Skip to content

zhang-liheng/RichControl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

Image Image Image

Liheng Zhang*, Lexi Pang*, Hang Ye, Xiaoxuan Ma, Yizhou Wang
Peking University
*Equal contribution

RichControl teaser figure

Environment setup

After cloning this repository, run

conda env create -f environment.yml
conda activate richcontrol

Example

For example image generation, run

python run_richcontrol.py

After that, in the directory results you should get something like

Example result

Feel free to add various images and create your own prompts for controllable generation!

Image input format can be found in configs/image_config.yaml.

Model configurations can be found in configs/model_config.yaml. Note that the Appearance-Rich Prompting module is disabled in the default config. Please update the configurations if you would like to use a prompt model.

Dataset

To download our dataset, click here. As described in the paper, the dataset comprises 150 image-prompt pairs spanning 7 condition types ("canny edge", "depth map", "HED edge", "normal map", "scribble drawing", "human pose", "segmentation mask") and 7 semantic categories ("animals": 58, "humans": 26, "objects": 20, "buildings": 16, "vehicles": 12, "scenes": 10, "rooms": 8). The dataset is organized as follows:

images
  canny
    beetle_canny
      condition.png
    cat_cartoon
      condition.png
    ...
  depth
    bedroom_depth
      condition.png
    castle_cartoon
      condition.png
    ...
  hed
    ...
  normal
    ...
  pose
    ...
  scribble
    ...
  seg
    ...
  image_config_dataset.yaml

For clarity, we split the entire dataset into 7 folders corresponding to 7 condition types. The file data_prompt-driven.yaml contains all metadata used in our evaluation experiments, where each entry includes the image file path, an inversion prompt, and a generation prompt:

- condition_image: canny/beetle_canny/condition.png
  inversion_prompt: a canny edge map of a volkswagen beetle
  prompt: a cartoon of a volkswagen beetle
- ...
...

Although our method does not require DDIM inversion, we include the inversion_prompt field to facilitate comparisons with baseline methods that rely on DDIM inversion. Our dataset is based on datasets from prior work including Ctrl-X, FreeControl, Plug-and-Play, and ADE20K. We gratefully acknowledge their contributions to the community.

Acknowledgement

Our code is inspired by repositories including Ctrl-X, Plug-and-Play and Restart sampling. We thank their contributors for sharing the valuable resources with the community.

Contact

For any questions and discussions, please contact Liheng Zhang (zhangliheng@stu.pku.edu.cn).

Reference

If you use our code in your research, please cite the following work.

@misc{zhang2025richcontrolstructureappearancerichtrainingfree,
      title={RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation}, 
      author={Liheng Zhang and Lexi Pang and Hang Ye and Xiaoxuan Ma and Yizhou Wang},
      year={2025},
      eprint={2507.02792},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.02792}, 
}

About

Official implementation of paper: "RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages