RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation
Liheng Zhang*, Lexi Pang*, Hang Ye, Xiaoxuan Ma, Yizhou Wang
Peking University
*Equal contribution
After cloning this repository, run
conda env create -f environment.yml
conda activate richcontrol
For example image generation, run
python run_richcontrol.py
After that, in the directory results you should get something like
Feel free to add various images and create your own prompts for controllable generation!
Image input format can be found in configs/image_config.yaml.
Model configurations can be found in configs/model_config.yaml. Note that the Appearance-Rich Prompting module is disabled in the default config. Please update the configurations if you would like to use a prompt model.
To download our dataset, click here. As described in the paper, the dataset comprises 150 image-prompt pairs spanning 7 condition types ("canny edge", "depth map", "HED edge", "normal map", "scribble drawing", "human pose", "segmentation mask") and 7 semantic categories ("animals": 58, "humans": 26, "objects": 20, "buildings": 16, "vehicles": 12, "scenes": 10, "rooms": 8). The dataset is organized as follows:
images
canny
beetle_canny
condition.png
cat_cartoon
condition.png
...
depth
bedroom_depth
condition.png
castle_cartoon
condition.png
...
hed
...
normal
...
pose
...
scribble
...
seg
...
image_config_dataset.yaml
For clarity, we split the entire dataset into 7 folders corresponding to 7 condition types. The file data_prompt-driven.yaml contains all metadata used in our evaluation experiments, where each entry includes the image file path, an inversion prompt, and a generation prompt:
- condition_image: canny/beetle_canny/condition.png
inversion_prompt: a canny edge map of a volkswagen beetle
prompt: a cartoon of a volkswagen beetle
- ...
...Although our method does not require DDIM inversion, we include the inversion_prompt field to facilitate comparisons with baseline methods that rely on DDIM inversion.
Our dataset is based on datasets from prior work including Ctrl-X, FreeControl, Plug-and-Play, and ADE20K. We gratefully acknowledge their contributions to the community.
Our code is inspired by repositories including Ctrl-X, Plug-and-Play and Restart sampling. We thank their contributors for sharing the valuable resources with the community.
For any questions and discussions, please contact Liheng Zhang (zhangliheng@stu.pku.edu.cn).
If you use our code in your research, please cite the following work.
@misc{zhang2025richcontrolstructureappearancerichtrainingfree,
title={RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation},
author={Liheng Zhang and Lexi Pang and Hang Ye and Xiaoxuan Ma and Yizhou Wang},
year={2025},
eprint={2507.02792},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.02792},
}

