Skip to content

Commit 1d48029

Browse files
authored
[docs] Prompt enhancer (#7565)
* prompt enhance * edits * align titles * feedback * feedback * feedback * link to style
1 parent b2323aa commit 1d48029

File tree

2 files changed

+207
-8
lines changed

2 files changed

+207
-8
lines changed

docs/source/en/_toctree.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@
7171
- local: using-diffusers/control_brightness
7272
title: Control image brightness
7373
- local: using-diffusers/weighted_prompts
74-
title: Prompt weighting
74+
title: Prompt techniques
7575
- local: using-diffusers/freeu
7676
title: Improve generation quality with FreeU
7777
title: Techniques

docs/source/en/using-diffusers/weighted_prompts.md

+206-7
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,209 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
1010
specific language governing permissions and limitations under the License.
1111
-->
1212

13-
# Prompt weighting
13+
# Prompt techniques
1414

1515
[[open-in-colab]]
1616

17+
Prompts are important because they describe what you want a diffusion model to generate. The best prompts are detailed, specific, and well-structured to help the model realize your vision. But crafting a great prompt takes time and effort and sometimes it may not be enough because language and words can be imprecise. This is where you need to boost your prompt with other techniques, such as prompt enhancing and prompt weighting, to get the results you want.
18+
19+
This guide will show you how you can use these prompt techniques to generate high-quality images with lower effort and adjust the weight of certain keywords in a prompt.
20+
21+
## Prompt engineering
22+
23+
> [!TIP]
24+
> This is not an exhaustive guide on prompt engineering, but it will help you understand the necessary parts of a good prompt. We encourage you to continue experimenting with different prompts and combine them in new ways to see what works best. As you write more prompts, you'll develop an intuition for what works and what doesn't!
25+
26+
New diffusion models do a pretty good job of generating high-quality images from a basic prompt, but it is still important to create a well-written prompt to get the best results. Here are a few tips for writing a good prompt:
27+
28+
1. What is the image *medium*? Is it a photo, a painting, a 3D illustration, or something else?
29+
2. What is the image *subject*? Is it a person, animal, object, or scene?
30+
3. What *details* would you like to see in the image? This is where you can get really creative and have a lot of fun experimenting with different words to bring your image to life. For example, what is the lighting like? What is the vibe and aesthetic? What kind of art or illustration style are you looking for? The more specific and precise words you use, the better the model will understand what you want to generate.
31+
32+
<div class="flex gap-4">
33+
<div>
34+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/plain-prompt.png"/>
35+
<figcaption class="mt-2 text-center text-sm text-gray-500">"A photo of a banana-shaped couch in a living room"</figcaption>
36+
</div>
37+
<div>
38+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/detail-prompt.png"/>
39+
<figcaption class="mt-2 text-center text-sm text-gray-500">"A vibrant yellow banana-shaped couch sits in a cozy living room, its curve cradling a pile of colorful cushions. on the wooden floor, a patterned rug adds a touch of eclectic charm, and a potted plant sits in the corner, reaching towards the sunlight filtering through the windows"</figcaption>
40+
</div>
41+
</div>
42+
43+
## Prompt enhancing with GPT2
44+
45+
Prompt enhancing is a technique for quickly improving prompt quality without spending too much effort constructing one. It uses a model like GPT2 pretrained on Stable Diffusion text prompts to automatically enrich a prompt with additional important keywords to generate high-quality images.
46+
47+
The technique works by curating a list of specific keywords and forcing the model to generate those words to enhance the original prompt. This way, your prompt can be "a cat" and GPT2 can enhance the prompt to "cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain quality sharp focus beautiful detailed intricate stunning amazing epic".
48+
49+
> [!TIP]
50+
> You should also use a [*offset noise*](https://www.crosslabs.org//blog/diffusion-with-offset-noise) LoRA to improve the contrast in bright and dark images and create better lighting overall. This [LoRA](https://hf.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_offset_example-lora_1.0.safetensors) is available from [stabilityai/stable-diffusion-xl-base-1.0](https://hf.co/stabilityai/stable-diffusion-xl-base-1.0).
51+
52+
Start by defining certain styles and a list of words (you can check out a more comprehensive list of [words](https://hf.co/LykosAI/GPT-Prompt-Expansion-Fooocus-v2/blob/main/positive.txt) and [styles](https://github.com/lllyasviel/Fooocus/tree/main/sdxl_styles) used by Fooocus) to enhance a prompt with.
53+
54+
```py
55+
import torch
56+
from transformers import GenerationConfig, GPT2LMHeadModel, GPT2Tokenizer, LogitsProcessor, LogitsProcessorList
57+
from diffusers import StableDiffusionXLPipeline
58+
59+
styles = {
60+
"cinematic": "cinematic film still of {prompt}, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain",
61+
"anime": "anime artwork of {prompt}, anime style, key visual, vibrant, studio anime, highly detailed",
62+
"photographic": "cinematic photo of {prompt}, 35mm photograph, film, professional, 4k, highly detailed",
63+
"comic": "comic of {prompt}, graphic illustration, comic art, graphic novel art, vibrant, highly detailed",
64+
"lineart": "line art drawing {prompt}, professional, sleek, modern, minimalist, graphic, line art, vector graphics",
65+
"pixelart": " pixel-art {prompt}, low-res, blocky, pixel art style, 8-bit graphics",
66+
}
67+
68+
words = [
69+
"aesthetic", "astonishing", "beautiful", "breathtaking", "composition", "contrasted", "epic", "moody", "enhanced",
70+
"exceptional", "fascinating", "flawless", "glamorous", "glorious", "illumination", "impressive", "improved",
71+
"inspirational", "magnificent", "majestic", "hyperrealistic", "smooth", "sharp", "focus", "stunning", "detailed",
72+
"intricate", "dramatic", "high", "quality", "perfect", "light", "ultra", "highly", "radiant", "satisfying",
73+
"soothing", "sophisticated", "stylish", "sublime", "terrific", "touching", "timeless", "wonderful", "unbelievable",
74+
"elegant", "awesome", "amazing", "dynamic", "trendy",
75+
]
76+
```
77+
78+
You may have noticed in the `words` list, there are certain words that can be paired together to create something more meaningful. For example, the words "high" and "quality" can be combined to create "high quality". Let's pair these words together and remove the words that can't be paired.
79+
80+
```py
81+
word_pairs = ["highly detailed", "high quality", "enhanced quality", "perfect composition", "dynamic light"]
82+
83+
def find_and_order_pairs(s, pairs):
84+
words = s.split()
85+
found_pairs = []
86+
for pair in pairs:
87+
pair_words = pair.split()
88+
if pair_words[0] in words and pair_words[1] in words:
89+
found_pairs.append(pair)
90+
words.remove(pair_words[0])
91+
words.remove(pair_words[1])
92+
93+
for word in words[:]:
94+
for pair in pairs:
95+
if word in pair.split():
96+
words.remove(word)
97+
break
98+
ordered_pairs = ", ".join(found_pairs)
99+
remaining_s = ", ".join(words)
100+
return ordered_pairs, remaining_s
101+
```
102+
103+
Next, implement a custom [`~transformers.LogitsProcessor`] class that assigns tokens in the `words` list a value of 0 and assigns tokens not in the `words` list a negative value so they aren't picked during generation. This way, generation is biased towards words in the `words` list. After a word from the list is used, it is also assigned a negative value so it isn't picked again.
104+
105+
```py
106+
class CustomLogitsProcessor(LogitsProcessor):
107+
def __init__(self, bias):
108+
super().__init__()
109+
self.bias = bias
110+
111+
def __call__(self, input_ids, scores):
112+
if len(input_ids.shape) == 2:
113+
last_token_id = input_ids[0, -1]
114+
self.bias[last_token_id] = -1e10
115+
return scores + self.bias
116+
117+
word_ids = [tokenizer.encode(word, add_prefix_space=True)[0] for word in words]
118+
bias = torch.full((tokenizer.vocab_size,), -float("Inf")).to("cuda")
119+
bias[word_ids] = 0
120+
processor = CustomLogitsProcessor(bias)
121+
processor_list = LogitsProcessorList([processor])
122+
```
123+
124+
Combine the prompt and the `cinematic` style prompt defined in the `styles` dictionary earlier.
125+
126+
```py
127+
prompt = "a cat basking in the sun on a roof in Turkey"
128+
style = "cinematic"
129+
130+
prompt = styles[style].format(prompt=prompt)
131+
prompt
132+
"cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain"
133+
```
134+
135+
Load a GPT2 tokenizer and model from the [Gustavosta/MagicPrompt-Stable-Diffusion](https://huggingface.co/Gustavosta/MagicPrompt-Stable-Diffusion) checkpoint (this specific checkpoint is trained to generate prompts) to enhance the prompt.
136+
137+
```py
138+
tokenizer = GPT2Tokenizer.from_pretrained("Gustavosta/MagicPrompt-Stable-Diffusion")
139+
model = GPT2LMHeadModel.from_pretrained("Gustavosta/MagicPrompt-Stable-Diffusion", torch_dtype=torch.float16).to(
140+
"cuda"
141+
)
142+
model.eval()
143+
144+
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
145+
token_count = inputs["input_ids"].shape[1]
146+
max_new_tokens = 50 - token_count
147+
148+
generation_config = GenerationConfig(
149+
penalty_alpha=0.7,
150+
top_k=50,
151+
eos_token_id=model.config.eos_token_id,
152+
pad_token_id=model.config.eos_token_id,
153+
pad_token=model.config.pad_token_id,
154+
do_sample=True,
155+
)
156+
157+
with torch.no_grad():
158+
generated_ids = model.generate(
159+
input_ids=inputs["input_ids"],
160+
attention_mask=inputs["attention_mask"],
161+
max_new_tokens=max_new_tokens,
162+
generation_config=generation_config,
163+
logits_processor=proccesor_list,
164+
)
165+
```
166+
167+
Then you can combine the input prompt and the generated prompt. Feel free to take a look at what the generated prompt (`generated_part`) is, the word pairs that were found (`pairs`), and the remaining words (`words`). This is all packed together in the `enhanced_prompt`.
168+
169+
```py
170+
output_tokens = [tokenizer.decode(generated_id, skip_special_tokens=True) for generated_id in generated_ids]
171+
input_part, generated_part = output_tokens[0][: len(prompt)], output_tokens[0][len(prompt) :]
172+
pairs, words = find_and_order_pairs(generated_part, word_pairs)
173+
formatted_generated_part = pairs + ", " + words
174+
enhanced_prompt = input_part + ", " + formatted_generated_part
175+
enhanced_prompt
176+
["cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain quality sharp focus beautiful detailed intricate stunning amazing epic"]
177+
```
178+
179+
Finally, load a pipeline and the offset noise LoRA with a *low weight* to generate an image with the enhanced prompt.
180+
181+
```py
182+
pipeline = StableDiffusionXLPipeline.from_pretrained(
183+
"RunDiffusion/Juggernaut-XL-v9", torch_dtype=torch.float16, variant="fp16"
184+
).to("cuda")
185+
186+
pipeline.load_lora_weights(
187+
"stabilityai/stable-diffusion-xl-base-1.0",
188+
weight_name="sd_xl_offset_example-lora_1.0.safetensors",
189+
adapter_name="offset",
190+
)
191+
pipeline.set_adapters(["offset"], adapter_weights=[0.2])
192+
193+
image = pipeline(
194+
enhanced_prompt,
195+
width=1152,
196+
height=896,
197+
guidance_scale=7.5,
198+
num_inference_steps=25,
199+
).images[0]
200+
image
201+
```
202+
203+
<div class="flex gap-4">
204+
<div>
205+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/non-enhanced-prompt.png"/>
206+
<figcaption class="mt-2 text-center text-sm text-gray-500">"a cat basking in the sun on a roof in Turkey"</figcaption>
207+
</div>
208+
<div>
209+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/enhanced-prompt.png"/>
210+
<figcaption class="mt-2 text-center text-sm text-gray-500">"cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain"</figcaption>
211+
</div>
212+
</div>
213+
214+
## Prompt weighting
215+
17216
Prompt weighting provides a way to emphasize or de-emphasize certain parts of a prompt, allowing for more control over the generated image. A prompt can include several concepts, which gets turned into contextualized text embeddings. The embeddings are used by the model to condition its cross-attention layers to generate an image (read the Stable Diffusion [blog post](https://huggingface.co/blog/stable_diffusion) to learn more about how it works).
18217

19218
Prompt weighting works by increasing or decreasing the scale of the text embedding vector that corresponds to its concept in the prompt because you may not necessarily want the model to focus on all concepts equally. The easiest way to prepare the prompt-weighted embeddings is to use [Compel](https://github.com/damian0815/compel), a text prompt-weighting and blending library. Once you have the prompt-weighted embeddings, you can pass them to any pipeline that has a [`prompt_embeds`](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__.prompt_embeds) (and optionally [`negative_prompt_embeds`](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__.negative_prompt_embeds)) parameter, such as [`StableDiffusionPipeline`], [`StableDiffusionControlNetPipeline`], and [`StableDiffusionXLPipeline`].
@@ -55,7 +254,7 @@ image
55254
<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/compel/forest_0.png"/>
56255
</div>
57256

58-
## Weighting
257+
### Weighting
59258

60259
You'll notice there is no "ball" in the image! Let's use compel to upweight the concept of "ball" in the prompt. Create a [`Compel`](https://github.com/damian0815/compel/blob/main/doc/compel.md#compel-objects) object, and pass it a tokenizer and text encoder:
61260

@@ -123,7 +322,7 @@ image
123322
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-pos-neg.png"/>
124323
</div>
125324

126-
## Blending
325+
### Blending
127326

128327
You can also create a weighted *blend* of prompts by adding `.blend()` to a list of prompts and passing it some weights. Your blend may not always produce the result you expect because it breaks some assumptions about how the text encoder functions, so just have fun and experiment with it!
129328

@@ -139,7 +338,7 @@ image
139338
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-blend.png"/>
140339
</div>
141340

142-
## Conjunction
341+
### Conjunction
143342

144343
A conjunction diffuses each prompt independently and concatenates their results by their weighted sum. Add `.and()` to the end of a list of prompts to create a conjunction:
145344

@@ -155,7 +354,7 @@ image
155354
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-conj.png"/>
156355
</div>
157356

158-
## Textual inversion
357+
### Textual inversion
159358

160359
[Textual inversion](../training/text_inversion) is a technique for learning a specific concept from some images which you can use to generate new images conditioned on that concept.
161360

@@ -195,7 +394,7 @@ image
195394
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-text-inversion.png"/>
196395
</div>
197396

198-
## DreamBooth
397+
### DreamBooth
199398

200399
[DreamBooth](../training/dreambooth) is a technique for generating contextualized images of a subject given just a few images of the subject to train on. It is similar to textual inversion, but DreamBooth trains the full model whereas textual inversion only fine-tunes the text embeddings. This means you should use [`~DiffusionPipeline.from_pretrained`] to load the DreamBooth model (feel free to browse the [Stable Diffusion Dreambooth Concepts Library](https://huggingface.co/sd-dreambooth-library) for 100+ trained models):
201400

@@ -221,7 +420,7 @@ image
221420
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-dreambooth.png"/>
222421
</div>
223422

224-
## Stable Diffusion XL
423+
### Stable Diffusion XL
225424

226425
Stable Diffusion XL (SDXL) has two tokenizers and text encoders so it's usage is a bit different. To address this, you should pass both tokenizers and encoders to the `Compel` class:
227426

0 commit comments

Comments
 (0)