background

Niko's Project Corner

Introduction to Stable Diffusion's parameters

Description Learning to use a pre-trained SD model.
Languages Python
PyTorch
Tags Com­puter Vi­sion
Au­toen­coder
Sta­ble Dif­fu­sion
Duration Fall 2022
Modified 10th November 2022
thumbnail

Sta­ble Dif­fu­sion is an im­age gen­er­ation net­work, which was re­leased to the pub­lic in 2022. It is based on a dif­fu­sion pro­cess, in which the model gets a noisy im­age as an in­put and it tries to gen­er­ate a noise-free im­age as an out­put. This pro­cess can be guided by de­scrib­ing the tar­get im­age in plain En­glish (aka txt2img), and op­tion­ally even giv­ing it a tar­get im­age (aka. img2img). This ar­ti­cle doesn't de­scribe how the model works and how to run it your­self, in­stead this is more of a tu­to­rial on how var­ious pa­ram­eters af­fect the re­sult­ing im­age. Non-tech­ni­cal peo­ple can use these im­age gen­er­at­ing AIs via web­pages such as Ar­tis­tic.wtf (my and my friend's pro­ject), Craiyon.com, Mid­jour­ney.com and others.

This ar­ti­cle shows how three dif­fer­ent pa­ram­eters (num­ber of steps, "CFG" or clas­si­fier-free guid­ance scale, and img2img strength) af­fect the im­age. These may in­ter­act with each other in com­plex ways, but that isn't stud­ied here. Ex­per­iments use five dif­fer­ent in­put texts (aka "prompts"): "a white par­rot sit­ting by the sea", "full length por­trait of gor­geous god­dess stand­ing in field full of flow­ers", "panda movie poster with yel­low back­ground and red ti­tle text", "moun­tains by a river" and "an an­cient yel­low robot hold­ing a sword on de­sert". In ad­di­tion they con­tain "prompt-en­gi­neer­ing" terms such as "cin­ematic, highly de­tailed, il­lus­tra­tion, con­cept ar". These greatly im­prove the re­sults, but that topic isn't cov­ered here ei­ther. One can find prompt ex­am­ples from sites such as lex­ica.art and hug­ging­face.co's Mag­icPrompt.

Vari­ations of the "same" im­age are gen­er­ated by chang­ing the seed of a ran­dom num­ber gen­er­ator. This af­fects the noise from which de­nois­ing steps are started (the dif­fu­sion pro­cess), and the al­go­rithm will con­verge to quite dif­fer­ent look­ing im­ages. But they should still be con­sep­tu­ally sim­ilar. Here three dif­fer­ent seeds are used for each ex­per­iment.

Each pa­ram­eter is var­ied to eight dif­fer­ent val­ues, ei­ther on a lin­ear or log­arith­mic scale. Thus each fig­ure con­sists of 3 × 8 sub-im­ages. Orig­inally each im­age was gen­er­ated on a res­olu­tion of 512 × 512 (ex­cept the panda poster, since it has a dif­fer­ent as­pect ra­tio), so the re­sult­ing im­age tile would have a res­olu­tion of 1536 × 4096, and there are 20 such im­ages in to­tal. This brings the to­tal im­age size to 156 megapix­els, which would make the file size rel­atively large. Orig­inal PNGs take 180 MB, but the used im­ages here have a res­olu­tion of 2560 × 960, and with a JPG qual­ity of 90% their to­tal size is only 12 MB (93% less).

steps_a_white_parrot_sitting_by_the_se_seagull_1
Figure 1: "a white par­rot sit­ting by the sea", gen­er­ated at vary­ing num­ber of de­nois­ing steps. Ma­jor de­tails have con­verged by the 4th or 5th column (15 & 18 steps), but there is still a no­tice­able jump on the last two columns (21 & 50 steps). The "sea" back­ground lacks any in­ter­est­ing de­tails.

The de­fault set­tings are 50 de­nois­ing steps, CFG scale of 7.5 and ref­er­ence im­age strength of 0 (mean­ing no ref­er­ence im­age). All other pa­ram­eters are kept con­stant while one other pa­ram­eter is var­ied. (Note: the com­mmon con­ven­tion on img2img ap­pli­ca­tions seems to spec­ify the weight of the added ran­dom noise, not the weight of the in­put im­age! Here the "im­age strength" is 1 - "noise strength".) The re­sults shown here are based on Sta­ble Dif­fu­sion model ver­sion 1.4 and using the "DDIM" sampler.

The first var­ied pa­ram­eter is the num­ber of de­nois­ing steps. The used val­ues are [3, 6, 9, 12, 15, 18, 21, 50], and usu­ally the im­age has con­verged by step 50. These re­sults are shown in fig­ures 1 - 5.

steps_an_ancient_yellow_robot_holding_guy_1
Figure 2: "an an­cient yel­low robot hold­ing a sword on de­sert" at vary­ing num­ber of steps. The re­sult is quite ac­cept­able al­ready by column four (12 steps), and by 24 steps it has con­verged so much that the re­sult is very sim­ilar as when us­ing 50 steps (the right-most column).
steps_hyperrealistic_full_length_portr_korean_woman_1
Figure 3: "full length por­trait of gor­geous god­dess stand­ing in field full of flow­ers" at vary­ing num­ber of steps. Here the con­ver­gence is quite slow, and the pose keeps chang­ing through­out the pro­cess. De­pend­ing on the seed, steps 24 and 50 may pro­duce very dif­fer­ent look­ing im­ages. This means that us­ing a lower step-count for pre­view­ing seeds doesn't al­ways work.

All im­ages have their ma­jor as­pects de­fined al­ready with just 10 - 15 de­nois­ing steps. Most of the time even go­ing from 24 to 50 steps doesn't change the re­sult in any ma­jor way, but of course there are ex­cep­tions. The most chal­leng­ing in­puts were "full length por­trait of gor­geous god­dess stand­ing in field full of flow­ers" (Fig­ure 3 and "panda movie poster with yel­low back­ground and red ti­tle text" (Fig­ure 5). They don't have much in com­mon, but the "god­dess" has a com­plex fore­ground and back­ground, and the "panda" has ty­pog­ra­phy (seems to have ei­ther Chi­nese or En­glish de­pend­ing on the seed), high-con­trast­ing col­ors and just a vague idea on how the com­po­si­tion should look like.

This is in­con­trast to the "a white par­rot sit­ting by the sea" (Fig­ure 1) and "an an­cient yel­low robot hold­ing a sword on de­sert" (Fig­ure 2), which have a some­what con­strained pose and a sim­pler back­ground. Al­though the set of all pos­si­ble robots has huge va­ri­ety.

steps_mountains_by_a_river_fantastic_l_mountains
Figure 4: "moun­tains by a river" at vary­ing num­ber of steps. Ini­tial im­ages lack con­trast and de­tail, but again the column 4 (12 steps) re­sem­bles the fi­nal re­sult very much. Some­how the art style seems foggy and lacks de­tail, nat­urally this can be guided by chang­ing the prompt by in­clud­ing terms like "sun­shine".
steps_panda_movie_poster_with_yellow_b_kungfu_panda
Figure 5: "panda movie poster with yel­low back­ground and red ti­tle text" at vary­ing num­ber of steps. Even the column three (9 steps) could pass as a low-qual­ity poster, but the con­tents con­tinue to change un­til ap­prox­imately column 6 (18 steps). And es­pe­cially the top right cor­ner's im­age ben­efits from "full" 50 steps, which adds de­tail to its hair and tunes the pose to look straight at the cam­era. It is plau­si­ble that orig­inal Kung Fu Panda movie posters were part of the train­ing dataset.
Step pa­ram­eter sum­mary: This pa­ram­eter is an easy one to tune, since the end re­sult is al­most al­ways bet­ter with more steps. Some­times the end re­sult has con­verged af­ter 30 steps, but it is bet­ter to let it run for 50 steps to be sure. And it is very rare that the end re­sult would sig­nif­icantly change when go­ing be­yond that, at least when gen­er­at­ing 512 × 512 res­olu­tion im­ages.

The next stud­ied pa­ram­eter is the "CFG" or clas­si­fier-free guid­ance scale, or just "scale". It is de­scribed for ex­am­ple at dif­fu­sion-news.org, but the con­clu­sions there are con­flict­ing with ex­per­imen­tal re­sults which are shown in this ar­ti­cle. Ac­cord­ing to the ar­ti­cle, it "is a mea­sure of how close you want the model to stick to your prompt when look­ing for a re­lated im­age to show you. A Cfg Scale value of 0 will give you es­sen­tially a ran­dom im­age based on the seed, where as a Cfg Scale of 20 (the max­imum on SD) will give you the clos­est match to your prompt that the model can pro­duce." They even run the tests with var­ious dif­fer­ent sam­plers, but those aren't cov­ered in this ar­ti­cle for the sake of stick­ing to ba­sics.

Tested CFG scales are from the for­mula 1 + 6.5 * 10i, where i is in­ter­po­lated lin­early be­tween -1 and 1. This was cho­sen so that the de­fault scale of 7.5 is in the mid­dle. The val­ues (rounded to two dec­imals) are [1.65, 2.25, 3.42, 5.68, 10.03, 18.44, 34.67, 66.0]. But the ac­tual mid­dle value 7.5 isn't pre­sent in the list, since there is an even num­ber of sam­ples. Ex­am­ples are shown in fig­ures 6 - 10.

scales_a_white_parrot_sitting_by_the_se_seagull_1
Figure 6: "Par­rot" at vary­ing CFG scale be­tween 1.65 and 66.0. Lower scale val­ues pro­duce an in­ter­est­ing hand-drawn style, but on the larger scales the re­sults aren't very suc­cess­ful.
scales_an_ancient_yellow_robot_holding_guy_1
Figure 7: "Robot" at vary­ing CFG scale be­tween 1.65 and 66.0. Small­est val­ues aren't re­ally co­her­ent, but larger val­ues pro­duce an in­ter­est­ing pop-art style.

The scale pa­ram­eter has a very con­crete im­pact on the re­sult­ing artis­tic style. Smaller num­bers seem to cause the it­er­ation to fo­cus on smaller scale de­tail, and the over­all con­trast and sat­ura­tion is fairly weak. But if such artis­tic style is de­sired, con­trast and sat­ura­tion can be eas­ily fixed in post-pro­cess­ing. Pa­ram­eters 6 - 8 give the most re­al­is­tic and well-bal­anced re­sults. Go­ing be­yond that, the re­sults get pro­gres­sively to the op­po­site di­rec­tion. They have very vi­brant col­ors, have high global con­trast but are lack­ing in small de­tails.

CFG scale pa­ram­eter sum­mary: Use smaller val­ues for in­tri­cate hand-drawn style, the de­fault 7.5 for re­al­is­tic pho­tos and larger val­ues for pop-art style with high con­trast and sat­urated col­ors. But go­ing to ei­ther ex­treme may re­quire fur­ther tweak­ing in a photo-edit­ing soft­ware.
scales_hyperrealistic_full_length_portr_korean_woman_1
Figure 8: "God­des" at vary­ing CFG scale be­tween 1.65 and 66.0. Small­est scales are well suited for this prompt, but larger val­ues don't work that well.
scales_mountains_by_a_river_fantastic_l_mountains
Figure 9: "Moun­tains" at vary­ing CFG scale be­tween 1.65 and 66.0. Here all scales pro­duce quite de­cent but very dif­fer­ent re­sults.
scales_panda_movie_poster_with_yellow_b_kungfu_panda
Figure 10: "Panda" at vary­ing CFG scale be­tween 1.65 and 66.0. Here smaller val­ues pro­duce in­ter­est­ing com­po­si­tions with lots of de­tail, and larger val­ues pro­duce less in­ter­est­ing im­ages.

The next stud­ied pa­ram­eter is the img2img strength (or "weight"), a value be­tween 0% and 100%. The all pre­vi­ously de­scribed pa­ram­eters still play a role, but here they are kept to their de­fault val­ues (run­ning for 50 steps and the CFG scale is 7.5). Note that be­cause of rea­sons, the strength has an op­po­site con­ven­tion that orig­inal work such as source code at github.com/Com­pVis/sta­ble-dif­fu­sion which says "strength is a value be­tween 0.0 and 1.0, that con­trols the amount of noise that is added to the in­put im­age. Val­ues that ap­proach 1.0 al­low for lots of vari­ations but will also pro­duce im­ages that are not se­man­ti­cally con­sis­tent with the in­put." Ba­si­cally it is a trade-off pa­ram­eter be­tween the im­age gen­er­ated purely from the prompt, and the tar­get im­age. 0% noise means that the out­put re­sem­bles the tar­get 100%, and 100% noise means that the tar­get im­age has no im­pact. Hope­fully this isn't con­fus­ing any­body.

The weight is a very del­icate pa­ram­eter to tune, and if the two im­ages (from the prompt and the tar­get) are very dif­fer­ent, then there might be a tip­ping point where chang­ing the weight even a lit­tle bit will have a large change in the re­sult­ing im­age. How­ever this de­pends very much on the con­text, and on many ex­per­iments the re­sults had a nice gradu­lar change. But be­cause of this, the img2img ex­eper­iments were run with two dif­fer­ent pa­ram­eter gra­di­ents. The first one uses weights [0.03, 0.06, 0.09, 0.12, 0.15, 0.18, 0.21], and the sec­ond one uses [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]. The re­sults are shwon in fig­ures 11 - 20. Note that only seven dif­fer­ent weights were used, but im­ages con­sist of eight columns. This is be­cause the right-most pic­ture al­ways shows the tar­get im­age (cor­re­spond­ing to a strength of 1.0, or noise-level of zero).

img2img_a_white_parrot_sitting_by_the_se_seagull_1
Figure 11: "Par­rot" turn­ing into a seag­ull by the sea, us­ing small strength val­ues. All val­ues (be­tween 0.03 and 0.21) show in­ter­est­ing in­ter­po­la­tion be­tween the "source" and "tar­get" im­ages.
img2imgheavy_a_white_parrot_sitting_by_the_se_seagull_1
Figure 12: "Par­rot" turn­ing into a seag­ull by the sea, us­ing large strength val­ues. Columns 5 - 7 (strengths 0.5 - 0.7) re­sem­ble the tar­get im­age too much, and very lit­tle "par­rot" in­flu­ence is pre­sent.
img2img_an_ancient_yellow_robot_holding_guy_1
Figure 13: "Robot" on a de­sert, in­spired by a yel­low­ish stick-man draw­ing, us­ing small strength val­ues. In this case the tar­get im­age isn't re­al­is­tic at all, so at these strengths the im­age hasn't in­flu­enced the com­po­si­tion at all. It has only set the yel­low color theme.
img2imgheavy_an_ancient_yellow_robot_holding_guy_1
Figure 14: "Robot" on a de­sert, in­spired by a yel­low­ish stick-man drawin, us­ing large strength val­ues. Here strength val­ues of 0.3 - 0.5 seem to pro­duce the best re­sults, go­ing higher than this loses all de­tails since they are ab­sent from the tar­get im­age.
img2img_hyperrealistic_full_length_portr_korean_woman_1
Figure 15: "God­dess" mim­ick­ing a woman on a grass field, us­ing small strength val­ues. Here all im­ages are quite in­ter­est­ing, al­though the com­po­si­tion at the mid­dle column isn't ideal since the face isn't vis­ible.
img2imgheavy_hyperrealistic_full_length_portr_korean_woman_1
Figure 16: "God­dess" mim­ick­ing a woman on a grass field, us­ing large strength val­ues. Im­ages be­yond the 2nd column (hav­ing the strength of 0.2) aren't that good, fa­cial and other fea­tures are a mix-n-match with clearly vis­ible ar­ti­facts.
img2img_mountains_by_a_river_fantastic_l_mountains
Figure 17: "Moun­tains" tak­ing guidanse from a sim­ple draw­ing, us­ing small strength val­ues. Only the first three columns (strengths 0.03 - 0.09) show in­ter­est­ing re­sults. The taget im­age is so sim­plis­tic and un­re­al­is­tic that it doesn't have a good im­pact on the re­sult­ing im­age.
img2imgheavy_mountains_by_a_river_fantastic_l_mountains
Figure 18: "Moun­tains" tak­ing guidanse from a sim­ple draw­ing, us­ing large strength val­ues. Clearly large strengths don't work well in this case.

Fi­ugres 13, 14 ("robot"), 17 and 18 ("moun­tains") show that one must use very low strengths if the tar­get im­age is un­re­al­is­tic, hav­ing lots re­gions with uni­form color and lacks small-scale de­tails. Bet­ter re­sults could be ob­tained by run­ning the img2img gen­er­ation sev­eral times, us­ing a low strength value each time and pick­ing the most promis­ing look­ing im­age as a start­ing point for the next it­er­ation.

Other ex­am­ples show that best re­sults are ob­tained by us­ing a re­al­is­tic in­put im­age to be­gin with, if re­al­is­tic re­sults are de­sired. But since it is very con­text-de­pen­dent on which strength val­ues pro­duce res­on­able re­sults, trial-and-er­ror ap­proach seems the eas­iest one. And one must not for­get that CFG scale can be changed as well from the de­fault 7.5, and it will have a ma­jor im­pact on the out­come as well.

img2img_panda_movie_poster_with_yellow_b_kungfu_panda
Figure 19: "Panda" tak­ing guidanse from the ac­tual movie poster, us­ing small strength val­ues. Here all re­sults look more or less rea­son­able.
img2imgheavy_panda_movie_poster_with_yellow_b_kungfu_panda
Figure 20: "Panda" tak­ing guidanse from the ac­tual movie poster, us­ing large strength val­ues. By column 5 (strength 0.5) all of the re­sults look very sim­ilar. Most in­ter­est­ing re­sults are be­tween columns 2 and 4 (strengths 0.2 - 0.4).
Img2img strength pa­ram­eter sum­mary: The re­sults de­pend very much on the prompt and the tar­get im­age. Some cases work best with a very small weight, while oth­ers re­quire a larger one so trial-and-er­ror must be used. But in gen­eral smaller weights work best when the tar­get im­age is un­re­al­is­tic and lacks de­tail, and larger weights can be used when the tar­get im­age is aready re­al­is­tic.

There are still many more top­ics to study, such as what kind of de­scrip­tions to ap­pend to the prompt, us­ing dif­fer­ent ver­sions of the model (ver­sion 1.5 is al­reayd out), us­ing dif­fer­ent sam­plers among oth­ers.


Related blog posts:

VideoClustering
Agadmator
Bananagrams1
Puzzles
HelsinkiDeblur