476
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/HonorableFoe on 2024-09-21 20:56:13+00:00.

Original Title: My comfyui Cog video workflow with adtailer using the fun_5b model, with some examples of outputs. You need to really dive in with some prompting, describing clothing and objects being held helps a lot too. Comfy workflow in the comments.

477
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/jenza1 on 2024-09-21 14:35:23+00:00.

478
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/fpgaminer on 2024-09-21 18:37:01+00:00.


This is an update and follow-up to my previous post (). To recap, JoyCaption is being built from the ground up as a free, open, and uncensored captioning VLM model for the community to use in training Diffusion models.

  • Free and Open: It will be released for free, open weights, no restrictions, and just like bigASP, will come with training scripts and lots of juicy details on how it gets built.
  • Uncensored: Equal coverage of SFW and NSFW concepts. No "cylindrical shaped object with a white substance coming out on it" here.
  • Diversity: All are welcome here. Do you like digital art? Photoreal? Anime? Furry? JoyCaption is for everyone. Pains are being taken to ensure broad coverage of image styles, content, ethnicity, gender, orientation, etc.
  • Minimal filtering: JoyCaption is trained on large swathes of images so that it can understand almost all aspects of our world. almost. Illegal content will never be tolerated in JoyCaption's training.

The Demo

WARNING ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ This is a preview release, a demo, alpha, highly unstable, not ready for production use, not indicative of the final product, may irradiate your cat, etc.

JoyCaption is still under development, but I like to release early and often to garner feedback, suggestions, and involvement from the community. So, here you go!

What's New

Wow, it's almost been two months since the Pre-Alpha! The comments and feedback from the community have been invaluable, and I've spent the time since then working to improve JoyCaption and bring it closer to my vision for version one.

  • First and foremost, based on feedback, I expanded the dataset in various directions to hopefully improve: anime/video game character recognition, classic art, movie names, artist names, watermark detection, male nsfw understanding, and more.
  • Second, and perhaps most importantly, you can now control the length of captions JoyCaption generates! You'll find in the demo above that you can ask for a number of words (20 to 260 words), a rough length (very short to very long), or "Any" which gives JoyCaption free reign.
  • Third, you can now control whether JoyCaption writes in the same style as the Pre-Alpha release, which is very formal and clincal, or a new "informal" style, which will use such vulgar and non-Victorian words as "dong" and "chick".
  • Fourth, there are new "Caption Types" to choose from. "Descriptive" is just like the pre-alpha, purely natural language captions. "Training Prompt" will write random mixtures of natural language, sentence fragments, and booru tags, to try and mimic how users typically write Stable Diffusion prompts. It's highly experimental and unstable; use with caution. "rng-tags" writes only booru tags. It doesn't work very well; I don't recommend it. (NOTE: "Caption Tone" only affects "Descriptive" captions.)

The Details

It has been a grueling month. I spent the majority of the time manually writing 2,000 Training Prompt captions from scratch to try and get that mode working. Unfortunately, I failed miserably. JoyCaption Pre-Alpha was turning out to be quite difficult to fine-tune for the new modes, so I decided to start back at the beginning and massively rework its base training data to hopefully make it more flexible and general. "rng-tags" mode was added to help it learn booru tags better. Half of the existing captions were re-worded into "informal" style to help the model learn new vocabulary. 200k brand new captions were added with varying lengths to help it learn how to write more tersely. And I added a LORA on the LLM module to help it adapt.

The upshot of all that work is the new Caption Length and Caption Tone controls, which I hope will make JoyCaption more useful. The downside is that none of that really helped Training Prompt mode function better. The issue is that, in that mode, it will often go haywire and spiral into a repeating loop. So while it kinda works, it's too unstable to be useful in practice. 2k captions is also quite small and so Training Prompt mode has picked up on some idiosyncrasies in the training data.

That said, I'm quite happy with the new length conditioning controls on Descriptive captions. They help a lot with reducing the verbosity of the captions. And for training Stable Diffusion models, you can randomly sample from the different caption lengths to help ensure that the model doesn't overfit to a particular caption length.

Caveats

As stated, Training Prompt mode is still not working very well, so use with caution. rng-tags mode is mostly just there to help expand the model's understanding, I wouldn't recommend actually using it.

Informal style is ... interesting. For training Stable Diffusion models, I think it'll be helpful because it greatly expands the vocabulary used in the captions. But I'm not terribly happy with the particular style it writes in. It very much sounds like a boomer trying to be hip. Also, the informal style was made by having a strong LLM rephrase half of the existing captions in the dataset; they were not built directly from the images they are associated with. That means that the informal style captions tend to be slightly less accurate than the formal style captions.

And the usual caveats from before. I think the dataset expansion did improve some things slightly like movie, art, and character recognition. OCR is still meh, especially on difficult to read stuff like artist signatures. And artist recognition is ... quite bad at the moment. I'm going to have to pour more classical art into the model to improve that. It should be better at calling out male NSFW details (erect/flaccid, circumcised/uncircumcised), but accuracy needs more improvement there.

Feedback

Please let me know what you think of the new features, if the model is performing better for you, or if it's performing worse. Feedback, like before, is always welcome and crucial to me improving JoyCaption for everyone to use.

479
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/stockimgai on 2024-09-21 16:43:51+00:00.

480
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/randomvariable56 on 2024-09-21 14:53:54+00:00.

481
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/diogodiogogod on 2024-09-21 14:04:04+00:00.


My Civitai article:

So, Flux is great with prompt adherence, right? Right…

but writing directions can be tricky for the model. How would Flux interpret “A full body man with a watch on his right wrist?”. It will most probably output a man, in front view, with the watch on his LEFT wrist, but positioned on the RIGHT side of the image. That’s not what we asked for.

"Full body shot of a man with a watch on his right wrist" 0 out of 2 here

Sometimes Flux gets it right, but often it doesn’t. And that’s mostly because of how we write our prompts.

A warning first: This is in no way perfect. Based on my experimentation, It helps, but it won’t be 100%.

Describing body parts using the character’s perspective (like “his left”) leads to confusion. Instead, it’s better to use the image’s perspective. For example, say “on the left side” instead of “his left.” Adding “side” helps the model a lot. You can also reference specific areas of the image like “on the left bottom corner”, “on the top-left corner”, “on the center”, “on the bottom”, of the image. Etc.

"Full body shot of a man with a watch on his wrist on the left side" 0.5 out of 2, getting there

NEVER use “his right X body part” ever. “On the left” is already way better than “on his left”, but still generates a lot of wrong perspectives. More recently I have been experimenting with taking “him/her” completely from the prompt and I think it is even better.

"Full body shot of a man with a watch on the wrist on the left side" 1 out of 2, better.

Another example would be:

"A warrior man from behind, climbing stepping up a stone. The leg on the left side is extended down, the leg on the right is bent at the knee. He is wearing a magical glowing green bracelet on the hand on the left side. The hand on the right side is holding the sword vertically upward. The background is the entrance of a magical dark cave, with multiple glowing red neon lights on the top-right side corner inside the cave resembling eyes."

Definitely not all is correct. But it's more consistent.

For side views, when both body parts are on the same side, you can use foreground and background to clarify:

A photo of man in side view wearing an orange tank top and green shorts. He is touching a brick wall arching, leaning forward to the left side. His hand on the background is up touching the wall on the left side. His hand in the foreground is hanging down on the left side.

This is way more inconsistent. It's a hit-and-miss most of the time.

Using these strategies, Flux performs better for inference. But what about training with auto captions like Joy Caption?

A trend have been going on about the model not needing them, but I still don’t buy it. For simple objects or faces, trigger words might be enough, but for complex poses or anatomy, captions still seem important. I haven't tested enough, though, so I could be wrong.

With the help of ChatGPT I created a script that updates all text files in a folder to the format I mentioned. It’s not perfect, but you can tweak it or ask ChatGPT for more body part examples (I also just recently added "to" instead of only "on").

https://github.com/diodiogod/Search-Replace-Body-Pos

A simpler and fast option would be to just add “side” after “right/left”. But it would still be ambiguous. For example, “her left side arm” might mean her side, not the image’s side. So you need to include all prepositions like “on the left leg” > “on the leg on the left side”. “On his left X” > “on his X on the left side” etc.

But another big problem is that Joy Caption and all the other auto captioners are very inconsistent. They often get left and right wrong, probably because of the perspective problem I mentioned. So it’s kind of essential to manual check…. That’s why I add after each substitution, so I can easily find and check them manually. You can then search and replace that string with Taggui, Notepad++ or another tool.

But manually switching left and right can be tedious. So, I built another tool to make it easier: a floating box to do text swap fast. I organize my window so I can manually check each text file, spot substitutions, and easily swap “left side” and “right side.”

https://github.com/diodiogod/Floating-FAST-Text-Swapper

What I did was using the preview panel, I would organize my window just like this:

Manually click on every txt, I could easily spot on the preview panel any txt that had a substitution by looking fro the <###---------####>. Check is it were correct. If not, I could drag the txt and easily swap “left side” <> “right side”.

This process isn’t perfect, and you’ll still need to do some manual edits.

But anyway, that’s it. Hope this can help anyone with their captions, or just with their prompt writing.

482
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/fab1an on 2024-09-21 13:35:29+00:00.

483
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/cgpixel23 on 2024-09-21 10:16:05+00:00.

484
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/3dmindscaper2000 on 2024-09-21 06:27:05+00:00.

485
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/stbl_reel on 2024-09-21 06:10:02+00:00.

486
1
1990s Rap Album LoRA (www.reddit.com)
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Angrypenguinpng on 2024-09-20 19:07:26+00:00.

487
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/ol_barney on 2024-09-20 16:09:32+00:00.

488
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Glass-Caterpillar-70 on 2024-09-20 14:09:10+00:00.

489
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/mardy_grass on 2024-09-20 18:12:46+00:00.

490
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/tintwotin on 2024-09-20 16:45:25+00:00.

491
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/rolux on 2024-09-20 16:24:18+00:00.

492
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/cocktail_peanut on 2024-09-20 15:52:30+00:00.

493
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/jjjnnnxxx on 2024-09-20 12:18:58+00:00.

494
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/zazaoo19 on 2024-09-20 04:01:52+00:00.

495
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/theninjacongafas on 2024-09-20 11:38:06+00:00.

496
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/mrfofr on 2024-09-20 10:14:56+00:00.

497
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/R34vspec on 2024-09-20 05:47:21+00:00.

498
1
CogVideoX I2V on memes (old.reddit.com)
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/4-r-r-o-w on 2024-09-20 05:49:04+00:00.

499
2
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/dewarrn1 on 2024-09-20 02:44:52+00:00.

500
1
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/FoxBenedict on 2024-09-20 04:50:34+00:00.


An astonishing paper was released a couple of days ago showing a revolutionary new image generation paradigm. It's a multimodal model with a built in LLM and a vision model that gives you unbelievable control through prompting. You can give it an image of a subject and tell it to put that subject in a certain scene. You can do that with multiple subjects. No need to train a LoRA or any of that. You can prompt it to edit a part of an image, or to produce an image with the same pose as a reference image, without the need of a controlnet. The possibilities are so mind-boggling, I am, frankly, having a hard time believing that this could be possible.

They are planning to release the source code "soon". I simply cannot wait. This is on a completely different level from anything we've seen.

view more: ‹ prev next ›

StableDiffusion

98 readers
1 users here now

/r/StableDiffusion is back open after the protest of Reddit killing open API access, which will bankrupt app developers, hamper moderation, and...

founded 1 year ago
MODERATORS