I was never able to get appreciably better results from 11 labs than using some (minorly) trained RVC model :/ The long scripts problem is something pretty much any text-to-something model suffers from. The longer the context the lower the cohesion ends up.
I do rotoscoping with SDXL i2i and controlnet posing together. Without I found it tends to smear. Do you just do image2image?
Facebook is trying to burn the forest around OpenAI and other closed models by removing the market for “models” by themselves, by releasing their own freely to the community. A lot of money is already pivoting away towards companies trying to find products that use the AI instead of the AI itself. Unless OpenAI pivots to something more substantial than just providing multimodal prompt completion they’re gonna find themselves without a lot of runway left.