PS: Many paid software tools have appeared on the market, such as Jihu Manjian and SuTui, which are packaged tools, but their core functionality is the same; what needs to be tested is still the effectiveness of GPT. This year, Sora has emerged as an evolved version in this field, which is more likely to impact the film and television production industry (UE4).
Function Design#
-
Extract storyboard scenes: Sentence segmentation of novel text, SD image generation, and TTS text-to-speech.
-
Novel content > Derive prompt words (SD painting).
-
Merge image and audio into video.
Model:
TTS(edge), SD painting model (using: cetusMix_Whalefall2 here), GPT (using Gemini here).
Project address: story-vision
Core Code#
Novel Storyboard Extraction GPT#
prompt = """I want you to create a storyboard based on the novel content, inferring scenes from the original text description; infer and supplement missing or implied information, including but not limited to: character clothing, character hairstyle, character hair color, character complexion, character facial features, character posture, character emotions, character body movements, etc.), style description (including but not limited to: era description, space description, time period description, geographical environment description, weather description), object description (including but not limited to: animals, plants, food, fruits, toys), visual perspective (including but not limited to: character proportions, camera depth description, observation angle description), but do not overdo it. Describe richer character emotions and emotional states through camera language, and after you understand this requirement, generate a new descriptive content through sentences. Change the output format to: Illustration 1: Original description: corresponding original sentences; Scene description: corresponding scene plot content; Scene characters: names of characters appearing in the scene; Clothing: protagonist in casual wear; Location: sitting in front of the bar; Expression: facial lines gentle, expression content; Behavior: gently shaking the wine glass in hand. Environment: the background of the bar is dark-toned, candlelight flickers in the background, giving a dreamy feeling. If you understand this requirement, please confirm these five points, and return results only with these five points' content, the novel content is as follows:"""
def split_text_into_chunks(text, max_length=ai_max_length):
"""
Split text into chunks with a maximum length, ensuring that splits only occur at line breaks.
"""
lines = text.splitlines()
chunks = []
current_chunk = ''
for line in lines:
if len(current_chunk + ' ' + line) <= max_length:
current_chunk += ' ' + line
else:
chunks.append(current_chunk)
current_chunk = line
chunks.append(current_chunk)
return chunks
def rewrite_text_with_genai(text, prompt="Please rewrite this text:"):
chunks = split_text_into_chunks(text)
rewritten_text = ''
# pbar = tqdm(total=len(chunks), ncols=150)
genai.configure(api_key=cfg['genai_api_key'])
model = genai.GenerativeModel('gemini-pro')
for chunk in chunks:
_prompt=f"{prompt}\n{chunk}",
response = model.generate_content(
contents=_prompt,
generation_config=genai.GenerationConfig(
temperature=0.1,
),
stream=True,
safety_settings = [
{
"category": "HARM_CATEGORY_DANGEROUS",
"threshold": "BLOCK_NONE",
},
{
"category": "HARM_CATEGORY_HARASSMENT",
"threshold": "BLOCK_NONE",
},
{
"category": "HARM_CATEGORY_HATE_SPEECH",
"threshold": "BLOCK_NONE",
},
{
"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
"threshold": "BLOCK_NONE",
},
{
"category": "HARM_CATEGORY_DANGEROUS_CONTENT",
"threshold": "BLOCK_NONE",
},
]
)
for _chunk in response:
if _chunk.text is not None:
rewritten_text += _chunk.text.strip()
# pbar.update(1)
# pbar.close()
return rewritten_text
Storyboard Output
SD Text-to-Image#
The prompt for SD is generated by GPT based on the storyboard text output above.
from diffusers import StableDiffusionPipeline
from diffusers.utils import load_image
import torch
model_path = "./models/cetusMix_Whalefall2.safetensors"
pipeline = StableDiffusionPipeline.from_single_file(
model_path,
torch_dtype=torch.float16,
variant="fp16"
).to("mps")
generator = torch.Generator("mps").manual_seed(31)
def sd_cetus(save_name, prompt):
prompt = prompt
image = pipeline(prompt).images[0]
image.save('data/img/'+ save_name +'.jpg')
Image Effect
TTS Audio Generation#
There are many TTS options available online; here we use the one provided by edge.
import edge_tts
import asyncio
voice = 'zh-CN-YunxiNeural'
output = 'data/voice/'
rate = '-4%'
volume = '+0%'
async def tts_function(text, save_name):
tts = edge_tts.Communicate(
text,
voice=voice,
rate=rate,
volume=volume
)
await tts.save(output + save_name + '.wav')
Video Effect#
[video(video-7erojzmT-1713340240300)(type-csdn)(url-https://live.csdn.net/v/embed/379613)(image-https://video-community.csdnimg.cn/vod-84deb4/00b03862fc8b71eebfc44531859c0102/snapshots/0bc4b0ed08a54fc2a412ee3ad1f3fdf2-00005.jpg?auth_key=4866938633-0-0-f335ae8248a7095d7f5d885a25aba80e)(title-Chapter 1: Entering the Bureau_out)]