We introduce a novel approach to enhance the capabilities of text-to-image models by incorporating Retrieval-Augmented Generation with a knowledge graph. Our system dynamically retrieves detailed character information and relational data from the knowledge graph, enabling the generation of visually accurate and contextually rich images. Furthermore, we propose a self-correcting mechanism within Stable Diffusion models to ensure consistency and fidelity in visual outputs, leveraging the rich context from the graph to guide corrections. To our knowledge, Context Canvas represents the first application of graph-based RAG in enhancing T2I models, representing a significant advancement for producing high-fidelity, context-aware multi-faceted images.
Context Canvas enhances image generation across diverse domains by integrating cultural and contextual details often missed by standard models. For example, it accurately portrays rare Indian mythological characters like Tumburu with his horse-face and instrument, and Gandabherunda as a dual-headed bird (top left). Domains such as mythology often comprise characters with multiple forms (eg.,celestial and human form). Our method picks up subtle cues from user prompts to depict such characters in the right form. For instance, Indian mythology character, 'Ganga' has a heavenly river form and a human form (Duality), both of which it represents accuratelyIt (left middle). Our method adapts to various mythologies, capturing Melinoe’s ghostly essence (left bottom) and Zhong Kui’s fierce warrior form (middle bottom). In Project Gutenberg domains, such as Historical Fiction, Gothic Horror, and Fantasy, it captures narrative-specific details like Captain Ahab’s ivory leg and gaunt expression (top 4th column) and Edmond Dant`es’ pale skin, coat, and ring (top 3rd column). For Gothic Horror, it enhances Count Dracula’s menacing presence (middle 3rd column) and infers Manfred’s guilty, dark persona (middle 4th column). Our approach faithfully represents Lilith with bat wings and snakes and Lizarel with ethereal beauty and silvery hair (bottom right), demonstrating superior fidelity across cultural and literary domains.
SRD achieves highly accurate and culturally resonant depictions of complex mythological characters through iterative, context-rich prompt refinement. SRD corrects Vritra’s multi-headed form, drawing on the story’s cultural narrative from a robust knowledge graph to identify and position Indra accurately. For ‘Garuda’, iterative adjustments restore his iconic gold jewelry and vibrant wing color, enhancing his divine representation. For “Yama on his vehicle”, our approach automatically identifies his vehicle and adds essential symbols like his crown, mace, and noose, with precise skin tone adjustments. Similarly for ‘Mahishasura’, his incorrect tail is transformed into a snake he is typically depicted with.
We conduct a comprehensive evaluation of 'Context Canvas' in two stages, assessing both its foundational RAG component and image generation. We benchmark our approach against SOTA T2I models-such as Flux, SDXL, and DALL-E 3 across image generation and two rounds of self-correction.
We evaluate the foundational RAG system used in our method for both retrieval and generation using GEval. Our RAG process achieves high evaluation scores across both retrieval and generation due to meticulous data curation, retrieval, and prompt engineering.
Traditional metrics like CLIP Score and FID, while effective for general image-text alignment and quality, fail to capture the cultural specificity, narrative depth, and relational accuracy crucial for domain-specific tasks. To address these limitations, we utilize the LLM-as-a-Judge framework, by defining metrics such as Attribute Accuracy, Context Relevance, Visual Fidelity, and Intent Representation. We implement the metrics using DeepEval library. These metrics assess nuanced attributes like character-specific elements, situational alignment, and the overall essence of the generated imagery.
@misc{venkatesh2024contextcanvasenhancingtexttoimage,
title={Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG},
author={Kavana Venkatesh and Yusuf Dalva and Ismini Lourentzou and Pinar Yanardag},
year={2024},
eprint={2412.09614},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.09614}
}