Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG

1Virginia Tech, 2University of Illinois Urbana-Champaign

TL;DR We propose a training-free approach that leverages knowledge graph-based Retrieval Augmented Generation to enhance image generation and editing of text-to-image diffusion models. We also introduce a novel RAG context-guided self-correction mechanism. The approach enables generation of contextually and narratively accurate images for complex, domain-specific scenarios that standard T2I models struggle with, using simple high-level user prompts.

Teaser Image

Context Canvas significantly improves image generation for domain-specific, complex characters that T2I models might otherwise struggle with (top left). Additionally, our method enables disentangled image editing by extracting precise item descriptions, such as transforming a simple “add a sword” prompt into a fire sword for the character Jambavan. It also adeptly captures relationships by utilizing the graph, for instance, generating Jambavan’s daughter from a simple “with his daughter” prompt, without needing explicit details about her appearance (bottom middle). Moreover, we introduce a novel RAG-based self-correction technique that further refines images to improve visual and narrative accuracy (right).

Abstract

We introduce a novel approach to enhance the capabilities of text-to-image models by incorporating Retrieval-Augmented Generation with a knowledge graph. Our system dynamically retrieves detailed character information and relational data from the knowledge graph, enabling the generation of visually accurate and contextually rich images. Furthermore, we propose a self-correcting mechanism within Stable Diffusion models to ensure consistency and fidelity in visual outputs, leveraging the rich context from the graph to guide corrections. To our knowledge, Context Canvas represents the first application of graph-based RAG in enhancing T2I models, representing a significant advancement for producing high-fidelity, context-aware multi-faceted images.

Method

Paper Method Diagram

Context Canvas introduces a novel knowledge graph-based RAG framework to enhance T2I diffusion models for context-driven generation and editing of images with complex, domain-specific concepts. The framework includes three key stages: knowledge graph-driven image generation, context-aware image editing, and a novel self-correction mechanism. In the first stage, user prompts are enriched with detailed character attributes and relationships from the knowledge graph, enabling precise depictions. The editing stage seamlessly integrates specific items or features into images by retrieving relevant contextual data. The self-correction process iteratively refines images using RAG context-guided prompts to ensure alignment with intended narratives and visual fidelity, addressing complex character traits and ensuring coherence across cultural and contextual dimensions.

Qualitative Results

Image Generation

Image Generation

Context Canvas enhances image generation across diverse domains by integrating cultural and contextual details often missed by standard models. For example, it accurately portrays rare Indian mythological characters like Tumburu with his horse-face and instrument, and Gandabherunda as a dual-headed bird (top left). Domains such as mythology often comprise characters with multiple forms (eg.,celestial and human form). Our method picks up subtle cues from user prompts to depict such characters in the right form. For instance, Indian mythology character, 'Ganga' has a heavenly river form and a human form (Duality), both of which it represents accuratelyIt (left middle). Our method adapts to various mythologies, capturing Melinoe’s ghostly essence (left bottom) and Zhong Kui’s fierce warrior form (middle bottom). In Project Gutenberg domains, such as Historical Fiction, Gothic Horror, and Fantasy, it captures narrative-specific details like Captain Ahab’s ivory leg and gaunt expression (top 4th column) and Edmond Dant`es’ pale skin, coat, and ring (top 3rd column). For Gothic Horror, it enhances Count Dracula’s menacing presence (middle 3rd column) and infers Manfred’s guilty, dark persona (middle 4th column). Our approach faithfully represents Lilith with bat wings and snakes and Lizarel with ethereal beauty and silvery hair (bottom right), demonstrating superior fidelity across cultural and literary domains.

Self-Correcting RAG-Guided Diffusion (SRD)

Self-Correction Results

SRD achieves highly accurate and culturally resonant depictions of complex mythological characters through iterative, context-rich prompt refinement. SRD corrects Vritra’s multi-headed form, drawing on the story’s cultural narrative from a robust knowledge graph to identify and position Indra accurately. For ‘Garuda’, iterative adjustments restore his iconic gold jewelry and vibrant wing color, enhancing his divine representation. For “Yama on his vehicle”, our approach automatically identifies his vehicle and adds essential symbols like his crown, mace, and noose, with precise skin tone adjustments. Similarly for ‘Mahishasura’, his incorrect tail is transformed into a snake he is typically depicted with.

Editing

Editing Results

Our method enhances ControlNet’s disentangled editing by adding culturally accurate elements. In the first row, for Jambavan, standard ControlNet introduces generic items. In the second row (bottom right), our approach accurately depicts ‘his sword’ as a fire sword, ‘primary weapon’ as a mace and ‘his daughter’ as an adopted human. For ‘Shiva’ (bottom row), the system intuitively adds his son ‘Ganesha’, snake ‘Vasuki’ and the ‘damaru’ in his mountain ‘abode’ without explicit instructions, while standard ControlNet either adds generic objects or fails to make any edits at all.

Qualitative Comparison

Qualitative Comparison

This figure highlights the comparative performance of Context Canvas against state-of-the-art methods. Our approach consistently achieves superior results in capturing contextually accurate details, including intricate features like character-specific appearances and attributes. For example, when depicting Tumburu or Mahishasura, Context Canvas preserves their unique traits with higher fidelity compared to Flux and ControlNet.

Quantitative Results

We conduct a comprehensive evaluation of 'Context Canvas' in two stages, assessing both its foundational RAG component and image generation. We benchmark our approach against SOTA T2I models-such as Flux, SDXL, and DALL-E 3 across image generation and two rounds of self-correction.

RAG Evaluation

We evaluate the foundational RAG system used in our method for both retrieval and generation using GEval. Our RAG process achieves high evaluation scores across both retrieval and generation due to meticulous data curation, retrieval, and prompt engineering.

RAG Evaluation Table

Comparison with SOTA T2I Models

Traditional metrics like CLIP Score and FID, while effective for general image-text alignment and quality, fail to capture the cultural specificity, narrative depth, and relational accuracy crucial for domain-specific tasks. To address these limitations, we utilize the LLM-as-a-Judge framework, by defining metrics such as Attribute Accuracy, Context Relevance, Visual Fidelity, and Intent Representation. We implement the metrics using DeepEval library. These metrics assess nuanced attributes like character-specific elements, situational alignment, and the overall essence of the generated imagery.

SOTA Benchmarking Table

We compare Context Canvas with SOTA T2I models across key quantitative metrics. Our framework achieves superior scores in cultural and contextual fidelity, image coherence, and narrative-specific accuracy, demonstrating its robustness and adaptability across diverse domains.

BibTeX

@misc{venkatesh2024contextcanvasenhancingtexttoimage,
              title={Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG}, 
              author={Kavana Venkatesh and Yusuf Dalva and Ismini Lourentzou and Pinar Yanardag},
              year={2024},
              eprint={2412.09614},
              archivePrefix={arXiv},
              primaryClass={cs.CV},
              url={https://arxiv.org/abs/2412.09614} 
        }