Generative AI Design Patterns


	Code repo for in-press O'Reilly book on GenAI design patterns by Valliappa Lakshmanan and Hannes Hapke. https://www.oreilly.com/library/view/generative-ai-design/9798341622654/

Summary of patterns

These are the 32 design patterns covered in the book:

Chapter 2: Controlling Content Style (Patterns 1-5)

Pattern Number	Pattern Name	Problem	Solution	Usage Scenarios	Code Example
1	Logits Masking	Need to ensure generated text conforms to specific style rules for brand, accuracy, or compliance reasons.	Intercept the generation at the sampling stage to zero out probabilities of continuations that don't meet the rules	Use words associated with specific brand; avoid repeating factual information; make content compliant with style book	examples/01_logits_masking
2	Grammar	Need text to conform to a specific format or data schema for downstream processing.	Specify rules as a formal grammar (e.g., BNF) or schema that the model framework applies to constrain token generation.	Generating valid SQL timestamps; extracting structured data in a specific format; ensuring output conforms to JSON schema.	examples/02_grammar
3	Style Transfer	Need to convert content into a form that mimics specific tone and style that is difficult to express through rules, but can be shown through example conversions.	Use few-shot learning or model fine-tuning to teach the model how to convert content to the desired style.	Rewriting generic content to match brand guidelines; converting academic papers to blog posts; transforming image and text content for different social media platforms or audiences.	examples/03_style_transfer
4	Reverse Neutralization	Need to generate content in a specific style that can be shown through example content.	Use an LLM to generate content in an intermediate neutral form, and a fine-tuned LLM to convert that neutral form into the desired style.	Generating letters in region-specific legalese; generating emails in personal style.	examples/04_reverse_neutralization
5	Content Optimization	Need to determine optimal style for content without knowing which factors matter.	Generate pairs of content, compare them using an evaluator, create a preference dataset, and perform preference tuning.	Optimizing ad copy, marketing content, or educational materials where effective style factors are unknown.	examples/05_content_optimization

Chapters 3 and 4: Adding Knowledge (Patterns 6-12)

Pattern Number	Pattern Name	Problem	Solution	Usage Scenarios	Code Example
6	Basic RAG	Knowledge cutoff, confidential data, and hallucinations pose problems for zero-shot generation by LLMs.	Ground the response generated by the LLM by adding relevant information from a knowledge base into the prompt context.	The applications of RAG are constantly expanding as the technology evolves.	examples/06_basic_rag
7	Semantic Indexing	Traditional keyword indexing/lookup approaches fail when documents get more complex, contain different media types like images or tables, or bridge multiple domains.	Use embeddings to capture the meaning of texts, images, and other media types. Find relevant chunks by comparing the embedding of the chunk to that of the query.		examples/07_semantic_indexing
8	Indexing at Scale	Dealing with outdated or contradictory information in your knowledge base.	Using metadata, query filtering, and result reranking.		examples/08_indexing_at_scale
9	Index-aware Retrieval	Comparing questions to chunks is problematic because the question itself will not appear in the knowledge base, may use synonyms or jargon, or may require holistic interpretation.	Hypothetical answers, query expansion, hybrid search, GraphRAG		examples/09_index_aware_retrieval
10	Node Postprocessing	Irrelevant content, ambiguous entities, generic answers.	Reranking offer the ability to bring in a lot of other neat ideas: hybrid search, query expansion, filtering, contextual compression, disambiguation, personalization		examples/10_node_postprocessing
11	Trustworthy Generation	How to retain users’ trust given that there is no way to completely avoid errors.	Out-of-domain detection, citations, guardrails, human feedback, corrective RAG, UX design can all help.		examples/11_trustworthy_generation
12	Deep Search	RAG systems are less effective for complex information retrieval tasks because of context window constraints, query ambiguity, information verification, shallow reasoning, and multi-hop query challenges.	Iterative process of searching, reading, and reasoning to provide comprehensive answers to complex queries.		examples/12_deep_search

Chapter 5: Extending Model Capabilities (Patterns 13-16)

Pattern Number	Pattern Name	Problem	Solution	Usage Scenarios	Code Example
13	Chain of Thought (CoT)	Foundational models often struggle with multi-step reasoning tasks, leading to incorrect or fabricated answers.	CoT prompts the model to break down complex problems into intermediate reasoning steps before providing the final answer.	Complex mathematical problems, logical deductions, and sequential reasoning tasks where step-by-step thinking is required.	examples/13_chain_of_thought
14	Tree of Thoughts (ToT)	Many strategic or logical tasks cannot be solved by a single linear reasoning path, requiring exploration of multiple alternatives.	ToT treats problem-solving as a tree search, generating multiple reasoning paths, evaluating them, and backtracking as needed	Complex tasks involving strategic thinking, planning, or creative writing that require exploring multiple solution paths.	examples/14_tree_of_thoughts
15	Adapter Tuning	Fully fine-tuning large foundational models for specialized tasks is computationally expensive and requires significant data.nt.	Adapter Tuning trains small add-on neural network layers, leaving the original model weights frozen, making it efficient for specialized adaptation.	Adapting models for specific tasks like classification, summarization, or specialized chatbots with a small (100-10k) dataset of examples.	examples/15_adapter_tuning
16	Evol-Instruct	Creating high-quality datasets for instruction tuning models on new and complex enterprise tasks is difficult and time-consuming.	Evol-Instruct efficiently generates instruction-tuning datasets by evolving instructions through multiple iterations of LLM-generated tasks and answers.	Teaching models new, domain-specific tasks that are not covered by their pre-training data, particularly in enterprise settings.	examples/16_evol_instruct

Chapter 6: Improving Reliability (Patterns 17-20)

Pattern Number	Pattern Name	Problem	Solution	Usage Scenarios	Code Example
17	LLM-as-Judge	Evaluation of GenAI capabilities is hard because the tasks that GenAI performs are open-ended.	Provide detailed, multi-dimensional feedback that can be used to compare models, track improvements, and guide further development.	Evaluation is core to many of the other patterns and to building AI applications effectively.	examples/17_llm_as_judge
18	Reflection	How to get the LLM to correct an earlier response in response to feedback or criticism.	The feedback is used to modify the prompt that is sent to the LLM a second time.	Reliable performance in most complex tasks where the approach can not be predetermined.	examples/18_reflection
19	Dependency Injection	Need to independently develop and test each component of an LLM chain.	When you build chains of LLM calls, build them such that it is easy to inject a mock implementation to replace any step of the chain.	In any situation where you chain LLM calls or use external tools.	examples/19_dependency_injection
20	Prompt Optimization	Need to easily update prompts when dependencies change to maintain level of performance	Systematically set the prompts used in a GenAI pipeline by optimizing them on a dataset of examples	In any situation where you have to reduce the maintenance overhead associated with LLM version changes (and other dependencies).	examples/20_prompt_optimiation

Chapter 7: Enabling Agents to Take Action (Patterns 21-23)

Pattern Number	Pattern Name	Problem	Solution	Usage Scenarios	Code Example
21	Tool Calling	How can you bridge the LLM and a software API so that the LLM is able to invoke the API and get the job done?	The LLM emits special tokens when it determines that a function needs to be called and also emits the parameters to pass to that function. A client-side postprocessor invokes the function with those parameters, and sends the results back to the LLM. The LLM incorporates the function results in its response.	Whenever you want the LLM to not just state the steps needed, but to execute those steps. Also allows you to incorporate up-to-date knowledge from real-time sources, connect to transactional enterprise systems, perform calculations, and use optimization solvers.	examples/21_tool_calling
22	Code Execution	You have a software system that can do the task, but invoking it involves a DSL.	LLMs generate code that is then executed by an external system.	Creating graphs, annotating images, updating databases.	examples/22_code_execution
23	Multi-agent Collaboration	Handle multi-step tasks that require different tools, maintain content over extended interactions, evaluate situations and take appropriate actions without human intervention, and adapt to user preferences.	Multi-agent architectures allow you to solve real-world problems using specialized single-purpose agents and organizing them in ways that mimic human organizational structures.	Complex reasoning, multi-step problem solving, collaborative content creation, adversarial verification, specialized domain integration, self-improving systems	examples/23_multi_agent

Chapters 8: Addressing Constraints (Patterns 24-28)

Pattern Number	Pattern Name	Problem	Solution	Usage Scenarios	Code Example
24	Small Language Model (SLM)	The foundational model you are using is introducing too much latency or cost.	Use a small foundational model to fit within cost and latency constraints without compromising unduly on quality by employing quantization (reduce precision of model parameters), distillation (narrow knowledge scope), or speculative coding (backstop with larger model)	Narrow-scoped knowledge applications, cost reduction, edge device deployment, faster inference requirements, GPU-constrained environments	examples/24_small_language_model
25	Prompt Caching	User requests follow patterns with repeated queries. Recomputing the same responses wastes resources and increases costs.	Reuse previously generated responses (in the case of client-side caching) and/or model internal states (in the case of server-side caching) for the same or similar prompts. The similarity can be based on prompt meaning (semantic cache) or overlap (prefix caching).	Applications with repeated queries, cost optimization, interactive applications requiring fast responses, multi-tenant systems	examples/25_prompt_caching
26	Inference Optimization	Self-hosting LLMs brings with it GPU constraints and hardware utilization challenges. Real-time applications need faster response times.	Improves the efficiency of model inference by employing continuous batching (requests are pulled from a queue and slotted into GPU cores as soon as they become available), speculative decoding (efficiently compute the next set of tokens whenever the smaller model is able to do so, backstopping this with a large model), and/or prompt compression (preprocess prompts to make them shorter).	Self-hosted LLM deployments, real-time applications, GPU memory-constrained environments, high-throughput serving scenarios	examples/26_inference_optimization
27	Degradation Testing	Need metrics to help identify when service quality degrades and the constraint under which the application is bounded.	A set of core metrics — Time-to-First-Token (TTFT), End-to-End Request Latency (EERL), Tokens per Second (TPS) — and a variety of scalability and resilience metrics can help identify degradation of service quality; targeted interventions can help improve specific metrics.	Pre-production testing, performance validation, bottleneck identification, capacity planning, ongoing monitoring and optimization.	examples/27_degradation_testing
28	Long-Term Memory	LLM applications need to simulate memory of past interactions by prepending relevant history to each prompt, but this approach can become costly and inefficient with long conversations due to context window limitations.	LLM applications use various types of memory – working, episodic, procedural, and semantic – to maintain context, recall past interactions, personalize responses, and retain key facts, respectively.	Chatbots, multi-step workflows, personalization, processing large documents	examples/28_long_term_memory

Chapters 9: Setting Safeguards (Patterns 29-32)

Pattern Number	Pattern Name	Problem	Solution	Usage Scenarios	Code Example
29	Template Generation	The risk of sending content without human review is very high, but human review will not scale to the volume of communications.	Pregenerate templates that are reviewed beforehand. Inference time requires only deterministic string replacement, and is therefore safe to directly send to consumers.	Personalized communications in business to consumer settings.	examples/29_template_generation
30	Assembled Reformat	Content needs to be presented in an appealing way, but the risk posed by dynamically generated content is too high.	Reduce the risk of inaccurate or hallucinated content by separating out the task of content creation into two low-risk steps — first, assembling data in low-risk ways and second, formatting the content based on that data.	Situations where accurate content needs to be presented in appealing ways, such as in product catalogs.	examples/30_assembled_reformat
31	Self-Check	Identify potential hallucinations cost-effectively	Use token probabilities to detect hallucination in LLM responses	In any situation where factual (as opposed to creative) responses are needed.	examples/31_self_check
32	Guardails	Require safeguards for security, data privacy, content moderation, hallucination, and alignment to ensure that AI applications operate within ethical, legal, and functional parameters.	Wrap the LLM calls with a layer of code that preprocesses the information going into the model and/or post-processes the output of the model. Knowledge retrieval and tool use will also need to be protected.	Anytime your application could be subject to attacks by malicious adversaries.	examples/32_guardrails

Book reviews (selected)

The best part is the inclusion of working examples for each pattern and explanations of code snippets, which make the concepts much clearer. -- Manjunath Janardhan
It's packed with design patterns, and even when I thought I knew a pattern, Valliappa Lakshmanan and Hannes Hapke offered valuable new insights. There are plenty of examples throughout the book to help illustrate and deepen understanding of the various patterns. This book is an absolute gem! -- Glen Yu

Want to be cited in future versions of the book?

If you have implemented any of the patterns in the book in production, submit a PR to update the USAGE.md in the folder corresponding to the pattern. See examples/15_adapter_tuning/USAGE.md for an example.

Name		Name	Last commit message	Last commit date
Latest commit History 233 Commits
composable_app		composable_app
diagrams		diagrams
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Generative AI Design Patterns

Summary of patterns

Book reviews (selected)

Want to be cited in future versions of the book?

Further reading

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

lakshmanok/generative-ai-design-patterns

Folders and files

Latest commit

History

Repository files navigation

Generative AI Design Patterns

Summary of patterns

Book reviews (selected)

Want to be cited in future versions of the book?

Further reading

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages