Skip to main content

GenAI Glossary

Clear definitions for AI and GenAI terminology. From RAG to embeddings, understand the key concepts driving modern AI applications. 12 terms defined.

A

AI Agents

Also known as: LLM Agents, Autonomous Agents, Agent Systems

techniques

Autonomous systems that use LLMs to perceive their environment, make decisions, take actions, and work toward goals, often with access to tools and the ability to plan multi-step workflows.

AI agents go beyond simple question-answering by incorporating perception, reasoning, planning, and action. They can use tools, interact with APIs, and execute complex multi-step workflows autonomously.

Agent Components

  1. Perception: Understanding the environment/task
  2. Planning: Breaking down goals into steps
  3. Action: Executing steps using tools
  4. Memory: Maintaining context and learning
  5. Reflection: Evaluating and improving performance
  • LangChain: Python/JS framework for agents
  • AutoGPT: Autonomous goal-oriented agent
  • BabyAGI: Task management and execution
  • CrewAI: Multi-agent collaboration
  • Microsoft Autogen: Multi-agent conversations

Common Agent Patterns

  • ReAct: Reasoning + Acting in loops
  • Plan-and-Execute: Upfront planning, then execution
  • Tool-using: Agents with API/function access
  • Multi-agent: Specialized agents collaborating

Use Cases

  • Customer support automation
  • Research and data analysis
  • Code generation and debugging
  • Business process automation

C

Context Window

Also known as: Context Length, Context Limit, Token Limit

concepts

The maximum amount of text (in tokens) that an LLM can process in a single request, including both the input prompt and generated output.

Context windows determine how much information an LLM can “remember” during a conversation or task. Larger context windows enable more sophisticated applications but come with increased cost and latency.

Common Context Sizes

  • GPT-4 Turbo: 128K tokens (~300 pages)
  • Claude 3.5 Sonnet: 200K tokens (~500 pages)
  • Gemini 1.5 Pro: 1M tokens (~2,400 pages)
  • GPT-3.5: 16K tokens (~40 pages)

Context Window Components

[System Message] + [User Prompt] + [Retrieved Docs] + [Chat History] + [Output] ≤ Context Limit

Managing Context

  1. Summarization: Condense older messages
  2. Sliding window: Keep recent N messages
  3. RAG: Retrieve only relevant info, not everything
  4. Compression: Use techniques like AutoCompressors
  5. Hierarchical: Split tasks across multiple calls

Trade-offs

  • Larger context: More info, higher cost, slower
  • Smaller context: Less info, cheaper, faster

E

Embeddings

Also known as: Vector Embeddings, Text Embeddings, Semantic Embeddings

concepts

Dense vector representations of text that capture semantic meaning, allowing machines to understand and compare the similarity between different pieces of content.

Embeddings convert text into arrays of numbers (vectors) in a high-dimensional space where semantically similar text is positioned closer together. This enables semantic search, clustering, and recommendation systems.

How Embeddings Work

  1. Input text: “The cat sat on the mat”
  2. Embedding model: Converts to vector [0.23, -0.15, 0.87, …]
  3. Vector space: Similar meanings → similar vectors

Common Embedding Models

  • OpenAI text-embedding-3-large: High quality, API-based
  • Cohere embed-v3: Multilingual support
  • BGE/E5: Open-source alternatives
  • Instructor: Task-specific embeddings

Applications

  • Semantic search in RAG systems
  • Document clustering and organization
  • Recommendation engines
  • Duplicate detection

F

Fine-tuning

Also known as: Fine-tuning, Model Fine-tuning, LLM Fine-tuning

techniques

The process of further training a pre-trained LLM on domain-specific data to adapt its behavior, knowledge, or style for specialized tasks or domains.

Fine-tuning takes a general-purpose LLM and specializes it by continuing the training process on a curated dataset. This allows the model to better understand domain-specific terminology, follow particular formats, or exhibit desired behaviors.

When to Fine-tune

  • Domain expertise: Medical, legal, technical terminology
  • Format adherence: Specific output structures
  • Style consistency: Brand voice, tone
  • Edge cases: Handle uncommon scenarios

Fine-tuning vs RAG

  • Fine-tuning: Bakes knowledge into model weights
  • RAG: Provides knowledge at query time

Often, the best approach uses both: fine-tune for behavior, RAG for knowledge.

G

GraphRAG

Also known as: Graph RAG, Graph-based RAG, Knowledge Graph RAG

techniques

An advanced RAG technique that combines knowledge graphs with traditional vector search to capture relationships between entities and enable more nuanced, context-aware retrieval.

While traditional RAG retrieves similar documents, GraphRAG also considers the relationships between entities, enabling it to answer complex questions that require understanding connections and context.

How GraphRAG Works

  1. Extract entities: Identify people, places, concepts
  2. Build graph: Connect related entities
  3. Hybrid retrieval: Combine vector search + graph traversal
  4. Contextual generation: Use both docs and relationships

Advantages Over Traditional RAG

  • Relationship awareness: “Who worked with whom on what?”
  • Multi-hop reasoning: Follow chains of connections
  • Entity disambiguation: Distinguish between similar names
  • Temporal context: Track changes over time
  • Better coherence: Maintain consistency across related facts

When to Use GraphRAG

  • Complex domains: Legal, scientific research, enterprise knowledge
  • Relationship-heavy: Social networks, org charts, supply chains
  • Multi-entity queries: Questions involving multiple connected entities
  • Citation tracking: Following references and attributions

Tools & Frameworks

  • Microsoft GraphRAG
  • Neo4j + LangChain
  • Knowledge graph databases (Neo4j, Amazon Neptune)

H

Hallucination

Also known as: AI Hallucination, Model Hallucination, Confabulation

concepts

When an LLM generates plausible-sounding but factually incorrect or nonsensical information, often with high confidence, presenting fabricated data as truth.

Hallucinations occur because LLMs are trained to predict the next token based on patterns, not to verify truth. They don’t have access to real-time facts and can’t distinguish between accurate and inaccurate information during generation.

Types of Hallucinations

  • Factual: Incorrect facts, dates, statistics
  • Fabricated: Invented references, citations, people
  • Contradictory: Self-contradicting within response
  • Nonsensical: Logically impossible statements

Mitigation Strategies

  1. RAG: Ground responses in retrieved facts
  2. Lower temperature: Reduce randomness
  3. Explicit instructions: “Only use provided context”
  4. Source citation: Require references
  5. Confidence scores: Request uncertainty estimates
  6. Human review: Critical info needs verification

Why It Matters

Hallucinations are the primary barrier to using LLMs in high-stakes applications like healthcare, legal, and finance. Production systems must implement robust mitigation strategies.

L

LLM

Also known as: Large Language Model, Large Language Models

models

Large Language Models are AI systems trained on vast amounts of text data to understand and generate human-like text, forming the foundation of modern GenAI applications.

LLMs like GPT-4, Claude, and Llama are transformer-based neural networks with billions of parameters, trained to predict the next token in a sequence. This simple training objective enables them to perform a wide variety of language tasks.

Common LLMs

  • GPT-4 (OpenAI): Multimodal, excellent reasoning
  • Claude (Anthropic): Long context, strong safety
  • Gemini (Google): Multimodal, fast inference
  • Llama (Meta): Open source, customizable

Key Capabilities

  • Text generation and completion
  • Question answering
  • Summarization
  • Translation
  • Code generation
  • Reasoning and analysis

P

Prompt Engineering

Also known as: Prompting, Prompt Design, Prompt Optimization

techniques

The practice of designing and optimizing input prompts to elicit desired outputs from LLMs, combining techniques like few-shot learning, chain-of-thought reasoning, and role-based prompting.

Effective prompt engineering can dramatically improve LLM performance without any model training. It’s both an art and a science, requiring understanding of how models interpret instructions and context.

Common Techniques

  • Zero-shot: Direct instruction without examples
  • Few-shot: Providing examples in the prompt
  • Chain-of-thought: Asking model to show reasoning
  • Role prompting: “You are an expert…”
  • System messages: Setting behavior context
  • Formatting: Using XML, JSON, markdown for structure

Best Practices

  1. Be specific: Clear, detailed instructions
  2. Provide context: Background information
  3. Show examples: Demonstrate desired format
  4. Constrain output: Specify length, format, tone
  5. Iterate: Test and refine prompts
  6. Use delimiters: Separate instructions from content

Advanced Patterns

  • ReAct: Reasoning + Acting (for agents)
  • Tree of Thought: Exploring multiple reasoning paths
  • Self-consistency: Multiple samples, majority vote

R

RAG

Also known as: Retrieval Augmented Generation, Retrieval-Augmented Generation

techniques

Retrieval-Augmented Generation combines information retrieval with LLM generation to provide accurate, up-to-date responses grounded in external knowledge bases rather than relying solely on the model's training data.

RAG works by first retrieving relevant documents from a knowledge base, then using those documents as context for an LLM to generate a response. This approach significantly reduces hallucinations and allows AI systems to access information beyond their training cutoff date.

How RAG Works

  1. Query Processing: Convert user query into embeddings
  2. Retrieval: Find relevant documents using semantic search
  3. Augmentation: Add retrieved context to the prompt
  4. Generation: LLM generates response based on context

Benefits

  • Reduces hallucinations by grounding responses in facts
  • Enables access to current, domain-specific information
  • More cost-effective than fine-tuning for knowledge updates
  • Provides source attribution for generated answers

T

Temperature

Also known as: Sampling Temperature, Model Temperature

concepts

A parameter controlling the randomness of LLM outputs, where lower values (0-0.3) produce deterministic, focused responses and higher values (0.7-1.0) generate more creative, diverse outputs.

Temperature works by adjusting the probability distribution when the model selects the next token. At temperature 0, the model always picks the most likely token (deterministic). At higher temperatures, less likely tokens have a better chance of being selected.

Temperature Ranges

  • 0.0: Deterministic, repeatable (great for code, facts)
  • 0.1-0.3: Mostly focused, slight variation
  • 0.5-0.7: Balanced creativity and coherence
  • 0.8-1.0: Creative, diverse, sometimes surprising
  • >1.0: Highly random, often incoherent

When to Use What

  • Low temperature (0.0-0.3):

    • Factual Q&A
    • Code generation
    • Data extraction
    • Classification
  • High temperature (0.7-1.0):

    • Creative writing
    • Brainstorming
    • Marketing copy
    • Story generation

Tokens

Also known as: Tokenization, Token Count, Subword Tokens

concepts

The basic units of text that LLMs process, typically representing parts of words, whole words, or characters, used for input processing, output generation, and usage billing.

LLMs don’t process text as human-readable words. Instead, they use tokenization to split text into chunks. One token roughly equals 4 characters in English, or about 0.75 words.

Tokenization Examples

  • “Hello, world!” → ~4 tokens
  • “AI implementation” → 3 tokens
  • “GPT-4” → 2 tokens (GPT, -4)
  • “antidisestablishmentarianism” → 6-8 tokens (split into parts)

Why Tokens Matter

  1. Cost: API pricing is per token (input + output)
  2. Context limits: Max tokens per request (e.g., 8K, 128K)
  3. Performance: More tokens = slower, more expensive
  4. Prompt design: Optimize prompts to reduce token count

Tokenization Strategies

  • Character-level: One character = one token (rare)
  • Word-level: One word = one token (rare in modern LLMs)
  • Subword (BPE): Balance between characters and words (most common)
  • SentencePiece: Language-agnostic tokenization

Tools

  • OpenAI tokenizer (tiktoken)
  • Hugging Face tokenizers
  • Token counters for cost estimation

V

Vector Database

Also known as: Vector DB, Vector Store, Embedding Database

infrastructure

Specialized databases optimized for storing and querying high-dimensional vectors (embeddings), enabling fast similarity search at scale for RAG and semantic search applications.

Vector databases use specialized indexing algorithms (like HNSW, IVF) to efficiently find similar vectors among millions or billions of embeddings, making them essential for production RAG systems.

  • Pinecone: Fully managed, easy to use
  • Weaviate: Open-source, GraphQL API
  • Qdrant: Rust-based, high performance
  • Chroma: Simple, developer-friendly
  • pgvector: PostgreSQL extension

Key Features

  • Similarity search: Find nearest neighbors
  • Filtering: Combine vector search with metadata
  • Hybrid search: Mix keyword and semantic search
  • Scalability: Handle billions of vectors
  • Real-time updates: Add/update vectors on the fly

Ready to implement GenAI?

Now that you understand the terminology, let's build something together.