GenAI Glossary

Clear definitions for AI and GenAI terminology. From RAG to embeddings, understand the key concepts driving modern AI applications. 12 terms defined.

Jump to:

A C E F G H L P R T V

A

AI Agents

Also known as: LLM Agents, Autonomous Agents, Agent Systems

techniques

Autonomous systems that use LLMs to perceive their environment, make decisions, take actions, and work toward goals, often with access to tools and the ability to plan multi-step workflows.

AI agents go beyond simple question-answering by incorporating perception, reasoning, planning, and action. They can use tools, interact with APIs, and execute complex multi-step workflows autonomously.

Agent Components

Perception: Understanding the environment/task
Planning: Breaking down goals into steps
Action: Executing steps using tools
Memory: Maintaining context and learning
Reflection: Evaluating and improving performance

Popular Agent Frameworks

LangChain: Python/JS framework for agents
AutoGPT: Autonomous goal-oriented agent
BabyAGI: Task management and execution
CrewAI: Multi-agent collaboration
Microsoft Autogen: Multi-agent conversations

Common Agent Patterns

ReAct: Reasoning + Acting in loops
Plan-and-Execute: Upfront planning, then execution
Tool-using: Agents with API/function access
Multi-agent: Specialized agents collaborating

Use Cases

Customer support automation
Research and data analysis
Code generation and debugging
Business process automation

llm prompt engineering ai-agents

C

Context Window

Also known as: Context Length, Context Limit, Token Limit

concepts

The maximum amount of text (in tokens) that an LLM can process in a single request, including both the input prompt and generated output.

Context windows determine how much information an LLM can “remember” during a conversation or task. Larger context windows enable more sophisticated applications but come with increased cost and latency.

Common Context Sizes

GPT-4 Turbo: 128K tokens (~300 pages)
Claude 3.5 Sonnet: 200K tokens (~500 pages)
Gemini 1.5 Pro: 1M tokens (~2,400 pages)
GPT-3.5: 16K tokens (~40 pages)

Context Window Components

[System Message] + [User Prompt] + [Retrieved Docs] + [Chat History] + [Output] ≤ Context Limit

Managing Context

Summarization: Condense older messages
Sliding window: Keep recent N messages
RAG: Retrieve only relevant info, not everything
Compression: Use techniques like AutoCompressors
Hierarchical: Split tasks across multiple calls

Trade-offs

Larger context: More info, higher cost, slower
Smaller context: Less info, cheaper, faster

tokens llm rag

E

Embeddings

Also known as: Vector Embeddings, Text Embeddings, Semantic Embeddings

concepts

Dense vector representations of text that capture semantic meaning, allowing machines to understand and compare the similarity between different pieces of content.

Embeddings convert text into arrays of numbers (vectors) in a high-dimensional space where semantically similar text is positioned closer together. This enables semantic search, clustering, and recommendation systems.

How Embeddings Work

Input text: “The cat sat on the mat”
Embedding model: Converts to vector [0.23, -0.15, 0.87, …]
Vector space: Similar meanings → similar vectors

Common Embedding Models

OpenAI text-embedding-3-large: High quality, API-based
Cohere embed-v3: Multilingual support
BGE/E5: Open-source alternatives
Instructor: Task-specific embeddings

Applications

Semantic search in RAG systems
Document clustering and organization
Recommendation engines
Duplicate detection

vector database rag rag-graphrag vector-dbs

F

Fine-tuning

Also known as: Fine-tuning, Model Fine-tuning, LLM Fine-tuning

techniques

The process of further training a pre-trained LLM on domain-specific data to adapt its behavior, knowledge, or style for specialized tasks or domains.

Fine-tuning takes a general-purpose LLM and specializes it by continuing the training process on a curated dataset. This allows the model to better understand domain-specific terminology, follow particular formats, or exhibit desired behaviors.

When to Fine-tune

Domain expertise: Medical, legal, technical terminology
Format adherence: Specific output structures
Style consistency: Brand voice, tone
Edge cases: Handle uncommon scenarios

Fine-tuning vs RAG

Fine-tuning: Bakes knowledge into model weights
RAG: Provides knowledge at query time

Often, the best approach uses both: fine-tune for behavior, RAG for knowledge.

llm rag fine-tuning

G

GraphRAG

Also known as: Graph RAG, Graph-based RAG, Knowledge Graph RAG

techniques

An advanced RAG technique that combines knowledge graphs with traditional vector search to capture relationships between entities and enable more nuanced, context-aware retrieval.

While traditional RAG retrieves similar documents, GraphRAG also considers the relationships between entities, enabling it to answer complex questions that require understanding connections and context.

How GraphRAG Works

Extract entities: Identify people, places, concepts
Build graph: Connect related entities
Hybrid retrieval: Combine vector search + graph traversal
Contextual generation: Use both docs and relationships

Advantages Over Traditional RAG

Relationship awareness: “Who worked with whom on what?”
Multi-hop reasoning: Follow chains of connections
Entity disambiguation: Distinguish between similar names
Temporal context: Track changes over time
Better coherence: Maintain consistency across related facts

When to Use GraphRAG

Complex domains: Legal, scientific research, enterprise knowledge
Relationship-heavy: Social networks, org charts, supply chains
Multi-entity queries: Questions involving multiple connected entities
Citation tracking: Following references and attributions

Tools & Frameworks

Microsoft GraphRAG
Neo4j + LangChain
Knowledge graph databases (Neo4j, Amazon Neptune)

rag embeddings vector database rag-graphrag

H

Hallucination

Also known as: AI Hallucination, Model Hallucination, Confabulation

concepts

When an LLM generates plausible-sounding but factually incorrect or nonsensical information, often with high confidence, presenting fabricated data as truth.

Hallucinations occur because LLMs are trained to predict the next token based on patterns, not to verify truth. They don’t have access to real-time facts and can’t distinguish between accurate and inaccurate information during generation.

Types of Hallucinations

Factual: Incorrect facts, dates, statistics
Fabricated: Invented references, citations, people
Contradictory: Self-contradicting within response
Nonsensical: Logically impossible statements

Mitigation Strategies

RAG: Ground responses in retrieved facts
Lower temperature: Reduce randomness
Explicit instructions: “Only use provided context”
Source citation: Require references
Confidence scores: Request uncertainty estimates
Human review: Critical info needs verification

Why It Matters

Hallucinations are the primary barrier to using LLMs in high-stakes applications like healthcare, legal, and finance. Production systems must implement robust mitigation strategies.

llm rag temperature

L

LLM

Also known as: Large Language Model, Large Language Models

models

Large Language Models are AI systems trained on vast amounts of text data to understand and generate human-like text, forming the foundation of modern GenAI applications.

LLMs like GPT-4, Claude, and Llama are transformer-based neural networks with billions of parameters, trained to predict the next token in a sequence. This simple training objective enables them to perform a wide variety of language tasks.

Common LLMs

GPT-4 (OpenAI): Multimodal, excellent reasoning
Claude (Anthropic): Long context, strong safety
Gemini (Google): Multimodal, fast inference
Llama (Meta): Open source, customizable

Key Capabilities

Text generation and completion
Question answering
Summarization
Translation
Code generation
Reasoning and analysis

fine tuning temperature tokens context window fine-tuning rag-graphrag

P

Prompt Engineering

Also known as: Prompting, Prompt Design, Prompt Optimization

techniques

The practice of designing and optimizing input prompts to elicit desired outputs from LLMs, combining techniques like few-shot learning, chain-of-thought reasoning, and role-based prompting.

Effective prompt engineering can dramatically improve LLM performance without any model training. It’s both an art and a science, requiring understanding of how models interpret instructions and context.

Common Techniques

Zero-shot: Direct instruction without examples
Few-shot: Providing examples in the prompt
Chain-of-thought: Asking model to show reasoning
Role prompting: “You are an expert…”
System messages: Setting behavior context
Formatting: Using XML, JSON, markdown for structure

Best Practices

Be specific: Clear, detailed instructions
Provide context: Background information
Show examples: Demonstrate desired format
Constrain output: Specify length, format, tone
Iterate: Test and refine prompts
Use delimiters: Separate instructions from content

Advanced Patterns

ReAct: Reasoning + Acting (for agents)
Tree of Thought: Exploring multiple reasoning paths
Self-consistency: Multiple samples, majority vote

llm temperature tokens

R

RAG

Also known as: Retrieval Augmented Generation, Retrieval-Augmented Generation

techniques

Retrieval-Augmented Generation combines information retrieval with LLM generation to provide accurate, up-to-date responses grounded in external knowledge bases rather than relying solely on the model's training data.

RAG works by first retrieving relevant documents from a knowledge base, then using those documents as context for an LLM to generate a response. This approach significantly reduces hallucinations and allows AI systems to access information beyond their training cutoff date.

How RAG Works

Query Processing: Convert user query into embeddings
Retrieval: Find relevant documents using semantic search
Augmentation: Add retrieved context to the prompt
Generation: LLM generates response based on context

Benefits

Reduces hallucinations by grounding responses in facts
Enables access to current, domain-specific information
More cost-effective than fine-tuning for knowledge updates
Provides source attribution for generated answers

embeddings vector database llm rag-graphrag

T

Temperature

Also known as: Sampling Temperature, Model Temperature

concepts

A parameter controlling the randomness of LLM outputs, where lower values (0-0.3) produce deterministic, focused responses and higher values (0.7-1.0) generate more creative, diverse outputs.

Temperature works by adjusting the probability distribution when the model selects the next token. At temperature 0, the model always picks the most likely token (deterministic). At higher temperatures, less likely tokens have a better chance of being selected.

Temperature Ranges

0.0: Deterministic, repeatable (great for code, facts)
0.1-0.3: Mostly focused, slight variation
0.5-0.7: Balanced creativity and coherence
0.8-1.0: Creative, diverse, sometimes surprising
>1.0: Highly random, often incoherent

When to Use What

Low temperature (0.0-0.3):
- Factual Q&A
- Code generation
- Data extraction
- Classification
High temperature (0.7-1.0):
- Creative writing
- Brainstorming
- Marketing copy
- Story generation

llm tokens

Tokens

Also known as: Tokenization, Token Count, Subword Tokens

concepts

The basic units of text that LLMs process, typically representing parts of words, whole words, or characters, used for input processing, output generation, and usage billing.

LLMs don’t process text as human-readable words. Instead, they use tokenization to split text into chunks. One token roughly equals 4 characters in English, or about 0.75 words.

Tokenization Examples

“Hello, world!” → ~4 tokens
“AI implementation” → 3 tokens
“GPT-4” → 2 tokens (GPT, -4)
“antidisestablishmentarianism” → 6-8 tokens (split into parts)

Why Tokens Matter

Cost: API pricing is per token (input + output)
Context limits: Max tokens per request (e.g., 8K, 128K)
Performance: More tokens = slower, more expensive
Prompt design: Optimize prompts to reduce token count

Tokenization Strategies

Character-level: One character = one token (rare)
Word-level: One word = one token (rare in modern LLMs)
Subword (BPE): Balance between characters and words (most common)
SentencePiece: Language-agnostic tokenization

Tools

OpenAI tokenizer (tiktoken)
Hugging Face tokenizers
Token counters for cost estimation

llm context window temperature

V

Vector Database

Also known as: Vector DB, Vector Store, Embedding Database

infrastructure

Specialized databases optimized for storing and querying high-dimensional vectors (embeddings), enabling fast similarity search at scale for RAG and semantic search applications.

Vector databases use specialized indexing algorithms (like HNSW, IVF) to efficiently find similar vectors among millions or billions of embeddings, making them essential for production RAG systems.

Popular Vector Databases

Pinecone: Fully managed, easy to use
Weaviate: Open-source, GraphQL API
Qdrant: Rust-based, high performance
Chroma: Simple, developer-friendly
pgvector: PostgreSQL extension

Key Features

Similarity search: Find nearest neighbors
Filtering: Combine vector search with metadata
Hybrid search: Mix keyword and semantic search
Scalability: Handle billions of vectors
Real-time updates: Add/update vectors on the fly

embeddings rag vector-dbs rag-graphrag

Ready to implement GenAI?

Now that you understand the terminology, let's build something together.

Book Consultation Our Services