The Role of Examples in Prompt Engineering

In the world of large language models (LLMs), examples play a pivotal role in shaping model behavior. Through a technique called "n-shot prompting", providing a set of well-crafted examples in the input prompt can dramatically improve the model's ability to understand the desired task and generate relevant outputs.

However, not all examples are created equal. Poorly chosen examples can lead to subpar results, wasted resources, and frustration for both developers and end-users. On the other hand, a thoughtfully curated set of examples can unlock the true potential of LLMs, enabling them to tackle complex tasks with ease.

The Art of Example Selection

So what makes a good example? Here are some key principles to keep in mind:

Representativeness: Your examples should cover the breadth of scenarios and edge cases your model is likely to encounter in real-world usage. This helps the model generalize better and handle a wide variety of inputs gracefully.
Clarity: Each example should clearly demonstrate the desired input format and expected output structure. Ambiguous or confusing examples will lead to inconsistent model behavior.
Conciseness: While it's important to cover diverse scenarios, aim to do so with the minimal number of examples required. Too many examples can slow down inference and increase costs.
Quality: Examples should be carefully vetted for accuracy, coherence, and alignment with your specific use case. Even a single low-quality example can degrade model performance.

Let's see how we can implement these principles in code using Python, the Instructor library, and Gemini Flash:

import instructor
import google.generativeai as genai
from pydantic import BaseModel
from typing import List
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()

# Configure Gemini
genai.configure(api_key=GOOGLE_API_KEY)

client = instructor.from_gemini(
    client=genai.GenerativeModel(
        model_name="models/gemini-1.5-flash-latest",
    ),
    mode=instructor.Mode.GEMINI_JSON,
)

class ProductReview(BaseModel):
    rating: int
    sentiment: str
    summary: str

class ReviewAnalysis(BaseModel):
    sentiment: str
    key_points: List[str]

def analyze_review(review: str) -> ReviewAnalysis:
    # Define examples that are representative, clear, concise, and high-quality
    examples = [
        {"review": "This product exceeded my expectations! The quality is top-notch and it's so easy to use. Highly recommend!", 
         "analysis": {"sentiment": "Positive", "key_points": ["Exceeded expectations", "High quality", "Easy to use", "Recommended"]}},
        {"review": "While the product works, it's overpriced for what you get. The customer service was also lacking when I had questions.", 
         "analysis": {"sentiment": "Mixed", "key_points": ["Functional but overpriced", "Poor customer service"]}},
        {"review": "Absolute waste of money. Broke after a week and the company refused to honor the warranty. Stay away!", 
         "analysis": {"sentiment": "Negative", "key_points": ["Poor durability", "Warranty issues", "Not recommended"]}}
    ]

    # Construct the prompt with examples
    prompt = "Analyze the following product review. Determine the overall sentiment and extract key points.\n\n"
    for example in examples:
        prompt += f"Review: {example['review']}\nAnalysis: {example['analysis']}\n\n"

    prompt += f"Review: {review}\nAnalysis:"

    # Generate the analysis using Gemini
    response = client.chat.completions.create(
        messages=[
            {"role": "user", "content": prompt}
        ],
        response_model=ReviewAnalysis,
    )

    return response

# Example usage
review = "The product looks great, but it's not as durable as I hoped. It scratched easily within the first week. However, the functionality is solid and it does what it's supposed to do."
result = analyze_review(review)
print(f"Sentiment: {result.sentiment}")
print("Key Points:")
for point in result.key_points:
    print(f"- {point}")

Output

Sentiment: Mixed
Key Points:
- Aesthetically pleasing
- Poor durability
- Scratched easily
- Functional

This code demonstrates how to use carefully selected examples to guide the model in analyzing product reviews. The examples cover different sentiments and scenarios, providing clear and concise information to help the model understand the task.

Putting It All Together

Now that we understand the characteristics of effective examples, let's walk through the process of curating a high-quality example set:

Define your use case: Clearly articulate the specific task your model needs to perform and the expected output format. This will guide your example selection process.
Gather a diverse pool of candidates: Collect a wide range of potential examples from various sources - product documentation, customer inquiries, domain experts, etc. The more comprehensive your initial pool, the better.
Evaluate and filter: Assess each candidate example against the principles of representativeness, clarity, conciseness, and quality. Eliminate any that fall short of these criteria.
Refine and test: Iteratively refine your example set by testing it with your model and evaluating outputs. Identify areas for improvement and adjust accordingly.
Monitor and maintain: As your use case evolves, continuously monitor model performance and collect user feedback. Update your example set as needed to maintain optimal results.

Let's implement a simple example set evaluator:

from dotenv import load_dotenv
import os
import google.generativeai as genai
from instructor import Mode
import instructor
from pydantic import BaseModel
from typing import List

# Load environment variables
load_dotenv()

# Configure Gemini
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
client = instructor.from_gemini(
    client=genai.GenerativeModel(
        model_name="models/gemini-1.5-flash-latest",
    ),
    mode=Mode.GEMINI_JSON,
)

class ExampleEvaluation(BaseModel):
    score: float
    feedback: str

class ExampleSetEvaluation(BaseModel):
    overall_score: float
    evaluations: List[ExampleEvaluation]

def evaluate_example_set(examples: List[dict]) -> ExampleSetEvaluation:
    prompt = """
    Evaluate the following set of examples for use in prompt engineering. 
    Consider representativeness, clarity, conciseness, and quality.

    For each example:
    - Provide a score from 0 to 1
    - Give brief, specific feedback

    Finally, provide an overall score from 0 to 1 for the entire set, considering:
    - Diversity of examples
    - Quality of responses
    - General usefulness for training

    Examples:
    """

    for i, example in enumerate(examples, 1):
        prompt += f"\n{i}. {example}"

    response = client.chat.completions.create(
        messages=[
            {"role": "system", "content": "You are an expert in evaluating examples for prompt engineering."},
            {"role": "user", "content": prompt}
        ],
        response_model=ExampleSetEvaluation
    )

    return response

# Example usage
examples = [
    {"input": "How do I reset my password?", "output": "To reset your password, follow these steps: 1. Go to the login page. 2. Click on 'Forgot Password'. 3. Enter your email address. 4. Follow the instructions sent to your email."},
    {"input": "What's the weather like today?", "output": "I'm sorry, but I don't have access to real-time weather information. You can check a weather website or app for the most up-to-date forecast in your area."},
    {"input": "Tell me a joke", "output": "Why don't scientists trust atoms? Because they make up everything!"}
]

evaluation = evaluate_example_set(examples)
print("\n=== Example Set Evaluation ===")
print(f"\nOverall Score: {evaluation.overall_score:.2f}")
print("\nIndividual Examples:")
for i, eval in enumerate(evaluation.evaluations, 1):
    print(f"\nExample {i}:")
    print(f"Score: {eval.score:.2f}")
    print(f"Feedback: {eval.feedback}")

Output

Example Set Evaluation

Overall Score: 0.80

Individual Examples:

Example 1:
Score: 0.90
Feedback: Good, clear, and concise example. The steps are easy to follow.

Example 2:
Score: 0.80
Feedback: Good example of a limitation. Could be improved by suggesting specific resources or apps.

Example 3:
Score: 0.70
Feedback: Good, but the joke could be funnier or more relevant to the task.

This code helps automate the process of evaluating your example set, providing scores and feedback for each example as well as an overall assessment.

Avoiding Common Pitfalls

Even with a well-defined process, it's easy to fall into common traps when working with examples. Watch out for these pitfalls:

Over-optimizing for specific examples at the expense of generalizability
Neglecting edge cases and focusing only on the most common scenarios
Allowing biases or inconsistencies to creep into your example set
Failing to regularly review and update examples as your use case changes

The Road Ahead

As LLMs continue to advance, the importance of prompt engineering and example selection will only grow. By mastering these techniques now, you can position your organization to harness the full power of AI and stay ahead of the curve.