Synthetic Data: Building AI Products Without Real Users
Synthetic data generation is a game-changer for AI development, especially when you're just starting out. This post shows you how to create realistic, diverse test data for your AI products. I'll walk you through a practical implementation that you can adapt for your own projects.
The Chicken-and-Egg Problem of AI Development
Let me paint a familiar scene: You've got a brilliant idea for an AI product. Maybe it's a healthcare chatbot that helps patients understand their symptoms, or a real estate assistant that helps buyers find their dream home. You've sketched out the architecture, chosen your tech stack, and you're ready to start building.
But then reality hits: You need data to test and evaluate your AI Products, but you need users to generate that data.
It's the classic chicken-and-egg problem of AI development. Without data, you can't build a good product. Without a good product, you can't attract users who generate data. The solution? Synthetic data generation.
Why Synthetic Data is Your Secret Weapon
Synthetic data isn't just a stopgap measure—it's often a superior approach for several reasons:
-
Complete control over test scenarios - You can generate examples that cover every edge case, even rare ones that might take months to observe in production.
-
Privacy by design - No worries about PII or compliance issues since the data is artificially created.
-
Unlimited scale - Need 10,000 examples of a specific scenario? No problem.
-
Faster iteration cycles - Test improvements immediately without waiting for real user interactions.
The Framework: Dimensions-Based Synthetic Data Generation
The key to effective synthetic data is thinking in terms of dimensions. Here's the framework I use:
- Features - What capabilities does your AI need to support?
- Scenarios - What situations will your AI encounter?
- Personas - Who will be using your AI and how?
For example, if you're building a scheduling assistant for physicians, your dimensions might look like:
- Features: appointment booking, rescheduling, cancellation, availability checking
- Scenarios: routine scheduling, urgent requests, conflicting appointments, after-hours requests
- Personas: primary care physician, specialist, nurse, administrative staff, patient coordinator
By generating data that covers combinations of these dimensions, you ensure comprehensive testing of your AI system.
Implementing Synthetic Data Generation with Gemini 2.0 Flash
Let's build a practical implementation using Google's Gemini 2.0 Flash model. I'll show you how to generate synthetic data for a physician scheduling assistant.
Step 1: Set Up Your Environment
First, let's set up our environment with the necessary libraries:
# Install the required packages
!pip install -q google-generativeai pandas python-dotenv
import google.generativeai as genai
import pandas as pd
import os
import json
from typing import List, Dict, Any
import itertools
from dotenv import load_dotenv
Step 2: Configure Your API Key
# Load environment variables
load_dotenv()
api_key = os.getenv("GOOGLE_API_KEY")
if not api_key:
raise ValueError("No API key found. Make sure GOOGLE_API_KEY is set in your .env file.")
# Configure the Gemini API
genai.configure(api_key=api_key)
Step 3: Define Your Dimensions
# Define the dimensions for our physician scheduling assistant
features = [
"appointment_booking", # Creating new appointments
"rescheduling", # Changing existing appointments
"cancellation", # Canceling appointments
"availability_check" # Checking open time slots
]
scenarios = [
"routine_scheduling", # Regular appointment booking
"urgent_request", # Need for immediate appointment
"conflicting_appointment", # Schedule conflicts
"after_hours_request", # Requests outside normal hours
"recurring_appointment" # Regular recurring appointments
]
personas = [
"primary_care_physician", # General practitioner
"specialist", # Medical specialist (cardiologist, etc.)
"nurse", # Nursing staff
"admin_staff", # Administrative personnel
"patient_coordinator" # Staff coordinating patient care
]
Step 4: Create a Function to Generate Synthetic Queries
def generate_synthetic_query(
feature: str,
scenario: str,
persona: str,
model_name: str = "gemini-2.0-flash"
) -> Dict[str, Any]:
"""Generate a synthetic user query based on the given dimensions."""
# Create a detailed prompt for Gemini
prompt = f"""
Generate a realistic user query for a physician scheduling assistant based on the following parameters:
FEATURE: {feature} - The scheduling capability being tested
SCENARIO: {scenario} - The scheduling situation
PERSONA: {persona} - The medical staff role making the query
Guidelines:
1. The query should be in natural language as if typed by a medical professional or staff member
2. Include appropriate medical context and scheduling details that make sense for the persona
3. Make it realistic - include abbreviations, medical terminology, or incomplete information if appropriate
4. For urgent requests, make them realistic but not extreme
5. Include realistic patient details (use fictional names) and appointment types
6. Do not include any disclaimers or explanations - just output the user query
Output the query as plain text only.
"""
# Call Gemini model
model = genai.GenerativeModel(model_name)
response = model.generate_content(prompt)
# Extract the query text
query_text = response.text.strip()
# Return the query along with its dimensions
return {
"feature": feature,
"scenario": scenario,
"persona": persona,
"query": query_text
}
Step 5: Generate a Diverse Dataset
def generate_synthetic_dataset(
features: List[str],
scenarios: List[str],
personas: List[str],
samples_per_combination: int = 1
) -> pd.DataFrame:
"""Generate a synthetic dataset covering all dimension combinations."""
dataset = []
# Generate all combinations of dimensions
combinations = list(itertools.product(features, scenarios, personas))
# For each combination, generate the specified number of samples
for feature, scenario, persona in combinations:
for _ in range(samples_per_combination):
try:
synthetic_query = generate_synthetic_query(feature, scenario, persona)
dataset.append(synthetic_query)
print(f"Generated query for {feature} - {scenario} - {persona}")
except Exception as e:
print(f"Error generating query for {feature} - {scenario} - {persona}: {e}")
# Convert to DataFrame
return pd.DataFrame(dataset)
Step 6: Generate and Save Your Dataset
# Generate a smaller sample dataset for testing (1 sample per combination)
# Using a subset of dimensions to make it faster
test_features = features[:2] # Just appointment_booking and rescheduling
test_scenarios = scenarios[:2] # Just routine_scheduling and urgent_request
test_personas = personas[:2] # Just primary_care_physician and specialist
print("Generating synthetic queries...")
synthetic_data = generate_synthetic_dataset(
test_features,
test_scenarios,
test_personas,
samples_per_combination=1
)
# Save to CSV
output_file = "physician_scheduler_synthetic_data.csv"
synthetic_data.to_csv(output_file, index=False)
print(f"Saved {len(synthetic_data)} synthetic queries to {output_file}")
# Display a sample
print("\nSample of generated data:")
print(synthetic_data.sample(min(2, len(synthetic_data))))
Output
Generating synthetic queries...
Generated query for appointment_booking - routine_scheduling - primary_care_physician
Generated query for appointment_booking - routine_scheduling - specialist
Generated query for appointment_booking - urgent_request - primary_care_physician
Generated query for appointment_booking - urgent_request - specialist
Generated query for rescheduling - routine_scheduling - primary_care_physician
Generated query for rescheduling - routine_scheduling - specialist
Generated query for rescheduling - urgent_request - primary_care_physician
Generated query for rescheduling - urgent_request - specialist
Saved 8 synthetic queries to physician_scheduler_synthetic_data.csv
Sample of generated data:
feature ... query
1 appointment_booking ... Okay, can you book Mrs. Eleanor Vance, DOB 03/...
2 appointment_booking ... "Need to squeeze in a patient today if possibl...
[2 rows x 4 columns]
Step 7: Generate AI Responses for Your Synthetic Queries
Now that we have user queries, let's generate AI responses to create complete interactions:
def generate_ai_response(
query: str,
feature: str,
scenario: str,
persona: str,
model_name: str = "gemini-2.0-flash"
) -> str:
"""Generate an AI response to a synthetic user query."""
# Create a detailed prompt for Gemini
prompt = f"""
You are a physician scheduling assistant. Respond to the following user query:
USER QUERY: "{query}"
CONTEXT:
- This query relates to: {feature}
- The scenario is: {scenario}
- The user is a: {persona}
SCHEDULING SYSTEM INFORMATION:
- Available appointment slots: 9:00 AM - 5:00 PM, Monday-Friday
- Appointment durations: 15 min (quick follow-up), 30 min (standard), 60 min (new patient/complex)
- Emergency slots: Two 30-minute slots reserved each day at 11:30 AM and 3:30 PM
- Current date/time: Monday, 10:00 AM
Guidelines:
1. Respond in a professional, efficient manner appropriate for a medical setting
2. For urgent requests, try to accommodate within 24-48 hours if possible
3. Suggest alternative times/dates if requested slot is unavailable
4. Be concise but thorough with scheduling details
5. Keep responses under 150 words
6. Include confirmation of details and any follow-up information needed
Respond directly as the scheduling assistant would (no meta-commentary).
"""
# Call Gemini model
model = genai.GenerativeModel(model_name)
response = model.generate_content(prompt)
# Return the response text
return response.text.strip()
# Load the synthetic queries
input_file = "physician_scheduler_synthetic_data.csv"
try:
synthetic_data = pd.read_csv(input_file)
print(f"Loaded {len(synthetic_data)} synthetic queries from {input_file}")
except FileNotFoundError:
print(f"Error: {input_file} not found. Please run generate_physician_scheduler_data.py first.")
exit(1)
# Generate AI responses for each query
print("Generating AI responses...")
# Add AI responses to our dataset
synthetic_data['ai_response'] = synthetic_data.apply(
lambda row: generate_ai_response(
row['query'],
row['feature'],
row['scenario'],
row['persona']
),
axis=1
)
# Save the complete dataset
output_file = "physician_scheduler_complete_interactions.csv"
synthetic_data.to_csv(output_file, index=False)
print(f"Saved {len(synthetic_data)} complete interactions to {output_file}")
# Display a sample
print("\nSample of complete interactions:")
sample = synthetic_data.sample(min(1, len(synthetic_data)))
for _, row in sample.iterrows():
print(f"\nQuery ({row['feature']} - {row['scenario']} - {row['persona']}):")
print(row['query'])
print("\nAI Response:")
print(row['ai_response'])
Output
Loaded 8 synthetic queries from physician_scheduler_synthetic_data.csv
Generating AI responses...
Saved 8 complete interactions to physician_scheduler_complete_interactions.csv
Sample of complete interactions:
Query (appointment_booking - routine_scheduling - primary_care_physician):
Book a routine physical for John Smith, DOB 03/15/1988, with Dr. Lee next Tuesday or Wednesday. Ideally, morning appointment. And can you check if he's due for any vaccines?
AI Response:
Okay, I can help with that.
Let me check Dr. Lee's availability for a routine physical for John Smith (DOB 03/15/1988) next Tuesday or Wednesday morning. A routine physical is typically a 30-minute appointment.
Dr. Lee has an opening at 9:00 AM next Tuesday. Would that work?
I will also check Mr. Smith's vaccination records to see if he is due for any vaccines.
Please confirm if the 9:00 AM Tuesday appointment is acceptable. Once confirmed, I will book it and let you know about any needed vaccines.
Taking It Further: Evaluating Your AI with Synthetic Data
Once you have your synthetic dataset with both user queries and AI responses, you can use it to evaluate your AI system. Let's implement a simplified version of the LLM-as-a-Judge approach which uses few-shot examples with expert critiques:
Note: This is a basic implementation focused on core evaluation functionality. A more comprehensive approach could include error analysis across dimensions, visualization of failure patterns, tracking agreement rates with domain experts, and systematic improvement of prompts based on evaluation results.
# Define domain expert examples with critiques
EXPERT_EXAMPLES = [
{
"query": "Can you book Mrs. Eleanor Vance for her annual physical with Dr. Ramirez? She's flexible next week, preferably sometime in the AM. And also, schedule a follow-up for Mr. David Rossi for a BP check, two weeks from today, any afternoon is fine.",
"response": "I've scheduled Mrs. Vance for next Tuesday at 9:00 AM with Dr. Ramirez for her annual physical. I've also scheduled Mr. Rossi for a BP check two weeks from today at 2:00 PM. Both appointments are confirmed.",
"critique": "The response directly addresses both scheduling requests and provides specific appointment times that match the requested preferences (morning for Mrs. Vance, afternoon for Mr. Rossi). It confirms both appointments have been scheduled, which is the primary goal. However, it doesn't specify the duration of the appointments or ask for confirmation before finalizing, which would be better practice.",
"judgment": "PASS"
},
{
"query": "Need to reschedule Sarah Johnson's cardiology follow-up from next Friday to sometime the following week. She prefers mornings.",
"response": "I'll look into rescheduling Sarah Johnson's cardiology follow-up. What time was her original appointment on Friday?",
"critique": "The response fails to complete the scheduling task. While it acknowledges the request, it asks for information (the original appointment time) that should be available in the scheduling system. The assistant should be able to locate the existing appointment and suggest specific alternative times for the following week in the morning as requested.",
"judgment": "FAIL"
},
{
"query": "Can you fit in Marie Dubois for a sick visit today? She called complaining of cough, fever, and SOB. Thinks it might be the flu. She's an established patient, DOB 03/15/1968. ASAP if possible.",
"response": "I can schedule Marie Dubois (DOB 03/15/1968) for a sick visit today. Given her symptoms (cough, fever, SOB), I've reserved the 11:30 AM emergency slot for a 30-minute appointment. Please let me know if this works for her, or if she needs a different time.",
"critique": "The response appropriately handles an urgent request by offering the emergency slot at 11:30 AM. It acknowledges the patient's symptoms and allocates an appropriate duration (30 minutes) for the visit. It correctly identifies the patient with their DOB and offers flexibility if the suggested time doesn't work. This effectively addresses the urgent nature of the request.",
"judgment": "PASS"
}
]
def evaluate_ai_response(
query: str,
response: str,
feature: str,
scenario: str,
persona: str,
model_name: str = "gemini-2.0-flash"
) -> dict:
"""
Evaluate an AI response to a user query using the Critique Shadowing approach.
Args:
query: The user's query
response: The AI's response
feature: The feature being tested
scenario: The scenario being tested
persona: The persona making the query
model_name: The Gemini model to use
Returns:
A dictionary containing the evaluation results
"""
# Format the few-shot examples
examples_text = "Here are some examples of how to evaluate physician scheduling assistant responses:\n\n"
for i, example in enumerate(EXPERT_EXAMPLES, 1):
examples_text += f"EXAMPLE {i}:\n"
examples_text += f"USER QUERY: \"{example['query']}\"\n\n"
examples_text += f"ASSISTANT RESPONSE: \"{example['response']}\"\n\n"
examples_text += f"CRITIQUE: {example['critique']}\n"
examples_text += f"JUDGMENT: {example['judgment']}\n\n"
# Create a detailed prompt for Gemini with few-shot examples
prompt = f"""
You are an expert evaluator for a physician scheduling assistant. Your job is to determine if the assistant's responses meet the needs of medical professionals in a scheduling context.
{examples_text}
Please evaluate the following interaction:
USER QUERY: "{query}"
ASSISTANT RESPONSE: "{response}"
CONTEXT:
- This query relates to: {feature}
- The scenario is: {scenario}
- The user is a: {persona}
SCHEDULING SYSTEM INFORMATION:
- Available appointment slots: 9:00 AM - 5:00 PM, Monday-Friday
- Appointment durations: 15 min (quick follow-up), 30 min (standard), 60 min (new patient/complex)
- Emergency slots: Two 30-minute slots reserved each day at 11:30 AM and 3:30 PM
EVALUATION INSTRUCTIONS:
1. First, write a detailed critique that analyzes how well the response addresses the scheduling request
2. Consider whether the response:
- Directly addresses the scheduling request
- Provides accurate and helpful scheduling information
- Is appropriate for the medical professional's role
- Handles the scenario appropriately (especially for urgent requests)
- Is professional, clear, and efficient
- Includes necessary appointment details (time, duration, type)
3. Then provide a binary PASS/FAIL judgment
- PASS: The response successfully fulfills the core scheduling need
- FAIL: The response does not adequately address the scheduling request
Format your response as JSON:
{{
"critique": "Your detailed analysis of the response...",
"judgment": "PASS" or "FAIL",
"improvement_suggestions": "Specific suggestions for improvement..."
}}
"""
# Call Gemini model
model = genai.GenerativeModel(model_name)
response_obj = model.generate_content(prompt)
# Parse the JSON response
try:
# Clean the response text to ensure it's valid JSON
response_text = response_obj.text.strip()
# Remove any markdown code block formatting if present
if response_text.startswith("```json"):
response_text = response_text.replace("```json", "", 1)
if response_text.endswith("```"):
response_text = response_text.rsplit("```", 1)[0]
response_text = response_text.strip()
evaluation = json.loads(response_text)
return evaluation
except json.JSONDecodeError as e:
# Fallback if response isn't valid JSON
print(f"Error parsing JSON: {e}")
print(f"Raw response: {response_obj.text}")
return {
"judgment": "ERROR",
"critique": "Failed to parse evaluation",
"improvement_suggestions": "N/A",
"raw_response": response_obj.text
}
# Load the complete interactions
input_file = "physician_scheduler_complete_interactions.csv"
try:
synthetic_data = pd.read_csv(input_file)
print(f"Loaded {len(synthetic_data)} complete interactions from {input_file}")
except FileNotFoundError:
print(f"Error: {input_file} not found. Please run generate_ai_responses.py first.")
exit(1)
# Evaluate a sample of the interactions (to save time)
sample_size = min(3, len(synthetic_data))
evaluation_sample = synthetic_data.sample(sample_size)
print(f"Evaluating {sample_size} interactions...")
# Add evaluations to our sample
evaluation_results = []
for idx, row in evaluation_sample.iterrows():
print(f"Evaluating {row['feature']} - {row['scenario']} - {row['persona']}...")
evaluation = evaluate_ai_response(
row['query'],
row['ai_response'],
row['feature'],
row['scenario'],
row['persona']
)
evaluation_results.append({
**row.to_dict(),
'evaluation': evaluation
})
# Save the evaluation results
output_file = "physician_scheduler_evaluations_simple.json"
with open(output_file, 'w') as f:
json.dump(evaluation_results, f, indent=2, default=str)
print(f"Saved {len(evaluation_results)} evaluations to {output_file}")
# Calculate pass rate
pass_rate = sum(1 for r in evaluation_results if r['evaluation'].get('judgment') == 'PASS') / len(evaluation_results)
print(f"\nOverall pass rate: {pass_rate:.2%}")
# Display detailed evaluation results
print("\nDetailed evaluation results:")
for result in evaluation_results:
print(f"\n--- {result['feature']} - {result['scenario']} - {result['persona']} ---")
print(f"Query: {result['query']}")
print(f"Response: {result['ai_response']}")
print(f"Judgment: {result['evaluation'].get('judgment', 'ERROR')}")
print(f"Critique: {result['evaluation'].get('critique', 'N/A')}")
print(f"Improvement suggestions: {result['evaluation'].get('improvement_suggestions', 'N/A')}")
Output
Loaded 8 complete interactions from physician_scheduler_complete_interactions.csv
Evaluating 3 interactions...
Evaluating appointment_booking - routine_scheduling - primary_care_physician...
# Note: The evaluation script hit the API rate limit. When running this code,
# you may need to implement rate limiting or use a paid API tier with higher quotas.
# A successful run would show evaluation results with PASS/FAIL judgments and critiques.
Here's an example of what successful evaluation results might look like:
Overall pass rate: 66.67%
Detailed evaluation results:
--- appointment_booking - urgent_request - specialist ---
Query: Hey, can you find me an open slot for an urgent ophthalmology consult for James Riley, DOB 03/15/1968? He's experiencing sudden blurred vision in his left eye, possibly a retinal detachment. Needs to be seen ASAP, preferably today or tomorrow. Primary is Dr. Lee.
Response: Okay, I can help with that. Let me check Dr. Lee's schedule for urgent ophthalmology consult availability for James Riley, DOB 03/15/1968, experiencing sudden blurred vision in his left eye.
Given the urgency and the possibility of a retinal detachment, I see an emergency slot available today at 11:30 AM for 30 minutes. Would that work?
If not, the next available emergency slot is tomorrow at 3:30 PM.
Please confirm if either of these times work for Mr. Riley. If not, I can check Dr. Lee's regular schedule, but it may be later in the week. Please let me know his insurance information for verification purposes before the appointment.
Judgment: PASS
Critique: The response is very good and appropriately handles the urgent consult request. It immediately acknowledges the request, confirms the patient's information, and acknowledges the potential severity (retinal detachment). It then offers two emergency slot options (today at 11:30 AM and tomorrow at 3:30 PM), which aligns with the 'ASAP' and 'today or tomorrow' preference. It also states the duration of the slots. Furthermore, it provides alternatives (checking the regular schedule), demonstrating flexibility, and requests insurance information for verification, which is a practical step. The response is clear, professional, and efficient.
Improvement suggestions: While not strictly necessary, mentioning Dr. Lee's name again when offering alternative slots could reinforce that the consult will be with the specified physician.
Advanced Evaluation Techniques
While the implementation above covers the basics, a production-ready evaluation system would include additional components:
-
Dimension-Based Error Analysis: Systematically analyze which combinations of features, scenarios, and personas have the highest failure rates to identify weak points in your AI system.
-
Visualization of Failure Patterns: Create charts and graphs to visualize where your system is struggling, making it easier to prioritize improvements.
-
Agreement Tracking with Human Experts: Compare LLM judgments with those of domain experts to ensure the evaluation system itself is reliable.
-
Iterative Prompt Refinement: Use evaluation results to continuously improve your prompts and few-shot examples.
-
Confidence Scoring: Add confidence levels to judgments to identify borderline cases that might need human review.
Implementing these advanced techniques can significantly improve your evaluation system's effectiveness and help you build a more robust AI product.
Best Practices for Synthetic Data Generation
Based on my experience here are some best practices to follow:
-
Ground your data in reality - Use real-world constraints and examples to make your synthetic data realistic.
-
Generate inputs, not outputs - Use LLMs to generate user queries, not the expected AI responses. This prevents your synthetic data from inheriting the biases of the generating model.
-
Verify scenario coverage - Make sure your synthetic data actually triggers the scenarios you want to test.
-
Start simple, then add complexity - Begin with straightforward test cases before adding nuance to help isolate issues.
-
Continuously refine your dimensions - As you learn more about your users and use cases, update your dimensions to better reflect reality.
The ROI of Synthetic Data
The return on investment for synthetic data generation is massive:
- Faster time to market - Start testing and improving your AI before you have real users.
- Higher quality at launch - Identify and fix issues before they affect real users.
- More robust systems - Test edge cases that might be rare in production but critical to handle correctly.
- Reduced development costs - Catch issues early when they're cheaper to fix.
In my experience, teams that leverage synthetic data effectively can cut their development time in half while delivering a more robust product.
Conclusion: Start Generating, Start Learning
Synthetic data generation isn't just a technical solution—it's a mindset shift. Instead of waiting for perfect data to fall into your lap, you proactively create the data you need to build and test your AI systems.
The code examples in this post give you a starting point, but the real power comes from adapting these techniques to your specific use case. Think about the dimensions that matter for your AI product, then use Gemini 2.0 Flash to generate diverse, realistic test data. The few-shot examples with expert critiques in our evaluation approach are particularly powerful - they help the model understand exactly what constitutes a good response in your specific domain.
Remember, the goal isn't to replace real user data entirely—it's to bootstrap your development process so you can build a better product faster. Once you have real users, you can combine synthetic and real data for even more powerful testing and improvement.
So what are you waiting for? Start generating, start testing, and start building better AI products today.