You have the data. You know it has answers in it. But getting from a raw CSV to a clear insight that someone will act on? That's the gap AI closes — not by replacing your analytical thinking, but by eliminating the mechanical work between question and answer.
These 40 prompts cover the full data analysis pipeline: cleaning messy data, exploring patterns, running statistical tests, designing visualizations, writing reports, and generating the SQL or Python code that makes it happen.
Every prompt includes {{placeholders}} for your dataset specifics. The more context you give about your data, the better the output. A prompt that says "analyze my sales data" gets generic advice. A prompt that says "analyze quarterly revenue by product line for a B2B SaaS company with 3 pricing tiers" gets something you can use.
For custom data analysis prompts tailored to your specific dataset, try the Data Analysis Prompt Generator or build your own with the AI Prompt Generator.
ChatGPT Code Interpreter vs. Claude: Which to Use When
Before diving into the prompts, know your tools.
ChatGPT with Code Interpreter lets you upload files directly. It can run Python code on your data, produce charts, and iterate on analysis in real time. Use it when you have a CSV or Excel file and want hands-on analysis. Best for prompts in the Exploratory Analysis, Statistical Analysis, and Visualization sections.
Claude excels at reasoning about data you describe in text. It produces superior written analysis, better report prose, and more thoughtful interpretations. It handles large amounts of pasted data well. Best for prompts in the Reporting, strategy, and interpretation sections.
Either model works well for SQL generation, code writing, and data cleaning prompts. Use whichever you have open.
Info
Tip: For sensitive data you can't upload to AI tools, describe the schema, column types, row count, and 5-10 anonymized sample rows. Every prompt below works with both uploaded files and described datasets.
Data Cleaning Prompts (6)
1. Data Quality Assessment
I have a dataset with the following structure:
Columns: {{list column names and data types}}
Row count: {{approximate number of rows}}
Source: {{where this data came from}}
Time range: {{date range covered}}
Sample rows:
{{paste 10 representative rows including some with issues}}
Assess the data quality:
1. Identify columns with likely null/missing values and suggest handling strategies (impute, drop, flag)
2. Flag potential data type issues (numbers stored as strings, inconsistent date formats)
3. Identify likely outliers and whether they're errors or valid extreme values
4. Check for duplicate rows or near-duplicates
5. Flag inconsistencies in categorical columns (capitalization, spelling variations, abbreviations)
6. Rate overall data quality on a 1-10 scale with justification
For each issue found, provide the specific cleaning step needed.
2. Standardization Script
Write a Python/pandas script to standardize this dataset:
Current column names: {{list messy column names}}
Target column names: {{list clean column names you want}}
Standardization rules:
- Dates should be in {{YYYY-MM-DD or other format}}
- Currency values should be numeric (remove $, commas, handle parentheses for negatives)
- Phone numbers should be {{format}}
- Addresses should be parsed into: street, city, state, zip
- Names should be: first_name, last_name (split from full name field)
- All string columns: strip whitespace, handle encoding issues
- Categorical columns: {{list specific mappings — e.g., "Y/N/Yes/No" → True/False}}
Output a complete, runnable Python script with comments explaining each transformation. Include a validation summary at the end that prints row counts before/after and sample rows.
3. Missing Data Strategy
My dataset has missing values in these columns:
{{for each column with missing data, list: column name, data type, approximate % missing, and why it might be missing}}
For each column, recommend and implement the best missing data strategy:
- MCAR (Missing Completely at Random) → what approach?
- MAR (Missing at Random) → what approach?
- MNAR (Missing Not at Random) → what approach?
Consider:
- Is the missingness informative? (Does the fact that data is missing tell us something?)
- Would imputation introduce bias for downstream analysis?
- Should I create a "was_missing" flag column?
- For each imputation method chosen, explain why it's better than alternatives
Write the Python code to implement the chosen strategy for each column.
4. Deduplication Logic
I have a dataset that likely contains duplicate records. The challenge: duplicates aren't exact — they have slight variations.
Dataset: {{describe what each row represents — customers, transactions, products, etc.}}
Key identifying fields: {{list fields that should be unique or near-unique}}
Known variation types: {{e.g., "John Smith" vs "JOHN SMITH" vs "J. Smith", different phone formats, etc.}}
Build a deduplication pipeline that:
1. Identifies exact duplicates and removes them
2. Identifies fuzzy duplicates using appropriate matching (Levenshtein distance, phonetic matching, etc.)
3. Scores match confidence (high/medium/low)
4. Creates a merge strategy (which record to keep, how to combine fields from duplicates)
5. Produces a report of all duplicate clusters for human review before deletion
Write this as a Python script using pandas and any necessary libraries (fuzzywuzzy, recordlinkage, etc.).
5. Data Validation Rules
Create a comprehensive data validation framework for this dataset:
Dataset description: {{what the data represents}}
Columns: {{list with data types}}
Business rules:
{{list known constraints — e.g., "order_total must be positive", "ship_date must be after order_date", "status must be one of: pending, shipped, delivered, cancelled"}}
Generate:
1. Validation rules for each column (type checks, range checks, pattern checks, referential integrity)
2. Cross-column validation rules (date ordering, calculated field consistency, logical constraints)
3. Statistical validation rules (values within expected distribution, no sudden shifts in patterns)
4. A Python script that runs all validations and produces a report:
- Number of rows failing each rule
- Sample failing rows for each rule
- Overall data quality score
- Priority ranking of issues to fix first
6. Data Type Conversion and Encoding
My dataset has columns that need type conversion and encoding for analysis:
Current state:
{{list columns with their current types and target types — e.g., "revenue: string with $ signs → float", "category: 15 unique string values → encoded for ML"}}
Generate Python code that:
1. Converts each column to its target type with error handling (what happens when "N/A" is in a numeric column?)
2. Encodes categorical variables appropriately:
- Ordinal categories (low/medium/high) → ordinal encoding
- Nominal categories (red/blue/green) → one-hot or target encoding
- High-cardinality categories (1000+ values) → frequency or target encoding
3. Handles mixed-type columns (column has both numbers and strings)
4. Creates a conversion report showing: original dtype, new dtype, conversion failures, and sample values before/after
Exploratory Data Analysis Prompts (6)
7. Initial Data Exploration
Perform an exploratory data analysis on this dataset:
{{paste column names, types, and 20 sample rows — or upload the file}}
Produce:
1. Summary statistics for all numeric columns (mean, median, std, min, max, quartiles, skewness)
2. Value counts for all categorical columns (top 10 values + "other" count)
3. Missing value summary (count and percentage per column)
4. Distribution assessment for key numeric columns (normal? skewed? bimodal?)
5. The 5 most interesting patterns or anomalies you notice
6. 10 questions this data could answer, ranked by business value
Focus on findings that would surprise or concern a {{business role — e.g., "product manager" or "CFO"}}.
8. Correlation Analysis
Analyze correlations in my dataset:
Key variables I care about: {{list the outcome variables and potential predictors}}
Dataset info: {{describe columns, types, and row count — or upload}}
Produce:
1. Correlation matrix for all numeric variables (highlight strong correlations > 0.5)
2. Non-obvious correlations that might indicate causation worth investigating
3. Variables that are highly correlated with each other (multicollinearity risk for modeling)
4. Correlation between categorical variables and numeric outcomes (using appropriate methods)
5. Time-based correlations if date fields exist (lagged correlations, seasonal patterns)
6. Spurious correlations to be careful about (correlation ≠ causation flags)
Write the Python code and explain each finding in plain language.
9. Segmentation Analysis
Segment the data in my dataset to find meaningful groups:
Dataset: {{describe what each row represents}}
Key columns: {{list the columns most relevant for segmentation}}
Business context: {{what decisions will these segments inform?}}
Perform:
1. Rule-based segmentation using business logic (e.g., high/medium/low value customers based on revenue)
2. Statistical segmentation using clustering (K-means or similar — determine optimal K)
3. Behavioral segmentation if time-series data exists (frequency, recency, monetary value)
4. Profile each segment: size, key characteristics, and how they differ from the overall population
5. Name each segment with a descriptive label a non-technical stakeholder would understand
6. For each segment, suggest one actionable business recommendation
Include Python code for both the segmentation and the profiling.
10. Time Series Exploration
Explore the time-based patterns in my dataset:
Date column: {{column name and format}}
Key metrics to analyze over time: {{list 2-4 numeric columns}}
Time range: {{start to end date}}
Granularity available: {{daily/weekly/monthly/hourly}}
Analyze:
1. Overall trend (increasing, decreasing, flat, cyclical)
2. Seasonality patterns (weekly, monthly, quarterly, annual)
3. Notable anomalies or change points (sudden spikes, drops, or trend shifts)
4. Day-of-week and time-of-day patterns if applicable
5. Year-over-year comparison if multiple years of data exist
6. Growth rate calculations (MoM, QoQ, YoY)
For each finding, specify: the time period, the magnitude of the pattern, and whether it's consistent or a one-time event.
11. Outlier Investigation
Investigate outliers in my dataset:
Columns to check for outliers: {{list numeric columns}}
Dataset context: {{what the data represents and what "normal" looks like}}
For each column:
1. Identify outliers using multiple methods (IQR, Z-score, isolation forest)
2. Categorize each outlier as:
- Data entry error (clearly wrong values)
- True extreme value (real but unusual)
- Systemic issue (a group of related outliers suggesting a process problem)
3. Show the actual outlier values and their context (other column values for those rows)
4. Recommend handling: remove, cap/floor, keep, investigate further
5. Quantify the impact of outliers on summary statistics (mean/median with and without outliers)
Generate Python code that flags outliers and produces a clean dataset with outlier handling applied.
12. Feature Importance Discovery
I'm trying to understand what drives {{target variable}} in my dataset.
Target variable: {{name and type — continuous or categorical}}
Potential predictors: {{list candidate columns}}
Dataset size: {{rows × columns}}
Analyze feature importance using multiple approaches:
1. Univariate analysis: correlation/association of each feature with the target
2. Mutual information: non-linear relationships between features and target
3. Tree-based importance: random forest or gradient boosting feature importances
4. Permutation importance: which features matter most for prediction accuracy
For each approach:
- Rank features from most to least important
- Explain why the rankings differ between methods (they will)
- Identify features that are consistently important across all methods
- Flag features that might be proxies for the same underlying driver
Summarize: "The top 3 drivers of {{target}} are X, Y, Z, and here's what that means for the business."
Statistical Analysis Prompts (6)
13. Hypothesis Testing
I need to test whether {{describe the difference or relationship you're investigating}}.
Context:
- Dataset: {{describe the two groups or conditions being compared}}
- Sample sizes: Group A = {{n}}, Group B = {{n}}
- Variable being compared: {{variable name and type}}
- Significance level: {{0.05 unless you have a reason for something else}}
Perform:
1. State the null and alternative hypotheses in plain language
2. Check assumptions for the appropriate test (normality, equal variance, independence)
3. Select and justify the correct statistical test (t-test, Mann-Whitney, chi-square, ANOVA, etc.)
4. Run the test and report: test statistic, p-value, and confidence interval
5. Calculate effect size (Cohen's d, odds ratio, or appropriate measure)
6. Interpret the result in plain language: "There is/isn't sufficient evidence that..."
7. State the practical significance (is the difference big enough to matter, even if statistically significant?)
Write the Python/R code to run this analysis from data loading to conclusion.
14. A/B Test Analysis
Analyze this A/B test:
Test description: {{what was changed}}
Control group (A): {{sample size and key metric result}}
Treatment group (B): {{sample size and key metric result}}
Primary metric: {{what you're measuring}}
Test duration: {{how long it ran}}
Analyze:
1. Is the sample size sufficient for the observed effect? (Post-hoc power analysis)
2. Is the result statistically significant? (with p-value and confidence interval)
3. What's the practical effect size? (absolute and relative difference)
4. Are there segment-level differences? (If segment data provided: {{list segments}})
5. Novelty effect risk assessment (is the uplift likely to diminish over time?)
6. Recommendation: ship it, kill it, or keep testing — with reasoning
Also flag if there are any red flags in the test design (sample ratio mismatch, peeking problem, multiple comparison issues).
15. Regression Analysis
Build a regression model to predict {{target variable}} using my dataset.
Target variable: {{name, type, range of values}}
Potential predictors: {{list columns to consider}}
Dataset size: {{rows}}
Purpose: {{understanding relationships? making predictions? both?}}
Walk me through:
1. Exploratory analysis of predictor-target relationships
2. Feature selection (which predictors to include and why)
3. Model building (linear, logistic, or polynomial — justify the choice)
4. Assumption checking (linearity, independence, homoscedasticity, normality of residuals)
5. Model performance metrics (R², adjusted R², RMSE, MAE for continuous; AUC, accuracy, precision/recall for classification)
6. Interpretation of coefficients in plain language
7. Residual analysis to check for problems
8. Prediction examples with confidence intervals
Write complete Python code using scikit-learn or statsmodels with clear comments.
16. Cohort Analysis
Perform a cohort analysis on my {{user/customer/employee}} data.
Cohort definition: Group {{entities}} by {{their signup month, first purchase date, hire date, etc.}}
Key metric to track: {{retention, revenue, engagement, etc.}}
Time periods to track: {{weekly/monthly over how many periods}}
Dataset columns available:
{{list relevant columns — entity ID, date of first action, activity dates, metric values}}
Produce:
1. Cohort retention/engagement table (cohort × time period)
2. Visual representation (heatmap description or code to generate one)
3. Which cohorts perform significantly better or worse than average
4. Trends across cohorts (are newer cohorts performing better or worse?)
5. Time-to-event analysis (how long until 50% of a cohort churns/converts/etc.)
6. Actionable insights: what changed for the best/worst performing cohorts?
Write the Python/SQL code to generate this from raw data.
17. Forecasting
Build a forecast for {{metric}} over the next {{time period}}.
Historical data:
- Time range: {{start to end}}
- Granularity: {{daily/weekly/monthly}}
- Known patterns: {{any seasonality, trends, or events you're aware of}}
- External factors: {{anything that might affect future values — promotions, market changes, etc.}}
{{Paste recent data points or upload file}}
Produce:
1. Model selection (moving average, exponential smoothing, ARIMA/SARIMA, Prophet — justify choice)
2. The forecast with confidence intervals (80% and 95%)
3. Decomposition: trend + seasonality + residual components
4. Accuracy metrics on a holdout set (MAPE, RMSE)
5. Scenario analysis: optimistic, baseline, pessimistic forecasts
6. Key assumptions and what could make the forecast wrong
Write the Python code (using statsmodels, Prophet, or scikit-learn as appropriate).
18. Statistical Summary for Non-Technical Audience
I need to present statistical findings to {{audience — e.g., "the executive team" or "the board of directors"}}.
Raw findings:
{{paste your statistical results — p-values, confidence intervals, effect sizes, regression coefficients, etc.}}
Translate this into:
1. A 3-sentence executive summary (no jargon, no p-values, just what it means)
2. A "so what?" section: what should we do differently based on these numbers
3. A confidence statement: how sure are we, in plain language (not "p < 0.05")
4. Analogies or comparisons that make the magnitude intuitive ("this is equivalent to...")
5. Caveats in plain language (what could make this wrong, without undermining the finding)
6. One chart description that would convey the key finding visually
Avoid: p-values, confidence intervals, standard deviations, regression coefficients, or any statistical term the audience wouldn't use in conversation.
Visualization Prompts (6)
19. Dashboard Design Specification
Design a data dashboard for {{audience/role}} tracking {{business area}}.
Key metrics to display:
{{list 5-10 metrics with their data types and update frequency}}
User goals: {{what decisions will this dashboard inform?}}
Data refresh frequency: {{real-time / daily / weekly}}
Tool: {{Tableau / Power BI / Looker / custom D3.js / Python Dash / etc.}}
Produce:
1. Dashboard layout (describe placement of each element)
2. Chart type for each metric with justification
3. Filter and drill-down controls needed
4. Color scheme and encoding rules (what colors mean what)
5. Alert thresholds (what values should trigger visual warnings)
6. Mobile/responsive considerations
7. The "glance test": what should someone understand within 5 seconds of looking at this dashboard?
20. Chart Type Selection
I need to visualize the following data relationships:
{{describe 5-8 specific findings or comparisons you want to show}}
For each finding, recommend:
1. Best chart type and why (not just "bar chart" — why bar over column, grouped over stacked, etc.)
2. What goes on each axis/encoding channel
3. Chart title that communicates the insight (not just "Sales by Region" — instead "Northeast Sales Outpace All Other Regions by 40%")
4. Annotation or callout to highlight the key takeaway
5. Common mistakes to avoid with this chart type
6. An alternative chart type if the first choice isn't available in the tool I'm using
Tool I'm using: {{name}}
Audience: {{who will see these charts}}
21. Python Visualization Code (Matplotlib/Seaborn)
Generate Python visualization code for these charts:
Data structure:
{{describe your DataFrame columns, types, and a few sample rows}}
Charts needed:
1. {{describe chart 1 — what data, what comparison, what insight}}
2. {{describe chart 2}}
3. {{describe chart 3}}
4. {{describe chart 4}}
Style requirements:
- Color palette: {{specific palette or "professional/clean"}}
- Figure size: {{or "appropriate for a presentation/report"}}
- Font sizes readable in {{presentation slides / web page / printed report}}
- Clean, minimal design (no chartjunk)
Generate complete, runnable code using matplotlib and seaborn. Each chart should have:
- Descriptive title
- Axis labels with units
- Legend only if necessary
- Grid lines only if they aid readability
- Any annotations that highlight key data points
22. Interactive Visualization Specification
Design an interactive visualization for exploring {{dataset description}}.
Data dimensions: {{list all filterable/explorable dimensions}}
Key metrics: {{list numeric values to visualize}}
Expected user interactions:
- Filter by {{what}}
- Drill down from {{level}} to {{level}}
- Compare {{what}} across {{dimension}}
- Hover to see {{details}}
Produce:
1. Layout specification (what goes where)
2. Interaction design (what clicking/hovering/filtering does)
3. Transitions and animations (how the viz responds to user actions)
4. Default view (what users see before interacting)
5. Implementation plan for {{Plotly / D3.js / Tableau / Power BI}}
6. Performance considerations for {{row count}} rows of data
Include the code for a working prototype if the tool is Python-based (Plotly Dash, Streamlit).
23. Chart Critique and Improvement
Critique this chart and suggest improvements:
Chart type: {{bar/line/scatter/pie/etc.}}
What it's trying to show: {{the intended message}}
Data: {{describe what data is plotted}}
Current issues I suspect: {{list any concerns — hard to read? misleading? ugly?}}
{{If possible, describe the chart in detail: axes, colors, labels, data points, etc.}}
Evaluate:
1. Does the chart type match the data relationship? (If not, what type would work better?)
2. Is the visual encoding accurate? (Truncated axes? Misleading scales? 3D effects?)
3. Is the data-to-ink ratio good? (Too much chartjunk? Too sparse?)
4. Can a reader extract the key insight within 5 seconds?
5. Accessibility: would this work in black and white? For colorblind readers?
6. Specific improvements with before/after descriptions
24. Automated Report Visualization Suite
I need a complete set of visualizations for a {{monthly/quarterly/annual}} report on {{business area}}.
Key sections of the report:
1. {{section 1 — e.g., "Revenue Performance"}}
2. {{section 2 — e.g., "Customer Metrics"}}
3. {{section 3 — e.g., "Operational Efficiency"}}
4. {{section 4 — e.g., "Market Comparison"}}
For each section:
- Recommend 2-3 charts
- Specify chart type, data source columns, and visual encoding
- Write a chart title that communicates the finding (not just the data)
- Include a one-sentence caption template: "This chart shows [insight]. The key takeaway is [action]."
Then write a Python script that generates all charts from a single DataFrame, outputs them as PNG files with consistent styling, and generates a summary text file with chart captions. The script should be rerunnable each {{reporting period}} with updated data.
Reporting Prompts (6)
25. Executive Data Summary
Write an executive summary based on this data analysis:
Key findings:
{{list 5-8 findings with their supporting numbers}}
Audience: {{role and what decisions they make}}
Context: {{what prompted this analysis — a question, a problem, a quarterly review?}}
Structure:
1. One-paragraph bottom line (the single most important thing the executive needs to know)
2. 3-5 key findings, each in 2-3 sentences with the "so what" implication
3. Recommended actions (specific, not vague — "increase budget for channel X by 15%" not "invest more in marketing")
4. Risks and caveats (what could make these findings wrong)
5. Next steps (what analysis should follow)
Rules:
- No jargon or statistical terms
- Every number needs context (is 15% growth good or bad? compared to what?)
- Lead with insights, not methodology
- Maximum 500 words
26. Data Analysis Narrative
Transform these analysis results into a clear narrative:
Results:
{{paste your analysis output — tables, statistics, key findings}}
Write a data story that:
1. Opens with the question we set out to answer
2. Walks through the evidence in logical order
3. Builds to the key insight (don't reveal the punchline first)
4. Addresses the obvious counterarguments
5. Ends with a clear recommendation and confidence level
Audience: {{who will read this}}
Tone: {{formal report / conversational presentation / Slack summary}}
Length: {{word count}}
Include suggestions for where to place charts or visuals within the narrative, labeled as [CHART: description].
27. Stakeholder-Specific Report Versions
I have one analysis that needs to be communicated to three different audiences:
Analysis summary:
{{paste your key findings and recommendations}}
Write three versions:
1. **C-Suite version** (200 words): Bottom line, business impact, decision needed. No methodology.
2. **Manager version** (500 words): Findings, implications for their team, specific actions. Light methodology.
3. **Analyst version** (800 words): Full findings, methodology, limitations, raw numbers, and reproducibility notes.
Each version should convey the same core insight but at the appropriate depth and vocabulary level. The C-Suite version should be something a CEO reads in 60 seconds and knows exactly what to do.
28. KPI Report Template
Create a reusable KPI report template for {{business area}} reporting.
KPIs to include:
{{list each KPI with: name, definition, target/benchmark, data source}}
Reporting frequency: {{daily/weekly/monthly}}
Audience: {{who receives this report}}
The template should include:
1. Header section (period, prepared by, distribution list)
2. Scorecard summary (all KPIs with status indicators: green/yellow/red)
3. Detailed section per KPI: current value, trend, variance from target, root cause for any red/yellow metrics
4. Commentary section: key events that affected metrics
5. Action items: what's being done about underperforming KPIs
6. Appendix: data definitions and calculation methodology
Make this fill-in-ready. I should be able to update the numbers each {{period}} and have a complete report.
29. Anomaly Report
Write a report explaining this data anomaly:
What happened: {{describe the anomaly — sudden spike, unexpected drop, unusual pattern}}
When: {{date/time range}}
Affected metric(s): {{which KPIs or data points}}
Magnitude: {{how big was the anomaly compared to normal?}}
Investigation data:
{{paste any supporting data — time series, segment breakdowns, related metrics}}
The report should:
1. State what happened in one sentence
2. Quantify the impact (in business terms, not just percentage change)
3. Present the most likely causes (with evidence for each)
4. Rule out common false positives (seasonality, data pipeline issues, one-time events)
5. Recommend immediate actions (if any)
6. Recommend preventive measures to detect this earlier next time
7. Flag if this anomaly changes any existing forecasts or targets
Write for {{audience}} in {{tone — urgent escalation / routine investigation / post-mortem}}.
30. Data Dictionary
Create a comprehensive data dictionary for this dataset:
Dataset name: {{name}}
Source system: {{where the data comes from}}
Refresh frequency: {{how often it updates}}
Columns:
{{list all column names — I'll provide what I know about each}}
Known information:
{{paste any existing documentation, even if incomplete}}
For each column, document:
1. Column name and any aliases
2. Data type (string, integer, float, date, boolean, etc.)
3. Description (what this field represents in business terms)
4. Source (where this value originates — user input, calculated, system-generated, etc.)
5. Possible values (for categorical) or valid range (for numeric)
6. Null handling (is null allowed? what does null mean?)
7. Business rules (any transformations, validations, or dependencies)
8. Related columns (foreign keys, calculated-from relationships)
9. Known data quality issues
10. Example values (3-5 representative examples)
Format as a clean markdown table and also as a JSON schema I can use for automated validation.
SQL Query Prompts (5)
31. Complex Query Builder
Write a SQL query for this analysis:
Database: {{PostgreSQL / MySQL / BigQuery / Snowflake / etc.}}
Tables available:
{{list each table with its columns and relationships / foreign keys}}
What I need:
{{describe the analysis in plain language — e.g., "Monthly revenue by product category for the last 12 months, compared to the same month in the prior year, showing the year-over-year growth rate, excluding refunded orders"}}
Requirements:
- Handle nulls appropriately
- Use CTEs for readability if the query is complex
- Include comments explaining non-obvious logic
- Optimize for {{large tables — millions of rows / small tables — fine to be readable over fast}}
- Output column names should be business-friendly (not database_column_names)
32. Performance Optimization
Optimize this slow SQL query:
{{paste the slow query}}
Database: {{type and version}}
Table sizes: {{approximate row counts for each table}}
Current execution time: {{how long it takes}}
Existing indexes: {{list known indexes, or "unknown"}}
Analyze:
1. Why is this query slow? (Identify the bottleneck — full table scan, missing index, unnecessary join, etc.)
2. Rewrite the query for better performance
3. Suggest indexes that would help (with CREATE INDEX statements)
4. Estimate the expected improvement
5. If the query can't be made fast enough, suggest an alternative approach (materialized view, pre-aggregation, etc.)
Show the before and after, and explain each optimization.
33. Data Pipeline Query Set
I need a set of SQL queries for a data pipeline that transforms raw data into analysis-ready tables.
Source tables:
{{describe raw tables and their contents}}
Target output:
{{describe what the final tables should look like — a clean fact table, dimension tables, aggregated summary table, etc.}}
Write:
1. Staging queries (clean and standardize raw data)
2. Transformation queries (join, aggregate, calculate derived fields)
3. Quality check queries (validate the output before it's used)
4. Each query as a separate, named step with dependencies noted
5. Error handling for common issues (null keys, duplicate rows, type mismatches)
Include a brief DAG description showing the order of execution.
34. Dynamic Reporting Queries
Write parameterized SQL queries for a reporting dashboard.
Dashboard requirements:
- Users can filter by: {{list filter dimensions — date range, region, product, etc.}}
- Key metrics shown: {{list metrics — revenue, count, average, etc.}}
- Drill-down levels: {{e.g., region → state → city}}
- Comparison periods: {{current vs. previous period, YoY, etc.}}
For each dashboard panel, write a query that:
- Accepts filter parameters (use {{parameter_name}} syntax)
- Returns data in the format the visualization needs
- Is performant for interactive use (sub-second response)
- Handles edge cases (empty filters = all data, invalid dates, etc.)
Database: {{type}}
Tables: {{list with key columns}}
35. Data Reconciliation Queries
Write SQL queries to reconcile data between two systems:
System A ({{name}}): {{describe tables and key fields}}
System B ({{name}}): {{describe tables and key fields}}
Joining logic: {{how to match records between systems — shared keys, business rules}}
Generate queries that find:
1. Records in System A not in System B (and vice versa)
2. Records that exist in both but with different values (specify which fields to compare)
3. Aggregate reconciliation (total counts and sums match between systems)
4. Timing discrepancies (records that are in different states due to processing delays)
5. A summary report: matched count, unmatched count per system, discrepancy rate
For each discrepancy type, suggest the likely cause and resolution approach.
Python/R Analysis Prompts (5)
36. End-to-End Analysis Script
Write a complete Python analysis script for this dataset:
Dataset description: {{what the data is, how many rows, column names and types}}
Analysis goal: {{what question we're answering}}
Output needed: {{report? charts? model? all of the above?}}
The script should:
1. Load and validate the data (from {{CSV/database/API}})
2. Clean and preprocess (handle the specific issues in this dataset)
3. Perform exploratory analysis (key distributions, correlations, patterns)
4. Run the core analysis ({{statistical test / model / aggregation / whatever the goal requires}})
5. Generate visualizations (save as PNGs)
6. Output a text summary of findings
7. Log progress and handle errors gracefully
Structure as a well-organized script with functions, not a Jupyter notebook dump. Include a main() function and docstrings. Use: pandas, numpy, matplotlib, seaborn, scipy, scikit-learn as needed.
37. Automated Report Generator
Write a Python script that generates a {{weekly/monthly}} automated report.
Data source: {{CSV file path / database connection / API endpoint}}
Report format: {{PDF / HTML / Markdown / PowerPoint}}
Report sections:
1. {{section 1 with specific metrics}}
2. {{section 2 with specific metrics}}
3. {{section 3 with specific metrics}}
The script should:
- Pull fresh data from the source
- Calculate all metrics with period-over-period comparison
- Generate charts for each section
- Compile into the output format with proper formatting
- Handle edge cases (missing data, incomplete periods, data source errors)
- Be schedulable (can run via cron/Airflow/etc. without human intervention)
- Send the output to {{email / Slack / S3 / shared drive}}
Include a config section at the top for easy customization of date ranges, recipients, and thresholds.
38. Machine Learning Pipeline
Build a machine learning pipeline to predict {{target variable}} using this dataset:
Target: {{variable name, type (classification/regression), class balance if classification}}
Features available: {{list candidate features}}
Dataset size: {{rows × columns}}
Success metric: {{accuracy / AUC / RMSE / MAE / F1 / business-specific metric}}
Deployment context: {{batch prediction / real-time API / one-time analysis}}
The pipeline should:
1. Train/test split with stratification if needed
2. Feature engineering (create useful derived features)
3. Feature selection (remove low-value features)
4. Model comparison (test 3-4 appropriate algorithms)
5. Hyperparameter tuning for the best model
6. Cross-validation for robust performance estimation
7. Feature importance analysis
8. Model diagnostics (confusion matrix, residual plots, calibration)
9. Save the trained model and preprocessing pipeline
10. Inference function that takes raw input and returns predictions
Write clean, production-grade Python using scikit-learn pipelines. Not a Jupyter mess.
39. Data Monitoring Script
Write a Python script that monitors data quality and sends alerts:
Dataset: {{describe the data and its source}}
Check frequency: {{how often to run}}
Alert channel: {{email / Slack / PagerDuty}}
Checks to implement:
1. Data freshness (alert if data hasn't updated in {{time threshold}})
2. Volume anomalies (row count outside {{expected range}})
3. Distribution drift (key metrics deviate from historical baseline by > {{threshold}}%)
4. Null rate spikes (nulls in critical columns exceed {{threshold}}%)
5. Schema changes (new/missing/renamed columns)
6. Value range violations (specific columns outside expected bounds)
7. Referential integrity (foreign key relationships still hold)
For each check:
- Log the result (pass/fail with details)
- Categorize severity (critical / warning / info)
- Include context in the alert (what changed, sample failing rows, historical comparison)
Structure as a reusable monitoring framework, not a one-off script.
40. Interactive Analysis Notebook Template
Create a well-structured Jupyter notebook template for {{type of analysis — e.g., "customer churn analysis" or "A/B test evaluation"}}.
The notebook should have:
Section 1: Setup and Configuration
- Import libraries
- Configuration variables (file paths, parameters, thresholds) in one cell
- Data loading with validation
Section 2: Data Overview
- Shape, types, sample rows
- Missing value summary
- Basic statistics
Section 3: Data Cleaning
- Document each cleaning step with reasoning
- Before/after row counts
Section 4: Exploratory Analysis
- Key distributions and relationships
- Guided exploration toward the analysis goal
Section 5: Core Analysis
- The main statistical/ML analysis
- Clear interpretation of results
Section 6: Findings and Recommendations
- Markdown summary of key findings
- Recommended actions
- Limitations and next steps
Include markdown cells with instructions for the analyst using the template. Make cells atomic — each cell does one thing and can be rerun independently.
Getting Better Results From Data Prompts
Three patterns separate useful AI data analysis from garbage output:
Describe your data precisely. Column names, data types, row counts, and sample rows. The AI can't analyze what it can't see. If you're working with sensitive data, anonymize a subset and paste it in rather than describing it abstractly.
State the business question, not just the technical task. "What's the correlation between X and Y" produces a number. "We're trying to decide whether to invest in X — does Y data support that decision?" produces an answer you can act on.
Chain your analysis. Use the data cleaning output as input for exploration. Use exploration findings to inform which statistical tests to run. See our prompt chaining guide for the complete technique.
These 40 prompts cover the data analysis pipeline from dirty CSV to boardroom presentation. Pick the ones that match your current bottleneck, fill in the placeholders, and turn your data into decisions.