justuse.me
Use CaseApril 17, 2026

How to Generate Fake Test Data for Database Seeding Without Breaking Privacy Rules

Learn how to create realistic sample data for testing and development without using real customer information or violating privacy regulations.

How to Generate Fake Test Data for Database Seeding Without Breaking Privacy Rules

Last month I was building a dashboard feature and needed 500 user records to test the pagination. I could have copied production data, but that's a terrible idea for about a dozen legal reasons. I needed fake data that looked real enough to catch edge cases but was completely synthetic.

This is one of those problems that seems simple until you actually try to solve it. You need names that don't look like "User123", emails that follow proper formats, addresses that could exist, and phone numbers that pass validation. And you need hundreds or thousands of them.

Why can't I just use production data for testing?

Because GDPR, CCPA, and basically every privacy regulation says you can't. Even if you strip out names, combining zip codes, birth dates, and gender can re-identify people. I've seen companies get fined for this exact mistake.

Beyond legal issues, production data has problems. It's messy, it has gaps, and it doesn't cover edge cases. You need a user with a 50-character last name to test your form validation? Production probably doesn't have one. You need to test how your app handles 10,000 orders? You can't just copy real customer orders.

What makes good fake test data?

Good fake data passes validation but fails reality checks. An email like "john.smith.8473@example.com" will get through your regex, but "john.smith@gmail.com" looks more realistic for UI testing.

I've found you need three levels of fake data depending on what you're testing:

Quick and dirty: Sequential data like "User 1", "User 2". Fine for unit tests where you just need something in the database. Takes 30 seconds to generate with a loop.

Realistic looking: Proper names, valid email formats, real-looking addresses. This is what you need for integration tests and demos. Most fake data generators live here.

Statistically accurate: Data that matches real-world distributions. If 60% of your users are in three states, your test data should reflect that. This matters for performance testing and analytics validation.

Most developers stop at level two, but level three catches interesting bugs. I once found a query that worked fine with evenly distributed data but crawled when 40% of records had the same zip code, which matched our actual user distribution.

How do I generate fake data quickly?

The fastest way is using a browser-based generator. I built Fake Data Generator for exactly this, it runs entirely in your browser and can create thousands of records in seconds. You pick the fields you need (name, email, phone, address, company, whatever), set how many records, and download as JSON or CSV.

No signup, no upload, no server processing your data. This matters more than it sounds like. I've used services where you upload a schema and they generate data server-side, but that means your database structure is leaving your network. For side projects it's fine, but try explaining that to your security team at work.

Mockaroo is probably the most popular alternative. It has more field types and can generate up to 1,000 rows for free. The paid version goes up to 100,000 rows and has API access, which is useful if you're seeding databases as part of your CI/CD pipeline.

Faker.js is the go-to if you're writing code. It's a JavaScript library with methods for generating every type of fake data you can think of. The downside is you have to write the script yourself. For one-off data generation, a web tool is faster. For automated seeding in your test suite, Faker.js makes more sense.

What fields should I include in my test data?

Start with what your database actually stores. If you have a users table with first_name, last_name, email, and created_at, generate exactly those fields. Don't add extra stuff you don't need.

Then think about edge cases. I always include:

  • Names with apostrophes (O'Brien) and hyphens (Smith-Jones)
  • Very long names (some cultures have 4-5 part surnames)
  • Email addresses with plus signs (user+test@example.com)
  • International phone numbers, not just US format
  • Addresses with apartment numbers and suite numbers
  • Special characters in company names (AT&T, H&M)

These catch validation bugs that clean data misses. I found a form that broke on any name with an apostrophe because someone wrote a SQL query without proper escaping. Test data with "O'Brien" would have caught it immediately.

How much test data do I actually need?

For basic functionality testing, 50-100 records is usually enough. You're checking that things work, not stressing the system.

For performance testing, you need to match production scale. If you have 100,000 users in production, test with at least that many. I've seen queries that worked fine with 1,000 rows but timed out with 50,000 because someone forgot to add an index.

One specific number I use: generate at least 3x the page size for any paginated view. If you show 20 items per page, have at least 60 records. This lets you test first page, middle page, and last page behavior.

Can I generate data that matches my production patterns?

Yes, and you should for certain types of testing. If 80% of your orders come from five states, your test data should reflect that. If most users sign up on mobile, generate more mobile-formatted data.

The trick is analyzing your production data to find the patterns, then configuring your generator to match them. Most tools let you set probability distributions. You can say "60% of users should have gmail addresses, 20% yahoo, 20% other" or "generate zip codes weighted by population density."

This is overkill for feature development but critical for load testing and analytics validation. I once debugged a slow dashboard that turned out to be fine with evenly distributed data but slow with realistic clustering. The production database had 40% of records in one category, which changed the query plan completely.

What about generating related data across tables?

This is where it gets tricky. If you have users, orders, and order_items tables, you need foreign keys that actually match up. Generating each table independently gives you orphaned records.

Some tools handle this automatically. You define the relationships and they generate data that respects foreign key constraints. Others require you to generate in order, users first, then orders referencing those user IDs, then order items referencing those order IDs.

I usually generate the parent table first, export it, then use those IDs when generating child tables. It's manual but it works. For complex schemas with 10+ related tables, you might want to write a script with Faker.js instead of clicking through a web interface.

Should I commit test data to my repository?

For small datasets (under 1MB), yes. Having a fixtures file with 50-100 sample records makes it easy for new developers to get started. They clone the repo, run the seed script, and have working data immediately.

For large datasets, no. Committing 100,000 records bloats your repository and slows down clones. Instead, commit the script that generates the data. New developers run the script locally to create their test database.

I keep a seeds.sql file in my repo with about 50 records covering common scenarios and edge cases. That's enough for development. For performance testing, I have a separate script that generates 100,000 records on demand.

The key is making it reproducible. If you generate random data every time, tests become flaky. Use a fixed seed value so the "random" data is actually the same every time. Most generators let you set a seed number for exactly this reason.