Generating fake data based on sampled data

I have a data warehouse that needs to import CSV files from numerous sources. Each of these CSV files has a different specification for the fields we expect to see, as well as different occurrences of “dirty data” that don’t always match the spec.

I would like to generate fake import data automatically based off a sampling of past CSV files, including the dirty data. I was thinking I could do something like iterate over all of the CSV fields and start with the most specific data generator spec, and iteratively make the generator spec more generic to include all samples of past data. I’m assuming I’d want the ability to override some of the results with specific fake data generators such as first and last names to make the data more readable as well.

I know there are fake data generators such as https://github.com/igas/faker but is anyone aware of something that can use sampled data to decide what kind of data we should generate as a starting point?

Thanks.

Honestly this might be something I’d write a neural net to generate data from, just for the learning opportunity… ^.^;

What is the dirty data like, what format? There are a few fuzzers out for Elixir/Erlang that might work well depending on just how it seems?

Dirty data could include enum values we didn’t plan on seeing or poorly formatted phone numbers, for simple examples. This feels like something that would be generally useful in data pipeline testing so I was hoping someone had already addressed it. :slight_smile:

As so just bad cells, not a bad csv format at all?

I’ve not seen anything myself that does precisely that, kind of the opposite of what I normally do (test every possible valid value and ignore invalid ones other than just testing that a few crash properly), does not mean it does not exist though. If it does not however, we’d all love a library for it! :slight_smile:

Yeah, I’m only talking about some bad fields but a valid CSV file. The ETL process needs to be robust to deal with some amount of data dirtiness, but once we clean it up we’re more strict about what goes into the warehouse.

Thanks for the response.

Honestly I’d probably just parse each field as I’d expect it and make fall-through cases for various bad data until it gets to a final fall-through case that logs the field, what it should be, and a TODO to remind me to add a case to handle that input… ^.^;

1 Like

Property based testing libraries all involve generating data - most allow you to provide some goal posts to generate within - its certainly not a silver bullet but it might be a good start.

I know there’s a certain pragmatic guy who has created an elixir property based testing library thats definitely worth a look :smile:

1 Like

Yeah those are the usual success testing that is done, but I’m unsure of a good way to have it generate ‘corrupt’ data as well…

My attempt at a data generator:

1 Like