Generating fake data based on sampled data

justincjohnson · April 18, 2017, 8:03pm

I have a data warehouse that needs to import CSV files from numerous sources. Each of these CSV files has a different specification for the fields we expect to see, as well as different occurrences of “dirty data” that don’t always match the spec.

I would like to generate fake import data automatically based off a sampling of past CSV files, including the dirty data. I was thinking I could do something like iterate over all of the CSV fields and start with the most specific data generator spec, and iteratively make the generator spec more generic to include all samples of past data. I’m assuming I’d want the ability to override some of the results with specific fake data generators such as first and last names to make the data more readable as well.

I know there are fake data generators such as https://github.com/igas/faker but is anyone aware of something that can use sampled data to decide what kind of data we should generate as a starting point?

Thanks.

OvermindDL1 · April 18, 2017, 8:06pm

Honestly this might be something I’d write a neural net to generate data from, just for the learning opportunity… ^.^;

What is the dirty data like, what format? There are a few fuzzers out for Elixir/Erlang that might work well depending on just how it seems?

justincjohnson · April 18, 2017, 8:10pm

Dirty data could include enum values we didn’t plan on seeing or poorly formatted phone numbers, for simple examples. This feels like something that would be generally useful in data pipeline testing so I was hoping someone had already addressed it.

OvermindDL1 · April 18, 2017, 8:16pm

As so just bad cells, not a bad csv format at all?

I’ve not seen anything myself that does precisely that, kind of the opposite of what I normally do (test every possible valid value and ignore invalid ones other than just testing that a few crash properly), does not mean it does not exist though. If it does not however, we’d all love a library for it!

justincjohnson · April 18, 2017, 8:33pm

Yeah, I’m only talking about some bad fields but a valid CSV file. The ETL process needs to be robust to deal with some amount of data dirtiness, but once we clean it up we’re more strict about what goes into the warehouse.

Thanks for the response.

OvermindDL1 · April 18, 2017, 8:46pm

Honestly I’d probably just parse each field as I’d expect it and make fall-through cases for various bad data until it gets to a final fall-through case that logs the field, what it should be, and a TODO to remind me to add a case to handle that input… ^.^;

globalkeith · April 18, 2017, 9:24pm

Property based testing libraries all involve generating data - most allow you to provide some goal posts to generate within - its certainly not a silver bullet but it might be a good start.

I know there’s a certain pragmatic guy who has created an elixir property based testing library thats definitely worth a look

OvermindDL1 · April 18, 2017, 9:27pm

Yeah those are the usual success testing that is done, but I’m unsure of a good way to have it generate ‘corrupt’ data as well…

bbense · April 19, 2017, 12:59am

My attempt at a data generator:

github.com

philosophers-stone/phenetic/blob/master/lib/phst_mutate.ex

defmodule PhStMutate do

  import PhStTransform

  @doc """
  Returns a mutation of the original data structure. This is
  a data structure that is congruent to the original but with
  randomized elements.
  """
  def mutate(data) do
    mutate_potion = %{ Atom => &from_atom/2,
                      Integer => &from_integer/2,
                      Float   =>  &from_float/2,
                      BitString => &from_bitstring/2,
                     }


    transform(data, mutate_potion)
  end

This file has been truncated. show original