Merging BigQuery Schemas

I’m trying to basically merge lists in nested maps. There may be a field of type RECORD, for those I need to look at the fields and merge them with the corresponding fields from both schemas and there may be more fields of type RECORD in those which also need to be merged. Everything in the new schema needs to be added to the old schema, as you cannot remove columns in Big Query.

Not really sure where to start, so any bit of direction would be great! Thanks!

Old schema:

%GoogleApi.BigQuery.V2.Model.TableSchema{
  fields: [
    %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
      description: nil,
      fields: nil,
      mode: "REQUIRED",
      name: "timestamp",
      type: "TIMESTAMP"
    },
    %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
      description: nil,
      fields: nil,
      mode: "NULLABLE",
      name: "event_message",
      type: "STRING"
    },
    %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
      description: nil,
      fields: [
        %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
          description: nil,
          fields: nil,
          mode: "NULLABLE",
          name: "ip_address",
          type: "STRING"
        },
        %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
          description: nil,
          fields: nil,
          mode: "NULLABLE",
          name: "request_method",
          type: "STRING"
        },
        %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
          description: nil,
          fields: [
            %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
              description: nil,
              fields: nil,
              mode: "NULLABLE",
              name: "server",
              type: "STRING"
            }
          ],
          mode: "NULLABLE",
          name: "request_headers",
          type: "RECORD"
        }
      ],
      mode: "REPEATED",
      name: "metadata",
      type: "RECORD"
    }
  ]
}

New schema:

%GoogleApi.BigQuery.V2.Model.TableSchema{
  fields: [
    %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
      description: nil,
      fields: nil,
      mode: "REQUIRED",
      name: "timestamp",
      type: "TIMESTAMP"
    },
    %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
      description: nil,
      fields: nil,
      mode: "NULLABLE",
      name: "event_message",
      type: "STRING"
    },
    %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
      description: nil,
      fields: [
        %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
          description: nil,
          fields: nil,
          mode: "NULLABLE",
          name: "ip_address",
          type: "STRING"
        },
        %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
          description: nil,
          fields: [
            %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
              description: nil,
              fields: nil,
              mode: "NULLABLE",
              name: "server",
              type: "STRING"
            },
            %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
              description: nil,
              fields: nil,
              mode: "NULLABLE",
              name: "user_agent",
              type: "STRING"
            }
          ],
          mode: "NULLABLE",
          name: "request_headers",
          type: "RECORD"
        }
      ],
      mode: "REPEATED",
      name: "metadata",
      type: "RECORD"
    }
  ]
}

Merged schemas:

%GoogleApi.BigQuery.V2.Model.TableSchema{
  fields: [
    %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
      description: nil,
      fields: nil,
      mode: "REQUIRED",
      name: "timestamp",
      type: "TIMESTAMP"
    },
    %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
      description: nil,
      fields: nil,
      mode: "NULLABLE",
      name: "event_message",
      type: "STRING"
    },
    %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
      description: nil,
      fields: [
        %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
          description: nil,
          fields: nil,
          mode: "NULLABLE",
          name: "ip_address",
          type: "STRING"
        },
        %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
          description: nil,
          fields: nil,
          mode: "NULLABLE",
          name: "request_method",
          type: "STRING"
        },
        %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
          description: nil,
          fields: [
            %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
              description: nil,
              fields: nil,
              mode: "NULLABLE",
              name: "server",
              type: "STRING"
            },
            %GoogleApi.BigQuery.V2.Model.TableFieldSchema{
              description: nil,
              fields: nil,
              mode: "NULLABLE",
              name: "user_agent",
              type: "STRING"
            }
          ],
          mode: "NULLABLE",
          name: "request_headers",
          type: "RECORD"
        }
      ],
      mode: "REPEATED",
      name: "metadata",
      type: "RECORD"
    }
  ]
}
2 Likes

A couple of observations from the top of my head (you may already know):

  • when the field Schema has the mode REPEATED, expect a list of values as defined by the type
  • when the type is RECORD, expect another serie of fields (this is recursive)

What we do in practice is to build a base map (where the key is the field name), based on the TableSchema. Additionally, every time we encounter a RECORD we reuse the same logic recursively.
Note that at the top level, bq’s elixir wrapper keeps atoms, but within RECORD it’s strings (:v becomes “v”, :f becomes “f”).

Now for merging both schemas, after parsing both schemas into maps, we can deep merge both, i.e instead of using Map.merge, use recursively merge both parsed structs as there might be internal maps (the RECORD).

Any code you can share? I’m struggling a bit to wrap my head around this.

1 Like