Now storing the full original json with identity column in S3 and having all the expanded columns plus left-over json elements in native Redshift table does make sense. This will make things slow and tie up a ton of network bandwidth. I just don't see Spectrum working on this data so it will just send the entire json data to Redshift repeatedly. Spectrum does well when the compute elements in S3 can apply the first level where clauses and simple aggregations. Spectrum does not look like a good fit for this use case. Keeping the original jsons in a separate table keyed with an identity column will allow joining if some need arises but the goal will be to not need to do this. I'd consider NOT keeping the entire json in the main fact tables, only json pieces that represent the data not otherwise in columns. Re-extracting the same data repeatedly costs. I really don't expect this but if needed can be folded in easily to the existing ETL processes.Ī hazard you will face is that users will only reference the json and not the extracted columns. The ingestion load is unlikely to be a major concern but if it is then the Lambda approach is a reasonable way to extend the compute resources to address. The data size increase will be less than you think, Redshift is good a compressing columns. Json element that are rare, unique, or of little analytic interest can be kept in super columns that have just these subset parts of the json. I'd expect any data that is common for 90% of the json elements you will want to be in unique columns. The database work to expand the json at ingestion will be dwarfed by the work to repeatedly expand it for every query.Īny data that will be commonly used in a where clause, group by, partition, join condition etc will likely need to be its own column. Any data that will be repeatedly queries in analytics will want to be its own column. You will not want to leave things as a monolithic json. Is there a 'standard' for this scenario ? Cost/maintenance of lambda/other process to transform JSON No load on cluster to ingest/process incoming data If I am converting to relational structure (2b), could simply store in S3 and utilze Redshift Spectrum to query (b) Use lambda to pre-unnest & serialize json, insert/copy directly into relational tables Would result in some massive tables for the 'tag' nodes (a) Load as super, but then use PartiQL to unnest, serialize and store in relational tables (2) 'Traditional' Relational Tables & Types End users would have to learn PartiQL and deal with unnesting & serialization etc Maybe an additional load on cluster & performance impact to execute end user queries? Very easy to ingest data with low load on cluster Regardless, been trying to weigh up the pros & cons of super vs. (Even if I store the full JSON as a super, I still have to serialize critical attributes into regular columns for the purposes of Distribution/Sort.) End users want to use 'traditional' SQLĪs far as I can see, my options, are to make use of the SUPER data type, or just convert everything to traditional relational tables & types. I have to decide how to store this data, considering. Large volume of new data streaming real-time (20m/day) My source is JSON with nested arrays & structures.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |