Bigquery: how to preserve nested data in derived tables? - python

I have a few large hourly upload tables with RECORD fieldtypes. I want to pull select records out of those tables and put them in daily per-customer tables. The trouble I'm running into is that using QUERY to do this seems to flatten the data out.
Is there some way to preserve the nested RECORDs, or do I need to rethink my approach?
If it helps, I'm using the Python API.

It is now possible to preserve nested field structure in query results.... more here
use flatten_results flag in bq util
--[no]flatten_results: Whether to flatten nested and repeated fields in the result schema. If
not set, the default behavior is to flatten.
API Documentation
https://developers.google.com/bigquery/docs/reference/v2/jobs#configuration.query.flattenResults

Unfortunately, there isn't a way to do this right now, since, as you realized, all results are flattened.

Related

Relational DB - separate joined tables

Is there any way to join tables from a relational database and then separate them again ?
I'm working on a project that involves modifying the data after having joined them. Unfortunately, modifying the data bejore the join is not an option. I would then want to separate the data according to the original schema.
I'm stuck at the separating part. I have metadata (python dictionary) with the information on the tables (primary keys, foreign keys, fields, etc.).
I'm working with python. So, if you have a solution in python, it would be greatly appretiated. If an SQL solution works, that also helps.
Edit : Perhaps the question was unclear. To summarize I would like to create a new database with an identical schema to the old one. I do not want to make any modifications to the original database. The data that makes up the new database must originally be in a single table (result of a join of the old tables) as the operations that need to be run must be run on a single table and I cannot run these operations on invidual tables as the outcome will not be as desired.
I would like to know if this is possible and, if so, how can I achieve this?
Thanks!

In SQLAlchemy, Which is better to fetch the data in descending order or reverse() list the fetched data?

I have to fetch data in descending order. So what would be faster? fetch the data in descending order or reverse() the list of fetched data.
Note: I have been using SQLAlchemy in Flask framework. My application has to fetch hundreds of data from MySQL.
It depends on if that sql table is indexed on the column that you are sorting.
If it is, let the query do the sorting. If it is not, it depends more on the parallelization of the sorting algo that you are running between the sql engine or your python code. If it is just hundreds of rows, it really wouldn't be significant performance difference between the two approaches if the table is not indexed on that column.

How to change the structure of tables in multiple data dictionary across a advantage database server?

We are having same structure of tables in multiple dictionary on advantage database server version 10. If one change in structure of table in one dictionary we have to change the same in all other data dictionary manually. Is there any way to automate this with any program or tool. The changes can be done across multiple data dictionary in a single process. Please help on this subject.

Best way of saving JSON data from Google Analytics into relational DB

I'm looking for the most effective way of parallel loading Google Analytics data, which is represented in JSON files with nested objects structure, into Relational Database, in order to collect and analyze this statistics later.
I've found pandas.io.json.json_normalize which can flatten nested data into flat structure, also there is a pyspark solution with converting json to dataframe as described here, but not sure about performance issues.
Can you describe best ways of loading data from Google Analytics API into RDBMS?
I think this answer can be best answered when we have more context about what data you want to consume and how you'll be consuming them. For example, if you would be consuming only few of the all fields available - then it make sense to store only those, or if you'll be using some specific field as index then maybe we can index that field also.
One thing that I can recall from on top my head is JSON type of Postgres, as it's inbuilt and have several helper methods to do operation later on.
References :
https://www.postgresql.org/docs/9.3/datatype-json.html
https://www.postgresql.org/docs/9.3/functions-json.html
If you can update here what decision you take - it would be great to know.

Django: storing/querying a dictionary-like data set?

I apologize if this has been asked already, or if this is answered somewhere else.
Anyways, I'm working on a project that, in short, stores image metadata and then allows the user to search said metadata (which resembles a long list of key-value pairs). This wouldn't be too big of an issue if the metadata was standardized. However, the problem is that for any given image in the database, there is any number of key/values in its metadata. Also there is no standard list of what keys there are.
Basically, I need to find a way to store a dictionary for each model, but with arbitrary key/value pairs. And I need to be able to query them. And the organization I'm working for is planning on uploading thousands of images to this program, so it has to query reasonably fast.
I have one model in my database, an image model, with a filefield.
So, I'm in between two options, and I could really use some help from people with more experience on choosing the best one (or any other solutions that would work better)
Using a traditional relational database like MySql, and creating a separate model with a foreignkey to the image model, a key field, and a value field. Then, when I need to query the data, I'll ask for every instance of this separate table that relates to an image, and then query those rows for the key/value combination I need.
Using something like MongoDB, with django-toolbox and its DictField to store the metadata. Then, when I need to query, I'll access the dict and search it for the key/value combination I need.
While I feel like 1 would be much better in terms of query time, each image may have up to 40 key/values of metadata, and that makes me worry about that separate "dictionary" table growing far too large if there's thousands of images.
Any advice would be much appreciated. Thanks!
What's the type of metadata? Both key and value are string? I assume it's the case.
The scale of your dataset matters. If you will have up to thousands images and each image has up to 40 key-value pairs, then in option 1, the separate table would have at most 400k records. That's no problem for modern database, as long as you have not bad machine and correct DB settings. One issue to take care is to composite index fields in the table. In Django ORM, it would be something like:
class ImageMeta(models.Model):
image = models.ForeignKey('Image')
key = models.CharField(max_length=XXXX)
value = models.CharField(max_length=XXXX)
class Meta:
index_together = [ ["image", "key", "value"], ] # Django 1.5 and above
In a Django project you've got 4 alternatives for this kind of problem, in no particular order:
using PostgreSQL, you can use the hstore field type, that's basically a pickled python dictionary. It's not very helpful in terms of querying it, but does its job saving your data.
using Django-NoRel with mongodb you get the ListField field type that does the same thing and can be queried just like anything in mongo. (option 2)
using Django-eav to create an entity attribute value store with your data. Elegant solution but painfully slow queries. (option 1)
storing your data as a json string in a long enough TextField and creating your own functions to serializing and deserializing the data, without thinking on being able to make a query over it.
In my own experience, if you by any chance need to query over the data, your option two is by far the best choice. EAV in Django, without composite keys, is painful.

Categories