I'm looking for the most effective way of parallel loading Google Analytics data, which is represented in JSON files with nested objects structure, into Relational Database, in order to collect and analyze this statistics later.
I've found pandas.io.json.json_normalize which can flatten nested data into flat structure, also there is a pyspark solution with converting json to dataframe as described here, but not sure about performance issues.
Can you describe best ways of loading data from Google Analytics API into RDBMS?
I think this answer can be best answered when we have more context about what data you want to consume and how you'll be consuming them. For example, if you would be consuming only few of the all fields available - then it make sense to store only those, or if you'll be using some specific field as index then maybe we can index that field also.
One thing that I can recall from on top my head is JSON type of Postgres, as it's inbuilt and have several helper methods to do operation later on.
References :
https://www.postgresql.org/docs/9.3/datatype-json.html
https://www.postgresql.org/docs/9.3/functions-json.html
If you can update here what decision you take - it would be great to know.
Related
I'm new to Big data. I learned that HDFS is for storing more of structured data and HBase is for storing unstructured data. I'm having a REST API where I need to get the data and load it into the data warehouse (HDFS/HBase). The data is in JSON format. So which one would be better to load the data into? HDFS or HBase? Also can you please direct me to some tutorial to do this. I came across this about Tutorial with Streaming Data. But I'm not sure if this will fit my use case.
It would be of great help if you can guide me to a particular resource/ technology to solve this issue.
There is several questions you have to think about
Do you want to work with batch files or streaming ? It depends on the rate at which your REST API will be requested
For the Storage there is not just HDFS and Hbase, you have a lot of other solutions as Casandra, MongoDB, Neo4j. All depends on the way you want to use it (Random Acces VS Full Scan, Update with versioning VS writing new lines, Concurrency access). For example Hbase is good for random access, Neo4j for graph storage,... If you are receiving JSON files, MongoDB can be a god choice as it stores object as document.
What is the size of your data ?
Here is good article on questions to think about when you start a big data project documentation
I apologize if this has been asked already, or if this is answered somewhere else.
Anyways, I'm working on a project that, in short, stores image metadata and then allows the user to search said metadata (which resembles a long list of key-value pairs). This wouldn't be too big of an issue if the metadata was standardized. However, the problem is that for any given image in the database, there is any number of key/values in its metadata. Also there is no standard list of what keys there are.
Basically, I need to find a way to store a dictionary for each model, but with arbitrary key/value pairs. And I need to be able to query them. And the organization I'm working for is planning on uploading thousands of images to this program, so it has to query reasonably fast.
I have one model in my database, an image model, with a filefield.
So, I'm in between two options, and I could really use some help from people with more experience on choosing the best one (or any other solutions that would work better)
Using a traditional relational database like MySql, and creating a separate model with a foreignkey to the image model, a key field, and a value field. Then, when I need to query the data, I'll ask for every instance of this separate table that relates to an image, and then query those rows for the key/value combination I need.
Using something like MongoDB, with django-toolbox and its DictField to store the metadata. Then, when I need to query, I'll access the dict and search it for the key/value combination I need.
While I feel like 1 would be much better in terms of query time, each image may have up to 40 key/values of metadata, and that makes me worry about that separate "dictionary" table growing far too large if there's thousands of images.
Any advice would be much appreciated. Thanks!
What's the type of metadata? Both key and value are string? I assume it's the case.
The scale of your dataset matters. If you will have up to thousands images and each image has up to 40 key-value pairs, then in option 1, the separate table would have at most 400k records. That's no problem for modern database, as long as you have not bad machine and correct DB settings. One issue to take care is to composite index fields in the table. In Django ORM, it would be something like:
class ImageMeta(models.Model):
image = models.ForeignKey('Image')
key = models.CharField(max_length=XXXX)
value = models.CharField(max_length=XXXX)
class Meta:
index_together = [ ["image", "key", "value"], ] # Django 1.5 and above
In a Django project you've got 4 alternatives for this kind of problem, in no particular order:
using PostgreSQL, you can use the hstore field type, that's basically a pickled python dictionary. It's not very helpful in terms of querying it, but does its job saving your data.
using Django-NoRel with mongodb you get the ListField field type that does the same thing and can be queried just like anything in mongo. (option 2)
using Django-eav to create an entity attribute value store with your data. Elegant solution but painfully slow queries. (option 1)
storing your data as a json string in a long enough TextField and creating your own functions to serializing and deserializing the data, without thinking on being able to make a query over it.
In my own experience, if you by any chance need to query over the data, your option two is by far the best choice. EAV in Django, without composite keys, is painful.
Have some programming background, but in the process of both learning Python and making a web app, and I'm a long-time lurker but first-time poster on Stack Overflow, so please bear with me.
I know that SQLite (or another database, seems like PostgreSQL is popular) is the way to store data between sessions. But what's the most efficient way to store large amounts of data during a session?
I'm building a script to identify the strongest groups of employees to work on various projects in a company. I have received one SQLite database per department containing employee data including skill sets, achievements, performance, and pay.
My script currently runs one SQL query on each database in response to an initial query by the user, pulling all the potentially-relevant employees and their data. It stores all of that data in a list of Python dicts so the end-user can mix-and-match relevant people.
I see two other options: I could still run the comprehensive initial queries but instead of storing it in Python dicts, dump it all into SQLite temporary tables; my guess is that this would save some space and computing because I wouldn't have to store all the joins with each record. Or I could just load employee name and column/row references, which would save a lot of joins on the first pass, then pull the data on the fly from the original databases as the user requests additional data, storing little if any data in Python data structures.
What's going to be the most efficient? Or, at least, what is the most common/proper way of handling large amounts of data during a session?
Thanks in advance!
Aren't you over-optimizing? You don't need the best solution, you need a solution which is good enough.
Implement the simplest one, using dicts; it has a fair chance to be adequate. If you test it and then find it inadequate, try SQLite or Mongo (both have downsides) and see if it suits you better. But I suspect that buying more RAM instead would be the most cost-effective solution in your case.
(Not-a-real-answer disclaimer applies.)
I have a few large hourly upload tables with RECORD fieldtypes. I want to pull select records out of those tables and put them in daily per-customer tables. The trouble I'm running into is that using QUERY to do this seems to flatten the data out.
Is there some way to preserve the nested RECORDs, or do I need to rethink my approach?
If it helps, I'm using the Python API.
It is now possible to preserve nested field structure in query results.... more here
use flatten_results flag in bq util
--[no]flatten_results: Whether to flatten nested and repeated fields in the result schema. If
not set, the default behavior is to flatten.
API Documentation
https://developers.google.com/bigquery/docs/reference/v2/jobs#configuration.query.flattenResults
Unfortunately, there isn't a way to do this right now, since, as you realized, all results are flattened.
I would like to store the raw JSON stream (either via Twitter or the NYTimes) efficiently in MongoDB, so that I can later index the data (NYTimes articles, or Tweets/usernames) with either Lucene or Hadoop. What's the smartest way of storing data in Mongo? Should I just pipe in the JSON, or is there something better? I am only using a single machine for mongodb, with 3 replica sets.
Is there an efficient (smart) way of writing queries, or storing my data to better-optimize the search-queries?
Is there an efficient (smart) way of writing queries, or storing my data to better-optimize the search-queries?
This totally depends on what kind of queries you need to make and what the usage pattern of your application will be.
It would be pretty simple to store each tweet in a Mongo Document containing: sender, timestamp, text, etc.
Depending on what queries you need to make, you will need to create indexes on these fields (more info: http://www.mongodb.org/display/DOCS/Indexes)
For full text search, you could tokenize/parse/stem the text of the tweets and store an array of tokens with each tweet which you can index to make queries on it fast.
If you need more powerful full text search features, you could also index them with Lucene and store the objectId in each lucene document - but this introduces the complexity of essentially having 2 data stores
Again, there's really no right answer here without knowing the details of the use case.