How to make buckets from a big Json file in Python? - python

I have the following local Json file (around 90MB):
For my data to be accessible, I want to create smaller Json files that include exactly the same data but only 100 of the array entries in Readings.SensorData every time. So a file that includes the first 100 readings, then a file that includes readings 101-200, and so on... I am aware about the ijson library but I can not figure out to do this in the most memory effective way.
Edit: Just to note I know how to do this with the standard json library but because it is a big file I want to be able to do this in a way that doesnt come to a complete hault.
Any help would be appreciated greatly!

Use
dict.items() in python
You will need
json, package in python
Use json.loads() features and it will result in a python dictionary.

Related

Quicker way to manage data than using python with JSON

I am using python to scrape, store and plot the data on an odds website for later reference. Initially I am storing the data in numerous .csv files (every X minutes) which I then aggregate into larger json files (per day) for easier access.
The problem is that with the increasing number of events per day(>600), the speed at which the json files are manipulated becomes unacceptable (~35s to just load a single json file of the size of 95MB).
What would be another set-up which would be more efficient (in terms of speed)? Maybe using SQL alongside python?
Maybe try another JSON library like orjson instead of the standard one.

What's the best strategy for dumping very large python dictionaries to a database?

I'm writing something that essentially refines and reports various strings out of an enormous python dictionary (the source file for the dictionary is XML over a million lines long).
I found mongodb yesterday and was delighted to see that it accepts python dictionaries easy as you please... until it refused mine because the dict object is larger than the BSON size limit of 16MB.
I looked at GridFS for a sec, but that won't accept any python object that doesn't have a .read attribute.
Over time, this program will acquire many of these mega dictionaries; I'd like to dump each into a database so that at some point I can compare values between them.
What's the best way to handle this? I'm awfully new to all of this but that's fine with me :) It seems that a NoSQL approach is best; the structure of these is generally known but can change without notice. Schemas would be nightmarish here.
Have your considered using Pandas? Yes Pandas does not natively accept xmls but if you use ElementTree from xml (standard library) you should be able to read it into a Pandas data frame and do what you need with it including refining strings and adding more data to the data frame as you get it.
So I've decided that this problem is more of a data design problem than a python situation. I'm trying to load a lot of unstructured data into a database when I probably only need 10% of it. I've decided to save the refined xml dictionary as a pickle on a shared filesystem for cool storage and use mongo to store the refined queries I want from the dictionary.
That'll reduce their size from 22MB to 100K.
Thanks for chatting with me about this :)

What is the best way to write data to disk to read from it?

Yesterday I did many experiments with data writing to file and then read them from it. I experienced many difficulties with this simple task.
At first I thought, that I can simply save data to a file as is. For example - save list to file.txt and then read it back to some variable as list.
I found solution on StackOverflow and I think it is a simplest way to read a list from file.txt
with open('file.txt', 'r') as f:
list_variable = ast.literal_eval(f.read())
So, as a newbie in programming, I would like to know from much more experienced coders - what is the best way to save data to disk for future import?
I personally using stings, dictionaries, lists and tuples. Is there accurate and simple way to work with? (txt, csv, json ... etc.)?
The basic idea of writing / reading from / to files is that you need to serialize your data (see Wikipedia).
In order to retrieve your data and restore it as a data structure of your choice (i.e. tuples, lists or w/e) you need a way to *de-*serialize the data you read. That can be achieved by parsing the input stream of data.
As a solution to your matter, you can use the json package by importing it and use json.dumps(...) and json.loads(...) to respectively write and read JSON formatted data.

Is it a good practice to parse a JSON response to a python dictionary?

I am making an API call that gets a JSON response. However, as the response is huge and I don't need all the information received, I am parsing only the required key:values to a dictionary which I am using to write to a CSV file. Is it a good practice to do? Should I parse the JSON data directly to create the CSV file?
Like all things performance-related, don't bother optimizing until it becomes a problem. What you're doing is the normal, simple approach, so keep doing it until you hit real bottlenecks. A "huge response" is a relative thing. To some a "huge" response might be several kilobytes, while others might consider several megabytes, or hundreds of megabytes to be huge.
If you ever do hit a bottleneck, the first thing you should do is profile your code to see where the performance problems are actually occurring and try to optimize only those parts. Don't guess; For all you know, the CSV writer could turn out to be the poor performer.
Remember, those JSON libraries have been around a long time, have strong test coverage and have been battle tested in the field by many developers. Any custom solution you try to create is going to have none of that.
If u want to write only particular key:value pairs into csv file, it is better to convert json into python dictionary with selected key:value pairs and write that into csv file.

quickest method of accessing key - value pairs?

I hope the question is not too unspecific: I have a huge database-like list (~ 900,000 entries) which I want to use for processing text files. (More details below.) Since this list will be edited and used with other programs as well, I would prefer to keep it in one separate file and include it in the python code, either directly or by dumping it to some format that python can use. I was wondering if you can advice on what would be the quickest and most efficient way. I have looked at several options, but may not have seen what is best:
Include the list as a python dictionary in the form
my_list = { "key": "value" }
directly into my python code.
Dump the list to an sqlite database and use the sqlite3 module.
Have the list as a yml file and use the yaml module.
Any ideas how these approaches would scale if I process a text file and want to do replacements on something like 30,000 lines?
For those interested: this is for linguistic processing, in particular ancient Greek. The list is an exhaustive list of Greek forms and the head words that they are derived from. For every word form in a text file, I want to add the dictionary head word.
Point 1 is much faster than using either YAML or SQL as #b4hand and #DeepSpace indicated. What you should do though is not include the list in the rest of the rest of the python code you are developing, as you indicated, but make a separate .py file with just the that dictionary definition.
That way the list in that file is more easy to write from a program (or extend by a program). And, on first import, a .pyc will be created which speeds up re-reading on further runs of your program. This is actually very similar in performance
to using the pickle module and pickling the dictionary to file and reading it back from there, while keeping the dictionary in an easy human readable and editable form.
Less than one million entries is not huge and should fit in memory easily. Thus, your best bet is option 1.
If you are looking for speed, option 1 should be the fastest because the other 2 will need to repeatedly access the HD which will be the bottleneck.
I would use a caching mechanism to hold this data or maybe a data structure storage like redis. Loading all of this in memory might become too expensive.

Categories