I have been trying to write large amount (>800mb) of data to JSON file; I did some fair amount of trial and error to get this code:
def write_to_cube(data):
with open('test.json') as file1:
temp_data = json.load(file1)
temp_data.update(data)
file1.close()
with open('test.json', 'w') as f:
json.dump(temp_data, f)
f.close()
to run it just call the function write_to_cube({"some_data" = data})
Now the problem with this code is that it's fast for the small amount of data, but the problem comes when test.json file has more than 800mb in it. When I try to update or add data to it, it takes ages.
I know there are external libraries such as simplejson or jsonpickle, I am not pretty sure on how to use them.
Is there any other way to this problem?
Update:
I am not sure how this can be a duplicate, other articles say nothing about writing or updating a large JSON file, rather they say only about parsing.
Is there a memory efficient and fast way to load big json files in python?
Reading rather large json files in Python
None of the above resolve this question a duplicate. They don't say anything about writing or update.
So the problem is that you have a long operation. Here are a few approaches that I usually do:
Optimize the operation: This rarely works. I wouldn't recommend any superb library that would parse the json a few seconds faster
Change your logic: If the purpose is to save and load data, probably you would like to try something else, like storing your object into a binary file instead.
Threads and callback, or deferred objects in some web frameworks. In case of web applications, sometimes, the operation takes longer than a request can wait, we can do the operation in background (some cases are: zipping files, then send the zip to user's email, sending SMS by calling another third party's api...)
Related
Yesterday I did many experiments with data writing to file and then read them from it. I experienced many difficulties with this simple task.
At first I thought, that I can simply save data to a file as is. For example - save list to file.txt and then read it back to some variable as list.
I found solution on StackOverflow and I think it is a simplest way to read a list from file.txt
with open('file.txt', 'r') as f:
list_variable = ast.literal_eval(f.read())
So, as a newbie in programming, I would like to know from much more experienced coders - what is the best way to save data to disk for future import?
I personally using stings, dictionaries, lists and tuples. Is there accurate and simple way to work with? (txt, csv, json ... etc.)?
The basic idea of writing / reading from / to files is that you need to serialize your data (see Wikipedia).
In order to retrieve your data and restore it as a data structure of your choice (i.e. tuples, lists or w/e) you need a way to *de-*serialize the data you read. That can be achieved by parsing the input stream of data.
As a solution to your matter, you can use the json package by importing it and use json.dumps(...) and json.loads(...) to respectively write and read JSON formatted data.
I am making an API call that gets a JSON response. However, as the response is huge and I don't need all the information received, I am parsing only the required key:values to a dictionary which I am using to write to a CSV file. Is it a good practice to do? Should I parse the JSON data directly to create the CSV file?
Like all things performance-related, don't bother optimizing until it becomes a problem. What you're doing is the normal, simple approach, so keep doing it until you hit real bottlenecks. A "huge response" is a relative thing. To some a "huge" response might be several kilobytes, while others might consider several megabytes, or hundreds of megabytes to be huge.
If you ever do hit a bottleneck, the first thing you should do is profile your code to see where the performance problems are actually occurring and try to optimize only those parts. Don't guess; For all you know, the CSV writer could turn out to be the poor performer.
Remember, those JSON libraries have been around a long time, have strong test coverage and have been battle tested in the field by many developers. Any custom solution you try to create is going to have none of that.
If u want to write only particular key:value pairs into csv file, it is better to convert json into python dictionary with selected key:value pairs and write that into csv file.
I'm wondering if the sas7bdat module in Python creates an iterator-type object or loads the entire file into memory as a list? I'm interested in doing something line-by-line to a .sas7bdat file that is on the order of 750GB, and I really don't want Python to attempt to load the whole thing into RAM.
Example script:
from sas7bdat import SAS7BDAT
count = 0
with SAS7BDAT('big_sas_file.sas7bdat') as f:
for row in f:
count+=1
I can also use
it = f.__iter__()
but I'm not sure if that will still go through a memory-intensive data load. Any knowledge of how sas7bdat works OR another way to deal with this issue would be greatly appreciated!
You can see the relevant code on bitbucket. The docstring describes iteration as a "generator", and looking at the code, it appears to be reading small pieces of the file rather than reading the whole thing at once. However, I don't know enough about the file format to know if there are situations that could cause it to read a lot of data at once.
If you really want to get a sense of its performance before trying it on a giant 750G file, you should test it by creating a few sample files of increasing size and seeing how its performance scales with the file size.
I have some data which I need to keep hard code in my project (quite big data), which is likely to be settings of my form. its structure is like this:
{X : [{a,b}, {c,d}], Y:[{e,f},{g,h}], Z: [{i,j},{k,l}], ...}
What is the good way to store it hard code, In Json or in ini or something else?
Keeping all this in settings.py is not good I guess!
If it requires nesting, then you can't use ini files.
Other options are json, pickling, a key/value store (like memcache/redis). If it will require modifications, then don't use the disk. Doing so will also make your code incompatible with many PaaS providers that do not have a "filesystem" that you can use.
My recommendations:
Use a k/v store (like memcache/redis). You don't need to serialize (convert) your data, the APIs are very straight forward and if you go with redis you can store complicated data structures easily. Plus, it will be very very fast.
json and pickling have the same problem; in that you need to use the filesystem. Hits to the file system will slow your execution time down and you will have problems if you want to deploy to heroku or similar as they don't provide file system access. The other problem you will have is you may need to write your own conversion code (serializers) if you plan on storing some custom objects that can't be easily converted. If you want to use json, my recommendation is to store it in a database.
It depends on your definition of "quite big data" and the frequency which it will be changed.
If your settings don't change very often you could use a simple file in the format you like the most. If you go this route I'd recommend taking a look at this project (it supports multiple formats: dict, json, yaml, .ini)
If you'll be constantly making changes to those settings and your data is actually very big (several thousand lines or something like that) you'll probably want to use a proper database or some other storage which provides a better interface for programatically editing those settings. If you're already using some kind of database for your application's non-settings data, why not use it for this as well?
It's true you could read huge settings from a file but it'll probably be easier to interact with those settings if they're stored in a database.
Hope this helps.
In Python, the simplest ways to do this are using straight Python code, pkl, or JSON.
JSON is very easy to load:
import json
with open('data.json', 'r') as f:
data = json.load(f)
And so is pkl:
import pickle
with open('data.pkl', 'r') as f:
data = pickle.load(f)
To generate the pkl file:
with open('data.pkl', 'w') as f:
pickle.dump(your_data, f)
in my program i have a method which requires about 4 files to be open each time it is called,as i require to take some data.all this data from the file i have been storing in list for manupalation.
I approximatily need to call this method about 10,000 times.which is making my program very slow?
any method for handling this files in a better ways and is storing the whole data in list time consuming what is better alternatives for list?
I can give some code,but my previous question was closed as that only confused everyone as it is a part of big program and need to be explained completely to understand,so i am not giving any code,please suggest ways thinking this as a general question...
thanks in advance
As a general strategy, it's best to keep this data in an in-memory cache if it's static, and relatively small. Then, the 10k calls will read an in-memory cache rather than a file. Much faster.
If you are modifying the data, the alternative might be a database like SQLite, or embedded MS SQL Server (and there are others, too!).
It's not clear what kind of data this is. Is it simple config/properties data? Sometimes you can find libraries to handle the loading/manipulation/storage of this data, and it usually has it's own internal in-memory cache, all you need to do is call one or two functions.
Without more information about the files (how big are they?) and the data (how is it formatted and structured?), it's hard to say more.
Opening, closing, and reading a file 10,000 times is always going to be slow. Can you open the file once, do 10,000 operations on the list, then close the file once?
It might be better to load your data into a database and put some indexes on the database. Then it will be very fast to make simple queries against your data. You don't need a lot of work to set up a database. You can create an SQLite database without requiring a separate process and it doesn't have a complicated installation process.
Call the open to the file from the calling method of the one you want to run. Pass the data as parameters to the method
If the files are structured, kinda configuration files, it might be good to use ConfigParser library, else if you have other structural format then I think it would be better to store all this data in JSON or XML and perform any necessary operations on your data