I am using python to scrape, store and plot the data on an odds website for later reference. Initially I am storing the data in numerous .csv files (every X minutes) which I then aggregate into larger json files (per day) for easier access.
The problem is that with the increasing number of events per day(>600), the speed at which the json files are manipulated becomes unacceptable (~35s to just load a single json file of the size of 95MB).
What would be another set-up which would be more efficient (in terms of speed)? Maybe using SQL alongside python?
Maybe try another JSON library like orjson instead of the standard one.
Related
I have the following local Json file (around 90MB):
For my data to be accessible, I want to create smaller Json files that include exactly the same data but only 100 of the array entries in Readings.SensorData every time. So a file that includes the first 100 readings, then a file that includes readings 101-200, and so on... I am aware about the ijson library but I can not figure out to do this in the most memory effective way.
Edit: Just to note I know how to do this with the standard json library but because it is a big file I want to be able to do this in a way that doesnt come to a complete hault.
Any help would be appreciated greatly!
Use
dict.items() in python
You will need
json, package in python
Use json.loads() features and it will result in a python dictionary.
I'm writing something that essentially refines and reports various strings out of an enormous python dictionary (the source file for the dictionary is XML over a million lines long).
I found mongodb yesterday and was delighted to see that it accepts python dictionaries easy as you please... until it refused mine because the dict object is larger than the BSON size limit of 16MB.
I looked at GridFS for a sec, but that won't accept any python object that doesn't have a .read attribute.
Over time, this program will acquire many of these mega dictionaries; I'd like to dump each into a database so that at some point I can compare values between them.
What's the best way to handle this? I'm awfully new to all of this but that's fine with me :) It seems that a NoSQL approach is best; the structure of these is generally known but can change without notice. Schemas would be nightmarish here.
Have your considered using Pandas? Yes Pandas does not natively accept xmls but if you use ElementTree from xml (standard library) you should be able to read it into a Pandas data frame and do what you need with it including refining strings and adding more data to the data frame as you get it.
So I've decided that this problem is more of a data design problem than a python situation. I'm trying to load a lot of unstructured data into a database when I probably only need 10% of it. I've decided to save the refined xml dictionary as a pickle on a shared filesystem for cool storage and use mongo to store the refined queries I want from the dictionary.
That'll reduce their size from 22MB to 100K.
Thanks for chatting with me about this :)
I am working on a personal project (using Python 3) that will retrieve weather information for any city in the United States. My program prompts the user to enter as many city-state combinations as they wish, and then it retrieves the weather information and creates a weather summary for each city entered. Behind the scenes, I'm essentially taking the State entered by the user, opening a .txt file corresponding to that State, and then getting a weather code that is associated with the city entered, which I then use in a URL request to find weather information for the city. Since I have a .txt file for every state, I have 50 .txt files, each with a large number of city-weather code combinations.
Would it be faster to keep my algorithm the way that it currently is, or would it be faster to keep all of this data in a dictionary? This is how I was thinking about storing the data in a dictionary:
info = {'Virginia':{'City1':'ID1','City2':'ID2'},'North Carolina':{'City3':'ID3'}}
I'd be happy to provide some of my code or elaborate if necessary.
Thanks!
If you have a large datafile, you will spend days shifting through the file and putting the values in the .py file. If it is a small file I would use a dictionary, but if it were a large file a .txt file.
Other possible solutions are:
sqlite
pickle
shelve
Other Resources
Basic data storage with Python
https://docs.python.org/3/library/persistence.html
https://docs.python.org/3/library/pickle.html
https://docs.python.org/3/library/shelve.html
It almost certainly would be much faster to preload the data from the files, if you're using the same python process for many user requests. If the process handles just one request and exits, this approach would be slower and use more memory. For some number of requests between "one" and "many", they'd be about equal on speed.
For a situation like this I would probably use sqlite, for which python has built-in support. It would be much faster than scanning text files without the time and memory overhead of loading the full dictionary.
It is probably not a very good idea to have a large amount of text files, because it will slow down in large or numerous director(y|ies) access. But If you have large data records, you might wish to choose an intermediate solution, in indexing one data file and load the index in a dictionary.
I've accumulated a set of 500 or so files, each of which has an array and header that stores metadata. Something like:
2,.25,.9,26 #<-- header, which is actually cryptic metadata
1.7331,0
1.7163,0
1.7042,0
1.6951,0
1.6881,0
1.6825,0
1.678,0
1.6743,0
1.6713,0
I'd like to read these arrays into memory selectively. We've built a GUI that lets users select one or multiple files from disk, then each are read in to the program. If users want to read in all 500 files, the program is slow opening and closing each file. Therefore, my question is: will it speed up my program to store all of these in a single structure? Something like hdf5? Ideally, this would have faster access than the individual files. What is the best way to go about this? I haven't ever dealt with these types of considerations. What's the best way to speed up this bottleneck in Python? The total data is only a few MegaBytes, I'd even be amenable to storing it in the program somewhere, not just on disk (but don't know how to do this)
Reading 500 files in python should not take much time, as the overall file size is around few MB. Your data-structure is plain and simple in your file chunks, it ll not even take much time to parse I guess.
Is the actual slowness is bcoz of opening and closing file, then there may be OS related issue (it may have very poor I/O.)
Did you timed it like how much time it is taking to read all the files.?
You can also try using small database structures like sqllite. Where you can store your file data and access the required data in a fly.
I have some large XML files (5GB ~ each) that I'm importing to a mongodb database. I'm using Expat to parse the documents, doing some data manipulation (deleting some fields, unit conversion, etc) and then inserting into the database. My script is based on this one: https://github.com/bgianfo/stackoverflow-mongodb/blob/master/so-import
My question is: is there a way to improve this with a batch insert ? Storing these documents on an array before inserting would be a good idea ? How many documents should I store before inserting, then ? Writing the jsons into a file and then using mongoimport would be faster ?
I appreciate any suggestion.
In case you want to import XML to MongoDB and Python is just what you so far chose to get this job done but you are open for further approaches then might also perform this with the following steps:
transforming the XML documents to CSV documents using XMLStarlet
transforming the CSVs to files containing JSONs using AWK
import the JSON files to MongoDB
XMLStarlet and AWK are both extremely fast and you are able to store your JSON objects using a non-trivial structure (sub-objects, arrays).
http://www.joyofdata.de/blog/transforming-xml-document-into-csv-using-xmlstarlet/
http://www.joyofdata.de/blog/import-csv-into-mongodb-with-awk-json/
Storing these documents on an array before inserting would be a good idea?
Yes, that's very likely. It reduces the number of round-trips to the database. You should monitor your system, it's probably idling a lot when inserting because of IO wait (that is, the overhead and thread synchronization is taking a lot more time than the actual data transfer).
How many documents should I store before inserting, then?
That's hard to say, because it depends on so many factors. Rule of thumb: 1,000 - 10,000. You will have to experiment a little. In older versions of mongodb, the entire batch must not be larger than the document size limit of 16MB.
Writing the jsons into a file and then using mongoimport would be faster?
No, unless your code has a flaw. That would mean you have to copy the data twice and the entire operation should be IO bound.
Also, it's a good idea to add all documents first, then add any indexes, not the other way around (because then the index will have to be repaired with every insert)