Python: query performance on JSON vs sqlite? - python

I have a JSON file that has the following format:
{
"items": {
"item_1_name": { ...item properties... }
"item_2_name": { ...item properties... }
...
}
}
On my last count, there can be over 13K items stored in the JSON file, and the file itself is nearly 75MB on disk.
Now, I have a program that needs to query (read-only) data. Each query operation takes an item name and needs to read its properties. Each invocation of that program may involve from a few to several dozen query ops.
Naturally, loading the JSON file from disk and parsing it takes time and space: it takes 0.76 seconds to load and parse, and the parsed data takes 197 MB in memory. That means on each invocation of that program, I need to first wait nearly a second before it can do anything else with the results. I want to make the program respond faster.
So I have another approach: create a SQLite database file from that JSON file. Afterwards, the program needs to query against the database, instead of querying against the data directly parsed from the JSON file.
However, the SQLite approach has one drawback: unlike json.load(), it doesn't parse the whole file and keep it around in memory (assuming cache miss), and I'm not sure if the time spent on disk IO encountered by the query ops may offset the benefit of not using the JSON approach.
So my question is: from your experience, is this use case suitable for SQLite?

I think this depends entirely on how you're querying the data. From the way you describe it, you're querying by an ID only, so you're not going to get the best of what sqlite has to offer by way of efficiencies. It should work just fine for your use case, but it would excel at returning all records matching a value, all record with values between two integers, etc. A third option worth considering is a minimal key/value store such as a python dictionary stored as a pickle or a really simple redis service. Both of these will allow to query by ID faster than reading a large json string.

Related

Python abstraction-layer for SQL

I'm currently using Python to store files in a JSON Database. However the JSON has started to become rather large, and inefficient (reading a 20MB file, changing one value, writing 20MB back to disk again, takes rather long)
So, I was thinking about switching to SQL (SQLite or Mysql), however I don't want to change my entire code. So far, I've been reading the JSON into lists/arrays and access them rather easily
database["key"] = "NewValue"
But if I switched to SQL, I'd have to deal with long SQL queries (select from.....insert into....), apart from the entire overhead-stuff (connect, execute, etc.). That requires me to rewrite every single data-access in my code.
Is there a way (maybe some sort of wrapper), where I can just keep my existing code-base, and let the wrapper the conversion for me in the background?

Store MySql query results for faster reuse

I'm doing analysis on data from a MySql database in python. I query the database for about 200,000 rows of data, then analyze in python using Pandas. I will often do many iterations over the same data, changing different variables, parameters, and such. Each time I run the program, I query the remote database (about 10 second query), then discard the query results when the program finishes. I'd like to save the results of the last query in a local file, then check each time I run the program to see if the query is the same, then just use the saved results. I guess I could just write the Pandas dataframe to a csv, but is there a better/easier/faster way to do this?
If for any reason MySQL Query Cache doesn't help, then I'd recommend to save the latest result set either in HDF5 format or in Feather format. Both formats are pretty fast. You may find some demos and tests here:
https://stackoverflow.com/a/37929007/5741205
https://stackoverflow.com/a/42750132/5741205
https://stackoverflow.com/a/42022053/5741205
Just use pickle to write the dataframe to a file, and to read it back out ("unpickle").
https://docs.python.org/3/library/pickle.html
This would be the "easy way".

Extract particular fields from json in python

Say I have a lot of json lines to process and I only care about the specific fields in a json line.
{blablabla, 'whatICare': 1, blablabla}
{blablabla, 'whatICare': 2, blablabla}
....
Is there any way to extract whatICare from these json lines withoud loads them? Since the json lines are very long it may be slow to build objects from json..
Not any reliable way without writing your own parsing code.
But check out ujson! It can be 10x faster than python's built in json library, which is a bit on the slow side.
No, you will have to load and parse the JSON before you know what’s inside and to be able to filter out the desired elements.
That being said, if you worry about memory, you could use ijson which is an iterative parser. Instead of loading all the content at once, it is able to load only what’s necessary for the next iteration. So if you your file contains an array of objects, you can load and parse one object at a time, reducing the memory impact (as you only need to keep one object in memory, plus the data you actually care about). But it won’t become faster, and it also won’t magically skip data you are not interested in.

How to improve a XML import into mongodb?

I have some large XML files (5GB ~ each) that I'm importing to a mongodb database. I'm using Expat to parse the documents, doing some data manipulation (deleting some fields, unit conversion, etc) and then inserting into the database. My script is based on this one: https://github.com/bgianfo/stackoverflow-mongodb/blob/master/so-import
My question is: is there a way to improve this with a batch insert ? Storing these documents on an array before inserting would be a good idea ? How many documents should I store before inserting, then ? Writing the jsons into a file and then using mongoimport would be faster ?
I appreciate any suggestion.
In case you want to import XML to MongoDB and Python is just what you so far chose to get this job done but you are open for further approaches then might also perform this with the following steps:
transforming the XML documents to CSV documents using XMLStarlet
transforming the CSVs to files containing JSONs using AWK
import the JSON files to MongoDB
XMLStarlet and AWK are both extremely fast and you are able to store your JSON objects using a non-trivial structure (sub-objects, arrays).
http://www.joyofdata.de/blog/transforming-xml-document-into-csv-using-xmlstarlet/
http://www.joyofdata.de/blog/import-csv-into-mongodb-with-awk-json/
Storing these documents on an array before inserting would be a good idea?
Yes, that's very likely. It reduces the number of round-trips to the database. You should monitor your system, it's probably idling a lot when inserting because of IO wait (that is, the overhead and thread synchronization is taking a lot more time than the actual data transfer).
How many documents should I store before inserting, then?
That's hard to say, because it depends on so many factors. Rule of thumb: 1,000 - 10,000. You will have to experiment a little. In older versions of mongodb, the entire batch must not be larger than the document size limit of 16MB.
Writing the jsons into a file and then using mongoimport would be faster?
No, unless your code has a flaw. That would mean you have to copy the data twice and the entire operation should be IO bound.
Also, it's a good idea to add all documents first, then add any indexes, not the other way around (because then the index will have to be repaired with every insert)

is there a limit to the (CSV) filesize that a Python script can read/write?

I will be writing a little Python script tomorrow, to retrieve all the data from an old MS Access database into a CSV file first, and then after some data cleansing, munging etc, I will import the data into a mySQL database on Linux.
I intend to use pyodbc to make a connection to the MS Access db. I will be running the initial script in a Windows environment.
The db has IIRC well over half a million rows of data. My questions are:
Is the number of records a cause for concern? (i.e. Will I hit some limits)?
Is there a better file format for the transitory data (instead of CSV)?
I chose CSv because it is quite simple and straightforward (and I am a Python newbie) - but
I would like to hear from someone who may have done something similar before.
Memory usage for csvfile.reader and csvfile.writer isn't proportional to the number of records, as long as you iterate correctly and don't try to load the whole file into memory. That's one reason the iterator protocol exists. Similarly, csvfile.writer writes directly to disk; it's not limited by available memory. You can process any number of records with these without memory limitations.
For simple data structures, CSV is fine. It's much easier to get fast, incremental access to CSV than more complicated formats like XML (tip: pulldom is painfully slow).
Yet another approach if you have Access available ...
Create a table in MySQL to hold the data.
In your Access db, create an ODBC link to the MySQL table.
Then execute a query such as:
INSERT INTO MySqlTable (field1, field2, field3)
SELECT field1, field2, field3
FROM AccessTable;
Note: This suggestion presumes you can do your data cleaning operations in Access before sending the data on to MySQL.
I wouldn't bother using an intermediate format. Pulling from Access via ADO and inserting right into MySQL really shouldn't be an issue.
The only limit should be operating system file size.
That said, make sure when you send the data to the new database, you're writing it a few records at a time; I've seen people do things where they try to load the entire file first, then write it.

Categories