Python abstraction-layer for SQL - python

I'm currently using Python to store files in a JSON Database. However the JSON has started to become rather large, and inefficient (reading a 20MB file, changing one value, writing 20MB back to disk again, takes rather long)
So, I was thinking about switching to SQL (SQLite or Mysql), however I don't want to change my entire code. So far, I've been reading the JSON into lists/arrays and access them rather easily
database["key"] = "NewValue"
But if I switched to SQL, I'd have to deal with long SQL queries (select from.....insert into....), apart from the entire overhead-stuff (connect, execute, etc.). That requires me to rewrite every single data-access in my code.
Is there a way (maybe some sort of wrapper), where I can just keep my existing code-base, and let the wrapper the conversion for me in the background?

Related

Python: query performance on JSON vs sqlite?

I have a JSON file that has the following format:
{
"items": {
"item_1_name": { ...item properties... }
"item_2_name": { ...item properties... }
...
}
}
On my last count, there can be over 13K items stored in the JSON file, and the file itself is nearly 75MB on disk.
Now, I have a program that needs to query (read-only) data. Each query operation takes an item name and needs to read its properties. Each invocation of that program may involve from a few to several dozen query ops.
Naturally, loading the JSON file from disk and parsing it takes time and space: it takes 0.76 seconds to load and parse, and the parsed data takes 197 MB in memory. That means on each invocation of that program, I need to first wait nearly a second before it can do anything else with the results. I want to make the program respond faster.
So I have another approach: create a SQLite database file from that JSON file. Afterwards, the program needs to query against the database, instead of querying against the data directly parsed from the JSON file.
However, the SQLite approach has one drawback: unlike json.load(), it doesn't parse the whole file and keep it around in memory (assuming cache miss), and I'm not sure if the time spent on disk IO encountered by the query ops may offset the benefit of not using the JSON approach.
So my question is: from your experience, is this use case suitable for SQLite?
I think this depends entirely on how you're querying the data. From the way you describe it, you're querying by an ID only, so you're not going to get the best of what sqlite has to offer by way of efficiencies. It should work just fine for your use case, but it would excel at returning all records matching a value, all record with values between two integers, etc. A third option worth considering is a minimal key/value store such as a python dictionary stored as a pickle or a really simple redis service. Both of these will allow to query by ID faster than reading a large json string.

Somthing wrong with using CSV as database for a webapp?

I am using Flask to make a small webapp to manage a group project, in this website I need to manage attendances, and also meetings reports. I don't have the time to get into SQLAlchemy, so I need to know what might be the bad things about using CSV as a database.
Just don't do it.
The problem with CSV is …
a, concurrency is not possible: What this means is that when two people access your app at the same time, there is no way to make sure that they don't interfere with each other, making changes to each other's data. There is no way to solve this with when using a CSV file as a backend.
b, speed: Whenever you make changes to a CSV file, you need to reload more or less the whole file. Parsing the file is eating up both memory and time.
Databases were made to solve this issues.
I agree however, that you don't need to learn SQLAlchemy for a small app.
There are lightweight alternatives that you should consider.
What you are looking for are ORM - Object-relational mapping - who translate Python code into SQL and manage the SQL databases for you.
PeeweeORM and PonyORM. Both are easy to use and translate all SQL into Python and vice versa. Both are free for personal use, but Pony costs money if you use it for commercial purposes. I highly recommend PeeweeORM. You can start using SQLite as a backend with Peewee, or if your app grows larger, you can plug in MySQL or PostGreSQL easily.
Don't do it, CSV that is.
There are many other possibilities, for instance the sqlite database, python shelve, etc. The available options from the standard library are summarised here.
Given that your application is a webapp, you will need to consider the effect of concurrency on your solution to ensure data integrity. You could also consider a more powerful database such as postgres for which there are a number of python libraries.
I think there's nothing wrong with that as long as you abstract away from it. I.e. make sure you have a clean separation between what you write and how you implement i . That will bloat your code a bit, but it will make sure you can swap your CSV storage in a matter of days.
I.e. pretend that you can persist your data as if you're keeping it in memory. Don't write "openCSVFile" in you flask app. Use initPersistence(). Don't write "csvFile.appendRecord()". Use "persister.saveNewReport()". When and if you actually realise CSV to be a bottleneck, you can just write a new persister plugin.
There are added benefits like you don't have to use a mock library in tests to make them faster. You just provide another persister.
I am absolutely baffled by how many people discourage using CSV as an database storage back-end format.
Concurrency: There is NO reason why CSV can not be used with concurrency. Just like how a database thread can write to one area of a binary file at the same time that another thread writes to another area of the same binary file. Databases can do EXACTLY the same thing with CSV files. Just as a journal is used to maintain the atomic nature of individual transactions, the same exact thing can be done with CSV.
Speed: Why on earth would a database read and write a WHOLE file at a time, when the database can do what it does for ALL other database storage formats, look up the starting byte of a record in an index file and SEEK to it in constant time and overwrite the data and comment out anything left over and record the free space for latter use in a separate index file, just like a database could zero out the bytes of any unneeded areas of a binary "row" and record the free space in a separate index file... I just do not understand this hostility to non-binary formats, when everything that can be done with one format can be done with the other... everything, except perhaps raw binary data compression, depending on the particular CSV syntax in use (special binary comments... etc.).
Emergency access: The added benefit of CSV is that when the database dies, which inevitably happens, you are left with a CSV file that can still be accessed quickly in the case of an emergency... which is the primary reason I do not EVER use binary storage for essential data that should be quickly accessible even when the database breaks due to incompetent programming.
Yes, the CSV file would have to be re-indexed every time you made changes to it in a spread sheet program, but that is no different than having to re-index a binary database after the index/table gets corrupted/deleted/out-of-sync/etc./etc..

How do I use Python with a database?

I wrote a Python program that handles very large data. As it processes the data, it puts the processed data into an array, which easily grows to hundreds of megabytes or even over a gigabyte.
The reason I set it like that is because Python needs to continuously access the data in the array. Because the array gets larger and larger, the process is easily prone to error and very slow.
Is there a way to have the array-like database stored on a different file or database module and access it on a as-needed basis?
Perhaps this is a very basic task, but I have no clue.
You can use sqlite3 if you want. it is part of the python
packages, it is simpler for basic usage.
MySQL for python
Postgres for python

What is the best way to loop through and process a large (10GB+) text file?

I need to loop through a very large text file, several gigabytes in size (a zone file to be exact). I need to run a few queries for each entry in the zone file, and then store the results in a searchable database.
My weapons of choice at the moment, mainly because I know them, are Python and MySQL. I'm not sure how well either will deal with files of this size, however.
Does anyone with experience in this area have any suggestions on the best way to open and loop through the file without overloading my system? How about the most efficient way to process the file once I can open it (threading?) and store the processed data?
You shouldn't have any real trouble storing that amount of data in MySQL, although you will probably not be able to store the entire database in memory, so expect some IO performance issues. As always, make sure you have the appropriate indices before running your queries.
The most important thing is to not try to load the entire file into memory. Loop through the file, don't try to use a method like readlines which will load the whole file at once.
Make sure to batch the requests. Load up a few thousand lines at a time and send them all in one big SQL request.
This approach should work:
def push_batch(batch):
# Send a big INSERT request to MySQL
current_batch = []
with open('filename') as f:
for line in f:
batch.append(line)
if len(current_batch) > 1000:
push_batch(current_batch)
current_batch = []
push_batch(current_batch)
Zone files are pretty normally formatted, consider if you can get away with just using LOAD DATA INFILE. You might also consider creating a named pipe, pushing partially formatted data in to it from python, and using LOAD DATA INFILE to read it in with MySQL.
MySQL has some great tips on optimizing inserts, some highlights:
Use multiple value lists in each insert statement.
Use INSERT DELAYED, particularly if you are pushing from multiple clients at once (e.g. using threading).
Lock your tables before inserting.
Tweak the key_buffer_size and bulk_insert_buffer_size.
The fastest processing will be done in MySQL, so consider if you can get away with doing the queries you need after the data is in the db, not before. If you do need to do operations in Python, threading is not going to help you. Only one thread of Python code can execute at a time (GIL), so unless you're doing something which spends a considerable amount of time in C, or interfaces with external resources, you're only going to ever be running in one thread anyway.
The most important optimization question is what is bounding the speed, there's no point spinning up a bunch of threads to read the file, if the database is the bounding factor. The only way to really know is to try it and make tweaks until it is fast enough for your purpose.
#Zack Bloom's answer is excellent and I upvoted it. Just a couple of thoughts:
As he showed, just using with open(filename) as f: / for line in f is all you need to do. open() returns an iterator that gives you one line at a time from the file.
If you want to slurp every line into your database, do it in the loop. If you only want certain lines that match a certain regular expression, that's easy.
import re
pat = re.compile(some_pattern)
with open(filename) as f:
for line in f:
if not pat.search(line):
continue
# do the work to insert the line here
With a file that is multiple gigabytes, you are likely to be I/O bound. So there is likely no reason to worry about multithreading or whatever. Even running several regular expressions is likely to crunch through the data faster than the file can be read or the database updated.
Personally, I'm not much of a database guy and I like using an ORM. The last project I did database work on, I was using Autumn with SQLite. I found that the default for the ORM was to do one commit per insert, and it took forever to insert a bunch of records, so I extended Autumn to let you explicitly bracket a bunch of inserts with a single commit; it was much faster that way. (Hmm, I should extend Autumn to work with a Python with statement, so that you could wrap a bunch of inserts into a with block and Autumn would automatically commit.)
http://autumn-orm.org/
Anyway, my point was just that with database stuff, doing things the wrong way can be very slow. If you are finding that the database inserting is your bottleneck, there might be something you can do to fix it, and Zack Bloom's answer contains several ideas to start you out.

is there a limit to the (CSV) filesize that a Python script can read/write?

I will be writing a little Python script tomorrow, to retrieve all the data from an old MS Access database into a CSV file first, and then after some data cleansing, munging etc, I will import the data into a mySQL database on Linux.
I intend to use pyodbc to make a connection to the MS Access db. I will be running the initial script in a Windows environment.
The db has IIRC well over half a million rows of data. My questions are:
Is the number of records a cause for concern? (i.e. Will I hit some limits)?
Is there a better file format for the transitory data (instead of CSV)?
I chose CSv because it is quite simple and straightforward (and I am a Python newbie) - but
I would like to hear from someone who may have done something similar before.
Memory usage for csvfile.reader and csvfile.writer isn't proportional to the number of records, as long as you iterate correctly and don't try to load the whole file into memory. That's one reason the iterator protocol exists. Similarly, csvfile.writer writes directly to disk; it's not limited by available memory. You can process any number of records with these without memory limitations.
For simple data structures, CSV is fine. It's much easier to get fast, incremental access to CSV than more complicated formats like XML (tip: pulldom is painfully slow).
Yet another approach if you have Access available ...
Create a table in MySQL to hold the data.
In your Access db, create an ODBC link to the MySQL table.
Then execute a query such as:
INSERT INTO MySqlTable (field1, field2, field3)
SELECT field1, field2, field3
FROM AccessTable;
Note: This suggestion presumes you can do your data cleaning operations in Access before sending the data on to MySQL.
I wouldn't bother using an intermediate format. Pulling from Access via ADO and inserting right into MySQL really shouldn't be an issue.
The only limit should be operating system file size.
That said, make sure when you send the data to the new database, you're writing it a few records at a time; I've seen people do things where they try to load the entire file first, then write it.

Categories