I wrote a Python program that handles very large data. As it processes the data, it puts the processed data into an array, which easily grows to hundreds of megabytes or even over a gigabyte.
The reason I set it like that is because Python needs to continuously access the data in the array. Because the array gets larger and larger, the process is easily prone to error and very slow.
Is there a way to have the array-like database stored on a different file or database module and access it on a as-needed basis?
Perhaps this is a very basic task, but I have no clue.
You can use sqlite3 if you want. it is part of the python
packages, it is simpler for basic usage.
MySQL for python
Postgres for python
Related
I'm currently using Python to store files in a JSON Database. However the JSON has started to become rather large, and inefficient (reading a 20MB file, changing one value, writing 20MB back to disk again, takes rather long)
So, I was thinking about switching to SQL (SQLite or Mysql), however I don't want to change my entire code. So far, I've been reading the JSON into lists/arrays and access them rather easily
database["key"] = "NewValue"
But if I switched to SQL, I'd have to deal with long SQL queries (select from.....insert into....), apart from the entire overhead-stuff (connect, execute, etc.). That requires me to rewrite every single data-access in my code.
Is there a way (maybe some sort of wrapper), where I can just keep my existing code-base, and let the wrapper the conversion for me in the background?
I have a python script that needs to read a huge file into a var and then search into it and perform other stuff,
the problem is the web server calls this script multiple times and every time i am having a latency of around 8 seconds while the file loads.
Is it possible to make the file persist in memory to have faster access to it atlater times ?
I know i can make the script as a service using supervisor but i can't do that for this.
Any other suggestions please.
PS I am already using var = pickle.load(open(file))
You should take a look at http://docs.h5py.org/en/latest/. It allows to perform various operations on huge files. It's what the NASA uses.
Not an easy problem. I assume you can do nothing about the fact that your web server calls your application multiple times. In that case I see two solutions:
(1) Write TWO separate applications. The first application, A, loads the large file and then it just sits there, waiting for the other application to access the data. "A" provides access as required, so it's basically a sort of custom server. The second application, B, is the one that gets called multiple times by the web server. On each call, it extracts the necessary data from A using some form of interprocess communication. This ought to be relatively fast. The Python standard library offers some tools for interprocess communication (socket, http server) but they are rather low-level. Alternatives are almost certainly going to be operating-system dependent.
(2) Perhaps you can pre-digest or pre-analyze the large file, writing out a more compact file that can be loaded quickly. A similar idea is suggested by tdelaney in his comment (some sort of database arrangement).
You are talking about memory-caching a large array, essentially…?
There are three fairly viable options for large arrays:
use memory-mapped arrays
use h5py or pytables as a back-end
use an array caching-aware package like klepto or joblib.
Memory-mapped arrays index the array in file, as if there were in memory.
h5py or pytables give you fast access to arrays on disk, and also can avoid the load of the entire array into memory. klepto and joblib can store arrays as a collection of "database" entries (typically a directory tree of files on disk), so you can load portions of the array into memory easily. Each have a different use case, so the best choice for you depends on what you want to do. (I'm the klepto author, and it can use SQL database tables as a backend instead of files).
I am using Flask to make a small webapp to manage a group project, in this website I need to manage attendances, and also meetings reports. I don't have the time to get into SQLAlchemy, so I need to know what might be the bad things about using CSV as a database.
Just don't do it.
The problem with CSV is …
a, concurrency is not possible: What this means is that when two people access your app at the same time, there is no way to make sure that they don't interfere with each other, making changes to each other's data. There is no way to solve this with when using a CSV file as a backend.
b, speed: Whenever you make changes to a CSV file, you need to reload more or less the whole file. Parsing the file is eating up both memory and time.
Databases were made to solve this issues.
I agree however, that you don't need to learn SQLAlchemy for a small app.
There are lightweight alternatives that you should consider.
What you are looking for are ORM - Object-relational mapping - who translate Python code into SQL and manage the SQL databases for you.
PeeweeORM and PonyORM. Both are easy to use and translate all SQL into Python and vice versa. Both are free for personal use, but Pony costs money if you use it for commercial purposes. I highly recommend PeeweeORM. You can start using SQLite as a backend with Peewee, or if your app grows larger, you can plug in MySQL or PostGreSQL easily.
Don't do it, CSV that is.
There are many other possibilities, for instance the sqlite database, python shelve, etc. The available options from the standard library are summarised here.
Given that your application is a webapp, you will need to consider the effect of concurrency on your solution to ensure data integrity. You could also consider a more powerful database such as postgres for which there are a number of python libraries.
I think there's nothing wrong with that as long as you abstract away from it. I.e. make sure you have a clean separation between what you write and how you implement i . That will bloat your code a bit, but it will make sure you can swap your CSV storage in a matter of days.
I.e. pretend that you can persist your data as if you're keeping it in memory. Don't write "openCSVFile" in you flask app. Use initPersistence(). Don't write "csvFile.appendRecord()". Use "persister.saveNewReport()". When and if you actually realise CSV to be a bottleneck, you can just write a new persister plugin.
There are added benefits like you don't have to use a mock library in tests to make them faster. You just provide another persister.
I am absolutely baffled by how many people discourage using CSV as an database storage back-end format.
Concurrency: There is NO reason why CSV can not be used with concurrency. Just like how a database thread can write to one area of a binary file at the same time that another thread writes to another area of the same binary file. Databases can do EXACTLY the same thing with CSV files. Just as a journal is used to maintain the atomic nature of individual transactions, the same exact thing can be done with CSV.
Speed: Why on earth would a database read and write a WHOLE file at a time, when the database can do what it does for ALL other database storage formats, look up the starting byte of a record in an index file and SEEK to it in constant time and overwrite the data and comment out anything left over and record the free space for latter use in a separate index file, just like a database could zero out the bytes of any unneeded areas of a binary "row" and record the free space in a separate index file... I just do not understand this hostility to non-binary formats, when everything that can be done with one format can be done with the other... everything, except perhaps raw binary data compression, depending on the particular CSV syntax in use (special binary comments... etc.).
Emergency access: The added benefit of CSV is that when the database dies, which inevitably happens, you are left with a CSV file that can still be accessed quickly in the case of an emergency... which is the primary reason I do not EVER use binary storage for essential data that should be quickly accessible even when the database breaks due to incompetent programming.
Yes, the CSV file would have to be re-indexed every time you made changes to it in a spread sheet program, but that is no different than having to re-index a binary database after the index/table gets corrupted/deleted/out-of-sync/etc./etc..
I have a few million documents. What I am trying to do is simple, process the documents to extract the information I need and load it into a database. I am doing it in Python and using SQLAlchemy. Also I am using multiprocessing to make use of all the cores on my machine. The documents are XML with huge chunks of text. The database is MySQL with a custom relation schema defined.
However, it runs very slow and loads only about 50k documents in 6-7 hours.
Is there any way that I can speed this task up?
sometimes RDBMS is not the answer, one sign for such situation is if your data has no relations to one another, for example, if every document stands by itself.
if you'd like to have some unstructured data searchable, consider building a searchable index using pylucene
or maybe put the data in some non-rel database like mongodb
in any case, try to identify what part of your system is slowing down the process, my guess would be the database or the file system, if this is mysql all you can do is throwing more hardware on it.
another way to optimize a system that use IO extensively is to switch to async programming using a library like twisted but it has some learning curve, so better make 100% sure its needed.
in my program i have a method which requires about 4 files to be open each time it is called,as i require to take some data.all this data from the file i have been storing in list for manupalation.
I approximatily need to call this method about 10,000 times.which is making my program very slow?
any method for handling this files in a better ways and is storing the whole data in list time consuming what is better alternatives for list?
I can give some code,but my previous question was closed as that only confused everyone as it is a part of big program and need to be explained completely to understand,so i am not giving any code,please suggest ways thinking this as a general question...
thanks in advance
As a general strategy, it's best to keep this data in an in-memory cache if it's static, and relatively small. Then, the 10k calls will read an in-memory cache rather than a file. Much faster.
If you are modifying the data, the alternative might be a database like SQLite, or embedded MS SQL Server (and there are others, too!).
It's not clear what kind of data this is. Is it simple config/properties data? Sometimes you can find libraries to handle the loading/manipulation/storage of this data, and it usually has it's own internal in-memory cache, all you need to do is call one or two functions.
Without more information about the files (how big are they?) and the data (how is it formatted and structured?), it's hard to say more.
Opening, closing, and reading a file 10,000 times is always going to be slow. Can you open the file once, do 10,000 operations on the list, then close the file once?
It might be better to load your data into a database and put some indexes on the database. Then it will be very fast to make simple queries against your data. You don't need a lot of work to set up a database. You can create an SQLite database without requiring a separate process and it doesn't have a complicated installation process.
Call the open to the file from the calling method of the one you want to run. Pass the data as parameters to the method
If the files are structured, kinda configuration files, it might be good to use ConfigParser library, else if you have other structural format then I think it would be better to store all this data in JSON or XML and perform any necessary operations on your data