Better way to store a set of files with arrays?

Better way to store a set of files with arrays? - python

I've accumulated a set of 500 or so files, each of which has an array and header that stores metadata. Something like:
2,.25,.9,26 #<-- header, which is actually cryptic metadata
1.7331,0
1.7163,0
1.7042,0
1.6951,0
1.6881,0
1.6825,0
1.678,0
1.6743,0
1.6713,0
I'd like to read these arrays into memory selectively. We've built a GUI that lets users select one or multiple files from disk, then each are read in to the program. If users want to read in all 500 files, the program is slow opening and closing each file. Therefore, my question is: will it speed up my program to store all of these in a single structure? Something like hdf5? Ideally, this would have faster access than the individual files. What is the best way to go about this? I haven't ever dealt with these types of considerations. What's the best way to speed up this bottleneck in Python? The total data is only a few MegaBytes, I'd even be amenable to storing it in the program somewhere, not just on disk (but don't know how to do this)

Reading 500 files in python should not take much time, as the overall file size is around few MB. Your data-structure is plain and simple in your file chunks, it ll not even take much time to parse I guess.
Is the actual slowness is bcoz of opening and closing file, then there may be OS related issue (it may have very poor I/O.)
Did you timed it like how much time it is taking to read all the files.?
You can also try using small database structures like sqllite. Where you can store your file data and access the required data in a fly.

Related

Instant access to line from a large file without loading the file

In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.
I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.
The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)
I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)
The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.
Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.
The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.
Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.
Thanks in advance,
Art.

As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file

Is it faster to read large amounts of data from multiple .txt files or from a dictionary?

I am working on a personal project (using Python 3) that will retrieve weather information for any city in the United States. My program prompts the user to enter as many city-state combinations as they wish, and then it retrieves the weather information and creates a weather summary for each city entered. Behind the scenes, I'm essentially taking the State entered by the user, opening a .txt file corresponding to that State, and then getting a weather code that is associated with the city entered, which I then use in a URL request to find weather information for the city. Since I have a .txt file for every state, I have 50 .txt files, each with a large number of city-weather code combinations.
Would it be faster to keep my algorithm the way that it currently is, or would it be faster to keep all of this data in a dictionary? This is how I was thinking about storing the data in a dictionary:
info = {'Virginia':{'City1':'ID1','City2':'ID2'},'North Carolina':{'City3':'ID3'}}
I'd be happy to provide some of my code or elaborate if necessary.
Thanks!

If you have a large datafile, you will spend days shifting through the file and putting the values in the .py file. If it is a small file I would use a dictionary, but if it were a large file a .txt file.
Other possible solutions are:
sqlite
pickle
shelve
Other Resources
Basic data storage with Python
https://docs.python.org/3/library/persistence.html
https://docs.python.org/3/library/pickle.html
https://docs.python.org/3/library/shelve.html

It almost certainly would be much faster to preload the data from the files, if you're using the same python process for many user requests. If the process handles just one request and exits, this approach would be slower and use more memory. For some number of requests between "one" and "many", they'd be about equal on speed.
For a situation like this I would probably use sqlite, for which python has built-in support. It would be much faster than scanning text files without the time and memory overhead of loading the full dictionary.

It is probably not a very good idea to have a large amount of text files, because it will slow down in large or numerous director(y|ies) access. But If you have large data records, you might wish to choose an intermediate solution, in indexing one data file and load the index in a dictionary.

Debugging a python script which first needs to read large files. Do I have to load them every time anew?

I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)

I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.

How to improve a XML import into mongodb?

I have some large XML files (5GB ~ each) that I'm importing to a mongodb database. I'm using Expat to parse the documents, doing some data manipulation (deleting some fields, unit conversion, etc) and then inserting into the database. My script is based on this one: https://github.com/bgianfo/stackoverflow-mongodb/blob/master/so-import
My question is: is there a way to improve this with a batch insert ? Storing these documents on an array before inserting would be a good idea ? How many documents should I store before inserting, then ? Writing the jsons into a file and then using mongoimport would be faster ?
I appreciate any suggestion.

In case you want to import XML to MongoDB and Python is just what you so far chose to get this job done but you are open for further approaches then might also perform this with the following steps:
transforming the XML documents to CSV documents using XMLStarlet
transforming the CSVs to files containing JSONs using AWK
import the JSON files to MongoDB
XMLStarlet and AWK are both extremely fast and you are able to store your JSON objects using a non-trivial structure (sub-objects, arrays).
http://www.joyofdata.de/blog/transforming-xml-document-into-csv-using-xmlstarlet/
http://www.joyofdata.de/blog/import-csv-into-mongodb-with-awk-json/

Storing these documents on an array before inserting would be a good idea?
Yes, that's very likely. It reduces the number of round-trips to the database. You should monitor your system, it's probably idling a lot when inserting because of IO wait (that is, the overhead and thread synchronization is taking a lot more time than the actual data transfer).
How many documents should I store before inserting, then?
That's hard to say, because it depends on so many factors. Rule of thumb: 1,000 - 10,000. You will have to experiment a little. In older versions of mongodb, the entire batch must not be larger than the document size limit of 16MB.
Writing the jsons into a file and then using mongoimport would be faster?
No, unless your code has a flaw. That would mean you have to copy the data twice and the entire operation should be IO bound.
Also, it's a good idea to add all documents first, then add any indexes, not the other way around (because then the index will have to be repaired with every insert)

how to speed up the code?

in my program i have a method which requires about 4 files to be open each time it is called,as i require to take some data.all this data from the file i have been storing in list for manupalation.
I approximatily need to call this method about 10,000 times.which is making my program very slow?
any method for handling this files in a better ways and is storing the whole data in list time consuming what is better alternatives for list?
I can give some code,but my previous question was closed as that only confused everyone as it is a part of big program and need to be explained completely to understand,so i am not giving any code,please suggest ways thinking this as a general question...
thanks in advance

As a general strategy, it's best to keep this data in an in-memory cache if it's static, and relatively small. Then, the 10k calls will read an in-memory cache rather than a file. Much faster.
If you are modifying the data, the alternative might be a database like SQLite, or embedded MS SQL Server (and there are others, too!).
It's not clear what kind of data this is. Is it simple config/properties data? Sometimes you can find libraries to handle the loading/manipulation/storage of this data, and it usually has it's own internal in-memory cache, all you need to do is call one or two functions.
Without more information about the files (how big are they?) and the data (how is it formatted and structured?), it's hard to say more.

Opening, closing, and reading a file 10,000 times is always going to be slow. Can you open the file once, do 10,000 operations on the list, then close the file once?

It might be better to load your data into a database and put some indexes on the database. Then it will be very fast to make simple queries against your data. You don't need a lot of work to set up a database. You can create an SQLite database without requiring a separate process and it doesn't have a complicated installation process.

Call the open to the file from the calling method of the one you want to run. Pass the data as parameters to the method

If the files are structured, kinda configuration files, it might be good to use ConfigParser library, else if you have other structural format then I think it would be better to store all this data in JSON or XML and perform any necessary operations on your data

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.