Is it in memory?
If so, then it doesn't matter if I import chunk by chunk or not because eventually, when I concatenate them, they'll all be stored in memory.
Does that mean for a large data set, there is no way to use pandas?
Yes, they will be stored in memory, and that's the reason why you want to chunk them - that allows you to not read the whole data set in at the same time, but process it in chunks before writing out the end result.
You can use chunksize to tell pandas how many rows should be read for each chunk. If you need a complete set of rows to perform arbitrary lookups, you'll have to back it with some other technology (such as a database).
Yes it is in memory, and yes when the dataset gets too large you have to use other tools.
Of course you can load data in chucks, process one chunk at a time and write down the results (and so free memory for the next chunk).
That works fine for some type of process like filtering and annotating while if you need sorting or grouping you need to use some other tool, personally I like bigquery from google cloud.
Related
I am writing a python script where I am reading over 100s of data files, and then process the data using some equations and eventually, I would like to collect all the data into a single dataframe.
Currently, I use a for loop to go through each file at a time, import the data, then apply an equation on to the data, and then concatenate the data to an empty dataframe, which grows over time as more and more files are read. This method take a lot of time and I was wondering if there are faster alternatives.
Thanks in advance for your suggestions! :-)
Rifat
You can do it with a 2 scan approach. In the first you access to all files and do only the data extraction. The outcome can be a single csv files or one json for each file. This step can be parallelized easily with multiprocessing library or joblib.
In the second scan you read and apply the computation, in this case you can benefit of vectorial computation if the data are loaded in a pandas dataframe.
The benefit of this approach is that data extraction will be done only when input files changes, if changes a subset of files, you will refresh only the extracted data related to them.
If you change something in the computation this can be repeated only using the extracted data and will go fast.
I have a Python program that is controlling some machines and stores some data. The data is produced at a rate of about 20 rows per second (and about 10 columns or so). The whole run of this program can be as long as one week, as a result there is a large dataframe.
What are safe and correct ways to store this data? With safe I mean that if something fails in the day 6, I will still have all the data from days 1→6. With correct I mean not re-writing the whole dataframe to a file in each loop.
My current solution is a CSV file, I just print each row manually. This solution is both safe and correct, but the problem is that CSV does not preserve data types and also occupies more memory. So I would like to know if there is a binary solution. I like the feather format as it is really fast, but it does not allow to append rows.
I can think of two easy options:
store chunks of data (e.g. every 30 seconds or whatever suits your use case) into separate files; you can then postprocess them back into a single dataframe.
store each row into an SQL database as it comes in. Sqlite will likely be a good start, but I'd maybe really go for PostgreSQL. That's what databases are meant for, after all.
I am currently using Python Record Linkage Toolkit to perform deduplication on data sets at work. In an ideal world, I would just use blocking or sortedneighborhood to trim down the size of the index of record pairs, but sometimes I need to do a full index on a data set with over 75k records, which results in a couple billion records pairs.
The issue I'm running into is that the workstation I'm able to use is running out of memory, so it can't store the full 2.5-3 billion pair multi-index. I know the documentation has ideas for doing record linkage with two large data sets using numpy split, which is simple enough for my usage, but doesn't provide anything for deduplication within a single dataframe. I actually incorporated this subset suggestion into a method for splitting the multiindex into subsets and running those, but it doesn't get around the issue of the .index() call seemingly loading the entire multiindex into memory and causing an out of memory error.
Is there a way to split a dataframe and compute the matched pairs iteratively so I don't have to load the whole kit and kaboodle into memory at once? I was looking at dask, but I'm still pretty green on the whole python thing, so I don't know how to incorporate the dask dataframes into the record linkage toolkit.
While I was able to solve this, sort of, I am going to leave it open because I suspect given my inexperience with python, my process could be improved.
Basically, I had to ditch the index function from record linkage toolkit. I pulled out the Index of the dataframe I was using, and then converted it to a list, and passed it through the itertools combinations function.
candidates = fl
candidates = candidates.index
candidates = candidates.tolist()
candidates = combinations(candidates,2)
This then gave me an iteration object full of tuples, without having to load everything in to memory. I then passed it into an islice grouper as a for loop.
for x in iter(lambda: list(islice(candidates,1000000)),[]):
I then proceeded to perform all of the necessary comparisons in the for loop, and added the resultant dataframe to a dictionary, which I then concatenate at the end for the full list. Python's memory usage hasn't risen above 3GB the entire time.
I would still love some information on how to incorporate dask into this, so I will accept any answer that can provide that (unless the mods think I should open a new question).
I have nearly 60-70 timing log files(all are .csv files, with a total size of nearly 100MB). I need to analyse these files at a single go. Till now, I've tried the following methods :
Merged all these files into a single file and stored it in a DataFrame (Pandas Python) and analysed them.
Stored all the csv files in a database table and analysed them.
My doubt is, which of these two methods is better? Or is there any other way to process and analyse these files?
Thanks.
For me I usually merge the file into a DataFrame and save it as a pickle but if you merge it the file will pretty big and used up a lot of ram when you used it but it is the fastest way if your machine have a lot of ram.
Storing the database is better in the long term but you will waste your time uploading the csv to the database and then waste even more of your time retrieving it from my experience you use the database if you want to query specific things from the table such as you want a log from date A to date B however if you use pandas to query all of that than this method is not very good.
Sometime for me depending on your use case you might not even need to merge it use the filename as a way to query and get the right log to process (using the filesystem) then merge the log files you are concern with your analysis only and don't save it you can save that as pickle for further processing in the future.
What exactly means analysis on a single go?
I think your problem(s) might be solved using dask and particularly the dask dataframe
However, note that the dask documentation recommends to work with one big dataframe, if it fits comfortably in the RAM of you machine.
Nevertheless, an advantage of dask might be to have a better parallelized or distributed computing support than pandas.
I'm currently rewriting some python code to make it more efficient and I have a question about saving python arrays so that they can be re-used / manipulated later.
I have a large number of data, saved in CSV files. Each file contains time-stamped values of the data that I am interested in and I have reached the point where I have to deal with tens of millions of data points. The data has got so large now that the processing time is excessive and inefficient---the way the current code is written the entire data set has to be reprocessed every time some new data is added.
What I want to do is this:
Read in all of the existing data to python arrays
Save the variable arrays to some kind of database/file
Then, the next time more data is added I load my database, append the new data, and resave it. This way only a small number of data need to be processed at any one time.
I would like the saved data to be accessible to further python scripts but also to be fairly "human readable" so that it can be handled in programs like OriginPro or perhaps even Excel.
My question is: whats the best format to save the data in? HDF5 seems like it might have all the features I need---but would something like SQLite make more sense?
EDIT: My data is single dimensional. I essentially have 30 arrays which are (millions, 1) in size. If it wasn't for the fact that there are so many points then CSV would be an ideal format! I am unlikely to want to do lookups of single entries---more likely is that I might want to plot small subsets of data (eg the last 100 hours, or the last 1000 hours, etc).
HDF5 is an excellent choice! It has a nice interface, is widely used (in the scientific community at least), many programs have support for it (matlab for example), there are libraries for C,C++,fortran,python,... It has a complete toolset to display the contents of a HDF5 file. If you later want to do complex MPI calculation on your data, HDF5 has support for concurrently read/writes. It's very well suited to handle very large datasets.
Maybe you could use some kind of key-value database like Redis, Berkeley DB, MongoDB... But it would be nice some more info about the schema you would be using.
EDITED
If you choose Redis for example, you can index very long lists:
The max length of a list is 232 - 1 elements (4294967295, more than 4
billion of elements per list). The main features of Redis Lists from
the point of view of time complexity are the support for constant time
insertion and deletion of elements near the head and tail, even with
many millions of inserted items. Accessing elements is very fast near
the extremes of the list but is slow if you try accessing the middle
of a very big list, as it is an O(N) operation.
I would use a single file with fixed record length for this usecase. No specialised DB solution (seems overkill to me in that case), just plain old struct (see the documentation for struct.py) and read()/write() on a file. If you have just millions of entries, everything should be working nicely in a single file of some dozens or hundreds of MB size (which is hardly too large for any file system). You also have random access to subsets in case you will need that later.