How to analyse multiple csv files very efficiently?

How to analyse multiple csv files very efficiently? - python

I have nearly 60-70 timing log files(all are .csv files, with a total size of nearly 100MB). I need to analyse these files at a single go. Till now, I've tried the following methods :
Merged all these files into a single file and stored it in a DataFrame (Pandas Python) and analysed them.
Stored all the csv files in a database table and analysed them.
My doubt is, which of these two methods is better? Or is there any other way to process and analyse these files?
Thanks.

For me I usually merge the file into a DataFrame and save it as a pickle but if you merge it the file will pretty big and used up a lot of ram when you used it but it is the fastest way if your machine have a lot of ram.
Storing the database is better in the long term but you will waste your time uploading the csv to the database and then waste even more of your time retrieving it from my experience you use the database if you want to query specific things from the table such as you want a log from date A to date B however if you use pandas to query all of that than this method is not very good.
Sometime for me depending on your use case you might not even need to merge it use the filename as a way to query and get the right log to process (using the filesystem) then merge the log files you are concern with your analysis only and don't save it you can save that as pickle for further processing in the future.

What exactly means analysis on a single go?
I think your problem(s) might be solved using dask and particularly the dask dataframe
However, note that the dask documentation recommends to work with one big dataframe, if it fits comfortably in the RAM of you machine.
Nevertheless, an advantage of dask might be to have a better parallelized or distributed computing support than pandas.

Related

Faster alternatives to for loop and concatenate in python

I am writing a python script where I am reading over 100s of data files, and then process the data using some equations and eventually, I would like to collect all the data into a single dataframe.
Currently, I use a for loop to go through each file at a time, import the data, then apply an equation on to the data, and then concatenate the data to an empty dataframe, which grows over time as more and more files are read. This method take a lot of time and I was wondering if there are faster alternatives.
Thanks in advance for your suggestions! :-)
Rifat

You can do it with a 2 scan approach. In the first you access to all files and do only the data extraction. The outcome can be a single csv files or one json for each file. This step can be parallelized easily with multiprocessing library or joblib.
In the second scan you read and apply the computation, in this case you can benefit of vectorial computation if the data are loaded in a pandas dataframe.
The benefit of this approach is that data extraction will be done only when input files changes, if changes a subset of files, you will refresh only the extracted data related to them.
If you change something in the computation this can be repeated only using the extracted data and will go fast.

How to convert a dbf file to a dask dataframe?

I have a big dbf file, converting it to a pandas dataframe is taking a lot of time.
Is there a way to convert the file into a dask dataframe?

Dask does not have a dbf loading method.
As far as I can tell, dbf files do not support random-access to the data, so it is not possible to read from sections of the file in separate workers, in parallel. I may be wrong about this, but certainly dbfreader makes no mention of jumping through to an arbitrary record.
Therefore, the only way you could read from dbf in parallel, and hope to see a speed increase, would be to split your original data into multiple dbf files, and use dask.delayed to read each of them.
It is worth mentioning, that probably the reason dbfreader is slow (but please, do your own profiling!) is that it's doing byte-by-byte manipulations and making python objects for every record before passing the records to pandas. If you really wanted to speed things up, this code should be converted to cython or maybe numba, and a pre-allocated dataframe assigned into.

Reading huge CSV files using Pandas vs. MySQL

I have a 500+ MB CSV data file. My question is, which would be faster for data manipulation (e.g., reading, processing) is the Python MySQL client would be faster since all work is mapped into SQL queries and optimization is left to the optimizer. But, at the same time Pandas is dealing with a file which should be faster than communicating with a server?
I have already checked "Large data" work flows using pandas, Best practices for importing large CSV files, Fastest way to write large CSV with Python, and Most efficient way to parse a large .csv in python?. However, I haven't really found any comparison regarding Pandas and MySQL.
Use Case:
I am working on text dataset that consists of 1,737,123 rows and 8 columns. I am feeding this dataset into RNN/LSTM network. I do some preprocessing in prior to feeding which is encoding using a customized encoding algorithm.
More details
I have 250+ experiments to do and 12 architectures (different models design) to try.
I am confused, I feel I miss something.

There's no comparison online 'cuz these two scenarios give different results:
With Pandas, you end up with a Dataframe in memory (as a NumPy ndarray under the hood), accessible as native Python objects
With MySQL client, you end up with data in a MySQL database on disk (unless you're using an in-memory database), accessible via IPC/sockets
So, the performance will depend on
how much data needs to be transferred by lower-speed channels (IPC, disk, network)
how comparatively fast is transferring vs processing (which of them is the bottleneck)
which data format your processing facilities prefer (i.e. what additional conversions will be involved)
E.g.:
If your processing facility can reside in the same (Python) process that will be used to read it, reading it directly into Python types is preferrable since you won't need to transfer it all to the MySQL process, then back again (converting formats each time).
OTOH if your processing facility is implemented in some other process and/or language, or e.g. resides within a computing cluster, hooking it to MySQL directly may be faster by eliminating the comparatively slow Python from equation, and because you'll need to be transferring the data again and converting it into the processing app's native objects anyway.

Comparing 2 large files (not in memory), where to begin

At the moment, I am doing a file comparison on 2 CSV files, checking for duplicate lines in each specific file, checking for data mismatches between the files, and checking for missing data rows in each file.
Currently, I am doing this in memory, built for speed because this will be processing thousands of files constantly. This comes at a price though, it can only process files it can completely store in memory.
I am looking to make a fall back if for some reason (although this should never happen) to be able to do the comparison if the files can't fit in memory.
What would be a good approach to do this?

Use pandas. Can't beat it for data analysis in python.
https://pandas.pydata.org/pandas-docs/stable/10min.html
Comes complete with a
read_csv(filepath, skiprows=100000, nrows=9999999)
method that loads the specified rows.
It's built on numpy, the majority of which's methods are implement in C, making them incredibly fast.

Is there a faster way to append many XLS files into a single CSV file?

After the recommendation from Jeff's Answer to check out this Google Forum, I still didn't feel satisfied on what the conclusion was regarding the appendCSV method. Below, you can see my implementation of reading many XLS files. Is there a way to significantly increase the speed of this? It currently takes over 10 minutes for around 900,000 rows.
listOfFiles = glob.glob(file_location)
frame = pd.DataFrame()
for idx, a_file in enumerate(listOfFiles):
data = pd.read_excel(a_file, sheetname=0, skiprows=range(1,2), header=1)
data.rename(columns={'Alphabeta':'AlphaBeta'}, inplace=True)
frame = frame.append(data)
# Save to CSV..
frame.to_csv(output_dir, index=False, encoding='utf-8', date_format="%Y-%m-%d")

The very first important point
Optimize only code that is required to be optimized.
If you need to convert all you files just once then you have already made a great job, congrats! If you, however, need to reuse it really often (and by really I mean that there is a source that produce your Excel files with a speed at least of 900K rows per 10 minutes and you need to parse them in real-time) then what you need to do is to analyze your profiling results.
Profiling analysis
Sorting your profile in descending order by 'cumtime', which is cumulative execution time of function including its subcalls, you will discover that out of ~2000 seconds of runtime ~800 seconds are taken by 'read_excel' method and ~1200 seconds are taken by 'to_csv' method.
If then you will sort profile by 'tottime' which is total execution time of functions themselves you will find out that top time consumers are populated with functions that are connected with reading and writing lines and conversion between formats. So, the real problem is that either libraries you use are slow, or the amount of data you are parsing is really huge.
Possible solutions
For the first reason, please keep in mind that parsing Excel lines and converting them could be a really complex task. It is hard to advice you without having an example of your input data. But there could be a real time loss just because the library you are using is for everything and it does hard work parsing rows several times when you actually do not need it, because your rows have very simple structure. In this case you may try to switch to different libraries, that does not perform complex parsing of input data, for example use xlrd for reading data from Excel. But in title you mentioned that input files are also CSVs so if this is applicable in your case then load lines with just:
line.strip().split(sep)
instead of complex Excel format parsing. And of course if your rows are simple than you can always use
','.join(list_of_rows)
to write CSV instead of using complex DataFrames at all. However, if your files contain Unicode symbols, complex fields and so on then these libraries are probably the best choice.
For the second reason - 900K rows could contain from 900K to infinite bytes, so it is really hard to understand whether your data input is really so big, without an example again. If you have really a lot of data then probably there is not too much you could do and you just have to wait. And remember that disk is actually a very slow device. Usual disks could provide you with ~100Mb/s at its best so if you are copying (because ultimately that is what you are doing) 10Gb of data then you can see that at least 3-4 minutes will be required for just physically reading raw data and writing the result. But in case if you are not using your disk bandwidth for 100% (for example if parsing one row with library that you are using takes comparable time with just reading this row from disk) you might also try to increase speed of your code by asynchronous data reading with multiprocessing map_async instead of cycle.

If you are using pandas, you could do this:
dfs = [pd.read_excel(path.join(dir, name), sep='\t', encoding='cp1252', error_bad_lines=False ) for name in os.listdir(dir) if name.endswith(suffix)]
df = pd.concat(dfs, axis=0, ignore_index=True)
This is screaming fast compared to other methods of getting data into pandas. Other tips:
You can also speed this up by specifying dtype for all columns.
If you are doing read_csv, use the engine='c' to speed up the import.
Skip rows on error

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.