The issue is that I have 299 .csv files (each of 150-200 MB on average of millions of rows and 12 columns - that makes up a year of data (approx of 52 GB/year). I have 6 years and would like to finally concatenate all of them) which I want to concatenate into a single .csv file with python. As you may expect, I ran into memory errors when trying the following code (my machine has 16GB of RAM):
import os, gzip, pandas as pd, time
rootdir = "/home/eriz/Desktop/2012_try/01"
dataframe_total_list = []
counter = 0
start = time.time()
for subdir, dirs, files in os.walk(rootdir):
dirs.sort()
for files_gz in files:
with gzip.open(os.path.join(subdir, files_gz)) as f:
df = pd.read_csv(f)
dataframe_total_list.append(df)
counter += 1
print(counter)
total_result = pd.concat(dataframe_total_list)
total_result.to_csv("/home/eriz/Desktop/2012_try/01/01.csv", encoding="utf-8", index=False)
My aim: get a single .csv file to then use to train DL models and etc.
My constraint: I'm very new to this huge amounts of data, but I have done "part" of the work:
I know that multiprocessing will not help much in my development; it is a sequential job in which I need each task to complete so that I can start the following one. The main problem is memory run out.
I know that pandas works well with this, even with chunk size added. However, the memory problem is still there because the amount of data is huge.
I have tried to split the work into little tasks so that I don't run out of memory, but I will have it anyway later when concatenating.
My questions:
Is still possible to do this task with python/pandas in any other way I'm not aware of or I need to switch no matter what to a database approach? Could you advise which?
If the database approach is the only path, will I have problems when need to perform python-based operations to train the DL model? I mean, if I need to use pandas/numpy functions to transform the data, is that possible or will I have problems because of the file size?
Thanks a lot in advance and I would appreciate a lot a "more" in deep explanation of the topic.
UPDATE 10/7/2018
After trying and using the below code snippets that #mdurant pointed out I have learned a lot and corrected my perspective about dask and memory issues.
Lessons:
Dask is there to be used AFTER the first pre-processing task (if it is the case that finally you end up with huge files and pandas struggles to load/process them). Once you have your "desired" mammoth file, you can load it into dask.dataframe object without any issue and process it.
Memory related:
first lesson - come up with a procedure so that you don't need to concat all the files and you run out of memory; just process them looping and reducing their content by changing dtypes, dropping columns, resampling... second lesson - try to ONLY put into memory what you need, so that you don't run out. Third lesson - if any of the other lessons don't apply, just look for a EC2 instance, big data tools like Spark, SQL etc.
Thanks #mdurant and #gyx-hh for your time and guidance.
First thing: to take the contents of each CSV and concatenate into one huge CSV is simple enough, you do not need pandas or anything else for that (or even python)
outfile = open('outpath.csv', 'w')
for files_gz in files:
with gzip.open(os.path.join(subdir, files_gz)) as f:
for line in f:
outfile.write(line)
outfile.close()
(you may want to ignore the first line of each CSV, if it has a header with column names).
To do processing on the data is harder. The reason is, that although Dask can read all the files and work on the set as a single data-frame, if any file results in more memory than your system can handle, processing will fail. This is because random-access doesn't mix with gzip compression.
The output file, however, is (presumably) uncompressed, so you can do:
import dask.dataframe as dd
df = dd.read_csv('outpath.csv') # automatically chunks input
df[filter].groupby(fields).mean().compute()
Here, only the reference to dd and the .compute() are specific to dask.
Related
I have an existing code for reading streaming data and storing it using pandas DataFrame (new data comes in every 5 mins), I then capture this Data Category wise (~350 categories).
Next, I write all the new data (as this is to be incrementally stored) using to_csv in a loop.
The Pseudocode is given below:
for row in parentdf.itertuples(): #insert into <tbl> .
mycat = row.category # this is the ONLY parameter which is passed to the Key function below.
try:
df = FnforExtractingNParsingData(mycat ,NumericParam1,NumericParam1)
df.insert(0,'NewCol',sym)
df = df.assign(calculatedCol = functions1(params))
df = df.assign(calculatedCol1 = functions2(params),20))
df = df.assign(calculatedCol3 = functions3(More params),20))
df[20:].to_csv(outfile, mode='a', header=False, index=False)
The category-wise reading and storing in csv takes 2 Mins-Per cycle*. This is close to .34 Seconds for each writing of the 350 Categories incrementally.
I am wondering whether I can make the above process faster & efficient by using dask dataframes.
I looked up dask.org and didn't get any clear answers, looked at the use cases as well.
Additional details: I am using Python 3.7 and Pandas 0.25,
Further the above code above doesn't return any errors, even though we have completed good amount of Exception handling already on the above.
My key function i.e. FnforExtractingNParsingData is fairly resilient and is working as desired for a long time.
Sounds like you're reading data into a Pandas DataFrame every 5 minutes and then writing it to disk. The question doesn't mention some key facts:
how much data is ingested every 5 minutes (10MB or 10TB)?
where is the code being executed (AWS Lambda or a big cluster of machines)?
what data operations does FnforExtractingNParsingData perform?
Dask DataFrames can be written to disk as multiple CSV files in parallel, which can be a lot faster than writing a single file with Pandas, but it depends. Dask is overkill for a tiny dataset. Dask can leverage all the CPUs of a single machine, so it can scale up on a single machine better than most people realize. For large datasets, Dask will help a lot. Feel free to provide more details in your question and I can give more specific suggestions.
I want to read in large csv files into python in the fastest way possible. I have a csv file of ~100 million rows. I came across this primer https://medium.com/casual-inference/the-most-time-efficient-ways-to-import-csv-data-in-python-cc159b44063d and they go through a few packages
csv
pandas
dask
datatable
paratext
For my purposes, "csv" is too raw and I want to leverage the type inference included in the other packages. I need it to work on both windows and linux machines and have also looked into datatable and paratext, but have had problems installing right package dependencies (neither are on the anaconda package repo). So that leaves pandas and dask. At first glance, dask seems much faster, but I only realized that it only does the computations if you call ".compute"
My specific use case is that even though the raw csv file is 100+ million rows, I only need a subset of it loaded into memory. For example, all rows with date >= T. Is there a more efficient way of doing this than just the below example? Both pandas and dask take similar time.
EDIT: The csv file updates daily and there is no pre-known order of the rows of the file. Ie it is not necessarily the case that the most recent dates are at the end of the file
import pandas as pd
import dask as dd
from datetime import datetime
s = datetime.now()
data1 = pd.read_csv("test.csv", parse_dates=["DATE"])
data1 = data1[data1.DATE>=datetime(2019,12,24)]
print(datetime.now()-s)
s = datetime.now()
data2 = dd.read_csv("test.csv", parse_dates=["DATE"])
data2 = data2[data2.DATE>=datetime(2019,12,24)].compute()
print(datetime.now()-s)
Your Dask solution looks good to me. For parsing CSV in particular you might want to use Dask's multiprocessing scheduler. Most Pandas operations are better with threads, but text-based processing like CSV, is an exception.
data2 = data2[data2.DATE>=datetime(2019,12,24)].compute(scheduler="processes")
See https://docs.dask.org/en/latest/scheduling.html for more information.
CSV is not an efficient file format for filtering, CSV files don't have indexes for data fields, don't have key based access. For each filter operation you always have to read all lines.
You can improve the performance marginally by using a library that's written in C or is doing some stuff a little smarter than another library, but don't expect miracles/ I'd expect an improvement of a few percent to perhaps 3x if you identify/implement an optimized C version reading your lines and performing the initial filtering.
If you read the CSV file more often then it might be useful to convert the file during the first read (multiple options exist: hand craftet helpers, indexes, sorting, data bases, ...)
and perform subsequent reads on the 'data base'.
If you know that the new CSV file is a the same as the previous version plus lines appended to the end of the file, you had to memorize the position of the last line of the previous version and just add the new lines to your optimized data file. (data base, ...)
Other file formats might be hundreds or thousand times more efficient, but creating these files the first time is probably at least as expensive as your searching (so if reading only once, there's nothing you can optimize)
If none of above conditions is true you cannot expect a huge performance increase
You might look at What is the fastest way to search the csv file?
for accelerations (assuming files can be sorted/indexed by a search / filter criteria)
Overall goal: I want to train a pytorch model on a data set that does not fit into memory.
Now forget that I spoke about pytorch, what it boils down to: Reading and writing a large file out of core or memory mapped.
I found a lot of libraries, but I couldn't find a single one that allows me to do a multi-threaded sequential read and write. What I want to do is having multiple threads that append to the file/dataframe (order does not matter, it should be shuffled for the downstream application anyways). And then when reading I only need sequential reading (no slicing, no indexing), but again multiple threads should be able to be fed.
I found/came up with the following solutions:
csv: Not an option, because storing floats yields to precision loss (also horrible to handle encoding and escaping)
numpy.memmep: You need to know the size of the array in advance, both for reading and writing, appending seems non-trivial.
dask: I can't find a way to append to a dataframe, it always creates a new one when appending, also a new dataframe seems not to be file-backed. This looks good for reading, but creating a new out of core dataframe is not documented.
xarray: Again no documentation on how to write to a file-backed dataframe, instead the documentation states It is important to note that when you modify values of a Dataset, even one linked to files on disk, only the in-memory copy you are manipulating in xarray is modified: the original file on disk is never touched. So it seems not possible?
joblib: Same story, reading yes, iterative writing no.
blaze: Also no row appending
vaex: No row appending. Why‽
It's great that they all support out of core reading, but I need to get it in the specific file format first (writing) – what am I missing here?
Looks like multi-threaded writing is a hard problem. But even incremental writing single-threaded, but multi-threaded reading would already be good, but there seems to be no library that supports that?
Multi-threaded sequential write can be error prone. Most systems typically prefer formats like Parquet that allow them to write each chunk of data to different files.
If you want to do actual parallel sequential writes you'll have to do some sort of locking, and you're probably on your own in terms of larger all-in-one systems.
I finally found a working solution with pyarrow.
Incremental writing:
import pyarrow as pa
result = []
writer = False
for _, row in df.iterrows():
result.append(process_row(row))
if len(result) >= 10000:
batch = pa.RecordBatch.from_pandas(pd.DataFrame(result))
if not writer:
writer = pa.RecordBatchFileWriter(f'filename.arrow', batch.schema)
writer.write(batch)
result = []
batch = pa.RecordBatch.from_pandas(pd.DataFrame(result))
writer.write(batch)
writer.close()
Read all into one dataframe:
pa.RecordBatchFileReader("filename.arrow").read_pandas()
Incremental reading:
rb = pa.RecordBatchFileReader("filename.arrow")
for i in range(rb.num_record_batches):
b = rb.get_batch(i)
I am trying to be really specific about my issue. I have a dataframe with some 200+ columns and 1mil+ rows. I am reading or writing it to a excel file which takes more than 45 mins if I recorded right.
df = pd.read_csv("data_file.csv", low_memory=False, header=0, delimiter = ',', na_values = ('', 'nan'))
df.to_excel('data_file.xlsx', header=0, index=False)
My question- is there anyway we can read or write faster to a file with pandas dataframe because this is just one file example. I have many more such files with me
Two thoughts:
Investigate Dask, which provides a Pandas like DataFrame that can distribute processing of large datasets across multiple CPUs or clusters. Hard to say to what degree you will get a speed up, if your performance is purely IO bound, but certainly worth investigating. Take a quick look at the Dask use cases to get an understanding of its capabilities.
If you are going to repeatedly read the same CSV input files, then I would suggest converting these to HDF, as reading HDF is orders of magnitude faster than reading the equivalent CSV file. It's as simple as reading the file into a DataFrame and then writing it back out using DataFrame.to_hdf(). Obviously this will only help if you can do this conversion as a once off exercise, and then use the HDF files from that point forward whenever you run your code.
Regards,
Ian
That is a big file you are working with. If you need to process the data then you can't really get around the long read and write times.
Do NOT write to xlsx, use csv, writing to xlsx is taking long time.
Write to csv. It takes a minute on my cheap laptop with SSD.
After the recommendation from Jeff's Answer to check out this Google Forum, I still didn't feel satisfied on what the conclusion was regarding the appendCSV method. Below, you can see my implementation of reading many XLS files. Is there a way to significantly increase the speed of this? It currently takes over 10 minutes for around 900,000 rows.
listOfFiles = glob.glob(file_location)
frame = pd.DataFrame()
for idx, a_file in enumerate(listOfFiles):
data = pd.read_excel(a_file, sheetname=0, skiprows=range(1,2), header=1)
data.rename(columns={'Alphabeta':'AlphaBeta'}, inplace=True)
frame = frame.append(data)
# Save to CSV..
frame.to_csv(output_dir, index=False, encoding='utf-8', date_format="%Y-%m-%d")
The very first important point
Optimize only code that is required to be optimized.
If you need to convert all you files just once then you have already made a great job, congrats! If you, however, need to reuse it really often (and by really I mean that there is a source that produce your Excel files with a speed at least of 900K rows per 10 minutes and you need to parse them in real-time) then what you need to do is to analyze your profiling results.
Profiling analysis
Sorting your profile in descending order by 'cumtime', which is cumulative execution time of function including its subcalls, you will discover that out of ~2000 seconds of runtime ~800 seconds are taken by 'read_excel' method and ~1200 seconds are taken by 'to_csv' method.
If then you will sort profile by 'tottime' which is total execution time of functions themselves you will find out that top time consumers are populated with functions that are connected with reading and writing lines and conversion between formats. So, the real problem is that either libraries you use are slow, or the amount of data you are parsing is really huge.
Possible solutions
For the first reason, please keep in mind that parsing Excel lines and converting them could be a really complex task. It is hard to advice you without having an example of your input data. But there could be a real time loss just because the library you are using is for everything and it does hard work parsing rows several times when you actually do not need it, because your rows have very simple structure. In this case you may try to switch to different libraries, that does not perform complex parsing of input data, for example use xlrd for reading data from Excel. But in title you mentioned that input files are also CSVs so if this is applicable in your case then load lines with just:
line.strip().split(sep)
instead of complex Excel format parsing. And of course if your rows are simple than you can always use
','.join(list_of_rows)
to write CSV instead of using complex DataFrames at all. However, if your files contain Unicode symbols, complex fields and so on then these libraries are probably the best choice.
For the second reason - 900K rows could contain from 900K to infinite bytes, so it is really hard to understand whether your data input is really so big, without an example again. If you have really a lot of data then probably there is not too much you could do and you just have to wait. And remember that disk is actually a very slow device. Usual disks could provide you with ~100Mb/s at its best so if you are copying (because ultimately that is what you are doing) 10Gb of data then you can see that at least 3-4 minutes will be required for just physically reading raw data and writing the result. But in case if you are not using your disk bandwidth for 100% (for example if parsing one row with library that you are using takes comparable time with just reading this row from disk) you might also try to increase speed of your code by asynchronous data reading with multiprocessing map_async instead of cycle.
If you are using pandas, you could do this:
dfs = [pd.read_excel(path.join(dir, name), sep='\t', encoding='cp1252', error_bad_lines=False ) for name in os.listdir(dir) if name.endswith(suffix)]
df = pd.concat(dfs, axis=0, ignore_index=True)
This is screaming fast compared to other methods of getting data into pandas. Other tips:
You can also speed this up by specifying dtype for all columns.
If you are doing read_csv, use the engine='c' to speed up the import.
Skip rows on error