How to use several input files and do parallel processing in python? - python

I have 30 csv files. I want to give it as input in for loop, in pandas?
Each file has names such as fileaa, fileab,fileac,filead,....
I have multiple input files and And i would like to receive one output.
Usually i use read_csv but due to memory error, 'read_csv' doesn't work.
f = "./file.csv"
df = pd.read_csv(f, sep="/", header=0, dtype=str)
So i would like to try parallel processing in python 2.7

You might want to have a look at dask.
Dask docs show a demo on how to read in many csv files and output a single dask dataframe:
import dask.dataframe as dd
df = dd.read_csv('*.csv')
And then MANY (but not all) of the pandas methods are available, i.e.:
df.head()
It would be useful to read more on dask dataframe to understand difference with pandas dataframe

Related

How to perform operations on a Dask dataframe and export the results to a csv?

I have a large input csv file (several GBs) that I import in Dask with a blocksize of 5e6. The input csv contains two columns: "ID" and "Text".
ddf1 = dd.read_csv('REL_Input.csv', names=['ID', 'Text'], blocksize=5e6)
I need to add a third column to ddf1, "Hash", by parsing the existing "Text" column for a string between "Hash=" and ";". In Pandas, I can simply do this:
ddf1['Hash'] = ddf1['Text'].str.extract(r'Hash=(.*?);')
When I do this in Dask, I get an error saying that the "column assignment doesn't support dask.dataframe.core.DataFrame". I tried to use assign but had no luck.
I also need to read multiple large csv files (each several GBs in size) from a directory, concatenate them into another Dask dataframe, ddf2. Each of these csv files have 100s of columns but I only need 2: "Hash" and "Name". Here is the code to create ddf2:
ddf2 = dd.concat([dd.read_csv(f, usecols=['Hash', 'Name'], blocksize=5e6) for f in glob.glob('*.tsv')], ignore_index=True, axis=0)
Then, I need to merge the two dataframes on the "Hash" columns--something like this:
ddf3 = ddf1[['ID', 'ddf1_Hash']].merge(ddf2[['ddf2_Hash', 'Name']], left_on='ddf1_Hash', right_on='ddf2_Hash', how='left')
Finally, I need to export ddf3 as a csv:
df3.to_csv('Output.csv')
I looked and it seems I can create the column for ddf1 and perform the merge operation by changing both ddf1 and ddf2 to pandas dfs using compute. However, that's not an option for me due to the sheer size of these dataframes. I also tried using the chunks approach in Pandas, but that does not work due to the "out of memory" error.
Is there a good way to tackle this problem? I'm still learning Python so any help would be appreciated.
UPDATE:
I am able to create the third column and merge the two dataframes. Though, now the issue is that I can't export the merged dataframe as a csv.
Running regex on a string column. The following snippet uses assign:
import dask.dataframe as dd
import pandas as pd
# this step is just to setup a minimum reproducible example
df = pd.DataFrame(list("abcdefghi"), columns=['A'])
ddf = dd.from_pandas(df, npartitions=3)
# this uses assign to extract the relevant content
ddf = ddf.assign(check_c = lambda x: x['A'].str.extract(r'([a-z])'))
# you can see that the computation was done correctly
ddf.compute()
Concatenating csv files. Do csv files have the same structure/columns? If so, you can just use dd.read_csv("path_to_csv_files/*csv"), but if the files have different structures, then your approach is correct:
ddf2 = dd.concat([dd.read_csv(f, usecols=['Hash', 'Name'], blocksize=5e6) for f in glob.glob('*.tsv')], ignore_index=True, axis=0)
Merging the dataframes. This is going to be an expensive operation, here's a couple of options to potentially reduce the cost of this:
if any of the dataframes can be put into memory, then it would help to run .compute() to get pandas dataframe before the merge;
setting the key variable as index on one or both dataframes:
ddf1 = ddf1.set_index('Hash')
ddf2 = ddf2.set_index('Hash')
ddf3 = ddf1.merge(ddf2, left_index=True, right_index=True)
Saving csv, by default, dask will save each partition to its own csv file, so your path needs to contain an asterisk, e.g.:
df3.to_csv('Output_*.csv', index=False)
There are other options possible (explicit paths, custom name function, see https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_csv).
If you need a single file, you can use
df3.to_csv('Output.csv', index=False, single_file=True)
However, this option is not supported on all systems, so you might want to check that it works using a small sample first (see documentation).

How to split a large .csv file using dask?

I am trying to use dask in order to split a huge tab-delimited file into smaller chunks on an AWS Batch array of 100,000 cores.
In AWS Batch each core has a unique environment variable AWS_BATCH_JOB_ARRAY_INDEX ranging from 0 to 99,999 (which is copied into the idx variable in the snippet below). Thus, I am trying to use the following code:
import os
import dask.dataframe as dd
idx = int(os.environ["AWS_BATCH_JOB_ARRAY_INDEX"])
df = dd.read_csv(f"s3://main-bucket/workdir/huge_file.tsv", sep='\t')
df = df.repartition(npartitions=100_000)
df = df.partitions[idx]
df = df.persist() # this call isn't needed before calling to df.to_csv (see comment by Sultan)
df = df.compute() # this call isn't needed before calling to df.to_csv (see comment by Sultan)
df.to_csv(f"/tmp/split_{idx}.tsv", sep="\t", index=False)
print(idx, df.shape, df.head(5))
Do I need to call presist and/or compute before calling df.to_csv?
When I have to split a big file into multiple smaller ones, I simply run the following code.
Read and repartition
import dask.dataframe as dd
df = dd.read_csv("file.csv")
df = df.repartition(npartitions=100)
Save to csv
o = df.to_csv("out_csv/part_*.csv", index=False)
Save to parquet
o = df.to_parquet("out_parquet/")
Here you can use write_metadata_file=False if you want to avoid metadata.
Few notes:
I don't think you really need persist and compute as you can directly save to disk. When you have problems like memory error is safer to save to disk rather than compute.
I found using parquet format at least 3x faster than csv when it's time to write.

Join two large files by column in python

I have 2 files with 38374732 lines in each and size 3.3 G each. I am trying to join them on the first column. For doing so I decided to use pandas with the following code that pulled from Stackoverflow:
import pandas as pd
import sys
a = pd.read_csv(sys.argv[1],sep='\t',encoding="utf-8-sig")
b = pd.read_csv(sys.argv[2],sep='\t',encoding="utf-8-sig")
chunksize = 10 ** 6
for chunk in a(chunksize=chunksize):
merged = chunk.merge(b, on='Bin_ID')
merged.to_csv("output.csv", index=False,sep='\t')
However I am getting memory error(not surprising). I looked up at the code with chunks for pandas (something like this How to read a 6 GB csv file with pandas), however how do I implement it for two files in a loop and I don't think I can chunk the second file as I need to lookup for column in the whole second file.Is there a way out for this?
This is already discussed in other posts like the one you mentioned (this, or this, or this).
As it is explained there, I would try to use dask dataframe to load the data and execute the merge, but depending on your PC you may still not be able to do it.
Minimum working example:
import dask.dataframe as dd
# Read the CSVs
df1 = dd.read_csv('data1.csv')
df2 = dd.read_csv('data2.csv')
# Merge them
df = dd.merge(df1, df2, on='Bin_ID').compute()
# Save the merged dataframe
df.to_csv('merged.csv', index=False)

Pandas read_csv large file Performance improvement

Just was wondering if there is a way to improve the performance of reading large csv files into a pandas dataframe. I have 3 large (3.5MM records each) pipe delimited file which I want to load into dataframe and perform some task on it. Currently I am using pandas.read_csv() defining the cols and there datatypes in the parameter like below. I did see some improvement by defining the datatype of the columns but it still takes more than 3 minutes to load.
import pandas as pd
df = pd.read_csv(file_, index_col=None, usecols = sourceFields, sep='|', header=0, dtype={'date':'str', 'gwTimeUtc':'str', 'asset':'|str',
'instrumentId':'|str', 'askPrice':'float64', 'bidPrice':'float64',
'askQuantity':'float64', 'bidQuantity':'float64', 'currency':'|str',
'venue':'|str', 'owner':'|str', 'status':'|str', 'priceNotation':'|str', 'nominalQuantity':'float64'})
Depending on what you wish to do with the data, a good option is dask.dataframe. This library works out-of-memory, and allows you to perform a subset of pandas operations lazily. You can then bring the results in memory as a pandas dataframe. Below is example code you can try:
import dask.dataframe as dd, pandas as pd
# point to all files beginning with "file"
dask_df = dd.read_csv('file*.csv')
# define your calculations as you would in pandas
dask_df['col2'] = dask_df['col1'] * 2
# compute results & return to pandas
df = dask_df.compute()
Crucially, nothing significant is computed until the very last line.
The .feather file is significantly faster than .csv. Pandas has built-in support for feather files.
Read the csv in using pd.read_csv(path) and then export it to a feather file: pd.to_feather(path). Now, read the feather file instead of csv.
In my case, a 950 MB csv file was compressed to a 180 MB feather file. Instead of taking 30 seconds to read, it takes about 1 second. I know I am a bit late to the party, but feather files are seriously underrated.

How to speed up the process of chunk to dataframe?

I try to use multiprocessing to read the csv file faster than using read_csv.
df = pd.read_csv('review-1m.csv', chunksize=10000)
But the df I get is not the dataframe but of the type pandas.io.parsers.TextFileReader. So I try to use
df = pd.concat(tp, ignore_index=True)
to convert df into a dataframe. But this process takes a lot of time thus the result is not much different from directly using read_csv. Does anyone know that how to make the process of converting df into dataframe faster?
pd.read_csv() is likely going to give you the same read time as any other method. If you want a real performance increase you should change the format you store your file in.
http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations

Categories