Joining two very large csv files using pandas - python

I have two large files which I'd like to join based on certain similar columns. File1 is about 9.5Gb whereas file2 is about 1.5Gb in size. This is the code I used to join these files.
import pandas as pd
df1 = pd.read_csv(r"file1_path", dtype= str)
df2 = pd.read_csv(r"file2_path", dtype = str)
df3 = pd.merge(df1, df2, on = ['Date', 'ProductPartitionID', 'AdGroupID'], how = 'left')
df3=df3[['Date','AccountDescriptiveName','AdGroupID','AdGroupName','CampaignId','CampaignName','ExternalCustomerId', 'ProductGroup','ProductPartitionID', 'Cost','Impressions','Clicks','Conversions','ConversionValue', 'AllConversions','AllConversionValue','SearchAbsoluteTopImpressionShare', 'ShoppingID', 'SKU','CNT_DIST_SESSIONS','SUM_CM_ORDER_AMT','SUM_NEW_CUST_TODAY_IND', 'SUM_ORDER_CNT','SUM_PRODUCT_VIEW_CNT','SUM_REACTIVANT']]
df3.set_index('Date', inplace = True)
df3.to_csv('Shopping_aw_final.csv')
I'm executing this code using Jupyter Notebook. Since these files are so large, it's taking forever to do the job and my laptop is hanging because of this which stalls the process way too many times. I don't think it's possible to get through with the task using pandas in this way. Can anyone suggest any alternative methods I could use to get this done?

Related

read_csv stops at 100000

I am trying to import a .csv file from my Downloads folder.
Usually, the read_csv function will import the entire rows, though there are millions of rows.
In this case, my file has 236,905 rows, but exactly 100,000 are loaded.
df = pd.read_csv(r'C:\Users\user\Downloads\df.csv',nrows=9999999,low_memory=False)
I come across the same problem with a file containing 5M rows.
I tried first this option :
tp = pd.read_csv('yourfile.csv', iterator=True, chunksize=1000)
data_customers = pd.concat(tp, ignore_index=True)
It did work but in my case some rows where not read properly since some columns contained the character ',' which is used as delimiter in read_csv
The other solution is to use Dask It has an object called "DataFrame" (as Pandas). Dask reads your file and construct a dask dataframe composed of several pandas dataframe.
It's a great solution for parallel computing.
Hope it helps
You need to create chunks using the chunksize= parameter:
temporary = pd.read_csv(r'C:\Users\user\Downloads\df.csv', iterator=True, chunksize=1000)
df = pd.concat(temporary, ignore_index=True)
ignore_index resets the index so it's not repeating.

How to perform operations on a Dask dataframe and export the results to a csv?

I have a large input csv file (several GBs) that I import in Dask with a blocksize of 5e6. The input csv contains two columns: "ID" and "Text".
ddf1 = dd.read_csv('REL_Input.csv', names=['ID', 'Text'], blocksize=5e6)
I need to add a third column to ddf1, "Hash", by parsing the existing "Text" column for a string between "Hash=" and ";". In Pandas, I can simply do this:
ddf1['Hash'] = ddf1['Text'].str.extract(r'Hash=(.*?);')
When I do this in Dask, I get an error saying that the "column assignment doesn't support dask.dataframe.core.DataFrame". I tried to use assign but had no luck.
I also need to read multiple large csv files (each several GBs in size) from a directory, concatenate them into another Dask dataframe, ddf2. Each of these csv files have 100s of columns but I only need 2: "Hash" and "Name". Here is the code to create ddf2:
ddf2 = dd.concat([dd.read_csv(f, usecols=['Hash', 'Name'], blocksize=5e6) for f in glob.glob('*.tsv')], ignore_index=True, axis=0)
Then, I need to merge the two dataframes on the "Hash" columns--something like this:
ddf3 = ddf1[['ID', 'ddf1_Hash']].merge(ddf2[['ddf2_Hash', 'Name']], left_on='ddf1_Hash', right_on='ddf2_Hash', how='left')
Finally, I need to export ddf3 as a csv:
df3.to_csv('Output.csv')
I looked and it seems I can create the column for ddf1 and perform the merge operation by changing both ddf1 and ddf2 to pandas dfs using compute. However, that's not an option for me due to the sheer size of these dataframes. I also tried using the chunks approach in Pandas, but that does not work due to the "out of memory" error.
Is there a good way to tackle this problem? I'm still learning Python so any help would be appreciated.
UPDATE:
I am able to create the third column and merge the two dataframes. Though, now the issue is that I can't export the merged dataframe as a csv.
Running regex on a string column. The following snippet uses assign:
import dask.dataframe as dd
import pandas as pd
# this step is just to setup a minimum reproducible example
df = pd.DataFrame(list("abcdefghi"), columns=['A'])
ddf = dd.from_pandas(df, npartitions=3)
# this uses assign to extract the relevant content
ddf = ddf.assign(check_c = lambda x: x['A'].str.extract(r'([a-z])'))
# you can see that the computation was done correctly
ddf.compute()
Concatenating csv files. Do csv files have the same structure/columns? If so, you can just use dd.read_csv("path_to_csv_files/*csv"), but if the files have different structures, then your approach is correct:
ddf2 = dd.concat([dd.read_csv(f, usecols=['Hash', 'Name'], blocksize=5e6) for f in glob.glob('*.tsv')], ignore_index=True, axis=0)
Merging the dataframes. This is going to be an expensive operation, here's a couple of options to potentially reduce the cost of this:
if any of the dataframes can be put into memory, then it would help to run .compute() to get pandas dataframe before the merge;
setting the key variable as index on one or both dataframes:
ddf1 = ddf1.set_index('Hash')
ddf2 = ddf2.set_index('Hash')
ddf3 = ddf1.merge(ddf2, left_index=True, right_index=True)
Saving csv, by default, dask will save each partition to its own csv file, so your path needs to contain an asterisk, e.g.:
df3.to_csv('Output_*.csv', index=False)
There are other options possible (explicit paths, custom name function, see https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_csv).
If you need a single file, you can use
df3.to_csv('Output.csv', index=False, single_file=True)
However, this option is not supported on all systems, so you might want to check that it works using a small sample first (see documentation).

Split large dataframes (pandas) into chunks (but after grouping)

I have a large tabular data, which needs to be merged and splitted by group. The easy method is to use pandas, but the only problem is memory.
I have this code to merge dataframes:
import pandas as pd;
from functools import reduce;
large_df = pd.read_table('large_file.csv', sep=',')
This, basically load the whole data in memory th
# Then I could group the pandas dataframe by some column value (say "block" )
df_by_block = large_df.groupby("block")
# and then write the data by blocks as
for block_id, block_val in df_by_block:
pd.Dataframe.to_csv(df_by_block, "df_" + str(block_id), sep="\t", index=False)
The only problem with above code is memory allocation, which freezes my desktop. I tried to transfer this code to dask but dask doesn't have a neat groupby implementation.
Note: I could have just sorted the file, then read the data line by line and split as the "block" value changes. But, the only problem is that "large_df.txt" is created in the pipeline upstream by merging several dataframes.
Any suggestions?
Thanks,
Update:
I tried the following approach but, it still seems to be memory heavy:
# find unique values in the column of interest (which is to be "grouped by")
large_df_contig = large_df['contig']
contig_list = list(large_df_contig.unique().compute())
# groupby the dataframe
large_df_grouped = large_df.set_index('contig')
# now, split dataframes
for items in contig_list:
my_df = large_df_grouped.loc[items].compute().reset_index()
pd.DataFrame.to_csv(my_df, 'dask_output/my_df_' + str(items), sep='\t', index=False)
Everything is fine, but the code
my_df = large_df_grouped.loc[items].compute().reset_index()
seems to be pulling everything into the memory again.
Any way to improve this code??
but dask doesn't have a neat groupb
Actually, dask does have groupby + user defined functions with OOM reshuffling.
You can use
large_df.groupby(something).apply(write_to_disk)
where write_to_disk is some short function writing the block to the disk. By default, dask uses disk shuffling in these cases (as opposed to network shuffling). Note that this operation might be slow, and it can still fail if the size of a single group exceeds your memory.

Join two large files by column in python

I have 2 files with 38374732 lines in each and size 3.3 G each. I am trying to join them on the first column. For doing so I decided to use pandas with the following code that pulled from Stackoverflow:
import pandas as pd
import sys
a = pd.read_csv(sys.argv[1],sep='\t',encoding="utf-8-sig")
b = pd.read_csv(sys.argv[2],sep='\t',encoding="utf-8-sig")
chunksize = 10 ** 6
for chunk in a(chunksize=chunksize):
merged = chunk.merge(b, on='Bin_ID')
merged.to_csv("output.csv", index=False,sep='\t')
However I am getting memory error(not surprising). I looked up at the code with chunks for pandas (something like this How to read a 6 GB csv file with pandas), however how do I implement it for two files in a loop and I don't think I can chunk the second file as I need to lookup for column in the whole second file.Is there a way out for this?
This is already discussed in other posts like the one you mentioned (this, or this, or this).
As it is explained there, I would try to use dask dataframe to load the data and execute the merge, but depending on your PC you may still not be able to do it.
Minimum working example:
import dask.dataframe as dd
# Read the CSVs
df1 = dd.read_csv('data1.csv')
df2 = dd.read_csv('data2.csv')
# Merge them
df = dd.merge(df1, df2, on='Bin_ID').compute()
# Save the merged dataframe
df.to_csv('merged.csv', index=False)

How to use several input files and do parallel processing in python?

I have 30 csv files. I want to give it as input in for loop, in pandas?
Each file has names such as fileaa, fileab,fileac,filead,....
I have multiple input files and And i would like to receive one output.
Usually i use read_csv but due to memory error, 'read_csv' doesn't work.
f = "./file.csv"
df = pd.read_csv(f, sep="/", header=0, dtype=str)
So i would like to try parallel processing in python 2.7
You might want to have a look at dask.
Dask docs show a demo on how to read in many csv files and output a single dask dataframe:
import dask.dataframe as dd
df = dd.read_csv('*.csv')
And then MANY (but not all) of the pandas methods are available, i.e.:
df.head()
It would be useful to read more on dask dataframe to understand difference with pandas dataframe

Categories