Join two large files by column in python - python

I have 2 files with 38374732 lines in each and size 3.3 G each. I am trying to join them on the first column. For doing so I decided to use pandas with the following code that pulled from Stackoverflow:
import pandas as pd
import sys
a = pd.read_csv(sys.argv[1],sep='\t',encoding="utf-8-sig")
b = pd.read_csv(sys.argv[2],sep='\t',encoding="utf-8-sig")
chunksize = 10 ** 6
for chunk in a(chunksize=chunksize):
merged = chunk.merge(b, on='Bin_ID')
merged.to_csv("output.csv", index=False,sep='\t')
However I am getting memory error(not surprising). I looked up at the code with chunks for pandas (something like this How to read a 6 GB csv file with pandas), however how do I implement it for two files in a loop and I don't think I can chunk the second file as I need to lookup for column in the whole second file.Is there a way out for this?

This is already discussed in other posts like the one you mentioned (this, or this, or this).
As it is explained there, I would try to use dask dataframe to load the data and execute the merge, but depending on your PC you may still not be able to do it.
Minimum working example:
import dask.dataframe as dd
# Read the CSVs
df1 = dd.read_csv('data1.csv')
df2 = dd.read_csv('data2.csv')
# Merge them
df = dd.merge(df1, df2, on='Bin_ID').compute()
# Save the merged dataframe
df.to_csv('merged.csv', index=False)

Related

How to perform operations on a Dask dataframe and export the results to a csv?

I have a large input csv file (several GBs) that I import in Dask with a blocksize of 5e6. The input csv contains two columns: "ID" and "Text".
ddf1 = dd.read_csv('REL_Input.csv', names=['ID', 'Text'], blocksize=5e6)
I need to add a third column to ddf1, "Hash", by parsing the existing "Text" column for a string between "Hash=" and ";". In Pandas, I can simply do this:
ddf1['Hash'] = ddf1['Text'].str.extract(r'Hash=(.*?);')
When I do this in Dask, I get an error saying that the "column assignment doesn't support dask.dataframe.core.DataFrame". I tried to use assign but had no luck.
I also need to read multiple large csv files (each several GBs in size) from a directory, concatenate them into another Dask dataframe, ddf2. Each of these csv files have 100s of columns but I only need 2: "Hash" and "Name". Here is the code to create ddf2:
ddf2 = dd.concat([dd.read_csv(f, usecols=['Hash', 'Name'], blocksize=5e6) for f in glob.glob('*.tsv')], ignore_index=True, axis=0)
Then, I need to merge the two dataframes on the "Hash" columns--something like this:
ddf3 = ddf1[['ID', 'ddf1_Hash']].merge(ddf2[['ddf2_Hash', 'Name']], left_on='ddf1_Hash', right_on='ddf2_Hash', how='left')
Finally, I need to export ddf3 as a csv:
df3.to_csv('Output.csv')
I looked and it seems I can create the column for ddf1 and perform the merge operation by changing both ddf1 and ddf2 to pandas dfs using compute. However, that's not an option for me due to the sheer size of these dataframes. I also tried using the chunks approach in Pandas, but that does not work due to the "out of memory" error.
Is there a good way to tackle this problem? I'm still learning Python so any help would be appreciated.
UPDATE:
I am able to create the third column and merge the two dataframes. Though, now the issue is that I can't export the merged dataframe as a csv.
Running regex on a string column. The following snippet uses assign:
import dask.dataframe as dd
import pandas as pd
# this step is just to setup a minimum reproducible example
df = pd.DataFrame(list("abcdefghi"), columns=['A'])
ddf = dd.from_pandas(df, npartitions=3)
# this uses assign to extract the relevant content
ddf = ddf.assign(check_c = lambda x: x['A'].str.extract(r'([a-z])'))
# you can see that the computation was done correctly
ddf.compute()
Concatenating csv files. Do csv files have the same structure/columns? If so, you can just use dd.read_csv("path_to_csv_files/*csv"), but if the files have different structures, then your approach is correct:
ddf2 = dd.concat([dd.read_csv(f, usecols=['Hash', 'Name'], blocksize=5e6) for f in glob.glob('*.tsv')], ignore_index=True, axis=0)
Merging the dataframes. This is going to be an expensive operation, here's a couple of options to potentially reduce the cost of this:
if any of the dataframes can be put into memory, then it would help to run .compute() to get pandas dataframe before the merge;
setting the key variable as index on one or both dataframes:
ddf1 = ddf1.set_index('Hash')
ddf2 = ddf2.set_index('Hash')
ddf3 = ddf1.merge(ddf2, left_index=True, right_index=True)
Saving csv, by default, dask will save each partition to its own csv file, so your path needs to contain an asterisk, e.g.:
df3.to_csv('Output_*.csv', index=False)
There are other options possible (explicit paths, custom name function, see https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_csv).
If you need a single file, you can use
df3.to_csv('Output.csv', index=False, single_file=True)
However, this option is not supported on all systems, so you might want to check that it works using a small sample first (see documentation).

Joining two very large csv files using pandas

I have two large files which I'd like to join based on certain similar columns. File1 is about 9.5Gb whereas file2 is about 1.5Gb in size. This is the code I used to join these files.
import pandas as pd
df1 = pd.read_csv(r"file1_path", dtype= str)
df2 = pd.read_csv(r"file2_path", dtype = str)
df3 = pd.merge(df1, df2, on = ['Date', 'ProductPartitionID', 'AdGroupID'], how = 'left')
df3=df3[['Date','AccountDescriptiveName','AdGroupID','AdGroupName','CampaignId','CampaignName','ExternalCustomerId', 'ProductGroup','ProductPartitionID', 'Cost','Impressions','Clicks','Conversions','ConversionValue', 'AllConversions','AllConversionValue','SearchAbsoluteTopImpressionShare', 'ShoppingID', 'SKU','CNT_DIST_SESSIONS','SUM_CM_ORDER_AMT','SUM_NEW_CUST_TODAY_IND', 'SUM_ORDER_CNT','SUM_PRODUCT_VIEW_CNT','SUM_REACTIVANT']]
df3.set_index('Date', inplace = True)
df3.to_csv('Shopping_aw_final.csv')
I'm executing this code using Jupyter Notebook. Since these files are so large, it's taking forever to do the job and my laptop is hanging because of this which stalls the process way too many times. I don't think it's possible to get through with the task using pandas in this way. Can anyone suggest any alternative methods I could use to get this done?

Split large dataframes (pandas) into chunks (but after grouping)

I have a large tabular data, which needs to be merged and splitted by group. The easy method is to use pandas, but the only problem is memory.
I have this code to merge dataframes:
import pandas as pd;
from functools import reduce;
large_df = pd.read_table('large_file.csv', sep=',')
This, basically load the whole data in memory th
# Then I could group the pandas dataframe by some column value (say "block" )
df_by_block = large_df.groupby("block")
# and then write the data by blocks as
for block_id, block_val in df_by_block:
pd.Dataframe.to_csv(df_by_block, "df_" + str(block_id), sep="\t", index=False)
The only problem with above code is memory allocation, which freezes my desktop. I tried to transfer this code to dask but dask doesn't have a neat groupby implementation.
Note: I could have just sorted the file, then read the data line by line and split as the "block" value changes. But, the only problem is that "large_df.txt" is created in the pipeline upstream by merging several dataframes.
Any suggestions?
Thanks,
Update:
I tried the following approach but, it still seems to be memory heavy:
# find unique values in the column of interest (which is to be "grouped by")
large_df_contig = large_df['contig']
contig_list = list(large_df_contig.unique().compute())
# groupby the dataframe
large_df_grouped = large_df.set_index('contig')
# now, split dataframes
for items in contig_list:
my_df = large_df_grouped.loc[items].compute().reset_index()
pd.DataFrame.to_csv(my_df, 'dask_output/my_df_' + str(items), sep='\t', index=False)
Everything is fine, but the code
my_df = large_df_grouped.loc[items].compute().reset_index()
seems to be pulling everything into the memory again.
Any way to improve this code??
but dask doesn't have a neat groupb
Actually, dask does have groupby + user defined functions with OOM reshuffling.
You can use
large_df.groupby(something).apply(write_to_disk)
where write_to_disk is some short function writing the block to the disk. By default, dask uses disk shuffling in these cases (as opposed to network shuffling). Note that this operation might be slow, and it can still fail if the size of a single group exceeds your memory.

Dask dataframes: reading multiple files & storing filename in column

I regularly use dask.dataframe to read multiple files, as so:
import dask.dataframe as dd
df = dd.read_csv('*.csv')
However, the origin of each row, i.e. which file the data was read from, seems to be forever lost.
Is there a way to add this as a column, e.g. df.loc[:100, 'partition'] = 'file1.csv' if file1.csv is the first file and contains 100 rows. This would be applied to each "partition" / file that is read into the dataframe, when compute is triggered as part of a workflow.
The idea is that different logic can then be applied depending on the source.
Dask functions read_csv, read_table, and read_fwf now include a parameter include_path_column:
include_path_column:bool or str, optional
Whether or not to include the path to each particular file.
If True a new column is added to the dataframe called path.
If str, sets new column name. Default is False.
Assuming you have or can make a file_list list that has the file path of each csv file, and each individual file fits in RAM (you mentioned 100 rows), then this should work:
import pandas as pd
import dask.dataframe as dd
from dask import delayed
def read_and_label_csv(filename):
# reads each csv file to a pandas.DataFrame
df_csv = pd.read_csv(filename)
df_csv['partition'] = filename.split('\\')[-1]
return df_csv
# create a list of functions ready to return a pandas.DataFrame
dfs = [delayed(read_and_label_csv)(fname) for fname in file_list]
# using delayed, assemble the pandas.DataFrames into a dask.DataFrame
ddf = dd.from_delayed(dfs)
With some customization, of course. If your csv files are bigger-than-RAM, then a concatentation of dask.DataFrames is probably the way to go.

How to use several input files and do parallel processing in python?

I have 30 csv files. I want to give it as input in for loop, in pandas?
Each file has names such as fileaa, fileab,fileac,filead,....
I have multiple input files and And i would like to receive one output.
Usually i use read_csv but due to memory error, 'read_csv' doesn't work.
f = "./file.csv"
df = pd.read_csv(f, sep="/", header=0, dtype=str)
So i would like to try parallel processing in python 2.7
You might want to have a look at dask.
Dask docs show a demo on how to read in many csv files and output a single dask dataframe:
import dask.dataframe as dd
df = dd.read_csv('*.csv')
And then MANY (but not all) of the pandas methods are available, i.e.:
df.head()
It would be useful to read more on dask dataframe to understand difference with pandas dataframe

Categories