Pandas Processing Large CSV Data - python

I am processing a Large Data Set with at least 8GB in size using pandas.
I've encountered a problem in reading the whole set so I read the file chunk by chunk.
In my understanding, chunking the whole file will create many different dataframes. So using my existing routine, this only removes the duplicate values on that certain dataframe and not the duplicates on the whole file.
I need to remove the duplicates on this whole data set based on the ['Unique Keys'] column.
I tried to use the pd.concat but I also encountered a problem with the memory so I tried to write the file on a csv file and append all the results of the dataframes on it.
After running the code, the file doesn't reduce much so I think my assumption is right that the current routine is not removing all the duplicates based on the whole data set.
I'm a newbie in Python so it would really help if someone can point me in the right direction.
def removeduplicates(filename):
CHUNK_SIZE = 250000
df_iterator = pd.read_csv(filename, na_filter=False, chunksize=CHUNK_SIZE,
low_memory=False)
# new_df = pd.DataFrame()
for df in df_iterator:
df = df.dropna(subset=['Unique Keys'])
df = df.drop_duplicates(subset=['Unique Keys'], keep='first')
df.to_csv(join(file_path, output_name.replace(' Step-2', '') +
' Step-3.csv'), mode='w', index=False, encoding='utf8')

If you can fit in memory the set of unique keys:
def removeduplicates(filename):
CHUNK_SIZE = 250000
df_iterator = pd.read_csv(filename, na_filter=False,
chunksize=CHUNK_SIZE,
low_memory=False)
# create a set of (unique) ids
all_ids = set()
for df in df_iterator:
df = df.dropna(subset=['Unique Keys'])
df = df.drop_duplicates(subset=['Unique Keys'], keep='first')
# Filter rows with key in all_ids
df = df.loc[~df['Unique Keys'].isin(all_ids)]
# Add new keys to the set
all_ids = all_ids.union(set(df['Unique Keys'].unique()))

Probably easier not doing it with pandas.
with open(input_csv_file) as fin:
with open(output_csv_file) as fout:
writer = csv.writer(fout)
seen_keys = set()
header = True
for row in csv.reader(fin):
if header:
writer.writerow(row)
header = False
continue
key = tuple(row[i] for i in key_indices)
if not all(key): # skip if key is empty
continue
if key not in seen_keys:
writer.writerow(row)
seen_keys.add(key)

I think this is a clear example of when you should use Dask or Pyspark. Both allow you to read files that does not fit in your memory.
As an example with Dask you could do:
import dask.dataframe as dd
df = dd.read_csv(filename, na_filter=False)
df = df.dropna(subset=["Unique Keys"])
df = df.drop_duplicates(subset=["Unique Keys"])
df.to_csv(filename_out, index=False, encoding="utf8", single_file=True)

Related

Pandas use column names if do not exist

Is there a way, without reading the file twice, to check if a column exists otherwise use column names passed? I have files of the same structure but some do not contain a header for some reason.
Example with header:
Field1 Field2 Field3
data1 data2 data3
Example without header:
data1 data2 data3
When trying to use the example below, if the file has a header it will make it the first row instead of replacing the header.
pd.read_csv('filename.csv', names=col_names)
When trying to use the below, it will drop the first row of data of there is no header in the file.
pd.read_csv('filename.csv', header=0, names=col_names)
My current work around is to load the file, check if the columns exist or not, then if it doesn't read the file again.
df = pd.read_csv('filename.csv')
if `Field1` not in df.columns:
del df
df = pd.read_csv('filename.csv', names=col_names)
Is there a better way to handle this data set that doesn't involve potentially reading the file twice?
Just modify your logic so the first time through only reads the first row:
# Load first row and setup keyword args if necessary
kw_args = {}
first = pd.read_csv('filename.csv', nrows=1)
if `Field1` not in first.columns:
kw_args["names"] = col_names
# Load data
df = pd.read_csv('filename.csv', **kw_args)
You can do this with seek method of file descriptor:
with open('filename.csv') as csvfile:
headers = pd.read_csv(csvfile, nrows=0).columns.tolist()
csvfile.seek(0) # return file pointer to the beginning of the file
# do stuff here
if 'Field1' in headers:
...
else:
...
df = pd.read_csv(csvfile, ...)
The file is read only once.

Need to reduce memory usage when using pd.concat() on multiple df's

I need to read in multiple large .csv's (20k rows x 6k columns) and store them in a dataframe.
This thread has excellent examples that have worked for me in the past with smaller files.
Such as:
pd.concat((pd.read_csv(f,index_col='Unnamed: 0') for f in file_list))
Other more direct approaches that I have attempted is:
frame = pd.DataFrame()
list_ = []
for file_ in file_list:
print(file_)
df = pd.read_csv(file_,index_col=0)
list_.append(df)
df = pd.concat(list_)
However all the solutions revolve around creating a list of all the csv files as individual df's and then using pd.concat() at the end over all of the df's.
As far as I can tell it's this approach which is causing a memory error when concat'ing ~20 of these df's.
How could I get past this and perhaps append each df as I go?
Example of file_list:
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_26.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_30.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_25.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_19.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_27.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_18.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_28.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_23.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_06_03.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_24.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_29.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_06_04.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_20.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_22.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_06_06.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_06_05.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_06_01.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_06_02.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_31.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_21.csv
Your CSVs are still manageably sized, so I would assume the issue is with misaligned headers.
I'd recommend reading in your DataFrames without any header, so concatenation is aligned.
list_ = []
for file_ in file_list:
df = pd.read_csv(file_, index_col=0, skiprows=1, header=None)
list_.append(df)
df = pd.concat(list_)

Insert a row in pd.DataFrame without loading the file

The following code is effective to insert a row (features names) in my dataset as a first row:
features = ['VendorID', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount']
df = pd.DataFrame(pd.read_csv(path + 'data.csv', sep=','))
df.loc[-1] = features # adding a row
df.index = df.index + 1 # shifting index
df = df.sort_index() # sorting by index
But data.csv is very large ~ 10 GB, hence I am wondering if I can insert features row directly in the file without loading it! Is it possible?
Thank you
You don't have to load the entire file into memory, use the stdlib csv module's writer functionality to append a row to the end of the file.
import csv
import os
with open(os.path.join(path, 'data.csv'), 'a') as f:
writer = csv.writer(f)
writer.writerow(features)

Operations on a very large csv with pandas

I have been using pandas on csv files to get some values out of them. My data looks like this:
"A",23.495,41.995,"this is a sentence with some words"
"B",52.243,0.118,"More text but contains WORD1"
"A",119.142,-58.289,"Also contains WORD1"
"B",423.2535,292.3958,"Doesn't contain anything of interest"
"C",12.413,18.494,"This string contains WORD2"
I have a simple script to read the csv and create the frequencies of WORD by group so the output is like:
group freqW1 freqW2
A 1 0
B 1 0
C 0 1
Then do some other operations on the values. The problem is now I have to deal with very large csv files (20+ GB) that can't be held in memory. I tried the chunksize=x option in pd.read_csv, but because 'TextFileReader' object is not subscriptable, I can't do the necessary operations on the chunks.
I suspect there is some easy way to iterate through the csv and do what I want.
My code is like this:
df = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"])
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
outfile = open("csv_out.txt","w", encoding='utf-8')
df1.to_csv(outfile, sep=",")
outfile.close()
You can specify a chunksize option in the read_csv call. See here for details
Alternatively you could use the Python csv library and create your own csv Reader or DictReader and then use that to read in data in whatever chunk size you choose.
Okay I misunderstood the chunk parameter. I solved it by doing this:
frame = pd.DataFrame()
chunks = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"],chunksize=1000000)
for df in chunks:
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
frame = frame.add(df1,fill_value=0)
outfile = open("csv_out.txt","w", encoding='utf-8')
frame.to_csv(outfile, sep=",")
outfile.close()

Comparing large (~40GB) of textual data using Pandas or an alternative approach

I have a large body of csv data, around 40GB of size that I need to process (lets call it the 'body'). The data in each file in this body consists of single column CSV files. Each row is a keyword consisting of words and short sentences, e.g.
Dog
Feeding cat
used cars in Brighton
trips to London
.....
This data needs to be compared against another set of files (this one 7GB in size, which I will call 'Removals'), any keywords from the Removals need to be identified and removed from the body. The data for the Removals is similar to whats in the body, i.e:
Guns
pricless ming vases
trips to London
pasta recipes
........
While I have an approach that will get the job done, it is a very slow approach and could take a good week to finish. It is a multi-threaded approach in which every file from the 7GB body is compared in a for loop against files from the body. It casts the column from the Removals file as a list and then filters the body file to keep any row that is not in that list. The filtered data is then appended to an output file:
def thread_worker(file_):
removal_path="removal_files"
allFiles_removals = glob.glob(removal_path + "/*.csv", recursive=True)
print(allFiles_removals)
print(file_)
file_df = pd.read_csv(file_)
file_df.columns = ['Keyword']
for removal_file_ in allFiles_removals:
print(removal_file_)
vertical_df = pd.read_csv(vertical_file_, header=None)
vertical_df.columns = ['Keyword']
vertical_keyword_list = vertical_df['Keyword'].values.tolist()
file_df = file_df[~file_df['Keyword'].isin(vertical_keyword_list)]
file_df.to_csv('output.csv',index=False, header=False, mode='a')
Obviously, my main aim is to work out how to get this done faster.Is Pandas even the best way to do this? I tend to default to using it when dealing with CSV files.
IIUC you can do it this way:
# read up "removal" keywords from all CSV files, get rid of duplicates
removals = pd.concat([pd.read_csv(f, sep='~', header=None, names=['Keyword']) for f in removal_files]
ignore_index=True).drop_duplicates()
df = pd.DataFrame()
for f in body_files:
# collect all filtered "body" data (file-by-file)
df = pd.concat([df,
pd.read_csv(f, sep='~', header=None, names=['Keyword']) \
.query('Keyword not in #removals.Keyword')],
ignore_index=True)
You can probably read them in small chunks and make the text column as category to drop duplicates while reading
from pandas.api.types import CategoricalDtype
TextFileReader = pd.read_csv(path, chunksize=1000, dtype = {"text_column":CategoricalDtype}) # the number of rows per chunk
dfList = []
for df in TextFileReader:
dfList.append(df)
df = pd.concat(dfList,sort=False)

Categories