Operations on a very large csv with pandas - python

I have been using pandas on csv files to get some values out of them. My data looks like this:
"A",23.495,41.995,"this is a sentence with some words"
"B",52.243,0.118,"More text but contains WORD1"
"A",119.142,-58.289,"Also contains WORD1"
"B",423.2535,292.3958,"Doesn't contain anything of interest"
"C",12.413,18.494,"This string contains WORD2"
I have a simple script to read the csv and create the frequencies of WORD by group so the output is like:
group freqW1 freqW2
A 1 0
B 1 0
C 0 1
Then do some other operations on the values. The problem is now I have to deal with very large csv files (20+ GB) that can't be held in memory. I tried the chunksize=x option in pd.read_csv, but because 'TextFileReader' object is not subscriptable, I can't do the necessary operations on the chunks.
I suspect there is some easy way to iterate through the csv and do what I want.
My code is like this:
df = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"])
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
outfile = open("csv_out.txt","w", encoding='utf-8')
df1.to_csv(outfile, sep=",")
outfile.close()

You can specify a chunksize option in the read_csv call. See here for details
Alternatively you could use the Python csv library and create your own csv Reader or DictReader and then use that to read in data in whatever chunk size you choose.

Okay I misunderstood the chunk parameter. I solved it by doing this:
frame = pd.DataFrame()
chunks = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"],chunksize=1000000)
for df in chunks:
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
frame = frame.add(df1,fill_value=0)
outfile = open("csv_out.txt","w", encoding='utf-8')
frame.to_csv(outfile, sep=",")
outfile.close()

Related

how to read data (using pandas?) so that it is correctly formatted?

I have a txt file with following format:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],"values":[["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Testcustomer",null,null,null,null,-196,196,-196,null,null],["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Testcustomer",null,null,null,null,null,null,null,null,null],["2017-10-06T08:50:25.349Z",null,null,2596,null,null,null,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,80700],["2017-10-06T08:50:25.35Z",null,null,null,null,null,null,null,null,null,1956,"41762721","Testkunde",null,null,null,null,null,null,null,null,null],["2017-10-06T09:20:05.742Z",null,null,null,null,null,67.98999,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,null]]}]}]}
...
So in the text file everything is saved in one line. CSV file is not available.
I would like to have it as a data frame in pandas. when I use read.csv:
df = pd.read_csv('time-series-data.txt', sep = ",")
the output of print(df) is someting like [0 rows x 3455.. columns]
So currently everything is read in as one line. However, I would like to have 22 columns (time, activepower0, CosPhi0,..). I ask for tips, thank you very much.
Is a pandas dataframe even suitable for this? the text files are up to 2 GB in size.
Here's an example which can read the file you posted.
Here's the test file, named test.json:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],
"values":[
["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Test-customer",null,null,null,null,-196,196,-196,null,null],
["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Test-customer",null,null,null,null,null,null,null,null,null]]}]}]}
Here's the python code used to read it in:
import json
import pandas as pd
# Read test file.
# This reads the entire file into memory at once. If this is not
# possible for you, you may want to look into something like ijson:
# https://pypi.org/project/ijson/
with open("test.json", "rb") as f
data = json.load(f)
# Get the first element of results list, and first element of series list
# You may need a loop here, if your real data has more than one of these.
subset = data['results'][0]['series'][0]
values = subset['values']
columns = subset['columns']
df = pd.DataFrame(values, columns=columns)
print(df)

Pandas Processing Large CSV Data

I am processing a Large Data Set with at least 8GB in size using pandas.
I've encountered a problem in reading the whole set so I read the file chunk by chunk.
In my understanding, chunking the whole file will create many different dataframes. So using my existing routine, this only removes the duplicate values on that certain dataframe and not the duplicates on the whole file.
I need to remove the duplicates on this whole data set based on the ['Unique Keys'] column.
I tried to use the pd.concat but I also encountered a problem with the memory so I tried to write the file on a csv file and append all the results of the dataframes on it.
After running the code, the file doesn't reduce much so I think my assumption is right that the current routine is not removing all the duplicates based on the whole data set.
I'm a newbie in Python so it would really help if someone can point me in the right direction.
def removeduplicates(filename):
CHUNK_SIZE = 250000
df_iterator = pd.read_csv(filename, na_filter=False, chunksize=CHUNK_SIZE,
low_memory=False)
# new_df = pd.DataFrame()
for df in df_iterator:
df = df.dropna(subset=['Unique Keys'])
df = df.drop_duplicates(subset=['Unique Keys'], keep='first')
df.to_csv(join(file_path, output_name.replace(' Step-2', '') +
' Step-3.csv'), mode='w', index=False, encoding='utf8')
If you can fit in memory the set of unique keys:
def removeduplicates(filename):
CHUNK_SIZE = 250000
df_iterator = pd.read_csv(filename, na_filter=False,
chunksize=CHUNK_SIZE,
low_memory=False)
# create a set of (unique) ids
all_ids = set()
for df in df_iterator:
df = df.dropna(subset=['Unique Keys'])
df = df.drop_duplicates(subset=['Unique Keys'], keep='first')
# Filter rows with key in all_ids
df = df.loc[~df['Unique Keys'].isin(all_ids)]
# Add new keys to the set
all_ids = all_ids.union(set(df['Unique Keys'].unique()))
Probably easier not doing it with pandas.
with open(input_csv_file) as fin:
with open(output_csv_file) as fout:
writer = csv.writer(fout)
seen_keys = set()
header = True
for row in csv.reader(fin):
if header:
writer.writerow(row)
header = False
continue
key = tuple(row[i] for i in key_indices)
if not all(key): # skip if key is empty
continue
if key not in seen_keys:
writer.writerow(row)
seen_keys.add(key)
I think this is a clear example of when you should use Dask or Pyspark. Both allow you to read files that does not fit in your memory.
As an example with Dask you could do:
import dask.dataframe as dd
df = dd.read_csv(filename, na_filter=False)
df = df.dropna(subset=["Unique Keys"])
df = df.drop_duplicates(subset=["Unique Keys"])
df.to_csv(filename_out, index=False, encoding="utf8", single_file=True)

Comparing large (~40GB) of textual data using Pandas or an alternative approach

I have a large body of csv data, around 40GB of size that I need to process (lets call it the 'body'). The data in each file in this body consists of single column CSV files. Each row is a keyword consisting of words and short sentences, e.g.
Dog
Feeding cat
used cars in Brighton
trips to London
.....
This data needs to be compared against another set of files (this one 7GB in size, which I will call 'Removals'), any keywords from the Removals need to be identified and removed from the body. The data for the Removals is similar to whats in the body, i.e:
Guns
pricless ming vases
trips to London
pasta recipes
........
While I have an approach that will get the job done, it is a very slow approach and could take a good week to finish. It is a multi-threaded approach in which every file from the 7GB body is compared in a for loop against files from the body. It casts the column from the Removals file as a list and then filters the body file to keep any row that is not in that list. The filtered data is then appended to an output file:
def thread_worker(file_):
removal_path="removal_files"
allFiles_removals = glob.glob(removal_path + "/*.csv", recursive=True)
print(allFiles_removals)
print(file_)
file_df = pd.read_csv(file_)
file_df.columns = ['Keyword']
for removal_file_ in allFiles_removals:
print(removal_file_)
vertical_df = pd.read_csv(vertical_file_, header=None)
vertical_df.columns = ['Keyword']
vertical_keyword_list = vertical_df['Keyword'].values.tolist()
file_df = file_df[~file_df['Keyword'].isin(vertical_keyword_list)]
file_df.to_csv('output.csv',index=False, header=False, mode='a')
Obviously, my main aim is to work out how to get this done faster.Is Pandas even the best way to do this? I tend to default to using it when dealing with CSV files.
IIUC you can do it this way:
# read up "removal" keywords from all CSV files, get rid of duplicates
removals = pd.concat([pd.read_csv(f, sep='~', header=None, names=['Keyword']) for f in removal_files]
ignore_index=True).drop_duplicates()
df = pd.DataFrame()
for f in body_files:
# collect all filtered "body" data (file-by-file)
df = pd.concat([df,
pd.read_csv(f, sep='~', header=None, names=['Keyword']) \
.query('Keyword not in #removals.Keyword')],
ignore_index=True)
You can probably read them in small chunks and make the text column as category to drop duplicates while reading
from pandas.api.types import CategoricalDtype
TextFileReader = pd.read_csv(path, chunksize=1000, dtype = {"text_column":CategoricalDtype}) # the number of rows per chunk
dfList = []
for df in TextFileReader:
dfList.append(df)
df = pd.concat(dfList,sort=False)

How to extract a single row from multiple CSV files to a new file

I have hundreds of CSV files on my disk, and one file added daily and I want to extract one row from each of them and put them in a new file. Then I want to daily add values to that same file. CSV files looks like this:
business_day,commodity,total,delivery,total_lots
.
.
20160831,CTC,,201710,10
20160831,CTC,,201711,10
20160831,CTC,,201712,10
20160831,CTC,Total,,385
20160831,HTC,,201701,30
20160831,HTC,,201702,30
.
.
I want to fetch the row that contains 'Total' from each file. The new file should look like:
business_day,commodity,total,total_lots
20160831,CTC,Total,385
20160901,CTC,Total,555
.
.
The raw files on my disk are named '20160831_foo.CSV', '20160901_foo.CSV etc..
After Googling this I have yet not seen any examples on how to extract only one value from a CSV file. Any hints/help much appreciated. Happy to use pandas if that makes life easier.
I ended up with the following:
import pandas as pd
import glob
list_ = []
filenames = glob.glob('c:\\Financial Data\\*_DAILY.csv')
for filename in filenames:
df = pd.read_csv(filename, index_col = None, usecols = ['business_day', 'commodity', 'total', 'total_lots'], parse_dates = ['business_day'], infer_datetime_format = True)
df = df[((df['commodity'] == 'CTC') & (df['total'] == 'Total'))]
list_.append(df)
df = pd.concat(list_, ignore_index = True)
df['total_lots'] = df['total_lots'].astype(int)
df = df.sort_values(['business_day'])
df = df.set_index('business_day')
Then I save it as my required file.
Read the csv files and process them directly like so:
with open('some.csv', newline='') as f:
reader = csv.reader(f)
for row in reader:
# do something here with `row`
break
I would recommend appending rows onto a list after processing for the rows that you desire, and then passing it onto a pandas Dataframe that will simplify your data manipulations a lot.

Is there a size limit for Pandas read_table()?

Let's say I have a .dat file, filename.dat, and I wish to read this into a Pandas Dataframe:
import pandas as pd
df = pd.read_table('filename.dat')
Is there a size limit regarding this? I was hoping to save the columns of a dataframe individually for a file of size 1 TB. Is this possible?
To expand on the usage of chunksize mentioned in the comments, I'd do something like the following:
chunks = pd.read_table('filename.dat', chunksize=10**5)
fileout = 'filname_{}.dat'
for i, chunk in enumerate(chunks):
mode = 'w' if i == 0 else 'a'
header = i == 0
for col in chunk.columns:
chunk[col].to_csv(fileout.format(col), index=False, header=header, mode=mode)
You'll probably want to experiment with the chunksize parameter to see what's most efficient for your data.
The reason I'm using enumerate is to create a new file with a header when the first chunk is read in, and append without a header for subsequent chunks.

Categories