Fast datetime parsing with multiple columns, read_csv - python

I am reading in a large csv file (10GB+). The raw data loaded from the csv looks like:
SYMBOL DATE TIME PRICE CORR COND
0 BA 20090501 9:29:46 40.24 0 F
1 BA 20090501 9:29:59 40.38 0 F
2 BA 20090501 9:30:01 40.31 0 O
3 BA 20090501 9:30:01 40.31 0 Q
4 BA 20090501 9:30:08 40.38 0 F
My goal is to combine the DATE and TIME columns into a single DATE_TIME column when reading in the date via the read_csv function.
Loading the data first and doing it manually is not an option due to memory constraints.
Currently, I am using
data = pd.read_csv('200905.csv',
parse_dates=[['DATE','TIME']],
infer_datetime_format=True,
)
However, using the default dateutil.parser.parser as above increases the loading time by 4x as opposed to just loading the raw csv.
A promising approach could be using the lookup approach in the following:
Pandas: slow date conversion. This is because my dataset has a lot of repeated dates.
However, my issue is, how do I optimally exploit the repeated structure of the DATE column while combining into a DATE_TIME column (which is likely to have very few repeated entries).

Related

Iterate through two dataframes and create a dictionary one data frame that is a substring in strings found in the second dataframe (values)

I have two dataframes. One is very large and has over 4 million rows of data while the other has about 26k. I'm trying to create a dictionary where the keys are the strings of the smaller data frame. This dataframe (df1) contains substrings or incomplete names and the larger dataframe (df2) contains full names/strings and I want to check if if the substring from df1 is in strings in df2 and then create my dict.
No matter what I try, my code takes long and I keep looking for faster ways to iterate through the df's.
org_dict={}
for rowi in df1.itertuples():
part = rowi.part_name
full_list = []
for rowj in df2.itertuples():
if part in rowj.full_name:
full_list.append(full_name)
org_dict[part]=full_list
Am I missing a break or is there a faster way to iterate through really large dataframes of way over 1 million rows?
Sample data:
df1
part_name
0 aaa
1 bb
2 856
3 cool
4 man
5 a0
df2
full_name
0 aaa35688d
1 coolbbd
2 8564578
3 coolaaa
4 man4857684
5 a03567
expected output:
{'aaa':['aaa35688d','coolaaa'],
'bb':['coolbbd'],
'856':['8564578']
...}
etc
The issue here is that nested for loops perform very badly time-wise as the data grows larger. Luckily, pandas allows us to perform vectorised operations across rows/columns.
I can't properly test without having access to a sample of your data, but I believe this does the trick and performs much faster:
org_dict = {substr: df2.full_name[df2.full_name.str.contains(substr)].tolist() for substr in df1.part_name}

Merge Many CSV Files Leads to Kernel Death

I need to preprocess a lot of csv tables to apply them to an autoencoder.
By using pandas, I read all these tables as data frames. Then I need to merge them based on a shared key(id). merged = pd.merge(df, df1, on='id', how = 'left').
However, after a couple of merges the size of the resulting table became very big and killed the kernel. This is the last size I got for merging result before the kernel died merged.shape = (29180782, 71). And I need to merge many more tables.
All the tables have an outlook like this but with more rows and columns (the values define in each column shows a category):
df: df1:
id a b c d id e f g h
0 2000 1 1 1 3 2000 1 1 1 1
1 2001 2 1 1 3 2001 2 0 0 3
2 2002 1 3 1 2 2002 1 3 1 2
3 2003 2 2 1 1 2003 1 0 1 1
I have tried feather but it doesn't help. I also did try to downcast the column types df['a'] = pd.to_numeric(df['a'], downcast='unsigned') but I saw no difference in table size. The last solution came up to my mind was using chunk. I tried the below code with different chunk sizes, but the kernel died again:
for chunk in pd.read_csv('df1', chunksize = 100000, low_memory=False):
df = pd.merge(df,chunk , on='id', how = 'left')
So I decided to write on a file instead of using a variable to prevent the kernel from dying. At first, I saved the last merged table in a csv file in order to read from it by chunks for the next merging process.
lastmerged.to_csv(r'/Desktop/lastmerged.csv', index=False)
And then:
from csv import writer
for chunk in pd.read_csv('lastmerged.csv', chunksize = 100000, low_memory=False):
newmerge = pd.merge(df1,chunk , on='id', how = 'right')
with open('newmerge.csv', 'a+', newline='') as write_obj:
csv_writer = writer(write_obj)
for i in range (len(newmerge)):
csv_writer.writerow(newmerge.loc[i,:])
I did try this piece of code on some small tables and I got the desired result. But for my real tables, it took lots of time for running and it made me to stop the kernel :| Besides, the code doesn't seem efficient!
In a nut shell, my question is how to merge tables when they got larger and larger and cause kernel's death and memory problem.
ps. I have already tried google colab, Jupyter, and terminal. They all work the same.
you can collect them in a list and use
total_df = pd.concat([df1,df2,df3,df4...,dfn],axis = 1)
you can also use
for name in filename:
df = pd.concat([df,pd.read_csv(name,index_col= False)])
So this way, you can pass the memory problem a
You can convert your pandas dataframes to dask dataframes. And then merge your data frames by dd.merge().
import dask.dataframe as dd
d_df = dd.from_pandas(df, chunksize=10000)
For data that fits into RAM, Pandas can often be faster and easier to use than Dask DataFrame, but when you have problem with RAM size you can use Dask to deal with hard disk and RAM.

Calculate Pandas dataframe column with replace function

I`m working on calculating a field in Pandas dataframe. Learning Python, I'm trying to find the best method.
Dataframe is quite big, over 55 mln rows. It has a few columns among which date and failure are in my interest. So the dataframe looks like this:
date failure
2018-09-09 0
2016-05-12 1
2013-12-12 1
2018-05-12 1
2018-05-12 1
I want to calculate failure_date (if failure = 1 then failure_date = date).
Tried smth. like this:
import pandas as pd
abc = pd.read_pickle('data_abc.pkl')
abc['failure_date'] = abc['failure'].replace(1, abc['date'])
The session is busy for a very long time (1.5h). No result so far. Is it a right approach?
Is the a more effective way of calculating column based on condition on others ?
This code adds a column "failure_date" and sets it to the failure date for the failures. It does not address "non-failures".
abc.loc[abc['failure']==1, 'failure_date'] = abc['date']
If you don't mind discarding the rest of the dataframe you could get all the dates where failure is 1 like this
abc = abc[abc['failure] == 1]

Create Multiple New Columns Based on Pipe-Delimited Column in Pandas

I have a pandas dataframe with a pipe delimited column with an arbitrary number of elements, called Parts. The number of elements in these pipe-strings varies from 0 to over 10. The number of unique elements contained in all pipe-strings is not much smaller than the number of rows (which makes it impossible for me to manually specify all of them while creating new columns).
For each row, I want to create a new column that acts as an indicator variable for each element of the pipe delimited list. For instance, if the row
...'Parts'...
...'12|34|56'
should be transformed to
...'Part_12' 'Part_34' 'Part_56'...
...1 1 1...
Because they are a lot of unique parts, these columns are obviously going to be sparse - mostly zeros since each row only contains a small fraction of unique parts.
I haven't found any approach that doesn't require manually specifying the columns (for instance, Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries).
I've also looked at pandas' melt, but I don't think that's the appropriate tool.
The way I know how to solve it would be to pipe the raw CSV to another python script and deal with it on a char-by-char basis, but I need to work within my existing script since I will be processing hundreds of CSVs in this manner.
Here's a better illustration of the data
ID YEAR AMT PARTZ
1202 2007 99.34
9321 1988 1012.99 2031|8942
2342 2012 381.22 1939|8321|Amx3
You can use get_dummies and add_prefix:
df.Parts.str.get_dummies().add_prefix('Part_')
Output:
Part_12 Part_34 Part_56
0 1 1 1
Edit for comment and counting duplicates.
df = pd.DataFrame({'Parts':['12|34|56|12']}, index=[0])
pd.get_dummies(df.Parts.str.split('|',expand=True).stack()).sum(level=0).add_prefix('Part_')
Output:
Part_12 Part_34 Part_56
0 2 1 1

Using Pandas to Manipulate Multiple Columns

I have a 30+ million row data set that I need to apply a whole host of data transformation rules to. For this task, I am trying to explore Pandas as a possible solution because my current solution isn't very fast.
Currently, I am performing a row by row manipulation of the data set, and then exporting it to a new table (CSV file) on disk.
There are 5 functions users can perform on the data within a given column:
remove white space
Capitalize all text
format date
replace letter/number
replace word
My first thought was to use the dataframe's apply or applmap, but this can only be used on a single column.
Is there a way to use apply or applymap to many columns instead of just one?
Is there a better workflow I should consider since I could be doing manipulations to 1:n columns in my dataset, where the maximum number of columns is currently around 30.
Thank you
You can use list comprehension with concat if need apply some function working only with Series:
import pandas as pd
data = pd.DataFrame({'A':[' ff ','2','3'],
'B':[' 77','s gg','d'],
'C':['s',' 44','f']})
print (data)
A B C
0 ff 77 s
1 2 s gg 44
2 3 d f
print (pd.concat([data[col].str.strip().str.capitalize() for col in data], axis=1))
A B C
0 Ff 77 S
1 2 S gg 44
2 3 D F

Categories