I would like to improve the runtime of a python program that takes a pandas dataframe and create two new variables (Group and Group date) based on several conditions (The code and logic are below). The code works fine on small datasets but on large datasets (20 million rows) it is taking 7+ hours to run.
Logic behind code
if the ID is the first ID encountered then group=1 and groupdate = date
else if not first ID and date - previous date > 10 or date - previous groupdate >10 then group=previous group # + 1 and groupdate = date
else if not first ID and date - previous date <= 10 or date - previous groupdate<=10 then group = previous group # and groupdate = previous groupdate.
Sample Code
import pandas as pd
import numpy as np
ID = ['a1','a1','a1','a1','a1','a2','a2','a2','a2','a2']
DATE = ['1/1/2014','1/15/2014','1/20/2014','1/22/2014','3/10/2015', \
'1/13/2015','1/20/2015','1/28/2015','2/28/2015','3/20/2015']
ITEM = ['P1','P2','P3','P4','P5','P1','P2','P3','P4','P5']
df = pd.DataFrame({"ID": ID, "DATE": DATE, "ITEM": ITEM})
df['DATE']= pd.to_datetime(df['DATE'], format = '%m/%d/%Y')
ids=df.ID
df['first_id'] = np.where((ids!=ids.shift(1)), 1, 0)
df['last_id'] = np.where((ids!=ids.shift(-1)), 1, 0)
print(df); print('\n')
for i in range(0,len(df)):
if df.loc[i,'first_id']==1:
df.loc[i,'group'] = 1
df.loc[i,'groupdate'] = df.loc[i,'DATE']
elif df.loc[i,'first_id']==0 and ((df.loc[i,'DATE'] - df.loc[i-1,'DATE']).days > 10) or \
((df.loc[i,'DATE'] - df.loc[i-1,'groupdate']).days > 10):
df.loc[i,'group'] = df.loc[i-1,'group'] + 1
df.loc[i,'groupdate'] = df.loc[i,'DATE']
else:
if df.loc[i,'first_id']==0 and ((df.loc[i,'DATE'] - df.loc[i-1,'DATE']).days <= 10) or \
((df.loc[i,'DATE'] - df.loc[i-1,'groupdate']).days <= 10):
df.loc[i,'group'] = df.loc[i-1,'group']
df.loc[i,'groupdate'] = df.loc[i-1,'groupdate']
print(df); print('\n')
Output
ID DATE ITEM GROUP GROUPDATE
1 1/1/2014 P1 1 1/1/2014
1 1/15/2014 P2 2 1/15/2014
1 1/20/2014 P3 2 1/15/2014
1 1/22/2014 P4 2 1/15/2014
1 3/10/2015 P5 3 3/10/2015
2 1/13/2015 P1 1 1/13/2015
2 1/20/2015 P2 1 1/13/2015
2 1/28/2015 P3 2 1/28/2015
2 2/28/2015 P4 3 2/28/2015
2 3/20/2015 P5 4 3/20/2015
Please don't take this as a full answer but as a work in progress and as a starting point.
I think that your code generate some problems when you move from a group to the other.
You should avoid group so I use groupby
I'm not implementing here your logic about previous_groupdate
Generate Data
import pandas as pd
import numpy as np
ID = ['a1','a1','a1','a1','a1','a2','a2','a2','a2','a2']
DATE = ['1/1/2014','1/15/2014','1/20/2014','1/22/2014','3/10/2015', \
'1/13/2015','1/20/2015','1/28/2015','2/28/2015','3/20/2015']
ITEM = ['P1','P2','P3','P4','P5','P1','P2','P3','P4','P5']
df = pd.DataFrame({"ID": ID, "DATE": DATE, "ITEM": ITEM})
df['DATE']= pd.to_datetime(df['DATE'], format = '%m/%d/%Y')
ids=df.ID
df['first_id'] = np.where((ids!=ids.shift(1)), 1, 0)
Function that work for every "ID"
def fun(x):
# To compare with previous date I add a column
x["PREVIOUS_DATE"] = x["DATE"].shift(1)
x["DATE_DIFF1"] = (x["DATE"]-x["PREVIOUS_DATE"]).dt.days
# These are your simplified conditions
conds = [x["first_id"]==1,
((x["first_id"]==0) & (x["DATE_DIFF1"]>10)),
((x["first_id"]==0) & (x["DATE_DIFF1"]<=10))]
# choices for date
choices_date = [x["DATE"].astype(str),
x["DATE"].astype(str),
'']
# choices for group
# To get the expected output we'll need a cumsum
choices_group = [ 1, 1, 0]
# I use np.select you can check how it works
x["group_date"] = np.select(conds, choices_date, default="")
x["group"] = np.select(conds, choices_group, default=0)
# some group_date are empty so I fill them
x["group_date"] = x["group_date"].astype("M8[us]").fillna(method="ffill")
# Here is the cumsum
x["group"] = x["group"].cumsum()
# Remove columns we don't need
x = x.drop(["first_id", "PREVIOUS_DATE", "DATE_DIFF1"], axis=1)
return x
How to use
df = df.groupby("ID").apply(fun)
ID DATE ITEM group_date group
0 a1 2014-01-01 P1 2014-01-01 1
1 a1 2014-01-15 P2 2014-01-15 2
2 a1 2014-01-20 P3 2014-01-15 2
3 a1 2014-01-22 P4 2014-01-15 2
4 a1 2015-03-10 P5 2015-03-10 3
5 a2 2014-01-01 P1 2014-01-01 1
6 a2 2014-01-15 P2 2014-01-15 2
7 a2 2014-01-20 P3 2014-01-15 2
8 a2 2014-01-22 P4 2014-01-15 2
9 a2 2015-03-10 P5 2015-03-10 3
Speed up
Here you could think to use dask, modin or cuDF see modin vs cuDF But probably you should work on how to organize your data before process it. I'm talking about something like this it's mine, sorry, but gives you an idea about how correctly partition data could speed things up.
Related
I have a sample dataframe like as shown below
df=pd.DataFrame({'Adm DateTime':['02/25/2012 09:40:00','03/05/1996 09:41:00','11/12/2010 10:21:21','31/05/2012 04:21:31','21/07/2019 13:15:02','31/10/2020 08:21:00'],
's_id':[1,1,1,1,2,2],
't_id':['t1','t2','t3','t3','t4','t5']})
df['Adm DateTime'] = pd.to_datetime(df['Adm DateTime'])
I would like to generate row number based for each group (of s_id)
I tried the below
df['R_N'] = df.sort_values(['Adm DateTime'], ascending=True).groupby(['s_id']).cumcount() + 1
While this works in sample data, it throws below error in original data.
TypeError: '<' not supported between instances of 'datetime.datetime'
and 'str'
But there is no NA in my original Adm DateTime column and the data type of the column itself is datetime64[ns]. I don't explicitly perform any comparison between dates (except sorting which may be done internally)
May I know why does this error happen and how can I identify the records which cause this issue?
You can try split solution because in your solution if chain sorting and creating new column pandas have to internally reorder rows by original index, I guess it should be problem here (and output has no sorted rows).
df['R_N'] = (df.sort_values(['Adm DateTime'], ascending=True)
.groupby(['s_id']).cumcount() + 1)
print (df)
Adm DateTime s_id t_id R_N
0 2012-02-25 09:40:00 1 t1 3
1 1996-03-05 09:41:00 1 t2 1
2 2010-11-12 10:21:21 1 t3 2
3 2012-05-31 04:21:31 1 t3 4
4 2019-07-21 13:15:02 2 t4 1
5 2020-10-31 08:21:00 2 t5 2
If need this output possible idea should be creating unique index values:
df = df.reset_index(drop=True)
df['R_N'] = (df.sort_values(['Adm DateTime'], ascending=True)
.groupby(['s_id']).cumcount() + 1)
My solution after sort create new DataFrame, so no reorder or rows is necessary (and output has sorted rows)
df['Adm DateTime'] = pd.to_datetime(df['Adm DateTime'])
df = df.sort_values(['Adm DateTime'])
df['R_N'] = df.groupby(['s_id']).cumcount() + 1
print (df)
Adm DateTime s_id t_id R_N
1 1996-03-05 09:41:00 1 t2 1
2 2010-11-12 10:21:21 1 t3 2
0 2012-02-25 09:40:00 1 t1 3
3 2012-05-31 04:21:31 1 t3 4
4 2019-07-21 13:15:02 2 t4 1
5 2020-10-31 08:21:00 2 t5 2
I have seen a bunch of examples on Stack Overflow on how to modify a single column in a dataframe based on a condition, but I cannot figure out how to modify multiple columns based on a single condition.
If I have a dataframe generated based on the below code -
import random
import pandas as pd
random_events = ('SHOT', 'MISSED_SHOT', 'GOAL')
events = list()
for i in range(6):
event = dict()
event['event_type'] = random.choice(random_events)
event['coords_x'] = round(random.uniform(-100, 100), 2)
event['coords_y'] = round(random.uniform(-42.5, 42.5), 2)
events.append(event)
df = pd.DataFrame(events)
print(df)
coords_x coords_y event_type
0 4.07 -21.75 GOAL
1 -2.46 -20.99 SHOT
2 99.45 -15.09 MISSED_SHOT
3 78.17 -10.17 GOAL
4 -87.24 34.40 GOAL
5 -96.10 30.41 GOAL
What I want to accomplish is the following (in pseudo-code) on each row of the DataFrame -
if df['coords_x'] < 0:
df['coords_x'] * -1
df['coords_y'] * -1
Is there a way to do this via an df.apply() function that I am missing?
Thank you in advance for your help!
IIUC, you can do this with loc, avoiding the need for apply:
>>> df
coords_x coords_y event_type
0 4.07 -21.75 GOAL
1 -2.46 -20.99 SHOT
2 99.45 -15.09 MISSED_SHOT
3 78.17 -10.17 GOAL
4 -87.24 34.40 GOAL
5 -96.10 30.41 GOAL
>>> df.loc[df.coords_x < 0, ['coords_x', 'coords_y']] *= -1
>>> df
coords_x coords_y event_type
0 4.07 -21.75 GOAL
1 2.46 20.99 SHOT
2 99.45 -15.09 MISSED_SHOT
3 78.17 -10.17 GOAL
4 87.24 -34.40 GOAL
5 96.10 -30.41 GOAL
I have a csv which is generated in a format that I can not change. The file has a multi index. The file looks like this.
The end goal is to turn the top row (hours) into an index, and index it with the "ID" column, so that the data looks like this.
I have imported the file into pandas...
myfile = 'c:/temp/myfile.csv'
df = pd.read_csv(myfile, header=[0, 1], tupleize_cols=True)
pd.set_option('display.multi_sparse', False)
df.columns = pd.MultiIndex.from_tuples(df.columns, names=['hour', 'field'])
df
But that gives me three unnamed fields:
My final step is to stack on hour:
df.stack(level=['hour'])
But I a missing what comes before that, where I can index the other columns, even though there's a blank multiindex line above them.
I believe the lines you are missing may be # 3 and 4:
df = pd.io.parsers.read_csv('temp.csv', header = [0,1], tupleize_cols = True)
df.columns = [c for _, c in df.columns[:3]] + [c for c in df.columns[3:]]
df = df.set_index(list(df.columns[:3]), append = True)
df.columns = pd.MultiIndex.from_tuples(df.columns, names = ['hour', 'field'])
Convert the tuples to strings by dropping the first value for first 3 col. headers.
Shelter these headers by placing them in an index.
After you perform the stack, you may reset the index if you like.
e.g.
Before
(Unnamed: 0_level_0, Date) (Unnamed: 1_level_0, id) \
0 3/11/2016 5
1 3/11/2016 6
(Unnamed: 2_level_0, zone) (100, p1) (100, p2) (200, p1) (200, p2)
0 abc 0.678 0.787 0.337 0.979
1 abc 0.953 0.559 0.776 0.520
After
field p1 p2
Date id zone hour
0 3/11/2016 5 abc 100 0.678 0.787
200 0.337 0.979
1 3/11/2016 6 abc 100 0.953 0.559
200 0.776 0.520
Problem: matching columns while appending CSV files
I have 50 .csv files where each column is a word, each row is a time of day and each file holds all words for one day. They look like this:
Date Time Aword Bword Cword Dword
Date1 t1 0 1 0 12
Date1 t2 0 6 3 0
Date Time Eword Fword Gword Hword Bword
Date2 t1 0 0 1 0 3
Date2 t2 2 0 0 19 0
I want to append the files so that any columns with the same word (like Bword in this example) are matched while new words are added in new columns:
Date Time Aword Bword Cword Dword Eword Fword Gword Hword
Date1 t1 0 1 0 12
Date1 t2 0 6 3 0
Date2 t1 3 0 0 1 0
Date2 t2 0 2 0 0 19
I'm opening the csv files as dataframes to manipulate them and using dataframe.append the new files are added like this:
Date Time Aword Bword Cword Dword
Date1 t1 0 1 0 12
Date1 t2 0 6 3 0
Date Time Eword Fword Gword Hword Bword
Date2 t1 0 0 1 0 3
Dat2e t2 2 0 0 19 0
Is there a different approach which could align matching columns while appending? ie without iterating through each column and checking for matches.
Sincere apologies if this question is too vague, I'm new to python and still struggling to know when I'm thinking un-pythonically and when I'm using the wrong tools.
EDIT: more information
1) I'll need to perform this task multiple times, once for each of five batches of csvs
2) The files all have 25 rows but have anything from 5 to 294 columns
3) The order of rows is important Day1(t1, t2...tn) then Day2(t1, t2...tn)
4) The order of columns is not important
IIUC, you can simply use pd.concat, which will automatically align on columns:
>>> csvs = glob.glob("*.csv")
>>> dfs = [pd.read_csv(csv) for csv in csvs]
>>> df_merged = pd.concat(dfs).fillna("")
>>> df_merged
Aword Bword Cword Date Dword Eword Fword Gword Hword Time
0 0 1 0 Date1 12 t1
1 0 6 3 Date1 0 t2
0 3 Date2 0 0 1 0 t1
1 0 Date2 2 0 0 19 t2
(Although I'd recommend either fillna(0) or leaving it as nan; if you fill with an empty string to look like your desired output, the column has to have object dtype and those are much slower than int or float.)
If you're really particular about the column order, you could cheat and use (re)set_index:
>>> df_merged.set_index(["Date", "Time"]).reset_index()
Date Time Aword Bword Cword Dword Eword Fword Gword Hword
0 Date1 t1 0 1 0 12
1 Date1 t2 0 6 3 0
2 Date2 t1 3 0 0 1 0
3 Date2 t2 0 2 0 0 19
I think for this kind of thing you might find using the pandas library a bit easier. Say filelist is a list of file names.
import pandas as pd
df = pd.concat([pd.read_csv(fl, index_col=[0,1]) for fl in filelist])
And you're done! As a side note if you'd like to combine the date and time columns (depending on their format) you can try
df = pd.concat([pd.read_csv(fl, parse_dates=['Date','Time']) for fl in filelist]).drop('Date', axis=1)
If the order of rows and columns is not important (if it is, you need to edit your Q to specify how to deal with it when the order differs among files!), there are no conflicts (different values of the same column for the same date and time), and the data fit in memory -- and you prefer to work in Python than in Pandas (I notice you haven't tagged your Q with pandas) -- one approach might be the following:
import collections
import csv
def merge_csvs(*filenames):
result_dict = collections.defaultdict(dict)
all_columns = set()
for fn in filenames:
with open(fn) as f:
dr = csv.DictReader(f)
update_cols = True
for row in dr:
date = row.pop('Date')
time = row.pop('Time')
result_dict[date, time].update(row)
if update_cols:
all_columns.update(row)
update_cols = False
for d in result_dict:
missing_cols = all_columns.difference(d)
d.update(dict.from_keys(missing_cols, '')
return result_dict
This produced a dictionary, keyed by (date, time) pairs, of dictionaries whose keys are all the columns found in any of the input CSVs, with either the corresponding value for that date and time, or else an empty string if that column was never found for that date and time.
Now you can deal with this as you wish, e.g
d = merge_csvs('a.csv', 'b.csv', 'c.csv')
for date, time in sorted(d):
dd = d[date, time]
outlist = [dd[c] for c in sorted(dd)]
print(date, time, outlist)
or, of course, write it back to a different CSV, and so forth.
I did not know of an easier thing to call what I am trying to do. Edits welcome. Here is what I want to do.
I have store, date, and product indices and a column called price.
I have two unique products 1 and 2.
But for each store, I don't have an observation for every date, and for every date, I don't have both products necessarily.
I want to create a series for each store that is indexed by dates only when when both products are present. The reason is because I want the value of the series to be product 1 price / product 2 price.
This is highly unbalanced panel, and I did a horrible workaround about 75 lines of code, so I appreciate any tips. This will be very useful in the future.
Data looks like below.
weeknum Location_Id Item_Id averageprice
70 201138 8501 1 0.129642
71 201138 8501 2 0.188274
72 201138 8502 1 0.129642
73 201139 8504 1 0.129642
Expected output in this simple case would be:
weeknum Location_Id averageprice
? 201138 8501 0.129642/0.188274
Since that is the only one with every requirement met.
I think this could be join on the two subFrames (but perhaps there is a cleaner pivoty way):
In [11]: res = pd.merge(df[df['Item_Id'] == 1], df[df['Item_Id'] == 2],
on=['weeknum', 'Location_Id'])
In [12]: res
Out[12]:
weeknum Location_Id Item_Id_x averageprice_x Item_Id_y averageprice_y
0 201138 8501 1 0.129642 2 0.188274
Now you can divide those two columns in the result:
In [13]: res['price'] = res['averageprice_x'] / res['averageprice_y']
In [14]: res
Out[14]:
weeknum Location_Id Item_Id_x averageprice_x Item_Id_y averageprice_y price
0 201138 8501 1 0.129642 2 0.188274 0.688582
Example data similar to yours:
weeknum loc_id item_id avg_price
0 1 8 1 8
1 1 8 2 9
2 1 9 1 10
3 2 10 1 11
First create a date mask that gets you the correct dates:
df_group = df.groupby(['loc_id', 'weeknum'])
df = df.join(df_group.item_id.apply(lambda x: len(x.unique()) == 2), on = ['loc_id', 'weeknum'], r_suffix = '_r')
weeknum loc_id item_id avg_price item_id_r
0 1 8 1 8 True
1 1 8 2 9 True
2 1 9 1 10 False
3 2 10 1 11 False
This give yous a boolean mask for groupby of each store for each date where there are exactly two unique Item_Id present. From this you can now apply the function that concatenates your prices:
df[df.item_id_r].groupby(['loc_id','weeknum']).avg_price.apply(lambda x: '/'.join([str(y) for y in x]))
loc_id weeknum
8 1 8,9
It's a bit verbose and lots of lambdas but it will get you started and you can refactor to make faster and/or more concise if you want.
Let's say your full dataset is called TILPS. Then you might try this:
import pandas as pd
from __future__ import division
# Get list of unique dates present in TILPS
datelist = list(TILPS.ix[:, 'datetime'].unique())
# Get list of unique stores present in TILPS
storelist = list(TILPS.ix[:, 'store'].unique())
# For a given date, extract relative price
def dateLevel(daterow):
price1 = int(daterow.loc[(daterow['Item_id']==1), 'averageprice'].unique())
price2 = int(daterow.loc[(daterow['Item_id']==2), 'averageprice'].unique())
return pd.DataFrame(pd.Series({'relprice' : price1/price2}))
# For each store, extract relative price for each date
def storeLevel(group, datelist):
info = {d: for d in datelist}
exist = group.loc[group['datetime'].isin(datelist), ['weeknum', 'locid']]
exist_gr = exist.groupy('datetime')
relprices = exist_gr.apply(dateLevel)
# Merge relprices with exist on INDEX.
exist.merge(relprices, left_index=True, right_index=True)
return exist
# Group TILPS by store
gr_store = TILPS.groupby('store')
fn = lambda x: storeLevel(x, datelist)
output = gr_store.apply(fn)
# Peek at output
print output.head(30)