I have a Pandas DataFrame with the following structure (about 100 million rows):
Date Value ID
'1/1/17' 500 1
'1/2/17' 550 1
'1/4/17' 600 2
If I do:
def get_coeff_var(group):
group['coeff_var'] = group['Value'].std()/group['Value'].mean()
return group
df = df.groupby(['ID']).apply(lambda x: get_coeff_var(x))
It completes extremely quickly.
But if I first set the index to the date and get the last month of data, then the same apply function takes an enormous (i.e. I can't even wait for it to complete) amount of time:
df = df.set_index('Date')
df = df.last('1M')
df = df.groupby(['ID']).apply(lambda x: get_coeff_var(x))
What's going on?
Almost always, mutating inside a groupby-apply is a bad idea - in general it takes a slow path, although I'm not sure what the exact issue is here.
In your case, the idiomatic, and much faster way to do this transformation is as follows, and should be fast regardless of your index.
gb = df.groupby('ID')['Value']
df['coeff_var'] = gb.transform('std') / gb.transform('mean')
Related
I have created a function that parses through each column of a dataframe, shifts up the data in that respective column to the first observation (shifting past '-'), and stores that column in a dictionary. I then convert the dictionary back to a dataframe to have the appropriately shifted columns. The function is operational and takes about 10 seconds on a 12x3000 dataframe. However, when applying it to 12x25000 it is extremely extremely slow. I feel like there is a much better way to approach this to increase the speed - perhaps even an argument of the shift function that I am missing. Appreciate any help.
def create_seasoned_df(df_orig):
"""
Creates a seasoned dataframe with only the first 12 periods of a loan
"""
df_seasoned = df_orig.reset_index().copy()
temp_dic = {}
for col in cols:
to_shift = -len(df_seasoned[df_seasoned[col] == '-'])
temp_dic[col] = df_seasoned[col].shift(periods=to_shift)
df_seasoned = pd.DataFrame.from_dict(temp_dic, orient='index').T[:12]
return df_seasoned
Try using this code with apply instead:
def create_seasoned_df(df_orig):
df_seasoned = df_orig.reset_index().apply(lambda x: x.shift(df_seasoned[col].eq('-').sum()), axis=0)
return df_seasoned
I am creating a new column in a dataframe that is based on other values in the entire dataframe. I have found a couple of ways to do so (shown below), but they are very slow when working with large datasets (500k rows takes 1 hour to run). I am looking to increase the speed of this process.
I have attempt to use .apply with a lambda function. I have also used .map to obtain a list to put into the new column. Both of these methods work but are too slow.
values = {'ID': ['1','2','3','4','1','2','3'],
'MOD': ['X','Y','Z','X','X','Z','Y'],
'Period': ['Current','Current','Current','Current','Past','Past','Past']
}
df = DataFrame(values,columns= ['ID', 'MOD','Period'])
df['ID_MOD']=df['ID']+df['MOD']
def funct(identifier, indentifier_modification,period):
if period=="Current":
if (df.ID==identifier).sum()==1:
return "New"
elif (df.ID_MOD==indentifier_modification).sum()==1:
return "Unique"
else:
return "Repeat"
else:
return "n/a"
Initial df:
ID MOD Period ID_MOD
0 1 X Current 1X
1 2 Y Current 2Y
2 3 Z Current 3Z
3 4 X Current 4X
4 1 X Past 1X
5 2 Z Past 2Z
6 3 Y Past 3Y
Here are the two methods that are too slow:
1)
df['new_column']=df.apply(lambda x:funct(x['ID'],x['ID_MOD'],x['Period']), axis=1)
2)
df['new_column']=list(map(funct,df['ID'],df['ID_MOD'],df['Period']))
Intended final df:
ID MOD Period ID_MOD new_column
0 1 X Current 1X Repeat
1 2 Y Current 2Y Unique
2 3 Z Current 3Z Unique
3 4 X Current 4X New
4 1 X Past 1X n/a
5 2 Z Past 2Z n/a
6 3 Y Past 3Y n/a
There are no error messages; the code just takes ~1 hour to run with a large data set.
your current code is currently scales as O(N**2) where N is the number of rows. if your df really is 500k rows this is going to take a long time! you really want to be using code from numpy and pandas that has much lower computational complexity.
the aggregations built into pandas would help a lot in place of your use of sum, as would learning about how pandas does indexing and merge. in your case I can get 500k rows down to less than a second pretty easily.
start by defining a dummy data set:
import numpy as np
import pandas as pd
N = 500_000
df = pd.DataFrame({
'id': np.random.choice(N//2, N),
'a': np.random.choice(list('XYZ'), N),
'b': np.random.choice(list('CP'), N),
})
next we can do the aggregations to count across your various groups:
ids = df.groupby(['id']).size().rename('ids')
idas = df.groupby(['id','a']).size().rename('idas')
next we can join these aggregations back to the original data set
cutting down the data as much as possible is always a good idea and in your case Past values always get a value of n/a and as they take up half your data would seem to half the amount of work:
df2 = df.loc[df['b'] == 'C',]
df2 = pd.merge(df2, ids, left_on=['id'], right_index=True)
df2 = pd.merge(df2, idas, left_on=['id','a'], right_index=True)
finally we use where from numpy to vectorise all your conditions and hence work much faster, then use pandas indexing to put everything back together efficiently, patching up missing values afterwards
df2['out'] = np.where(
df2['ids'] == 1, 'New',
np.where(df2['idas'] == 1, 'Unique', 'Repeat'))
df['out'] = df2['out']
df['out'].fillna('n/a', inplace=True)
hope some of that helps! for reference, the above runs in ~320ms for 500k rows on my laptop
I have a 10 GB csv file with 170,000,000 rows and 23 columns that I read in to a dataframe as follows:
import pandas as pd
d = pd.read_csv(f, dtype = {'tax_id': str})
I also have a list of strings with nearly 20,000 unique elements:
h = ['1123787', '3345634442', '2342345234', .... ]
I want to create a new column called class in the dataframe d. I want to assign d['class'] = 'A' whenever d['tax_id'] has a value that is found in the list of strings h. Otherwise, I want d['class'] = 'B'.
The following code works very quickly on a 1% sample of my dataframe d:
d['class'] = 'B'
d.loc[d['tax_num'].isin(h), 'class'] = 'A'
However, on the complete dataframe d, this code takes over 48 hours (and counting) to run on a 32 core server in batch mode. I suspect that indexing with loc is slowing down the code, but I'm not sure what it could really be.
In sum: Is there a more efficient way of creating the class column?
If your tax numbers are unique, I would recommend setting tax_num to the index and then indexing on that. As it stands, you call isin which is a linear operation. However fast your machine is, it can't do a linear search on 170 million records in a reasonable amount of time.
df.set_index('tax_num', inplace=True) # df = df.set_index('tax_num')
df['class'] = 'B'
df.loc[h, 'class'] = 'A'
If you're still suffering from performance issues, I'd recommend switching to distributed processing with dask.
"I also have a list of strings with nearly 20,000 unique elements"
Well, for starters, you should make that list a set if you are going to be using it for membership testing. list objects have linear time membership testing, set objects have very optimized constant-time performance for membership testing. That is the lowest hanging fruit here. So use
h = set(h) # convert list to set
d['class'] = 'B'
d.loc[d['tax_num'].isin(h), 'class'] = 'A'
I want to sort through a Dataframe of about 400k rows, with 4 columns, taking out roughly half of them with an if statement:
for a in range (0, howmanytimestorunthrough):
if ('Primary' not in DataFrameexample[a]):
#take out row
So far I've been testing either one of the 4 below:
newdf.append(emptyline,)
nefdf.at[b,'column1'] = DataFrameexample.at[a,'column1']
nefdf.at[b,'column2'] = DataFrameexample.at[a,'column2']
nefdf.at[b,'column3'] = DataFrameexample.at[a,'column3']
nefdf.at[b,'column4'] = DataFrameexample.at[a,'column4']
b = b + 1
or the same with .loc
newdf.append(emptyline,)
nefdf.loc[b,:] = DataFrameexample.loc[a,:]
b = b + 1
or changing the if (not in) to an if (in) and using:
DataFrameexample = DataFrameexample.drop([k])
or trying to set emptyline to have values, and then append it:
notemptyline = pd.Series(DataFrameexample.loc[a,:].values, index = ['column1', 'column2', ...)
newdf.append(notemptyline, ignore_index=True)
So from what I've managed to test so far, they all seem to work ok on a small number of rows (2000), but once I start getting a lot more rows they take exponentially longer. .at seems slighly faster than .loc even if I need it to run 4 times, but still gets slow (10 times the rows, takes longer than 10 times). .drop I think tries to copy the dataframe each time, so really doesn't work? I can't seem to get .append(notemptyline) to work properly, it just replaces index 0 over and over again.
I know there must be an efficient way of doing this, I just can't seem to quite get there. Any help?
Your speed problem has nothing to do with .loc vs .at vs ... (for a comparisson between .loc and .at look have a look at this question) but comes from explicitly looping over every row of your dataframe. Pandas is all about vectorising your operations.
You want to filter your dataframe based on a comparison. You can transform that to a boolean indexer.
indexer = df!='Primary'
This will give you a 4 by n rows dataframe with boolean values. Now you want to reduce the dimension to 1 x n rows such that the value is true if all values in the row (axis 1) are true.
indexer = indexer.all(axis=1)
Now we can use .loc to to get only the rows were indexer is True
df = df.loc[indexer]
This will be much faster then iterating over the rows.
EDIT:
To check if the df entry contains a string you can replace the first row:
indexer = df.apply(lambda x: x.str.contains('Primary'))
Note that you normally don't want to use an apply statement (internally it uses a for loop for custom functions) to iterate over a lot of elements. In this case we are looping over the columns which is fine if you just have a couple of those.
I have a Pandas DataFrame of subscriptions, each with a start datetime (timestamp) and an optional end datetime (if they were canceled).
For simplicity, I have created string columns for the date (e.g. "20170901") based on start and end datetimes (timestamps). It looks like this:
df = pd.DataFrame([('20170511', None), ('20170514', '20170613'), ('20170901', None), ...], columns=["sd", "ed"])
The end result should be a time series of how many subscriptions were active on any given date in a range.
To that end, I created an Index for all the days within a range:
days = df.groupby(["sd"])["sd"].count()
I am able to create what I am interested in with a loop each executing a query over the entire DataFrame df.
count_by_day = pd.DataFrame([
len(df.loc[(df.sd <= i) & (df.ed.isnull() | (df.ed > i))])
for i in days.index], index=days.index)
Note that I have values for each day in the original dataset, so there are no gaps. I'm sure getting the date range can be improved.
The actual question is: is there an efficient way to compute this for a large initial dataset df, with multiple thousands of rows? It seems the method I used is quadratic in complexity. I've also tried df.query() but it's 66% slower than the Pythonic filter and does not change the complexity.
I tried to search the Pandas docs for examples but I seem to be using the wrong keywords. Any ideas?
It's an interesting problem, here's how I would do it. Not sure about performance
EDIT: My first answer was incorrect, I didn't read fully the question
# Initial data, columns as Timestamps
df = pd.DataFrame([('20170511', None), ('20170514', '20170613'), ('20170901', None)], columns=["sd", "ed"])
df['sd'] = pd.DatetimeIndex(df.sd)
df['ed'] = pd.DatetimeIndex(df.ed)
# Range input and related index
beg = pd.Timestamp('2017-05-15')
end = pd.Timestamp('2017-09-15')
idx = pd.DatetimeIndex(start=beg, end=end, freq='D')
# We filter data for records out of the range and then clip the
# the subscriptions start/end to the range bounds.
fdf = df[(df.sd <= beg) | ((df.ed >= end) | (pd.isnull(df.ed)))]
fdf['ed'].fillna(end, inplace=True)
fdf['ps'] = fdf.sd.apply(lambda x: max(x, beg))
fdf['pe'] = fdf.ed.apply(lambda x: min(x, end))
# We run a conditional count
idx.to_series().apply(lambda x: len(fdf[(fdf.ps<=x) & (fdf.pe >=x)]))
Ok, I'm answering my own question after quite a bit of research, fiddling and trying things out. I may still be missing an obvious solution but maybe it helps.
The fastest solution I could find to date is (thanks Alex for some nice code patterns):
# Start with test data from question
df = pd.DataFrame([('20170511', None), ('20170514', '20170613'),
('20170901', None), ...], columns=['sd', 'ed'])
# Convert to datetime columns
df['sd'] = pd.DatetimeIndex(df['sd'])
df['ed'] = pd.DatetimeIndex(df['ed'])
df.ed.fillna(df.sd.max(), inplace=True)
# Note: In my real data I have timestamps - I convert them like this:
#df['sd'] = pd.to_datetime(df['start_date'], unit='s').apply(lambda x: x.date())
# Set and sort multi-index to enable slices
df = df.set_index(['sd', 'ed'], drop=False)
df.sort_index(inplace=True)
# Compute the active counts by day in range
di = pd.DatetimeIndex(start=df.sd.min(), end=df.sd.max(), freq='D')
count_by_day = di.to_series().apply(lambda i: len(df.loc[
(slice(None, i.date()), slice(i.date(), None)), :]))
In my real dataset (with >10K rows for df and a date range of about a year), this was twice as fast as the code in the question, about 1.5s.
Here some lessons I learned:
Creating a Series with counters for the date range and iterating through the dataset df with df.apply or df.itertuples and incrementing counters was much slower. Curiously, apply was slower than itertuples. Don't even think of iterrows
My dataset had a product_id with each row, so filtering the dataset for each product and running the calculation on the filtered result (for each product) was twice as fast as adding the product_id to the multi-index and slicing on that level too
Building an intermediate Series of active days (from iterating through each row in df and adding each date in the active range to the Series) and then grouping by date was much slower.
Running the code in the question on a df with a multi-index did not change the performance.
Running the code in the question on a df with a limited set of columns (my real dataset has 22 columns) did not change the performance.
I was looking at pd.crosstab and pd.Period but I was not able to get anything to work
Pandas is pretty awesome and trying to outsmart it is really hard (esp. non-vectorized in Python)