Faster way to group data than pandas groupby

Faster way to group data than pandas groupby - python

I am implementing a Genetic Algorithm. For this algorithm a number of iterations (between 100 to 500) have to be done where in each iteration all 100 individuals are evaluated for their 'fitness'. To this extent, I have written an evaluate function. However, even for one iteration evaluating the fitness of the 100 individuals already takes 13 seconds. I have to speed this up massively in order to implement an efficient algorithm.
The evaluate function takes two arguments, and then performs some calculations. I will share part of the function since a similar form of calculation is repeated after that. Specifically, I now perform a groupby to a dataframe called df_demand, and then take the sum of a list comprehension that uses the resulting dataframe from the groupby function and another dataframe called df_distance. A snippet of df_demand looks as follows but has larger dimensions in reality (index is just 0,1,2,...):
date customer deliveries warehouse
2020-10-21 A 30 1
2020-10-21 A 47 1
2020-10-21 A 59 2
2020-10-21 B 130 3
2020-10-21 B 102 3
2020-10-21 B 95 2
2020-10-22 A 55 1
2020-10-22 A 46 4
2020-10-22 A 57 4
2020-10-22 B 89 3
2020-10-22 B 104 3
2020-10-22 B 106 4
and a snippet of df_distance is (where the columns are the warehouses):
index 1 2 3 4
A 30.2 54.3 76.3 30.9
B 96.2 34.2 87.7 102.4
C 57.0 99.5 76.4 34.5
Next, I want to groupby df_demand such that each combination of (date, customer, warehouse) appears once and all deliveries for this combination are summed. Finally, I want to calculate total costs. Currently, I have done the following but this is too slow:
def evaluate(df_demand, df_distance):
costs = df_demand.groupby(["date", "customer", "warehouse"]).sum().reset_index()
cost = sum([math.ceil(costs.iat[i, 3] / 20) * df_distance.loc[costs.iat[i, 1], costs.iat[i, 2]] for i in range(len(costs))])
etc...
return cost
Since I have to do many iterations and considering the fact that dimensions of my data are considerably larger, my question is: what is the fastest way to do this operation?

let's try:
def get_cost(df, df2):
'''
df: deliveries data
df2: distance data
'''
pivot = np.ceil(df.pivot_table(index=['customer', 'warehouse'], columns=['date'],
values='deliveries', aggfunc='sum', fill_value=0)
.div(20)
)
return pivot.mul(df2.rename_axis(index='customer', columns='warehouse').stack(),
axis='rows').sum().sum()

Related

How do you add the value for a certain column from a previous row to your current row in Python Pandas? [duplicate]

In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)

I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71

To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6

I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass

An easy way to calculate time intervals between dates in a column in Python

Suppose I have a Pandas DataFrame like this:
item event date
A 1 2020-03-09
B 1 2020-03-09
A 2 2020-05-01
B 2 2020-05-01
C 2 2020-05-01
A 3 2020-06-25
C 3 2020-06-25
B 4 2020-07-18
C 4 2020-07-18
This dataframe contains a unique date per 'event' per 'item'. So this means that an item has several events with distinct dates.
Now I would like to calculate per item the average amount of days between the dates. So this will be different values for each item and it thus requires me to calculate the average of the time between the dates per event per item.
So the expected output would look like:
item average_interval_in_days
A 54
B 65.5
C 39.5
Anyone an idea how to do this?

Very similar to #BradSolomon's answer, with two small differences:
df.sort_values(['item', 'date']).groupby('item')['date'].agg(
lambda g: g.diff().mean() / pd.Timedelta(days=1))
# gives:
item
A 54.0
B 65.5
C 39.0
Notes:
ensure that dates are sorted within each group, otherwise the mean will depend on the order; in your example, the dates happen to be sorted, so if you can guarantee it, you may skip .sort_values();
use ... / pd.Timedelta(days=1) to produce directly the mean difference in units of days.
Alternative for speed (no sort, no lambda, but a bit more opaque)
gb = df.groupby('item')['date']
(gb.max() - gb.min()) / (gb.count() - 1) / pd.Timedelta(days=1)
# gives:
item
A 54.0
B 65.5
C 39.0

Pandas create column with names of columns with lowest match

I have Pandas dataframe where I have points and corresponding lengths to another points. I am able to get minimal value of the calculated columns, however, I need the column names itself. I am unable to figure out how can I get the column names corresponding to values in a new column. My dataframe looks like this:
df.head():
0 1 2 ... 6 7 min
9 58.0 94.0 984.003636 ... 696.667367 218.039561 218.039561
71 100.0 381.0 925.324708 ... 647.707783 169.856557 169.856557
61 225.0 69.0 751.353014 ... 515.152768 122.377490 122.377490
0 and 1 are datapoints, the rest are distances to datapoints #1 to 7, in some cases the number of points can differ, does not really matter for the question. The code I use to count min is following:
new = users.iloc[:,2:].min(axis=1)
users["min"] = new
#could also do the following way
#users.assign(Min=lambda users: users.iloc[:,2:].min(1))
This is quite simple and there is no much about finding the minimum of multiple columns. However, I need to get the col name instead of the value. So my desired output would look like this (in the example all are 7, which is not rule):
0 1 2 ... 6 7 min
9 58.0 94.0 984.003636 ... 696.667367 218.039561 7
71 100.0 381.0 925.324708 ... 647.707783 169.856557 7
61 225.0 69.0 751.353014 ... 515.152768 122.377490 7
Is there a simple way to achieve this?

Use df.idxmin:
In [549]: df['min'] = df.iloc[:,2:].idxmin(axis=1)
In [550]: df
Out[550]:
0 1 2 6 7 min
9 58.0 94.0 984.003636 696.667367 218.039561 7
71 100.0 381.0 925.324708 647.707783 169.856557 7
61 225.0 69.0 751.353014 515.152768 122.377490 7

Feature engineered multiple columns of pandas data frame (add new columns based on existing ones)

Sorry being naive. I have the following data and I want to feature engineered some columns. But I don't have how I can do multiple operations on the same data frame. One thing to mention I have multiple entries for each customer. So, in the end, I want aggregated values (i.e. 1 entry for each customer)
customer_id purchase_amount date_of_purchase days_since
0 760 25.0 06-11-2009 2395
1 860 50.0 09-28-2012 1190
2 1200 100.0 10-25-2005 3720
3 1420 50.0 09-07-2009 2307
4 1940 70.0 01-25-2013 1071
new column based on min, count and mean
customer_purchases['amount'] = customer_purchases.groupby(['customer_id'])['purchase_amount'].agg('min')
customer_purchases['frequency'] = customer_purchases.groupby(['customer_id'])['days_since'].agg('count')
customer_purchases['recency'] = customer_purchases.groupby(['customer_id'])['days_since'].agg('mean')
nexpected outcome
customer_id purchase_amount date_of_purchase days_since recency frequency amount first_purchase
0 760 25.0 06-11-2009 2395 1273 5 38.000000 3293
1 860 50.0 09-28-2012 1190 118 10 54.000000 3744
2 1200 100.0 10-25-2005 3720 1192 9 102.777778 3907
3 1420 50.0 09-07-2009 2307 142 34 51.029412 3825
4 1940 70.0 01-25-2013 1071 686 10 47.500000 3984
One solution :
I can think of 3 separate operations for each needed column and then join all those to get a new data frame. I know it's not efficient for just sake what I need
df_1 = customer_purchases.groupby('customer_id', sort = False)["purchase_amount"].min().reset_index(name ='amount')
df_2 = customer_purchases.groupby('customer_id', sort = False)["days_since"].count().reset_index(name ='frequency')
df_3 = customer_purchases.groupby('customer_id', sort = False)["days_since"].mean().reset_index(name ='recency')
However, either I get an error or not data frame with correct data.
Your help and patience will be appreciated.

SOLUTION
finally i found the solution
def f(x):
recency = x['days_since'].min()
frequency = x['days_since'].count()
monetary_value = x['purchase_amount'].mean()
c = ['recency','frequency, monetary_value']
return pd.Series([recency, frequency, monetary_value], index =c )
df1 = customer_purchases.groupby('customer_id').apply(f)
print (df1)

Use instead
customer_purchases.groupby('customer_id')['purchase_amount'].transform(lambda x : x.min())
Transform will give output for each row of original dataframe instead of grouped row as in case of using agg

Pandas: Group by operation on dynamically selected columns with conditional filter

I have a dataframe as follows:
Date Group Value Duration
2018-01-01 A 20 30
2018-02-01 A 10 60
2018-01-01 B 15 180
2018-02-01 B 30 210
2018-03-01 B 25 238
2018-01-01 C 10 235
2018-02-01 C 15 130
I want to use group_by dynamically i.e. do not wish to type the column names on which group_by would be applied. Specifically, I want to compute mean of each Group for last two months.
As we can see that not each Group's data is present in the above dataframe for all dates. So the tasks are as follows:
Add a dummy row based on the date, in case data pertaining to Date = 2018-03-01not present for each Group (e.g. add row for A and C).
Perform group_by to compute mean using last two month's Value and Duration.
So my approach is as follows:
For Task 1:
s = pd.MultiIndex.from_product(df['Date'].unique(),df['Group'].unique()],names=['Date','Group'])
df = df.set_index(['Date','Group']).reindex(s).reset_index().sort_values(['Group','Date']).ffill(axis=0)
can we have a better method for achieving the 'add row' task? The reference is found here.
For Task 2:
def cond_grp_by(df,grp_by:str,cols_list:list,*args):
df_grp = df.groupby(grp_by)[cols_list].transform(lambda x : x.tail(2).mean())
return df_grp
df_cols = df.columns.tolist()
df = cond_grp_by(dealer_f_filt,'Group',df_cols)
Reference of the above approach is found here.
The above code is throwing IndexError : Column(s) ['index','Group','Date','Value','Duration'] already selected
The expected output is
Group Value Duration
A 10 60 <--------- Since a row is added for 2018-03-01 with
B 27.5 224 same value as 2018-02-01,we are
C 15 130 <--------- computing mean for last two values

Use GroupBy.agg instead transform if need output filled by aggregate values:
def cond_grp_by(df,grp_by:str,cols_list:list,*args):
return df.groupby(grp_by)[cols_list].agg(lambda x : x.tail(2).mean()).reset_index()
df = cond_grp_by(df,'Group',df_cols)
print (df)
Group Value Duration
0 A 10.0 60.0
1 B 27.5 224.0
2 C 15.0 130.0
If need last value per groups use GroupBy.last:
def cond_grp_by(df,grp_by:str,cols_list:list,*args):
return df.groupby(grp_by)[cols_list].last().reset_index()
df = cond_grp_by(df,'Group',df_cols)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Faster way to group data than pandas groupby - python

Related

How do you add the value for a certain column from a previous row to your current row in Python Pandas? [duplicate]

An easy way to calculate time intervals between dates in a column in Python

Pandas create column with names of columns with lowest match

Feature engineered multiple columns of pandas data frame (add new columns based on existing ones)

Pandas: Group by operation on dynamically selected columns with conditional filter

Categories

Resources