Optimizing a Pandas DataFrame Transformation to Link two Columns - python

Given the following df:
SequenceNumber | ID | CountNumber | Side | featureA | featureB
0 0 | 0 | 3 | Sell | 4 | 2
1 0 | 1 | 1 | Buy | 12 | 45
2 0 | 2 | 1 | Buy | 1 | 4
3 0 | 3 | 1 | Buy | 3 | 36
4 1 | 0 | 1 | Sell | 5 | 11
5 1 | 1 | 1 | Sell | 7 | 12
6 1 | 2 | 2 | Buy | 5 | 35
I want to create a new df such that for every SequenceNumber value, it takes the rows with the CountNumber == 1, and creates new rows where if the Side == 'Buy' then put their ID in a column named To. Otherwise put their ID in a column named From. Then the empty column out of From and To will take the ID of the row with the CountNumber > 1 (there is only one per each SequenceNumber value). The rest of the features should be preserved.
NOTE: basically each SequenceNumber represents one transactions that has either one seller and multiple buyers, or vice versa. I am trying to create a database that links the buyers and sellers where From is the Seller ID and To is the Buyer ID.
The output should look like this:
SequenceNumber | From | To | featureA | featureB
0 0 | 0 | 1 | 12 | 45
1 0 | 0 | 2 | 1 | 4
2 0 | 0 | 3 | 3 | 36
3 1 | 0 | 2 | 5 | 11
4 1 | 1 | 2 | 7 | 12
I implemented a method that does this, however I am using for loops which takes a long time to run on a large data. I am looking for a faster scalable method. Any suggestions?
Here is the original df:
df = pd.DataFrame({'SequenceNumber ': [0, 0, 0, 0, 1, 1, 1],
'ID': [0, 1, 2, 3, 0, 1, 2],
'CountNumber': [3, 1, 1, 1, 1, 1, 2],
'Side': ['Sell', 'Buy', 'Buy', 'Buy', 'Sell', 'Sell', 'Buy'],
'featureA': [4, 12, 1, 3, 5, 7, 5],
'featureB': [2, 45, 4, 36, 11, 12, 35]})

You can reshape with a pivot, select the features to keep with a mask and rework the output with groupby.first then concat:
features = list(df.filter(like='feature'))
out = (
# repeat the rows with CountNumber > 1
df.loc[df.index.repeat(df['CountNumber'])]
# rename Sell/Buy into from/to and de-duplicate the rows per group
.assign(Side=lambda d: d['Side'].map({'Sell': 'from', 'Buy': 'to'}),
n=lambda d: d.groupby(['SequenceNumber', 'Side']).cumcount()
)
# mask the features where CountNumber > 1
.assign(**{f: lambda d, f=f: d[f].mask(d['CountNumber'].gt(1)) for f in features})
.drop(columns='CountNumber')
# reshape with a pivot
.pivot(index=['SequenceNumber', 'n'], columns='Side')
)
out = (
pd.concat([out['ID'], out.drop(columns='ID').groupby(level=0, axis=1).first()], axis=1)
.reset_index('SequenceNumber')
)
Output:
SequenceNumber from to featureA featureB
n
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
0 1 0 2 5.0 11.0
1 1 1 2 7.0 12.0
atlernative using a merge like suggested by ifly6:
features = list(df.filter(like='feature'))
df1 = df.query('Side=="Sell"').copy()
df1[features] = df1[features].mask(df1['CountNumber'].gt(1))
df2 = df.query('Side=="Buy"').copy()
df2[features] = df2[features].mask(df2['CountNumber'].gt(1))
out = (df1.merge(df2, on='SequenceNumber').rename(columns={'ID_x': 'from', 'ID_y': 'to'})
.set_index(['SequenceNumber', 'from', 'to'])
.filter(like='feature')
.pipe(lambda d: d.groupby(d.columns.str.replace('_.*?$', '', regex=True), axis=1).first())
.reset_index()
)
Output:
SequenceNumber from to featureA featureB
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
3 1 0 2 5.0 11.0
4 1 1 2 7.0 12.0

Initial response. To get the answer half complete. Split the data into sellers and buyers. Then merge it against itself on the sequence number:
ndf = df.query('Side == "Sell"').merge(
df.query('Side == "Buy"'), on='SequenceNumber', suffixes=['_sell', '_buy']) \
.rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})
I then drop the side variable.
ndf = ndf.drop(columns=[i for i in ndf.columns if i.startswith('Side')])
This creates a very wide table:
SequenceNumber From CountNumber_sell featureA_sell featureB_sell To CountNumber_buy featureA_buy featureB_buy
0 0 0 3 4 2 1 1 12 45
1 0 0 3 4 2 2 1 1 4
2 0 0 3 4 2 3 1 3 36
3 1 0 1 5 11 2 2 5 35
4 1 1 1 7 12 2 2 5 35
This leaves you, however, with two featureA and featureB columns. I don't think your question clearly establishes which one takes precedence. Please provide more information on that.
Is it select the side with the lower CountNumber? Is it when CountNumber == 1? If the latter, then just null out the entries at the merge stage, do the merge, and then forward fill your appropriate columns to recover the proper values.
Re nulling. If you null the portions in featureA and featureB where the CountNumber is not 1, you can then create new version of those columns after the merge by forward filling and selecting.
s = df.query('Side == "Sell"').copy()
s.loc[s['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan
b = df.query('Side == "Buy"').copy()
b.loc[b['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan
ndf = s.merge(
b, on='SequenceNumber', suffixes=['_sell', '_buy']) \
.rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})
ndf['featureA'] = ndf[['featureA_buy', 'featureA_sell']] \
.ffill(axis=1).iloc[:, -1]
ndf['featureB'] = ndf[['featureB_buy', 'featureB_sell']] \
.ffill(axis=1).iloc[:, -1]
ndf = ndf.drop(
columns=[i for i in ndf.columns if i.startswith('Side')
or i.endswith('_sell') or i.endswith('_buy')])
The final version of ndf then is:
SequenceNumber From To featureA featureB
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
3 1 0 2 5.0 11.0
4 1 1 2 7.0 12.0

Here is an alternative approach
df1 = df.loc[df['CountNumber'] == 1].copy()
df1['From'] = (df1['ID'].where(df1['Side'] == 'Sell', df1['SequenceNumber']
.map(df.loc[df['CountNumber'] > 1].set_index('SequenceNumber')['ID']))
)
df1['To'] = (df1['ID'].where(df1['Side'] == 'Buy', df1['SequenceNumber']
.map(df.loc[df['CountNumber'] > 1].set_index('SequenceNumber')['ID']))
)
df1 = df1.drop(['ID', 'CountNumber', 'Side'], axis=1)
df1 = df1[['SequenceNumber', 'From', 'To', 'featureA', 'featureB']]
df1.reset_index(drop=True, inplace=True)
print(df1)
SequenceNumber From To featureA featureB
0 0 0 1 12 45
1 0 0 2 1 4
2 0 0 3 3 36
3 1 0 2 5 11
4 1 1 2 7 12

Related

Find list of indices before specific value in column for each unique id

I have dataframe like this:
id | date | status
________________________
... ... ...
1 |2020-01-01 | reserve
1 |2020-01-02 | sold
2 |2020-01-01 | free
3 |2020-01-03 | reserve
3 |2020-01-25 | signed
3 |2020-01-30 | sold
... ... ...
10 |2020-01-02 | signed
10 |2020-02-15 | sold
... ... ....
I want to find indices of all rows with status sold and then for all rows back 29 days of these (rows with status sold) assign 1 and 0 in else case.
The desired dataframe is this
id | date | status | label
_________________________________
... ... ... ...
1 |2020-01-01 | reserve | 1
1 |2020-01-02 | sold | 1
2 |2019-12-02 | free | 0 # no sold status for 2
3 |2020-01-03 | reserve | 1
3 |2020-01-25 | signed | 1
3 |2020-01-30 | sold | 1
... ... ... ...
10 |2020-01-02 | signed | 0
10 |2020-02-15 | sold | 1 # more than 29 days from 2020-02-15
... ... .... ...
I attemped to use apply(), but I found out I can't call function like that
def make_labels(df):
def get_indices(df):
return list(df[df['date'] >= df.iloc[-1]['date'] - timedelta(days=29)].index)
df.sort_values(['id', 'date'], inplace=True)
zero_labels = pd.Series(0, index = df.index, name='sold_labels')
one_lables = df.groupby('id')['status'].apply(lambda s: get_indices if s.iloc[-1] == 'sold').sum()
zero_labels.loc[one_lables] = 1
return zero_labels
df['label'] = make_labels(df)
dataframe constructor of the input:
d = {'id': [1, 1, 2, 3, 3, 3, 10, 10],
'date': ['2020-01-01', '2020-01-02', '2020-01-01', '2020-01-03', '2020-01-25', '2020-01-30', '2020-01-02', '2020-02-15'],
'status': ['reserve', 'sold', 'free', 'reserve', 'signed', 'sold', 'signed', 'sold']
}
df = pd.DataFrame(data=d)
You can use groupby.transform to get the sold date per group, then compare the difference to 29 days:
df['date'] = pd.to_datetime(df['date'])
ref = (df['date']
.where(df['status'].eq('sold'))
.groupby(df['id'])
.transform('first')
)
df['label'] = (df['date'].rsub(ref)
.le('29days')
.astype(int)
)
Output:
id date status label
0 1 2020-01-01 reserve 1
1 1 2020-01-02 sold 1
2 2 2020-01-01 free 0
3 3 2020-01-03 reserve 1
4 3 2020-01-25 signed 1
5 3 2020-01-30 sold 1
6 10 2020-01-02 signed 0
7 10 2020-02-15 sold 1

split dataframe rows according to ratios

I want to turn this dataframe
| ID| values|
|:--|:-----:|
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
into the below one by splitting the values according to the ratios 2:3:5
| ID| values|
|:--|:-----:|
| 1 | 2 |
| 1 | 3 |
| 1 | 5 |
| 2 | 4 |
| 2 | 6 |
| 2 | 10 |
| 3 | 6 |
| 3 | 9 |
| 3 | 15 |
Is there any simple code/convenient way to do this? Thanks!
Let us do
df['new'] = (df['values'].to_numpy()[:,None]*[2,3,5]/10).tolist()
df = df.explode('new')
Out[849]:
ID values new
0 1 10 2.0
0 1 10 3.0
0 1 10 5.0
1 2 20 4.0
1 2 20 6.0
1 2 20 10.0
2 3 30 6.0
2 3 30 9.0
2 3 30 15.0
Here is one approach:
import pandas as pd
df = pd.DataFrame({
"ID": [1, 2, 3],
"values": [10, 20, 30]
})
ratios = [2, 3, 5]
df = (
df.assign(values=df["values"].apply(lambda x: [x * (ratio / sum(ratios)) for ratio in ratios]))
.explode("values")
)
print(df)
In essence, we aim to create cells with lists under the "values" column so that we can take advantage of a DataFrame's explode method which melts cells containing lists into individual cells.
To make these lists we use the apply method on the "values" Series (pandas term for column of a DataFrame). This function:
lambda x: [x * (ratio / sum(ratios)) for ratio in ratios]
is an anonymous function that receives a number and splits out a list of the ratios. For example, when x is 10:
10 * (2 / 10) = 2
10 * (3 / 10) = 3
10 & (5 / 10) = 5
Therefore [2, 3, 5]
Then for the next value:
20 * (2 / 10) = 4
20 * (3 / 10) = 6
20 * (5 / 10) = 10
Therefore [4, 6, 10]
etc., which results in the intermediate dataframe:
ID values
0 1 [2.0, 3.0, 5.0]
1 2 [4.0, 6.0, 10.0]
2 3 [6.0, 9.0, 15.0]
Using the explode method on this dataframe produces your desired result.
here is one way to do it
ratio = [2, 3, 5]
ratio_dec = np.divide(ratio, sum(ratio))
df['ratio'] = df['values'].apply(lambda x: np.round(np.multiply(x, ratio_dec) ,0))
df.explode('ratio')
ID values ratio
0 1 10 2.0
0 1 10 3.0
0 1 10 5.0
1 2 20 4.0
1 2 20 6.0
1 2 20 10.0
2 3 30 6.0
2 3 30 9.0
2 3 30 15.0
Here's a way:
ratio = [2, 3, 5]
df = df.assign(**{f'ratio_{i}':df['values'] * x / sum(ratio)
for i, x in enumerate(ratio)}).set_index(['ID', 'values']).stack().to_frame(
'values').reset_index(level=0).reset_index(drop=True)
Output:
ID values
0 1 2.0
1 1 3.0
2 1 5.0
3 2 4.0
4 2 6.0
5 2 10.0
6 3 6.0
7 3 9.0
8 3 15.0

List in pandas dataframe columns

I have the following pandas dataframe
| A | B |
| :-|:------:|
| 1 | [2,3,4]|
| 2 | np.nan |
| 3 | np.nan |
| 4 | 10 |
I would like to unlist the first row and place those values sequentially in the subsequent rows. The outcome will look like this:
| A | B |
| :-|:------:|
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 10 |
How can I achieve this in a very large dataset with this phenomena occurring in many rows?
If the number of NaN values serve as a "slack" space, so that list elements can slot in, i.e. if the lengths match, then you can explode columns "B", then drop NaN values with dropna, reset index and assign back to "B":
df['B'] = df['B'].explode().dropna().reset_index(drop=True)
Output:
A B
0 1 2
1 2 3
2 3 4
3 4 10
As the number of consecutive NaNs does not match the length of the list, you can make groups starting with non NaN elements and explode while keeping the length of the group constant.
I used a slightly different example for clarity (I also assigned to a different column):
df['C'] = (df['B']
.groupby(df['B'].notna().cumsum())
.apply(lambda s: s.explode().iloc[:len(s)])
.values
)
Output:
A B C
0 1 [2, 3, 4] 2
1 2 NaN 3
2 3 NaN 4
3 4 NaN NaN
4 5 10 10
Used input:
df = pd.DataFrame({'A': range(1,6),
'B': [[2,3,4], np.nan, np.nan, np.nan, 10]
})

Pandas Custom Cumulative Calculation Over Group By in DataFrame

I am trying to run a simple calculation over the values of each row from within a group inside of a dataframe, but I'm having trouble with the syntax, I think I'm specifically getting confused in relation to what data object I should return, i.e. dataframe vs series etc.
For context, I have a bunch of stock values for each product I am tracking and I want to estimate the number of sales via a custom function which essentially does the following:
# Because stock can go up and down, I'm looking to record the difference
# when the stock is less than the previous stock number from the previous row.
# How do I access each row of the dataframe and then return the series I need?
def get_stock_sold(x):
# Written in pseudo
stock_sold = previous_stock_no - current_stock_no if current_stock_no < previous_stock_no else 0
return pd.Series(stock_sold)
I then have the following dataframe:
# 'Order' is a date in the real dataset.
data = {
'id' : ['1', '1', '1', '2', '2', '2'],
'order' : [1, 2, 3, 1, 2, 3],
'current_stock' : [100, 150, 90, 50, 48, 30]
}
df = pd.DataFrame(data)
df = df.sort_values(by=['id', 'order'])
df['previous_stock'] = df.groupby('id')['current_stock'].shift(1)
I'd like to create a new column (stock_sold) and apply the logic from above to each row within the grouped dataframe object:
df['stock_sold'] = df.groupby('id').apply(get_stock_sold)
Desired output would look as follows:
| id | order | current_stock | previous_stock | stock_sold |
|----|-------|---------------|----------------|------------|
| 1 | 1 | 100 | NaN | 0 |
| | 2 | 150 | 100.0 | 0 |
| | 3 | 90 | 150.0 | 60 |
| 2 | 1 | 50 | NaN | 0 |
| | 2 | 48 | 50.0 | 2 |
| | 3 | 30 | 48 | 18 |
Try:
df["previous_stock"] = df.groupby("id")["current_stock"].shift()
df["stock_sold"] = np.where(
df["current_stock"] > df["previous_stock"].fillna(0),
0,
df["previous_stock"] - df["current_stock"],
)
print(df)
Prints:
id order current_stock previous_stock stock_sold
0 1 1 100 NaN 0.0
1 1 2 150 100.0 0.0
2 1 3 90 150.0 60.0
3 2 1 50 NaN 0.0
4 2 2 48 50.0 2.0
5 2 3 30 48.0 18.0

Sort within a group and add a columns indicating rows below and above

I have a pandas dataframe that contains something like
+------+--------+-----+-------+
| Team | Gender | Age | Name |
+------+--------+-----+-------+
| A | M | 22 | Sam |
| A | F | 25 | Annie |
| B | M | 33 | Fred |
| B | M | 18 | James |
| A | M | 56 | Alan |
| B | F | 28 | Julie |
| A | M | 33 | Greg |
+------+--------+-----+-------+
What I'm trying to do is first group by Team and Gender (which I have been able to do so by using: df.groupby(['Team'], as_index=False)
Is there a way to sort the members of the group based on their age and add extra columns in there which would indicate how many members are above any particular member and how many below?
eg:
For group 'Team A':
+------+--------+-----+-------+---------+---------+---------+---------+
| Team | Gender | Age | Name | M_Above | M_Below | F_Above | F_Below |
+------+--------+-----+-------+---------+---------+---------+---------+
| A | M | 22 | Sam | 0 | 2 | 0 | 1 |
| A | F | 25 | Annie | 1 | 2 | 0 | 0 |
| A | M | 33 | Greg | 1 | 1 | 1 | 0 |
| A | M | 56 | Alan | 2 | 0 | 1 | 0 |
+------+--------+-----+-------+---------+---------+---------+---------+
import pandas as pd
df = pd.DataFrame({'Team':['A','A','B','B','A','B','A'], 'Gender':['M','F','M','M','M','F','M'],
'Age':[22,25,33,18,56,28,33], 'Name':['Sam','Annie','Fred','James','Alan','Julie','Greg']}).sort_values(['Team','Age'])
for idx, data in df.groupby(['Team'], as_index=False):
m_tot = data['Gender'].value_counts()[0] # number of males in current team
f_tot = data['Gender'].value_counts()[1] # dido^ (females)
m_seen = 0 # males seen so far for current team
f_seen = 0 # dido^ (females)
for row in data.iterrows():
(M_Above, M_below, F_Above, F_Below) = (m_seen, m_tot-m_seen, f_seen, f_tot-f_seen)
if row[1].Gender == 'M':
m_seen += 1
M_below -= 1
else:
f_seen += 1
F_Below -= 1
df.loc[row[0],'M_Above'] = M_Above
df.loc[row[0],'M_Below'] = M_below
df.loc[row[0],'F_Above'] = F_Above
df.loc[row[0],'F_Below'] = F_Below
And it results as:
Age Gender Team M_Above M_below F_Above F_Below
0 22 M A 0.0 2.0 0.0 1.0
1 25 F A 1.0 2.0 0.0 0.0
6 33 M A 1.0 1.0 1.0 0.0
4 56 M A 2.0 0.0 1.0 0.0
3 18 M B 0.0 1.0 0.0 1.0
5 28 F B 1.0 1.0 0.0 0.0
2 33 M B 1.0 0.0 1.0 0.0
And if you wish to get the new columns as int (as in your example), use:
for new_col in ['M_Above', 'M_Below', 'F_Above', 'F_Below']:
df[new_col] = df[new_col].astype(int)
Which results:
Age Gender Name Team M_Above M_Below F_Above F_Below
0 22 M Sam A 0 2 0 1
1 25 F Annie A 1 2 0 0
6 33 M Greg A 1 1 1 0
4 56 M Alan A 2 0 1 0
3 18 M James B 0 1 0 1
5 28 F Julie B 1 1 0 0
2 33 M Fred B 1 0 1 0
EDIT: (running times comparison)
Note that this solution is faster than using ix (the approved solution). Average running time (over 1000 iterations) is ~6 times faster (which would probably matter in bigger DataFrames). Run this to check:
import pandas as pd
from time import time
import numpy as np
def f(x):
for i,d in x.iterrows():
above = x.ix[:i, 'Gender'].drop(i).value_counts().reindex(['M','F'])
below = x.ix[i:, 'Gender'].drop(i).value_counts().reindex(['M','F'])
x.ix[i,'M_Above'] = above.ix['M']
x.ix[i,'M_Below'] = below.ix['M']
x.ix[i,'F_Above'] = above.ix['F']
x.ix[i,'F_Below'] = below.ix['F']
return x
df = pd.DataFrame({'Team':['A','A','B','B','A','B','A'], 'Gender':['M','F','M','M','M','F','M'],
'Age':[22,25,33,18,56,28,33], 'Name':['Sam','Annie','Fred','James','Alan','Julie','Greg']}).sort_values(['Team','Age'])
times = []
times2 = []
for i in range(1000):
tic = time()
for idx, data in df.groupby(['Team'], as_index=False):
m_tot = data['Gender'].value_counts()[0] # number of males in current team
f_tot = data['Gender'].value_counts()[1] # dido^ (females)
m_seen = 0 # males seen so far for current team
f_seen = 0 # dido^ (females)
for row in data.iterrows():
(M_Above, M_below, F_Above, F_Below) = (m_seen, m_tot-m_seen, f_seen, f_tot-f_seen)
if row[1].Gender == 'M':
m_seen += 1
M_below -= 1
else:
f_seen += 1
F_Below -= 1
df.loc[row[0],'M_Above'] = M_Above
df.loc[row[0],'M_Below'] = M_below
df.loc[row[0],'F_Above'] = F_Above
df.loc[row[0],'F_Below'] = F_Below
toc = time()
times.append(toc-tic)
for i in range(1000):
tic = time()
df1 = df.groupby('Team', sort=False).apply(f).fillna(0)
df1.ix[:,'M_Above':] = df1.ix[:,'M_Above':].astype(int)
toc = time()
times2.append(toc-tic)
print(np.mean(times))
print(np.mean(times2))
Results:
0.0163134906292 # alternative solution
0.0622982912064 # approved solution
You can apply custom function f with groupby by column Team.
In function f for each row first filter above and below values by ix, then drop value and get values desired values by value_counts. Some values are missing, so need reindex and then select by ix:
def f(x):
for i,d in x.iterrows():
above = x.ix[:i, 'Gender'].drop(i).value_counts().reindex(['M','F'])
below = x.ix[i:, 'Gender'].drop(i).value_counts().reindex(['M','F'])
x.ix[i,'M_Above'] = above.ix['M']
x.ix[i,'M_Below'] = below.ix['M']
x.ix[i,'F_Above'] = above.ix['F']
x.ix[i,'F_Below'] = below.ix['F']
return x
df1 = df.groupby('Team', sort=False).apply(f).fillna(0)
#cast float to int
df1.ix[:,'M_Above':] = df1.ix[:,'M_Above':].astype(int)
print (df1)
Age Gender Name Team M_Above M_Below F_Above F_Below
0 22 M Sam A 0 2 0 1
1 25 F Annie A 1 2 0 0
6 33 M Greg A 1 1 1 0
4 56 M Alan A 2 0 1 0
3 18 M James B 0 1 0 1
5 28 F Julie B 1 1 0 0
2 33 M Fred B 1 0 1 0

Categories