split dataframe rows according to ratios - python

I want to turn this dataframe
| ID| values|
|:--|:-----:|
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
into the below one by splitting the values according to the ratios 2:3:5
| ID| values|
|:--|:-----:|
| 1 | 2 |
| 1 | 3 |
| 1 | 5 |
| 2 | 4 |
| 2 | 6 |
| 2 | 10 |
| 3 | 6 |
| 3 | 9 |
| 3 | 15 |
Is there any simple code/convenient way to do this? Thanks!

Let us do
df['new'] = (df['values'].to_numpy()[:,None]*[2,3,5]/10).tolist()
df = df.explode('new')
Out[849]:
ID values new
0 1 10 2.0
0 1 10 3.0
0 1 10 5.0
1 2 20 4.0
1 2 20 6.0
1 2 20 10.0
2 3 30 6.0
2 3 30 9.0
2 3 30 15.0

Here is one approach:
import pandas as pd
df = pd.DataFrame({
"ID": [1, 2, 3],
"values": [10, 20, 30]
})
ratios = [2, 3, 5]
df = (
df.assign(values=df["values"].apply(lambda x: [x * (ratio / sum(ratios)) for ratio in ratios]))
.explode("values")
)
print(df)
In essence, we aim to create cells with lists under the "values" column so that we can take advantage of a DataFrame's explode method which melts cells containing lists into individual cells.
To make these lists we use the apply method on the "values" Series (pandas term for column of a DataFrame). This function:
lambda x: [x * (ratio / sum(ratios)) for ratio in ratios]
is an anonymous function that receives a number and splits out a list of the ratios. For example, when x is 10:
10 * (2 / 10) = 2
10 * (3 / 10) = 3
10 & (5 / 10) = 5
Therefore [2, 3, 5]
Then for the next value:
20 * (2 / 10) = 4
20 * (3 / 10) = 6
20 * (5 / 10) = 10
Therefore [4, 6, 10]
etc., which results in the intermediate dataframe:
ID values
0 1 [2.0, 3.0, 5.0]
1 2 [4.0, 6.0, 10.0]
2 3 [6.0, 9.0, 15.0]
Using the explode method on this dataframe produces your desired result.

here is one way to do it
ratio = [2, 3, 5]
ratio_dec = np.divide(ratio, sum(ratio))
df['ratio'] = df['values'].apply(lambda x: np.round(np.multiply(x, ratio_dec) ,0))
df.explode('ratio')
ID values ratio
0 1 10 2.0
0 1 10 3.0
0 1 10 5.0
1 2 20 4.0
1 2 20 6.0
1 2 20 10.0
2 3 30 6.0
2 3 30 9.0
2 3 30 15.0

Here's a way:
ratio = [2, 3, 5]
df = df.assign(**{f'ratio_{i}':df['values'] * x / sum(ratio)
for i, x in enumerate(ratio)}).set_index(['ID', 'values']).stack().to_frame(
'values').reset_index(level=0).reset_index(drop=True)
Output:
ID values
0 1 2.0
1 1 3.0
2 1 5.0
3 2 4.0
4 2 6.0
5 2 10.0
6 3 6.0
7 3 9.0
8 3 15.0

Related

python piecewise linear interpolation across dataframes in a list

I am trying to apply piecewise linear interpolation. I first tried to use pandas built-in interpolate function but it was not working.
Example data looks below
import pandas as pd
import numpy as np
d = {'ID':[5,5,5,5,5,5,5], 'month':[0,3,6,9,12,15,18], 'num':[7,np.nan,5,np.nan,np.nan,5,8]}
tempo = pd.DataFrame(data = d)
d2 = {'ID':[6,6,6,6,6,6,6], 'month':[0,3,6,9,12,15,18], 'num':[5,np.nan,2,np.nan,np.nan,np.nan,7]}
tempo2 = pd.DataFrame(data = d2)
this = []
this.append(tempo)
this.append(tempo2)
The actual data has over 1000 unique IDs, so I filtered each ID into a dataframe and put them into the list.
The first dataframe in the list looks as below
I am trying to go through all the dataframe in the list to do a piecewise linear interpolation. I tried to change month to a index and use .interpolate(method='index', inplace = True) but it was not working.
The expected output is
ID | month | num
5 | 0 | 7
5 | 3 | 6
5 | 6 | 5
5 | 9 | 5
5 | 12 | 5
5 | 15 | 5
5 | 18 | 8
This needs to be applied across all the dataframes in the list.
Assuming this is a follow up of your previous question, change the code to:
for i, df in enumerate(this):
this[i] = (df
.set_index('month')
# optional, because of the previous question
.reindex(range(df['month'].min(), df['month'].max()+3, 3))
.interpolate()
.reset_index()[df.columns]
)
NB. I simplified the code to remove the groupby, which only works if you have a single group per DataFrame, as you mentioned in the other question.
Output:
[ ID month num
0 5 0 7.0
1 5 3 6.0
2 5 6 5.0
3 5 9 5.0
4 5 12 5.0
5 5 15 5.0
6 5 18 8.0,
ID month num
0 6 0 5.00
1 6 3 3.50
2 6 6 2.00
3 6 9 3.25
4 6 12 4.50
5 6 15 5.75
6 6 18 7.00]

Optimizing a Pandas DataFrame Transformation to Link two Columns

Given the following df:
SequenceNumber | ID | CountNumber | Side | featureA | featureB
0 0 | 0 | 3 | Sell | 4 | 2
1 0 | 1 | 1 | Buy | 12 | 45
2 0 | 2 | 1 | Buy | 1 | 4
3 0 | 3 | 1 | Buy | 3 | 36
4 1 | 0 | 1 | Sell | 5 | 11
5 1 | 1 | 1 | Sell | 7 | 12
6 1 | 2 | 2 | Buy | 5 | 35
I want to create a new df such that for every SequenceNumber value, it takes the rows with the CountNumber == 1, and creates new rows where if the Side == 'Buy' then put their ID in a column named To. Otherwise put their ID in a column named From. Then the empty column out of From and To will take the ID of the row with the CountNumber > 1 (there is only one per each SequenceNumber value). The rest of the features should be preserved.
NOTE: basically each SequenceNumber represents one transactions that has either one seller and multiple buyers, or vice versa. I am trying to create a database that links the buyers and sellers where From is the Seller ID and To is the Buyer ID.
The output should look like this:
SequenceNumber | From | To | featureA | featureB
0 0 | 0 | 1 | 12 | 45
1 0 | 0 | 2 | 1 | 4
2 0 | 0 | 3 | 3 | 36
3 1 | 0 | 2 | 5 | 11
4 1 | 1 | 2 | 7 | 12
I implemented a method that does this, however I am using for loops which takes a long time to run on a large data. I am looking for a faster scalable method. Any suggestions?
Here is the original df:
df = pd.DataFrame({'SequenceNumber ': [0, 0, 0, 0, 1, 1, 1],
'ID': [0, 1, 2, 3, 0, 1, 2],
'CountNumber': [3, 1, 1, 1, 1, 1, 2],
'Side': ['Sell', 'Buy', 'Buy', 'Buy', 'Sell', 'Sell', 'Buy'],
'featureA': [4, 12, 1, 3, 5, 7, 5],
'featureB': [2, 45, 4, 36, 11, 12, 35]})
You can reshape with a pivot, select the features to keep with a mask and rework the output with groupby.first then concat:
features = list(df.filter(like='feature'))
out = (
# repeat the rows with CountNumber > 1
df.loc[df.index.repeat(df['CountNumber'])]
# rename Sell/Buy into from/to and de-duplicate the rows per group
.assign(Side=lambda d: d['Side'].map({'Sell': 'from', 'Buy': 'to'}),
n=lambda d: d.groupby(['SequenceNumber', 'Side']).cumcount()
)
# mask the features where CountNumber > 1
.assign(**{f: lambda d, f=f: d[f].mask(d['CountNumber'].gt(1)) for f in features})
.drop(columns='CountNumber')
# reshape with a pivot
.pivot(index=['SequenceNumber', 'n'], columns='Side')
)
out = (
pd.concat([out['ID'], out.drop(columns='ID').groupby(level=0, axis=1).first()], axis=1)
.reset_index('SequenceNumber')
)
Output:
SequenceNumber from to featureA featureB
n
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
0 1 0 2 5.0 11.0
1 1 1 2 7.0 12.0
atlernative using a merge like suggested by ifly6:
features = list(df.filter(like='feature'))
df1 = df.query('Side=="Sell"').copy()
df1[features] = df1[features].mask(df1['CountNumber'].gt(1))
df2 = df.query('Side=="Buy"').copy()
df2[features] = df2[features].mask(df2['CountNumber'].gt(1))
out = (df1.merge(df2, on='SequenceNumber').rename(columns={'ID_x': 'from', 'ID_y': 'to'})
.set_index(['SequenceNumber', 'from', 'to'])
.filter(like='feature')
.pipe(lambda d: d.groupby(d.columns.str.replace('_.*?$', '', regex=True), axis=1).first())
.reset_index()
)
Output:
SequenceNumber from to featureA featureB
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
3 1 0 2 5.0 11.0
4 1 1 2 7.0 12.0
Initial response. To get the answer half complete. Split the data into sellers and buyers. Then merge it against itself on the sequence number:
ndf = df.query('Side == "Sell"').merge(
df.query('Side == "Buy"'), on='SequenceNumber', suffixes=['_sell', '_buy']) \
.rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})
I then drop the side variable.
ndf = ndf.drop(columns=[i for i in ndf.columns if i.startswith('Side')])
This creates a very wide table:
SequenceNumber From CountNumber_sell featureA_sell featureB_sell To CountNumber_buy featureA_buy featureB_buy
0 0 0 3 4 2 1 1 12 45
1 0 0 3 4 2 2 1 1 4
2 0 0 3 4 2 3 1 3 36
3 1 0 1 5 11 2 2 5 35
4 1 1 1 7 12 2 2 5 35
This leaves you, however, with two featureA and featureB columns. I don't think your question clearly establishes which one takes precedence. Please provide more information on that.
Is it select the side with the lower CountNumber? Is it when CountNumber == 1? If the latter, then just null out the entries at the merge stage, do the merge, and then forward fill your appropriate columns to recover the proper values.
Re nulling. If you null the portions in featureA and featureB where the CountNumber is not 1, you can then create new version of those columns after the merge by forward filling and selecting.
s = df.query('Side == "Sell"').copy()
s.loc[s['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan
b = df.query('Side == "Buy"').copy()
b.loc[b['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan
ndf = s.merge(
b, on='SequenceNumber', suffixes=['_sell', '_buy']) \
.rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})
ndf['featureA'] = ndf[['featureA_buy', 'featureA_sell']] \
.ffill(axis=1).iloc[:, -1]
ndf['featureB'] = ndf[['featureB_buy', 'featureB_sell']] \
.ffill(axis=1).iloc[:, -1]
ndf = ndf.drop(
columns=[i for i in ndf.columns if i.startswith('Side')
or i.endswith('_sell') or i.endswith('_buy')])
The final version of ndf then is:
SequenceNumber From To featureA featureB
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
3 1 0 2 5.0 11.0
4 1 1 2 7.0 12.0
Here is an alternative approach
df1 = df.loc[df['CountNumber'] == 1].copy()
df1['From'] = (df1['ID'].where(df1['Side'] == 'Sell', df1['SequenceNumber']
.map(df.loc[df['CountNumber'] > 1].set_index('SequenceNumber')['ID']))
)
df1['To'] = (df1['ID'].where(df1['Side'] == 'Buy', df1['SequenceNumber']
.map(df.loc[df['CountNumber'] > 1].set_index('SequenceNumber')['ID']))
)
df1 = df1.drop(['ID', 'CountNumber', 'Side'], axis=1)
df1 = df1[['SequenceNumber', 'From', 'To', 'featureA', 'featureB']]
df1.reset_index(drop=True, inplace=True)
print(df1)
SequenceNumber From To featureA featureB
0 0 0 1 12 45
1 0 0 2 1 4
2 0 0 3 3 36
3 1 0 2 5 11
4 1 1 2 7 12

Pandas - Applying formula on all column based on a value on the row

lets say I have a dataframe like below
+------+------+------+-------------+
| A | B | C | devisor_col |
+------+------+------+-------------+
| 2 | 4 | 10 | 2 |
| 3 | 3 | 9 | 3 |
| 10 | 25 | 40 | 10 |
+------+------+------+-------------+
what would be the best command to apply a formula using values from the devisor_col. Do note that I have thousand of column and rows.
the result should be like this:
+------+------+------+-------------+
| A | B | V | devisor_col |
+------+------+------+-------------+
| 1 | 2 | 5 | 2 |
| 1 | 1 | 3 | 3 |
| 1 | 1.5 | 4 | 10 |
+------+------+------+-------------+
I tried using apply map but I dont know why I cant apply it to all columns.
modResult = my_df.applymap(lambda x: x/x["devisor_col"]))
IIUC, use pandas.DataFrame.divide on axis=0 :
modResult= (
pd.concat(
[my_df, my_df.filter(like="Col") # selecting columns
.divide(my_df["devisor_col"], axis=0).add_suffix("_div")], axis=1)
)
# Output :
print(modResult)
Col1 Col2 Col3 devisor_col Col1_div Col2_div Col3_div
0 2 4 10 2 1.0 2.0 5.0
1 3 3 9 3 1.0 1.0 3.0
2 10 25 40 10 1.0 2.5 4.0
If you need only the result of the divide, use this :
modResult= my_df.filter(like="Col").divide(my_df["devisor_col"], axis=0)
print(modResult)
Col1 Col2 Col3
0 1.0 2.0 5.0
1 1.0 1.0 3.0
2 1.0 2.5 4.0
Or if you want to overwrite the old columns, use pandas.DataFrame.join:
modResult= (
my_df.filter(like="Col")
.divide(my_df["devisor_col"], axis=0)
.join(my_df["devisor_col"])
)
Col1 Col2 Col3 devisor_col
0 1.0 2.0 5.0 2
1 1.0 1.0 3.0 3
2 1.0 2.5 4.0 10
You can replace my_df.filter(like="Col") with my_df.loc[:, my_df.columns!="devisor_col"].
You can try using .loc
df = pd.DataFrame([[1,2,3,1],[2,3,4,5],[4,5,6,7]], columns=['col1', 'col2', 'col3', 'divisor'])
df.loc[:, df.columns != 'divisor'] = df.loc[:, df.columns != 'divisor'].divide(df['divisor'], axis=0)

List in pandas dataframe columns

I have the following pandas dataframe
| A | B |
| :-|:------:|
| 1 | [2,3,4]|
| 2 | np.nan |
| 3 | np.nan |
| 4 | 10 |
I would like to unlist the first row and place those values sequentially in the subsequent rows. The outcome will look like this:
| A | B |
| :-|:------:|
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 10 |
How can I achieve this in a very large dataset with this phenomena occurring in many rows?
If the number of NaN values serve as a "slack" space, so that list elements can slot in, i.e. if the lengths match, then you can explode columns "B", then drop NaN values with dropna, reset index and assign back to "B":
df['B'] = df['B'].explode().dropna().reset_index(drop=True)
Output:
A B
0 1 2
1 2 3
2 3 4
3 4 10
As the number of consecutive NaNs does not match the length of the list, you can make groups starting with non NaN elements and explode while keeping the length of the group constant.
I used a slightly different example for clarity (I also assigned to a different column):
df['C'] = (df['B']
.groupby(df['B'].notna().cumsum())
.apply(lambda s: s.explode().iloc[:len(s)])
.values
)
Output:
A B C
0 1 [2, 3, 4] 2
1 2 NaN 3
2 3 NaN 4
3 4 NaN NaN
4 5 10 10
Used input:
df = pd.DataFrame({'A': range(1,6),
'B': [[2,3,4], np.nan, np.nan, np.nan, 10]
})

Merging columns within a dataframe with pandas

I'm trying to merge two different columns within a data frame.
So if you have columns A and B, and you want A to remain the default value unless it is empty. If it is empty you want to use the value for B.
pd.merge looks like it only works when merging data frames, not columns within an existing single data frame.
| A | B |
| 2 | 4 |
| NaN | 3 |
| 5 | NaN |
| NaN | 6 |
| 7 | 8 |
Desired Result:
|A|
|2|
|3|
|5|
|6|
|7|
Credit to Scott Boston for the comment on the OP:
import pandas as pd
df = pd.DataFrame(
{
'A': [2, None, 5, None, 7],
'B': [4, 3, None, 6, 8]
}
)
df.head()
"""
A B
0 2.0 4.0
1 NaN 3.0
2 5.0 NaN
3 NaN 6.0
4 7.0 8.0
"""
df['A'] = df['A'].fillna(df['B'])
df.head()
"""
A B
0 2.0 4.0
1 3.0 3.0
2 5.0 NaN
3 6.0 6.0
4 7.0 8.0
"""

Categories