I'm having a difficulty with applying agg to a groupby pandas dataframe.
I have a dataframe df like this:
order_id distance_theo bird_distance
10 100 80
10 80 80
10 70 80
11 90 70
11 70 70
11 60 70
12 200 180
12 150 180
12 100 180
12 60 180
I want to groupby order_id, and make a new column crow by dividing distance_theo of the first row in each group by bird_distance in the first row of each group(or in any row, because there is only one value of bird_distance in one group).
order_id distance_theo bird_distance crow
10 100 80 1.25
10 80 80 1.25
10 70 80 1.25
11 90 70 1.29
11 70 70 1.29
11 60 70 1.29
12 200 180 1.11
12 150 180 1.11
12 100 180 1.11
12 60 180 1.11
My attempt:
df.groupby('order_id').agg({'crow', lambda x: x.distance_theo.head(1) / x.bird_distance.head(1)})
But I get an error:
'Series' object has no attribute 'distance_theo'
How can I solve this? Thanks for any kinds of advice!
Using groupby with first:
s = df.groupby('order_id').transform('first')
df.assign(crow=s.distance_theo.div(s.bird_distance))
order_id distance_theo bird_distance crow
0 10 100 80 1.250000
1 10 80 80 1.250000
2 10 70 80 1.250000
3 11 90 70 1.285714
4 11 70 70 1.285714
5 11 60 70 1.285714
6 12 200 180 1.111111
7 12 150 180 1.111111
8 12 100 180 1.111111
9 12 60 180 1.111111
You could do it without groupby and use drop_duplicate and join:
df.join(df.drop_duplicates('order_id')\
.eval('crow = distance_theo / bird_distance')[['crow']]).ffill()
or use assign instead of eval per #jezraela comments below:
df1.join(df1.drop_duplicates('order_id')\
.assign(crow=df1.distance_theo / df1.bird_distance)[['crow']]).ffill()
Output:
order_id distance_theo bird_distance crow
0 10 100 80 1.250000
1 10 80 80 1.250000
2 10 70 80 1.250000
3 11 90 70 1.285714
4 11 70 70 1.285714
5 11 60 70 1.285714
6 12 200 180 1.111111
7 12 150 180 1.111111
8 12 100 180 1.111111
9 12 60 180 1.111111
Related
Below, I have two dataframe. I need to update df_mapped using df_original.
In df_mapped, For each x_time need to find 3 closest rows (closest defined from difference from x_price) and add those to df_mapped dataframe.
import io
import pandas as pd
d = """
x_time expiration x_price p_price
60 4 10 20
60 5 11 30
60 6 12 40
60 7 13 50
60 8 14 60
70 5 10 20
70 6 11 30
70 7 12 40
70 8 13 50
70 9 14 60
80 1 10 20
80 2 11 30
80 3 12 40
80 4 13 50
80 5 14 60
"""
df_original = pd.read_csv(io.StringIO(d), delim_whitespace=True)`
to_mapped = """
x_time expiration x_price
50 4 15
60 5 15
70 6 13
80 7 20
90 8 20
"""
df_mapped = pd.read_csv(io.StringIO(to_mapped), delim_whitespace=True)
df_mapped = df_mapped.merge(df_original, on='x_time', how='left')
df_mapped['x_price_delta'] = abs(df_mapped['x_price_x'] - df_mapped['x_price_y'])`
**Intermediate output: In this, need to select 3 min x_price_delta row for each x_time
**
int_out = """
x_time expiration_x x_price_x expiration_y x_price_y p_price x_price_delta
50 4 15
60 5 15 6 12 40 3
60 5 15 7 13 50 2
60 5 15 8 14 60 1
70 6 13 7 12 40 1
70 6 13 8 13 50 0
70 6 13 9 14 60 1
80 7 20 3 12 40 8
80 7 20 4 13 50 7
80 7 20 5 14 60 6
90 8 20
"""
df_int_out = pd.read_csv(io.StringIO(int_out), delim_whitespace=True)
**Final step: keeping x_time fixed need to flatten the dataframe so we get the 3 closest row in one row
**
final_out = """
x_time expiration_original x_price_original expiration_1 x_price_1 p_price_1 expiration_2 x_price_2 p_price_2 expiration_3 x_price_3 p_price_3
50 4 15
60 5 15 6 12 40 7 13 50 8 14 60
70 6 13 7 12 40 8 13 50 9 14 60
80 7 20 3 12 40 4 13 50 5 14 60
90 8 20
"""
df_out = pd.read_csv(io.StringIO(final_out), delim_whitespace=True)
I am stuck in between intermediate and last step. Can't think of way out, what could be done to massage the dataframe?
This is not complete solution but it might help you to get unstuck.
At the end we get the correct data.
In [1]: df = df_int_out.groupby("x_time").apply(lambda x: x.sort_values(ascen
...: ding=False, by="x_price_delta")).set_index(["x_time", "expiration_x"]
...: ).drop(["x_price_delta", "x_price_x"],axis=1)
In [2]: df1 = df.iloc[1:-1]
In [3]: df1.groupby(df1.index).apply(lambda x: pd.concat([pd.DataFrame(d) for
...: d in x.values],axis=1).unstack())
Out[3]:
0
0 1 2 0 1 2 0 1 2
(60, 5) 6.0 12.0 40.0 7.0 13.0 50.0 8.0 14.0 60.0
(70, 6) 7.0 12.0 40.0 9.0 14.0 60.0 8.0 13.0 50.0
(80, 7) 3.0 12.0 40.0 4.0 13.0 50.0 5.0 14.0 60.0
I am sure there are much better ways of handling this case.
"Multi-Index" might be the incorrect term of what I'm looking to do, but below is an example of what I'm trying to accomplish.
Original DF:
HOD site1_units site1_orders site2_units site2_orders
hour1 6 3 20 16
hour2 25 10 16 3
hour3 500 50 50 25
hour4 125 65 59 14
hour5 16 1 158 6
hour6 0 0 15 15
hour7 180 18 99 90
Desired DF
site1 site2
HOD units orders units orders
hour1 6 3 20 16
hour2 25 10 16 3
hour3 500 50 50 25
hour4 125 65 59 14
hour5 16 1 158 6
hour6 0 0 15 15
hour7 180 18 99 90
Is there an efficient way to construct/format the dataframe like this? Thank you for the help!
Try this:
df = df.set_index('HOD')
df = df.set_axis(df.columns.map(lambda x: tuple(x.split('_'))),axis=1)
Output:
site1 site2
units orders units orders
HOD
hour1 6 3 20 16
hour2 25 10 16 3
hour3 500 50 50 25
hour4 125 65 59 14
hour5 16 1 158 6
hour6 0 0 15 15
hour7 180 18 99 90
Here is one way you can do this:
df = df.set_index("HOD")
index = pd.MultiIndex.from_tuples(zip(["site1","site1", "site2", "site2"],["units", "orders", "units", "orders"]))
df.columns = index
Result:
site1 site2
units orders units orders
HOD
hour1 6 3 20 16
hour2 25 10 16 3
hour3 500 50 50 25
hour4 125 65 59 14
hour5 16 1 158 6
hour6 0 0 15 15
hour7 180 18 99 90
I want to add ten more rows to each column of the dataset provided below. It should add random integer values ranging from :
20-27 for temperature
40-55 for humidity
150-170 for moisture
Dataset:
Temperature Humidity Moisture
0 22 46 0
1 36 41.4 170
2 18 69.3 120
3 21 39.3 200
4 39 70 150
5 22 78 220
6 27 65 180
7 32 75 250
I have tried:
import numpy as np
import pandas as pd
data1 = np.random.randint(20,27,size=10)
df = pd.DataFrame(data, columns=['Temperature'])
print(df)
This method deletes all the existing row values and gives out only the random values. What I all need is the existing rows and the random values in addition.
Use:
df1 = pd.DataFrame({'Temperature':np.random.randint(20,28,size=10),
'Humidity':np.random.randint(40,56,size=10),
'Moisture':np.random.randint(150,171,size=10)})
df = pd.concat([df, df1], ignore_index=True)
print (df)
Temperature Humidity Moisture
0 22 46.0 0
1 36 41.4 170
2 18 69.3 120
3 21 39.3 200
4 39 70.0 150
5 22 78.0 220
6 27 65.0 180
7 32 75.0 250
8 20 52.0 158
9 21 45.0 156
10 23 49.0 151
11 24 51.0 167
12 22 45.0 157
13 21 43.0 163
14 26 55.0 162
15 25 40.0 164
16 24 40.0 155
17 20 48.0 150
I would to add 4 last numbers of a dataframe and output will be appended in a new column, I have used for loop to do it but it is slower if there are many rows is there a way we can use df.apply(lamda x:) to achieve this.
**Sample Input:
values
0 10
1 20
2 30
3 40
4 50
5 60
Output:
values result
0 10 10
1 20 30
2 30 60
3 40 100
4 50 140
5 60 180**
use pandas.DataFrame.rolling
>>> df.rolling(4, min_periods=1).sum()
values
0 10.0
1 30.0
2 60.0
3 100.0
4 140.0
5 180.0
6 220.0
7 260.0
8 300.0
Add it together:
>>> df.assign(results=df.rolling(4, min_periods=1).sum().astype(int))
values results
0 10 10
1 20 30
2 30 60
3 40 100
4 50 140
5 60 180
6 70 220
7 80 260
8 90 300
I'm wondering how to sum up 10 rows of a data frame from any point.
I tried using rolling(10,window =1).sum() but the very first row should sum up the 10 rows below. Similar issue with cumsum()
So if my data frame is just the A column, id like it to output B.
A B
0 10 550
1 20 650
2 30 750
3 40 850
4 50 950
5 60 1050
6 70 1150
7 80 1250
8 90 1350
9 100 1450
10 110 etc
11 120 etc
12 130 etc
13 140
14 150
15 160
16 170
17 180
18 190
It would be similar to doing this operation in excel and copying it down
Excel Example:
You can reverse your series before using pd.Series.rolling, and then reverse the result:
df['B'] = df['A'][::-1].rolling(10, min_periods=0).sum()[::-1]
print(df)
A B
0 10 550.0
1 20 650.0
2 30 750.0
3 40 850.0
4 50 950.0
5 60 1050.0
6 70 1150.0
7 80 1250.0
8 90 1350.0
9 100 1450.0
10 110 1350.0
11 120 1240.0
12 130 1120.0
13 140 990.0
14 150 850.0
15 160 700.0
16 170 540.0
17 180 370.0
18 190 190.0