Summing up previous 10 rows of a dataframe - python

I'm wondering how to sum up 10 rows of a data frame from any point.
I tried using rolling(10,window =1).sum() but the very first row should sum up the 10 rows below. Similar issue with cumsum()
So if my data frame is just the A column, id like it to output B.
A B
0 10 550
1 20 650
2 30 750
3 40 850
4 50 950
5 60 1050
6 70 1150
7 80 1250
8 90 1350
9 100 1450
10 110 etc
11 120 etc
12 130 etc
13 140
14 150
15 160
16 170
17 180
18 190
It would be similar to doing this operation in excel and copying it down
Excel Example:

You can reverse your series before using pd.Series.rolling, and then reverse the result:
df['B'] = df['A'][::-1].rolling(10, min_periods=0).sum()[::-1]
print(df)
A B
0 10 550.0
1 20 650.0
2 30 750.0
3 40 850.0
4 50 950.0
5 60 1050.0
6 70 1150.0
7 80 1250.0
8 90 1350.0
9 100 1450.0
10 110 1350.0
11 120 1240.0
12 130 1120.0
13 140 990.0
14 150 850.0
15 160 700.0
16 170 540.0
17 180 370.0
18 190 190.0

Related

How to Multi-Index an existing DataFrame

"Multi-Index" might be the incorrect term of what I'm looking to do, but below is an example of what I'm trying to accomplish.
Original DF:
HOD site1_units site1_orders site2_units site2_orders
hour1 6 3 20 16
hour2 25 10 16 3
hour3 500 50 50 25
hour4 125 65 59 14
hour5 16 1 158 6
hour6 0 0 15 15
hour7 180 18 99 90
Desired DF
site1 site2
HOD units orders units orders
hour1 6 3 20 16
hour2 25 10 16 3
hour3 500 50 50 25
hour4 125 65 59 14
hour5 16 1 158 6
hour6 0 0 15 15
hour7 180 18 99 90
Is there an efficient way to construct/format the dataframe like this? Thank you for the help!
Try this:
df = df.set_index('HOD')
df = df.set_axis(df.columns.map(lambda x: tuple(x.split('_'))),axis=1)
Output:
site1 site2
units orders units orders
HOD
hour1 6 3 20 16
hour2 25 10 16 3
hour3 500 50 50 25
hour4 125 65 59 14
hour5 16 1 158 6
hour6 0 0 15 15
hour7 180 18 99 90
Here is one way you can do this:
df = df.set_index("HOD")
index = pd.MultiIndex.from_tuples(zip(["site1","site1", "site2", "site2"],["units", "orders", "units", "orders"]))
df.columns = index
Result:
site1 site2
units orders units orders
HOD
hour1 6 3 20 16
hour2 25 10 16 3
hour3 500 50 50 25
hour4 125 65 59 14
hour5 16 1 158 6
hour6 0 0 15 15
hour7 180 18 99 90

How can I turn this into a DataFrame?

I am new to Python, and was trying to run a basic web scraper. My code looks like this
import requests
import pandas as pd
x = requests.get('https://www.baseball-reference.com/players/p/penaje02.shtml')
dfs = pd.read_html(x.content)
print(dfs)
df = pd.DataFrame(dfs)
when printing dfs it looks like this. I only want the second table.
[ Year Age Tm Lg G PA AB \
0 2018 20 HOU-min A- 36 156 136
1 2019 21 HOU-min A,A+ 109 473 409
2 2021 23 HOU-min AAA,Rk 37 160 145
3 2022 24 HOU AL 136 558 521
4 1 Yr 1 Yr 1 Yr 1 Yr 136 558 521
5 162 Game Avg. 162 Game Avg. 162 Game Avg. 162 Game Avg. 162 665 621
R H 2B ... OPS OPS+ TB GDP HBP SH SF IBB Pos \
0 22 34 5 ... 0.649 NaN 42 0 1 0 1 0 NaN
1 72 124 21 ... 0.825 NaN 180 4 11 0 6 0 NaN
2 25 43 5 ... 0.942 NaN 84 0 7 0 0 0 NaN
3 72 132 20 ... 0.715 101.0 222 5 6 1 5 0 *6/H
4 72 132 20 ... 0.715 101.0 222 5 6 1 5 0 NaN
5 86 157 24 ... 0.715 101.0 264 6 7 1 6 0 NaN
Awards
0 TRC · NYPL
1 DAV,FAY · MIDW,CARL
2 SKT,AST · AAAW,FCL
3 GG
4 NaN
5 NaN
[6 rows x 30 columns]]
however, i end up with error Must pass 2-d input. shape=(1, 6, 30) after my last line. I have tried using df=dfs[1], but got the error list index our of range. Any way i can turn dfs from a list to a datframe?
What do you mean you only want the second table? There's only one table, it's 6 rows and 30 columns. The backslashes show up when whatever you're trying to print to isn't wide enough to contain the dataframe without line wrapping. Here's the dataframe printed in a wider terminal:
The pd.read_html() function returns a List[DataFrame] so you first need to grab your dataframe from the list, and then you can subset it to get the columns you care about:
df = dfs[0]
columns = ['R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'BA', 'OBP', 'SLG', 'OPS', 'OPS+', 'TB', 'GDP', 'HBP', 'SH', 'SF', 'IBB', 'Pos']
print(df[columns])
Output:
R H 2B 3B HR RBI SB CS BB SO BA OBP SLG OPS OPS+ TB GDP HBP SH SF IBB Pos
0 22 34 5 0 1 10 3 0 18 19 0.250 0.340 0.309 0.649 NaN 42 0 1 0 1 0 NaN
1 72 124 21 7 7 54 20 10 47 90 0.303 0.385 0.440 0.825 NaN 180 4 11 0 6 0 NaN
2 25 43 5 3 10 21 6 1 8 41 0.297 0.363 0.579 0.942 NaN 84 0 7 0 0 0 NaN
3 72 132 20 2 22 63 11 2 22 135 0.253 0.289 0.426 0.715 101.0 222 5 6 1 5 0 *6/H
4 72 132 20 2 22 63 11 2 22 135 0.253 0.289 0.426 0.715 101.0 222 5 6 1 5 0 NaN
5 86 157 24 2 26 75 13 2 26 161 0.253 0.289 0.426 0.715 101.0 264 6 7 1 6 0 NaN

How to add more rows of random values to an existing column in my dataset - pandas?

I want to add ten more rows to each column of the dataset provided below. It should add random integer values ranging from :
20-27 for temperature
40-55 for humidity
150-170 for moisture
Dataset:
Temperature Humidity Moisture
0 22 46 0
1 36 41.4 170
2 18 69.3 120
3 21 39.3 200
4 39 70 150
5 22 78 220
6 27 65 180
7 32 75 250
I have tried:
import numpy as np
import pandas as pd
data1 = np.random.randint(20,27,size=10)
df = pd.DataFrame(data, columns=['Temperature'])
print(df)
This method deletes all the existing row values and gives out only the random values. What I all need is the existing rows and the random values in addition.
Use:
df1 = pd.DataFrame({'Temperature':np.random.randint(20,28,size=10),
'Humidity':np.random.randint(40,56,size=10),
'Moisture':np.random.randint(150,171,size=10)})
df = pd.concat([df, df1], ignore_index=True)
print (df)
Temperature Humidity Moisture
0 22 46.0 0
1 36 41.4 170
2 18 69.3 120
3 21 39.3 200
4 39 70.0 150
5 22 78.0 220
6 27 65.0 180
7 32 75.0 250
8 20 52.0 158
9 21 45.0 156
10 23 49.0 151
11 24 51.0 167
12 22 45.0 157
13 21 43.0 163
14 26 55.0 162
15 25 40.0 164
16 24 40.0 155
17 20 48.0 150

Add last n rows of DataFrame into a new column

I would to add 4 last numbers of a dataframe and output will be appended in a new column, I have used for loop to do it but it is slower if there are many rows is there a way we can use df.apply(lamda x:) to achieve this.
**Sample Input:
values
0 10
1 20
2 30
3 40
4 50
5 60
Output:
values result
0 10 10
1 20 30
2 30 60
3 40 100
4 50 140
5 60 180**
use pandas.DataFrame.rolling
>>> df.rolling(4, min_periods=1).sum()
values
0 10.0
1 30.0
2 60.0
3 100.0
4 140.0
5 180.0
6 220.0
7 260.0
8 300.0
Add it together:
>>> df.assign(results=df.rolling(4, min_periods=1).sum().astype(int))
values results
0 10 10
1 20 30
2 30 60
3 40 100
4 50 140
5 60 180
6 70 220
7 80 260
8 90 300

Pandas: groupby and make a new column applying aggregate to two columns

I'm having a difficulty with applying agg to a groupby pandas dataframe.
I have a dataframe df like this:
order_id distance_theo bird_distance
10 100 80
10 80 80
10 70 80
11 90 70
11 70 70
11 60 70
12 200 180
12 150 180
12 100 180
12 60 180
I want to groupby order_id, and make a new column crow by dividing distance_theo of the first row in each group by bird_distance in the first row of each group(or in any row, because there is only one value of bird_distance in one group).
order_id distance_theo bird_distance crow
10 100 80 1.25
10 80 80 1.25
10 70 80 1.25
11 90 70 1.29
11 70 70 1.29
11 60 70 1.29
12 200 180 1.11
12 150 180 1.11
12 100 180 1.11
12 60 180 1.11
My attempt:
df.groupby('order_id').agg({'crow', lambda x: x.distance_theo.head(1) / x.bird_distance.head(1)})
But I get an error:
'Series' object has no attribute 'distance_theo'
How can I solve this? Thanks for any kinds of advice!
Using groupby with first:
s = df.groupby('order_id').transform('first')
df.assign(crow=s.distance_theo.div(s.bird_distance))
order_id distance_theo bird_distance crow
0 10 100 80 1.250000
1 10 80 80 1.250000
2 10 70 80 1.250000
3 11 90 70 1.285714
4 11 70 70 1.285714
5 11 60 70 1.285714
6 12 200 180 1.111111
7 12 150 180 1.111111
8 12 100 180 1.111111
9 12 60 180 1.111111
You could do it without groupby and use drop_duplicate and join:
df.join(df.drop_duplicates('order_id')\
.eval('crow = distance_theo / bird_distance')[['crow']]).ffill()
or use assign instead of eval per #jezraela comments below:
df1.join(df1.drop_duplicates('order_id')\
.assign(crow=df1.distance_theo / df1.bird_distance)[['crow']]).ffill()
Output:
order_id distance_theo bird_distance crow
0 10 100 80 1.250000
1 10 80 80 1.250000
2 10 70 80 1.250000
3 11 90 70 1.285714
4 11 70 70 1.285714
5 11 60 70 1.285714
6 12 200 180 1.111111
7 12 150 180 1.111111
8 12 100 180 1.111111
9 12 60 180 1.111111

Categories