I'm sorry if this has been asked but I can't find another question like this.
I have a data frame in Pandas like this:
Home Away Home_Score Away_Score
MIL NYC 1 2
ATL NYC 1 3
NYC PHX 2 1
HOU NYC 1 6
I want to calculate the moving average for each team, but the catch is that I want to do it for all of their games, both home and away combined.
So for a moving average window of size 3 for 'NYC' the answer should be (2+3+2)/3 for row 1 and then (3+2+6)/3 for row 2, etc.
You can exploid stack to convert the two columns into one and groupby:
(df[['Home_Score','Away_Score']]
.stack()
.groupby(df[['Home','Away']].stack().values)
.rolling(3).mean()
.reset_index(level=0, drop=True)
.unstack()
.add_prefix('Avg_')
)
Output:
Avg_Away_Score Avg_Home_Score
0 NaN NaN
1 NaN NaN
2 NaN 2.333333
3 3.666667 NaN
Related
I have a messy datasets as attached below
Sales Credit type Year Status
0 NaN GS 2000 Confirmed
1 NaN V 2000 Assigned
2 GS 2001 Assigned NaN
3 V 2004 Received NaN
I am trying to move over the corresponding value into the right columns. So ideally should be like this one.
Sales Credit type Year Status
0 NaN GS 2000 Confirmed
1 NaN V 2000 Assigned
2 NaN GS 2001 Assigned
3 NaN V 2004 Received
I have tried to find the solutions in this platforms but no luck. I used df.loc to placed the datasets but seems like the result is not like what I expected. I would really appreciate your support for solving this issue. Thank you
*Update
It works with #jezrael solution, thanks! but is it possible if we use it for this case?
ID Sales Credit_type Year Status
0 1 Aston GS 2000 Confirmed
1 1 NaN V 2000 Assigned
2 2 GS 2001 Assigned NaN
3 3 V 2004 Received NaN
And the result should be like this:
ID Sales Credit_type Year Status
0 1 Aston GS 2000 Confirmed
1 1 NaN V 2000 Assigned
2 2 NaN GS 2001 Assigned
3 3 NaN V 2004 Received
You can create mask by last column for test if missing values by Series.isna and then use DataFrame.shift with axis=1 only for filtered rows:
m = df.iloc[:, -1].isna()
df[m] = df[m].shift(axis=1)
print (df)
Sales Credit type Year Status
0 NaN GS 2000 Confirmed
1 NaN V 2000 Assigned
2 NaN GS 2001 Assigned
3 NaN V 2004 Received
If need set all columns without first use DataFrame.iloc with indexing .iloc[m, 1:]:
m = df.iloc[:, -1].isna().to_numpy()
df.iloc[m, 1:] = df.iloc[m, 1:].shift(axis=1)
print (df)
ID Sales Credit_type Year Status
0 1 Aston GS 2000 Confirmed
1 1 NaN V 2000 Assigned
2 2 NaN GS 2001 Assigned
3 3 NaN V 2004 Received
Given: A pandas dataframe that contains a user_url column among other columns.
Expectation: New columns added to the original dataframe where the columns are composed of information extracted from the URL in the user_url column. Those columns being car_make, model, year and user_id.
Some Extra info: We know that the car_make will only contain letters either with or without a '-'. The model can contain any characters. The year will only be 4 digits long. The user_id will consist of digits of any length.
I tired using a regex to split the url but it failed when there was missing information or extra information. I also tried just splinting the data but I has the same issue using split.
Given dataframe
mpg miles user_url
0 NaN NaN https://www.somewebsite.com/suzuki/swift/2015/674857
1 31.6 NaN https://www.somewebsite.com/bmw/x3/2009/461150
2 28.5 NaN https://www.somewebsite.com/mercedes-benz/e300/1998/13
3 46.8 NaN https://www.somewebsite.com/320d/2010/247233
4 21.0 244.4 https://www.somewebsite.com/honda/pass/2019/1038865
5 25.0 254.4 https://www.somewebsite.com/volkswagen/passat/11
Expected Dataframe
mpg miles user_url car_make model year \
0 NaN NaN https://www.somewebsite.com/suzuki/swift/2015/674857 suzuki swift 2015
1 31.6 NaN https://www.somewebsite.com/bmw/x3/2009/461150 bmw x3 2009
2 28.5 NaN https://www.somewebsite.com/mercedes-benz/e300/1998/13 mercedes-benz e300 1998
3 46.8 NaN https://www.somewebsite.com/320d/2010/247233 NaN 320d 2010
4 21.0 244.4 https://www.somewebsite.com/honda/pass/2019/1038865 honda pass 2019
5 25.0 254.4 https://www.somewebsite.com/volkswagen/passat/11 volkswagen passat NaN
user_id
0 674857
1 461150
2 13
3 247233
4 1038865
5 11
you just have to do
split=df['user_url'].str.split("/", n = 4, expand = True)
df['car_make']=split[3]
df.loc[df['car_make'].str.contains('1|2|3|4|5|6|7|8|9|0'),'car_make']=np.nan
I'm trying to drop the first two columns in a dataframe that has NaN for column headers. The dataframe looks like this:
**15 NaN NaN NaN Energy Supply Energy Supply Renewable Energy**
17 NaN Afghanistan Afghanistan 1 2 3
18 NaN Albania Albania 1 2 3
19 NaN Algeria Algeria 1 2 3
I need to drop the first two columns labeled NaN. I tried df=df.drop(df.columns[[1,2]],axis=1), which returns an error
What am I missing?
KeyError: '[nan nan] not found in axis'
Strange you have NaN as columns. Please try filter columns that do not start with NaN using regex.
df.filter(regex='^(?!NaN).+', axis=1)
Using your data
print(df)
15 NaN NaN.1 NaN.2 EnergySupply EnergySupply.1 \
0 17 NaN Afghanistan Afghanistan 1 2
1 18 NaN Albania Albania 1 2
2 19 NaN Algeria Algeria 1 2
RenewableEnergy
0 3
1 3
2 3
Solution
print(df.filter(regex='^(?!NaN).+', axis=1))
15 EnergySupply EnergySupply.1 RenewableEnergy
0 17 1 2 3
1 18 1 2 3
2 19 1 2 3
When the NaN columns exist, I had to do a case-insenstive version of the regex from wwnde's answer in order for them to successfully filter out the column:
df = df.filter(regex='(?i)^(?!NaN).+', axis=1)
Other suggestions, such as df=df[df.columns.dropna()] and df=df.drop(np.nan, axis=1) do not work, but the above did.
I'm guessing this is related to the painful reality of np.nan == np.nan not evaluating to true, but ultimately it seems like a bug with pandas.
I have a dataframe products. Products looks this:
Cust_ID Prod Time_of_Sale
A Bat 1
A Ball 2
A Lego 3
B Lego 3
B Lego 9
B Ball 11
B Bat 11
B Bat 13
C Bat 2
C Lego 2
I want to change it so that it becomes like this:
Cust_ID Bat Bat Ball Lego Lego
A 1 NaN 2 3 NaN
B 11 13 11 3 9
C 2 NaN NaN 2 NaN
I have been playing around with products.groupby() and it is not really leading me anywhere. Any help is appreciated.
The aim is to 'visualize' the order in which each item was purchased by each customer. I have more than 1000 unique Customers.
Edit:
I see that a user suggested that I go through How to pivot a dataframe. But this doesn't work because my columns have duplicate values.
This is a little tricky with duplicates on Prod. Basically you need a cumcount and pivot:
new_df = (df.set_index(['Cust_ID','Prod',
df.groupby(['Cust_ID', 'Prod']).cumcount()])
['Time_of_Sale']
.unstack(level=(1,2))
.sort_index(axis=1)
)
new_df.columns = [x for x,y in new_df.columns]
new_df = new_df.reset_index()
Output:
Cust_ID Ball Bat Bat Lego Lego
0 A 2.0 1.0 NaN 3.0 NaN
1 B 11.0 11.0 13.0 3.0 9.0
2 C NaN 2.0 NaN 2.0 NaN
Note: Duplicated column names, although supported, should be avoid in Pandas.
The below is a part of a dataframe which consists of football game results.
FTHG stands for "Full time home goals"
FTAG stands for "Full time away goals"
Date HomeTeam AwayTeam FTHG FTAG FTR
14/08/93 Arsenal Coventry 0 3 A
14/08/93 Aston Villa QPR 4 1 H
16/08/93 Tottenham Arsenal 0 1 A
17/08/93 Everton Man City 1 0 H
21/08/93 QPR Southampton 2 1 H
21/08/93 Sheffield Arsenal 0 1 A
24/08/93 Arsenal Leeds 2 1 H
24/08/93 Man City Blackburn 0 2 A
28/08/93 Arsenal Everton 2 0 H
I want to create a code in python that calculates a rolling sum (for ex. 3) of the goals scored by each team regardless if the team was home or visitor.
The groupby method does half the job. Say "a" is a variable and "df" is dataframe
a = df.groupby("HomeTeam")["FTHG"].rolling(3).sum()
The result be something like that:
FTHG
Arsenal NaN
NaN
4.0
.....
However I would like the code to take into account also the goals when Arsenal was visiting team. Respectively to produce a column (it should not be called FTHG but to be some new column)
Arsenal NaN
NaN
2
4
5
Ideas will be much appreciated
you can combine those columns together and then apply groupby
tmp1 = df[['Date','HomeTeam', 'FTHG']]
tmp2 = df[['Date','AwayTeam', 'FTAG']]
tmp1.columns = ['Date','name', 'score']
tmp2.columns = ['Date','name', 'score']
tmp = pd.concat([tmp1,tmp2])
tmp.sort_values(by='Date').groupby("name")["score"].rolling(3).sum()
name
Arsenal 0 NaN
2 NaN
5 2.0
6 4.0
8 5.0