I want to add a new column with the number of times the points were over 700 and after the year 2014.
import pandas as pd
ipl_data = {'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
df.loc[(df['Points'] > 700) & (df['Year'] > 2014), 'High_points'] = df['Points']
#df['Point_per_year_gr_700']=df.groupby(by='Year')['Points'].transform('count')
df['Point_per_year_gr_700']=grouped['Points'].agg(np.size))
the end dataframe should look like this, but I cant get the 'Point_per_year_gr_700' right
Year Points Point_per_year_gr_700 High_points
0 2014 876 NaN
1 2015 789 3 789.0
2 2014 863 NaN
3 2015 673 NaN
4 2014 741 NaN
5 2015 812 3 812.0
6 2016 756 1 756.0
7 2017 788 1 788.0
8 2016 694 NaN
9 2014 701 NaN
10 2015 804 3 804.0
11 2017 690 NaN
Use where to mask the DataFrame to NaN where your condition isn't met. You can use this to create the High_points column and also to exclude rows that shouldn't count when you groupby year and find how many rows satisfy High_points each year.
df['High_points'] = df['Points'].where(df['Year'].gt(2014) & df['Points'].gt(700))
df['ppy_gt700'] = (df.where(df['High_points'].notnull())
.groupby('Year')['Year'].transform('size'))
Year Points High_Points ppy_gt700
0 2014 876 NaN NaN
1 2015 789 789.0 3.0
2 2014 863 NaN NaN
3 2015 673 NaN NaN
4 2014 741 NaN NaN
5 2015 812 812.0 3.0
6 2016 756 756.0 1.0
7 2017 788 788.0 1.0
8 2016 694 NaN NaN
9 2014 701 NaN NaN
10 2015 804 804.0 3.0
11 2017 690 NaN NaN
Related
I would like to replace missing values based on the values of the column Submitted.
Find below what I have:
Year
Country
Submitted
Age12
Age14
2018
CHI
1
267
NaN
2019
CHI
NaN
NaN
NaN
2020
CHI
1
244
203
2018
ALB
1
163
165
2019
ALB
1
NaN
NaN
2020
ALB
1
161
NaN
2018
GER
1
451
381
2019
GER
NaN
NaN
NaN
2020
GER
1
361
321
An this is what I would like to have:
Year
Country
Submitted
Age12
Age14
2018
CHI
1
267
NaN
2019
CHI
NaN
267
NaN
2020
CHI
1
244
203
2018
ALB
1
163
165
2019
ALB
1
NaN
NaN
2020
ALB
1
161
NaN
2018
GER
1
451
381
2019
GER
NaN
451
381
2020
GER
1
361
321
I tried using the command df.fillna(axis=0, method='ffill')
But this replace all values NaN by the previous, but this is not what I want because some values should be kept as NaN if the "Submitted" column value is 1.
I would like to change the values by the previous row only if the respective "Submitted" value is "NaN".
Thank you
Try using where together with what you did:
df = df.where(~df.Sumbitted.isnull(), df.fillna(axis=0, method='ffill'))
This will replace the entries only when Submitted is null.
You can do a conditional ffill() using np.where
import numpy as np
(
df.assign(Age12=np.where(df.Submitted.isna(), df.Age12.ffill(), df.Age12))
.assign(Age14=np.where(df.Submitted.isna(), df.Age14.ffill(), df.Age14))
)
You can use .filter() to select the related columns and put the columns in the list cols. Then, use .mask() to change the values of the selected columns by forward fill using ffill() when Submitted is NaN, as follows:
cols = df.filter(like='Age').columns
df[cols] = df[cols].mask(df['Submitted'].isna(), df[cols].ffill())
Result:
print(df)
Year Country Submitted Age12 Age14
0 2018 CHI 1.0 267.0 NaN
1 2019 CHI NaN 267.0 NaN
2 2020 CHI 1.0 244.0 203.0
3 2018 ALB 1.0 163.0 165.0
4 2019 ALB 1.0 NaN NaN
5 2020 ALB 1.0 161.0 NaN
6 2018 GER 1.0 451.0 381.0
7 2019 GER NaN 451.0 381.0
8 2020 GER 1.0 361.0 321.0
I just used a for loop to check and update the values in the dataframe
import pandas as pd
new_data = [[2018,'CHI',1,267,30], [2019,'CHI','NaN','NaN','NaN'], [2020,'CHI',1,244,203]]
df = pd.DataFrame(new_data, columns = ['Year','Country','Submitted','Age12','Age14'])
prevValue12 = df.iloc[0]['Age12']
prevValue14 = df.iloc[0]['Age14']
for index, row in df.iterrows():
if(row['Submitted']=='NaN'):
df.at[index,'Age12']=prevValue12
df.at[index,'Age14']=prevValue14
prevValue12 = row['Age12']
prevValue14 = row['Age14']
print(df)
output
Year Country Submitted Age12 Age14
0 2018 CHI 1 267 30
1 2019 CHI NaN 267 30
2 2020 CHI 1 244 203
I am trying to reshape the following dataframe such that it is in panel data form by moving the "Year" column such that each year is an individual column.
Out[34]:
Award Year 0
State
Alabama 2003 89
Alabama 2004 92
Alabama 2005 108
Alabama 2006 81
Alabama 2007 71
... ...
Wyoming 2011 4
Wyoming 2012 2
Wyoming 2013 1
Wyoming 2014 4
Wyoming 2015 3
[648 rows x 2 columns]
I want the years to each be individual columns, this is an example,
Out[48]:
State 2003 2004 2005 2006
0 NewYork 10 10 10 10
1 Alabama 15 15 15 15
2 Washington 20 20 20 20
I have read up on stack/unstack but I don't think I want a multilevel index as a result. I have been looking through the documentation at to_frame etc. but I can't see what I am looking for.
If anyone can help that would be great!
Use set_index with append=True then select the column 0 and use unstack to reshape:
df = df.set_index('Award Year', append=True)['0'].unstack()
Result:
Award Year 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0
Pivot Table can help.
df2 = pd.pivot_table(df,values='0', columns='AwardYear', index=['State'])
df2
Result:
AwardYear 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0
I have a dataframe that looks something like:
Component Date MTD YTD QTD FC
ABC Jan 2017 56 nan nan nan
DEF Jan 2017 453 nan nan nan
XYZ Jan 2017 657
PQR Jan 2017 123
ABC Feb 2017 56 nan nan nan
DEF Feb 2017 456 nan nan nan
XYZ Feb 2017 6234 57
PQR Feb 2017 123 346
ABC Dec 2017 56 nan nan nan
DEF Dec 2017 nan nan 345 324
XYZ Dec 2017 6234 57
PQR Dec 2017 nan 346 54654 546
And i would like to transpose this dataframe in such a way that the component becomes the prefix of the existing MTD,QTD, etc columns
so the output expected would be:
Date ABC_MTD DEF_MTD XYZ_MTD PQR_MTD ABC_YTD DEF_YTD XYZ_YTD PQR_YTD etcetc
Jan 2017 56 453 657 123 nan nan nan nan
Feb 2017 56 456 6234 123 nan nan 57 346
Dec 2017 56 nan 6234 nan 57 346
I am not sure whether a pivot or stack/unstack would be efficient out here.
Thanks in advance.
You could try this:
newdf=df.pivot(values=df.columns[2:], index='Date', columns='Component' )
newdf.columns = ['%s%s' % (b, '_%s' % a if b else '') for a, b in newdf.columns] #join the multiindex columns names
print(newdf)
Output:
df
Component Date MTD YTD QTD FC
0 ABC 2017-01-01 56.0 NaN NaN NaN
1 DEF 2017-01-01 453.0 NaN NaN NaN
2 XYZ 2017-01-01 657.0
3 PQR 2017-01-01 123.0
4 ABC 2017-02-01 56.0 NaN NaN NaN
5 DEF 2017-02-01 456.0 NaN NaN NaN
6 XYZ 2017-02-01 6234.0 57
7 PQR 2017-02-01 123.0 346
8 ABC 2017-12-01 56.0 NaN NaN NaN
9 DEF 2017-12-01 NaN NaN 345 324
10 XYZ 2017-12-01 6234.0 57
11 PQR 2017-12-01 NaN 346 54654 546
newdf
ABC_MTD DEF_MTD PQR_MTD XYZ_MTD ABC_YTD DEF_YTD PQR_YTD XYZ_YTD ABC_QTD DEF_QTD PQR_QTD XYZ_QTD ABC_FC DEF_FC PQR_FC XYZ_FC
Date
2017-01-01 56 453 123 657 NaN NaN NaN NaN NaN NaN
2017-02-01 56 456 123 6234 NaN NaN 346 57 NaN NaN NaN NaN
2017-12-01 56 NaN NaN 6234 NaN NaN 346 57 NaN 345 54654 NaN 324 546
I am trying to download data from a website. When I do this, there are some rows that are not part of the data included, which is obvious because their first column is not a number.
So I'm getting something like
GM_Num Date Tm
1 Monday, Apr 3 LAA
2 Tuesday, Apr 4 LAA
... ... ...
Gm# May Tm
where the last row is one that I want to drop. In the actual table, there are multiple rows like this randomly throughout the table.
Here is the code that I have tried so far to drop those rows:
import requests
import pandas as pd
url = 'https://www.baseball-reference.com/teams/LAA/2017-schedule-scores.shtml'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
df.rename(columns={"Gm#": "GM_Num"}, inplace = True)
#Attempts that didn't work:
df[df['GM_Num'].str.isdigit().isnull()]
#df[df.GM_Num.apply(lambda x: x.isnumeric())].set_index('GM_Num', inplace = True)
#df.set_index('GM_Num', inplace = True)
df
Thank you in advance for any help!
Let's cast your 'Gm#' column and drop records in a couple of steps:
df['Gm#'] = pd.to_numeric(df['Gm#'], errors='coerce')
df = df.dropna(subset=['Gm#'])
df
Output:
Gm# Date Unnamed: 2 Tm Unnamed: 4 Opp W/L R RA \
0 1.0 Monday, Apr 3 boxscore LAA # OAK L 2 4
1 2.0 Tuesday, Apr 4 boxscore LAA # OAK W 7 6
2 3.0 Wednesday, Apr 5 boxscore LAA # OAK W 5 0
3 4.0 Thursday, Apr 6 boxscore LAA # OAK L 1 5
4 5.0 Friday, Apr 7 boxscore LAA NaN SEA W 5 1
.. ... ... ... ... ... ... ... .. ..
162 158.0 Wednesday, Sep 27 boxscore LAA # CHW L-wo 4 6
163 159.0 Thursday, Sep 28 boxscore LAA # CHW L 4 5
164 160.0 Friday, Sep 29 boxscore LAA NaN SEA W 6 5
165 161.0 Saturday, Sep 30 boxscore LAA NaN SEA L 4 6
167 162.0 Sunday, Oct 1 boxscore LAA NaN SEA W 6 2
Inn ... Rank GB Win Loss Save Time D/N \
0 NaN ... 3 1.0 Graveman Nolasco Casilla 2:56 N
1 NaN ... 2 1.0 Bailey Dull Bedrosian 3:17 N
2 NaN ... 2 1.0 Ramirez Cotton NaN 3:15 N
3 NaN ... 2 1.0 Triggs Skaggs NaN 2:44 D
4 NaN ... 1 Tied Chavez Gallardo NaN 2:56 N
.. ... ... ... ... ... ... ... ... ..
162 10 ... 2 20.0 Farquhar Parker NaN 3:58 N
163 NaN ... 2 21.0 Infante Chavez Minaya 3:04 N
164 NaN ... 2 21.0 Wood Rzepczynski Parker 3:01 N
165 NaN ... 2 21.0 Lawrence Bedrosian Diaz 3:32 N
167 NaN ... 2 21.0 Bridwell Simmons NaN 2:38 D
Attendance Streak Orig. Scheduled
0 36067 - NaN
1 11225 + NaN
2 13405 ++ NaN
3 13292 - NaN
4 43911 + NaN
.. ... ... ...
162 17012 - NaN
163 19596 -- NaN
164 35106 + NaN
165 38075 - NaN
167 34940 + NaN
[162 rows x 21 columns]
I have a dataframe with information about sales of some products (unit):
unit year month price
0 1 2018 6 100
1 1 2013 4 70
2 2 2015 10 80
3 2 2015 2 110
4 3 2017 4 120
5 3 2002 6 90
6 4 2016 1 55
and I would like to add, for each sale, columns with information about the previous sales and NaN if there is no previous sale.
unit year month price prev_price prev_year prev_month
0 1 2018 6 100 70.0 2013.0 4.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 110.0 2015.0 2.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 90.0 2002.0 6.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN
For the moment I am doing some grouping on the unit, keeping those that have several rows, then extracting the information for these units that are associated with the minimal date. Then joining this table with my original table keeping only the rows that have a different date in the 2 tables that have been merged.
I feel like there is a much simple way to do this but I am not sure how.
Use DataFrameGroupBy.shift with add_prefix and join to append new DataFrame to original:
#if real data are not sorted
#df = df.sort_values(['unit','year','month'], ascending=[True, False, False])
df = df.join(df.groupby('unit', sort=False).shift(-1).add_prefix('prev_'))
print (df)
unit year month price prev_year prev_month prev_price
0 1 2018 6 100 2013.0 4.0 70.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 2015.0 2.0 110.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 2002.0 6.0 90.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN