Pandas df group by count elements

Pandas df group by count elements - python

My dataframe looks like this.
# initialize list of lists
data = [[1998, 1998,2002,2003], [2001, 1999,1993,2003], [1998, 1999,2003,1994], [1998,1997,2003,1993], [1999,2001,1996, 1999]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
I would like to count for each date the number of occurrences in %. Such that the dataframe looks like this:
1997 1998 1999
A 20% 80% 100%
B 30% 10% 0%
C 70% 10% 0%
I tried to use Pandas group-by.

The logic is not fully clear (since it looks that the provided output is not the real one corresponding to the provided input), but here are some approaches:
using crosstab
Percent per year
df2 = df.melt(value_name='year')
df2 = pd.crosstab(df2['variable'], df2['year'], normalize='columns').mul(100)
# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum()).mul(100)
Output:
year 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A 0.0 0.0 0.0 0.0 75.0 25.0 50.0 0.0 0.0
B 0.0 0.0 0.0 100.0 25.0 50.0 50.0 0.0 0.0
C 50.0 0.0 100.0 0.0 0.0 0.0 0.0 100.0 50.0
D 50.0 100.0 0.0 0.0 0.0 25.0 0.0 0.0 50.0
Percent per variable
df2 = df.melt(value_name='year')
pd.crosstab(df2['variable'], df2['year'], normalize='index').mul(100)
# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum(1), axis=0).mul(100)
Output:
year 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A 0.0 0.0 0.0 0.0 60.0 20.0 20.0 0.0 0.0
B 0.0 0.0 0.0 20.0 20.0 40.0 20.0 0.0 0.0
C 20.0 0.0 20.0 0.0 0.0 0.0 0.0 20.0 40.0
D 20.0 20.0 0.0 0.0 0.0 20.0 0.0 0.0 40.0
using groupby
(df.stack()
.groupby(level=1)
.apply(lambda s: s.value_counts(normalize=True))
.unstack(fill_value=0)
.mul(100)
)
Output:
1993 1994 1996 1997 1998 1999 2001 2002 2003
A 0.0 0.0 0.0 0.0 60.0 20.0 20.0 0.0 0.0
B 0.0 0.0 0.0 20.0 20.0 40.0 20.0 0.0 0.0
C 20.0 0.0 20.0 0.0 0.0 0.0 0.0 20.0 40.0
D 20.0 20.0 0.0 0.0 0.0 20.0 0.0 0.0 40.0

Another option could be the following:
# getting value_counts for each column
df2 = pd.concat([df[col].value_counts(normalize=True) for col in df.columns], axis=1)
# filling null values with 0
df2.fillna(0, inplace=True)
# transforming to string and adding %
df2 = df2.astype('int').astype('str')+'%'
# getting your output
df2.loc['1997':'1999', 'A':'C'].T
Output:
1997 1998 1999
A 20% 80% 100%
B 30% 10% 0%
C 70% 10% 0%

melt + groupby + unstack
(df.melt().groupby(['variable', 'value']).size()
/ df.melt().groupby('value').size()).unstack(1)
Out[1]:
value 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A NaN NaN NaN NaN 0.75 0.25 0.5 NaN NaN
B NaN NaN NaN 1.0 0.25 0.50 0.5 NaN NaN
C 0.5 NaN 1.0 NaN NaN NaN NaN 1.0 0.5
D 0.5 1.0 NaN NaN NaN 0.25 NaN NaN 0.5

Related

Store the result in new columns, named based on another variable (Pandas)

I have a dataframe. What I need is to calculate the difference between the variables A and B, and store the result in the new columns based on the variable df['Value']. If the Value == 1, then the result is stored in column named Diff_1, if the Value == 2, then column Diff_2, and so on.
Here is the code so far, but obviously the line df_red['Diff_' + str(value) ] = df_red['A'] - df_red['B'] is not doing what I want:
import pandas as pd
df = pd.read_excel(r'E:\...\.xlsx')
print(df)
value = list(set(df['Value']))
print(value)
for value in value:
df_red = df[df['Value'] == value]
df_red['Diff_' + str(value) ] = df_red['A'] - df_red['B']
Out[126]:
ID Value A B
0 1 1 56.0 49.0
1 2 3 56.0 50.0
2 3 4 103.0 44.0
3 4 2 89.0 44.0
4 5 1 84.0 41.0
5 6 1 77.0 43.0
6 7 2 71.0 35.0
7 8 4 77.0 32.0
print(value)
[1, 2, 3, 4]
After a simple operation of df['A'] - df['B'] the result should look like this.
Out[128]:
ID Value A B Diff_1 Diff_2 Diff_3 Diff_4
0 1 1 56.0 49.0 7.0 0.0 0.0 0.0
1 2 3 56.0 50.0 0.0 0.0 6.0 0.0
2 3 4 103.0 44.0 0.0 0.0 0.0 60.0
3 4 2 89.0 44.0 0.0 45.0 0.0 0.0
4 5 1 84.0 41.0 43.0 0.0 0.0 0.0
5 6 1 77.0 43.0 34.0 0.0 0.0 0.0
6 7 2 71.0 35.0 0.0 36.0 0.0 0.0
7 8 4 77.0 32.0 0.0 0.0 0.0 45.0
Not so great way of doing this would be like this, however I am looking for some more efficient, better ways:
df['Diff_1'] = df[df['Value']==1]['A'] - df[df['Value']==1]['B']
df['Diff_2'] = df[df['Value']==2]['A'] - df[df['Value']==2]['B']
df['Diff_3'] = df[df['Value']==3]['A'] - df[df['Value']==3]['B']
df['Diff_4'] = df[df['Value']==4]['A'] - df[df['Value']==4]['B']

You can use:
df.join(df.set_index(['ID', 'Value'])
.eval('A-B')
.unstack(level=1).add_prefix('Diff_')
.reset_index(drop=True)
)
output:
ID Value A B Diff_1 Diff_2 Diff_3 Diff_4
0 1 1 56.0 49.0 7.0 NaN NaN NaN
1 2 3 56.0 50.0 NaN NaN 6.0 NaN
2 3 4 103.0 44.0 NaN NaN NaN 59.0
3 4 2 89.0 44.0 NaN 45.0 NaN NaN
4 5 1 84.0 41.0 43.0 NaN NaN NaN
5 6 1 77.0 43.0 34.0 NaN NaN NaN
6 7 2 71.0 35.0 NaN 36.0 NaN NaN
7 8 4 77.0 32.0 NaN NaN NaN 45.0

Here is my approach which may not be fastest, but it's a start:
for i in df['Value'].unique():
df.loc[df['Value'] == i, 'Diff_' + str(i)] = df['A'] - df['B']
df.fillna(0, inplace = True)
Output of my fake data:
Value A B Diff_1 Diff_2 Diff_3 Diff_4
0 1 20 2 18.0 0.0 0.0 0.0
1 1 30 5 25.0 0.0 0.0 0.0
2 2 40 7 0.0 33.0 0.0 0.0
3 2 50 15 0.0 35.0 0.0 0.0
4 3 60 25 0.0 0.0 35.0 0.0
5 3 20 7 0.0 0.0 13.0 0.0
6 4 15 36 0.0 0.0 0.0 -21.0
7 4 14 3 0.0 0.0 0.0 11.0

How to Parse the MLB Team and Player data using Pandas DataFrame?

I am still learning and could use some help. I would like to parse the starting pitchers and their respective teams.
I would like the data in a Pandas Dataframe but do not know how to parse the data correctly. Any suggestions would be very helpful. Thanks for your time!
Here is an example of the desired output:
Game Team Name
OAK Chris Bassitt
1
ARI Zac Gallen
SEA Justin Dunn
2
LAD Ross Stripling
Here is my code:
#url = https://www.baseball-reference.com/previews/index.shtml
#Data needed: 1) Team 2) Pitcher Name
import pandas as pd
url = 'https://www.baseball-reference.com/previews/index.shtml'
test = pd.read_html(url)
for t in test:
name = t[1]
team = t[0]
print(team)
print(name)
I feel like I have to create a Pandas DataFrame and append the Team and Name, however, I am not sure how to parse out just the desired output.

pandas.read_html returns a list of all the tables for a given URL
dataframes in the list can be selected using normal list slicing and selecting methods
import pandas as pd
url = 'https://www.baseball-reference.com/previews/index.shtml'
list_of_dataframes = pd.read_html(url)
# select and combine the dataframes for games; every other dataframe from 0 (even)
games = pd.concat(list_of_dataframes[0::2])
# display(games.head())
0 1 2
0 Cubs (13-6) NaN Preview
1 Cardinals (4-4) NaN 12:00AM
0 Cardinals (4-4) NaN Preview
1 Cubs (13-6) NaN 5:15PM
0 Red Sox (6-16) NaN Preview
# select the players from list_of_dataframes; every other dataframe from 1 (odd)
players = list_of_dataframes[1::2]
# add the Game to the dataframes
for i, df in enumerate(players, 1):
df['Game'] = i
players[i-1] = df
# combine all the dataframe
players = pd.concat(players).reset_index(drop=True)
# create a players column for the name only
players['name'] = players[1].str.split('(', expand=True)[0]
# rename the colume
players.rename(columns={0: 'Team'}, inplace=True)
# drop 1
players.drop(columns=[1], inplace=True)
# display(players.head(6))
Team Game name
0 CHC 1 Tyson Miller
1 STL 1 Alex Reyes
2 STL 2 Kwang Hyun Kim
3 CHC 2 Kyle Hendricks
4 BOS 3 Martin Perez
5 NYY 3 Jordan Montgomery

Love those sports reference.com sites. Trenton's solution is perfect, so don't change the accepted answer, but just wanted to throw this alternative data source for probable pitchers incase you were interested.
Looks like mlb.com has a publicly available api to pull that info (I'm going to assume that's possibly where baseball-reference fills their probable pitcher page). But what I like about this is you can get much more data returned to analyse, and it gives you the option to get a wider date range to get historical data, and possibly probable pitchers 2 or 3 days in advance (as well as day of). So give this code a look over too, play with it, practice with it.
But this could set you up to your first machine learning sort of thing.
PS: Let me know if you figure out what strikeZoneBottom and strikeZoneTop means here if you even bother to look into this data. I haven't been able to figure out what those mean.
I'm also wondering too, if there's data regarding the ballpark. Like in the pitchers stats there's the fly ball:ground ball ratio. If there was data on the ballparks like if you have flyball pitcher in a venue that yields lots of homeruns, that you might see a different situation for that same pitcher in a ballpark where flyballs don't quite travel as far, or the stadium has deeper fences (essentially homeruns turn into warning track fly out and vice versa)??
Code:
import requests
import pandas as pd
from datetime import datetime, timedelta
url = 'https://statsapi.mlb.com/api/v1/schedule'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
yesterday = datetime.strftime(datetime.now() - timedelta(1), '%Y-%m-%d')
today = datetime.strftime(datetime.now(), '%Y-%m-%d')
tomorrow = datetime.strftime(datetime.now() + timedelta(1), '%Y-%m-%d')
#To get 7 days earlier; notice the minus sign
#pastDate = datetime.strftime(datetime.now() - timedelta(7), '%Y-%m-%d')
#To get 3 days later; notice the plus sign
#futureDate = datetime.strftime(datetime.now() + timedelta(3), '%Y-%m-%d')
#hydrate parameter is to get back certain data elements. Not sure how to alter it exactly yet, would have to play around
#But without hydrate, it doesn't return probable pitchers
payload = {
'sportId': '1',
'startDate': today, #<-- Change these to get a wider range of games (to also get historical stats for machine learning)
'endDate': today, #<-- Change these to get a wider range of games (to possible probable pitchers for next few days. just need to adjust timedelta above)
'hydrate': 'team(leaders(showOnPreview(leaderCategories=[homeRuns,runsBattedIn,battingAverage],statGroup=[pitching,hitting]))),linescore(matchup,runners),flags,liveLookin,review,broadcasts(all),venue(location),decisions,person,probablePitcher,stats,homeRuns,previousPlay,game(content(media(featured,epg),summary),tickets),seriesStatus(useOverride=true)'}
jsonData = requests.get(url, headers=headers, params=payload).json()
dates = jsonData['dates']
rows = []
for date in dates:
games = date['games']
for game in games:
dayNight = game['dayNight']
gameDate = game['gameDate']
city = game['venue']['location']['city']
venue = game['venue']['name']
teams = game['teams']
for k, v in teams.items():
row = {}
row.update({'dayNight':dayNight,
'gameDate':gameDate,
'city':city,
'venue':venue})
homeAway = k
teamName = v['team']['name']
if 'probablePitcher' not in v.keys():
row.update({'homeAway':homeAway,
'teamName':teamName})
rows.append(row)
else:
probablePitcher = v['probablePitcher']
fullName = probablePitcher['fullName']
pitchHand = probablePitcher['pitchHand']['code']
strikeZoneBottom = probablePitcher['strikeZoneBottom']
strikeZoneTop = probablePitcher['strikeZoneTop']
row.update({'homeAway':homeAway,
'teamName':teamName,
'probablePitcher':fullName,
'pitchHand':pitchHand,
'strikeZoneBottom':strikeZoneBottom,
'strikeZoneTop':strikeZoneTop})
stats = probablePitcher['stats']
for stat in stats:
if stat['type']['displayName'] == 'statsSingleSeason' and stat['group']['displayName'] == 'pitching':
playerStats = stat['stats']
row.update(playerStats)
rows.append(row)
df = pd.DataFrame(rows)
Output: First 10 rows
print (df.head(10).to_string())
airOuts atBats balks baseOnBalls blownSaves catchersInterference caughtStealing city completeGames dayNight doubles earnedRuns era gameDate gamesFinished gamesPitched gamesPlayed gamesStarted groundOuts groundOutsToAirouts hitBatsmen hitByPitch hits hitsPer9Inn holds homeAway homeRuns homeRunsPer9 inheritedRunners inheritedRunnersScored inningsPitched intentionalWalks losses obp outs pickoffs pitchHand probablePitcher rbi runs runsScoredPer9 sacBunts sacFlies saveOpportunities saves shutouts stolenBasePercentage stolenBases strikeOuts strikeZoneBottom strikeZoneTop strikeoutWalkRatio strikeoutsPer9Inn teamName triples venue walksPer9Inn whip wildPitches winPercentage wins
0 15.0 44.0 0.0 9.0 0.0 0.0 0.0 Baltimore 0.0 day 2.0 8.0 6.00 2020-08-19T17:05:00Z 0.0 3.0 3.0 3.0 9.0 0.60 0.0 0.0 10.0 7.50 0.0 away 3.0 2.25 0.0 0.0 12.0 0.0 1.0 .358 36.0 0.0 R Tanner Roark 0.0 8.0 6.00 0.0 0.0 0.0 0.0 0.0 1.000 1.0 10.0 1.589 3.467 1.11 7.50 Toronto Blue Jays 0.0 Oriole Park at Camden Yards 6.75 1.58 0.0 .500 1.0
1 18.0 74.0 0.0 3.0 0.0 0.0 0.0 Baltimore 0.0 day 5.0 8.0 4.00 2020-08-19T17:05:00Z 0.0 4.0 4.0 4.0 18.0 1.00 1.0 1.0 22.0 11.00 0.0 home 1.0 0.50 0.0 0.0 18.0 0.0 2.0 .329 54.0 1.0 L Tommy Milone 0.0 11.0 5.50 1.0 1.0 0.0 0.0 0.0 1.000 1.0 18.0 1.535 3.371 6.00 9.00 Baltimore Orioles 1.0 Oriole Park at Camden Yards 1.50 1.39 1.0 .333 1.0
2 14.0 59.0 0.0 2.0 0.0 0.0 0.0 Boston 0.0 day 3.0 7.0 4.02 2020-08-19T17:35:00Z 0.0 3.0 3.0 3.0 14.0 1.00 0.0 0.0 17.0 9.77 0.0 away 2.0 1.15 0.0 0.0 15.2 0.0 2.0 .311 47.0 0.0 R Jake Arrieta 0.0 7.0 4.02 0.0 0.0 0.0 0.0 0.0 .--- 0.0 14.0 1.627 3.549 7.00 8.04 Philadelphia Phillies 0.0 Fenway Park 1.15 1.21 2.0 .333 1.0
3 2.0 14.0 1.0 3.0 0.0 0.0 0.0 Boston 0.0 day 1.0 5.0 22.50 2020-08-19T17:35:00Z 0.0 1.0 1.0 1.0 1.0 0.50 0.0 0.0 7.0 31.50 0.0 home 2.0 9.00 0.0 0.0 2.0 0.0 1.0 .588 6.0 0.0 L Kyle Hart 0.0 7.0 31.50 0.0 0.0 0.0 0.0 0.0 .--- 0.0 4.0 1.681 3.575 1.33 18.00 Boston Red Sox 0.0 Fenway Park 13.50 5.00 0.0 .000 0.0
4 8.0 27.0 0.0 0.0 0.0 0.0 0.0 Chicago 0.0 day 0.0 2.0 2.57 2020-08-19T18:20:00Z 0.0 1.0 1.0 1.0 7.0 0.88 0.0 0.0 6.0 7.71 0.0 away 0.0 0.00 0.0 0.0 7.0 0.0 0.0 .222 21.0 0.0 R Jack Flaherty 0.0 2.0 2.57 0.0 0.0 0.0 0.0 0.0 .--- 0.0 6.0 1.627 3.549 -.-- 7.71 St. Louis Cardinals 0.0 Wrigley Field 0.00 0.86 0.0 1.000 1.0
5 13.0 65.0 0.0 6.0 0.0 0.0 1.0 Chicago 0.0 day 2.0 6.0 2.84 2020-08-19T18:20:00Z 0.0 3.0 3.0 3.0 28.0 2.15 1.0 1.0 10.0 4.74 0.0 home 2.0 0.95 0.0 0.0 19.0 0.0 1.0 .236 57.0 0.0 R Alec Mills 0.0 6.0 2.84 0.0 0.0 0.0 0.0 0.0 .000 0.0 14.0 1.627 3.549 2.33 6.63 Chicago Cubs 0.0 Wrigley Field 2.84 0.84 0.0 .667 2.0
6 NaN NaN NaN NaN NaN NaN NaN Chicago NaN night NaN NaN NaN 2020-08-19T03:33:00Z NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN away NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Chicago Cubs NaN Wrigley Field NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN Chicago NaN night NaN NaN NaN 2020-08-19T03:33:00Z NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN home NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN St. Louis Cardinals NaN Wrigley Field NaN NaN NaN NaN NaN
8 13.0 92.0 0.0 8.0 0.0 0.0 1.0 Kansas City 0.0 day 6.0 10.0 3.91 2020-08-19T21:05:00Z 0.0 4.0 4.0 4.0 24.0 1.85 0.0 0.0 25.0 9.78 0.0 away 1.0 0.39 0.0 0.0 23.0 0.0 2.0 .327 69.0 0.0 R Luis Castillo 0.0 12.0 4.70 0.0 1.0 0.0 0.0 0.0 .000 0.0 31.0 1.589 3.467 3.88 12.13 Cincinnati Reds 1.0 Kauffman Stadium 3.13 1.43 0.0 .000 0.0
9 10.0 36.0 0.0 5.0 0.0 0.0 0.0 Kansas City 0.0 day 0.0 0.0 0.00 2020-08-19T21:05:00Z 0.0 2.0 2.0 2.0 11.0 1.10 1.0 1.0 5.0 4.09 0.0 home 0.0 0.00 0.0 0.0 11.0 0.0 0.0 .262 33.0 0.0 R Brad Keller 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 .--- 0.0 10.0 1.681 3.575 2.00 8.18 Kansas City Royals 0.0 Kauffman Stadium 4.09 0.91 0.0 1.000 2.0

python how to find the number of days in each month from Dec 2019 and forward between two date columns

I have two date columns 'StartDate' and 'EndDate'. I want to find the number of days in each month between those two dates from Dec 2019 and forward ignoring any prior months of 2019 for calculation. StartDate and EndDate of each row can span across 2 years with overlapped months and Date columns can also be empty.
Sample Data:
df = {'Id': ['1','2','3','4','5','6','7', '8'],
'Item': ['A','B','C','D','E','F','G', 'H'],
'StartDate': ['2019-12-10', '2019-12-01', '2019-10-01', '2020-01-01', '2019-03-01','2019-03-01','2019-10-01', ''],
'EndDate': ['2020-02-21' ,'2020-01-01','2020-08-31','2020-01-30','2019-12-31','2019-12-31','2020-08-31', '']
}
df = pd.DataFrame(df,columns= ['Id', 'Item','StartDate','EndDate'])
Expected O/P:
The below solution works partially works.
df['StartDate'] = pd.to_datetime(df['StartDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])
def days_of_month(x):
s = pd.date_range(*x, freq='D').to_series()
return s.resample('M').count().rename(lambda x: x.month)
df1 = df[['StartDate', 'EndDate']].apply(days_of_month, axis=1).fillna(0)
df_final = df[['StartDate', 'EndDate']].join([df['StartDate'].dt.year.rename('Year'), df1])

Try this:
df.join(df.dropna(axis=0,how='any')
.apply(lambda x: pd.date_range(x['StartDate'],x['EndDate'], freq='D')
.to_frame().resample('M').count().loc['2019-12-01':].unstack(), axis=1)[0].fillna(0))
Output:
Id Item StartDate EndDate 2019-12-31 00:00:00 2020-01-31 00:00:00 2020-02-29 00:00:00 2020-03-31 00:00:00 2020-04-30 00:00:00 2020-05-31 00:00:00 2020-06-30 00:00:00 2020-07-31 00:00:00 2020-08-31 00:00:00
0 1 A 2019-12-10 2020-02-21 22.0 31.0 21.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2 B 2019-12-01 2020-01-01 31.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 3 C 2019-10-01 2020-08-31 31.0 31.0 29.0 31.0 30.0 31.0 30.0 31.0 31.0
3 4 D 2020-01-01 2020-01-30 0.0 30.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 5 E 2019-03-01 2019-12-31 31.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 6 F 2019-03-01 2019-12-31 31.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 7 G 2019-10-01 2020-08-31 31.0 31.0 29.0 31.0 30.0 31.0 30.0 31.0 31.0
7 8 H NaT NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN

We'll create two large DataFrames, one with the start of each month and another with the end of each month. We'll then clip them accordingly which leaves us with a simple subtraction. Since you want to include the end dates, we need to add 1 day, and we clean up any negative dates, which should be 0.
import pandas as pd
df_s = pd.DataFrame([pd.date_range('2019-12-01', '2020-12-01', freq='MS').to_numpy()],
index=df.index)
df_e = df_s + pd.offsets.MonthEnd(1)
df_s = df_s.clip(lower=pd.to_datetime(df.StartDate), axis=0)
df_e = df_e.clip(upper=pd.to_datetime(df.EndDate), axis=0)
res = ((df_e - df_s) + pd.to_timedelta(1, 'd')).clip(lower=pd.to_timedelta(0, 'd'))
res.columns = pd.period_range(start='2019-12', end='2020-12', freq='M')
# So int or float
for col in res.columns:
res[col] = res[col].dt.days
df = pd.concat([df, res], axis=1)
Id Item StartDate EndDate 2019-12 2020-01 2020-02 2020-03 2020-04 2020-05 2020-06 2020-07 2020-08 2020-09 2020-10 2020-11 2020-12
0 1 A 2019-12-10 2020-02-21 22.0 31.0 21.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2 B 2019-12-01 2020-01-31 31.0 31.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 3 C 2019-10-01 2020-08-31 31.0 31.0 29.0 31.0 30.0 31.0 30.0 31.0 31.0 0.0 0.0 0.0 0.0
3 4 D 2020-01-01 2020-01-30 0.0 30.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 5 E 2019-03-01 2019-12-31 31.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 6 F 2019-03-01 2019-12-31 31.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 7 G 2019-10-01 2020-08-31 31.0 31.0 29.0 31.0 30.0 31.0 30.0 31.0 31.0 0.0 0.0 0.0 0.0
7 8 H NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Here's another approach, creating the whole day list, and compute the overlap with broadcasting:
dates = pd.date_range('2019-12-01', '2020-12-31', freq='D').values
(pd.DataFrame((df.StartDate.values <= dates[:,None])
& (df.EndDate.values >= dates[:,None]),
index=dates)
.resample('M')
.sum()
.T
)
Output:
2019-12-31 00:00:00 2020-01-31 00:00:00 2020-02-29 00:00:00 2020-03-31 00:00:00 2020-04-30 00:00:00 2020-05-31 00:00:00 2020-06-30 00:00:00 2020-07-31 00:00:00 2020-08-31 00:00:00 2020-09-30 00:00:00 2020-10-31 00:00:00 2020-11-30 00:00:00 2020-12-31 00:00:00
-- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- ---------------------
0 22 31 21 0 0 0 0 0 0 0 0 0 0
1 31 1 0 0 0 0 0 0 0 0 0 0 0
2 31 31 29 31 30 31 30 31 31 0 0 0 0
3 0 30 0 0 0 0 0 0 0 0 0 0 0
4 31 0 0 0 0 0 0 0 0 0 0 0 0
5 31 0 0 0 0 0 0 0 0 0 0 0 0
6 31 31 29 31 30 31 30 31 31 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0

Use the same code, add coerce to to_datetime and dropna and change in the rename part
df['StartDate'] = pd.to_datetime(df['StartDate'], errors='coerce')
df['EndDate'] = pd.to_datetime(df['EndDate'], errors='coerce')
def days_of_month(x):
s = pd.date_range(*x, freq='D').to_series()
return s.resample('M').count().rename(lambda x: x.to_period(freq='M'))
df1 = (df[['StartDate', 'EndDate']].dropna().apply(days_of_month, axis=1)
.reindex(df.index).fillna(0))
df_final = df.join(df1)
Out[1205]:
Id Item StartDate EndDate 2019-03 2019-04 2019-05 2019-06 2019-07 \
0 1 A 2019-12-10 2020-02-21 0.0 0.0 0.0 0.0 0.0
1 2 B 2019-12-01 2020-01-01 0.0 0.0 0.0 0.0 0.0
2 3 C 2019-10-01 2020-08-31 0.0 0.0 0.0 0.0 0.0
3 4 D 2020-01-01 2020-01-30 0.0 0.0 0.0 0.0 0.0
4 5 E 2019-03-01 2019-12-31 31.0 30.0 31.0 30.0 31.0
5 6 F 2019-03-01 2019-12-31 31.0 30.0 31.0 30.0 31.0
6 7 G 2019-10-01 2020-08-31 0.0 0.0 0.0 0.0 0.0
7 8 H NaT NaT 0.0 0.0 0.0 0.0 0.0
2019-08 2019-09 2019-10 2019-11 2019-12 2020-01 2020-02 2020-03 \
0 0.0 0.0 0.0 0.0 22.0 31.0 21.0 0.0
1 0.0 0.0 0.0 0.0 31.0 1.0 0.0 0.0
2 0.0 0.0 31.0 30.0 31.0 31.0 29.0 31.0
3 0.0 0.0 0.0 0.0 0.0 30.0 0.0 0.0
4 31.0 30.0 31.0 30.0 31.0 0.0 0.0 0.0
5 31.0 30.0 31.0 30.0 31.0 0.0 0.0 0.0
6 0.0 0.0 31.0 30.0 31.0 31.0 29.0 31.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2020-04 2020-05 2020-06 2020-07 2020-08
0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
2 30.0 31.0 30.0 31.0 31.0
3 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0
6 30.0 31.0 30.0 31.0 31.0
7 0.0 0.0 0.0 0.0 0.0

Python: pivot dataset

I have a data frame which looks like:
df = pd.DataFrame({'Dev':[1,2,3,4,5,6,7,8,9,10,11,12],'2012':[1,2,3,4,5,6,7,8,9,10,11,12],
'GWP':[0,0,0,10,20,30,40,50,60,70,80,90],'Inc':[0,0,0,10,20,30,40,50,60,70,80,90],
'Dev1':[1,2,3,4,5,6,7,8,9,10,np.nan,np.nan],'2013':[1,2,3,4,5,6,7,8,9,10,np.nan,np.nan],
'GWP1':[0,0,0,10,20,30,40,50,60,70,np.nan,np.nan],'Inc1':[0,0,0,10,20,30,40,50,60,70,np.nan,np.nan],
'Dev2':[1,2,3,4,5,6,7,8,np.nan,np.nan,np.nan,np.nan],'2014':[1,2,3,4,5,6,7,8,np.nan,np.nan,np.nan,np.nan],
'GWP2':[0,0,0,10,20,30,40,50,np.nan,np.nan,np.nan,np.nan],'Inc2':[0,0,0,10,20,30,40,50,np.nan,np.nan,np.nan,np.nan],
})
df.head()
Dev 2012 GWP Inc Dev1 2013 GWP1 Inc1 Dev2 2014 GWP2 Inc2
0 1 1 0 0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0
1 2 2 0 0 2.0 2.0 0.0 0.0 2.0 2.0 0.0 0.0
2 3 3 0 0 3.0 3.0 0.0 0.0 3.0 3.0 0.0 0.0
3 4 4 10 10 4.0 4.0 10.0 10.0 4.0 4.0 10.0 10.0
4 5 5 20 20 5.0 5.0 20.0 20.0 5.0 5.0 20.0 20.0
I'm trying to pivot this dataframe to the following:
result_df = pd.DataFrame({'Dev':list(np.arange(1,13))*3,'YEAR':[2012]*12 + [2013]*12 + [2014]*12,
'GWP':[0,0,0,10,20,30,40,50,60,70,80,90] + [0,0,0,10,20,30,40,50,60,70,np.nan,np.nan] + [0,0,0,10,20,30,40,50,np.nan,np.nan,np.nan,np.nan],
'Inc':[0,0,0,10,20,30,40,50,60,70,80,90] + [0,0,0,10,20,30,40,50,60,70,np.nan,np.nan] + [0,0,0,10,20,30,40,50,np.nan,np.nan,np.nan,np.nan]})
result_df.head()
Out[83]:
Dev YEAR GWP Inc
0 1 2012 0.0 0.0
1 2 2012 0.0 0.0
2 3 2012 0.0 0.0
3 4 2012 10.0 10.0
4 5 2012 20.0 20.0
Does anyone know how this is possible using pandas or R?

Consider melt and wide_to_long. Specifically, melt the year columns, 2012-2014, then rename columns to adhere to stubsuffix style. Finally, reshape across multiple columns according to subs, Dev, GWP, Inc:
melt_df = (df.melt(id_vars = df.columns[~df.columns.isin(['2012', '2013', '2014'])],
value_vars=['2012', '2013', '2014'], var_name='Year')
.drop(columns=['value'])
.rename(columns={'GWP':'GWP0', 'Inc':'Inc0', 'Dev':'Dev0'})
)
final_df = pd.wide_to_long(melt_df.assign(id = lambda x: x.index),
["Dev", "GWP", "Inc"], i="id", j="suffix")
print(final_df.head(20))
# Year GWP Inc Dev
# id suffix
# 0 0 2012 0.0 0.0 1.0
# 1 0 2012 0.0 0.0 2.0
# 2 0 2012 0.0 0.0 3.0
# 3 0 2012 10.0 10.0 4.0
# 4 0 2012 20.0 20.0 5.0
# 5 0 2012 30.0 30.0 6.0
# 6 0 2012 40.0 40.0 7.0
# 7 0 2012 50.0 50.0 8.0
# 8 0 2012 60.0 60.0 9.0
# 9 0 2012 70.0 70.0 10.0
# 10 0 2012 80.0 80.0 11.0
# 11 0 2012 90.0 90.0 12.0
# 12 0 2013 0.0 0.0 1.0
# 13 0 2013 0.0 0.0 2.0
# 14 0 2013 0.0 0.0 3.0
# 15 0 2013 10.0 10.0 4.0
# 16 0 2013 20.0 20.0 5.0
# 17 0 2013 30.0 30.0 6.0
# 18 0 2013 40.0 40.0 7.0
# 19 0 2013 50.0 50.0 8.0

How to add rows into data frame in order to organize time series

I have a data frame that looks like this
AUX TER
11/2014 2.0 10.0
01/2015 23.0 117.0
03/2015 57.0 65.0
04/2015 1.0 1.0
05/2015 16.0 20.0
07/2015 19.0 30.0
I want to fill the values for months that are not in data frame with 0
like this
AUX TER
11/2014 2.0 10.0
12/2014 0 0
01/2015 23.0 117.0
03/2015 57.0 65.0
04/2015 1.0 1.0
05/2015 16.0 20.0
06/2015 0 0
07/2015 19.0 30.0

Change your index to datetime
df.index = pd.to_datetime(df.index, format='%m/%Y')
Use asfreq with the fill_value argument
df.asfreq('MS', fill_value=0)
AUX TER
2014-11-01 2.0 10.0
2014-12-01 0.0 0.0
2015-01-01 23.0 117.0
2015-02-01 0.0 0.0
2015-03-01 57.0 65.0
2015-04-01 1.0 1.0
2015-05-01 16.0 20.0
2015-06-01 0.0 0.0
2015-07-01 19.0 30.0

You can use the below to reindex():
s=pd.to_datetime(df.index)
df.reindex(pd.date_range(s.min(),s.max()+pd.DateOffset(months=1),freq='M')
.strftime('%m/%Y'),fill_value=0)
AUX TER
11/2014 2.0 10.0
12/2014 0.0 0.0
01/2015 23.0 117.0
02/2015 0.0 0.0
03/2015 57.0 65.0
04/2015 1.0 1.0
05/2015 16.0 20.0
06/2015 0.0 0.0
07/2015 19.0 30.0

Using df.resample("M").mean().fillna(0)
Ex:
df = pd.read_csv(filename, sep="\s+", parse_dates=['date'])
df.set_index("date", inplace=True)
df = df.resample("M").mean().fillna(0)
df.index = df.index.strftime("%m/%Y")
print(df)
Output:
AUX TER
11/2014 2.0 10.0
12/2014 0.0 0.0
01/2015 23.0 117.0
02/2015 0.0 0.0
03/2015 57.0 65.0
04/2015 1.0 1.0
05/2015 16.0 20.0
06/2015 0.0 0.0
07/2015 19.0 30.0

When you have a datetime format, you can try:
df.resample('MS').mean()
following this post: Python, summarize daily data in dataframe to monthly and quarterly

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas df group by count elements - python

Related

Store the result in new columns, named based on another variable (Pandas)

How to Parse the MLB Team and Player data using Pandas DataFrame?

python how to find the number of days in each month from Dec 2019 and forward between two date columns

Python: pivot dataset

How to add rows into data frame in order to organize time series

Categories

Resources