Pandas: Get per-year counts for Dateranges spanning multiple years - python

I have a dataframe with records spanning multiple years:
WarName | StartDate | EndDate
---------------------------------------------
'fakewar1' 01-01-1990 02-02-1995
'examplewar' 05-01-1990 03-07-1998
(...)
'examplewar2' 05-07-1999 06-09-2002
I am trying to convert this dataframe to a summary overview of the total wars per year, e.g.:
Year | Number_of_wars
----------------------------
1989 0
1990 2
1991 2
1992 3
1994 2
Usually I would use someting like df.groupby('year').count() to get total wars by year, but since I am currently working with ranges instead of set dates that approach wouldn't work.
I am currently writing a function that generates a list of years, and then for each year in the list checks each row in the dataframe and runs a function that checks if the year is within the date-range of that row (returning True if that is the case).
years = range(1816, 2006)
year_dict = {}
for year in years:
for index, row in df.iterrows():
range = year_in_range(year, row)
if range = True:
year_dict[year] = year_dict.get(year, 0) + 1
This works, but is also seems extremely convoluted. So I was wondering, what am I missing? What would be the canonical 'pandas-way' to solve this issue?

Use a comprehension with pd.value_counts
pd.value_counts([
d.year for s, e in zip(df.StartDate, df.EndDate)
for d in pd.date_range(s, e, freq='Y')
]).sort_index()
1990 2
1991 2
1992 2
1993 2
1994 2
1995 1
1996 1
1997 1
1999 1
2000 1
2001 1
dtype: int64
Alternate
from functools import reduce
def r(t):
return pd.date_range(t.StartDate, t.EndDate, freq='Y')
pd.value_counts(reduce(pd.Index.append, map(r, df.itertuples())).year).sort_index()
Setup
df = pd.DataFrame(dict(
WarName=['fakewar1', 'examplewar', 'feuxwar2'],
StartDate=pd.to_datetime(['01-01-1990', '05-01-1990', '05-07-1999']),
EndDate=pd.to_datetime(['02-02-1995', '03-07-1998', '06-09-2002'])
), columns=['WarName', 'StartDate', 'EndDate'])
df
WarName StartDate EndDate
0 fakewar1 1990-01-01 1995-02-02
1 examplewar 1990-05-01 1998-03-07
2 feuxwar2 1999-05-07 2002-06-09

By using np.unique
x,y = np.unique(sum([list(range(x.year,y.year)) for x,y in zip(df.StartDate,df.EndDate)],[]), return_counts=True)
pd.Series(dict(zip(x,y)))
Out[222]:
1990 2
1991 2
1992 2
1993 2
1994 2
1995 1
1996 1
1997 1
1999 1
2000 1
2001 1
dtype: int64

The other answers with pandas are far preferable, but the native Python answer you showed didn't have to be so convoluted; just instantiate and directly index into an array:
wars = [0] * 191 # max(df['EndDate']).year - min(df['StartDate']).year + 1
yr_offset = 1816 # min(df['StartDate']).year
for _, row in df.iterrows():
for yr in range(row['StartDate'].year-yr_offset, row['EndDate'].year-yr_offset): # or maybe (year+1)
wars[yr] += 1

Related

Pandas DataFrame: Calculate percentage difference between rows?

I have a year wise dataframe with each year has three parameters year,type and value. I'm trying to calculate percentage of taken vs empty. For example year 2014 has total of 50 empty and 50 taken - So 50% in empty and 50% in taken as shown in final_df
df
year type value
0 2014 Total 100
1 2014 Empty 50
2 2014 Taken 50
3 2013 Total 2000
4 2013 Empty 100
5 2013 Taken 1900
6 2012 Total 50
7 2012 Empty 45
8 2012 Taken 5
Final df
year Empty Taken
0 2014 50 50
0 2013 ... ...
0 2012 ... ...
Should i shift cells up and do the percentage calculate or any other method?
You can use pivot_table:
new = df[df['type'] != 'Total']
res = (new.pivot_table(index='year',columns='type',values='value').sort_values(by='year',ascending=False).reset_index())
which gets you:
res
year Empty Taken
0 2014 50 50
1 2013 100 1900
2 2012 45 5
And then you can get the percentages for each column:
total = (res['Empty'] + res['Taken'])
for col in ['Empty','Taken']:
res[col+'_perc'] = res[col] / total
year Empty Taken Empty_perc Taken_perc
2014 50 50 0.50 0.50
2013 100 1900 0.05 0.95
2012 45 5 0.90 0.10
As #sophods pointed out, you can use pivot_table to rearange your dataframe, however, to add to his answer; i think you're after the percentage, hence i suggest you keep the 'Total' record and then apply your calculation:
#pivot your data
res = (df.pivot_table(index='year',columns='type',values='value')).reset_index()
#calculate percentages of empty and taken
res['Empty'] = res['Empty']/res['Total']
res['Taken'] = res['Taken']/res['Total']
#final dataframe
res = res[['year', 'Empty', 'Taken']]
You can filter out records having Empty and Taken in type and then groupby year and apply func. In func, you can set the type as index and then get the required values and calculate the percentage. x in func would be dataframe having type and value columns and data per group.
def func(x):
x = x.set_index('type')
total = x['value'].sum()
return [(x.loc['Empty', 'value']/total)*100, (x.loc['Taken', 'value']/total)*100]
temp = (df[df['type'].isin({'Empty', 'Taken'})]
.groupby('year')[['type', 'value']]
.apply(lambda x: func(x)))
temp
year
2012 [90.0, 10.0]
2013 [5.0, 95.0]
2014 [50.0, 50.0]
dtype: object
Convert the result into the required dataframe
pd.DataFrame(temp.values.tolist(), index=temp.index, columns=['Empty', 'Taken'])
Empty Taken
year
2012 90.0 10.0
2013 5.0 95.0
2014 50.0 50.0

Write python function that sums values from certain rows for each index type using groupby

In my dataframe, df, I am trying to sum the values from the value column for each Product and Year for two periods of the year (Month), specifically Months 1 through 3 and Months 9 through 11. I know I need to use groupby to group Products and Years, and possibly use a lambda function (or an if statement) to separate the two periods of time.
Here's my data frame df:
import pandas as pd
products = {'Product': ['A','A','A','A','A','A','B','B','B','B','C','C','C','C','C',
'C','C','C'],
'Month': [1,1,3,4,5,10,4,5,10,11,2,3,5,3,9,
10,11,12],
'Year': [1999,1999,1999,1999,1999,1999,2017,2017,1988,1988,2002,2002,2002,2003,2003,
2003,2003,2003],
'value': [250,810,1200,340,250,800,1200,400,250,800,1200,300,290,800,1200,300, 1200, 300]
}
df = pd.DataFrame(products, columns= ['Product', 'Month','Year','value'])
df
And I want a table that looks something like this:
products = {'Product': ['A','A','B','B','C','C','C'],
'MonthGroups': ['Month1:3','Month9:11','Month1:3','Month9:11','Month1:3','Month1:3','Month9:11'],
'Year': [1999,1999,2017,1988,2002, 2003, 2003],
'SummedValue': [2260, 800, 0, 1050, 1500, 800, 2700]
}
new_df = pd.DataFrame(products, columns= ['Product', 'MonthGroups','Year','SummedValue'])
new_df
What I have so far that is that I should use groupby to group Product and Year. What I'm stuck on is defining the two "Month Groups": Months 1 through 3 and Months 9 through 11, which should be the sum of value per year.
df.groupby(['Product','Year']).value.sum().loc[lambda p: p > 10].to_frame()
This isn't right though because it needs to sum based on the month groups.
First created new column by numpy.select with DataFrame.assign, then aggregate also by MonthGroups and because groupby by default remove rows with misisng values if column used for by parameter (like here MonthGroups) are omitted not matched groups:
df1 = (df.assign(MonthGroups = np.select([df['Month'].between(1,3),
df['Month'].between(9,11)],
['Month1:3','Month9:11'], default=None))
.groupby(['Product','MonthGroups','Year']).value
.sum()
.reset_index(name='SummedValue')
)
print (df1)
Product MonthGroups Year SummedValue
0 A Month1:3 1999 2260
1 A Month9:11 1999 800
2 B Month9:11 1988 1050
3 C Month1:3 2002 1500
4 C Month1:3 2003 800
5 C Month9:11 2003 2700
If need also 0 sum values for not matched rows:
df2 = df[['Product','Year']].drop_duplicates().assign(MonthGroups='Month1:3',SummedValue=0)
df1 = (df.assign(MonthGroups = np.select([df['Month'].between(1,3),
df['Month'].between(9,11)],
['Month1:3','Month9:11'], default=None))
.groupby(['Product','MonthGroups','Year']).value
.sum()
.reset_index(name='SummedValue')
.append(df2)
.drop_duplicates(['Product','MonthGroups','Year'])
)
print (df1)
Product MonthGroups Year SummedValue
0 A Month1:3 1999 2260
1 A Month9:11 1999 800
2 B Month9:11 1988 1050
3 C Month1:3 2002 1500
4 C Month1:3 2003 800
5 C Month9:11 2003 2700
6 B Month1:3 2017 0
8 B Month1:3 1988 0
A little different approach using pd.cut:
bins = [0,3,8,11]
s = pd.cut(df['Month'],bins,labels=['1:3','irrelevant','9:11'])
(df[s.isin(['1:3','9:11'])].assign(MonthGroups=s.astype(str))
.groupby(['Product','MonthGroups','Year'])['value'].sum().reset_index())
Product MonthGroups Year value
0 A 1:3 1999 2260
1 A 9:11 1999 800
2 B 9:11 1988 1050
3 C 1:3 2002 1500
4 C 1:3 2003 800
5 C 9:11 2003 2700

Count occuriences depending on condition & save in new column

I am relatively new to pandas / python.
I have a list of names and dates. I want to group the entries by Name and count the number of Names for 'after 2016' and 'before 2016'. The count should be added to a new column.
My input:
Name Date
Marc 2006
Carl 2003
Carl 2002
Carl 1990
Marc 1999
Max 2016
Max 2014
Marc 2006
Carl 2003
Carl 2002
Carl 2019
Marc 1999
Max 2016
Max 2014
And the output, should look like this:
Before
2016 Count
Marc 1 4
Marc 0 0
Carl 1 5
Carl 0 1
Max 1 2
Max 0 2
So the Output should have 2 entries for each Name, one with a count of Names before 2016 and one after. Addtionally a column which just stats 1 for before 2016 and 0 for after.
As mentioned before, I am quite a beginner. I was able to count the entries with the condition of the year:
df.groupby('Name')['Date'].apply(lambda x: (x<'2016').sum()).reset_index(name='count')
But honestly, I am not quite sure what to do next. Maybe somebody could point me in the right direction.
You can pass to apply a function which returns a 2x2 dataframe. Something like this:
def counting(x):
bef = (x < 2016).sum()
aft = (x > 2016).sum()
return pd.DataFrame([[1, bef], [0, aft]], index=[x.name, x.name], columns=["before 2016", "Count"])
ddf = df.groupby('Name')['Date'].apply(counting).reset_index(level=0, drop=True)
ddf is:
before 2016 Count
Carl 1 5
Carl 0 1
Marc 1 4
Marc 0 0
Max 1 2
Max 0 0
You can group by an external series having the same length as the dataframe:
s = df['Date'].lt(2016).astype('int')
s.name = 'Before 2016'
df.groupby(['Name', s]).count()
Result:
Date
Name Before 2016
Carl 0 1
1 5
Marc 1 4
Max 0 2
1 2
lt stands for "less than". Other comparison functions are le (less than or equal), gt (greater than), ge (greater than or equal) and eq (equal)
From what I understand you need to populate both 1 and 0 for each names, try with pivot_table with df.unstack():
(df.assign(Before=df['Date'].lt(2016).view('i1'))
.pivot_table('Date','Name','Before',aggfunc='count',fill_value=0).unstack()
.sort_index(level=1).reset_index(0,name='Count'))
Before Count
Name
Carl 0 1
Carl 1 5
Marc 0 0
Marc 1 4
Max 0 2
Max 1 2

Filtering Dataframe in Python

I have a dataframe with 2 columns as below:
Index Year Country
0 2015 US
1 2015 US
2 2015 UK
3 2015 Indonesia
4 2015 US
5 2016 India
6 2016 India
7 2016 UK
I want to create a new dataframe containing the maximum count of country in every year.
The new dataframe will contain 3 columns as below:
Index Year Country Count
0 2015 US 3
1 2016 India 2
Is there any function in pandas where this can be done quickly?
One way can be to use groupby and along with size for finding in each category adn sort values and slice by possible number of year. You can try the following:
num_year = df['Year'].nunique()
new_df = df.groupby(['Year', 'Country']).size().rename('Count').sort_values(ascending=False).reset_index()[:num_year]
Result:
Year Country Count
0 2015 US 3
1 2016 India 2
Use:
1.
First get count of each pairs Year and Country by groupby and size.
Then get index of max value by idxmax and select row by loc:
df = df.groupby(['Year','Country']).size()
df = df.loc[df.groupby(level=0).idxmax()].reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
2.
Use custom function with value_counts and head:
df = df.groupby('Year')['Country']
.apply(lambda x: x.value_counts().head(1))
.rename_axis(('Year','Country'))
.reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
Just provide a method without groupby
Count=pd.Series(list(zip(df2.Year,df2.Country))).value_counts()
.head(2).reset_index(name='Count')
Count[['Year','Country']]=Count['index'].apply(pd.Series)
Count.drop('index',1)
Out[266]:
Count Year Country
0 3 2015 US
1 2 2016 India

Ascending/Resetting Integer Values from MultiIndex Pandas

I have a dataframe:
import pandas as pd
tuples = [('a', 1990),('a', 1994),('a',1996),('b',1992),('b',1997),('c',2001)]
index = pd.MultiIndex.from_tuples(tuples, names = ['Type', 'Year'])
vals = ['This','That','SomeName','This','SomeOtherName','SomeThirdName']
df = pd.DataFrame(vals, index=index, columns=['Whatev'])
df
Out[3]:
Whatev
Type Year
a 1990 This
1994 That
1996 SomeName
b 1992 This
1997 SomeOtherName
c 2001 SomeThirdName
And I'd like to add a column of ascending integers corresponding to 'Year' that resets for each 'Type', like so:
Whatev IndexInt
Type Year
a 1990 This 1
1994 That 2
1996 SomeName 3
b 1992 This 1
1997 SomeOtherName 2
c 2001 SomeThirdName 1
Here's my current method:
grouped = df.groupby(level=0)
unique_loc = []
for name, group in grouped:
unique_loc += range(1,len(group)+1)
joined['IndexInt'] = unique_loc
But this seems ugly and convoluted to me, and I imagine it could get slow on the ~50 million row dataframe that I'm working with. Is there a simpler way?
you can use groupby(level=0) + cumcount():
In [7]: df['IndexInt'] = df.groupby(level=0).cumcount()+1
In [8]: df
Out[8]:
Whatev IndexInt
Type Year
a 1990 This 1
1994 That 2
1996 SomeName 3
b 1992 This 1
1997 SomeOtherName 2
c 2001 SomeThirdName 1

Categories