In my dataframe, df, I am trying to sum the values from the value column for each Product and Year for two periods of the year (Month), specifically Months 1 through 3 and Months 9 through 11. I know I need to use groupby to group Products and Years, and possibly use a lambda function (or an if statement) to separate the two periods of time.
Here's my data frame df:
import pandas as pd
products = {'Product': ['A','A','A','A','A','A','B','B','B','B','C','C','C','C','C',
'C','C','C'],
'Month': [1,1,3,4,5,10,4,5,10,11,2,3,5,3,9,
10,11,12],
'Year': [1999,1999,1999,1999,1999,1999,2017,2017,1988,1988,2002,2002,2002,2003,2003,
2003,2003,2003],
'value': [250,810,1200,340,250,800,1200,400,250,800,1200,300,290,800,1200,300, 1200, 300]
}
df = pd.DataFrame(products, columns= ['Product', 'Month','Year','value'])
df
And I want a table that looks something like this:
products = {'Product': ['A','A','B','B','C','C','C'],
'MonthGroups': ['Month1:3','Month9:11','Month1:3','Month9:11','Month1:3','Month1:3','Month9:11'],
'Year': [1999,1999,2017,1988,2002, 2003, 2003],
'SummedValue': [2260, 800, 0, 1050, 1500, 800, 2700]
}
new_df = pd.DataFrame(products, columns= ['Product', 'MonthGroups','Year','SummedValue'])
new_df
What I have so far that is that I should use groupby to group Product and Year. What I'm stuck on is defining the two "Month Groups": Months 1 through 3 and Months 9 through 11, which should be the sum of value per year.
df.groupby(['Product','Year']).value.sum().loc[lambda p: p > 10].to_frame()
This isn't right though because it needs to sum based on the month groups.
First created new column by numpy.select with DataFrame.assign, then aggregate also by MonthGroups and because groupby by default remove rows with misisng values if column used for by parameter (like here MonthGroups) are omitted not matched groups:
df1 = (df.assign(MonthGroups = np.select([df['Month'].between(1,3),
df['Month'].between(9,11)],
['Month1:3','Month9:11'], default=None))
.groupby(['Product','MonthGroups','Year']).value
.sum()
.reset_index(name='SummedValue')
)
print (df1)
Product MonthGroups Year SummedValue
0 A Month1:3 1999 2260
1 A Month9:11 1999 800
2 B Month9:11 1988 1050
3 C Month1:3 2002 1500
4 C Month1:3 2003 800
5 C Month9:11 2003 2700
If need also 0 sum values for not matched rows:
df2 = df[['Product','Year']].drop_duplicates().assign(MonthGroups='Month1:3',SummedValue=0)
df1 = (df.assign(MonthGroups = np.select([df['Month'].between(1,3),
df['Month'].between(9,11)],
['Month1:3','Month9:11'], default=None))
.groupby(['Product','MonthGroups','Year']).value
.sum()
.reset_index(name='SummedValue')
.append(df2)
.drop_duplicates(['Product','MonthGroups','Year'])
)
print (df1)
Product MonthGroups Year SummedValue
0 A Month1:3 1999 2260
1 A Month9:11 1999 800
2 B Month9:11 1988 1050
3 C Month1:3 2002 1500
4 C Month1:3 2003 800
5 C Month9:11 2003 2700
6 B Month1:3 2017 0
8 B Month1:3 1988 0
A little different approach using pd.cut:
bins = [0,3,8,11]
s = pd.cut(df['Month'],bins,labels=['1:3','irrelevant','9:11'])
(df[s.isin(['1:3','9:11'])].assign(MonthGroups=s.astype(str))
.groupby(['Product','MonthGroups','Year'])['value'].sum().reset_index())
Product MonthGroups Year value
0 A 1:3 1999 2260
1 A 9:11 1999 800
2 B 9:11 1988 1050
3 C 1:3 2002 1500
4 C 1:3 2003 800
5 C 9:11 2003 2700
A simple data-frame and I want to pick the most recent 2 rows (sorted by "Year") with all columns.
import pandas as pd
data = {'People' : ["John","John","John","Kate","Kate","David","David","David","David"],
'Year': ["2018","2019","2006","2017","2012","2006","2019","2018","2017"],
'Sales' : [120,100,60,150,135,140,90,110,160]}
df = pd.DataFrame(data)
I tried below but it doesn't produce what's wanted:
df = df.groupby('People')
df_1 = pd.concat([df.head(2)]).drop_duplicates().sort_values('Year').reset_index(drop=True)
What's the right way to write it? Thank you.
IIUC, use pandas.DataFrame.nlargest:
df['Year'] = df['Year'].astype(int)
df.groupby('People', as_index=False).apply(lambda x: x.nlargest(2, "Year"))
Output:
People Year Sales
0 6 David 2019 90
7 David 2018 110
1 1 John 2019 100
0 John 2018 120
2 3 Kate 2017 150
4 Kate 2012 135
I have a dataframe with records spanning multiple years:
WarName | StartDate | EndDate
---------------------------------------------
'fakewar1' 01-01-1990 02-02-1995
'examplewar' 05-01-1990 03-07-1998
(...)
'examplewar2' 05-07-1999 06-09-2002
I am trying to convert this dataframe to a summary overview of the total wars per year, e.g.:
Year | Number_of_wars
----------------------------
1989 0
1990 2
1991 2
1992 3
1994 2
Usually I would use someting like df.groupby('year').count() to get total wars by year, but since I am currently working with ranges instead of set dates that approach wouldn't work.
I am currently writing a function that generates a list of years, and then for each year in the list checks each row in the dataframe and runs a function that checks if the year is within the date-range of that row (returning True if that is the case).
years = range(1816, 2006)
year_dict = {}
for year in years:
for index, row in df.iterrows():
range = year_in_range(year, row)
if range = True:
year_dict[year] = year_dict.get(year, 0) + 1
This works, but is also seems extremely convoluted. So I was wondering, what am I missing? What would be the canonical 'pandas-way' to solve this issue?
Use a comprehension with pd.value_counts
pd.value_counts([
d.year for s, e in zip(df.StartDate, df.EndDate)
for d in pd.date_range(s, e, freq='Y')
]).sort_index()
1990 2
1991 2
1992 2
1993 2
1994 2
1995 1
1996 1
1997 1
1999 1
2000 1
2001 1
dtype: int64
Alternate
from functools import reduce
def r(t):
return pd.date_range(t.StartDate, t.EndDate, freq='Y')
pd.value_counts(reduce(pd.Index.append, map(r, df.itertuples())).year).sort_index()
Setup
df = pd.DataFrame(dict(
WarName=['fakewar1', 'examplewar', 'feuxwar2'],
StartDate=pd.to_datetime(['01-01-1990', '05-01-1990', '05-07-1999']),
EndDate=pd.to_datetime(['02-02-1995', '03-07-1998', '06-09-2002'])
), columns=['WarName', 'StartDate', 'EndDate'])
df
WarName StartDate EndDate
0 fakewar1 1990-01-01 1995-02-02
1 examplewar 1990-05-01 1998-03-07
2 feuxwar2 1999-05-07 2002-06-09
By using np.unique
x,y = np.unique(sum([list(range(x.year,y.year)) for x,y in zip(df.StartDate,df.EndDate)],[]), return_counts=True)
pd.Series(dict(zip(x,y)))
Out[222]:
1990 2
1991 2
1992 2
1993 2
1994 2
1995 1
1996 1
1997 1
1999 1
2000 1
2001 1
dtype: int64
The other answers with pandas are far preferable, but the native Python answer you showed didn't have to be so convoluted; just instantiate and directly index into an array:
wars = [0] * 191 # max(df['EndDate']).year - min(df['StartDate']).year + 1
yr_offset = 1816 # min(df['StartDate']).year
for _, row in df.iterrows():
for yr in range(row['StartDate'].year-yr_offset, row['EndDate'].year-yr_offset): # or maybe (year+1)
wars[yr] += 1
I have a dataframe with 2 columns as below:
Index Year Country
0 2015 US
1 2015 US
2 2015 UK
3 2015 Indonesia
4 2015 US
5 2016 India
6 2016 India
7 2016 UK
I want to create a new dataframe containing the maximum count of country in every year.
The new dataframe will contain 3 columns as below:
Index Year Country Count
0 2015 US 3
1 2016 India 2
Is there any function in pandas where this can be done quickly?
One way can be to use groupby and along with size for finding in each category adn sort values and slice by possible number of year. You can try the following:
num_year = df['Year'].nunique()
new_df = df.groupby(['Year', 'Country']).size().rename('Count').sort_values(ascending=False).reset_index()[:num_year]
Result:
Year Country Count
0 2015 US 3
1 2016 India 2
Use:
1.
First get count of each pairs Year and Country by groupby and size.
Then get index of max value by idxmax and select row by loc:
df = df.groupby(['Year','Country']).size()
df = df.loc[df.groupby(level=0).idxmax()].reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
2.
Use custom function with value_counts and head:
df = df.groupby('Year')['Country']
.apply(lambda x: x.value_counts().head(1))
.rename_axis(('Year','Country'))
.reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
Just provide a method without groupby
Count=pd.Series(list(zip(df2.Year,df2.Country))).value_counts()
.head(2).reset_index(name='Count')
Count[['Year','Country']]=Count['index'].apply(pd.Series)
Count.drop('index',1)
Out[266]:
Count Year Country
0 3 2015 US
1 2 2016 India
I have a pandas Data Frame that I would like to fill in some NaN values of.
import pandas as pd
tuples = [('a', 1990),('a', 1994),('a',1996),('b',1992),('b',1997),('c',2001)]
index = pd.MultiIndex.from_tuples(tuples, names = ['Type', 'Year'])
vals = ['NaN','NaN','SomeName','NaN','SomeOtherName','SomeThirdName']
df = pd.DataFrame(vals, index=index)
print(df)
0
Type Year
a 1990 NaN
1994 NaN
1996 SomeName
b 1992 NaN
1997 SomeOtherName
c 2001 SomeThirdName
The output that I would like is:
Type Year
a 1990 SomeName
1994 SomeName
1996 SomeName
b 1992 SomeOtherName
1997 SomeOtherName
c 2001 SomeThirdName
This needs to be done on a much larger DataFrame (millions of rows) where each 'Type' can have between 1-5 unique 'Years' and the name value is only present for the most recent year. I'm trying to avoid iterating over rows for performance purposes.
You can sort your data frame by index in descending order and then ffill it:
import pandas as pd
df.sort_index(level = [0,1], ascending = False).ffill()
# 0
# Type Year
# c 2001 SomeThirdName
# b 1997 SomeOtherName
# 1992 SomeOtherName
# a 1996 SomeName
# 1994 SomeName
# 1990 SomeName
Note: The example data doesn't really contain np.nan values but string NaN, so in order for ffill to work you need to replace the NaN string as np.nan:
import numpy as np
df[0] = np.where(df[0] == "NaN", np.nan, df[0])
Or as #ayhan suggested, after replacing the String "NaN" with np.nan use df.bfill().