Say we have this data:
list1, list2, list3 = [1,2,3,4], [1990, 1990, 1990, 1991], [2009, 2009, 2009, 2009]
df = pd.DataFrame(list(zip(list1, list2, list3)), columns = ['Index', 'Y0', 'Y1'])
> df
Index Y0 Y1
1 1990 2009
2 1990 2009
3 1990 2009
4 1991 2009
I want to count, for each year, how many rows ("index") fall within each year, but excluding the Y0.
So say we start at the first available year, 1990:
How many rows do we count? 0.
1991:
Three (row 1, 2, 3)
1992:
Four (row 1, 2, 3, 4)
...
2009:
Four (row 1, 2, 3, 4)
So I want to end up with a dataframe that says:
Count Year
0 1990
3 1991
4. 1992
... ...
4 2009
My attempt:
df['Y0'] = pd.to_datetime(df['Y0'], format='%Y')
df['Y1'] = pd.to_datetime(df['Y1'], format='%Y')
# Group by the interval between Y0 and Y1
df = d.groupby([d['Y0'].dt.year, d['Y1'].dt.year]).agg({'count'})
df.columns = ['count', 'Y0 count', 'Y1 count']
# sum the total
df_sum = pd.DataFrame(df.groupby(df.index)['count'].sum())
But the result doesn't look right.
Appreciate any help.
you could do:
min_year = df[['Y0', 'Y1']].values.min()
max_year = df[['Y0', 'Y1']].values.max()
year_range = np.arange(min_year, max_year+1)
counts = ((df[['Y0']].values < year_range) & (year_range<= df[['Y1']].values)).sum(axis=0)
o = pd.DataFrame({"counts": counts, 'year': year_range})
counts year
0 0 1990
1 3 1991
2 4 1992
3 4 1993
4 4 1994
5 4 1995
6 4 1996
7 4 1997
8 4 1998
9 4 1999
10 4 2000
11 4 2001
12 4 2002
13 4 2003
14 4 2004
15 4 2005
16 4 2006
17 4 2007
18 4 2008
19 4 2009
The following should do your job:
counts=[]
years=[]
def count_in_interval(year):
n=0
for i in range(len(df)):
if df['Y0'][i]<year<=df['Y1'][i]:
n+=1
return n
for i in range(1990, 2010):
counts.append(count_in_interval(i))
years.append(i)
result=pd.DataFrame(zip(counts, years), columns=['Count', 'Year'])
Related
I've a dataframe with 3 levels of multi index columns:
quarter Q1 Q2 Totals
year 2021 2022 2021 2022
qty orders qty orders qty orders qty orders qty orders
month name
January 40 2 5 1 1 2 0 0 46 5
February 20 8 2 3 4 6 0 0 26 17
March 2 10 7 4 3 3 0 0 12 17
Totals 62 20 14 8 8 11 0 0 84 39
After doing a groupy by levels (0,2), I've the following subtotals dataframe:
quarter Q1 Q2 Totals
qty orders qty orders qty orders
month name
January 45 3 1 2 46 5
February 22 10 4 6 26 16
March 9 14 3 3 12 17
Totals 76 28 8 11 84 39
I need to insert the second into the first, without upsetting the columns, levels or index so that I get the following dataframe:
quarter Q1 Q2 Totals
year 2021 2022 Subtotal 2021 2022 Subtotal
qty orders qty orders qty orders qty orders qty orders qty orders qty orders
month name
January 40 2 5 1 45 3 1 2 0 0 1 2 46 5
February 20 8 2 3 22 10 4 6 0 0 4 6 26 16
March 2 10 7 4 9 14 3 3 0 0 3 3 12 17
Totals 62 20 14 8 76 28 8 11 0 0 8 11 84 39
How do I do this?
With your initial dataframe (before groupby):
import pandas as pd
df = pd.DataFrame(
[
[40, 2, 5, 1, 1, 2, 0, 0],
[20, 8, 2, 3, 4, 6, 0, 0],
[2, 10, 7, 4, 3, 3, 0, 0],
[62, 20, 14, 8, 8, 11, 0, 0],
],
columns=pd.MultiIndex.from_product(
[("Q1", "Q2"), ("2021", "2022"), ("qty", "orders")]
),
index=["January", "February", "March", "Totals"],
)
Here is one way to do it (using product from Python standard library's itertools module, otherwise a nested for-loop is also possible):
# Add new columns
for level1, level2 in product(["Q1", "Q2"], ["qty", "orders"]):
df.loc[:, (level1, "subtotal", level2)] = (
df.loc[:, (level1, "2021", level2)] + df.loc[:, (level1, "2022", level2)]
)
# Sort columns
df = df.reindex(
pd.MultiIndex.from_product(
[("Q1", "Q2"), ("2021", "2022", "subtotal"), ("qty", "orders")]
),
axis=1,
)
Then:
print(df)
# Output
Q1 Q2 \
2021 2022 subtotal 2021 2022
qty orders qty orders qty orders qty orders qty orders
January 40 2 5 1 45 3 1 2 0 0
February 20 8 2 3 22 11 4 6 0 0
March 2 10 7 4 9 14 3 3 0 0
Totals 62 20 14 8 76 28 8 11 0 0
subtotal
qty orders
January 1 2
February 4 6
March 3 3
Totals 8 11
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
However, I have the following problem:
If a year or date does not exist in df2 then a price and a listing_id is automatically added during the merge. But that should be NaN
The second problem is when merging, as soon as I have multiple data that were on the same day and year then the temperature is also merged to the second, for example:
d = {'id': [1], 'day': [1], 'temperature': [20], 'year': [2001]}
df = pd.DataFrame(data=d)
print(df)
id day temperature year
0 1 1 20 2001
d2 = {'id': [122, 244], 'day': [1, 1],
'listing_id': [2, 4], 'price': [20, 440], 'year': [2001, 2001]}
df2 = pd.DataFrame(data=d2)
print(df2)
id day listing_id price year
0 122 1 2 20 2001
1 244 1 4 440 2001
df3 = pd.merge(df,df2[['day', 'listing_id', 'price']],
left_on='day', right_on = 'day',how='left')
print(df3)
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 1 1 20 2001 4 440 # <-- The second temperature is wrong :/
This should not be so, because if I later still have a date from year 2002 which was in day 1 with a temperature of 30 and I want to calculate the average. Then I get the following formula: 20 + 20 + 30 = 23.3. The formula should be 20 + 30 = 25. Therefore, if a value has already been filled, there should be a NaN value in it.
Code Snippet
d = {'id': [1, 2, 3, 4, 5], 'day': [1, 2, 3, 4, 2],
'temperature': [20, 40, 50, 60, 20], 'year': [2001, 2002, 2004, 2005, 1999]}
df = pd.DataFrame(data=d)
print(df)
id day temperature year
0 1 1 20 2001
1 2 2 40 2002
2 3 3 50 2004
3 4 4 60 2005
4 5 2 20 1999
d2 = {'id': [122, 244, 387, 4454, 521], 'day': [1, 2, 3, 4, 2],
'listing_id': [2, 4, 5, 6, 7], 'price': [20, 440, 500, 6600, 500],
'year': [2001, 2002, 2004, 2005, 2005]}
df2 = pd.DataFrame(data=d2)
print(df2)
id day listing_id price year
0 122 1 2 20 2001
1 244 2 4 440 2002
2 387 3 5 500 2004
3 4454 4 6 6600 2005
4 521 2 7 500 2005
df3 = pd.merge(df,df2[['day','listing_id', 'price']],
left_on='day', right_on = 'day',how='left').drop('day',axis=1)
print(df3)
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 2 2 40 2002 4 440
2 2 2 40 2002 7 500
3 3 3 50 2004 5 500
4 4 4 60 2005 6 6600
5 5 2 20 1999 4 440
6 5 2 20 1999 7 500
What I want
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 2 2 40 2002 4 440
2 2 2 NaN 2005 7 500
3 3 3 50 2004 5 500
4 4 4 60 2005 6 6600
5 5 2 20 1999 NaN NaN
IIUC:
>>> df1.merge(df2[['day', 'listing_id', 'price', 'year']],
on=['day', 'year'], how='outer')
id day temperature year listing_id price
0 1.0 1 20.0 2001 2.0 20.0
1 2.0 2 40.0 2002 4.0 440.0
2 3.0 3 50.0 2004 5.0 500.0
3 4.0 4 60.0 2005 6.0 6600.0
4 5.0 2 20.0 1999 NaN NaN
5 NaN 2 NaN 2005 7.0 500.0
I have a dataframe:
df = pd.DataFrame([[2, 4, 7, 8, 1, 3, 2013], [9, 2, 4, 5, 5, 6, 2014]], columns=['Amy', 'Bob', 'Carl', 'Chris', 'Ben', 'Other', 'Year'])
Amy Bob Carl Chris Ben Other Year
0 2 4 7 8 1 3 2013
1 9 2 4 5 5 6 2014
And a dictionary:
d = {'A': ['Amy'], 'B': ['Bob', 'Ben'], 'C': ['Carl', 'Chris']}
I would like to reshape my dataframe to look like this:
Group Name Year Value
0 A Amy 2013 2
1 A Amy 2014 9
2 B Bob 2013 4
3 B Bob 2014 2
4 B Ben 2013 1
5 B Ben 2014 5
6 C Carl 2013 7
7 C Carl 2014 4
8 C Chris 2013 8
9 C Chris 2014 5
10 Other 2013 3
11 Other 2014 6
Note that Other doesn't have any values in the Name column and the order of the rows does not matter. I think I should be using the melt function but the examples that I've come across aren't too clear.
melt gets you part way there.
In [29]: m = pd.melt(df, id_vars=['Year'], var_name='Name')
This has everything except Group. To get that, we need to reshape d a bit as well.
In [30]: d2 = {}
In [31]: for k, v in d.items():
for item in v:
d2[item] = k
....:
In [32]: d2
Out[32]: {'Amy': 'A', 'Ben': 'B', 'Bob': 'B', 'Carl': 'C', 'Chris': 'C'}
In [34]: m['Group'] = m['Name'].map(d2)
In [35]: m
Out[35]:
Year Name value Group
0 2013 Amy 2 A
1 2014 Amy 9 A
2 2013 Bob 4 B
3 2014 Bob 2 B
4 2013 Carl 7 C
.. ... ... ... ...
7 2014 Chris 5 C
8 2013 Ben 1 B
9 2014 Ben 5 B
10 2013 Other 3 NaN
11 2014 Other 6 NaN
[12 rows x 4 columns]
And moving 'Other' from Name to Group
In [8]: mask = m['Name'] == 'Other'
In [9]: m.loc[mask, 'Name'] = ''
In [10]: m.loc[mask, 'Group'] = 'Other'
In [11]: m
Out[11]:
Year Name value Group
0 2013 Amy 2 A
1 2014 Amy 9 A
2 2013 Bob 4 B
3 2014 Bob 2 B
4 2013 Carl 7 C
.. ... ... ... ...
7 2014 Chris 5 C
8 2013 Ben 1 B
9 2014 Ben 5 B
10 2013 3 Other
11 2014 6 Other
[12 rows x 4 columns]
Pandas Melt Function :-
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
eg:-
melted = pd.melt(df, id_vars=["weekday"],
var_name="Person", value_name="Score")
we use melt to transform wide data to long data.
Consider following dataframe:
df = pd.read_csv("data.csv")
print(df)
Category Year Month Count1 Count2
0 a 2017 December 5 9
1 a 2018 January 3 5
2 b 2017 October 7 6
3 b 2017 November 4 1
4 b 2018 March 3 3
I want to achieve this:
Category Year Month Count1 Count2
0 a 2017 October
1 a 2017 November
2 a 2017 December 5 9
3 a 2018 January 3 5
4 a 2018 February
5 a 2018 March
6 b 2017 October 7 6
7 b 2017 November 4 1
8 b 2017 December
9 b 2018 January
10 b 2018 February
11 b 2018 March 3 3
Here I've done so far:
months = {"January": 1, "February": 2, "March": 3, "April": 4, "May": 5, "June": 6, "July": 7, "August": 8, "September": 9, "October": 10, "November": 11, "December": 12}
df["Date"] = pd.to_datetime(10000 * df["Year"] + 100 * df["Month"].apply(months.get) + 1, format="%Y%m%d")
date_min = df["Date"].min()
date_max = df["Date"].max()
new_index = pd.MultiIndex.from_product([df["Category"].unique(), pd.date_range(date_min, date_max, freq="M")], names=["Category", "Date"])
df = df.set_index(["Category", "Date"]).reindex(new_index).reset_index()
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month_name()
df = df[["Category", "Year", "Month", "Count1", "Count2"]]
In the resulting dataframe last month (March) is missing and all "Count1", "Count2" values are NaN
This is complicated by the fact that you want to fill the category as well as the missing dates. One solution is to create a separate data frame for each category and then concatenate them all together.
df['Date'] = pd.to_datetime('1 '+df.Month.astype(str)+' '+df.Year.astype(str))
df_ix = pd.Series(1, index=df.Date.sort_values()).resample('MS').first().reset_index()
df_list = []
for cat in df.Category.unique():
df_temp = (df.query('Category==#cat')
.merge(df_ix, on='Date', how='right')
.get(['Date','Category','Count1','Count2'])
.sort_values('Date')
)
df_temp.Category = cat
df_temp = df_temp.fillna(0)
df_temp.loc[:,['Count1', 'Count2']] = df_temp.get(['Count1', 'Count2']).astype(int)
df_list.append(df_temp)
df2 = pd.concat(df_list, ignore_index=True)
df2['Month'] = df2.Date.apply(lambda x: x.strftime('%B'))
df2['Year'] = df2.Date.apply(lambda x: x.year)
df2.drop('Date', axis=1)
# returns:
Category Count1 Count2 Month Year
0 a 0 0 October 2017
1 a 0 0 November 2017
2 a 5 9 December 2017
3 a 3 5 January 2018
4 a 0 0 February 2018
5 a 0 0 March 2018
6 b 7 6 October 2017
7 b 4 1 November 2017
8 b 0 0 December 2017
9 b 0 0 January 2018
10 b 0 0 February 2018
11 b 3 3 March 2018
So I used a group by on a pandas dataframe which looks like this
df.groupby(['year','month'])['AMT'].agg('sum')
And I get something like this
year month
2003 1 114.00
2 9195.00
3 300.00
5 200.00
6 450.00
7 68.00
8 750.00
9 3521.00
10 250.00
11 799.00
12 1000.00
2004 1 8551.00
2 9998.00
3 17334.00
4 2525.00
5 16014.00
6 9132.00
7 10623.00
8 7538.00
9 3650.00
10 7733.00
11 10128.00
12 4741.00
2005 1 6965.00
2 3208.00
3 8630.00
4 7776.00
5 11950.00
6 11717.00
7 1510.00
...
2015 7 1431441.00
8 966974.00
9 1121650.00
10 1200104.00
11 1312191.90
12 482535.00
2016 1 1337343.00
2 1465068.00
3 1170113.00
4 1121691.00
5 1302936.00
6 1518047.00
7 1251844.00
8 825215.00
9 1491626.00
10 1243877.00
11 1632252.00
12 750995.50
2017 1 905974.00
2 1330182.00
3 1382628.52
4 1146789.00
5 1201425.00
6 1278701.00
7 1172596.00
8 1517116.50
9 1108609.00
10 1360841.00
11 1340386.00
12 860686.00
What I want is to just select the max out of the third summed column so that the final data frame has only the max from each year, something like:
year month
2003 2 9195.00
2004 3 17334.00
2005 5 11950.00
... and so on
What do I have to add to my group by aggregation to do this?
I think need DataFrameGroupBy.idxmax:
s = df.groupby(['year','month'])['AMT'].sum()
out = s.loc[s.groupby(level=0).idxmax()]
#working in newer pandas versions
#out = df.loc[df.groupby('Year').idxmax()]
print (out)
Year month
2003 2 9195.0
2004 3 17334.0
2005 5 11950.0
Name: AMT, dtype: float64
If possible multiple max values per years:
out = s[s == s.groupby(level=0).transform('max')]
print (out)
Year month
2003 2 9195.0
2004 3 17334.0
2005 5 11950.0
Name: AMT, dtype: float64
You can use GroupBy + transform with max. Note this gives multiple maximums for any years where a tie exists. This may or may not be what you require.
As you have requested, it's possible to do this in 2 steps, first summing and then calculating maximums by year.
df = pd.DataFrame({'year': [2003, 2003, 2003, 2004, 2004, 2004],
'month': [1, 2, 2, 1, 1, 2],
'AMT': [100, 200, 100, 100, 300, 100]})
# STEP 1: sum by year + month
df2 = df.groupby(['year', 'month']).sum().reset_index()
# STEP 2: filter for max by year
res = df2[df2['AMT'] == df2.groupby(['year'])['AMT'].transform('max')]
print(res)
year month AMT
1 2003 2 300
2 2004 1 400