I already got answer to this question in R, wondering how this can be implemented in Python.
Let's say we have a pandas DataFrame like this:
import pandas as pd
d = pd.DataFrame({'2019Q1':[1], '2019Q2':[2], '2019Q3':[3]})
which displays like this:
2019Q1 2019Q2 2019Q3
0 1 2 3
How can I transform it to looks like this:
Year Quarter Value
2019 1 1
2019 2 2
2019 3 3
Use Series.str.split for MultiIndex with expand=True and then reshape by DataFrame.unstack, last data cleaning with with Series.reset_index and Series.rename_axis:
d = pd.DataFrame({'2019Q1':[1], '2019Q2':[2], '2019Q3':[3]})
d.columns = d.columns.str.split('Q', expand=True)
df = (d.unstack(0)
.reset_index(level=2, drop=True)
.rename_axis(('Year','Quarter'))
.reset_index(name='Value'))
print (df)
Year Quarter Value
0 2019 1 1
1 2019 2 2
2 2019 3 3
Thank you #Jon Clements for another solution:
df = (d.melt()
.variable
.str.extract('(?P<Year>\d{4})Q(?P<Quarter>\d)')
.assign(Value=d.T.values.flatten()))
print (df)
Year Quarter Value
0 2019 1 1
1 2019 2 2
2 2019 3 3
Alternative with split:
df = (d.melt()
.variable
.str.split('Q', expand=True)
.rename(columns={0:'Year',1:'Quarter'})
.assign(Value=d.T.values.flatten()))
print (df)
Year Quarter Value
0 2019 1 1
1 2019 2 2
2 2019 3 3
Using DataFrame.stack with DataFrame.pop and Series.str.split:
df = d.stack().reset_index(level=1).rename(columns={0:'Value'})
df[['Year', 'Quarter']] = df.pop('level_1').str.split('Q', expand=True)
Value Year Quarter
0 1 2019 1
0 2 2019 2
0 3 2019 3
If you care about the order of columns, use reindex:
df = df.reindex(['Year', 'Quarter', 'Value'], axis=1)
Year Quarter Value
0 2019 1 1
0 2019 2 2
0 2019 3 3
Related
I have df similar to below. I need to select rows where df['Year 2'] is equal or closest to df['Year'] in subsets grouped by df['ID'] so in this example rows 1,2 and 5.
df
Year ID A Year 2 C
0 2020 12 0 2019 0
1 2020 12 0 2020 0 <-
2 2017 10 1 2017 0 <-
3 2017 10 0 2018 0
4 2019 6 0 2017 0
5 2019 6 1 2018 0 <-
I am trying to achieve that with the following piece of code using group by and passing a function to get the proper row with the closest value for both columns.
df1 = df.groupby(['ID']).apply(min(df['Year 2'], key=lambda x:abs(x-df['Year'].min())))
This particular line returns 'int' object is not callable. Any ideas how to fix this line of code or a fresh approach to the problem is appreciated.
TYIA.
You can subtract both columns by Series.sub, convert to absolute and aggregate indices by minimum values by DataFrameGroupBy.idxmin:
idx = df['Year 2'].sub(df['Year']).abs().groupby(df['ID']).idxmin()
If need new column filled by boolean use Index.isin:
df['new'] = df.index.isin(idx)
print (df)
Year ID A Year 2 C new
0 2020 12 0 2019 0 False
1 2020 12 0 2020 0 True
2 2017 10 1 2017 0 True
3 2017 10 0 2018 0 False
4 2019 6 0 2017 0 False
5 2019 6 1 2018 0 True
If need filter rows use DataFrame.loc:
df1 = df.loc[idx]
print (df1)
Year ID A Year 2 C
5 2019 6 1 2018 0
2 2017 10 1 2017 0
1 2020 12 0 2020 0
One row solution:
df1 = df.loc[df['Year 2'].sub(df['Year']).abs().groupby(df['ID']).idxmin()]
You could get the idxmin per group:
idx = (df['Year 2']-df['Year']).abs().groupby(df['ID']).idxmin()
# assignment for test
df.loc[idx, 'D'] = '<-'
for selection only:
df2 = df.loc[idx]
output:
Year ID A Year 2 C D
0 2020 12 0 2019 0 NaN
1 2020 12 0 2020 0 <-
2 2017 10 1 2017 0 <-
3 2017 10 0 2018 0 NaN
4 2019 6 0 2017 0 NaN
5 2019 6 1 2018 0 <-
Note that there is a difference between:
df.loc[df.index.isin(idx)]
which gets all the min rows
and:
df.loc[idx]
which gets the first match
I have a years of transaction data which I am working with by customer ids. The transaction information is at an invoice level and an id could easily have multiple invoices on the same day or not have invoices for years. I am attempting to create dataframes which contain sums of invoices by customer by each year, but also show years where invoices where not added. Something akin to:
tmp = invoices[invoice['invoice_year'].isin([2018,2019,2020]]
tmp = tmp.groupby(['id', pd.Grouper(key = 'invoice_date', freq = 'Y')])['sales'].sum()
This would return something akin to:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2020 23423
3 2020 2330202
However the desired output would be:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2019 nan
2 2020 23423
3 2018 nan
3 2019 nan
3 2020 2330202
Ideas?
Let's suppose the original values are defined in the dataframe named df then you can try the following:
output = (df.groupby(['id', 'invoice_date'])['val'].sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='val'))
Otherwise you can previously create the column invoice_year:
df['invoice_year'] = df['invoice_date'].dt.year
And repeat the same code, this outputs:
id invoice_year val
0 1 2018 1
1 1 2019 1
2 1 2020 0
3 2 2018 1
4 2 2019 0
5 2 2020 1
6 3 2018 0
7 3 2019 1
8 3 2020 1
Using the following data as example:
df = pd.DataFrame({'id':[1]*2+[2]*2+[3]*2,'invoice_date':pd.to_datetime(['2018-12-01','2019-12-01','2020-12-01']*2,infer_datetime_format=True),'val':[1]*6})
Stefan has posted a comment that may help. Simply passing dropna=False to your .groupby seems like the best bet; but, you could also take the approach where you bring the NaNs back afterward, which may be required on earlier versions of pandas that don't have the dropna=False parameter:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2020 23423
3 2020 2330202
You can use pd.MultiIndex.from_product and reindex the dataframe from a newly created index called idx:
i, iy = df['id'], df['invoice_year']
idx = pd.MultiIndex.from_product([range(i.min(), i.max()+1),
range(iy.min(), iy.max()+1)],
names=[i.name, iy.name])
df = df.set_index([i.name, iy.name]).reindex(idx).reset_index()
df
Out[1]:
id invoice_year sales
0 1 2018 483982.2
1 1 2019 3453.0
2 1 2020 453533.0
3 2 2018 243.0
4 2 2019 NaN
5 2 2020 23423.0
6 3 2018 NaN
7 3 2019 NaN
8 3 2020 2330202.0
Suppose, I have a dataframe as below:
year month message
0 2018 2 txt1
1 2017 4 txt2
2 2019 5 txt3
3 2017 5 txt5
4 2017 5 txt4
5 2020 4 txt3
6 2020 6 txt3
7 2020 6 txt3
8 2020 6 txt4
I want to figure out top three number of messages in each year. So, I grouped the data as below:
df.groupby(['year','month']).count()
which results:
message
year month
2017 4 1
5 2
2018 2 1
2019 5 1
2020 4 1
6 3
The data is in ascending order for both indexes. But how to find the results as shown below where the data is sorted by year (ascending) and count (descending) for top n values. 'month' index will be free.
message
year month
2017 5 2
4 1
2018 2 1
2019 5 1
2020 6 3
4 1
value_counts gives you sort by default:
df.groupby('year')['month'].value_counts()
Output:
year month
2017 5 2
4 1
2018 2 1
2019 5 1
2020 6 3
4 1
Name: month, dtype: int64
If you want only 2 top values for each year, do another groupby:
(df.groupby('year')['month'].value_counts()
.groupby('year').head(2)
)
Output:
year month
2017 5 2
4 1
2018 2 1
2019 5 1
2020 6 3
4 1
Name: month, dtype: int64
This will sort by year (ascending) and count (descending).
df = df.groupby(['year', 'month']).count().sort_values(['year', 'message'], ascending=[True, False])
You can use sort_index, specifying ascending=[True,False] so that only the second level is sorted in descending order:
df = df.groupby(['year','month']).count().sort_index(ascending=[True,False])
message
year month
2017 5 2
4 1
2018 2 1
2019 5 1
2020 6 3
4 1
here you go
df.groupby(['year', 'month']).count().sort_values(axis=0, ascending=False, by='message').sort_values(axis=0, ascending=True, by='year')
you can use this code for it.
df.groupby(['year', 'month']).count().sort_index(axis=0, ascending=False).sort_values(by="year", ascending=True)
Following up to my previous question here:
import pandas as pd
d = pd.DataFrame({'value':['a', 'b'],'2019Q1':[1, 5], '2019Q2':[2, 6], '2019Q3':[3, 7]})
which displays like this:
value 2019Q1 2019Q2 2019Q3
0 a 1 2 3
1 b 5 6 7
How can I transform it into this shape:
Year measure Quarter Value
2019 a 1 1
2019 a 2 2
2019 a 3 3
2019 b 1 5
2019 b 2 6
2019 b 3 7
Use pd.wide_to_long with DataFrame.melt:
df2 = df.copy()
df2.columns = df.columns.str.split('Q').str[::-1].str.join('_')
new_df = (pd.wide_to_long(df2.rename(columns = {'value':'Measure'}),
['1','2','3'],
j="Year",
i = 'Measure',
sep='_')
.reset_index()
.melt(['Measure','Year'],var_name = 'Quarter',value_name = 'Value')
.loc[:,['Year','Measure','Quarter','Value']]
.sort_values(['Year','Measure','Quarter']))
print(new_df)
Year Measure Quarter Value
0 2019 a 1 1
2 2019 a 2 2
4 2019 a 3 3
1 2019 b 1 5
3 2019 b 2 6
5 2019 b 3 7
this is just an addition for future visitors : when u split columns and use expand=True, u get a multiindex. This allows reshaping using the stack method.
#set value column as index
d = d.set_index('value')
#split columns and convert to multiindex
d.columns = d.columns.str.split('Q',expand=True)
#reshape dataframe
d.stack([0,1]).rename_axis(['measure','year','quarter']).reset_index(name='Value')
measure year quarter Value
0 a 2019 1 1
1 a 2019 2 2
2 a 2019 3 3
3 b 2019 1 5
4 b 2019 2 6
5 b 2019 3 7
How do I compare/merge two data frames based on the start and data columns and get the missing gaps with the count.
Dataframe 1
id start
1 2009
1 2010
1 2011
1 2012
2 2010
2 2011
2 2012
2 2013
2 2014
Data frame 2
id data
1 2010
1 2012
2 2010
2 2011
2 2012
Expected Output:
id first last size
1 2009 2009 1
1 2011 2011 1
2 2013 2014 2
How may I achieve this.
Use merge with indicator=True and outer join first:
df11 = df1.rename(columns={'start':'data'})
df = df2.merge(df11, how='outer', indicator=True, on=['id','data']).sort_values(['id','data'])
print (df)
id data _merge
5 1 2009 right_only
0 1 2010 both
6 1 2011 right_only
1 1 2012 both
2 2 2010 both
3 2 2011 both
4 2 2012 both
7 2 2013 right_only
8 2 2014 right_only
And then use old solution - only change condition:
#boolean mask for check no right_only to variable for reuse
m = (df['_merge'] != 'right_only').rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df.index = m.cumsum()
print (df)
id data _merge
g
0 1 2009 right_only
1 1 2010 both
1 1 2011 right_only
2 1 2012 both
3 2 2010 both
4 2 2011 both
5 2 2012 both
5 2 2013 right_only
5 2 2014 right_only
#filter only NaNs row and aggregate first, last and count.
df2 = (df[~m.values].groupby(['id', 'g'])['data']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
id first last size
0 1 2009 2009 1
1 1 2011 2011 1
2 2 2013 2014 2
I answered a similar question for you yesterday. I dont know where you are getting the first and last columns but here is a way to find the missing years based on the example above:
df1_year = pd.DataFrame(df1.groupby('id')['start'].apply(list))
df2_year = pd.DataFrame(df2.groupby('id')['data'].apply(list))
dfs = [df1_year,df2_year]
df_final = reduce(lambda left,right: pd.merge(left,right,on='id'), dfs)
df_final.reset_index(inplace=True)
def noMatch(a, b):
return [x for x in a if x not in b]
df3 = []
for i in range(0, len(df_final)):
df3.append(noMatch(df_final['start'][i],df_final['data'][i]))
missing_year = pd.DataFrame(df3)
missing_year['missingYear'] = missing_year.values.tolist()
df_concat = pd.concat([df_final, missing_year], axis=1)
df_concat = df_concat[['id','missingYear']]
df4 = []
for i in range(0,len(df_concat)):
df4.append(df_concat.applymap(lambda x: x[i] if isinstance(x, list) else x))
df_final1 = reduce(lambda left,right: pd.merge(left,right,on='id'), df4)
pd.concat([df_final1[['id','missingYear_x']], df_final1[['id','missingYear_y']].rename(columns={'missingYear_y':'missingYear_x'})]).rename(columns={'missingYear_x':'missingYear'}).sort_index()
id missingYear
0 1 2009
0 1 2011
1 2 2013
1 2 2014
to add it to df2 per your comment just append data