Panadas pivot table - python

When trying to use the pd.pivot_table on a given dataset, I noticed that it creates levels for all existing levels on a parent group, not all possible levels. For example, on a dataset like this:
YEAR CLASS
0 2013 A
1 2013 A
2 2013 B
3 2013 B
4 2013 B
5 2013 C
6 2013 C
7 2013 D
8 2014 A
9 2014 A
10 2014 A
11 2014 B
12 2014 B
13 2014 B
14 2014 C
15 2014 C
there is no level D for year 2014, so the pivot table will look like this:
pd.pivot_table(d,index=["YEAR","CLASS"],values=["YEAR"],aggfunc=[len],fill_value=0)
len
YEAR CLASS
2013 A 2
B 3
C 2
D 1
2014 A 3
B 3
C 2
What I want is to get a separate group for D in 2014 with length 0 in my pivot table. How can I include all possible levels in the child variable for the parent variable?

I think you can use crosstab and stack:
print pd.pivot_table(df,
index=["YEAR","CLASS"],
values=["YEAR"],
aggfunc=[len],
fill_value=0)
len
YEAR CLASS
2013 A 2
B 3
C 2
D 1
2014 A 3
B 3
C 2
print pd.crosstab(df['YEAR'],df['CLASS'])
CLASS A B C D
YEAR
2013 2 3 2 1
2014 3 3 2 0
df = pd.crosstab(df['YEAR'],df['CLASS']).stack()
df.name = 'len'
print df
YEAR CLASS
2013 A 2
B 3
C 2
D 1
2014 A 3
B 3
C 2
D 0
Name: len, dtype: int64

Related

Doing joins between 2 csv files [duplicate]

For df2 which only has data in the year of 2019:
type year value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
df1 has multiple years data:
type year value
0 a 2015 12
1 a 2016 2
2 a 2019 3
3 b 2018 50
4 b 2019 10
5 c 2017 1
6 c 2016 5
7 c 2019 8
I need to concatenate them together while replacing df2's values in 2019 with the values from df1's same year.
The expected result will like this:
type date value
0 a 2015 12
1 a 2016 2
2 b 2018 50
3 c 2017 1
4 c 2016 5
5 a 2019 13
6 b 2019 5
7 c 2019 5
8 d 2019 20
The result from pd.concat([df1, df2], ignore_index=True, sort =False), which clearly have multiple values in year of 2019 for one type. How should I improve the code? Thank you.
type date value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
4 a 2015 12
5 a 2016 2
6 a 2019 3
7 b 2018 50
8 b 2019 10
9 c 2017 1
10 c 2016 5
11 c 2019 8
Add DataFrame.drop_duplicates for get last rows per type and date after concat.
Solution working if type and date pairs are unique in both DataFrames.
df = (pd.concat([df1, df2], ignore_index=True, sort =False)
.drop_duplicates(['type','date'], keep='last'))

merging two csv using python [duplicate]

For df2 which only has data in the year of 2019:
type year value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
df1 has multiple years data:
type year value
0 a 2015 12
1 a 2016 2
2 a 2019 3
3 b 2018 50
4 b 2019 10
5 c 2017 1
6 c 2016 5
7 c 2019 8
I need to concatenate them together while replacing df2's values in 2019 with the values from df1's same year.
The expected result will like this:
type date value
0 a 2015 12
1 a 2016 2
2 b 2018 50
3 c 2017 1
4 c 2016 5
5 a 2019 13
6 b 2019 5
7 c 2019 5
8 d 2019 20
The result from pd.concat([df1, df2], ignore_index=True, sort =False), which clearly have multiple values in year of 2019 for one type. How should I improve the code? Thank you.
type date value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
4 a 2015 12
5 a 2016 2
6 a 2019 3
7 b 2018 50
8 b 2019 10
9 c 2017 1
10 c 2016 5
11 c 2019 8
Add DataFrame.drop_duplicates for get last rows per type and date after concat.
Solution working if type and date pairs are unique in both DataFrames.
df = (pd.concat([df1, df2], ignore_index=True, sort =False)
.drop_duplicates(['type','date'], keep='last'))

create new column with values from another column based on condition

I have a dataframe
A B Value FY
1 5 a 2020
2 6 b 2020
3 7 c 2021
4 8 d 2021
I want to create a column 'prev_FY' which looks at the 'value' column and previous year and populates in current year row in FY column;
my desired output is:
A B Value FY prev_FY
1 5 a 2020
2 6 b 2020
3 7 c 2021 a
4 8 d 2021 b
I tried using pivottable but it does not work as the values remain the same as corresponding to the FY. SHIFT function is not feasible as I have millions of rows.
Use:
df['g'] = df.groupby('FY').cumcount()
df2 = df[['FY','Value','g']].assign(FY = df['FY'].add(1))
df = df.merge(df2, on=['FY','g'], how='left', suffixes=('','_prev')).drop('g', axis=1)
print (df)
A B Value FY Value_prev
0 1 5 a 2020 NaN
1 2 6 b 2020 NaN
2 3 7 c 2021 a
3 4 8 d 2021 b

Converting columns to rows in Pandas with a secondary index value

Following up to my previous question here:
import pandas as pd
d = pd.DataFrame({'value':['a', 'b'],'2019Q1':[1, 5], '2019Q2':[2, 6], '2019Q3':[3, 7]})
which displays like this:
value 2019Q1 2019Q2 2019Q3
0 a 1 2 3
1 b 5 6 7
How can I transform it into this shape:
Year measure Quarter Value
2019 a 1 1
2019 a 2 2
2019 a 3 3
2019 b 1 5
2019 b 2 6
2019 b 3 7
Use pd.wide_to_long with DataFrame.melt:
df2 = df.copy()
df2.columns = df.columns.str.split('Q').str[::-1].str.join('_')
new_df = (pd.wide_to_long(df2.rename(columns = {'value':'Measure'}),
['1','2','3'],
j="Year",
i = 'Measure',
sep='_')
.reset_index()
.melt(['Measure','Year'],var_name = 'Quarter',value_name = 'Value')
.loc[:,['Year','Measure','Quarter','Value']]
.sort_values(['Year','Measure','Quarter']))
print(new_df)
Year Measure Quarter Value
0 2019 a 1 1
2 2019 a 2 2
4 2019 a 3 3
1 2019 b 1 5
3 2019 b 2 6
5 2019 b 3 7
this is just an addition for future visitors : when u split columns and use expand=True, u get a multiindex. This allows reshaping using the stack method.
#set value column as index
d = d.set_index('value')
#split columns and convert to multiindex
d.columns = d.columns.str.split('Q',expand=True)
#reshape dataframe
d.stack([0,1]).rename_axis(['measure','year','quarter']).reset_index(name='Value')
measure year quarter Value
0 a 2019 1 1
1 a 2019 2 2
2 a 2019 3 3
3 b 2019 1 5
4 b 2019 2 6
5 b 2019 3 7

pandas dataframe join with where restriction

I have 2 pandas DataFrames looking like this:
ranks:
year name rank
2015 A 1
2015 B 2
2015 C 3
2014 A 4
2014 B 5
2014 C 6
and tourneys:
date name
20150506 A
20150708 B
20150910 C
20141212 A
20141111 B
20141010 C
I want to join these two DataFrames based on the name column, however as you can see the names are not unique and are repeated each year. Hence a restriction for the join is that the year of ranks should match the first four characters of date of tourneys.
the result should look like this:
date name_t year name_r rank
20150506 A 2015 A 1
20150708 B 2015 B 2
20150910 C 2015 C 3
20141212 A 2014 A 4
20141111 B 2014 B 5
20141010 C 2014 C 6
I am aware of the join method in pandas, however, I would also need to restrict the join with some sort of WHERE ranks.year == tourneys.date[:4].
Create a new date4 for df2 and then merge df1 and df2
In [103]: df2['date4'] = (df2['date']/10000).astype(int)
Now, merge df1 and df2 on ['year', 'name'] and ['date4', 'name'] combinations.
In [104]: df1.merge(df2, left_on=['year', 'name'], right_on=['date4', 'name'])
Out[104]:
year name rank date date4
0 2015 A 1 20150506 2015
1 2015 B 2 20150708 2015
2 2015 C 3 20150910 2015
3 2014 A 4 20141212 2014
4 2014 B 5 20141111 2014
5 2014 C 6 20141010 2014
Where df1 and df2 looks like
In [105]: df1
Out[105]:
year name rank
0 2015 A 1
1 2015 B 2
2 2015 C 3
3 2014 A 4
4 2014 B 5
5 2014 C 6
In [106]: df2
Out[106]:
date name date4
0 20150506 A 2015
1 20150708 B 2015
2 20150910 C 2015
3 20141212 A 2014
4 20141111 B 2014
5 20141010 C 2014

Categories