Unique IDs for sorted groups of indices - python

I have something like the following DataFrame where I have data points at 2 locations in 4 seasons in 2 years.
>>> df=pd.DataFrame(index=pd.MultiIndex.from_product([[1,2,3,4],[2011,2012],['A','B']], names=['Season','Year','Location']))
>>> df['Value']=np.random.randint(1,100,len(df))
>>> df
Value
Season Year Location
1 2011 A 40
B 7
2012 A 81
B 84
2 2011 A 37
B 59
2012 A 30
B 6
3 2011 A 71
B 43
2012 A 3
B 65
4 2011 A 45
B 13
2012 A 38
B 70
>>>
I would like to create a new series that represents that number of the season sorted by year. For example, the seasons in the first year would just be 1,2,3,4 and then the seasons in the second year would be 5,6,7,8. The series would look like this:
Season Year Location
1 2011 A 1
B 1
2012 A 5
B 5
2 2011 A 2
B 2
2012 A 6
B 6
3 2011 A 3
B 3
2012 A 7
B 7
4 2011 A 4
B 4
2012 A 8
B 8
Name: SeasonNum, dtype: int64
>>>
Any suggestions on the best way to do this?

You could do:
def seasons(row):
return row['Year'] % 2011 * 4 + row['Season']
df.reset_index(inplace=True)
df['seasons'] = df.apply(seasons, axis=1)
df.set_index(['Season', 'Year', 'Location'], inplace=True)

Related

Doing joins between 2 csv files [duplicate]

For df2 which only has data in the year of 2019:
type year value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
df1 has multiple years data:
type year value
0 a 2015 12
1 a 2016 2
2 a 2019 3
3 b 2018 50
4 b 2019 10
5 c 2017 1
6 c 2016 5
7 c 2019 8
I need to concatenate them together while replacing df2's values in 2019 with the values from df1's same year.
The expected result will like this:
type date value
0 a 2015 12
1 a 2016 2
2 b 2018 50
3 c 2017 1
4 c 2016 5
5 a 2019 13
6 b 2019 5
7 c 2019 5
8 d 2019 20
The result from pd.concat([df1, df2], ignore_index=True, sort =False), which clearly have multiple values in year of 2019 for one type. How should I improve the code? Thank you.
type date value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
4 a 2015 12
5 a 2016 2
6 a 2019 3
7 b 2018 50
8 b 2019 10
9 c 2017 1
10 c 2016 5
11 c 2019 8
Add DataFrame.drop_duplicates for get last rows per type and date after concat.
Solution working if type and date pairs are unique in both DataFrames.
df = (pd.concat([df1, df2], ignore_index=True, sort =False)
.drop_duplicates(['type','date'], keep='last'))

merging two csv using python [duplicate]

For df2 which only has data in the year of 2019:
type year value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
df1 has multiple years data:
type year value
0 a 2015 12
1 a 2016 2
2 a 2019 3
3 b 2018 50
4 b 2019 10
5 c 2017 1
6 c 2016 5
7 c 2019 8
I need to concatenate them together while replacing df2's values in 2019 with the values from df1's same year.
The expected result will like this:
type date value
0 a 2015 12
1 a 2016 2
2 b 2018 50
3 c 2017 1
4 c 2016 5
5 a 2019 13
6 b 2019 5
7 c 2019 5
8 d 2019 20
The result from pd.concat([df1, df2], ignore_index=True, sort =False), which clearly have multiple values in year of 2019 for one type. How should I improve the code? Thank you.
type date value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
4 a 2015 12
5 a 2016 2
6 a 2019 3
7 b 2018 50
8 b 2019 10
9 c 2017 1
10 c 2016 5
11 c 2019 8
Add DataFrame.drop_duplicates for get last rows per type and date after concat.
Solution working if type and date pairs are unique in both DataFrames.
df = (pd.concat([df1, df2], ignore_index=True, sort =False)
.drop_duplicates(['type','date'], keep='last'))

Efficient way subtract a row with previous row seperated by group with Pandas

The objective is to subtract a row (N) with previous row (N-1) separated by groups.
Given a df
years nchar nval
0 2019 a 1
1 2019 b 1
2 2019 c 1
3 2020 a 1
4 2020 s 4
Lets,separate into group of year 2019, and we denote it as df_2019
For df_2019, there we assign constant 10.
Then,only for index 0, we do the following operation and assign to a new column 'B`
df_2019.loc[df_2019.index[0], 'B']= 10 - df_2019['nval'].values[0]
Whereas, the other index
df_2019.loc[df_2019.index[N], 'B'] = df_2019['B'].values[N-1] - df_2019['nval'].values[N]
This, will produced the following table
years nchar nval C D B
1 2019 a 1 9
2 2019 b 1 8
3 2019 c 1 7
For the group 2020, the same computation apply. However, the only difference is, the constant value is the 7, which is taken from the last index of column B.
To answer this requirement, the following code is produced with extra possible groups.
import pandas as pd
year=[2019,2019,2019,2020,2020,2020,2020,2022,2022,2022]
nval=[1,1,1,1,4,1,4,5,6,7]
nchar=['a','b','c','a','s','c','a','b','c','g']
df=pd.DataFrame(zip(year,nchar,nval),columns=['years','nchar','nval'])
print(df)
year_ls=[2019,2020,2022]
nspacing_total=2
nspacing_between_df=4
all_df=[]
default_val=10
for idx,dyear in enumerate(year_ls):
df_=df[df['years']==dyear].reset_index(drop=True)
t=pd.DataFrame([[''] * 3]*len(df_), columns=["C", "D", "B"])
df_=pd.concat([df_,t],axis=1)
Total = df_['nval'].sum()
df_=pd.DataFrame([[''] * len(df.columns)]*1, columns=df.columns).append(df_).reset_index(drop=True)
if idx ==0:
df_.loc[df_.index[0], 'B']=default_val
if idx !=0:
pre_df=all_df[idx-1]
pre_val=pre_df['B'].values[-1]
nposi=1
pre_years=pre_df['years'].values[nposi]
df_.loc[df_.index[0], 'nchar']=f'From {pre_years}'
df_.loc[df_.index[0], 'B']=pre_val
for ndexd in range(df_.shape[0]-1):
df_.loc[df_.index[ndexd+1], 'B']=df_['B'].values[ndexd]-df_['nval'].values[ndexd+1]
df_=df_.append(pd.DataFrame([[''] * len(df.columns)]*nspacing_total, columns=df.columns)).reset_index(drop=True)
df_.loc[df_.index[-1], 'nval']=Total
df_.loc[df_.index[-1], 'nchar']='Total'
df_.loc[df_.index[-1], 'B']=df_['B'].values[0]-df_['nval'].values[-1]
all_df.append(df_)
However, I wonder whether this proposal can be further simplified further using pandas groupby or other. I really appreciate for any tips.
Ultimately, I would like to express the table as below, which will be exported to excel
years nchar nval C D B
0 10
1 2019 a 1 9
2 2019 b 1 8
3 2019 c 1 7
4
5 Total 3 7
6
7
8
9
10 From 2019 7
11 2020 a 1 6
12 2020 s 4 2
13 2020 c 1 1
14 2020 a 4 -3
15
16 Total 10 -3
17
18
19
20
21 From 2020 -3
22 2022 b 5 -8
23 2022 c 6 -14
24 2022 g 7 -21
25
26 Total 18 -21
27
28
29
30
The code to produced the above table
# Optional to represent the table above
all_ap_df=[]
for a_df in all_df:
df=a_df.append(pd.DataFrame([[''] * len(df.columns)]*nspacing_between_df, columns=df.columns)).reset_index(drop=True)
all_ap_df.append(df)
df=pd.concat(all_ap_df,axis=0).reset_index(drop=True)
df.loc[df_.index[0], 'D']=df['B'].values[0]
df.loc[df_.index[0], 'B']=''
df = df.fillna('')
I think this is actually quite simple. Use groupby + cumsum:
df['B'] = 10 - df['nval'].cumsum()
Output:
>>> df
years nchar nval B
0 2019 a 1 9
1 2019 b 1 8
2 2019 c 1 7
3 2020 a 1 6
4 2020 s 4 2
In your case chain with groupby
df['new'] = df.groupby('years')['nval'].cumsum().rsub(10)
Out[8]:
0 9
1 8
2 7
3 9
4 5
Name: nval, dtype: int64

Pandas dataframe multiple groupby filtering

I have the following dataframe:
df2 = pd.DataFrame({'season':[1,1,1,2,2,2,3,3],'value' : [-2, 3,1,5,8,6,7,5], 'test':[3,2,6,8,7,4,25,2],'test2':[4,5,7,8,9,10,11,12]},index=['2020', '2020', '2020','2020', '2020', '2021', '2021', '2021'])
df2.index= pd.to_datetime(df2.index)
df2.index = df2.index.year
print(df2)
season test test2 value
2020 1 3 4 -2
2020 1 2 5 3
2020 1 6 7 1
2020 2 8 8 5
2020 2 7 9 8
2021 2 4 10 6
2021 3 25 11 7
2021 3 2 12 5
I would like to filter it to obtain for each year and each season of that year the maximum value of the column 'value'. How can I do that efficiently?
Expected result:
print(df_result)
season value test test2
year
2020 1 3 2 5
2020 2 8 7 9
2021 2 6 4 10
2021 3 7 25 11
Thank you for your help,
Pierre
This is a groupby operation, but a little non-trivial, so posting as an answer.
(df2.set_index('season', append=True)
.groupby(level=[0, 1])
.value.max()
.reset_index(level=1)
)
season value
2020 1 4
2020 2 8
2021 2 6
2021 3 7
You can elevate your index to a series, then perform a groupby operation on a list of columns:
df2['year'] = df2.index
df_result = df2.groupby(['year', 'season'])['value'].max().reset_index()
print(df_result)
year season value
0 2020 1 4
1 2020 2 8
2 2021 2 6
3 2021 3 7
If you wish, you can make year your index again via df_result = df_result.set_index('year').
To keep other columns use:
df2['year'] = df2.index
df2['value'] = df2.groupby(['year', 'season'])['value'].transform('max')
Then drop any duplicates via pd.DataFrame.drop_duplicates.
Update #1
For your new requirement, you need to apply an aggregation function for 2 series:
df2['year'] = df2.index
df_result = df2.groupby(['year', 'season'])\
.agg({'value': 'max', 'test': 'last'})\
.reset_index()
print(df_result)
year season value test
0 2020 1 4 6
1 2020 2 8 7
2 2021 2 6 2
3 2021 3 7 2
Update #2
For your finalised requirement:
df2['year'] = df2.index
df2['max_value'] = df2.groupby(['year', 'season'])['value'].transform('max')
df_result = df2.loc[df2['value'] == df2['max_value']]\
.drop_duplicates(['year', 'season'])\
.drop('max_value', 1)
print(df_result)
season value test test2 year
2020 1 3 2 5 2020
2020 2 8 7 9 2020
2021 2 6 4 10 2021
2021 3 7 25 11 2021
You can using get_level_values for bring index value into groupby
df2.groupby([df2.index.get_level_values(0),df2.season]).value.max().reset_index(level=1)
Out[38]:
season value
2020 1 4
2020 2 8
2021 2 6
2021 3 7

Panadas pivot table

When trying to use the pd.pivot_table on a given dataset, I noticed that it creates levels for all existing levels on a parent group, not all possible levels. For example, on a dataset like this:
YEAR CLASS
0 2013 A
1 2013 A
2 2013 B
3 2013 B
4 2013 B
5 2013 C
6 2013 C
7 2013 D
8 2014 A
9 2014 A
10 2014 A
11 2014 B
12 2014 B
13 2014 B
14 2014 C
15 2014 C
there is no level D for year 2014, so the pivot table will look like this:
pd.pivot_table(d,index=["YEAR","CLASS"],values=["YEAR"],aggfunc=[len],fill_value=0)
len
YEAR CLASS
2013 A 2
B 3
C 2
D 1
2014 A 3
B 3
C 2
What I want is to get a separate group for D in 2014 with length 0 in my pivot table. How can I include all possible levels in the child variable for the parent variable?
I think you can use crosstab and stack:
print pd.pivot_table(df,
index=["YEAR","CLASS"],
values=["YEAR"],
aggfunc=[len],
fill_value=0)
len
YEAR CLASS
2013 A 2
B 3
C 2
D 1
2014 A 3
B 3
C 2
print pd.crosstab(df['YEAR'],df['CLASS'])
CLASS A B C D
YEAR
2013 2 3 2 1
2014 3 3 2 0
df = pd.crosstab(df['YEAR'],df['CLASS']).stack()
df.name = 'len'
print df
YEAR CLASS
2013 A 2
B 3
C 2
D 1
2014 A 3
B 3
C 2
D 0
Name: len, dtype: int64

Categories