I have a pandas dataframe with 2 level of indexes. For each level 1 Index, I want to select 1st Level 2 Index records.
df = pd.DataFrame({'Person': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'Year': ['2020','2020', '2019','2019','2019','2018', '2019','2018','2017'],'class':list('AISAAIASS'),
'val': randint(0, 10, 9)})
df
Person Year class val
0 1 2020 A 8
1 1 2020 I 7
2 1 2019 S 6
3 2 2019 A 8
4 2 2019 A 1
5 2 2018 I 2
6 3 2019 A 0
7 3 2018 S 6
8 3 2017 S 8
I want 2020(Year) records for Person 1 (2 in no), 2019 records (2 in no.) for Person 2 and 2019 record ( 1 record) for Person 3.
I looked into lot of codes, still unable to get the answer. Is there a simple way of doing it?
Use Index.get_level_values with Index.duplicated for first MultiIndex values and then filter by Index.isin:
np.random.seed(2020)
df = pd.DataFrame({'Person': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'Year': ['2020','2020', '2019','2019','2019','2018', '2019','2018','2017'],
'class':list('AISAAIASS'),
'val': np.random.randint(0, 10, 9)}).set_index(['Person','Year'])
idx = df.index[~df.index.get_level_values(0).duplicated()]
df1 = df[df.index.isin(idx)]
Or get first index values by GroupBy.head by first level:
df1 = df[df.index.isin(df.groupby(['Person']).head(1).index)]
print (df1)
class val
Person Year
1 2020 A 0
2020 I 8
2 2019 A 6
2019 A 3
3 2019 A 7
Related
I have data that looks like this
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'DATE': ['1/1/2015','1/2/2015', '1/3/2015','1/4/2015','1/5/2015','1/6/2015','1/7/2015','1/8/2015',
'1/9/2016','1/2/2015','1/3/2015','1/4/2015','1/5/2015','1/6/2015','1/7/2015'],
'CD': ['A','A','A','A','B','B','A','A','C','A','A','A','A','A','A']})
What I would like to do is group by ID and CD and get the start and stop change for each change. I tried using groupby and agg function but it will group all A together even though they needs to be separated since there is B in between 2 A.
df1 = df.groupby(['ID','CD'])
df1 = df1.agg(
Start_Date = ('Date',np.min),
End_Date=('Date', np.min)
).reset_index()
What I get is :
I was hoping if some one could help me get the result I need. What I am looking for is :
make grouper for grouping
grouper = df['CD'].ne(df['CD'].shift(1)).cumsum()
grouper:
0 1
1 1
2 1
3 1
4 2
5 2
6 3
7 3
8 4
9 5
10 5
11 5
12 5
13 5
14 5
Name: CD, dtype: int32
then use groupby with grouper
df.groupby(['ID', grouper, 'CD'])['DATE'].agg([min, max]).droplevel(1)
output:
min max
ID CD
1 A 1/1/2015 1/4/2015
B 1/5/2015 1/6/2015
A 1/7/2015 1/8/2015
C 1/9/2016 1/9/2016
2 A 1/2/2015 1/7/2015
change column name and use reset_index and so on..for your desired output
(df.groupby(['ID', grouper, 'CD'])['DATE'].agg([min, max]).droplevel(1)
.set_axis(['Start_Date', 'End_Date'], axis=1)
.reset_index()
.assign(CD=lambda x: x.pop('CD')))
result
ID Start_Date End_Date CD
0 1 1/1/2015 1/4/2015 A
1 1 1/5/2015 1/6/2015 B
2 1 1/7/2015 1/8/2015 A
3 1 1/9/2016 1/9/2016 C
4 2 1/2/2015 1/7/2015 A
This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 1 year ago.
I have two DataFrames in python, where I'm trying to return the values in df based on the date and column name, the real DataFrame is very long so this must be done using some kind of loop
df = pd.DataFrame({
'Avg1': [1, 2, 3, 4],
'Avg2': [3, 5, 1, 15],
'Date': ['2021-08-06', '2021-08-07', '2021-08-08', '2021-08-07']
})
Avg1
Avg2
Date
0
1
3
2021-08-06
1
2
5
2021-08-07
2
3
1
2021-08-08
3
4
15
2021-08-07
df2 = pd.DataFrame({
'Return Avg': ['Avg1', 'Avg2'],
'At Date': ['2021-08-08', '2021-08-07'],
'Returned values (what I want)': [3, 5]
})
Return Avg
At Date
Returned values (what I want)
0
Avg1
2021-08-08
3
1
Avg2
2021-08-07
5
I think melt is what you are looking for.
df = pd.DataFrame({
'Avg1': [1, 2, 3, 4],
'Avg2': [3, 5, 1, 15],
'Date': ['2021-08-06', '2021-08-07', '2021-08-08', '2021-08-07']
})
then :
df.melt(id_vars=['Date'], value_vars=['Avg1', 'Avg2'])
Date variable value
0 2021-08-06 Avg1 1
1 2021-08-07 Avg1 2
2 2021-08-08 Avg1 3
3 2021-08-07 Avg1 4
4 2021-08-06 Avg2 3
5 2021-08-07 Avg2 5
6 2021-08-08 Avg2 1
7 2021-08-07 Avg2 15
let's start with a function
table1 = pd.DataFrame({'Avg1':[1,2,3,4],'Avg2':[3,5,1,15],'Date':['2021-08-06','2021-08-07','2021-08-08','2021-08-09']})
df2 = pd.DataFrame({'Return Avg':['Avg1','Avg2'],'At Date':['2021-08-08','2021-08-07']})
def get_the_currect_value(date, col_name, ext_table):
return ext_table.loc[ext_table['Date'] == date, col_name].iloc[0]
now after the function, we will use apply to create the new column you want
df2['rv'] = df2.apply(lambda x: get_the_currect_value(x['At Date'],x['Return Avg'], table1),axis=1)
and that it
I'm trying to create a column which contains a cumulative sum of the number of entries, tid, which are grouped according to unique values of (raceid, tid). The cumulative sum should increment by the number of entries in the grouping as shown in the df3 dataframe below rather than one at a time.
import pandas as pd
df1 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3]})
rid tid
0 1 1
1 1 2
2 1 2
3 2 1
4 2 1
5 2 3
6 3 1
7 3 4
8 4 5
9 5 1
10 5 1
11 5 1
12 5 3
Giving after the required operation:
df3 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3],
'groupentries': [1, 2, 2, 2, 2, 1, 1, 1, 1, 3, 3, 3, 1],
'cumulativeentries': [1, 2, 2, 3, 3, 1, 4, 1, 1, 7, 7, 7, 2]})
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2
The derived column that I'm after is the cumulativeentries column although I've only figured out how to generate the intermediate column groupentries using pandas:
df1.groupby(["rid", "tid"]).size()
Values in cumulativeentries are actually a kind of running count.
The task is to count occurrences of the current tid in "source area" of
tid column:
from the beginning of the DataFrame,
up to (including) the end of the current group.
To compute values of both required values for each group, I defined
the following function:
def fn(grp):
lastRow = grp.iloc[-1] # last row of the current group
lastId = lastRow.name # index of this row
tids = df1.truncate(after=lastId).tid
return [grp.index.size, tids[tids == lastRow.tid].size]
To get the "source area" mentioned above I used truncate function.
In my opinion it is a very intuitive solution, based on the notion of the
"source area".
The function returns a list containing both required values:
the size of the current group,
how many tids equal to the current tid are in the
truncated tid column.
To apply this function, run:
df2 = df1.groupby(['rid', 'tid']).apply(fn).apply(pd.Series)\
.rename(columns={0: 'groupentries', 1: 'cumulativeentries'})
Details:
apply(fn) generates a Series containing 2-element lists.
apply(pd.Series) converts it to a DataFrame (with default column names).
rename sets the target column names.
And the last thing to do is to join this table to df1:
df1.join(df2, on=['rid', 'tid'])
For first column use GroupBy.transform with DataFrameGroupBy.size, for second use custom function for test all values of column to last index values, compare with last values and count matched values by sum:
f = lambda x: (df1['tid'].iloc[:x.index[-1]+1] == x.iat[-1]).sum()
df1['groupentries'] = df1.groupby(["rid", "tid"])['rid'].transform('size')
df1['cumulativeentries'] = df1.groupby(["rid", "tid"])['tid'].transform(f)
print (df1)
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2
Let's say I have this data ordered by id:
id | count
1 1
2 2
3 0
4 4
5 3
6 2
7 0
8 10
9 1
10 2
I want to obtain always the last change that comes after the last zero of any. Based on the data above, I would want to get :
id | count
8 10
9 1
10 2
Does anyone know how to do this?
pandas
df.loc[df['count'].ne(0).iloc[::-1].cumprod().astype(bool)]
id count
7 8 10
8 9 1
9 10 2
numpy
df[(df['count'].values[::-1] != 0).cumprod()[::-1].astype(bool)]
id count
7 8 10
8 9 1
9 10 2
with other conditions
df[(df['count'].values[::-1] < 3).cumprod()[::-1].astype(bool)]
# df.loc[df['count'].lt(3).iloc[::-1].cumprod().astype(bool)]
id count
8 9 1
9 10 2
debugging
You should be able to copy and paste this and reproduce my results. If you can't then there is something else wrong. Try resetting your kernel.
import pandas as pd
df = pd.DataFrame({
'count': [1, 2, 0, 4, 3, 2, 0, 10, 1, 2],
'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})
df[(df['count'].values[::-1] < 3).cumprod()[::-1].astype(bool)]
Should produce
count id
8 1 9
9 2 10
I have a dataframe containing weekly sales for different products (a, b, c). If there were zero sales in a given week (e.g. week 4), there is no record for that week:
In[1]
df = pd.DataFrame({'product': list('aaaabbbbcccc'),
'week': [1, 2, 3, 5, 1, 2, 3, 5, 1, 2, 3, 4],
'sales': np.power(2, range(12))})
Out[1]
product sales week
0 a 1 1
1 a 2 2
2 a 4 3
3 a 8 5
4 b 16 1
5 b 32 2
6 b 64 3
7 b 128 5
8 c 256 1
9 c 512 2
10 c 1024 3
11 c 2048 4
I would like to create a new column containing the cumulative sales for the previous n weeks, grouped by product. E.g. for n=2 it should be like last_2_weeks:
product sales week last_2_weeks
0 a 1 1 0
1 a 2 2 1
2 a 4 3 3
3 a 8 5 4
4 b 16 1 0
5 b 32 2 16
6 b 64 3 48
7 b 128 5 64
8 c 256 1 0
9 c 512 2 256
10 c 1024 3 768
11 c 2048 4 1536
If there was a record for every week, I could just use rolling_sum as described in this question.
Is there a way to set 'week' as an index and only calculate the sum on that index? Or could I resample 'week' and set sales to zero for all missing rows?
Resample is only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex.
But reindex is possible with integers.
Firstly column week is set to index. Then df is grouped by column product and apply reindex by max values of index of each group. Missing values are filled by 0.
import pandas as pd
import numpy as np
df = pd.DataFrame({'product': list('aaaabbbbcccc'),
'week': [1, 2, 3, 5, 1, 2, 3, 5, 1, 2, 3, 4],
'sales': np.power(2, range(12))})
df = df.set_index('week')
def reindex_by_max_index_of_group(df):
index = range(1, max(df.index) + 1)
return df.reindex(index, fill_value=0)
df = df.groupby('product').apply(reindex_by_max_index_of_group)
df.drop(['product'], inplace=True, axis=1)
print df.reset_index()
# product week sales
#0 a 1 1
#1 a 2 2
#2 a 3 4
#3 a 4 0
#4 a 5 8
#5 b 1 16
#6 b 2 32
#7 b 3 64
#8 b 4 0
#9 b 5 128
#10 c 1 256
#11 c 2 512
#12 c 3 1024
#13 c 4 2048
You can use pivot to create a table which will auto-fill the missing values. This works provided that there is at least one entry for each week in your original data, reindex can be used to ensure that there is a row in the table for every week.
This can then have rolling_sum applied to it:
import pandas, numpy
df = pandas.DataFrame({'product': list('aaaabbbbcccc'),
'week': [1, 2, 3, 5, 1, 2, 3, 5, 1, 2, 3, 4],
'sales': numpy.power(2, range(12))})
sales = df.pivot(index='week', columns='product')
# Cope with weeks when there were no sales at all
sales = sales.reindex(range(min(sales.index), 1+max(sales.index))).fillna(0)
# Calculate the sum for the preceding two weeks
pandas.rolling_sum(sales, 3, min_periods=1)-sales
This gives the following result, which looks to match the desired (in that it provides the sum for the preceding two weeks):
product a b c
week
1 0 0 0
2 1 16 256
3 3 48 768
4 6 96 1536
5 4 64 3072