I have a pandas dataframe with multiple columns, which looks like the following:
Index
ID
Year
Code
Type
Mode
0
100
2018
ABC
1
1
1
100
2019
DEF
2
2
2
100
2019
GHI
3
3
3
102
2018
JKL
4
1
4
103
2019
MNO
5
1
5
103
2018
PQR
6
2
6
102
2019
PQR
3
2
I only want to keep ids that have rows against all the values for the column Mode. An example would look like this:
Index
ID
Year
Code
Type
Mode
0
100
2018
ABC
1
1
1
100
2019
DEF
2
2
2
100
2019
GHI
3
3
I have already tried doing so by using the following code:
df = data.groupby('ID').filter(lambda x: {1, 2, 3}.issubset(x['Mode']))
but this returns an empty result. Can someone help me here?
TIA
You can try
out = df.groupby('ID').filter(lambda x : pd.Series([1,2,3]).isin(x['Mode']).all())
Out[9]:
Index ID Year Code Type Mode
0 0 100 2018 ABC 1 1
1 1 100 2019 DEF 2 2
2 2 100 2019 GHI 3 3
Your code works just fine (on pandas 1.3, python 3.9):
out = df.groupby('ID').filter(lambda x: {1,2,3}.issubset(x['Mode']))
Output:
Index ID Year Code Type Mode
0 0 100 2018 ABC 1 1
1 1 100 2019 DEF 2 2
2 2 100 2019 GHI 3 3
Related
I have df similar to below. I need to select rows where df['Year 2'] is equal or closest to df['Year'] in subsets grouped by df['ID'] so in this example rows 1,2 and 5.
df
Year ID A Year 2 C
0 2020 12 0 2019 0
1 2020 12 0 2020 0 <-
2 2017 10 1 2017 0 <-
3 2017 10 0 2018 0
4 2019 6 0 2017 0
5 2019 6 1 2018 0 <-
I am trying to achieve that with the following piece of code using group by and passing a function to get the proper row with the closest value for both columns.
df1 = df.groupby(['ID']).apply(min(df['Year 2'], key=lambda x:abs(x-df['Year'].min())))
This particular line returns 'int' object is not callable. Any ideas how to fix this line of code or a fresh approach to the problem is appreciated.
TYIA.
You can subtract both columns by Series.sub, convert to absolute and aggregate indices by minimum values by DataFrameGroupBy.idxmin:
idx = df['Year 2'].sub(df['Year']).abs().groupby(df['ID']).idxmin()
If need new column filled by boolean use Index.isin:
df['new'] = df.index.isin(idx)
print (df)
Year ID A Year 2 C new
0 2020 12 0 2019 0 False
1 2020 12 0 2020 0 True
2 2017 10 1 2017 0 True
3 2017 10 0 2018 0 False
4 2019 6 0 2017 0 False
5 2019 6 1 2018 0 True
If need filter rows use DataFrame.loc:
df1 = df.loc[idx]
print (df1)
Year ID A Year 2 C
5 2019 6 1 2018 0
2 2017 10 1 2017 0
1 2020 12 0 2020 0
One row solution:
df1 = df.loc[df['Year 2'].sub(df['Year']).abs().groupby(df['ID']).idxmin()]
You could get the idxmin per group:
idx = (df['Year 2']-df['Year']).abs().groupby(df['ID']).idxmin()
# assignment for test
df.loc[idx, 'D'] = '<-'
for selection only:
df2 = df.loc[idx]
output:
Year ID A Year 2 C D
0 2020 12 0 2019 0 NaN
1 2020 12 0 2020 0 <-
2 2017 10 1 2017 0 <-
3 2017 10 0 2018 0 NaN
4 2019 6 0 2017 0 NaN
5 2019 6 1 2018 0 <-
Note that there is a difference between:
df.loc[df.index.isin(idx)]
which gets all the min rows
and:
df.loc[idx]
which gets the first match
I have a years of transaction data which I am working with by customer ids. The transaction information is at an invoice level and an id could easily have multiple invoices on the same day or not have invoices for years. I am attempting to create dataframes which contain sums of invoices by customer by each year, but also show years where invoices where not added. Something akin to:
tmp = invoices[invoice['invoice_year'].isin([2018,2019,2020]]
tmp = tmp.groupby(['id', pd.Grouper(key = 'invoice_date', freq = 'Y')])['sales'].sum()
This would return something akin to:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2020 23423
3 2020 2330202
However the desired output would be:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2019 nan
2 2020 23423
3 2018 nan
3 2019 nan
3 2020 2330202
Ideas?
Let's suppose the original values are defined in the dataframe named df then you can try the following:
output = (df.groupby(['id', 'invoice_date'])['val'].sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='val'))
Otherwise you can previously create the column invoice_year:
df['invoice_year'] = df['invoice_date'].dt.year
And repeat the same code, this outputs:
id invoice_year val
0 1 2018 1
1 1 2019 1
2 1 2020 0
3 2 2018 1
4 2 2019 0
5 2 2020 1
6 3 2018 0
7 3 2019 1
8 3 2020 1
Using the following data as example:
df = pd.DataFrame({'id':[1]*2+[2]*2+[3]*2,'invoice_date':pd.to_datetime(['2018-12-01','2019-12-01','2020-12-01']*2,infer_datetime_format=True),'val':[1]*6})
Stefan has posted a comment that may help. Simply passing dropna=False to your .groupby seems like the best bet; but, you could also take the approach where you bring the NaNs back afterward, which may be required on earlier versions of pandas that don't have the dropna=False parameter:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2020 23423
3 2020 2330202
You can use pd.MultiIndex.from_product and reindex the dataframe from a newly created index called idx:
i, iy = df['id'], df['invoice_year']
idx = pd.MultiIndex.from_product([range(i.min(), i.max()+1),
range(iy.min(), iy.max()+1)],
names=[i.name, iy.name])
df = df.set_index([i.name, iy.name]).reindex(idx).reset_index()
df
Out[1]:
id invoice_year sales
0 1 2018 483982.2
1 1 2019 3453.0
2 1 2020 453533.0
3 2 2018 243.0
4 2 2019 NaN
5 2 2020 23423.0
6 3 2018 NaN
7 3 2019 NaN
8 3 2020 2330202.0
My dataframe is given below:
input_df =
index Year Month Day Hour Minute GHI
0 2017 1 1 7 30 100
1 2017 1 1 8 30 200
2 2017 1 2 9 30 300
3 2017 1 2 10 30 400
4 2017 2 1 11 30 500
5 2017 2 1 12 30 600
6 2017 2 2 13 30 700
I want to sum each day GHI data. From above I am expecting an output like below:
result_df =
index Year Month Day GHI
0 2017 1 1 300
1 2017 1 2 700
2 2017 2 1 1100
3 2017 2 2 700
My code and my present output is:
result_df = input_df.groupby(['Year','Month','Day'])['GHI'].sum()
print(result_df)
result_df =
index Year Month Day GHI
0 2017 1 1 1400
1 2017 2 2 1400
My above code is combining first day in each month and summing the data. But it is wrong. How to overcome it?
You are incredibly close in your attempt. The thing to bear in mind is that pd.groupby() has a parameter as_index with default value True. Therefore your groupby() outputs a multi-index data frame. To get the desired output you can either chain the reset_index() method after the groupby or change the value of the as_index parameter to False.
result_df = input_df.groupby(['Year','Month','Day'])['GHI'].sum()
result_df
Out[12]:
Year Month Day
2017 1 1 300
2 700
2 1 1100
2 700
Name: GHI, dtype: int64
# Getting the desired output
input_df.groupby(['Year','Month','Day'])['GHI'].sum().reset_index()
Out[16]:
Year Month Day GHI
0 2017 1 1 300
1 2017 1 2 700
2 2017 2 1 1100
3 2017 2 2 700
input_df.groupby(['Year','Month','Day'], as_index=False)['GHI'].sum()
Out[17]:
Year Month Day GHI
0 2017 1 1 300
1 2017 1 2 700
2 2017 2 1 1100
3 2017 2 2 700
I have a df with 4 observations per company (4 quarter). However, for several companies I have less than 4 observations. When I don't have the 4 quarters for a firm I would like to delete all observations relative to the firm. Any ideas how to do this ?
This is how the df looks like:
Quarter Year Company
1 2018 A
2 2018 A
3 2018 A
4 2018 A
1 2018 B
2 2018 B
1 2018 C
2 2018 C
3 2018 C
4 2018 C
In this df I would like to delete rows relative to company B because I only have 2 quarters.
Many thanks!
Use transform with size for Series with same size like original DataFrame, so possible filtering:
df = df[df.groupby('Company')['Quarter'].transform('size') == 4]
#if want check by Companies and years
#df = df[df.groupby(['Company','Year'])['Quarter'].transform('size') == 4]
print (df)
Quarter Year Company
0 1 2018 A
1 2 2018 A
2 3 2018 A
3 4 2018 A
6 1 2018 C
7 2 2018 C
8 3 2018 C
9 4 2018 C
If performance is not important or small DataFrame use DataFrameGroupBy.filter:
df = df.groupby('Company').filter(lambda x: len(x) == 4)
Using value_counts
s=df.Company.value_counts()
df.loc[df.Company.isin(s[s==4].index)]
Out[527]:
Quarter Year Company
0 1 2018 A
1 2 2018 A
2 3 2018 A
3 4 2018 A
6 1 2018 C
7 2 2018 C
8 3 2018 C
9 4 2018 C
You can go through your company column and check whether you have all 4 quarter results.
for i in set(df['Company']):
if len(df[df['Company']==i)!=4:
df=df[df['Company']!=i]
I am working on a data with pandas in which a maintenance work is done at a location. The maintenance is done every four years at each site. I want to find the years since the last maintenance action at each site. I am giving here only two sites in the following example but in the original dataset, I have thousands of them. My data only covers the years 2014 through 2017.
Action = 0 means no action has been performed that year, Action = 1 means some action has been done. Measurement is a performance reading related to the effect of the action. The action can happen in any year. I know that if the action has been performed in Year Y, the previous maintenance has been performed in Year Y-4.
Site Year Action Measurement
A 2014 0 100
A 2015 0 150
A 2016 1 300
A 2017 0 80
B 2014 0 200
B 2015 1 250
B 2016 0 60
B 2017 0 110
Given this dataset; first, I want to have a temporary dataset like this:
Item Year Action Measurement Years_Since_Last_Action
A 2014 0 100 2
A 2015 0 150 3
A 2016 1 300 4
A 2017 0 80 1
B 2014 0 200 3
B 2015 1 250 4
B 2016 0 60 1
B 2017 0 110 2
Then, I want to have:
Years_Since_Last_Action Mean_Measurement
1 70
2 105
3 175
4 275
Thanks in advance!
Your first question
s=df.loc[df.Action==1,['Site','Year']].set_index('Site') # get all year have the action and map back to the whole dataframe
df['Newyear']=df.Site.map(s.Year)
s1=df.Year-df.Newyear
df['action since last year']=np.where(s1<=0,s1+4,s1)# using np.where get the condition
df
Out[167]:
Site Year Action Measurement Newyear action since last year
0 A 2014 0 100 2016 2
1 A 2015 0 150 2016 3
2 A 2016 1 300 2016 4
3 A 2017 0 80 2016 1
4 B 2014 0 200 2015 3
5 B 2015 1 250 2015 4
6 B 2016 0 60 2015 1
7 B 2017 0 110 2015 2
2nd question
df.groupby('action since last year').Measurement.mean()
Out[168]:
action since last year
1 70
2 105
3 175
4 275
Name: Measurement, dtype: int64
First, build your intermediate using groupby, *fill and a little arithmetic.
v = (df.Year
.where(df.Action.astype(bool))
.groupby(df.Site)
.ffill()
.bfill()
.sub(df.Year))
df['Years_Since_Last_Action'] = np.select([v > 0, v < 0], [4 - v, v.abs()], default=4)
df
Site Year Action Measurement Years_Since_Last_Action
0 A 2014 0 100 2.0
1 A 2015 0 150 3.0
2 A 2016 1 300 4.0
3 A 2017 0 80 1.0
4 B 2014 0 200 3.0
5 B 2015 1 250 4.0
6 B 2016 0 60 1.0
7 B 2017 0 110 2.0
Next,
df.groupby('Years_Since_Last_Action', as_index=False).Measurement.mean()
Years_Since_Last_Action Measurement
0 1.0 70
1 2.0 105
2 3.0 175
3 4.0 275
How about:
delta_year = df.loc[df.groupby("Site")["Action"].transform("idxmax"), "Year"].values
years_since = ((df.Year - delta_year) % 4).replace(0, 4)
df["Years_Since_Last_Action"] = years_since
out = df.groupby("Years_Since_Last_Action")["Measurement"].mean().reset_index()
out = out.rename(columns={"Measurement": "Mean_Measurement"})
which gives me
In [230]: df
Out[230]:
Site Year Action Measurement Years_Since_Last_Action
0 A 2014 0 100 2
1 A 2015 0 150 3
2 A 2016 1 300 4
3 A 2017 0 80 1
4 B 2014 0 200 3
5 B 2015 1 250 4
6 B 2016 0 60 1
7 B 2017 0 110 2
In [231]: out
Out[231]:
Years_Since_Last_Action Mean_Measurement
0 1 70
1 2 105
2 3 175
3 4 275