Be the following python pandas DataFrame:
ID
Holidays
visit_1
visit_2
visit_3
other
0
True
1
2
0
red
0
False
3
2
0
red
0
True
4
4
1
blue
1
False
2
0
0
red
1
True
1
2
1
green
2
False
1
0
0
red
Currently I calculate a new DataFrame with the accumulated visit values as follows.
# Calculate the columns of the total visit count
visit_df = df.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
I would like to create a new one taking into account only the rows whose Holiday value is True. How could I do this?
Simple subset the rows first:
df[df['Holidays']].groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
output:
visit_1 visit_2 visit_3
ID
0 5 6 1
1 1 2 1
Alternative if you want to also get the groups without any match:
df2 = df.set_index('ID')
(df2.where(df2['Holidays'])
.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
)
output:
visit_1 visit_2 visit_3
ID
0 5.0 6.0 1.0
1 1.0 2.0 1.0
2 0.0 0.0 0.0
variant
df2 = df.set_index('ID')
(df2.where(df2['Holidays'])
.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
.convert_dtypes()
.add_suffix('_Holidays')
)
output:
visit_1_Holidays visit_2_Holidays visit_3_Holidays
ID
0 5 6 1
1 1 2 1
2 0 0 0
I have two DataFrames (example below). I would like to delete any row in df1 with a value equal to df2[patnum] if df2[city] is 'nan'.
For example: I would want to drop rows 2 and 3 in df1 since they contain '4' and patnum '4' in df2 has a missing value in df2['city'].
How would I do this?
df1
Citer Citee
0 1 2
1 2 4
2 3 5
3 4 7
df2
Patnum City
0 1 new york
1 2 amsterdam
2 3 copenhagen
3 4 nan
4 5 sydney
expected result:
df1
Citer Citee
0 1 2
1 3 5
IIUC stack isin and dropna
the idea is to return a True/False boolean based on matches then drop those rows after we unstack the dataframe.
val = df2[df2['City'].isna()]['Patnum'].values
df3 = df1.stack()[~df1.stack().isin(val)].unstack().dropna(how="any")
Citer Citee
0 1.0 2.0
2 3.0 5.0
Details
df1.stack()[~df1.stack().isin(val)]
0 Citer 1
Citee 2
1 Citer 2
2 Citer 3
Citee 5
3 Citee 7
dtype: int64
print(df1.stack()[~df1.stack().isin(val)].unstack())
Citer Citee
0 1.0 2.0
1 2.0 NaN
2 3.0 5.0
3 NaN 7.0
I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
This question already has answers here:
Pandas groupby(),agg() - how to return results without the multi index?
(3 answers)
Closed 4 years ago.
I have this pandas DataFrame df:
df.head()
windIntensity year month day hour AOBT delay
3 2015 1 1 0 0 0.0 15.0
2 2015 1 1 0 0 0.0 10.0
2 2015 1 1 1 0 0.0 5.0
2 2015 1 1 1 0 0.0 0.0
1 2015 1 1 2 0 0.0 0.0
When I execute this code:
df = dfj.groupby(["year","hour"]).agg({'windIntensity':'mean','delay':['mean','count']}).reset_index()
I get this result:
year hour windIntensity delay
mean mean count
0 2015 0 4.239207 24.240373 857
1 2015 1 4.029024 15.770449 758
2 2015 2 3.863928 7.431322 779
3 2015 3 3.859801 4.161290 806
4 2015 4 3.782659 4.722230 6851
But how can I rename columns to get one line of column, not two lines?
Expected result:
year hour windIntensity_mean delay_mean count
0 2015 0 4.239207 24.240373 857
1 2015 1 4.029024 15.770449 758
2 2015 2 3.863928 7.431322 779
3 2015 3 3.859801 4.161290 806
4 2015 4 3.782659 4.722230 6851
Demo:
source DF with multi-level columns:
In [223]: r
Out[223]:
year hour windIntensity delay
mean mean count
0 1 0 2015 6.0 5
solution:
In [224]: r.columns = r.columns.map(lambda c: ('_' if c[1] else '').join(c))
result:
In [225]: r
Out[225]:
year hour windIntensity_mean delay_mean delay_count
0 1 0 2015 6.0 5
I have a dataframe that looks something like this:
df = pd.DataFrame({'Name':['a','a','a','a','b','b','b'], 'Year':[1999,1999,1999,2000,1999,2000,2000], 'Name_id':[1,1,1,1,2,2,2]})
Name Name_id Year
0 a 1 1999
1 a 1 1999
2 a 1 1999
3 a 1 2000
4 b 2 1999
5 b 2 2000
6 b 2 2000
What I'd like to have is a new column 'yr_name_id' that increases for each unique Name_id-Year combination and then begins anew with each new Name_id.
Name Name_id Year yr_name_id
0 a 1 1999 1
1 a 1 1999 1
2 a 1 1999 1
3 a 1 2000 2
4 b 2 1999 1
5 b 2 2000 2
6 b 2 2000 2
I've tried a variety of things and looked here, here and at a few posts on groupby and enumerate.
At first I tried creating a unique dictionary after combining Name_id and Year and then using map to assign values, but when I try to combine Name_id and Year as strings via:
df['yr_name_id'] = str(df['Name_id']) + str(df['Year'])
The new column has a non-unique syntax of 0 0 1\n1 1\n2 1\n3 1\n4 2\n5 2... which I don't really understand.
A more promising approach that I think I just need help with the lambda is by using groupby
df['yr_name_id'] = df.groupby(['Name_id', 'Year'])['Name_id'].transform(lambda x: )#unsure from this point
I am very unfamiliar with lambda's so any guidance on how I might do this would be greatly appreciated.
IIUC you can do it this way:
In [99]: df['yr_name_id'] = pd.Categorical(pd.factorize(df['Name_id'].astype(str) + '-' + df['Year'].astype(str))[0] + 1)
In [100]: df
Out[100]:
Name Name_id Year yr_name_id
0 a 1 1999 1
1 a 1 1999 1
2 a 1 1999 1
3 a 1 2000 2
4 b 2 1999 3
5 b 2 2000 4
6 b 2 2000 4
In [101]: df.dtypes
Out[101]:
Name object
Name_id int64
Year int64
yr_name_id category
dtype: object
But looking at your desired DF, it looks like you want to categorize just a Year column, not a combination of Name_id + Year
In [102]: df['yr_name_id'] = pd.Categorical(pd.factorize(df.Year)[0] + 1)
In [103]: df
Out[103]:
Name Name_id Year yr_name_id
0 a 1 1999 1
1 a 1 1999 1
2 a 1 1999 1
3 a 1 2000 2
4 b 2 1999 1
5 b 2 2000 2
6 b 2 2000 2
In [104]: df.dtypes
Out[104]:
Name object
Name_id int64
Year int64
yr_name_id category
dtype: object
Use itertools.count:
from itertools import count
counter = count(1)
df['yr_name_id'] = (df.groupby(['Name_id', 'Year'])['Name_id']
.transform(lambda x: next(counter)))
Output:
Name Name_id Year yr_name_id
0 a 1 1999 1
1 a 1 1999 1
2 a 1 1999 1
3 a 1 2000 2
4 b 2 1999 3
5 b 2 2000 4
6 b 2 2000 4