Realise accumulated DataFrame from a column of Boolean values - python

Be the following python pandas DataFrame:
ID
Holidays
visit_1
visit_2
visit_3
other
0
True
1
2
0
red
0
False
3
2
0
red
0
True
4
4
1
blue
1
False
2
0
0
red
1
True
1
2
1
green
2
False
1
0
0
red
Currently I calculate a new DataFrame with the accumulated visit values as follows.
# Calculate the columns of the total visit count
visit_df = df.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
I would like to create a new one taking into account only the rows whose Holiday value is True. How could I do this?

Simple subset the rows first:
df[df['Holidays']].groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
output:
visit_1 visit_2 visit_3
ID
0 5 6 1
1 1 2 1
Alternative if you want to also get the groups without any match:
df2 = df.set_index('ID')
(df2.where(df2['Holidays'])
.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
)
output:
visit_1 visit_2 visit_3
ID
0 5.0 6.0 1.0
1 1.0 2.0 1.0
2 0.0 0.0 0.0
variant
df2 = df.set_index('ID')
(df2.where(df2['Holidays'])
.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
.convert_dtypes()
.add_suffix('_Holidays')
)
output:
visit_1_Holidays visit_2_Holidays visit_3_Holidays
ID
0 5 6 1
1 1 2 1
2 0 0 0

Related

Walking average based on two matching columns

I have a dataframe df of the following format:
team1 team2 score1 score2
0 1 2 1 0
1 3 4 3 0
2 1 3 1 1
3 2 4 0 2
4 1 2 3 2
What I want to do is to create a new column that will return rolling average of the score1 column of last 3 games but only when the two teams from team1 and team2 are matching.
Expected output:
team1 team2 score1 score2 new
0 1 2 1 0 1
1 3 4 3 0 3
2 1 3 1 1 1
3 2 4 0 2 0
4 1 2 3 2 2
I was able to calculate walking average for all games for each team separately like that:
df['new'] = df.groupby('team1')['score1'].transform(lambda x: x.rolling(3, min_periods=1).mean()
but cannot find a sensible way to expand that to match two teams.
I tried the code below that returns... something, but definitely not what I need.
df['new'] = df.groupby(['team1','team2'])['score1'].transform(lambda x: x.rolling(3, min_periods=1).mean()
I suppose this could be done with apply() but I want to avoid it due to performace issues.
Not sure what is your exact expected output, but you can first reshape the DataFrame to a long format:
(pd.wide_to_long(df.reset_index(), ['team', 'score'], i='index', j='x')
.groupby('team')['score']
.rolling(3, min_periods=1).mean()
)
Output:
team index x
1 0 1 1.0
2 1 1.0
2 3 1 0.0
0 2 0.0
3 1 1 3.0
2 2 2.0
4 1 2 0.0
3 2 1.0
Name: score, dtype: float64
The walkaround I've found was to create 'temp' column that merges the values in 'team1' and 'team2' and uses that column as a reference for the rolling average.
df['temp'] = df.team1+'_'+df.team2
df['new'] = df.groupby('temp')['score1'].transform(lambda x: x.rolling(3, min_periods=1).mean()
Can this be done in one line?

Pandas split column based on category of the elements

I have a pandas.DataFrame in it I have a column. The columns contains, integers, strings, time...
I want to create columns (containing [0,1]) that tells if the value in that column is a string or not, a time or not... in an efficient way.
A
0 Hello
1 Name
2 123
3 456
4 22/03/2019
And the output should be
A A_string A_number A_date
0 Hello 1 0 0
1 Name 1 0 0
2 123 0 1 0
3 456 0 1 0
4 22/03/2019 0 0 1
Using pandas str methods to check for the string type could help:
df = pd.read_clipboard()
df['A_string'] = df.A.str.isalpha().astype(int)
df['A_number'] = df.A.str.isdigit().astype(int)
#naive assumption
df['A_Date'] = (~df.A.str.isalnum()).astype(int)
df.filter(['A','A_string','A_number','A_Date'])
A A_string A_number A_Date
0 Hello 1 0 0
1 Name 1 0 0
2 123 0 1 0
3 456 0 1 0
4 22/03/2019 0 0 1
We can use the native pandas .to_numeric, to_datetime to test for dates & numbers. Then we can use .loc for assignment and fillna to match your target df.
df.loc[~pd.to_datetime(df['A'],errors='coerce').isna(),'A_Date'] = 1
df.loc[~pd.to_numeric(df['A'],errors='coerce').isna(),'A_Number'] = 1
df.loc[(pd.to_numeric(df['A'],errors='coerce').isna())
& pd.to_datetime(df['A'],errors='coerce').isna()
,'A_String'] = 1
df = df.fillna(0)
print(df)
A A_Date A_Number A_String
0 Hello 0.0 0.0 1.0
1 Name 0.0 0.0 1.0
2 123 0.0 1.0 0.0
3 456 0.0 1.0 0.0
4 22/03/2019 1.0 0.0 0.0

How could I replace null value In a group?

I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN

How can I delete int in Pandas dataframe column?

I have a dataframe like this, how can I delete all the int in a column?
For example, the value of column[0]['material'], transformed from lm792 to lm.
material item
index
0 lm792 1
1 sotl085-pu01. 1
2 lm792 1
3 sotl085-pu01. 1
4 ym11-3527 1
... ... ...
135526 0 0
135527 0 0
135528 0 0
135529 0 0
135530 0 0
you could use a simple regex -
\d is a digit (a character in the range 0-9), and + means 1 or more times. So, \d+ is 1 or more digits.
df['material'] = df['material'].str.replace('\d+','')
print(df)
material item
0 lm 1.0
1 sotl-pu. 1.0
2 lm 1.0
3 sotl-pu. 1.0
4 ym- 1.0
5 NaN
6 NaN
7 NaN
8 NaN
9 0.0

how to utilize Pandas aggregate functions on this DataFrame?

This is the table:
order_id product_id reordered department_id
2 33120 1 16
2 28985 1 4
2 9327 0 13
2 45918 1 13
3 17668 1 16
3 46667 1 4
3 17461 1 12
3 32665 1 3
4 46842 0 3
I want to group by department_id, summing the number of orders that come from that department, as well as the number of orders from that department where reordered == 0. The resulting table would look like this:
department_id number_of_orders number_of_reordered_0
3 2 1
4 2 0
12 1 0
13 2 1
16 2 0
I know this can be done in SQL (I forget what the query for that would look like as well, if anyone can refresh my memory on that, that'd be great too). But what are the Pandas functions to make that work?
I know that it starts with df.groupby('department_id').sum(). Not sure how to flesh out the rest of the line.
Use GroupBy.agg with DataFrameGroupBy.size and lambda function for compare values by Series.eq and count by sum of True values (Trues are processes like 1):
df1 = (df.groupby('department_id')['reordered']
.agg([('number_of_orders','size'), ('number_of_reordered_0',lambda x: x.eq(0).sum())])
.reset_index())
print (df1)
department_id number_of_orders number_of_reordered_0
0 3 2 1
1 4 2 0
2 12 1 0
3 13 2 1
4 16 2 0
If values are only 1 and 0 is possible use sum and last subtract:
df1 = (df.groupby('department_id')['reordered']
.agg([('number_of_orders','size'), ('number_of_reordered_0','sum')])
.reset_index())
df1['number_of_reordered_0'] = df1['number_of_orders'] - df1['number_of_reordered_0']
print (df1)
department_id number_of_orders number_of_reordered_0
0 3 2 1
1 4 2 0
2 12 1 0
3 13 2 1
4 16 2 0
in sql it would be simple aggregation
select department_id,count(*) as number_of_orders,
sum(case when reordered=0 then 1 else 0 end) as number_of_reordered_0
from tabl_name
group by department_id

Categories