Selecting columns and rows in a dataframe - python

Here i am trying to count the times a police office is present (a 1 value (2 and 3 mean not present)) at an accident and if there is more chance they are present on a weekday or at the weekend. So far i have out my data into day of the week i now need to select the 1 values and compare them if anyone hows ho to do this. The code i have used and pandas dataframe is below;
#first we need to modify the date so we can find days of the week
accidents['Date'] = pd.to_datetime(accidents['Date'], format="%d/%m/%Y")
accidents.sort_values(['Date', 'Time'], inplace=True)
#now we can assign days of the week
accidents['day'] = accidents['Date'].dt.strftime('%A')
#now we can count the number of police at each day of the week
accidents.value_counts(['Did_Police_Officer_Attend_Scene_of_Accident','day'])
What im looking for in this bottom like is something like; accidents.value_counts(['Did_Police_Officer_Attend_Scene_of_Accident','day'] ==1) but im unsure how to write it
data preview;
Accident_Index Location_Easting_OSGR Location_Northing Did_Police_Officer_Attend_Scene_of_Accident day
2019320634369 521429.0 21973.0 1 Tuesday
2019320634368 521429.0 21970.0 2 Tuesday
2019320634367 521429.0 21972.0 1 Wednesday
2019320634366 521429.0 21972.0 3 Sunday
2019320634366 521429.0 21971.0 1 Sunday
2019320634365 521429.0 21975.0 2 Monday
Update, desired outcome.
So here is the code i had from all of the attended accidents. I now wish to do this again but split into weekdays and weekends
#when did an officer attend
attended = (accidents.Did_Police_Officer_Attend_Scene_of_Accident == 1).sum()
This bit of code now need to include the weekday (then another with weekend) before calling.sum
My desired output would be similar to this but would also count the weekday and weekend values, preferably returned in 2 dataframes. This would then allow me to compare the weekday to the weekend dataframe allowing me to return an single value for each of which has more officers attending

Related

ValueError: cannot reindex on an axis with duplicate labels (Pandas reindex dataframe)

I'm trying to create a dataframe using pandas that counts the number of engaged, repeaters, and inactive customers for a company based on a JSON file with the transaction data.
For context, the columns of the new dataframe would be each month from Jan to June, while the rows are:
Repeater (customers who purchased in the current and previous month)
Inactive (customers in all transactions over the months including the current month who have purchased in previous months but not the current month)
Engaged (customers in all transactions over the months including the current month who have purchased in every month)
Hence, I've written code that first fetches the month of each transaction based on the provided transaction date for each record in the JSON. Then, it creates another month column ("month_no") which contains the month number of the month which the transaction was made. Next, a function is defined with the metrics to apply to each group and it is applied to a dataframe grouped by name.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_json('data/data.json')
df = (df.astype({'transaction_date': 'datetime64'}).assign(month=lambda x: x['transaction_date'].dt.month_name()))
months = {'January': 1, 'February': 2, 'March': 3, 'April': 4, 'May': 5, 'June': 6}
df['month_no'] = df['month'].map(months)
df = df.set_flags(allows_duplicate_labels=False)
def grpProc(grp):
wrk = pd.DataFrame({'engaged': grp.drop_duplicates().sort_values('month_no').set_index('month_no').reindex(months).name.notna()})
wrk['inactive'] = ~wrk.engaged
wrk['repeaters'] = wrk.engaged & wrk.engaged.shift()
return wrk
act = df.groupby('name').apply(grpProc)
result = act.groupby(level=1).sum().astype(int).T
result.columns = months.keys()
However: this code produces these errors:
FutureWarning: reindexing with a non-unique Index is deprecated and will raise in a future version.
wrk = pd.DataFrame({'engaged': grp.drop_duplicates().sort_values('month_no').set_index('month_no').reindex(months.values()).name.notna()})
...
ValueError: cannot reindex on an axis with duplicate labels
It highlights the line:
act = df.groupby('name').apply(grpProc)
For your reference, here are the important columns of the dataframe and some dummy data:
Name
Purchase Month
Mark
March
John
January
Luke
March
John
March
Mark
January
Mark
February
Luke
February
John
January
The goal is to create a pivot table based on the above table by counting the repeaters, inactive, and engaged members:
Status
January
February
March
Repeaters
0
1
2
Inactive
1
1
0
Engaged
2
1
1
How do you do this and fix the error? If you have another completely different solution to this that works, please share also.

Summing with on multiple conditions

I am trying count the total number of visitors to all restaurants in 2017(The total number of people to visit any restaurant, not individual restaurants). I only want to count the restaurants numbers if its store_id appears in the relation_table, but I can't get my code to work. I get a syntax error on "no_visitors"
UPDATE: My problem was with a previous line
total_visits = reservations.loc[reservations["store_id"].isin(relation_table["store_id"]) & (reservations.year==2017), "no_visitors"].sum()
Example dataframe
RESERVATIONS RELATION_TABLE
store_id year no_visitors store_id
mcdonalds 2017 4 mcdonalds
kfc 2016 5 kfc
burgerking 2017 2
One way to filter your data (df) is to do df[filter_condition] which returns the rows for which the given condition is true. Now all you need is to take the sum of the column you are interested in (no_visitors).
# df = reservations
df[(df.store_id != "") & (df.year == 2017)].no_visitors.sum()

How to calculate average weekly spend with groupby, with week being Monday to Sunday?

I have a customer dataframe with purchase amounts and date. In this case I have two customers, A and B:
df1 = pd.DataFrame(index=pd.date_range('2015-04-24', periods = 50)).assign(purchase=[x for x in range(51,101)])
df2 = pd.DataFrame(index=pd.date_range('2015-04-28', periods = 50)).assign(purchase=[x for x in range(0,50)])
df3 = pd.concat([df1,df2], keys=['A','B'])
df3 = df3.rename_axis(['user','date']).reset_index()
print(df3.head())
user date purchase
0 A 2015-04-24 51
1 A 2015-04-25 52
2 A 2015-04-26 53
3 A 2015-04-27 54
4 A 2015-04-28 55
I would just like to know the user's mean weekly spend, with a week being from Monday to Sunday. Expected outcome:
user average_weekly_spend
0 A 51
1 B 60
However I can't figure out how to set it as Monday to Sunday. For now I am using resample with 7D. This means all customers would have a different definition of a week, I think. I believe it takes the 7 days from the first purchase and so on. So every customer will have a different starting date.
df3.groupby('user').apply(lambda x: x.resample('7D', on='date').mean()).groupby('user')['purchase'].mean()
user
A 78.125
B 27.125
Is it possible to define my own week as Monday to Sunday, for all customers?
It seems you need W-Mon frequency:
df = (df3.groupby('user')
.resample('W-Mon', on='date')['purchase']
.mean()
.mean(level=0)
.reset_index())
print (df)
user purchase
0 A 75.5
1 B 28.7
Not sure if here is good solution use mean of means, maybe you can get counts and sums with resample and then create means by definition - sums divide by counts:
df = (df3.groupby('user')
.resample('W-Mon', on='date')['purchase']
.agg(['size','sum'])
.sum(level=0))
df['mean'] = df.pop('sum') / df.pop('size')
print (df)
mean
user
A 75.5
B 24.5
Another solution with to_period, interestingly, gives a different answer:
df3.groupby(['user',df3.date.dt.to_period('W-MON')]).mean().mean(level='user')
Output:
purchase
user
A 75.500
B 27.125
In Python, the date range is already indexed Monday to Sunday.
If you just use the pandas.Series.dt.week method to get the week number, this is easy.
df3['week_number'] = df3['date'].dt.week
df3.head(20)
You can check in the df3 above, week 18 starts on 2015-04-27, which is a Monday.
df4 = df3.groupby(['user','week_number']).mean()
# Final mean
df4.groupby(['user']).mean()
I think this is the correct average weekly spend. This isn't, however, the same as what you shared in your post as Expected Outcome.
Output:
user purchase
A 74.625
B 26.250

Pandas extend index date using group by

I have a series of transactions similar to this table:
ID Customer Date Amount
1 A 6/12/2018 33,223.00
2 A 9/20/2018 635.00
3 B 8/3/2018 8,643.00
4 B 8/30/2018 1,231.00
5 C 5/29/2018 7,522.00
However I need to get the mean amount of the last six months (as of today)
I was using
df.groupby('Customer').resample('W')['Amount'].sum()
And get something like this:
CustomerCode PayDate
A 2018-05-21 268
2018-05-28 0.00
2018-06-11 0.00
2018-06-18 472,657
2018-06-25 0.00
However with this solution I only get the range of dates where the customers had amount. I need to extend the weeks for each customer so I can get the whole range of the six months (in weeks). In this example I would need to get for customer A from the week of '2018-04-05' (which is exactly six months ago from today) till the week of today (filled with 0 of course since there was no amount)
Heres is the solution I found to my question. First I creates the dates I wanted (last six months but in frequency of weeks)
dates = pd.date_range(datetime.date.today() - datetime.timedelta(6*365/12),
pd.datetime.today(),
freq='W')
Then I create a multi-index using the product of the customer with the dates.
multi_index = pd.MultiIndex.from_product([pd.Index(df['Customer'].unique()),
dates],
names=('Customer', 'Date'))
Then I reindex the df using the new created multi-index and lastly, I fill with zeroes the missing values.
df.reindex(multi_index)
df.fillna(0)
Resample is super flexible. To get a 6-month sum instead of the weekly sum you currently have all you need is:
df.groupby('Customer').resample('6M')['Amount'].sum()
That groups by month end; month start would be '6MS'.
More documentation on available frequencies can be found here:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Annual mean for Pandas dataset but not starting in January

In the dataframe below (small snippet show, actual dataframe spans from 2000 to 2014 in time), I want to compute the annual average but starting in September of one year and going till only May of next year.
Cnt Year JD Min_Temp
S 2000 1 277.139
S 2000 2 274.725
S 2001 1 270.945
S 2001 2 271.505
N 2000 1 257.709
N 2000 2 254.533
N 2000 3 258.472
N 2001 1 255.763
I can compute annual average (Jan - Dec) using this code:
df['Min_Temp'].groupby(df['YEAR']).mean()
How do I adapt this code to mean from Sept of first year to May of next year?
--EDIT: Based on comments below, you can assume that a MONTH column is also available, specifying the month for each row
Not sure which column refers to month or if it is missing, but in the past I've used a quick and dirty method to assign custom seasons (interested if anyone has found more elegant route).
I've used Yahoo Finance data to demonstrate approach, unless one of your columns is Month?
EDIT Requires dataframe to be sorted by date ascending
import pandas as pd
import pandas.io.data as web
import datetime
start = datetime.datetime(2010, 9, 1)
end = datetime.datetime(2015, 5, 31)
df = web.DataReader("F", 'yahoo', start, end)
#Ensure date sorted --required
df = df.sort_index()
#identify custom season and set months june-august to null
count = 0
season = 1
for i,row in df.iterrows():
if i.month in [9,10,11,12,1,2,3,4,5]:
if count == 1:
season += 1
df.set_value(i,'season', season)
count = 0
else:
count = 1
df.set_value(i,'season',None)
#new data frame excluding months june-august
df_data = df[~df['season'].isnull()]
df_data['Adj Close'].groupby(df_data.season).mean()

Categories