Summing with on multiple conditions - python

I am trying count the total number of visitors to all restaurants in 2017(The total number of people to visit any restaurant, not individual restaurants). I only want to count the restaurants numbers if its store_id appears in the relation_table, but I can't get my code to work. I get a syntax error on "no_visitors"
UPDATE: My problem was with a previous line
total_visits = reservations.loc[reservations["store_id"].isin(relation_table["store_id"]) & (reservations.year==2017), "no_visitors"].sum()
Example dataframe
RESERVATIONS RELATION_TABLE
store_id year no_visitors store_id
mcdonalds 2017 4 mcdonalds
kfc 2016 5 kfc
burgerking 2017 2

One way to filter your data (df) is to do df[filter_condition] which returns the rows for which the given condition is true. Now all you need is to take the sum of the column you are interested in (no_visitors).
# df = reservations
df[(df.store_id != "") & (df.year == 2017)].no_visitors.sum()

Related

Selecting columns and rows in a dataframe

Here i am trying to count the times a police office is present (a 1 value (2 and 3 mean not present)) at an accident and if there is more chance they are present on a weekday or at the weekend. So far i have out my data into day of the week i now need to select the 1 values and compare them if anyone hows ho to do this. The code i have used and pandas dataframe is below;
#first we need to modify the date so we can find days of the week
accidents['Date'] = pd.to_datetime(accidents['Date'], format="%d/%m/%Y")
accidents.sort_values(['Date', 'Time'], inplace=True)
#now we can assign days of the week
accidents['day'] = accidents['Date'].dt.strftime('%A')
#now we can count the number of police at each day of the week
accidents.value_counts(['Did_Police_Officer_Attend_Scene_of_Accident','day'])
What im looking for in this bottom like is something like; accidents.value_counts(['Did_Police_Officer_Attend_Scene_of_Accident','day'] ==1) but im unsure how to write it
data preview;
Accident_Index Location_Easting_OSGR Location_Northing Did_Police_Officer_Attend_Scene_of_Accident day
2019320634369 521429.0 21973.0 1 Tuesday
2019320634368 521429.0 21970.0 2 Tuesday
2019320634367 521429.0 21972.0 1 Wednesday
2019320634366 521429.0 21972.0 3 Sunday
2019320634366 521429.0 21971.0 1 Sunday
2019320634365 521429.0 21975.0 2 Monday
Update, desired outcome.
So here is the code i had from all of the attended accidents. I now wish to do this again but split into weekdays and weekends
#when did an officer attend
attended = (accidents.Did_Police_Officer_Attend_Scene_of_Accident == 1).sum()
This bit of code now need to include the weekday (then another with weekend) before calling.sum
My desired output would be similar to this but would also count the weekday and weekend values, preferably returned in 2 dataframes. This would then allow me to compare the weekday to the weekend dataframe allowing me to return an single value for each of which has more officers attending

Matching Strings and Count Frequency

I have a list of companies with their subsidiaries, the data looks as below:
CompanyName Employees
Microsoft China 1
Microsoft India 1
Microsoft Europe 1
Apple Inc 1
Apple Data Inc 1
Apple Customer Service Inc 1
Data Corp 1
Data SHCH 1
Data India 1
City Corp 1
Data City 1
If two companies have same words (e.g. Apple Inc and Apple Data Inc), they are considered one company. I will group those companies together, and calculate their total number of employees.
The expected return should be:
Company Employees
Microsft 3
Apple 3
Data 3
City 2
The company will return the common word
The Employees return the sum of company and its subsidiaries
Most of the pandas function doesn't really work in this case. Any suggestions on For Loop?
As you requested in the comments
If the company is always the first word in CompanyName
# extract company as word at index 0
df.CompanyName = df.CompanyName.str.split(expand=True)[0]
# groupby company name and count
dfg = df.groupby('CompanyName').agg({'CompanyName': 'count'})
# display(dfg)
CompanyName
CompanyName
Apple 3
City 1
Data 4
Microsoft 3
I don't think there's a 'very' simple way to do what you want. But it's not too complex too.
First, you need to define clearly the ~criterion to decide wich names are the same 'company'.
We can try with "get the first world and see if it matches", obviously it's not a perfect approach, but it'll do for now.
Then, you can create an object to store your new data. I would recommend a dictionary, with entries like company: (total employees).
You'll now iterate over the rows of the dataframe, with apply and a function to do what you want. It'll look like this:
dict = {}
def aggregator(row):
word1 = row.company.split(" ")[0]
if word1 in dict.keys:
dict[word1] += row.employees
else:
dict[word1] = row.employees
dataframe.apply(aggregator, axis = 1)

Is there a way to count and calculate mean for text columns using groupby?

I have been using pandas.groupby to pivot data and create descriptive charts and tables for my data. While doing groupby for three variables, I keep running into a DataError: No numeric types to aggregate error while working with the cancelled column.
To describe my data, Year and Month contain yearly and monthly data for multiple columns (multiple years, all months), Type contains the type of order item (Clothes, Appliances, etc.), and cancelled contains yes or no string values to determine whether a order was cancelled or not.
I am hoping to plot a graph and show a table to show what the cancellation rate (and success rate) is by order item. The following is what I'm using so far
df.groupby(['Year', 'Month', 'Type'])['cancelled'].mean()
But this doesn't seem to be working.
Sample
Year Month Type cancelled
2012 1 electronics yes
2012 10 fiber yes
2012 9 clothes no
2013 4 vegetables yes
2013 5 appliances no
2016 3 fiber no
2017 1 clothes yes
Use:
df = pd.DataFrame({
'Year':[2020] * 6,
'Month':[7,8,7,8,7,8],
'cancelled':['yes','no'] * 3,
'Type':list('aaaaba')
})
print (df)
Get counts per Year, Month, Type columns:
df1 = df.groupby(['Year', 'Month', 'Type','cancelled']).size().unstack(fill_value=0)
print (df1)
cancelled no yes
Year Month Type
2020 7 a 0 2
b 0 1
8 a 3 0
And then divide by sum of values for ratio:
df2 = df1.div(df1.sum()).mul(100)
print (df2)
cancelled no yes
Year Month Type
2020 7 a 0.0 66.666667
b 0.0 33.333333
8 a 100.0 0.000000
It's possible I have misunderstood what you want your output to look like, but to find the cancellation rate for each item type, you could do something like this:
# change 'cancelled' to numeric values
df.loc[df['cancelled'] == 'yes', 'cancelled'] = 1
df.loc[df['cancelled'] == 'no', 'cancelled'] = 0
# get the mean of 'cancelled' for each item type
res = {}
for t in df['Type'].unique():
res[t] = df.loc[df['Type'] == t, 'cancelled'].mean()
# if desired, put it into a dataframe
results = pd.DataFrame([res], index=['Rate']).T
Output:
Rate
electronics 1.0
fiber 0.5
clothes 0.5
vegetables 1.0
appliances 0.0
Note: If you want to specify specific years or months, you can do that with loc as well, but given that your example data did not have any repeats within a given year or month, this would return your original dataframe for your given example.

Pandas Dataframe show value in column which appears more than ten times

Currently I am analyzing a .csv file which includes names, birthyear and gender of dogs in a given city. I want to filter out birthyears where less than 10 dogs were born.
What would be the right method to do that?
name birth_year gender
0 "Bobby" Lord Sinclair 2009 m
1 "Buddy" Fortheringhay's J. 2011 m
2 "Zappalla II" Kora v. Tüfibach 2011 w
3 (Karl) Kaiser Karl vom Edersee 2013 m
4 A-Diana 2006 w
The data looks somewhat like that, the list is a lot longer. What I want to do is to filter out birth_year values which occur less than 11 times.
I started with using
df[df["birth_year"] < 11]
but this obviously filters out the birth year 11 and lower itself and not the amount
Greetings
If you send the data.. or you can use a function like
df['some'] = df[df[dog] < 10];
or pd querys
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html
It's unclear whether you want to keep the rows in groups with 10 or less, or throw away the rows in groups with 10 or less. Change the > to <= appropriately.
g = df.groupby("birth_year")
g.filter(lambda x: x.name.count() > 10)

subset by counting the number of times 0 occurs in a column after groupby in python

I have some typical stock data. I want to create a column called "Volume_Count" that will count the number of 0 volume days per quarter. My ultimate goal is to remove all stocks that have 0 volume for more than 5 days in a quarter. By creating this column, I can write a simple statement to subset Vol_Count > 5.
A typical Dataset:
Stock Date Qtr Volume
XYZ 1/1/19 2019 Q1 0
XYZ 1/2/19 2019 Q1 598
XYZ 1/3/19 2019 Q1 0
XYZ 1/4/19 2019 Q1 0
XYZ 1/5/19 2019 Q1 0
XYZ 1/6/19 2019 Q1 2195
XYZ 1/7/19 2019 Q1 0
... ... and so on (for multiple stocks and quarters)
This is what I've tried - a 1 liner -
df = df.groupby(['stock','Qtr'], as_index=False).filter(lambda x: len(x.Volume == 0) > 5)
However, as stated previously, this produced inconsistent results.
I want to remove the stock from the dataset only for the quarter where the volume == 0 for 5 or more days.
Note: I have multiple Stocks and Qtr in my dataset, therefore it's essential to groupby Qtr, Stock.
Desired Output:
I want to keep the dataset but remove any stocks for a qtr if they have a volume = 0 for > 5 days.. that might entail a stock not being in the dataset for 2019 Q1 (because Vol == 0 >5 days) but being in the df in 2019 Q2 (Vol == 0 < 5 days)...
Try this:
df[df['Volume'].eq(0).groupby([df['Stock'],df['Qtr']]).transform('sum') < 5]
Details.
First take the Volume column of your dataframe and check to see if
it zero for each record.
Next, group that column by 'Stock' and 'Qtr' columns and get a sum of each True values from step 1 assign that sum to each record using groupby and transform.
Create boolean series from that sum where True if less than 5 and
use that series to boolean index your original dataframe.

Categories