finding KPIs using Pandas in Python - python

I have datasets where I have tried to use the pandas groupby function to group the selected columns in the dataset. I would like to get the count of items in a particular column as part of the same dataframe. I can't seem to find a way. I am new to python and pandas. Thanks for the help.
example:-
country customer_no Treatment_Group Open
Atlantis 1352202109 Group A 1
Atlantis 1354540751 Group B 1
Atlantis 1354849289 Group A 1
Oceania 1356553036 Group A 1
Oceania 1356553036 Group A 1
Oceania 1356553036 Group A 1
Oceania 1356883118 Group B 0
Oceania 1356883118 Group B 0
Group Country Rate (count opened/total unique customer))
A Atlantis (2/2)*100
A Oceania (1/1)*100 rate of distinct customer number who opened
B Atlantis (1/1)*100
B Oceania *0/1)*100

Related

Get index and column name for a particular value in Pandas Dataframe

I have the following Pandas DataFrame:
A B
0 Exporter Invoice No. & Date
1 ABC PVT LTD. ABC/1234/2022-23 DATED 20/08/2022
2 1234/B, XYZ,
3 ABCD, DELHI, INDIA Proforma Invoice No. Date.
4 AB/CDE/FGH/2022-23/1234 20.08.2022
5 Consignee Buyer (If other than consignee)
6 ABC Co.
8 P.O BOX NO. 54321
9 Berlin, Germany
Now I want to search for a value in this DataFrame, and store the index and column name in 2 different variables.
For example:
If I search "Consignee", I should get
index = 5
column = 'A'
Assuming you really want the index/column of the match, you can use a mask and stack:
df.where(df.eq('Consignee')).stack()
output:
5 A Consignee
dtype: object
As list:
df.where(df.eq('Consignee')).stack().index.tolist()
output: [(5, 'A')]

Groupby 2 Columns, ['year_range', 'director']. Where director is not in in a year range show 0

I have a dataframe df. With different columns including ['year_range', 'popularity', 'director']. I want to do a groupby() to see the mean score of 'popularity' column
for each director per 'year_range'. Some values 'director' don't fall in some 'year_range' so it doesn't return anything for those 'year_range'. I want to return 0 for those 'year_range' where the 'director' was not active
df.groupby(['year_range', 'director'].popularity.mean()
If a director is not active 2010s should return 0 James Cameron.

How to groupby and count binomial variable in python? and make a plot of this

i have a dataframe like this:
country
question1
question2
france
yes
no
italy
yes
yes
france
yes
no
germany
no
yes
italy
no
yes
i would like to get an output like a pivot table or a group with a count of yes/no for each Question and each country(similar countifs of excel).
I tried many methods as df.groupby(country).value_counts() or df.groupby(country).sum("Yes")
but i cannot get the result wanted.
And i would like to make a chart of this result obtained, only for the YES answer.
Someone can give me an advice?
Thanks
How to groupby and count binomial variables?
We can encode the values in the columns question1 and question2 using get_dummies then sum the encoded values per unique country to get the counts of number of Yes and No for each question per country
counts = pd.get_dummies(df.set_index('country')).sum(level=0)
question1_no question1_yes question2_no question2_yes
country
france 0 2 2 0
italy 1 1 0 2
germany 1 0 0 1
How to make a plot of this?
Filter the question columns containing _Yes suffixed in their names, then call the plot method of pandas dataframe with kind=bar to create a bar chart showing the counts of questions having Yes corresponding to each country
counts.filter(like='_yes').plot(kind='bar')

How to count Pandas df elements with dynamic condition per row (=countif)

I am tyring to do some equivalent of COUNTIF in Pandas. I am trying to get my head around doing it with groupby, but I am struggling because my logical grouping condition is dynamic.
Say I have a list of customers, and the day on which they visited. I want to identify new customers based on 2 logical conditions
They must be the same customer (same Guest ID)
They must have been there on the previous day
If both conditions are met, they are a returning customer. If not, they are new (Hence newby = 1-... to identify new customers.
I managed to do this with a for loop, but obviously performance is terrible and this goes pretty much against the logic of Pandas.
How can I wrap the following code into something smarter than a loop?
for i in range (0, len(df)):
newby = 1-np.sum((df["Day"] == df.iloc[i]["Day"]-1) & (df["Guest ID"] == df.iloc[i]["Guest ID"]))
This post does not help, as the condition is static. I would like to avoid introducting "dummy columns", such as transposing the df, because I will have many categories (many customer names) and would like to build more complex logical statements. I do not want to run the risk of ending up with many auxiliary columns
I have the following input
df
Day Guest ID
0 3230 Tom
1 3230 Peter
2 3231 Tom
3 3232 Peter
4 3232 Peter
and expect this output
df
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
Note that elements 3 and 4 are not necessarily duplicates - given there might be additional, varying columns (such as their order).
Do:
# ensure the df is sorted by date
df = df.sort_values('Day')
# group by customer and find the diff within each group
df['newby'] = (df.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
UPDATE
If multiple visits are allowed per day, you could do:
# only keep unique visits per day
uniques = df.drop_duplicates()
# ensure the df is sorted by date
uniques = uniques.sort_values('Day')
# group by customer and find the diff within each group
uniques['newby'] = (uniques.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
# merge the uniques visits back into the original df
res = df.merge(uniques, on=['Day', 'Guest ID'])
print(res)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
As an alternative, without sorting or merging, you could do:
lookup = {(day + 1, guest) for day, guest in df[['Day', 'Guest ID']].value_counts().to_dict()}
df['newby'] = (~pd.MultiIndex.from_arrays([df['Day'], df['Guest ID']]).isin(lookup)).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1

Pandas: sum of values in one dataframe based on the group in a different dataframe

I have a dataframe such contains companies with their sectors
Symbol Sector
0 MCM Industrials
1 AFT Health Care
2 ABV Health Care
3 AMN Health Care
4 ACN Information Technology
I have another dataframe that contains companies with their positions
Symbol Position
0 ABC 1864817
1 AAP -3298989
2 ABV -1556626
3 AXC 2436387
4 ABT 878535
What I want is to get a dataframe that contains the aggregate positions for sectors. So sum the positions of all the companies in a given sector. I can do this individually by
df2[df2.Symbol.isin(df1.groupby('Sector').get_group('Industrials')['Symbol'].to_list())]
I am looking for a more efficient pandas approach to do this rather than looping over each sector under the group_by. The final dataframe should look like the following:
Sector Sum Position
0 Industrials 14567232
1 Health Care -329173249
2 Information Technology -65742234
3 Energy 6574352342
4 Pharma 6342387658
Any help is appreciated.
If I understood the question correctly, one way to do it is joining both data frames and then group by sector and sum the position column, like so:
df_agg = df1.join(df2['Position']).drop('Symbol', axis=1)
df_agg.groupby('Sector').sum()
Where, df1 is the df with Sectors and df2 is the df with Positions.
You can map the Symbol column to sector and use that Series to group.
df2.groupby(df2.Symbol.map(df1.set_index('Symbol').Sector)).Position.sum()
let us just do merge
df2.merge(df1,how='left').groupby('Sector').Position.sum()

Categories