How to groupby and get cumcount using a criteria? - python

I have a dataframe like as shown below
df = pd.DataFrame(
{'supplier_id':[1,1,1,1],
'prod_id':[123,456,789,342],
'country' : ['UK', 'UK', 'UK','US'],
'transaction_date' : ['13/11/2020', '10/1/2018','11/11/2017', '27/03/2016'],
'industry' : ['STA','STA','PSA','STA'],
'segment' : ['testa','testb','testa','testc'],
'label':[1,1,1,0]})
My objective is to find out answers to the below questions
a) from the current row, How many times prior (previously), the same supplier has succeeded and failed in the same country? (use supplier_id and country column). here column label = 1 means success and label=0 means failure
Similarly, I would like to compute the success and failure count based on industry, country and segment as well.
Note that 1st transaction will always starts with 0 because supplier will have no previous transactions associated with that column.
As we are looking at chronological order of business done, we need to first sort the dataframe based on transaction_date.
So, I tried the below
df.sort_values(by=['supplier_id','transaction_date'],inplace=True)
df['prev_biz_country_success_count'] = df.groupby(['supplier_id', 'country']).cumcount()
df['prev_biz_country_failure_count'] = df.groupby(['supplier_id', 'country']).cumcount()
but as you can see, am not sure how to include the label column value. Meaning, we need to count based on label=1 and label=0.
I expect my output to be like as shown below

We can group the dataframe by supplier_id and country column then apply transformation function shift + cumsum on label column to get the count of rows where the criteria in previous rows
g = df.groupby(['supplier_id', 'country'])
for criteria, label in dict(success=1, failure=0).items():
df[f'prev_biz_country_{criteria}_count'] =\
g['label'].apply(lambda s: s.eq(label).shift(fill_value=0).cumsum())
supplier_id prod_id country transaction_date industry segment label prev_biz_country_success_count prev_biz_country_failure_count
1 1 456 UK 10/1/2018 STA testb 1 0 0
2 1 789 UK 11/11/2017 PSA testa 1 1 0
0 1 123 UK 13/11/2020 STA testa 1 2 0
3 1 342 US 27/03/2016 STA testc 0 0 0

Related

How to find shared characteristics in pandas?

I have a dataset
I want to get to know our customers by looking at the typical shared characteristics (e.g. "Married customers in their 40s like wine"). This would correspond to the itemset {Married, 40s, Wine}.
How can I create a new dataframe called customer_data_onehot such that rows correspond to customers (as in the original data set) and columns correspond to the categories of each of the ten categorical attributes in the data. The new dataframe should only contain boolean values (True/False or 0/1s) such that the value in row 𝑖 and column 𝑗 is True (or 1) if and only if the attribute value corresponding to the column 𝑗 holds for the customer corresponding to row 𝑖 . Display the dataframe.
I have this hint "Hint: For example, for the attribute "Education" there are 5 possible categories: 'Graduation', 'PhD', 'Master', 'Basic', '2n Cycle'. Therefore, the new dataframe must contain one column for each of those attribute values." but I don't understand how can I achieve this.
Can someone guide me here to achieve the correct solution?
i have this code which Imports the csv file and selects 90% of data from the original dataset.
import pandas as pd
pre_process = pd.read_csv('customer_data.csv')
pre_process = pre_process.sample(frac=0.9, random_state=413808).to_csv('customer_data_2.csv',
index=False)
Use get_dummies:
Setup a MRE
data = {'Customer': ['A', 'B', 'C'],
'Marital_Status': ['Together', 'Married', 'Single'],
'Age_Group': ['40s', '60s', '20s']}
df = pd.DataFrame(data)
print(df)
# Output
Customer Marital_Status Age_Group
0 A Together 40s
1 B Married 60s
2 C Single 20s
out = pd.get_dummies(df.set_index('Customer')).reset_index()
print(out)
# Output
Customer Marital_Status_Married Marital_Status_Single Marital_Status_Together Age_Group_20s Age_Group_40s Age_Group_60s
0 A 0 0 1 0 1 0
1 B 1 0 0 0 0 1
2 C 0 1 0 1 0 0

Pandas dataframe getting number of categorical values that are in a list

I am new to learning machine learning on datasets in python and am trying to perform the following on the below dataframe (only shown a snippet)
id
country
device
label
100
sg
samsung
0
100
ch
galaxy s
0
200
ab
pocophone
1
200
ee
iphone 1
1
200
my
iphone 2
1
i am trying to
get a list of all the countries where the labels have been = 1
for each id, out of all the countries , count the countries that are in the list in 1), and get the total count of the countries present for each id.
Update:
I have managed to get a list of countries where label = 1. For each id, how to find the number of countries that they have which falls into the list mentioned before?
You can use
df.loc[ df['label'] == 1 ] ['country']
This will find which indices have df['label'] as 1, locate them, and take the 'country' Series from them.
Try via loc accessor and boolean masking:
count=df.loc[df['label'].eq(1),['id','country']].value_counts()
#count the values of country where 'label' is 1
lst=count.index.get_level_values(1).unique().tolist()
#get the index of count for country names
output of lst:
['ab', 'ee', 'my']
output of count:
id country
200 ab 1
ee 1
my 1
dtype: int64
If I understand correctly:
unique countries with label = 1
>>> df.query('label == 1')['country'].unique()
array(['ab', 'ee', 'my'], dtype=object)
count of unique countries per id when label = 1
>>> df.query('label == 1').groupby('id')['country'].nunique()
id
200 3
Name: country, dtype: int64
Updated version:
countries = df.query('label == 1')['country'].unique()
df.query('country in #countries').groupby('id')['country'].nunique()

python / pandas: How to count each cluster of unevenly distributed distinct values in each row

I am transitioning from excel to python and finding the process a little daunting. I have a pandas dataframe and cannot find how to count the total of each cluster of '1's' per row and group by each ID (example data below).
ID 20-21 19-20 18-19 17-18 16-17 15-16 14-15 13-14 12-13 11-12
0 335344 0 0 1 1 1 0 0 0 0 0
1 358213 1 1 0 1 1 1 1 0 1 0
2 358249 0 0 0 0 0 0 0 0 0 0
3 365663 0 0 0 1 1 1 1 1 0 0
The result of the above in the format
ID
LastColumn Heading a '1' occurs: count of '1's' in that cluster
would be:
335344
16-17: 3
358213
19-20: 2
14-15: 4
12-13: 1
365663
13-14: 5
There are more than 11,000 rows of data I would like to output the result to a txt file. I have been unable to find any examples of how the same values are clustered by row, with a count for each cluster, but I am probably not using the correct python terminology. I would be grateful if someone could point me in the right direction. Thanks in advance.
First step is use DataFrame.set_index with DataFrame.stack for reshape. Then create consecutive groups by compare for not equal Series.shifted values with cumulative sum by Series.cumsum to new column g. Then filter rows with only 1 and aggregate by named aggregation by GroupBy.agg with GroupBy.last and GroupBy.size:
df = df.set_index('ID').stack().reset_index(name='value')
df['g'] = df['value'].ne(df['value'].shift()).cumsum()
df1 = (df[df['value'].eq(1)].groupby(['ID', 'g'])
.agg(a=('level_1','last'), b=('level_1','size'))
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID a b
0 335344 16-17 3
1 358213 19-20 2
2 358213 14-15 4
3 358213 12-13 1
4 365663 13-14 5
Last for write to txt use DataFrame.to_csv:
df1.to_csv('file.txt', index=False)
If need your custom format in text file use:
with open("file.txt","w") as f:
for i, g in df1.groupby('ID'):
f.write(f"{i}\n")
for a, b in g[['a','b']].to_numpy():
f.write(f"\t{a}: {b}\n")
You just need to use the sum method and then specify which axis you would like to sum on. To get the sum of each row, create a new series equal to the sum of the row.
# create new series equal to sum of values in the index row
df['sum'] = df.sum(axis=1) # specifies index (row) axis
The best method for getting the sum of each column is dependent on how you want to use that information but in general the core is just to use the sum method on the series and assign it to a variable.
# sum a column and assign result to variable
foo = df['20-21'].sum() # default axis=0
bar = df['16-17'].sum() # default axis=0
print(foo) # returns 1
print(bar) # returns 3
You can get the sum of each column using a for loop and add them to a dictionary. Here is a quick function I put together that should get the sum of each column and return a dictionary of the results so you know which total belongs to which column. The two inputs are 1) the dataframe 2) a list of any column names you would like to ignore
def get_df_col_sum(frame: pd.DataFrame, ignore: list) -> dict:
"""Get the sum of each column in a dataframe in a dictionary"""
# get list of headers in dataframe
dfcols = frame.columns.tolist()
# create a blank dictionary to store results
dfsums = {}
# loop through each column and append sum to list
for dfcol in dfcols:
if dfcol not in ignore:
dfsums.update({dfcol: frame[dfcol].sum()})
return dfsums
I then ran the following code
# read excel to dataframe
df = pd.read_excel(test_file)
# ignore the ID column
ignore_list = ['ID']
# get sum for each column
res_dict = get_df_col_sum(df, ignore_list)
print(res_dict)
and got the following result.
{'20-21': 1, '19-20': 1, '18-19': 1, '17-18': 3, '16-17': 3, '15-16':
2, '14-15': 2, '13-14': 1, '12-13': 1, '11-12': 0}
Sources: Sum by row, Pandas Sum, Add pairs to dictionary

In a DataFrame, how could we get a list of indexes with 0's in specific columns?

We have a large dataset that needs to be modified based on specific criteria.
Here is a sample of the data:
Input
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
1 0 0 1 0 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,1],[0,0,1,0,0]],columns =
['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
The fields of this data are all formatted 'family.member', and a family may have any number of members. We need to remove all rows of the dataframe which have all 0's for any family.
Simply put, we want to only keep rows of the data that contain at least one member of every family.
We have no reproducible code for this problem because we are unsure of where to start.
We thought about using iterrows() but the documentation says:
#You should **never modify** something you are iterating over.
#This is not guaranteed to work in all cases. Depending on the
#data types, the iterator returns a copy and not a view, and writing
#to it will have no effect.
Other questions on S.O. do not quite solve our problem.
Here is what we want the SampleData to look like after we run it:
Expected output
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,0]],columns = ['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
Also, could you please explain why we should not modify a data we iterate over when we do that all the time with for loops, and what is the correct way to modify DataFrame's too, please?
Thanks for the help in advance!
Start from copying df and reformatting its columns into a MultiIndex:
df2 = df.copy()
df2.columns = df.columns.str.split(r'\.', expand=True)
The result is:
BL MI
DB KB RO RA XZ
0 0 1 1 1 0
1 0 0 1 0 0
To generate "family totals", i.e. sums of elements in rows over the top
(0) level of column index, run:
df2.groupby(level=[0], axis=1).sum()
The result is:
BL MI
0 1 2
1 0 1
But actually we want to count zeroes in each row of the above table,
so extend the above code to:
(df2.groupby(level=[0], axis=1).sum() == 0).astype(int).sum(axis=1)
The result is:
0 0
1 1
dtype: int64
meaning:
row with index 0 has no "family zeroes",
row with index 1 has one such zero (for one family).
And to print what we are looking for, run:
df[(df2.groupby(level=[0], axis=1).sum() == 0)\
.astype(int).sum(axis=1) == 0]
i.e. print rows from df, with indices for which the count of
"family zeroes" in df2 is zero.
It's possible to group along axis=1. For each row, check that all families (grouped on the column name before '.') have at least one 1, then slice by this Boolean Series to retain these rows.
m = df.groupby(df.columns.str.split('.').str[0], axis=1).any(1).all(1)
df[m]
# BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
#0 0 1 1 1 0 1
As an illustration, here's what grouping along axis=1 looks like; it partitions the DataFrame by columns.
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1):
print(idx, gp, '\n')
#BL BL.DB BL.KB
#0 0 1
#1 0 0
#MAY MAY.BE
#0 1
#1 1
#MI MI.RO MI.RA MI.XZ
#0 1 1 0
#1 1 0 0
Now it's rather straightforward to find the rows where all of these groups have any single non-zero column, by using those with axis=1.
You basically want to group on families and retain rows where there is one or more member for all families in the row.
One way to do this is to transpose the original dataframe and then split the index on the period, taking the first element which is the family identifier. The columns are the index values in the original dataframe.
We can then group on the families (level=0) and sum the number of members in each for every record (df2.groupby(level=0).sum()). No we retain the index values with more than one member in each family (.gt(0).all()). We create a mask using these values, and apply it to a boolean index on the original dataframe to get the relevant rows.
df2 = SampleData1.T
df2.index = [idx.split('.')[0] for idx in df2.index]
# >>> df2
# 0 1
# BL 0 0
# BL 1 0
# MI 1 1
# MI 1 0
# MI 0 0
# >>> df2.groupby(level=0).sum()
# 0 1
# BL 1 0
# MI 2 1
mask = df2.groupby(level=0).sum().gt(0).all()
>>> SampleData1[mask]
BL.DB BL.KB MI.RO MI.RA MI.XZ
0 0 1 1 1 0

How to replace specific rows (based on conditions) using values with similar features condition in pandas?

i'm having a trouble when i wanna replace specific values that satisfies a condition and replace the values based on another condition.
Example of dataframe (df)
Gender Surname Ticket
` 0 masc Family1 a12`
` 1 **fem NoGroup aa3**`
` 2 boy Family1 125`
` 3 **fem Family2 aa3**`
` 4 fem Family4 525`
` 5 masc NoGroup a52`
The condition to substitute de values in all rows of df['Surname'] column is:
if ((df['Gender']!= masc) & (df['Surname'] == 'NoGroup'))
The code must search for row that have equal ticket and substitute for the correspondent Surname value, else keep the value that already exists ('noGroup').
In this example, the ['Surname'] value in the row 1 ('noGroup') should be replace by 'family2', that corresponds row 3.
I tried this way, but it did not work
for i in zip((df['Gender']!='man') & df['Surname']=='noGroup'):
df['Surname'][i] = df.loc[df['Ticket']==df['Surname'][i]]
With Pandas you should aim for vectorised calculations rather than row-wise loops. Here's one approach. First convert selected values to None:
df.loc[df['Gender'].ne('masc') & df['Surname'].eq('NoGroup'), 'Surname'] = None
Then create a series mapping from Ticket to Surname after a filter:
s = df[df['Surname'].notnull()].drop_duplicates('Ticket').set_index('Ticket')['Surname']
Finally, map null values with the calculated series:
df['Surname'] = df['Surname'].fillna(df['Ticket'].map(s))
Result:
Gender Surname Ticket
0 masc Family1 a12
1 fem Family2 aa3
2 boy Family1 125
3 fem Family2 aa3
4 fem Family4 525
5 masc NoGroup a52

Categories