Create binary column in pandas dataframe based on priority - python

I have a pandas dataframe that looks something like this:
Item Status
123 B
123 BW
123 W
123 NF
456 W
456 BW
789 W
789 NF
000 NF
And I need to create a new column Value which will be either 1 or 0 depending on the values in the Item and Status columns. The assignment of the value 1 is prioritized by this order: B, BW, W, NF. So, using the sample dataframe above, the result should be:
Item Status Value
123 B 1
123 BW 0
123 W 0
123 NF 0
456 W 0
456 BW 1
789 W 1
789 NF 0
000 NF 1
Using Python 3.7.

Taking your original dataframe as input df dataframe, the following code will produce your desired output:
#dictionary assigning order of priority to status values
priority_map = {'B':1,'BW':2,'W':3,'NF':4}
#new temporary column that converts Status values to order of priority values
df['rank'] = df['Status'].map(priority_map)
#create dictionary with Item as key and lowest rank value per Item as value
lowest_val_dict = df.groupby('Item')['rank'].min().to_dict()
#new column that assigns the same Value to all rows per Item
df['Value'] = df['Item'].map(lowest_val_dict)
#replace Values where rank is different with 0's
df['Value'] = np.where(df['Value'] == df['rank'],1,0)
#delete rank column
del df['rank']

I would prefer an approach where the status is an ordered pd.Categorical, because a) that's what it is and b) it's much more readable: if you have that, you just compare if a value is equal to the max of its group:
df['Status'] = pd.Categorical(df['Status'], categories=['NF', 'W', 'BW', 'B'],
ordered=True)
df['Value'] = df.groupby('Item')['Status'].apply(lambda x: (x == x.max()).astype(int))
# Item Status Value
#0 123 B 1
#1 123 BW 0
#2 123 W 0
#3 123 NF 0
#4 456 W 0
#5 456 BW 1
#6 789 W 1
#7 789 NF 0
#8 0 NF 1

I might be able to help you conceptually, by explaining some steps that I would do:
Create the new column Value, and fill it with zeros np.zeros() or pd.fillna()
Group the dataframe by Item with groupby = pd.groupby('Item')
Iterate through all the groups founds for name, group in groupby:
By using a simple function with if's, a custom priority queue, custom sorting criteria, or any other preferred method, determine which entry has higher priority " by this value 1 is prioritized by this order: B, BW, W, NF ", and assign a value of 1 to it's Value column group.loc[entry]['Value'] == 0
Let's say we are looking at group '123':
Item Status Value
-------------------------
123 B 0 (before 0, after 1)
123 BW 0
123 W 0
123 NF 0
Because the row [123, 'B', 0] had the highest priority based on your criteria, you change it to [123, 'B', 1]
When finished, create the dataframe back from the groupby object, and you're done. You have a lot of possibilities for doing that, might check here: Converting a Pandas GroupBy object to DataFrame

Related

New column with unique groupby results in data frame

I have a data frame with duplicate rows ('id').
I want to aggregate the data, but first need to sum unique sessions per id.
id session
123 X
123 X
123 Y
123 Z
234 T
234 T
This code works well, but not when I want to add this new column 'ncount' to my data frame.
df['ncount'] = df.groupby('id')['session'].nunique().reset_index()
I tried using transform and it didn't work.
df['ncount'] = df.groupby('id')['session'].transform('nunique')
This is the result from the transform code (my data as duplicates id):
id session ncount
123 X 1
123 X 1
123 Y 1
123 Z 1
234 T 1
234 T 1
This is the result I'm interested in:
id session ncount
123 X 3
123 X 3
123 Y 3
123 Z 3
234 T 1
234 T 1
Use the following steps:
1.Group data and store in separate variable.
2.Then merge back to original data frame.
Code:
import pandas as pd
df = pd.DataFrame({"id":[123,123,123,123,234,234],"session":["X","X","Y","Z","T","T"]})
x = df.groupby(["id"])['session'].nunique().reset_index()
res = pd.merge(df,x,how="left",on="id")
print(res)
You can rename the column names if required .
using .count()
Steps:
1: Group the data by "id" and count the values of id values then
2: Decrease the Count by one for index format and Merge to two DataFrames
import pandas as pd
df = pd.DataFrame({"id":[123,123,123,123,234,234],"session":["X","X","Y","Z","T","T"]})
uniq_df = df.groupby(["id"])["session"].count().reset_index()
uniq_df["session"] = uniq_df["session"] - 1
result = pd.merge(df,uniq_df,how="left",on="id")
print(result)

In a DataFrame, how could we get a list of indexes with 0's in specific columns?

We have a large dataset that needs to be modified based on specific criteria.
Here is a sample of the data:
Input
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
1 0 0 1 0 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,1],[0,0,1,0,0]],columns =
['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
The fields of this data are all formatted 'family.member', and a family may have any number of members. We need to remove all rows of the dataframe which have all 0's for any family.
Simply put, we want to only keep rows of the data that contain at least one member of every family.
We have no reproducible code for this problem because we are unsure of where to start.
We thought about using iterrows() but the documentation says:
#You should **never modify** something you are iterating over.
#This is not guaranteed to work in all cases. Depending on the
#data types, the iterator returns a copy and not a view, and writing
#to it will have no effect.
Other questions on S.O. do not quite solve our problem.
Here is what we want the SampleData to look like after we run it:
Expected output
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,0]],columns = ['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
Also, could you please explain why we should not modify a data we iterate over when we do that all the time with for loops, and what is the correct way to modify DataFrame's too, please?
Thanks for the help in advance!
Start from copying df and reformatting its columns into a MultiIndex:
df2 = df.copy()
df2.columns = df.columns.str.split(r'\.', expand=True)
The result is:
BL MI
DB KB RO RA XZ
0 0 1 1 1 0
1 0 0 1 0 0
To generate "family totals", i.e. sums of elements in rows over the top
(0) level of column index, run:
df2.groupby(level=[0], axis=1).sum()
The result is:
BL MI
0 1 2
1 0 1
But actually we want to count zeroes in each row of the above table,
so extend the above code to:
(df2.groupby(level=[0], axis=1).sum() == 0).astype(int).sum(axis=1)
The result is:
0 0
1 1
dtype: int64
meaning:
row with index 0 has no "family zeroes",
row with index 1 has one such zero (for one family).
And to print what we are looking for, run:
df[(df2.groupby(level=[0], axis=1).sum() == 0)\
.astype(int).sum(axis=1) == 0]
i.e. print rows from df, with indices for which the count of
"family zeroes" in df2 is zero.
It's possible to group along axis=1. For each row, check that all families (grouped on the column name before '.') have at least one 1, then slice by this Boolean Series to retain these rows.
m = df.groupby(df.columns.str.split('.').str[0], axis=1).any(1).all(1)
df[m]
# BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
#0 0 1 1 1 0 1
As an illustration, here's what grouping along axis=1 looks like; it partitions the DataFrame by columns.
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1):
print(idx, gp, '\n')
#BL BL.DB BL.KB
#0 0 1
#1 0 0
#MAY MAY.BE
#0 1
#1 1
#MI MI.RO MI.RA MI.XZ
#0 1 1 0
#1 1 0 0
Now it's rather straightforward to find the rows where all of these groups have any single non-zero column, by using those with axis=1.
You basically want to group on families and retain rows where there is one or more member for all families in the row.
One way to do this is to transpose the original dataframe and then split the index on the period, taking the first element which is the family identifier. The columns are the index values in the original dataframe.
We can then group on the families (level=0) and sum the number of members in each for every record (df2.groupby(level=0).sum()). No we retain the index values with more than one member in each family (.gt(0).all()). We create a mask using these values, and apply it to a boolean index on the original dataframe to get the relevant rows.
df2 = SampleData1.T
df2.index = [idx.split('.')[0] for idx in df2.index]
# >>> df2
# 0 1
# BL 0 0
# BL 1 0
# MI 1 1
# MI 1 0
# MI 0 0
# >>> df2.groupby(level=0).sum()
# 0 1
# BL 1 0
# MI 2 1
mask = df2.groupby(level=0).sum().gt(0).all()
>>> SampleData1[mask]
BL.DB BL.KB MI.RO MI.RA MI.XZ
0 0 1 1 1 0

Improving performance of Python for loops with Pandas data frames

please consider the following DataFrame df:
timestamp id condition
1234 A
2323 B
3843 B
1234 C
8574 A
9483 A
Basing on the condition contained in the column condition I have to define a new column in this data frame which counts how many ids are in that condition.
However, please note that since the DataFrame is ordered by the timestamp column, one could have multiple entries of the same id and then a simple .cumsum() is not a viable option.
I have come out with the following code, which is working properly but is extremely slow:
#I start defining empty arrays
ids_with_condition_a = np.empty(0)
ids_with_condition_b = np.empty(0)
ids_with_condition_c = np.empty(0)
#Initializing new column
df['count'] = 0
#Using a for loop to do the task, but this is sooo slow!
for r in range(0, df.shape[0]):
if df.condition[r] == 'A':
ids_with_condition_a = np.append(ids_with_condition_a, df.id[r])
elif df.condition[r] == 'B':
ids_with_condition_b = np.append(ids_with_condition_b, df.id[r])
ids_with_condition_a = np.setdiff1d(ids_with_condition_a, ids_with_condition_b)
elifif df.condition[r] == 'C':
ids_with_condition_c = np.append(ids_with_condition_c, df.id[r])
df.count[r] = ids_with_condition_a.size
Keeping these Numpy arrays is very useful to me because it gives the list of the ids in a particular condition. I would also be able to put dinamically these arrays in a corresponding cell in the df DataFrame.
Are you able to come out with a better solution in terms of performance?
you need to use groupby on the column 'condition' and cumcount to count how many ids are in each condition up to the current row (which seems to be what your code do):
df['count'] = df.groupby('condition').cumcount()+1 # +1 is to start at 1 not 0
with your input sample, you get:
id condition count
0 1234 A 1
1 2323 B 1
2 3843 B 2
3 1234 C 1
4 8574 A 2
5 9483 A 3
which is faster than using loop for
and if you want just have the row with condition A for example, you can use a mask such as, if you do
print (df[df['condition'] == 'A']), you see row with only condition egal to A. So to get an array,
arr_A = df.loc[df['condition'] == 'A','id'].values
print (arr_A)
array([1234, 8574, 9483])
EDIT: to create two column per conditions, you can do for example for condition A:
# put 1 in a column where the condition is met
df['nb_cond_A'] = pd.np.where(df['condition'] == 'A',1,None)
# then use cumsum for increment number, ffill to fill the same number down
# where the condition is not meet, fillna(0) for filling other missing values
df['nb_cond_A'] = df['nb_cond_A'].cumsum().ffill().fillna(0).astype(int)
# for the partial list, first create the full array
arr_A = df.loc[df['condition'] == 'A','id'].values
# create the column with apply (here another might exist, but it's one way)
df['partial_arr_A'] = df['nb_cond_A'].apply(lambda x: arr_A[:x])
the output looks like this:
id condition nb_condition_A partial_arr_A nb_cond_A
0 1234 A 1 [1234] 1
1 2323 B 1 [1234] 1
2 3843 B 1 [1234] 1
3 1234 C 1 [1234] 1
4 8574 A 2 [1234, 8574] 2
5 9483 A 3 [1234, 8574, 9483] 3
then same thing for B, C. Maybe with a loop for cond in set(df['condition']) ould be practical for generalisation
EDIT 2: one idea to do what you expalined in the comments but not sure it improves the performance:
# array of unique condition
arr_cond = df.condition.unique()
#use apply to create row-wise the list of ids for each condition
df[arr_cond] = (df.apply(lambda row: (df.loc[:row.name].drop_duplicates('id','last')
.groupby('condition').id.apply(list)) ,axis=1)
.applymap(lambda x: [] if not isinstance(x,list) else x))
Some explanations: for each row, select the dataframe up to this row loc[:row.name], drop the duplicated 'id' and keep the last one drop_duplicates('id','last') (in your example, it means that once we reach the row 3, the row 0 is dropped, as the id 1234 is twice), then the data is grouped by condition groupby('condition'), and ids for each condition are put in a same list id.apply(list). The part starting with applymap fillna with empty list (you can't use fillna([]), it's not possible).
For the length for each condition, you can do:
for cond in arr_cond:
df['len_{}'.format(cond)] = df[cond].str.len().fillna(0).astype(int)
THe result is like this:
id condition A B C len_A len_B len_C
0 1234 A [1234] [] [] 1 0 0
1 2323 B [1234] [2323] [] 1 1 0
2 3843 B [1234] [2323, 3843] [] 1 2 0
3 1234 C [] [2323, 3843] [1234] 0 2 1
4 8574 A [8574] [2323, 3843] [1234] 1 2 1
5 9483 A [8574, 9483] [2323, 3843] [1234] 2 2 1

How do I specify a column header for pandas groupby result?

I need to group by and then return the values of a column in a concatenated form. While I have managed to do this, the returned dataframe has a column name 0. Just 0. Is there a way to specify what the results will be.
all_columns_grouped = all_columns.groupby(['INDEX','URL'], as_index = False)['VALUE'].apply(lambda x: ' '.join(x)).reset_index()
The resulting groupby object has the headers
INDEX | URL | 0
The results are in the 0 column.
While I have managed to rename the column using
.rename(index=str, columns={0: "variant"}) this seems very in elegant.
Any way to provide a header for the column? Thanks
The simpliest is remove as_index = False for return Series and add parameter name to reset_index:
Sample:
all_columns = pd.DataFrame({'VALUE':['a','s','d','ss','t','y'],
'URL':[5,5,4,4,4,4],
'INDEX':list('aaabbb')})
print (all_columns)
INDEX URL VALUE
0 a 5 a
1 a 5 s
2 a 4 d
3 b 4 ss
4 b 4 t
5 b 4 y
all_columns_grouped = all_columns.groupby(['INDEX','URL'])['VALUE'] \
.apply(' '.join) \
.reset_index(name='variant')
print (all_columns_grouped)
INDEX URL variant
0 a 4 d
1 a 5 a s
2 b 4 ss t y
You can use agg when applied to a column (VALUE in this case) to assign column names to the result of a function.
# Sample data (thanks #jezrael)
all_columns = pd.DataFrame({'VALUE':['a','s','d','ss','t','y'],
'URL':[5,5,4,4,4,4],
'INDEX':list('aaabbb')})
# Solution
>>> all_columns.groupby(['INDEX','URL'], as_index=False)['VALUE'].agg(
{'variant': lambda x: ' '.join(x)})
INDEX URL variant
0 a 4 d
1 a 5 a s
2 b 4 ss t y

How to process column names and create new columns

This is my pandas DataFrame with original column names.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
1 3 0 0
2 1 1 5
Firstly I want to extract all unique variations of cm, e.g. in this case cm1 and cm2.
After this I want to create a new column per each unique cm. In this example there should be 2 new columns.
Finally in each new column I should store the total count of non-zero original column values, i.e.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
1 3 0 0 2 0
2 1 1 5 2 1
I implemented the first step as follows:
cols = pd.DataFrame(list(df.columns))
ind = [c for c in df.columns if 'cm' in c]
df.ix[:, ind].columns
How to proceed with steps 2 and 3, so that the solution is automatic (I don't want to manually define column names cm1 and cm2, because in original data set I might have many cm variations.
You can use:
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
0 1 3 0 0
1 2 1 1 5
First you can filter columns contains string cm, so columns without cm are removed.
df1 = df.filter(regex='cm')
Now you can change columns to new values like cm1, cm2, cm3.
print [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
['cm1', 'cm1', 'cm2']
df1.columns = [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
print df1
cm1 cm1 cm2
0 1 3 0
1 2 1 1
Now you can count non - zero values - change df1 to boolean DataFrame and sum - True are converted to 1 and False to 0. You need count by unique column names - so groupby columns and sum values.
df1 = df1.astype(bool)
print df1
cm1 cm1 cm2
0 True True False
1 True True True
print df1.groupby(df1.columns, axis=1).sum()
cm1 cm2
0 2 0
1 2 1
You need unique columns, which are added to original df:
print df1.columns.unique()
['cm1' 'cm2']
Last you can add new columns by df[['cm1','cm2']] from groupby function:
df[df1.columns.unique()] = df1.groupby(df1.columns, axis=1).sum()
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
0 1 3 0 0 2 0
1 2 1 1 5 2 1
Once you know which columns have cm in them you can map them (with a dict) to the desired new column with an adapted version of this answer:
col_map = {c:'cm'+c[c.index('cm') + len('cm')] for c in ind}
# ^ if you are hard coding this in you might as well use 2
so that instead of the string after cm it is cm and the character directly following, in this case it would be:
{'old_dm_cm1': 'cm1', 'old_dt_cm1_tt': 'cm1', 'old_rr_cm2_epf': 'cm2'}
Then add the new columns to the DataFrame by iterating over the dict:
for col,new_col in col_map.items():
if new_col not in df:
df[new_col] =[int(a!=0) for a in df[col]]
else:
df[new_col]+=[int(a!=0) for a in df[col]]
note that int(a!=0) will simply give 0 if the value is 0 and 1 otherwise. The only issue with this is because dicts are inherently unordered it may be preferable to add the new columns in order according to the values: (like the answer here)
import operator
for col,new_col in sorted(col_map.items(),key=operator.itemgetter(1)):
if new_col in df:
df[new_col]+=[int(a!=0) for a in df[col]]
else:
df[new_col] =[int(a!=0) for a in df[col]]
to ensure the new columns are inserted in order.

Categories