Conditional grouping pandas DataFrame - python

I have a DataFrame that has below columns:
df = pd.DataFrame({'Name': ['Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy'],
'Lenght': ['10', '12.5', '11', '12.5', '12', '11', '12.5', '10', '5'],
'Try': [0,0,0,1,1,1,2,2,2],
'Batch':[0,0,0,0,0,0,0,0,0]})
In each batch a name gets arbitrary many tries to get the greatest lenght.
What I want to do is create a column win that has the value 1 for greatest lenght in a batch and 0 otherwise, with the following conditions.
If one name hold the greatest lenght in a batch in multiple try only the first try will have the value 1 in win(See "Abe in example above")
If two separate name holds equal greatest lenght then both will have value 1 in win
What I have managed to do so far is:
df.groupby(['Batch', 'name'])['lenght'].apply(lambda x: (x == x.max()).map({True: 1, False: 0}))
But it doesn't support all the conditions, any insight would be highly
Expected outout:
df = pd.DataFrame({'Name': ['Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy'],
'Lenght': ['10', '12.5', '11', '12.5', '12', '11', '12.5', '10', '5'],
'Try': [0,0,0,1,1,1,2,2,2],
'Batch':[0,0,0,0,0,0,0,0,0],
'win':[0,1,0,1,0,0,0,0,0]})
appreciated.
Many thanks,

Use GroupBy.transform for max values per groups compared by Lenght column by Series.eq for equality and for map to True->1 and False->0 cast values to integers by Series.astype:
#added first row data by second row
df = pd.DataFrame({'Name': ['Karl', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy'],
'Lenght': ['12.5', '12.5', '11', '12.5', '12', '11', '12.5', '10', '5'],
'Try': [0,0,0,1,1,1,2,2,2],
'Batch':[0,0,0,0,0,0,0,0,0]})
df['Lenght'] = df['Lenght'].astype(float)
m1 = df.groupby('Batch')['Lenght'].transform('max').eq(df['Lenght'])
df1 = df[m1]
m2 = df1.groupby('Name')['Try'].transform('nunique').eq(1)
m3 = ~df1.duplicated(['Name','Batch'])
df['new'] = ((m2 | m3) & m1).astype(int)
print (df)
Name Lenght Try Batch new
0 Karl 12.5 0 0 1
1 Karl 12.5 0 0 1
2 Billy 11.0 0 0 0
3 Abe 12.5 1 0 1
4 Karl 12.0 1 0 0
5 Billy 11.0 1 0 0
6 Abe 12.5 2 0 0
7 Karl 10.0 2 0 0
8 Billy 5.0 2 0 0

Related

python pandas substring based on columns values

Given the following df:
data = {'Description': ['with lemon', 'lemon', 'and orange', 'orange'],
'Start': ['6', '1', '5', '1'],
'Length': ['5', '5', '6', '6']}
df = pd.DataFrame(data)
print (df)
I would like to substring the "Description" based on what is specified in the other columns as start and length, here the expected output:
data = {'Description': ['with lemon', 'lemon', 'and orange', 'orange'],
'Start': ['6', '1', '5', '1'],
'Length': ['5', '5', '6', '6'],
'Res': ['lemon', 'lemon', 'orange', 'orange']}
df = pd.DataFrame(data)
print (df)
Is there a way to make it dynamic or another compact way?
df['Res'] = df['Description'].str[1:2]
You need to loop, a list comprehension will be the most efficient (python ≥3.8 due to the walrus operator, thanks #I'mahdi):
df['Res'] = [s[(start:=int(a)-1):start+int(b)] for (s,a,b)
in zip(df['Description'], df['Start'], df['Length'])]
Or using pandas for the conversion (thanks #DaniMesejo):
df['Res'] = [s[a:a+b] for (s,a,b) in
zip(df['Description'],
df['Start'].astype(int)-1,
df['Length'].astype(int))]
output:
Description Start Length Res
0 with lemon 6 5 lemon
1 lemon 1 5 lemon
2 and orange 5 6 orange
3 orange 1 6 orange
handling non-integers / NAs
df['Res'] = [s[a:a+b] if pd.notna(a) and pd.notna(b) else 'NA'
for (s,a,b) in
zip(df['Description'],
pd.to_numeric(df['Start'], errors='coerce').convert_dtypes()-1,
pd.to_numeric(df['Length'], errors='coerce').convert_dtypes()
)]
output:
Description Start Length Res
0 with lemon 6 5 lemon
1 lemon 1 5 lemon
2 and orange 5 6 orange
3 orange 1 6 orange
4 pinapple xxx NA NA NA
5 orangiie NA NA NA
Given that the fruit name of interest always seems to be the final word in the description column, you might be able to use a regex extract approach here.
data["Res"] = data["Description"].str.extract(r'(\w+)$')
You can use .map to cycle through the Series. Use split(' ') to separate the words if there is space and get the last word in the list [-1].
df['RES'] = df['Description'].map(lambda x: x.split(' ')[-1])

join column names in a new pandas columns conditional on value

I have the following dataset:
data = {'Environment': ['0', '0', '0'],
'Health': ['1', '0', '1'],
'Labor': ['1', '1', '1'],
}
df = pd.DataFrame(data, columns=['Environment', 'Health', 'Labor'])
I want to create a new column df['Keyword'] whose value is a join of the column names with value > 0.
Expected Outcome:
data = {'Environment': ['0', '0', '0'],
'Health': ['1', '0', '1'],
'Labor': ['1', '1', '1'],
'Keyword': ['Health, Labor', 'Labor', 'Health, Labor']}
df_test = pd.DataFrame(data, columns=['Environment', 'Health', 'Labor', 'Keyword'])
df_test
df = pd.DataFrame(data, columns=['Environment', 'Health', 'Labor'])
How do I go about it?
Other version with .apply():
df['Keyword'] = df.apply(lambda x: ', '.join(b for a, b in zip(x, x.index) if a=='1'),axis=1)
print(df)
Prints:
Environment Health Labor Keyword
0 0 1 1 Health, Labor
1 0 0 1 Labor
2 0 1 1 Health, Labor
Another method with mask and stack then groupby to get your aggregation of items.
stack by default drops na values.
df['keyword'] = df.mask(
df.lt(1)).stack().reset_index(1)\
.groupby(level=0)["level_1"].agg(list)
print(df)
Environment Health Labor keyword
0 0 1 1 [Health, Labor]
1 0 0 1 [Labor]
2 0 1 1 [Health, Labor]
First problem in sample data values are strings, so if want compare for greater use:
df = df.astype(float).astype(int)
Or:
df = df.replace({'0':0, '1':1})
And then use DataFrame.dot for matrix multiplication with columns names and separators, last remove it from right side:
df['Keyword'] = df.gt(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
Environment Health Labor Keyword
0 0 1 1 Health, Labor
1 0 0 1 Labor
2 0 1 1 Health, Labor
Or compare strings - e.g. not equal '0' or equal '1':
df['Keyword'] = df.ne('0').dot(df.columns + ', ').str.rstrip(', ')
df['Keyword'] = df.eq('1').dot(df.columns + ', ').str.rstrip(', ')

Python List of Dictionaries Denormalization

I have a list of list of dictionaries such as the following:
[[{'ID': '1',
'Value': '100'},
{'ID': '2',
'Value': '200'}],
[{'ID': '2',
'Value': '300'},
{'ID': '2',
'Value': '300'}],
...]]
I want to convert it into a denormalized dataframe which would have new column for each key such as:
# ID Value ID Value
#0 1 100 2 100
#1 2 300 2 300
If one item has 3 pairs of id, value those should be null for the other items. Running pd.DataFrame(list) creates only one ID and one Value column and puts the values under. How can we achieve this as seperate columns?
You can do it with the concat function:
data = [pd.DataFrame(i) for i in input_data]
out = pd.concat(data, axis=1)
print(out)
Prints:
ID Value ID Value
0 1 100 2 300
1 2 200 2 300
The key is the axis=1 which concatenates along the column axis.
Edit:
Just saw the information with the zeros for all "shorter" columns. THis code results in NaN instead of zero, this however can be resolved fast with the fillna() method:
out = out.fillna(value=0)
Example:
import pandas as pd
input_data = [[{'ID': '1',
'Value': '100'},
{'ID': '2',
'Value': '200'}],
[{'ID': '2',
'Value': '300'},
{'ID': '2',
'Value': '300'}],
[{'ID': '2',
'Value': '300'},
{'ID': '2',
'Value': '300'},
{'ID': '3',
'Value': '300'}]]
data = [pd.DataFrame(i) for i in input_data]
out = pd.concat(data, axis=1)
out = out.fillna(value=0)
print(out)
prints:
ID Value ID Value ID Value
0 1 100 2 300 2 300
1 2 200 2 300 2 300
2 0 0 0 0 3 300

Group by with sum conditions [duplicate]

This question already has answers here:
Python Pandas Conditional Sum with Groupby
(3 answers)
Closed 4 years ago.
I have the following df and I'd like to group it by Date & Ref but with sum conditions.
In this respect I'd need to group by Date & Ref and sum 'Q' column only if P is >= than PP.
df = DataFrame({'Date' : ['1', '1', '1', '1'],
'Ref' : ['one', 'one', 'two', 'two'],
'P' : ['50', '65', '30', '38'],
'PP' : ['63', '63', '32', '32'],
'Q' : ['10', '15', '20', '10']})
df.groupby(['Date','Ref'])['Q'].sum() #This does the right grouping byt summing the whole column
df.loc[df['P'] >= df['PP'], ('Q')].sum() #this has the right sum condition, but does not divide between Date & Ref
Is there a way to do that?
Many thanks in advance
Just filter prior to grouping:
In[15]:
df[df['P'] >= df['PP']].groupby(['Date','Ref'])['Q'].sum()
Out[15]:
Date Ref
1 one 15
two 10
Name: Q, dtype: object
This reduces the size of the df in the first place so will speed up the groupby operation
You could do:
import pandas as pd
df = pd.DataFrame({'Date' : ['1', '1', '1', '1'],
'Ref' : ['one', 'one', 'two', 'two'],
'P' : ['50', '65', '30', '38'],
'PP' : ['63', '63', '32', '32'],
'Q' : ['10', '15', '20', '10']})
def conditional_sum(x):
return x[x['P'] >= x['PP']].Q.sum()
result = df.groupby(['Date','Ref']).apply(conditional_sum)
print(result)
Output
Date Ref
1 one 15
two 10
dtype: object
UPDATE
If you want to sum multiple columns in the output, you could use loc:
def conditional_sum(x):
return x.loc[x['P'] >= x['PP'], ['Q', 'P']].sum()
result = df.groupby(['Date', 'Ref']).apply(conditional_sum)
print(result)
Output
Q P
Date Ref
1 one 15.0 65.0
two 10.0 38.0
Note that in the example above I used column P for the sake of showing how to do it with multiple columns.

Python pandas: Nested List of Dictionary into Dataframe

Here is how the data structure is below... It is a List with inner List that contains two Dictionaries each.
I want it into dataframe with these headings: hasPossession, score and spread.
[[{'hasPossession': '0', 'score': '23', 'spread': '-0'},
{'hasPossession': '0', 'score': '34', 'spread': '0.0'}],
[{'hasPossession': '0', 'score': '', 'spread': '-7.5'},
{'hasPossession': '0', 'score': '', 'spread': '7.5'}],
[{'hasPossession': '0', 'score': '', 'spread': '-1'},
{'hasPossession': '0', 'score': '', 'spread': '1.0'}]]
Generally, above structure is a List that contains 3 Lists and each List contains 2 Dictionary with 2 elements.
How do I transform such into pandas dataframe?
flatten the list and use the default constructor
pd.DataFrame([k for item in initial_list for k in item])
hasPossession score spread
0 0 23 -0
1 0 34 0.0
2 0 -7.5
3 0 7.5
4 0 -1
5 0 1.0

Categories