I have the following dataset:
data = {'Environment': ['0', '0', '0'],
'Health': ['1', '0', '1'],
'Labor': ['1', '1', '1'],
}
df = pd.DataFrame(data, columns=['Environment', 'Health', 'Labor'])
I want to create a new column df['Keyword'] whose value is a join of the column names with value > 0.
Expected Outcome:
data = {'Environment': ['0', '0', '0'],
'Health': ['1', '0', '1'],
'Labor': ['1', '1', '1'],
'Keyword': ['Health, Labor', 'Labor', 'Health, Labor']}
df_test = pd.DataFrame(data, columns=['Environment', 'Health', 'Labor', 'Keyword'])
df_test
df = pd.DataFrame(data, columns=['Environment', 'Health', 'Labor'])
How do I go about it?
Other version with .apply():
df['Keyword'] = df.apply(lambda x: ', '.join(b for a, b in zip(x, x.index) if a=='1'),axis=1)
print(df)
Prints:
Environment Health Labor Keyword
0 0 1 1 Health, Labor
1 0 0 1 Labor
2 0 1 1 Health, Labor
Another method with mask and stack then groupby to get your aggregation of items.
stack by default drops na values.
df['keyword'] = df.mask(
df.lt(1)).stack().reset_index(1)\
.groupby(level=0)["level_1"].agg(list)
print(df)
Environment Health Labor keyword
0 0 1 1 [Health, Labor]
1 0 0 1 [Labor]
2 0 1 1 [Health, Labor]
First problem in sample data values are strings, so if want compare for greater use:
df = df.astype(float).astype(int)
Or:
df = df.replace({'0':0, '1':1})
And then use DataFrame.dot for matrix multiplication with columns names and separators, last remove it from right side:
df['Keyword'] = df.gt(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
Environment Health Labor Keyword
0 0 1 1 Health, Labor
1 0 0 1 Labor
2 0 1 1 Health, Labor
Or compare strings - e.g. not equal '0' or equal '1':
df['Keyword'] = df.ne('0').dot(df.columns + ', ').str.rstrip(', ')
df['Keyword'] = df.eq('1').dot(df.columns + ', ').str.rstrip(', ')
Related
I want to print two dataframes that print the rows where there is a mismatch in a given column, here the "second_column":
"first_column" is a key value that identify same product in both dataframes
import pandas as pd
data1 = {
'first_column': ['id1', 'id2', 'id3'],
'second_column': ['1', '2', '2'],
'third_column': ['1', '2', '2'],
'fourth_column': ['1', '2', '2']
}
df1 = pd.DataFrame(data1)
print(df1)
test = df1['second_column'].nunique()
data2 = {
'first_column': ['id1', 'id2', 'id3'],
'second_column': ['3', '4', '2'],
'third_column': ['1', '2', '2'],
'fourth_column': ['1', '2', '2']
}
df2 = pd.DataFrame(data2)
print(df2)
expected output:
IIUC
btw, you screenshots don't match your DF definition
df1.loc[~df1['second_column'].isin(df2['second_column'])]
first_column second_column third_column fourth_column
0 1 1 1 1
df2.loc[~df2['second_column'].isin(df1['second_column'])]
first_column second_column third_column fourth_column
0 1 3 1 1
1 2 4 2 2
the compare method can do what you want.
different_rows = df1.compare(df2, align_axis=1).index
df1.loc[different_rows]
With this method, one important point is if there are extra rows (index) then it will not return a difference.
or if you want to find differences in one column only, you can first join on the index then check if the join matches
joined_df = df1.join(df2['second_column'], rsuffix='_df2')
diff = joined_df['second_column']!=joined_df['second_column_df2']
print(joined_df.loc[diff, df1.columns])
I have the following dataframe with multiple dictionaries in a list in the Rules column.
SetID SetName Rules
0 Standard_1 [{'RulesID': '10', 'RuleName': 'name_abc'}, {'RulesID': '11', 'RuleName': 'name_xyz'}]
1 Standard_2 [{'RulesID': '12', 'RuleName': 'name_arg'}]
The desired output is:
SetID SetName RulesID RuleName
0 Standard_1 10 name_abc
0 Standard_1 11 name_xyz
1 Standard_2 12 name_arg
It might be possible that there are more than two dictionaries inside of the list.
I am thinking about a pop, explode or pivot function to build the dataframe but I have no clue how to start.
Each advice will be very appreciated!
EDIT: To build the dataframe you can use the follwing dataframe constructor:
# initialize list of lists
data = [[0, 'Standard_1', [{'RulesID': '10', 'RuleName': 'name_abc'}, {'RulesID': '11', 'RuleName': 'name_xyz'}]], [1, 'Standard_2', [{'RulesID': '12', 'RuleName': 'name_arg'}]]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['SetID', 'SetName', 'Rules'])
You can use explode:
tmp = df.explode('Rules').reset_index(drop=True)
df = pd.concat([tmp, pd.json_normalize(tmp['Rules'])], axis=1).drop('Rules', axis=1)
Output:
>>> df
SetID SetName RulesID RuleName
0 0 Standard_1 10 name_abc
1 0 Standard_1 11 name_xyz
2 1 Standard_2 12 name_arg
One-liner version of the above:
df.explode('Rules').reset_index(drop=True).pipe(lambda x: pd.concat([tmp, pd.json_normalize(tmp['Rules'])], axis=1)).drop('Rules', axis=1)
I need to count the number of times id occurs given the condition that enroll > 0. This is what I have so far...Any thoughts on how to do this? Thanks!
raw_data = [['a', '0'], ['a', '0'], ['a', '1'], ['b', '0'], ['b', '0.5'], ['c', '0'], ['c', '0']]
df = pd.DataFrame(raw_data, columns = ['id', 'enroll'])
df
def countidsperenroll():
for i in df['id']:
if (enroll>0):
return value.count()
continue
The result should be a table with:
the values:
3
2
0
because there were 3 'a' ids and there was an enroll> 0 with one of the 'a' ids. And because there were 2 'b' ids and there was an enroll > 0 with one of the 'b' ids. No 'enroll' for the 'c' id, So that gets a 0.
We can do it two steps with value_counts
s=df.id.value_counts()
s.loc[~s.index.isin(df.loc[df.enroll>0,'id'].unique())]=0
s
a 3
c 0
b 2
Name: id, dtype: int64
df.groupby("id").filter(lambda x: (x["enroll"]>0).any()).groupby("id").count()
First you groupby to filter out the groups that have at least one enroll greater than 0 than you groupy again to get the aggregate data.
You could use the fact that if enroll is greater than 0, then the sum per group will be greater than 0 :
(
df.assign(temp=df.groupby("id").enroll.transform("sum").gt(0))
.groupby("id")
.temp.sum()
)
id
a 3.0
b 2.0
c 0.0
Name: temp, dtype: float64
I have a DataFrame that has below columns:
df = pd.DataFrame({'Name': ['Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy'],
'Lenght': ['10', '12.5', '11', '12.5', '12', '11', '12.5', '10', '5'],
'Try': [0,0,0,1,1,1,2,2,2],
'Batch':[0,0,0,0,0,0,0,0,0]})
In each batch a name gets arbitrary many tries to get the greatest lenght.
What I want to do is create a column win that has the value 1 for greatest lenght in a batch and 0 otherwise, with the following conditions.
If one name hold the greatest lenght in a batch in multiple try only the first try will have the value 1 in win(See "Abe in example above")
If two separate name holds equal greatest lenght then both will have value 1 in win
What I have managed to do so far is:
df.groupby(['Batch', 'name'])['lenght'].apply(lambda x: (x == x.max()).map({True: 1, False: 0}))
But it doesn't support all the conditions, any insight would be highly
Expected outout:
df = pd.DataFrame({'Name': ['Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy'],
'Lenght': ['10', '12.5', '11', '12.5', '12', '11', '12.5', '10', '5'],
'Try': [0,0,0,1,1,1,2,2,2],
'Batch':[0,0,0,0,0,0,0,0,0],
'win':[0,1,0,1,0,0,0,0,0]})
appreciated.
Many thanks,
Use GroupBy.transform for max values per groups compared by Lenght column by Series.eq for equality and for map to True->1 and False->0 cast values to integers by Series.astype:
#added first row data by second row
df = pd.DataFrame({'Name': ['Karl', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy'],
'Lenght': ['12.5', '12.5', '11', '12.5', '12', '11', '12.5', '10', '5'],
'Try': [0,0,0,1,1,1,2,2,2],
'Batch':[0,0,0,0,0,0,0,0,0]})
df['Lenght'] = df['Lenght'].astype(float)
m1 = df.groupby('Batch')['Lenght'].transform('max').eq(df['Lenght'])
df1 = df[m1]
m2 = df1.groupby('Name')['Try'].transform('nunique').eq(1)
m3 = ~df1.duplicated(['Name','Batch'])
df['new'] = ((m2 | m3) & m1).astype(int)
print (df)
Name Lenght Try Batch new
0 Karl 12.5 0 0 1
1 Karl 12.5 0 0 1
2 Billy 11.0 0 0 0
3 Abe 12.5 1 0 1
4 Karl 12.0 1 0 0
5 Billy 11.0 1 0 0
6 Abe 12.5 2 0 0
7 Karl 10.0 2 0 0
8 Billy 5.0 2 0 0
I have a data as below. I would like to flag transactions -
when a same employee has one of the ('Car Rental', 'Car Rental - Gas' in the column expense type) and 'Car Mileage' on the same day - so in this case employee a and c's transactions would be highlighted. Employee b's transactions won't be highlighted as they don't meet the condition - he doesn't have a 'Car Mileage'
i want the column zflag. Different numbers in that column indicate group of instances when the above condition was met
d = {'emp': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c' ],
'date': ['1', '1', '1', '1', '2', '2', '2', '3', '3', '3', '3' ],
'usd':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ],
'expense type':['Car Mileage', 'Car Rental', 'Car Rental - Gas', 'food', 'Car Rental', 'Car Rental - Gas', 'food', 'Car Mileage', 'Car Rental', 'food', 'wine' ],
'zflag':['1', '1', '1', ' ',' ',' ',' ','2','2',' ',' ' ]
}
df = pd.DataFrame(data=d)
df
Out[253]:
date emp expense type usd zflag
0 1 a Car Mileage 1 1
1 1 a Car Rental 2 1
2 1 a Car Rental - Gas 3 1
3 1 a food 4
4 2 b Car Rental 5
5 2 b Car Rental - Gas 6
6 2 b food 7
7 3 c Car Mileage 8 2
8 3 c Car Rental 9 2
9 3 c food 10
10 3 c wine 11
I would appreciate if i could get pointers regarding functions to use. I am thinking of using groupby...but not sure
I understand that date+emp will be my primary key
Here is an approach. It's not the cleanest but what you're describing is quite specific. Some of this might be able to be simplified with a function.
temp_df = df.groupby(["emp", "date"], axis=0)["expense type"].apply(lambda x: 1 if "Car Mileage" in x.values and any([k in x.values for k in ["Car Rental", "Car Rental - Gas"]]) else 0).rename("zzflag")
temp_df = temp_df.loc[temp_df!=0,:].cumsum()
final_df = pd.merge(df, temp_df.reset_index(), how="left").fillna(0)
Steps:
Groupby emp/date and search for criteria, 1 if met, 0 if not
Remove rows with 0's and cumsum to produce unique values
Join back to the original frame
Edit:
To answer your question below. The join works because after you run .reset_index() that takes "emp" and "date" from the index and moves them to columns.