Given the following df:
data = {'Description': ['with lemon', 'lemon', 'and orange', 'orange'],
'Start': ['6', '1', '5', '1'],
'Length': ['5', '5', '6', '6']}
df = pd.DataFrame(data)
print (df)
I would like to substring the "Description" based on what is specified in the other columns as start and length, here the expected output:
data = {'Description': ['with lemon', 'lemon', 'and orange', 'orange'],
'Start': ['6', '1', '5', '1'],
'Length': ['5', '5', '6', '6'],
'Res': ['lemon', 'lemon', 'orange', 'orange']}
df = pd.DataFrame(data)
print (df)
Is there a way to make it dynamic or another compact way?
df['Res'] = df['Description'].str[1:2]
You need to loop, a list comprehension will be the most efficient (python ≥3.8 due to the walrus operator, thanks #I'mahdi):
df['Res'] = [s[(start:=int(a)-1):start+int(b)] for (s,a,b)
in zip(df['Description'], df['Start'], df['Length'])]
Or using pandas for the conversion (thanks #DaniMesejo):
df['Res'] = [s[a:a+b] for (s,a,b) in
zip(df['Description'],
df['Start'].astype(int)-1,
df['Length'].astype(int))]
output:
Description Start Length Res
0 with lemon 6 5 lemon
1 lemon 1 5 lemon
2 and orange 5 6 orange
3 orange 1 6 orange
handling non-integers / NAs
df['Res'] = [s[a:a+b] if pd.notna(a) and pd.notna(b) else 'NA'
for (s,a,b) in
zip(df['Description'],
pd.to_numeric(df['Start'], errors='coerce').convert_dtypes()-1,
pd.to_numeric(df['Length'], errors='coerce').convert_dtypes()
)]
output:
Description Start Length Res
0 with lemon 6 5 lemon
1 lemon 1 5 lemon
2 and orange 5 6 orange
3 orange 1 6 orange
4 pinapple xxx NA NA NA
5 orangiie NA NA NA
Given that the fruit name of interest always seems to be the final word in the description column, you might be able to use a regex extract approach here.
data["Res"] = data["Description"].str.extract(r'(\w+)$')
You can use .map to cycle through the Series. Use split(' ') to separate the words if there is space and get the last word in the list [-1].
df['RES'] = df['Description'].map(lambda x: x.split(' ')[-1])
Related
here is a simple pandas DataFrame :
data={'Name': ['John', 'Dav', 'Ann', 'Mike', 'Dany'],
'Number': ['2', '3', '2', '4', '2']}
df = pd.DataFrame(data, columns=['Name', 'Number'])
df
I would like to add a third column named "color" where the value is 'red' if Number = 2 and 'Blue' if Number = 3
This dataframe just has 5 rows. In reality It has thousand rows so I can not just add a simple column manually.
You can use .map:
dct = {2: "Red", 3: "Blue"}
df["color"] = df["Number"].astype(int).map(dct) # remove .astype(int) if the values are already integer
print(df)
Prints:
Name Number color
0 John 2 Red
1 Dav 3 Blue
2 Ann 2 Red
3 Mike 4 NaN
4 Dany 2 Red
I hope someone might help me.
I have a dataframe that inculdes columns with similar names (see example data)
I have 3 additional lists of column names which include the original names of the columns (i.e. the string occurring before the question mark (see lists of column names)
I need to subset the df dataframe into 3 separate dataframes, based on matching the first part of the column names present in the 3 lists. The expected output at the bottom.
It has to be in lists (or something programmatic) as I have lots and lots of columns like this. I tried pattern matching but because some names are very similar, they match to multiple lists.
thank you in advance!
example data
df = {'id': ['1','2','3','4'],
'ab? op': ['green', 'red', 'blue', 'None'],
'ab? 1': ['red', 'yellow', 'None', 'None'],
'cd': ['L', 'XL', 'M','L'],
'efab? cba' : ['husband', 'wife', 'husband', 'None'],
'efab? 1':['son', 'grandparent', 'son', 'None'],
'efab? 2':['None', 'son', 'None', 'None'],
'fab? 4':['9', '10', '5', '3'],
'fab? po':['England', 'Scotland', 'Wales', 'NA'] }
df = pd.DataFrame(df, columns = ['id','ab? op', 'ab? 1', 'cd', 'efab? cba', 'efab? 1', 'efab? 2', 'fab? 4', 'fab? po'])
list of column names in other 3 data frames
df1_lst = ['ab', 'cd']
df2_lst = ['efab']
df3_lst = ['fab']
desired output
df1 = ['ab? op', 'ab? 1', 'cd']
df2 = ['efab? cba', 'efab? 1', 'efab? 2']
df3 = ['fab? 4', 'fab? po']
You can form a dynamic regex for each df lists:
df_lists = [df1_lst, df2_lst, df3_lst]
result = [df.filter(regex=fr"\b({'|'.join(names)})\??") for names in df_lists]
e.g., for the first list, the regex is \b(ab|cd)\?? i.e. look for either ab or cd but they should be standalone from the left side (\b) and there might be an optional ? afterwards.
The desired entries are in the result list e.g.
>>> result[1]
efab? cba efab? 1 efab? 2
0 husband son None
1 wife grandparent son
2 husband son None
3 None None None
Split column names by ?, keep the first part and check if they are in list:
df1 = df.loc[:, df.columns.str.split('?').str[0].isin(df1_lst)]
df2 = df.loc[:, df.columns.str.split('?').str[0].isin(df2_lst)]
df3 = df.loc[:, df.columns.str.split('?').str[0].isin(df3_lst)]
>>> df1
ab? op ab? 1 cd
0 green red L
1 red yellow XL
2 blue None M
3 None None L
>>> df2
efab? cba efab? 1 efab? 2
0 husband son None
1 wife grandparent son
2 husband son None
3 None None None
>>> df3
fab? 4 fab? po
0 9 England
1 10 Scotland
2 5 Wales
3 3 NA
I have the following dataset:
data = {'Environment': ['0', '0', '0'],
'Health': ['1', '0', '1'],
'Labor': ['1', '1', '1'],
}
df = pd.DataFrame(data, columns=['Environment', 'Health', 'Labor'])
I want to create a new column df['Keyword'] whose value is a join of the column names with value > 0.
Expected Outcome:
data = {'Environment': ['0', '0', '0'],
'Health': ['1', '0', '1'],
'Labor': ['1', '1', '1'],
'Keyword': ['Health, Labor', 'Labor', 'Health, Labor']}
df_test = pd.DataFrame(data, columns=['Environment', 'Health', 'Labor', 'Keyword'])
df_test
df = pd.DataFrame(data, columns=['Environment', 'Health', 'Labor'])
How do I go about it?
Other version with .apply():
df['Keyword'] = df.apply(lambda x: ', '.join(b for a, b in zip(x, x.index) if a=='1'),axis=1)
print(df)
Prints:
Environment Health Labor Keyword
0 0 1 1 Health, Labor
1 0 0 1 Labor
2 0 1 1 Health, Labor
Another method with mask and stack then groupby to get your aggregation of items.
stack by default drops na values.
df['keyword'] = df.mask(
df.lt(1)).stack().reset_index(1)\
.groupby(level=0)["level_1"].agg(list)
print(df)
Environment Health Labor keyword
0 0 1 1 [Health, Labor]
1 0 0 1 [Labor]
2 0 1 1 [Health, Labor]
First problem in sample data values are strings, so if want compare for greater use:
df = df.astype(float).astype(int)
Or:
df = df.replace({'0':0, '1':1})
And then use DataFrame.dot for matrix multiplication with columns names and separators, last remove it from right side:
df['Keyword'] = df.gt(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
Environment Health Labor Keyword
0 0 1 1 Health, Labor
1 0 0 1 Labor
2 0 1 1 Health, Labor
Or compare strings - e.g. not equal '0' or equal '1':
df['Keyword'] = df.ne('0').dot(df.columns + ', ').str.rstrip(', ')
df['Keyword'] = df.eq('1').dot(df.columns + ', ').str.rstrip(', ')
I have a DataFrame that has below columns:
df = pd.DataFrame({'Name': ['Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy'],
'Lenght': ['10', '12.5', '11', '12.5', '12', '11', '12.5', '10', '5'],
'Try': [0,0,0,1,1,1,2,2,2],
'Batch':[0,0,0,0,0,0,0,0,0]})
In each batch a name gets arbitrary many tries to get the greatest lenght.
What I want to do is create a column win that has the value 1 for greatest lenght in a batch and 0 otherwise, with the following conditions.
If one name hold the greatest lenght in a batch in multiple try only the first try will have the value 1 in win(See "Abe in example above")
If two separate name holds equal greatest lenght then both will have value 1 in win
What I have managed to do so far is:
df.groupby(['Batch', 'name'])['lenght'].apply(lambda x: (x == x.max()).map({True: 1, False: 0}))
But it doesn't support all the conditions, any insight would be highly
Expected outout:
df = pd.DataFrame({'Name': ['Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy'],
'Lenght': ['10', '12.5', '11', '12.5', '12', '11', '12.5', '10', '5'],
'Try': [0,0,0,1,1,1,2,2,2],
'Batch':[0,0,0,0,0,0,0,0,0],
'win':[0,1,0,1,0,0,0,0,0]})
appreciated.
Many thanks,
Use GroupBy.transform for max values per groups compared by Lenght column by Series.eq for equality and for map to True->1 and False->0 cast values to integers by Series.astype:
#added first row data by second row
df = pd.DataFrame({'Name': ['Karl', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy'],
'Lenght': ['12.5', '12.5', '11', '12.5', '12', '11', '12.5', '10', '5'],
'Try': [0,0,0,1,1,1,2,2,2],
'Batch':[0,0,0,0,0,0,0,0,0]})
df['Lenght'] = df['Lenght'].astype(float)
m1 = df.groupby('Batch')['Lenght'].transform('max').eq(df['Lenght'])
df1 = df[m1]
m2 = df1.groupby('Name')['Try'].transform('nunique').eq(1)
m3 = ~df1.duplicated(['Name','Batch'])
df['new'] = ((m2 | m3) & m1).astype(int)
print (df)
Name Lenght Try Batch new
0 Karl 12.5 0 0 1
1 Karl 12.5 0 0 1
2 Billy 11.0 0 0 0
3 Abe 12.5 1 0 1
4 Karl 12.0 1 0 0
5 Billy 11.0 1 0 0
6 Abe 12.5 2 0 0
7 Karl 10.0 2 0 0
8 Billy 5.0 2 0 0
I have a data as below. I would like to flag transactions -
when a same employee has one of the ('Car Rental', 'Car Rental - Gas' in the column expense type) and 'Car Mileage' on the same day - so in this case employee a and c's transactions would be highlighted. Employee b's transactions won't be highlighted as they don't meet the condition - he doesn't have a 'Car Mileage'
i want the column zflag. Different numbers in that column indicate group of instances when the above condition was met
d = {'emp': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c' ],
'date': ['1', '1', '1', '1', '2', '2', '2', '3', '3', '3', '3' ],
'usd':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ],
'expense type':['Car Mileage', 'Car Rental', 'Car Rental - Gas', 'food', 'Car Rental', 'Car Rental - Gas', 'food', 'Car Mileage', 'Car Rental', 'food', 'wine' ],
'zflag':['1', '1', '1', ' ',' ',' ',' ','2','2',' ',' ' ]
}
df = pd.DataFrame(data=d)
df
Out[253]:
date emp expense type usd zflag
0 1 a Car Mileage 1 1
1 1 a Car Rental 2 1
2 1 a Car Rental - Gas 3 1
3 1 a food 4
4 2 b Car Rental 5
5 2 b Car Rental - Gas 6
6 2 b food 7
7 3 c Car Mileage 8 2
8 3 c Car Rental 9 2
9 3 c food 10
10 3 c wine 11
I would appreciate if i could get pointers regarding functions to use. I am thinking of using groupby...but not sure
I understand that date+emp will be my primary key
Here is an approach. It's not the cleanest but what you're describing is quite specific. Some of this might be able to be simplified with a function.
temp_df = df.groupby(["emp", "date"], axis=0)["expense type"].apply(lambda x: 1 if "Car Mileage" in x.values and any([k in x.values for k in ["Car Rental", "Car Rental - Gas"]]) else 0).rename("zzflag")
temp_df = temp_df.loc[temp_df!=0,:].cumsum()
final_df = pd.merge(df, temp_df.reset_index(), how="left").fillna(0)
Steps:
Groupby emp/date and search for criteria, 1 if met, 0 if not
Remove rows with 0's and cumsum to produce unique values
Join back to the original frame
Edit:
To answer your question below. The join works because after you run .reset_index() that takes "emp" and "date" from the index and moves them to columns.