I have a data as below. I would like to flag transactions -
when a same employee has one of the ('Car Rental', 'Car Rental - Gas' in the column expense type) and 'Car Mileage' on the same day - so in this case employee a and c's transactions would be highlighted. Employee b's transactions won't be highlighted as they don't meet the condition - he doesn't have a 'Car Mileage'
i want the column zflag. Different numbers in that column indicate group of instances when the above condition was met
d = {'emp': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c' ],
'date': ['1', '1', '1', '1', '2', '2', '2', '3', '3', '3', '3' ],
'usd':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ],
'expense type':['Car Mileage', 'Car Rental', 'Car Rental - Gas', 'food', 'Car Rental', 'Car Rental - Gas', 'food', 'Car Mileage', 'Car Rental', 'food', 'wine' ],
'zflag':['1', '1', '1', ' ',' ',' ',' ','2','2',' ',' ' ]
}
df = pd.DataFrame(data=d)
df
Out[253]:
date emp expense type usd zflag
0 1 a Car Mileage 1 1
1 1 a Car Rental 2 1
2 1 a Car Rental - Gas 3 1
3 1 a food 4
4 2 b Car Rental 5
5 2 b Car Rental - Gas 6
6 2 b food 7
7 3 c Car Mileage 8 2
8 3 c Car Rental 9 2
9 3 c food 10
10 3 c wine 11
I would appreciate if i could get pointers regarding functions to use. I am thinking of using groupby...but not sure
I understand that date+emp will be my primary key
Here is an approach. It's not the cleanest but what you're describing is quite specific. Some of this might be able to be simplified with a function.
temp_df = df.groupby(["emp", "date"], axis=0)["expense type"].apply(lambda x: 1 if "Car Mileage" in x.values and any([k in x.values for k in ["Car Rental", "Car Rental - Gas"]]) else 0).rename("zzflag")
temp_df = temp_df.loc[temp_df!=0,:].cumsum()
final_df = pd.merge(df, temp_df.reset_index(), how="left").fillna(0)
Steps:
Groupby emp/date and search for criteria, 1 if met, 0 if not
Remove rows with 0's and cumsum to produce unique values
Join back to the original frame
Edit:
To answer your question below. The join works because after you run .reset_index() that takes "emp" and "date" from the index and moves them to columns.
Related
Given the following df:
data = {'Description': ['with lemon', 'lemon', 'and orange', 'orange'],
'Start': ['6', '1', '5', '1'],
'Length': ['5', '5', '6', '6']}
df = pd.DataFrame(data)
print (df)
I would like to substring the "Description" based on what is specified in the other columns as start and length, here the expected output:
data = {'Description': ['with lemon', 'lemon', 'and orange', 'orange'],
'Start': ['6', '1', '5', '1'],
'Length': ['5', '5', '6', '6'],
'Res': ['lemon', 'lemon', 'orange', 'orange']}
df = pd.DataFrame(data)
print (df)
Is there a way to make it dynamic or another compact way?
df['Res'] = df['Description'].str[1:2]
You need to loop, a list comprehension will be the most efficient (python ≥3.8 due to the walrus operator, thanks #I'mahdi):
df['Res'] = [s[(start:=int(a)-1):start+int(b)] for (s,a,b)
in zip(df['Description'], df['Start'], df['Length'])]
Or using pandas for the conversion (thanks #DaniMesejo):
df['Res'] = [s[a:a+b] for (s,a,b) in
zip(df['Description'],
df['Start'].astype(int)-1,
df['Length'].astype(int))]
output:
Description Start Length Res
0 with lemon 6 5 lemon
1 lemon 1 5 lemon
2 and orange 5 6 orange
3 orange 1 6 orange
handling non-integers / NAs
df['Res'] = [s[a:a+b] if pd.notna(a) and pd.notna(b) else 'NA'
for (s,a,b) in
zip(df['Description'],
pd.to_numeric(df['Start'], errors='coerce').convert_dtypes()-1,
pd.to_numeric(df['Length'], errors='coerce').convert_dtypes()
)]
output:
Description Start Length Res
0 with lemon 6 5 lemon
1 lemon 1 5 lemon
2 and orange 5 6 orange
3 orange 1 6 orange
4 pinapple xxx NA NA NA
5 orangiie NA NA NA
Given that the fruit name of interest always seems to be the final word in the description column, you might be able to use a regex extract approach here.
data["Res"] = data["Description"].str.extract(r'(\w+)$')
You can use .map to cycle through the Series. Use split(' ') to separate the words if there is space and get the last word in the list [-1].
df['RES'] = df['Description'].map(lambda x: x.split(' ')[-1])
here is a simple pandas DataFrame :
data={'Name': ['John', 'Dav', 'Ann', 'Mike', 'Dany'],
'Number': ['2', '3', '2', '4', '2']}
df = pd.DataFrame(data, columns=['Name', 'Number'])
df
I would like to add a third column named "color" where the value is 'red' if Number = 2 and 'Blue' if Number = 3
This dataframe just has 5 rows. In reality It has thousand rows so I can not just add a simple column manually.
You can use .map:
dct = {2: "Red", 3: "Blue"}
df["color"] = df["Number"].astype(int).map(dct) # remove .astype(int) if the values are already integer
print(df)
Prints:
Name Number color
0 John 2 Red
1 Dav 3 Blue
2 Ann 2 Red
3 Mike 4 NaN
4 Dany 2 Red
I have a Pandas aggregate dataframe like this:
import pandas as pd
agg_df = pd.DataFrame({'v1':['item', 'item', 'item', 'item', 'location', 'status', 'status'],
'v2' :['bed', 'lamp', 'candle', 'chair', 'home', 'new', 'used' ],
'count':['2', '2', '2', '1', '7', '4', '3' ]})
agg_df
I want to prepare it for academic publication and I need a new dataframe like this:
# item bed 2
# lamp 2
# candle 2
# chair 1
# location home 7
# status new 4
# used 3
How can I create such a dataframe?
For display only is possible use MultiIndex:
df = agg_df.set_index(['v1','v2'])
print (df)
count
v1 v2
item bed 2
lamp 2
candle 2
chair 1
location home 7
status new 4
used 3
If need replace duplicated values use Series.duplicated with Series.mask:
agg_df['v1'] = agg_df['v1'].mask(agg_df['v1'].duplicated(),'')
print (agg_df)
v1 v2 count
0 item bed 2
1 lamp 2
2 candle 2
3 chair 1
4 location home 7
5 status new 4
6 used 3
If need remove index and columns values:
print (agg_df.to_string(index=False, header=None))
item bed 2
lamp 2
candle 2
chair 1
location home 7
status new 4
used 3
u can do that using
import pandas as pd
agg_df = pd.DataFrame({'v1':['item', 'item', 'item', 'item', 'location', 'status', 'status'],
'v2' :['bed', 'lamp', 'candle', 'chair', 'home', 'new', 'used' ],
'count':['2', '2', '2', '1', '7', '4', '3' ]})
agg_df.set_index(["v1","v2"])
I have the following dataset:
data = {'Environment': ['0', '0', '0'],
'Health': ['1', '0', '1'],
'Labor': ['1', '1', '1'],
}
df = pd.DataFrame(data, columns=['Environment', 'Health', 'Labor'])
I want to create a new column df['Keyword'] whose value is a join of the column names with value > 0.
Expected Outcome:
data = {'Environment': ['0', '0', '0'],
'Health': ['1', '0', '1'],
'Labor': ['1', '1', '1'],
'Keyword': ['Health, Labor', 'Labor', 'Health, Labor']}
df_test = pd.DataFrame(data, columns=['Environment', 'Health', 'Labor', 'Keyword'])
df_test
df = pd.DataFrame(data, columns=['Environment', 'Health', 'Labor'])
How do I go about it?
Other version with .apply():
df['Keyword'] = df.apply(lambda x: ', '.join(b for a, b in zip(x, x.index) if a=='1'),axis=1)
print(df)
Prints:
Environment Health Labor Keyword
0 0 1 1 Health, Labor
1 0 0 1 Labor
2 0 1 1 Health, Labor
Another method with mask and stack then groupby to get your aggregation of items.
stack by default drops na values.
df['keyword'] = df.mask(
df.lt(1)).stack().reset_index(1)\
.groupby(level=0)["level_1"].agg(list)
print(df)
Environment Health Labor keyword
0 0 1 1 [Health, Labor]
1 0 0 1 [Labor]
2 0 1 1 [Health, Labor]
First problem in sample data values are strings, so if want compare for greater use:
df = df.astype(float).astype(int)
Or:
df = df.replace({'0':0, '1':1})
And then use DataFrame.dot for matrix multiplication with columns names and separators, last remove it from right side:
df['Keyword'] = df.gt(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
Environment Health Labor Keyword
0 0 1 1 Health, Labor
1 0 0 1 Labor
2 0 1 1 Health, Labor
Or compare strings - e.g. not equal '0' or equal '1':
df['Keyword'] = df.ne('0').dot(df.columns + ', ').str.rstrip(', ')
df['Keyword'] = df.eq('1').dot(df.columns + ', ').str.rstrip(', ')
I have 2 similar dataframes that I concatenated that have a lot of repeated values because they are basically the same data set but for different years.
The problem is that one of the sets has some values missing whereas the other sometimes has these values.
For example:
Name Unit Year Level
Nik 1 2000 12
Nik 1 12
John 2 2001 11
John 2 2001 11
Stacy 1 8
Stacy 1 1999 8
.
.
I want to drop duplicates on the subset = ['Name', 'Unit', 'Level'] since some repetitions don't have years.
However, I'm left with the data that has no Year and I'd like to keep the data with these values:
Name Unit Year Level
Nik 1 2000 12
John 2 2001 11
Stacy 1 1999 8
.
.
How do I keep these values rather than the blanks?
Use sort_values with default parameter na_position='last', so should be omit, and then drop_duplicates:
print (df)
Name Unit Year Level
0 Nik 1 NaN 12
1 Nik 1 2000.0 12
2 John 2 2001.0 11
3 John 2 2001.0 11
4 Stacy 1 NaN 8
5 Stacy 1 1999.0 8
subset = ['Name', 'Unit', 'Level']
df = df.sort_values('Year').drop_duplicates(subset)
Or:
df = df.sort_values(subset + ['Year']).drop_duplicates(subset)
print (df)
Name Unit Year Level
5 Stacy 1 1999.0 8
1 Nik 1 2000.0 12
2 John 2 2001.0 11
Another solution with GroupBy.first for return first non missing value of Year per groups:
df = df.groupby(subset, as_index=False, sort=False)['Year'].first()
print (df)
Name Unit Level Year
0 Nik 1 12 2000.0
1 John 2 11 2001.0
2 Stacy 1 8 1999.0
One solution that comes to mind is to first sort the concatenated dataframe by year with the sortvalues function:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
then drop duplicates with keep='first' parameter
df.drop_duplicates(subset=['Name', 'Unit', 'Level'], keep="first")
I would suggest that you look at the creation step of your merged dataset.
When merging the data sets you can do so on multiple indices i.e.
df = pd.merge(left, right, how='outer', on=['Name', 'Unit', 'Level'], suffixes=['', '_r'])
With the outer join you collect all data sets and remove duplicates right away. The only thing left is to merge the Year column which you can do like so:
df['Year'] = df[['Year', 'Year_r']].apply(lambda x: x['Year'] if (x['Year'] is not np.nan and x['Year'] != '') else x['Year_r'], axis=1)
This fills the gaps and afterwards you are able to simply drop the 'Year_r' column.
The benefit here is that not only NaN values of missing years are covered but also missing Years which are represented as empty strings.
Following a small working example:
import pandas as pd
import numpy as np
left = pd.DataFrame({'Name': ['Adam', 'Beatrice', 'Crissy', 'Dumbo', 'Peter', 'Adam'],
'Unit': ['2', '4', '6', '2', '4', '12'],
'Year': ['', '2009', '1954', '2025', '2012', '2024'],
'Level': ['L1', 'L1', 'L0', 'L4', 'L3', 'L10']})
right = pd.DataFrame({'Name': ['Adam', 'Beatrice', 'Crissy', 'Dumbo'],
'Unit': ['2', '4', '6', '2'],
'Year': ['2010', '2009', '1954', '2025'],
'Level': ['L1', 'L1', 'L0', 'L4']})
df = pd.merge(left, right, how='outer', on=['Name', 'Unit', 'Level'], suffixes=['', '_r'])
df['Year'] = df[['Year', 'Year_r']].apply(lambda x: x['Year'] if (x['Year'] is not np.nan and x['Year'] != '') else x['Year_r'], axis=1)
df