Groupby id and create column boolean column - python

I have a dataframe of transactions:
id | type | date
453| online | 08-12-19
453| instore| 08-12-19
453| return | 10-5-19
There are 4 possible types: online, instore, return, other. I want to create boolean columns where I see if for each unique customer if they ever had a given transaction type.
I tried the following code but it was not giving me what I wanted.
transactions.groupby('id')['type'].transform(lambda x: x == 'online') == 'online'

Use get_dummies with aggregate max for indicaro columns per groups and last add DataFrame.reindex for custom order and add possible misisng types filled by 0:
t = ['online', 'instore', 'return', 'other']
df = pd.get_dummies(df['type']).groupby(df['id']).max().reindex(t, axis=1, fill_value=0)
print (df)
online instore return other
id
453 1 1 1 0
Another idea with join per groups and Series.str.get_dummies:
t = ['online', 'instore', 'return', 'other']
df.groupby('id')['type'].agg('|'.join).str.get_dummies().reindex(t, axis=1, fill_value=0)

Related

How can I get the index values in DF1 to where DF1's column values match DF2's custom multiindex values?

I have two data frames: DF1 and DF2.
DF2 is essentially a randomly generated subset of rows in DF1.
I want to get the (integer) indexes of DF1 of the rows where there is a complete match of all column values in DF1.
I'm trying to do this with a multi-index:
So if I have the following:
DF1:
Index Name Age Gender Label
0 Kate 24 F 1
1 Bill 23 M 0
2 Bob 22 M 0
3 Billy 21 M 0
DF2:
MultiIndex Name Age Gender Label
(Bob,22,M) Bob 22 M 0
(Billy,21,M) Billy 21 M 0
Desired Output: [2,3]
How can I use that MultiIndex in DF2 to check DF1 for those matches?
I found this while searching but I think this requires you to specify what value you want beforehand? I can't find this exact use case.
df2.loc[(df2.index.get_level_values("Name" =='xxx') &
(df2.index.get_level_values('Age') == x &
(df2.index.get_level_values('Gender') == x)]
Please let me know the best way.
Thanks!
Edit (Code to generate df1):
Pseudocode: Merge two dataframes to get a total of 10 columns and
drop everything except 4 columns
Edit (Code to generate df2):
if amount_needed - len(lowest_value_keys) > 0:
extra_samples = df1[df1.Label==0].sample(n=amount_needed -len(lowest_value_keys) ,replace=False)
lowest_value_df = pd.DataFrame(data = lower_value_keys, columns = ["Name", 'Age','Gender'])
samples = pd.concat([lowest_value_df, extra_samples])
samples.index = pd.MultiIndex.from_frame(samples [["Name", 'Age','Gender']])
else:
all_samples = pd.DataFrame(data = lower_value_keys, columns = ["Name", 'Age','Gender'])
samples = all_samples.sample(n=amount_needed,replace=False)
samples.index = pd.MultiIndex.from_frame(samples [["Name", 'Age','Gender']])
Not sure if this answers your query, but if we first reset the index of df1 to get that as another column 'Index', and then set_index on Name, Age , Gender to find the matches on df2 and just take the resulting Index column would that work ?
So that would be:
df1.reset_index().set_index(['Name','Age','Gender']).loc[df2.set_index(['Name','Age','Gender']).index]['Index'].values

Comparing rows of string inside groupby and assigning a value to a new column pandas

I have a dataset of employees (their IDs) and the names of their bosses for several years.
df:
What I need to do is to see if an employee had a boss' change. So, desired output is:
For employees who appear in the df only once, I just assign 0 (no boss' change). However, I cannot figure out how to do it for the employees who are in the df for several years.
I was thinking that first I need to assign 0 for the first year they appear in the df (because we do not know who was the boss before, therefore there is no boss' change). Then I need to compare the name of the boss with the name in the next row and decide whether to assign 1 or 0 into the ManagerChange column.
So far I split the df into two (with unique IDs and duplicated IDs) and assigned 0 to ManagerChange for the unique IDs.
Then I groupby the duplicated IDs and sort them by year. However, I am new to Python and cannot figure out how to compare strings and assign a result value to a new column inside the groupby. Please, help.
Code I have so far:
# splitting database in two
bool_series = df["ID"].duplicated(keep=False)
df_duplicated=df[bool_series]
df_unique = df[~bool_series]
# assigning 0 for ManagerChange for the unique IDs
df_unique['ManagerChange'] = 0
# groupby by ID and sorting by year for the duplicated IDs
df_duplicated.groupby('ID').apply(lambda x: x.sort_values('Year'))
You can groupby then shift() the group and compare on Boss columns.
# Sort value first
df.sort_values(['ID', 'Year'], inplace=True)
# Compare Boss column with shifted Boss column
df['ManagerChange'] = df.groupby('ID').apply(lambda group: group['Boss'] != group['Boss'].shift(1)).tolist()
# Change True to 1, False to 0
df['ManagerChange'] = df['ManagerChange'].map({True: 1, False: 0})
# Sort df to original df
df = df.sort_index()
# Change the first in each group to 0
df.loc[df.groupby('ID').head(1).index, 'ManagerChange'] = 0
# print(df)
ID Year Boss ManagerChange
0 1234 2018 Anna 0
1 567 2019 Sarah 0
2 1234 2020 Michael 0
3 8976 2019 John 0
4 1234 2019 Michael 1
5 8976 2020 John 0
You could also make use of fill_value argument, this will help you get rid of the last df.loc[] operation.
# Sort value first
df.sort_values(['ID', 'Year'], inplace=True)
df['ManagerChange'] = df.groupby('ID').apply(lambda group: group['Boss'] != group['Boss'].shift(1, fill_value=group['Boss'].iloc[0])).tolist()
# Change True to 1, False to 0
df['ManagerChange'] = df['ManagerChange'].map({True: 1, False: 0})
# Sort df to original df
df = df.sort_index()

How to left join based on specific conditions in Python SQL?

I have 2 dataframes, df1 and df2 like this:
df1=
person_id
10001
...
10900
df2=
person_id month_1 place_1
10001 255 X
...
10900 2111 Y
10900 500 X
10900 200 X
I want to left join df2 on df1 only where place_1 is X and the final value as the sum(month_1)
Like this :
newdf=
person_id month_1 place_1
10900 700 X
So far, I've thought of constructing my sqlite3 code as follows :
import sqlite3
conn=sqlite3.connect(':memory:')
crsr=conn.cursor()
qry='''
SELECT df1.*
FROM df1
left join df2 on sum(month_1)
WHERE UPPER(place_1) like '%X%'
group by df2.person_id
on df1.person_id = df2.person_id;
'''
new_df=pd.read_sql(qry,conn)
What is going wrong in my query approach? How should I implement my query logic correctly?
I'm learning how to use SQL to manage my data within Python. Any help would be greatly helpful!
If i got your question right, you are looking for all records in df2 with a place like X summed up and if that person has got some records in df1 then pull those as well.
To do that the following would get you the record set. (While aggregating the non-grouped columns should be in a aggregating function,such as MAX or MIN etc)
SELECT df2.person_id
,sum(df2.month_1)
,max(df1.person_name)
FROM df2
LEFT JOIN df1
ON df2.person_id=df1.person_id
WHERE UPPER(df2.place_1) like '%X%'
GROUP BY df2.person_id
This is your mistake:
left join df2 on sum(month_1)
ON must be followed a condition on which to join rows. sum(month_1) is not a condition, but a single value.
And while, say sum(month_1) > 0 is a condition, it wouldn't work either, because you are joining single rows, and sum(month_1) is not a row's value, but an aggregation over several rows.
You have on df1.person_id = df2.person_id later, but the ON clause belongs with the JOIN, not at the end of the query.
What you want is to select SUM(df2.month_1), so put it in the SELECT clause. The following query gives you all df1 rows along with their month_1 sum (or null, when there are no df2 entries for the person).
SELECT df1.*, SUM(df2.month_1)
FROM df1
left join df2 ON df2.person_id = df1.person_id
WHERE UPPER(df1.place_1) = 'X'
GROUP BY df1.person_id;
I don't know whether SQLite supports grouping by a key and selecting its functional dependent columns (df1.*), though. If you only want to show df1.person_id then you should replace df1.* by df1.person_id. If you want more df1 columns and SQLIte doesnt allow df1.*, then you may want to aggregate before joining (which I consider good style anyway):
SELECT df1.*, d2.total
FROM df1
left join
(
SELECT person_id, SUM(month_1) AS total
FROM df2
GROUP BY person_id
) d2 ON d2.person_id = df1.person_id
WHERE UPPER(df1.place_1) = 'X';
Try below, it doesn't join data, just filters by place and IDs in df1:
select person_id, sum(month_1) from df2
where place_1 = 'X' and
exists(select 1 from df1
where person_id = df2.person_id)
group by person_id
or using in:
select person_id, sum(month_1) from df2
where place_1 = 'X' and
person_id in (select person_id from df1)
group by person_id
I assume that you want all the rows of df1 and this is why you use a LEFT join.
So the condition UPPER(df2.place_1) LIKE '%X%' should be set in the ON clause and not in the WHERE clause:
SELECT df1.person_id, SUM(month_1) AS month_1, MAX(place_1) place_1
FROM df1 LEFT JOIN df2
ON df1.person_id = df2.person_id AND UPPER(df2.place_1) LIKE '%X%'
GROUP BY df1.person_id;
If instead of NULLs you want 0s in the results for the non matching rows then change SUM(month_1) to:
COALESCE(SUM(month_1), 0)
See the demo.
Results:
| person_id | month_1 | place_1 |
| --------- | ------- | ------- |
| 10001 | 255 | X |
| 10900 | 700 | X |

Update pandas DataFrame based on a different DataFrame

I have two pandas DataFrames:
df1
key id count
100 9821 7
200 9813 10
df2
nodekey nodeid
100 9821
200 9813
If the nodekey+nodeid in df2 match key+id in df1, count in df1 has to be set to 0. So, the result of the example above should be;
key id count
100 9821 0
200 9813 0
I tried the following (matching on key and nodekey only, as a test) but receive an error:
df1['count']=np.where((df1.key == df2.nodekey),0)
ValueError: either both or neither of x and y should be given
Suggestions?
This should work
df1.loc[df1[['key', 'id']].transform(tuple,1).isin(df2[['nodekey', 'nodeid']].transform(tuple,1)), "count"] = 0
which is basically using
df.loc[mask, 'count']=0
where mask is True for rows where tuple ('key', 'id') matches any tuple ('nodekey', 'nodeid')
Merge the dataframes using the left merge (the rows that are present in df1 but not in df2 will be filled with nans):
combined = df1.merge(df2, left_on=['key', 'id'],
right_on=['nodekey', 'nodeid'], how='left')
Update the counts for the rows that are non-nan:
combined.loc[combined.nodekey.notnull(), 'count'] = 0
Cleanup the unwanted columns:
combined.drop(['nodekey', 'nodeid'], axis=1, inplace=True)
# key id count
#0 100 9821 0
#1 200 9813 0
#2 300 9855 7

Groupby Duplicated in python

I have a dataset of orderID and ProductID.
Order_ID, Item_ID
101,121
101,121
101,223
101,234
I want to check that which Item_ID came more than once in any particular Order.
output>
Order_ID, Item_ID, freq
101,121,2
Which will be the most efficient way to do this in python?
Use groupby with size or value_counts first and then filter by query or boolean indexing - faster in larger DataFrame:
df1 = df.groupby(['Order_ID','Item_ID']).size().reset_index(name='freq').query('freq > 1')
Alternative:
df1=df.groupby('Order_ID')['Item_ID'].value_counts().reset_index(name='freq').query('freq>1')
Or:
df1 = df.groupby(['Order_ID','Item_ID']).size().reset_index(name='freq')
df1 = df1[df1['freq'] > 1]
print (df1)
Order_ID Item_ID freq
0 101 121 2

Categories