I'm trying to reshape my data. At first glance, it sounds like a transpose, but it's not. I tried melts, stack/unstack, joins, etc.
Use Case
I want to have only one row per unique individual, and put all job history on the columns. For clients, it can be easier to read information across rows rather than reading through columns.
Here's the data:
import pandas as pd
import numpy as np
data1 = {'Name': ["Joe", "Joe", "Joe","Jane","Jane"],
'Job': ["Analyst","Manager","Director","Analyst","Manager"],
'Job Eff Date': ["1/1/2015","1/1/2016","7/1/2016","1/1/2015","1/1/2016"]}
df2 = pd.DataFrame(data1, columns=['Name', 'Job', 'Job Eff Date'])
df2
Here's what I want it to look like:
Desired Output Table
.T within groupby
def tgrp(df):
df = df.drop('Name', axis=1)
return df.reset_index(drop=True).T
df2.groupby('Name').apply(tgrp).unstack()
Explanation
groupby returns an object that contains information on how the original series or dataframe has been grouped. Instead of performing a groupby with a subsquent action of some sort, we could first assign the df2.groupby('Name') to a variable (I often do), say gb.
gb = df2.groupby('Name')
On this object gb we could call .mean() to get an average of each group. Or .last() to get the last element (row) of each group. Or .transform(lambda x: (x - x.mean()) / x.std()) to get a zscore transformation within each group. When there is something you want to do within a group that doesn't have a predefined function, there is still .apply().
.apply() for a groupby object is different than it is for a dataframe. For a dataframe, .apply() takes callable object as its argument and applies that callable to each column (or row) in the object. the object that is passed to that callable is a pd.Series. When you are using .apply in a dataframe context, it is helpful to keep this fact in mind. In the context of a groupby object, the object passed to the callable argument is a dataframe. In fact, that dataframe is one of the groups specified by the groupby.
When I write such functions to pass to groupby.apply, I typically define the parameter as df to reflect that it is a dataframe.
Ok, so we have:
df2.groupby('Name').apply(tgrp)
This generates a sub-dataframe for each 'Name' and passes that sub-dataframe to the function tgrp. Then the groupby object recombines all such groups having gone through the tgrp function back together again.
It'll look like this.
I took the OP's original attempt to simply transpose to heart. But I had to do some things first. Had I simply done:
df2[df2.Name == 'Jane'].T
df2[df2.Name == 'Joe'].T
Combining these manually (without groupby):
pd.concat([df2[df2.Name == 'Jane'].T, df2[df2.Name == 'Joe'].T])
Whoa! Now that's ugly. Obviously the index values of [0, 1, 2] don't mesh with [3, 4]. So let's reset.
pd.concat([df2[df2.Name == 'Jane'].reset_index(drop=True).T,
df2[df2.Name == 'Joe'].reset_index(drop=True).T])
That's much better. But now we are getting into the territory groupby was intended to handle. So let it handle it.
Back to
df2.groupby('Name').apply(tgrp)
The only thing missing here is that we want to unstack the results to get the desired output.
Say you start by unstacking:
df2 = df2.set_index(['Name', 'Job']).unstack()
>>> df2
Job Eff Date
Job Analyst Director Manager
Name
Jane 1/1/2015 None 1/1/2016
Joe 1/1/2015 7/1/2016 1/1/2016
In [29]:
df2
Now, to make things easier, flatten the multi-index:
df2.columns = df2.columns.get_level_values(1)
>>> df2
Job Analyst Director Manager
Name
Jane 1/1/2015 None 1/1/2016
Joe 1/1/2015 7/1/2016 1/1/2016
Now, just manipulate the columns:
cols = []
for i, c in enumerate(df2.columns):
col = 'Job %d' % i
df2[col] = c
cols.append(col)
col = 'Eff Date %d' % i
df2[col] = df2[c]
cols.append(col)
>>> df2[cols]
Job Job 0 Eff Date 0 Job 1 Eff Date 1 Job 2 Eff Date 2
Name
Jane Analyst 1/1/2015 Director None Manager 1/1/2016
Joe Analyst 1/1/2015 Director 7/1/2016 Manager 1/1/2016
Edit
Jane was never a director (alas). The above code states that Jane became Director at None date. To change the result so that it specifies that Jane became None at None date (which is a matter of taste), replace
df2[col] = c
by
df2[col] = [None if d is None else c for d in df2[c]]
This gives
Job Job 0 Eff Date 0 Job 1 Eff Date 1 Job 2 Eff Date 2
Name
Jane Analyst 1/1/2015 None None Manager 1/1/2016
Joe Analyst 1/1/2015 Director 7/1/2016 Manager 1/1/2016
Here is a possible workaround. Here, I first create a dictionary of the proper form and create a DataFrame based on the new dictionary:
df = pd.DataFrame(data1)
dic = {}
for name, jobs in df.groupby('Name').groups.iteritems():
if not dic:
dic['Name'] = []
dic['Name'].append(name)
for j, job in enumerate(jobs, 1):
jobstr = 'Job {0}'.format(j)
jobeffdatestr = 'Job Eff Date {0}'.format(j)
if jobstr not in dic:
dic[jobstr] = ['']*(len(dic['Name'])-1)
dic[jobeffdatestr] = ['']*(len(dic['Name'])-1)
dic[jobstr].append(df['Job'].ix[job])
dic[jobeffdatestr].append(df['Job Eff Date'].ix[job])
df2 = pd.DataFrame(dic).set_index('Name')
## Job 1 Job 2 Job 3 Job Eff Date 1 Job Eff Date 2 Job Eff Date 3
## Name
## Jane Analyst Manager 1/1/2015 1/1/2016
## Joe Analyst Manager Director 1/1/2015 1/1/2016 7/1/2016
g = df2.groupby('Name').groups
names = list(g.keys())
data2 = {'Name': names}
cols = ['Name']
temp1 = [g[y] for y in names]
job_str = 'Job'
job_date_str = 'Job Eff Date'
for i in range(max([len(x) for x in g.values()])):
temp = [x[i] if len(x) > i else '' for x in temp1]
job_str_curr = job_str + str(i+1)
job_date_curr = job_date_str + str(i + 1)
data2[job_str + str(i+1)] = df2[job_str].ix[temp].values
data2[job_date_str + str(i+1)] = df2[job_date_str].ix[temp].values
cols.extend([job_str_curr, job_date_curr])
df3 = pd.DataFrame(data2, columns=cols)
df3 = df3.fillna('')
print(df3)
Name Job1 Job Eff Date1 Job2 Job Eff Date2 Job3 Job Eff Date3
0 Jane Analyst 1/1/2015 Manager 1/1/2016
1 Joe Analyst 1/1/2015 Manager 1/1/2016 Director 7/1/2016
This is not exactly what you were asking but here is a way to print the data frame as you wanted:
df = pd.DataFrame(data1)
for name, jobs in df.groupby('Name').groups.iteritems():
print '{0:<15}'.format(name),
for job in jobs:
print '{0:<15}{1:<15}'.format(df['Job'].ix[job], df['Job Eff Date'].ix[job]),
print
## Jane Analyst 1/1/2015 Manager 1/1/2016
## Joe Analyst 1/1/2015 Manager 1/1/2016 Director 7/1/2016
Diving into #piRSquared answer....
def tgrp(df):
df = df.drop('Name', axis=1)
print df, '\n'
out = df.reset_index(drop=True)
print out, '\n'
out.T
print out.T, '\n\n'
return out.T
dfxx = df2.groupby('Name').apply(tgrp).unstack()
dfxx
The output of above. Why does pandas repeat the first group? Is this a bug?
Job Job Eff Date
3 Analyst 1/1/2015
4 Manager 1/1/2016
Job Job Eff Date
0 Analyst 1/1/2015
1 Manager 1/1/2016
0 1
Job Analyst Manager
Job Eff Date 1/1/2015 1/1/2016
Job Job Eff Date
3 Analyst 1/1/2015
4 Manager 1/1/2016
Job Job Eff Date
0 Analyst 1/1/2015
1 Manager 1/1/2016
0 1
Job Analyst Manager
Job Eff Date 1/1/2015 1/1/2016
Job Job Eff Date
0 Analyst 1/1/2015
1 Manager 1/1/2016
2 Director 7/1/2016
Job Job Eff Date
0 Analyst 1/1/2015
1 Manager 1/1/2016
2 Director 7/1/2016
0 1 2
Job Analyst Manager Director
Job Eff Date 1/1/2015 1/1/2016 7/1/2016
Related
I have read this post and would like to do something similar.
I have 2 dfs:
df1:
file_num
city
address_line
1
Toronto
123 Fake St
2
Montreal
456 Sample Ave
df2:
DB_Num
Address
AB1
Toronto 123 Fake St
AB3
789 Random Drive, Toronto
I want to know which DB_Num in df2 match to addres_line and city in df1, and include which file_num the match was from.
My ideal output is:
file_num
city
address_line
DB_Num
Address
1
Toronto
123 Fake St
AB1
Toronto 123 Fake St
Based on the above linked post, I have made a look ahead regex, and am searching using the insert and str.extract method.
df1['search_field'] = "(?=.*" + df1['city'] + ")(?=.*" + df1['address_line'] + ")"
pat = "|".join(df1['search_field'])
df = df2.insert(0, 'search_field', df2['Address'].str.extract("(" + pat + ')', expand=False))
Since my address in df2 is entered manually, it is sometimes out of order.
Because it is out of order, I am using the look ahead method of regex.
The look ahead method is causing str.extract to not output any value. Although I can still filter out nulls and it will keep only the correct matches.
My main problem is I have no way to join back to df1 to get the file_num.
I can do this problem with a for loop and iterating each record to search, but it takes too long. df1 is actually around 5000 records, and df2 has millions, so it takes over 2 hours to run. Is there a way to leverage vectorization for this problem?
Thanks!
Start by creating a new series which is the row each "Address" in df2 corresponds to "address_line" in df1, if such a row exists:
r = '({})'.format('|'.join(df1.address_line))
merge_df = df2.Address.str.extract(r, expand=False)
merge_df
#output:
0 123 Fake St
1 NaN
Name: Address, dtype: object
Now we merge our df1 on the "address_line" column, and our df2 on our "merge_df" series:
df1.merge(df2, left_on='address_line', right_on=merge_df)
index
file_num
City
address_line
DB_num
Address
0
1.0
Toronto
123 Fake St
AB1
Toronto 123 Fake St
I am tyring to do some equivalent of COUNTIF in Pandas. I am trying to get my head around doing it with groupby, but I am struggling because my logical grouping condition is dynamic.
Say I have a list of customers, and the day on which they visited. I want to identify new customers based on 2 logical conditions
They must be the same customer (same Guest ID)
They must have been there on the previous day
If both conditions are met, they are a returning customer. If not, they are new (Hence newby = 1-... to identify new customers.
I managed to do this with a for loop, but obviously performance is terrible and this goes pretty much against the logic of Pandas.
How can I wrap the following code into something smarter than a loop?
for i in range (0, len(df)):
newby = 1-np.sum((df["Day"] == df.iloc[i]["Day"]-1) & (df["Guest ID"] == df.iloc[i]["Guest ID"]))
This post does not help, as the condition is static. I would like to avoid introducting "dummy columns", such as transposing the df, because I will have many categories (many customer names) and would like to build more complex logical statements. I do not want to run the risk of ending up with many auxiliary columns
I have the following input
df
Day Guest ID
0 3230 Tom
1 3230 Peter
2 3231 Tom
3 3232 Peter
4 3232 Peter
and expect this output
df
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
Note that elements 3 and 4 are not necessarily duplicates - given there might be additional, varying columns (such as their order).
Do:
# ensure the df is sorted by date
df = df.sort_values('Day')
# group by customer and find the diff within each group
df['newby'] = (df.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
UPDATE
If multiple visits are allowed per day, you could do:
# only keep unique visits per day
uniques = df.drop_duplicates()
# ensure the df is sorted by date
uniques = uniques.sort_values('Day')
# group by customer and find the diff within each group
uniques['newby'] = (uniques.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
# merge the uniques visits back into the original df
res = df.merge(uniques, on=['Day', 'Guest ID'])
print(res)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
As an alternative, without sorting or merging, you could do:
lookup = {(day + 1, guest) for day, guest in df[['Day', 'Guest ID']].value_counts().to_dict()}
df['newby'] = (~pd.MultiIndex.from_arrays([df['Day'], df['Guest ID']]).isin(lookup)).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
I have a df with US citizens state and I would like to use that as a lookup for world citizens
df1=
[Sam, New York;
Nick, California;
Sarah, Texas]
df2 =
[Sam;
Phillip;
Will;
Sam]
I would like to either df2.replace() with the states or create df3 where my output is:
[New York;
NaN;
NaN;
New York]
I have tried mapping with set_index and dict(zip()) but have had no luck so far.
Thank you.
How about this method:
import pandas as pd
df1 = pd.DataFrame([['Sam','New York'],['Nick','California'],['Sarah','Texas']],\
columns = ['name','state'])
display(df1)
df2 = pd.DataFrame(['Sam','Phillip','Will','Sam'],\
columns = ['name'])
display(df2)
df2.merge(right=df1,left_on='name',right_on='name',how='left')
resulting in
name state
0 Sam New York
1 Nick California
2 Sarah Texas
name
0 Sam
1 Phillip
2 Will
3 Sam
name state
0 Sam New York
1 Phillip NaN
2 Will NaN
3 Sam New York
you can then filter for just the state column in the merged dataframe
Lets say I have the following data set, turned into a dataframe:
data = [
['Job 1', datetime.date(2019, 6, 9), 'Jim', 'Tom'],
['Job 1', datetime.date(2019, 6, 9), 'Bill', 'Tom'],
['Job 1', datetime.date(2019, 6, 9), 'Tom', 'Tom'],
['Job 1', datetime.date(2019, 6, 10), 'Bill', None],
['Job 2', datetime.date(2019,6,10), 'Tom', 'Tom']
]
df = pd.DataFrame(data, columns=['Job', 'Date', 'Employee', 'Manager'])
This yields a dataframe that looks like:
Job Date Employee Manager
0 Job 1 2019-06-09 Jim Tom
1 Job 1 2019-06-09 Bill Tom
2 Job 1 2019-06-09 Tom Tom
3 Job 1 2019-06-10 Bill None
4 Job 2 2019-06-10 Tom Tom
What I am trying to generate is a pivot on each unique Job/Date combo, with a column for Manager, and a column for a string with comma separated, non-manager employees. A couple of things to assume:
All employee names are unique (I'll actually be using unique employee ids rather than names), and Managers are also "employees", so there will never be a case with an employee and a manager sharing the same name/id, but being different individuals.
A work crew can have a manager, or not (see row with id 3, for an example without)
A manager will always also be listed as an employee (see row with id 2 or 4)
A job could have a manager, with no additional employees (see row id 4)
I'd like the resulting dataframe to look like:
Job Date Manager Employees
0 Job 1 2019-06-09 Tom Jim, Bill
1 Job 1 2019-06-10 None Bill
2 Job 2 2019-06-10 Tom None
Which leads to my questions:
Is there a way to do a ','.join like aggregation in a pandas pivot?
Is there a way to make this aggregation conditional (exclude the name/id in the manager column)
I suspect 1) is possible, and 2) might be more difficult. If 2) is a no, I can get around it in other ways later in my code.
The tricky part here is removing the Manager from the Employee column.
u = df.melt(['Job', 'Date'])
f = u[~u.duplicated(['Job', 'Date', 'value'], keep='last')].astype(str)
f.pivot_table(
index=['Job', 'Date'],
columns='variable', values='value',
aggfunc=','.join
).rename_axis(None, axis=1)
Employee Manager
Job Date
Job 1 2019-06-09 Jim,Bill Tom
2019-06-10 Bill None
Job 2 2019-06-10 NaN Tom
Group to aggregate, then fix the Employees by removing the Manager and setting to None where appropriate. Since the employees are unique, sets will work nicely here to remove the Manager.
s = df.groupby(['Job', 'Date']).agg({'Manager': 'first', 'Employee': lambda x: set(x)})
s['Employee'] = [', '.join(x.difference({y})) for x,y in zip(s.Employee, s.Manager)]
s['Employee'] = s.Employee.replace({'': None})
Manager Employee
Job Date
Job 1 2019-06-09 Tom Jim, Bill
2019-06-10 None Bill
Job 2 2019-06-10 Tom None
I'm partial to building a dictionary up with the desired results and reconstructing the dataframe.
d = {}
for t in df.itertuples():
d_ = d.setdefault((t.Job, t.Date), {})
d_['Manager'] = t.Manager
d_.setdefault('Employees', set()).add(t.Employee)
for k, v in d.items():
v['Employees'] -= {v['Manager']}
v['Employees'] = ', '.join(v['Employees'])
pd.DataFrame(d.values(), d).rename_axis(['Job', 'Date']).reset_index()
Job Date Employees Manager
0 Job 1 2019-06-09 Bill, Jim Tom
1 Job 1 2019-06-10 Bill None
2 Job 2 2019-06-10 Tom
In your case try not using lambda transform + drop_duplicates
df['Employee']=df['Employee'].mask(df['Employee'].eq(df.Manager)).dropna().groupby([df['Job'], df['Date']]).transform('unique').str.join(',')
df=df.drop_duplicates(['Job','Date'])
df
Out[745]:
Job Date Employee Manager
0 Job 1 2019-06-09 Jim,Bill Tom
3 Job 1 2019-06-10 Bill None
4 Job 2 2019-06-10 NaN Tom
how about
df.groupby(["Job","Date","Manager"]).apply( lambda x: ",".join(x.Employee))
this will find all unique sets of Job Date and Manager and put the employees together with "," into one string
I have 2 DataFrames:
PROJECT1
key name deadline delivered
0 AA1 Tom 01/05/2018 02/05/2018
1 AA2 Sue 01/05/2018 30/04/2018
2 AA4 Jack 01/05/2018 04/05/2018
PROJECT2
key name deadline delivered
0 AA1 Tom 01/05/2018 30/04/2018
1 AA2 Sue 01/05/2018 30/04/2018
2 AA3 Jim 01/05/2018 03/05/2018
is is possible to create a column in PROJECT2 named 'In PROJECT1' and apply condition as such:
psuedo code
for row in PROJECT2:
if in the same row based on key column PROJECT1['delivered'] >= PROJECT2['deadline']:
PROJECT2['In PROJECT1'] = 'project delivered before deadline'
else:
'Project delayed'
expected result
key name deadline delivered In PROJECT1
0 AA1 Tom 01/05/2018 30/04/2018 Project delayed
1 AA2 Sue 01/05/2018 30/04/2018 project delivered before deadline
2 AA3 Jim 01/05/2018 03/05/2018 NaN
not sure how to approach it (iterrows(), for loop, df.loc[conditions], np.where(), or perhaps I need to define some kind of function to use in df.apply()), any help highly appreciated.
You can use numpy.select to add a series with a list of conditions and values.
Note I believe you have your desired criteria reversed, i.e. delivered before deadline should give "project delivered before deadline" rather than vice versa.
import numpy as np
# convert series to datetime if necessary
for col in ['deadline', 'delivered']:
df1[col] = pd.to_datetime(df1[col], dayfirst=True)
for col in ['deadline', 'delivered']:
df2[col] = pd.to_datetime(df2[col], dayfirst=True)
# create series mapping key to delivered date in df1
s = df1.set_index('key')['delivered']
# define conditions and values
conditions = [~df2['key'].isin(s.index), df2['key'].map(s) <= df2['deadline']]
values = [np.nan, 'project delivered before deadline']
# apply conditions and values, with fallback value
df2['In Project1'] = np.select(conditions, values, 'Project delayed')
print(df2)
key name deadline delivered In Project1
0 AA1 Tom 2018-05-01 2018-04-30 Project delayed
1 AA2 Sue 2018-05-01 2018-04-30 project delivered before deadline
2 AA3 Jim 2018-05-01 2018-05-03 nan
Here is an alternate way you can follow by joining both the data sets. This will help you avoid any necessity for loop and will be faster.
## join the two data sets
# p1 = Project 1
# p2 = Project 2
p3 = p2.merge(p1.loc[:,['key','delivered']], on='key',how='left', suffixes=['_p2','_p1'])
p3['In PROJECT1'] = np.where((p3['delivered_p1'] >= p3['delivered_p2']),'project delivered before deadline','Project delayed')
# handle cases with NA
set_to_na = p3[['delivered_p1','delivered_p2']].isnull().any(axis=1).values.tolist()
p3['In PROJECT1'].iloc[set_to_na] = np.nan
## remove unwanted columns and rename
p3.drop('delivered_p1', axis=1, inplace=True)
p3.rename(columns={'delivered_p2':'delivered'}, inplace=True)
print(p3)
key name deadline delivered In PROJECT1
0 AA1 Tom 01/05/2018 30/04/2018 Project delayed
1 AA2 Sue 01/05/2018 30/04/2018 project delivered before deadline
2 AA3 Jim 01/05/2018 03/05/2018 NaN