Related
I have the following 3 Panda Dataframe. I want to replace company and division columns the ID from their respective company and division dataframe.
pd_staff:
id name company division
P001 John Sunrise Headquarter
P002 Jane Falcon Digital Research & Development
P003 Joe Ashford Finance
P004 Adam Falcon Digital Sales
P004 Barbara Sunrise Human Resource
pd_company:
id name
1 Sunrise
2 Falcon Digital
3 Ashford
pd_division:
id name
1 Headquarter
2 Research & Development
3 Finance
4 Sales
5 Human Resource
This is the end result that I am trying to produce
id name company division
P001 John 1 1
P002 Jane 2 2
P003 Joe 3 3
P004 Adam 2 4
P004 Barbara 1 5
I have tried to combine Staff and Company using this code
pd_staff.loc[pd_staff['company'].isin(pd_company['name']), 'company'] = pd_company.loc[pd_company['name'].isin(pd_staff['company']), 'id']
which produces
id name company
P001 John 1.0
P002 Jane NaN
P003 Joe NaN
P004 Adam NaN
P004 Barbara NaN
You can do:
pd_staff['company'] = pd_staff['company'].map(pd_company.set_index('name')['id'])
pd_staff['division'] = pd_staff['division'].map(pd_division.set_index('name')['id'])
print(pd_staff):
id name company division
0 P001 John 1 1
1 P002 Jane 2 2
2 P003 Joe 3 3
3 P004 Adam 2 4
4 P004 Barbara 1 5
This will achieve the desired results
df_merge = df.merge(df2, how = 'inner', right_on = 'name', left_on = 'company', suffixes=('', '_y'))
df_merge = df_merge.merge(df3, how = 'inner', left_on = 'division', right_on = 'name', suffixes=('', '_z'))
df_merge = df_merge[['id', 'name', 'id_y', 'id_z']]
df_merge.columns = ['id', 'name', 'company', 'division']
df_merge.sort_values('id')
first, lets modify df company and df division a little bit
df2.rename(columns={'name':'company'},inplace=True)
df3.rename(columns={'name':'division'},inplace=True)
Then
df1=df1.merge(df2,on='company',how='left').merge(df3,on='division',how='left')
df1=df1[['id_x','name','id_y','id']]
df1.rename(columns={'id_x':'id','id_y':'company','id':'division'},inplace=True)
Use apply, you can have a function thar will replace the values. from the second excel you will pass the field to look up to and what's to replace in this. Here I am replacing Sunrise by 1 because it is in the second excel.
import pandas as pd
df = pd.read_excel('teste.xlsx')
df2 = pd.read_excel('ids.xlsx')
def altera(df33, field='Sunrise', new_field='1'): # for showing pourposes I left default values but they are to pass from the second excel
return df33.replace(field, new_field)
df.loc[:, 'company'] = df['company'].apply(altera)
How do I turn the headers inside the rows into columns?
For example I have the Dataframe below.
enter image description here
and would like it to be
enter image description here
EDIT:
Code to produce current df example
import pandas as pd
df = pd.DataFrame({'Date':[2020,2021,2022], 'James':'', ' Sales': [3,4,5], ' City':'NY', ' DIV':'a', 'KIM':'', ' Sales ': [3,4,5], ' City ':'SF', ' DIV ':'b'}).T.reset_index()
index 0 1 2
0 Date 2020 2021 2022
1 James
2 Sales 3 4 5
3 City NY NY NY
4 DIV a a a
5 KIM
6 Sales 3 4 5
7 City SF SF SF
8 DIV b b b
looking to get
Name City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 3 4 5
I think the best way is to iterate over the first column if the name(eg James) has no indent its turn into a column until it hits a other value (KIM). So to find a way to categories the header which is not indent into a new column which stops when a new header comes up (KIM).
#Edit 2 there not only two names (KIM or JAMES) there is like 20 names. Or only the three second levels (Sales, City, Div). Different names have more that 3 second levels some have 7 levels. The only thing that is consistent is the Names are not indent but the second levels are.
Using a slightly simpler example, this works, but it sure ain't pretty:
df = pd.DataFrame({
'date': ['James', 'Sales', 'City', 'Kim', 'Sales', 'City',],
'2020': ['', '3', 'NY', '', '4', 'SF'],
'2021': ['', '4', 'NY', '', '5', 'SF'],
})
def rows_to_columns(group):
for value in group.date.values:
if value != group.person.values[0] and value != 'Sales':
temp_column = '_'+value
group.loc[group['date']==value, temp_column] = group['2020']
group[value.lower()] = (
group[temp_column]
.fillna(method='ffill')
.fillna(method='bfill')
)
group.drop([temp_column], axis=1, inplace=True)
pass
pass
return group
df.loc[df['2020']=='', 'person'] = df.date
df.person = df.person.fillna(method='ffill')
new_df = (df
.groupby('person')
.apply(lambda x:rows_to_columns(x))
.drop(['date'], axis=1)
.loc[df.date=='Sales']
)
The basic idea is to
Copy the name into a separate column and fill that column using .fillna(method='ffill'). This works if the assumption holds that every person's block begins with the person's name. Otherwise it wreaks havoc.
All other values, such as 'div' and 'city' will be converted by row_to_columns(group). The function iterates over all rows in a group that are neither the person's name nor 'Sales', copies the value from the row into a temp column, creates a new column for that row and uses ffill and bfill to fill it out. It then deletes the temp column and returns the group.
The resulting data frame is the intended format once the column 'Sales' is dropped.
Note: This solution probably does not work well on larger datasets.
You gave more details, and I see you are not working with multi-level indexes. The best way for you would be to create the DataFrame already in the format you need in this case. The way you are creating the first DataFrame is not well structured and the information is not indexed by name (James/KIM) as they are columns with empty values, no link with the other values. The stacking you did use blank spaces on a string. Take a look at multi-indexing and generate a data frame you can work with, or create the data frame in the format you need in the end.
-- Answer considering multi-level indexes --
Using the few information provided, I see your Dataframe is stacked, it means, you have multiple indexes. The first level is person (James/KIM) and the second level is Sales/City/DIV. So your Dataframe should be created like this:
import pandas
multi_index = pandas.MultiIndex.from_tuples([
('James', 'Sales'), ('James', 'City'), ('James', 'DIV'),
('KIM', 'Sales'), ('KIM', 'City'), ('KIM', 'DIV')])
year_2020 = pandas.Series([3, 'NY', 'a', 4, 'SF', 'b'], index=multi_index)
year_2021 = pandas.Series([4, 'NY', 'a', 5, 'SF', 'b'], index=multi_index)
year_2022 = pandas.Series([5, 'NY', 'a', 6, 'SF', 'b'], index=multi_index)
frame = { '2020': year_2020, '2021': year_2021, '2022': year_2022}
df = pandas.DataFrame(frame)
print(df)
2020 2021 2022
James Sales 3 4 5
City NY NY NY
DIV a a a
KIM Sales 4 5 6
City SF SF SF
DIV b b b
Now that you have the multi_level DataFrame, you have many ways to transform it. This is what we will do to make it one level:
sales_df = df.xs('Sales', axis=0, level=1).copy()
div_df = df.xs('DIV', axis=0, level=1).copy()
city_df = df.xs('City', axis=0, level=1).copy()
The results will be:
print(sales)
2020 2021 2022
James 3 4 5
KIM 4 5 6
print(div_df)
2020 2021 2022
James a a a
KIM b b b
print(city_df)
2020 2021 2022
James NY NY NY
KIM SF SF SF
You are discarding any information regarding DIV or City changes from years, so we can reduce the City and DIV dataframe to a Series, taking the first one as reference:
div_series = div_df.iloc[:,0]
city_series = city_df.iloc[:,0]
Take the sales DF as reference, and add the City and DIV series:
sales_df['DIV'] = div_series
sales_df['City'] = city_series
sales_df['Account'] = 'Sales'
Now reorder the columns as you wish:
sales_df = sales_df[['City', 'DIV', 'Account', '2020', '2021', '2022']]
print(sales_df)
City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 4 5 6
Let's say I have a dataframe of leads as such:
import pandas as pd
leads = {'Unique Identifier':['1','2','3','4','5','6','7','8'],
'Name': ['brad','stacy','holly','mike','phil', 'chris','jane','glenn'],
'Channel': [None,None,None,None,'facebook', 'facebook','google', 'facebook'],
'Campaign': [None,None,None,None,'A', 'B','B', 'C'],
'Gender': ['M','F','F','M','M', 'M','F','M'],
'Signup Month':['Mar','Mar','Apr','May','May','May','Jun','Jun']
}
leads_df = pd.DataFrame(leads)
leads_df
which looks like the following. It has missing data for Channel and Campaign for the first 4 leads.
leads table
I have a separate dataframe with the missing data:
missing = {'Unique Identifier':['1','2','3','4'],
'Channel': ['google', 'email','facebook', 'google'],
'Campaign': ['B', 'A','C', 'B']
}
missing_df = pd.DataFrame(missing)
missing_df
table with missing data
Using the Unique Identifiers in both tables, how would I go about plugging in the missing data into the main leads table? For context there are about 6,000 leads with missing data.
You can merge the two dataframes together, update the columns using the results from the merge and then proceed to drop the merged columns.
data = leads_df.merge(missing_df, how='outer', on='Unique Identifier')
data['Channel'] = data['Channel_y'].fillna(data['Channel_x'])
data['Campaign'] = data['Campaign_y'].fillna(data['Campaign_x'])
data.drop(['Channel_x', 'Channel_y', 'Campaign_x', 'Campaign_y'], 1, inplace=True)
The result:
data
Unique Identifier Name Gender Signup Month Channel Campaign
0 1 brad M Mar google B
1 2 stacy F Mar email A
2 3 holly F Apr facebook C
3 4 mike M May google B
4 5 phil M May facebook A
5 6 chris M May facebook B
6 7 jane F Jun google B
7 8 glenn M Jun facebook C
You can set the index of both dataframes to unique identifier and use combine_first to fill the null values in leads_df
(leads_df
.set_index("Unique Identifier")
.combine_first(missing_df.set_index("Unique Identifier"))
.reset_index()
)
The way I use in this kind of case is similar to vlookup function of Excel.
leads_df.loc[leads_df.Channel.isna(),'Channel']=pd.merge(leads_df.loc[leads_df.Channel.isna(),'Unique Identifier'],
missing_df,
how='left')['Channel']
This code will result in :
Unique Identifier Name Channel Campaign Gender Signup Month
0 1 brad google None M Mar
1 2 stacy email None F Mar
2 3 holly facebook None F Apr
3 4 mike google None M May
4 5 phil facebook A M May
5 6 chris facebook B M May
6 7 jane google B F Jun
7 8 glenn facebook C M Jun
You can do same to 'Campaign'.
You just need to fill it using fillna() ...
leads_df.fillna(missing_df, inplace=True)
There is a pandas DataFrame method for this called combine_first:
voltron = leads_df.combine_first(missing_df)
Lets say I have the following data set, turned into a dataframe:
data = [
['Job 1', datetime.date(2019, 6, 9), 'Jim', 'Tom'],
['Job 1', datetime.date(2019, 6, 9), 'Bill', 'Tom'],
['Job 1', datetime.date(2019, 6, 9), 'Tom', 'Tom'],
['Job 1', datetime.date(2019, 6, 10), 'Bill', None],
['Job 2', datetime.date(2019,6,10), 'Tom', 'Tom']
]
df = pd.DataFrame(data, columns=['Job', 'Date', 'Employee', 'Manager'])
This yields a dataframe that looks like:
Job Date Employee Manager
0 Job 1 2019-06-09 Jim Tom
1 Job 1 2019-06-09 Bill Tom
2 Job 1 2019-06-09 Tom Tom
3 Job 1 2019-06-10 Bill None
4 Job 2 2019-06-10 Tom Tom
What I am trying to generate is a pivot on each unique Job/Date combo, with a column for Manager, and a column for a string with comma separated, non-manager employees. A couple of things to assume:
All employee names are unique (I'll actually be using unique employee ids rather than names), and Managers are also "employees", so there will never be a case with an employee and a manager sharing the same name/id, but being different individuals.
A work crew can have a manager, or not (see row with id 3, for an example without)
A manager will always also be listed as an employee (see row with id 2 or 4)
A job could have a manager, with no additional employees (see row id 4)
I'd like the resulting dataframe to look like:
Job Date Manager Employees
0 Job 1 2019-06-09 Tom Jim, Bill
1 Job 1 2019-06-10 None Bill
2 Job 2 2019-06-10 Tom None
Which leads to my questions:
Is there a way to do a ','.join like aggregation in a pandas pivot?
Is there a way to make this aggregation conditional (exclude the name/id in the manager column)
I suspect 1) is possible, and 2) might be more difficult. If 2) is a no, I can get around it in other ways later in my code.
The tricky part here is removing the Manager from the Employee column.
u = df.melt(['Job', 'Date'])
f = u[~u.duplicated(['Job', 'Date', 'value'], keep='last')].astype(str)
f.pivot_table(
index=['Job', 'Date'],
columns='variable', values='value',
aggfunc=','.join
).rename_axis(None, axis=1)
Employee Manager
Job Date
Job 1 2019-06-09 Jim,Bill Tom
2019-06-10 Bill None
Job 2 2019-06-10 NaN Tom
Group to aggregate, then fix the Employees by removing the Manager and setting to None where appropriate. Since the employees are unique, sets will work nicely here to remove the Manager.
s = df.groupby(['Job', 'Date']).agg({'Manager': 'first', 'Employee': lambda x: set(x)})
s['Employee'] = [', '.join(x.difference({y})) for x,y in zip(s.Employee, s.Manager)]
s['Employee'] = s.Employee.replace({'': None})
Manager Employee
Job Date
Job 1 2019-06-09 Tom Jim, Bill
2019-06-10 None Bill
Job 2 2019-06-10 Tom None
I'm partial to building a dictionary up with the desired results and reconstructing the dataframe.
d = {}
for t in df.itertuples():
d_ = d.setdefault((t.Job, t.Date), {})
d_['Manager'] = t.Manager
d_.setdefault('Employees', set()).add(t.Employee)
for k, v in d.items():
v['Employees'] -= {v['Manager']}
v['Employees'] = ', '.join(v['Employees'])
pd.DataFrame(d.values(), d).rename_axis(['Job', 'Date']).reset_index()
Job Date Employees Manager
0 Job 1 2019-06-09 Bill, Jim Tom
1 Job 1 2019-06-10 Bill None
2 Job 2 2019-06-10 Tom
In your case try not using lambda transform + drop_duplicates
df['Employee']=df['Employee'].mask(df['Employee'].eq(df.Manager)).dropna().groupby([df['Job'], df['Date']]).transform('unique').str.join(',')
df=df.drop_duplicates(['Job','Date'])
df
Out[745]:
Job Date Employee Manager
0 Job 1 2019-06-09 Jim,Bill Tom
3 Job 1 2019-06-10 Bill None
4 Job 2 2019-06-10 NaN Tom
how about
df.groupby(["Job","Date","Manager"]).apply( lambda x: ",".join(x.Employee))
this will find all unique sets of Job Date and Manager and put the employees together with "," into one string
I'm trying to reshape my data. At first glance, it sounds like a transpose, but it's not. I tried melts, stack/unstack, joins, etc.
Use Case
I want to have only one row per unique individual, and put all job history on the columns. For clients, it can be easier to read information across rows rather than reading through columns.
Here's the data:
import pandas as pd
import numpy as np
data1 = {'Name': ["Joe", "Joe", "Joe","Jane","Jane"],
'Job': ["Analyst","Manager","Director","Analyst","Manager"],
'Job Eff Date': ["1/1/2015","1/1/2016","7/1/2016","1/1/2015","1/1/2016"]}
df2 = pd.DataFrame(data1, columns=['Name', 'Job', 'Job Eff Date'])
df2
Here's what I want it to look like:
Desired Output Table
.T within groupby
def tgrp(df):
df = df.drop('Name', axis=1)
return df.reset_index(drop=True).T
df2.groupby('Name').apply(tgrp).unstack()
Explanation
groupby returns an object that contains information on how the original series or dataframe has been grouped. Instead of performing a groupby with a subsquent action of some sort, we could first assign the df2.groupby('Name') to a variable (I often do), say gb.
gb = df2.groupby('Name')
On this object gb we could call .mean() to get an average of each group. Or .last() to get the last element (row) of each group. Or .transform(lambda x: (x - x.mean()) / x.std()) to get a zscore transformation within each group. When there is something you want to do within a group that doesn't have a predefined function, there is still .apply().
.apply() for a groupby object is different than it is for a dataframe. For a dataframe, .apply() takes callable object as its argument and applies that callable to each column (or row) in the object. the object that is passed to that callable is a pd.Series. When you are using .apply in a dataframe context, it is helpful to keep this fact in mind. In the context of a groupby object, the object passed to the callable argument is a dataframe. In fact, that dataframe is one of the groups specified by the groupby.
When I write such functions to pass to groupby.apply, I typically define the parameter as df to reflect that it is a dataframe.
Ok, so we have:
df2.groupby('Name').apply(tgrp)
This generates a sub-dataframe for each 'Name' and passes that sub-dataframe to the function tgrp. Then the groupby object recombines all such groups having gone through the tgrp function back together again.
It'll look like this.
I took the OP's original attempt to simply transpose to heart. But I had to do some things first. Had I simply done:
df2[df2.Name == 'Jane'].T
df2[df2.Name == 'Joe'].T
Combining these manually (without groupby):
pd.concat([df2[df2.Name == 'Jane'].T, df2[df2.Name == 'Joe'].T])
Whoa! Now that's ugly. Obviously the index values of [0, 1, 2] don't mesh with [3, 4]. So let's reset.
pd.concat([df2[df2.Name == 'Jane'].reset_index(drop=True).T,
df2[df2.Name == 'Joe'].reset_index(drop=True).T])
That's much better. But now we are getting into the territory groupby was intended to handle. So let it handle it.
Back to
df2.groupby('Name').apply(tgrp)
The only thing missing here is that we want to unstack the results to get the desired output.
Say you start by unstacking:
df2 = df2.set_index(['Name', 'Job']).unstack()
>>> df2
Job Eff Date
Job Analyst Director Manager
Name
Jane 1/1/2015 None 1/1/2016
Joe 1/1/2015 7/1/2016 1/1/2016
In [29]:
df2
Now, to make things easier, flatten the multi-index:
df2.columns = df2.columns.get_level_values(1)
>>> df2
Job Analyst Director Manager
Name
Jane 1/1/2015 None 1/1/2016
Joe 1/1/2015 7/1/2016 1/1/2016
Now, just manipulate the columns:
cols = []
for i, c in enumerate(df2.columns):
col = 'Job %d' % i
df2[col] = c
cols.append(col)
col = 'Eff Date %d' % i
df2[col] = df2[c]
cols.append(col)
>>> df2[cols]
Job Job 0 Eff Date 0 Job 1 Eff Date 1 Job 2 Eff Date 2
Name
Jane Analyst 1/1/2015 Director None Manager 1/1/2016
Joe Analyst 1/1/2015 Director 7/1/2016 Manager 1/1/2016
Edit
Jane was never a director (alas). The above code states that Jane became Director at None date. To change the result so that it specifies that Jane became None at None date (which is a matter of taste), replace
df2[col] = c
by
df2[col] = [None if d is None else c for d in df2[c]]
This gives
Job Job 0 Eff Date 0 Job 1 Eff Date 1 Job 2 Eff Date 2
Name
Jane Analyst 1/1/2015 None None Manager 1/1/2016
Joe Analyst 1/1/2015 Director 7/1/2016 Manager 1/1/2016
Here is a possible workaround. Here, I first create a dictionary of the proper form and create a DataFrame based on the new dictionary:
df = pd.DataFrame(data1)
dic = {}
for name, jobs in df.groupby('Name').groups.iteritems():
if not dic:
dic['Name'] = []
dic['Name'].append(name)
for j, job in enumerate(jobs, 1):
jobstr = 'Job {0}'.format(j)
jobeffdatestr = 'Job Eff Date {0}'.format(j)
if jobstr not in dic:
dic[jobstr] = ['']*(len(dic['Name'])-1)
dic[jobeffdatestr] = ['']*(len(dic['Name'])-1)
dic[jobstr].append(df['Job'].ix[job])
dic[jobeffdatestr].append(df['Job Eff Date'].ix[job])
df2 = pd.DataFrame(dic).set_index('Name')
## Job 1 Job 2 Job 3 Job Eff Date 1 Job Eff Date 2 Job Eff Date 3
## Name
## Jane Analyst Manager 1/1/2015 1/1/2016
## Joe Analyst Manager Director 1/1/2015 1/1/2016 7/1/2016
g = df2.groupby('Name').groups
names = list(g.keys())
data2 = {'Name': names}
cols = ['Name']
temp1 = [g[y] for y in names]
job_str = 'Job'
job_date_str = 'Job Eff Date'
for i in range(max([len(x) for x in g.values()])):
temp = [x[i] if len(x) > i else '' for x in temp1]
job_str_curr = job_str + str(i+1)
job_date_curr = job_date_str + str(i + 1)
data2[job_str + str(i+1)] = df2[job_str].ix[temp].values
data2[job_date_str + str(i+1)] = df2[job_date_str].ix[temp].values
cols.extend([job_str_curr, job_date_curr])
df3 = pd.DataFrame(data2, columns=cols)
df3 = df3.fillna('')
print(df3)
Name Job1 Job Eff Date1 Job2 Job Eff Date2 Job3 Job Eff Date3
0 Jane Analyst 1/1/2015 Manager 1/1/2016
1 Joe Analyst 1/1/2015 Manager 1/1/2016 Director 7/1/2016
This is not exactly what you were asking but here is a way to print the data frame as you wanted:
df = pd.DataFrame(data1)
for name, jobs in df.groupby('Name').groups.iteritems():
print '{0:<15}'.format(name),
for job in jobs:
print '{0:<15}{1:<15}'.format(df['Job'].ix[job], df['Job Eff Date'].ix[job]),
print
## Jane Analyst 1/1/2015 Manager 1/1/2016
## Joe Analyst 1/1/2015 Manager 1/1/2016 Director 7/1/2016
Diving into #piRSquared answer....
def tgrp(df):
df = df.drop('Name', axis=1)
print df, '\n'
out = df.reset_index(drop=True)
print out, '\n'
out.T
print out.T, '\n\n'
return out.T
dfxx = df2.groupby('Name').apply(tgrp).unstack()
dfxx
The output of above. Why does pandas repeat the first group? Is this a bug?
Job Job Eff Date
3 Analyst 1/1/2015
4 Manager 1/1/2016
Job Job Eff Date
0 Analyst 1/1/2015
1 Manager 1/1/2016
0 1
Job Analyst Manager
Job Eff Date 1/1/2015 1/1/2016
Job Job Eff Date
3 Analyst 1/1/2015
4 Manager 1/1/2016
Job Job Eff Date
0 Analyst 1/1/2015
1 Manager 1/1/2016
0 1
Job Analyst Manager
Job Eff Date 1/1/2015 1/1/2016
Job Job Eff Date
0 Analyst 1/1/2015
1 Manager 1/1/2016
2 Director 7/1/2016
Job Job Eff Date
0 Analyst 1/1/2015
1 Manager 1/1/2016
2 Director 7/1/2016
0 1 2
Job Analyst Manager Director
Job Eff Date 1/1/2015 1/1/2016 7/1/2016