Simplify pivot in pandas [duplicate] - python

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 7 months ago.
I have a question that concerns the use of the pivot function in pandas. I have a table (df_init) with a bunch of customer Ids (7000 different Ids) and the product codes they purchased
CST_ID
PROD_CODE
11111
1234
11111
2345
11111
5425
11111
9875
22222
2345
22222
9251
22222
1234
33333
6542
33333
7498
Each Id can be repeated at most 4 time in the table, but can appear less than 4 times (e,g, 22222 and 33333). I want to reorganize that table as follows (df_fin)
CST_ID
PROD_1
PROD_2
PROD_3
PROD_4
11111
1234
2345
5425
9875
22222
2345
9251
1234
NaN
33333
6542
7498
NaN
NaN
Good news is, I have found a way to do so. Bad news I am not satisfied as it loops over the Customer Ids nd takes a while. Namely I count the occurrences of a certain Id while looping over the column and add that to a list, then append this list as a new variable to df_init
to_append = []
for index in range(len(df_init)):
temp = df_init.iloc[:index+1]['CST_ID'] == df_init.iloc[index]['CST_ID'] # ['CST_ID']== df_init.iloc[index]['CST_ID']]
counter = sum(list(temp))
to_append.append(counter)
df_init['Product_number'] = to_append
Afterwards I pivot and rename the columns
df_fin = df_init.pivot(index='CST_ID', columns='Product_number', values='PROD_CODE').rename_axis(None).reset_index()
df_fin.columns=['CST_ID', 'pdt1', 'pdt2', 'pdt3', 'pdt4']
Of course this solution works just fine, but looping in order to create the column which I use for the columns specification of the Pivot takes time. Hence I was wondering if there was a better solution (perhapes embedded already in Pandas or in the Pivot method) to do so.
Thanks to anyone who is willing to participate
Best

You can vectorize the part creating the pivoting column as below. groupby + cumcount generates the increasing number by the CST_ID.
df_fin = df_init.assign(key="PROD_" + (df_init.groupby("CST_ID").cumcount()+1).astype(str))
df_fin = df_fin.pivot(index="CST_ID", columns="key", values="PROD_CODE")
df_fin
#key PROD_1 PROD_2 PROD_3 PROD_4
#CST_ID
#11111 1234.0 2345.0 5425.0 9875.0
#22222 2345.0 9251.0 1234.0 NaN
#33333 6542.0 7498.0 NaN NaN

For large dataframes i would have done, but above solution works nice
import pandas
df = pandas.DataFrame(
{
"CST_ID": [11111, 11111, 11111, 11111, 22222, 22222, 22222, 22222, 33333, 33333, 33333, 33333],
"PROD_CODE": [random.randint(1, 100) for _ in range(12)]
}
)
df["Product_number"] = df.groupby(['CST_ID']).cumcount() + 1
df = df.pivot(index='CST_ID', columns='Product_number', values='PROD_CODE')
df.columns = ["PROD_%s" % _ for _ in df.columns]
# PROD_1 PROD_2 PROD_3 PROD_4
#CST_ID
#11111 98 11 13 38
#22222 33 13 3 61
#33333 86 35 93 23

Related

Python Dataframe: pivot rows as columns

I have raw files from different stations. When I combine them into a dataframe, I see three columns with matching id and name with different component. I want to convert this into a dataframe where name entries become the column names
Code:
df =
id name component
0 1 Serial Number 103
1 2 Station Name DC
2 1 Serial Number 114
3 2 Station Name CA
4 1 Serial Number 147
5 2 Station Name FL
Expected answer:
new_df =
Station Name Serial Number
0 DC 103
1 CA 114
2 FL 147
My answer:
# Solution1
df.pivot_table('id','name','component')
name
NaN NaN NaN NaN
# Solution2
df.pivot(index=None,columns='name')['component']
name
NaN NaN NaN NaN
I am not getting desired answer. Any help?
First you have to make every 2 rows with the same id, after that you can use pivot table.
import pandas as pd
df = pd.DataFrame({'id': ["1", "2", "1", "2", "1", "2"],
'name': ["Serial Number", "Station Name", "Serial Number", "Station Name", "Serial Number", "Station Name"],
'component': ["103", "DC", "114", "CA", "147", "FL"]})
new_column = [x//2+1 for x in range(len(df))]
df["id"] = new_column
df = df.pivot(index='id',columns='name')['component']
If your Serial Number is just before Station Name, you can pivot on name columns then combine the every two rows:
df_ = df.pivot(columns='name', values='component').groupby(df.index // 2).first()
print(df_)
name Serial Number Station Name
0 103 DC
1 114 CA
2 147 FL

How can I convert rows to columns (with custom names) after grouping?

I'm trying to get some row data as columns with pandas.
My original dataframe is something like the following (with a lot more columns). Most data repeats for the same employee but some info changes, like salary in this example. Employees have different number of entries (in this case employee 1 has two entries, 2 has 4, and so on).
employee_id salary other1 other2 other3
1 50000 somedata1 somedata2 somedata3
1 48000 somedata1 somedata2 somedata3
2 80000 somedata20 somedata21 somedata22
2 77000 somedata20 somedata21 somedata22
2 75000 somedata20 somedata21 somedata22
2 74000 somedata20 somedata21 somedata22
3 60000 somedata30 somedata31 somedata32
I'm trying to get something like the following. Salary data should span a few columns and use the last available salary for employees with fewer entries (the repeated salary values in this example).
employee_id salary prevsalary1 prevsalary2 prevsalary3 other1 other2 other3
1 50000 48000 48000 48000 somedata1 somedata2 somedata3
2 80000 77000 75000 74000 somedata20 somedata21 somedata22
3 60000 60000 60000 60000 somedata30 somedata31 somedata32
I tried grouping
df.groupby(["employee_id"])['salary'].nlargest(3).reset_index()
But I dont get all columns. I can't find a way to preserve the rest of columns. Do I need to merge, concatenate or something like that with the original dataframe?
Also, I get a column named "level_1". I think I could get rid of it by using reset_index(level=1, drop=True) but I believe this doesn't return a dataframe.
And finally, I guess if I get this grouping right, there's one more step to get the columns... maybe using pivot or unstack?
I'm starting my journey into machine learning and I keep scratching my head with this one, I hope you can help me :)
Creating dataset:
df = pd.DataFrame({'emp_id':[1,1,2,2,2,2,3],'salary':[50000,48000,80000,77000,75000,74000,60000]})
df['other1'] =['somedata1','somedata1','somedata20','somedata20','somedata20','somedata20','somedata30']
df['other2'] = df['other1'].apply(lambda x: x+'1')
df['other3'] = df['other1'].apply(lambda x: x+'2')
df
Out[59]:
emp_id salary other1 other2 other3
0 1 50000 somedata1 somedata11 somedata12
1 1 48000 somedata1 somedata11 somedata12
2 2 80000 somedata20 somedata201 somedata202
3 2 77000 somedata20 somedata201 somedata202
4 2 75000 somedata20 somedata201 somedata202
5 2 74000 somedata20 somedata201 somedata202
6 3 60000 somedata30 somedata301 somedata302
One way is using pd.pivot_table with ffill:
g = df.groupby('employee_id')
cols = g.salary.cumcount()
out = df.pivot_table(index='employee_id', values='salary', columns=cols).ffill(1)
# Crete list of column names matching the expected output
out.columns = ['salary'] + [f'prevsalary{i}' for i in range(1,len(out.columns))]
print(out)
salary prevsalary1 prevsalary2 prevsalary3
employee_id
1 50000.0 48000.0 48000.0 48000.0
2 80000.0 77000.0 75000.0 74000.0
3 60000.0 60000.0 60000.0 60000.0
Now we just need to join with the unique other columns from the original dataframe:
out = out.join(df.filter(like='other').groupby(df.employee_id).first())
print(out)
salary prevsalary1 prevsalary2 prevsalary3 other1 \
employee_id
1 50000.0 48000.0 48000.0 48000.0 somedata1
2 80000.0 77000.0 75000.0 74000.0 somedata20
3 60000.0 60000.0 60000.0 60000.0 somedata30
other2 other3
employee_id
1 somedata2 somedata3
2 somedata21 somedata22
3 somedata31 somedata32
pivot the table of salaries first, then merge with the non-salary data
# first create a copy of the dataset without the salary column
dataset_without_salaries = df.drop('salary', axis=1).drop_duplicates()
# pivot only salary column
temp = pd.pivot_table(data=df[['salary']], index=df['employee_id'], aggfunc=list)
# expand the list
temp2 = temp.apply(lambda x: pd.Series(x['salary']), axis=1)
# merge the two together
final = pd.merge(temp2, dataset_without_salaries)

Multiple aggregated Counting in Pandas

I have a DF:
data = [["John","144","Smith","200"], ["Mia","220","John","144"],["Caleb","155","Smith","200"],["Smith","200","Jason","500"]]
data_frame = pd.DataFrame(data,columns = ["Name","ID","Manager_name","Manager_ID"])
data_frame
OP:
Name ID Manager_name Manager_ID
0 John 144 Smith 200
1 Mia 220 John 144
2 Caleb 155 Smith 200
3 Smith 200 Jason 500
I am trying to count the number of people reporting under each person in the column Name.
Logic is:
Count the number of people reporting individually and people reporting under in the chain. For example with Smith; John and Caleb reports to Smith so 2 + 1 with Mia reporting to John (who already reports to Smith) so total 3.
Similarly for Jason -> 1 because Smith reports to him and 3 people already report to Smith so total 4.
I understand how to do it pythonically with some recursion, is there a way to efficiently do it in Pandas. Any suggestions?
Expected OP:
Name Number of people reporting
John 1
Mia 0
Caleb 0
Smith 3
Jason 4
Scott Boston's Networkx solution is the preferred solution...
There are two solutions to this problem. The first one is a vectorized pandas type solution and should be fast over larger datasets, the second is pythonic and does not work well on the size of dataset the OP was looking for, the original df size is (223635,4).
PANDAS SOLUTION
This problem seeks to find out how many people each person in an organization manages, including subordinate's subordinates. This solution will create a dataframe by adding successive columns that are the managers of the previous columns, and then counting the occurance of each employee in that dataframe to determine the total number under them.
First we set up the input.
import pandas as pd
import numpy as np
data = [
["John", "144", "Smith", "200"],
["Mia", "220", "John", "144"],
["Caleb", "155", "Smith", "200"],
["Smith", "200", "Jason", "500"],
]
df = pd.DataFrame(data, columns=["Name", "SID", "Manager_name", "Manager_SID"])
df = df[["SID", "Manager_SID"]]
# shortening the columns for convenience
df.columns = ["1", "2"]
print(df)
1 2
0 144 200
1 220 144
2 155 200
3 200 500
First the employees without subordinates must be counted and put into a seperate dictionary.
df_not_mngr = df.loc[~df['1'].isin(df['2']), '1']
non_mngr_dict = {str(key):0 for key in df_not_mngr.values}
non_mngr_dict
{'220': 0, '155': 0}
Next we will modify the dataframe by adding columns of managers of the previous column. The loop is stopped when there are no employees in the right most column
for i in range(2, 10):
df = df.merge(
df[["1", "2"]], how="left", left_on=str(i), right_on="1", suffixes=("_l", "_r")
).drop("1_r", axis=1)
df.columns = [str(x) for x in range(1, i + 2)]
if df.iloc[:, -1].isnull().all():
break
else:
continue
print(df)
1 2 3 4 5
0 144 200 500 NaN NaN
1 220 144 200 500 NaN
2 155 200 500 NaN NaN
3 200 500 NaN NaN NaN
All columns except the first columns are collapsed and each employee counted and added to a dictionary.
from collections import Counter
result = dict(Counter(df.iloc[:, 1:].values.flatten()))
The non manager dictionary is added to the result.
result.update(non_mngr_dict)
result
{'200': 3, '500': 4, nan: 8, '144': 1, '220': 0, '155': 0}
RECURSIVE PYTHONIC SOLUTION
I think this is probably way more pythonic than you were looking for. First I created a list 'all_sids' to make sure we capture all employees as not all are in each list.
import pandas as pd
import numpy as np
data = [
["John", "144", "Smith", "200"],
["Mia", "220", "John", "144"],
["Caleb", "155", "Smith", "200"],
["Smith", "200", "Jason", "500"],
]
df = pd.DataFrame(data, columns=["Name", "SID", "Manager_name", "Manager_SID"])
all_sids = pd.unique(df[['SID', 'Manager_SID']].values.ravel('K'))
Then create a pivot table.
dfp = df.pivot_table(values='Name', index='SID', columns='Manager_SID', aggfunc='count')
dfp
Manager_SID 144 200 500
SID
144 NaN 1.0 NaN
155 NaN 1.0 NaN
200 NaN NaN 1.0
220 1.0 NaN NaN
Then a function that will go through the pivot table to total up all the reports.
def count_mngrs(SID, count=0):
if str(SID) not in dfp.columns:
return count
else:
count += dfp[str(SID)].sum()
sid_list = dfp[dfp[str(SID)].notnull()].index
for sid in sid_list:
count = count_mngrs(sid, count)
return count
Call the function for each employee and print the results.
print('SID', ' Number of People Reporting')
for sid in all_sids:
print(sid, " " , int(count_mngrs(sid)))
Results are below, sorry I was a bit lazy in putting the names with the sids.
SID Number of People Reporting
144 1
220 0
155 0
200 3
500 4
Look forward to seeing a more pandas type solution!
This is also, a graph problem and you can use Networkx:
import networkx as nx
import pandas as pd
data = [["John","144","Smith","200"], ["Mia","220","John","144"],["Caleb","155","Smith","200"],["Smith","200","Jason","500"]]
data_frame = pd.DataFrame(data,columns = ["Name","ID","Manager_name","Manager_ID"])
#create a directed graph object using nx.DiGraph
G = nx.from_pandas_edgelist(data_frame,
source='Name',
target='Manager_name',
create_using=nx.DiGraph())
#use nx.ancestors to get set of "ancenstor" nodes for each node in the directed graph
pd.DataFrame.from_dict({i:len(nx.ancestors(G,i)) for i in G.nodes()},
orient='index',
columns=['Num of People reporting'])
Output:
Num of People reporting
John 1
Smith 3
Mia 0
Caleb 0
Jason 4
Draw newtorkx:

Pandas join issue: columns overlap but no suffix specified

I have the following data frames:
print(df_a)
mukey DI PI
0 100000 35 14
1 1000005 44 14
2 1000006 44 14
3 1000007 43 13
4 1000008 43 13
print(df_b)
mukey niccdcd
0 190236 4
1 190237 6
2 190238 7
3 190239 4
4 190240 7
When I try to join these data frames:
join_df = df_a.join(df_b, on='mukey', how='left')
I get the error:
*** ValueError: columns overlap but no suffix specified: Index([u'mukey'], dtype='object')
Why is this so? The data frames do have common 'mukey' values.
Your error on the snippet of data you posted is a little cryptic, in that because there are no common values, the join operation fails because the values don't overlap it requires you to supply a suffix for the left and right hand side:
In [173]:
df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
Out[173]:
mukey_left DI PI mukey_right niccdcd
index
0 100000 35 14 NaN NaN
1 1000005 44 14 NaN NaN
2 1000006 44 14 NaN NaN
3 1000007 43 13 NaN NaN
4 1000008 43 13 NaN NaN
merge works because it doesn't have this restriction:
In [176]:
df_a.merge(df_b, on='mukey', how='left')
Out[176]:
mukey DI PI niccdcd
0 100000 35 14 NaN
1 1000005 44 14 NaN
2 1000006 44 14 NaN
3 1000007 43 13 NaN
4 1000008 43 13 NaN
The .join() function is using the index of the passed as argument dataset, so you should use set_index or use .merge function instead.
Please find the two examples that should work in your case:
join_df = LS_sgo.join(MSU_pi.set_index('mukey'), on='mukey', how='left')
or
join_df = df_a.merge(df_b, on='mukey', how='left')
This error indicates that the two tables have one or more column names that have the same column name.
The error message translates to: "I can see the same column in both tables but you haven't told me to rename either one before bringing them into the same table"
You either want to delete one of the columns before bringing it in from the other on using del df['column name'], or use lsuffix to re-write the original column, or rsuffix to rename the one that is being brought in.
df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
The error indicates that the two tables have the 1 or more column names that have the same column name.
Anyone with the same error who doesn't want to provide a suffix can rename the columns instead. Also make sure the index of both DataFrames match in type and value if you don't want to provide the on='mukey' setting.
# rename example
df_a = df_a.rename(columns={'a_old': 'a_new', 'a2_old': 'a2_new'})
# set the index
df_a = df_a.set_index(['mukus'])
df_b = df_b.set_index(['mukus'])
df_a.join(df_b)
Mainly join is used exclusively to join based on the index,not on the attribute names,so change the attributes names in two different dataframes,then try to join,they will be joined,else this error is raised

Is there an "ungroup by" operation opposite to .groupby in pandas?

Suppose we take a pandas dataframe...
name age family
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
Then do a groupby() ...
group_df = df.groupby('family')
group_df = group_df.aggregate({'name': name_join, 'age': pd.np.mean})
Then do some aggregate/summarize operation (in my example, my function name_join aggregates the names):
def name_join(list_names, concat='-'):
return concat.join(list_names)
The grouped summarized output is thus:
age name
family
1 23 john-jason-jane
2 28 jack-james
Question:
Is there a quick, efficient way to get to the following from the aggregated table?
name age family
0 john 23 1
1 jason 23 1
2 jane 23 1
3 jack 28 2
4 james 28 2
(Note: the age column values are just examples, I don't care for the information I am losing after averaging in this specific example)
The way I thought I could do it does not look too efficient:
create empty dataframe
from every line in group_df, separate the names
return a dataframe with as many rows as there are names in the starting row
append the output to the empty dataframe
The rough equivalent is .reset_index(), but it may not be helpful to think of it as the "opposite" of groupby().
You are splitting a string in to pieces, and maintaining each piece's association with 'family'. This old answer of mine does the job.
Just set 'family' as the index column first, refer to the link above, and then reset_index() at the end to get your desired result.
It turns out that pd.groupby() returns an object with the original data stored in obj. So ungrouping is just pulling out the original data.
group_df = df.groupby('family')
group_df.obj
Example
>>> dat_1 = df.groupby("category_2")
>>> dat_1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fce78b3dd00>
>>> dat_1.obj
order_date category_2 value
1 2011-02-01 Cross Country Race 324400.0
2 2011-03-01 Cross Country Race 142000.0
3 2011-04-01 Cross Country Race 498580.0
4 2011-05-01 Cross Country Race 220310.0
5 2011-06-01 Cross Country Race 364420.0
.. ... ... ...
535 2015-08-01 Triathalon 39200.0
536 2015-09-01 Triathalon 75600.0
537 2015-10-01 Triathalon 58600.0
538 2015-11-01 Triathalon 70050.0
539 2015-12-01 Triathalon 38600.0
[531 rows x 3 columns]
Here's a complete example that recovers the original dataframe from the grouped object
def name_join(list_names, concat='-'):
return concat.join(list_names)
print('create dataframe\n')
df = pandas.DataFrame({'name':['john', 'jason', 'jane', 'jack', 'james'], 'age':[1,36,32,26,30], 'family':[1,1,1,2,2]})
df.index.name='indexer'
print(df)
print('create group_by object')
group_obj_df = df.groupby('family')
print(group_obj_df)
print('\nrecover grouped df')
group_joined_df = group_obj_df.aggregate({'name': name_join, 'age': 'mean'})
group_joined_df
create dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
create group_by object
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbfdd9dd048>
recover grouped df
name age
family
1 john-jason-jane 23
2 jack-james 28
print('\nRecover the original dataframe')
print(pandas.concat([group_obj_df.get_group(key) for key in group_obj_df.groups]))
Recover the original dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
There are a few ways to undo DataFrame.groupby, one way is to do DataFrame.groupby.filter(lambda x:True), this gets back to the original DataFrame.

Categories