Update columns with duplicate values from the DataFrame in Pandas - python

I have a data set which has values for different columns as different entries with first name to identify the respective columns.
For instance James's gender is in first row and James's age is in 5th row.
DataFrame
df1=
Index
First Name
Age
Gender
Weight in lb
Height in cm
0
James
Male
1
John
175
2
Patricia
23
5
James
22
4
James
185
5
John
29
6
John
176
I am trying to make it combined into one DataFrame as below
df1=
Index
First Name
Age
Gender
Weight
Height
0
James
22
Male
185
1
John
29
175
176
2
Patricia
23
I tried to do groupby but it is not working.

Assuming NaN in the empty cells, you can use groupby.first:
df.groupby('First Name', as_index=False).first()
output:
First Name Age Gender Weight in lb Height in cm
0 James 22.0 Male 185.0 NaN
1 John 29.0 None 175.0 176.0
2 Patricia 23.0 None NaN NaN

Related

Comparing two DataFrames and retrieving modified values

Two separate similar DataFrames with different lengths
df2=
pd.DataFrame([('James',25,'Male',155),
('John',27, 'Male',175),
('Patricia',23,'Female',135),
('Mary',22,'Female',125),
('Martin',30,'Male',185),
('Margaret',29,'Female'141),
('Kevin',22,'Male',198)], columns =(['First Name','Age','Gender','Weight']))
Index
First Name
Age
Gender
Weight
0
James
25
Male
155
1
John
27
Male
175
2
Patricia
23
Female
135
3
Mary
22
Female
125
4
Martin
30
Male
185
5
Margaret
29
Female
141
6
Kevin
22
Male
198
df1=
pd.DataFrame([('James',25,'Male',165,"5'10"),
('John',27, 'Male',175,"5'9"),
('Matthew',29,'Male',183,"6'0"),
('Patricia',23,'Female',135,"5'3"),
('Mary',22,'Female',125,"5'4"),
('Rachel',29,'Female',123,"5'3"),
('Jose',20,'Male',175,"5'11"),
('Kevin',22,'Male',192,"6'2")], columns =(['First Name','Age','Gender','Weight','Height']))
Index
First Name
Age
Gender
Weight
Height
0
James
25
Male
165
5'10
1
John
27
Male
175
5'9
2
Matthew
29
Male
183
6'0
3
Patricia
23
Female
135
5'3
4
Mary
22
Female
125
5'4
5
Rachel
29
Female
123
5'3
6
Jose
20
Male
175
5'11
7
Kevin
22
Male
192
6'2
df2 has some rows which are not in df1 and df1 has some values which are not in df2.
I need to calculate the modified values, if the First Name is same, I need to check for modified values; for example in df1, the weight of James is 155 however in df2 the weight is 165, so I need to store the modified weight of James(165) and index(0) in a new dataframe without iteration; the iteration takes a long time because this is a sample of a big dataframe with a lot more rows and columns.
Desired output:
df2=
Index
First Name
Age
Gender
Weight
Height
0
James
25
Male
155
5'10
1
John
27
Male
175
5'9
2
Patricia
23
Female
135
5'3
3
Mary
22
Female
125
5'4
4
Martin
30
Male
185
5
Margaret
29
Female
141
6
Kevin
22
Male
198
6'2
Martin's and Margaret's heights are not there in df1, so their heights are not updated in df2
Desired Output
modval=
Index
First Name
Age
Gender
Weight
Height
0
James
165
7
Kevin
192

Copy contents from one Dataframe to another based on column values in Pandas

Two seperate similar DataFrames with different lengths
df2=
Index
First Name
Age
Gender
Weight
0
James
25
Male
155
1
John
27
Male
175
2
Patricia
23
Female
135
3
Mary
22
Female
125
4
Martin
30
Male
185
5
Margaret
29
Female
141
6
Kevin
22
Male
198
df1=
Index
First Name
Age
Gender
Weight
Height
0
James
25
Male
165
5'10
1
John
27
Male
175
5'9
2
Matthew
29
Male
183
6'0
3
Patricia
23
Female
135
5'3
4
Mary
22
Female
125
5'4
5
Rachel
29
Female
123
5'3
6
Jose
20
Male
175
5'11
7
Kevin
22
Male
192
6'2
df2 has some rows which are not in df1 and df1 has some values which are not in df2.
I am comparing df1 against df2. I have calculated the newentries with the following code
newentries = df2.loc[~df2['First Name'].isin(df1['First Name'])]
deletedentries = df1.loc[~df1['First Name'].isin(df2['First Name'])]
where newentries denote the rows/entries that are there in df2 but not in df1; deletedentries denote the rows/entries that are there in df1 but not in df2. The above code works perfectly fine.
I need to copy the height from df1 to df2 when the first names are equal.
df2.loc[df2['First Name'].isin(df1['First Name']),"Height"] = df1.loc[df1['First Name'].isin(df2['First Name']),"Height"]
The above code copies the values however indexing is causing an issue and the values are not copied to the corresponding rows, I tried to promote First Name as the Index but that doesn't solve the issue. Please help me with a solution
Also, I need to calculate the modified values, if the First Name is same, I need to check for modified values; for example in df1, the weight of James is 155 however in df2 the weight is 165, so I need to store the modified weight of James(165) and index(0) in a new dataframe without iteration; the iteration takes a long time because this is a sample of a big dataframe with a lot more rows and columns.
Desired output:
df2=
Index
First Name
Age
Gender
Weight
Height
0
James
25
Male
155
5'10
1
John
27
Male
175
5'9
2
Patricia
23
Female
135
5'3
3
Mary
22
Female
125
5'4
4
Martin
30
Male
185
5
Margaret
29
Female
141
6
Kevin
22
Male
198
6'2
Martin's and Margaret's heights are not there in df1, so their heights are not updated in df2
newentries=
Index
First Name
Age
Gender
Weight
Height
4
Martin
30
Male
185
5
Margaret
29
Female
141
deletedentries=
Index
First Name
Age
Gender
Weight
Height
2
Matthew
29
Male
183
6'0
5
Rachel
29
Male
123
5'3
6
Jose
20
Male
175
5'11
modval=
Index
First Name
Age
Gender
Weight
Height
0
James
165
7
Kevin
192
Building off of Rabinzel's answer:
output = df2.merge(df1, how='left', on='First Name', suffixes=[None, '_old'])
df3 = output[['First Name', 'Age', 'Gender', 'Weight', 'Height']]
cols = df1.columns[1:-1]
modval = pd.DataFrame()
for col in cols:
modval = pd.concat([modval, output[['First Name', col + '_old']][output[col] != output[col + '_old']].dropna()])
modval.rename(columns={col +'_old':col}, inplace=True)
newentries = df2[~df2['First Name'].isin(df1['First Name'])]
deletedentries = df1[~df1['First Name'].isin(df2['First Name'])]
print(df3, newentries, deletedentries, modval, sep='\n\n')
Output:
First Name Age Gender Weight Height
0 James 25 Male 155 5'10
1 John 27 Male 175 5'9
2 Patricia 23 Female 135 5'3
3 Mary 22 Female 125 5'4
4 Martin 30 Male 185 NaN
5 Margaret 29 Female 141 NaN
6 Kevin 22 Male 198 6'2
First Name Age Gender Weight
4 Martin 30 Male 185
5 Margaret 29 Female 141
First Name Age Gender Weight Height
2 Matthew 29 Male 183 6'0
5 Rachel 29 Male 123 5'3
6 Jose 20 Male 175 5'11
First Name Age Gender Weight
0 James NaN NaN 165.0
6 Kevin NaN NaN 192.0
for your desired output for df2 you can try this:
desired_df2 = df2.merge(df1[['First Name','Height']], on='First Name', how='left')
#if you want to change the "NaN" values just add ".fillna(fill_value=0)" for e.g 0 after the merge
print(desired_df2)
First Name Age Gender Weight Height
0 James 25 Male 155 5'10
1 John 27 Male 175 5'9
2 Patricia 23 Female 135 5'3
3 Mary 22 Female 125 5'4
4 Martin 30 Male 185 NaN
5 Margaret 29 Female 141 NaN
6 Kevin 22 Male 198 6'2
new and deleted entries is already right. for the moment I'm a bit stuck how to get the modval dataframe. I'll update my answer if I get a solution.

How to update multiple column values in pandas

been trying to crack this for a while, but stuck now.
This is my code
l=list()
column_name=[col for col in df.columns if 'SalesPerson' in col]
filtereddf=pd.DataFrame(columns=['Item','SerialNo','Location','SalesPerson01','SalesPerson02',SalesPerson03',SalesPerson04',SalesPerson05',SalesPerson06','PredictedSales01','PredictedSales02','PredictedSales03','PredictedSales04','PredictedSales05','PredictedSales06']
for i,r in df.iterrows():
if len(r['Name'].split(';'))>1:
for x in r['Name'].split(';'):
for y in column_name:
if x in r[y]:
number_is=y[-2:]
filtereddf.at[i,'SerialNo']=r['SerialNo']
filtereddf.at[i,'Location']=r['Location']
filtereddf.at[i,y]=r[y]
filtereddf.at[i,'Item']=r['Item']
filtereddf.at[i,f'PredictedSales{number_is}']=r[f'PredictedSales{number_is}']
#The below statement however prints the values correctly. But I want to filter the values and use in a dataframe
#print(r['SerialNo'],r['Location'],r[f'SalesPerson{number_is}'],r[f'PredictedSales{number_is}]',r['Definition'])
l.append(filtereddf)
elif for y in column_name:
if r['Name'] in r[y]:
number_is=y[-2:]
filtereddf.at[i,'SerialNo']=r['SerialNo']
filtereddf.at[i,'Location']=r['Location']
filtereddf.at[i,y]=r[y]
filtereddf.at[i,'Item']=r['Item']
filtereddf.at[i,f'PredictedSales{number_is}']=r[f'PredictedSales{number_is}']
#The below statement however prints the values correctly. But I want to filter the values and use in a dataframe
#print(r['SerialNo'],r['Location'],r[f'SalesPerson{number_is}'],r[f'PredictedSales{number_is}]',r['Definition'])
l.append(filtereddf)
finaldf=pd.concat(l,ignore_index=True)
It eventually throws an error
MemoryError: Unable to allocate 9.18 GiB for an array with shape (1, 1231543895) and data type object
Basically I want to extract SalesPersonNN and corresponding PredicatedSalesNN from the main dataframe df
sampled dataset is (Actual csv file is almost 100k entries)
Item Name SerialNo Location SalesPerson01 SalesPerson02 SalesPerson03 SalesPerson04 SalesPerson05 SalesPerson06 PredictedSales01 PredictedSales02 PredictedSales03 PredictedSales04 PredictedSales05 PredictedSales06
0 TV Joe;Mary;Philip 11111 NY Tom Julie Joe Sara Mary Philip 90 80 30 98 99 100
1 WashingMachine Mike 22222 NJ Tom Julie Joe Mike Mary Philip 80 70 40 74 88 42
2 Dishwasher Tony;Sue 33333 NC Margaret Tony William Brian Sue Bert 58 49 39 59 78 89
3 Microwave Bill;Jeff;Mary 44444 PA Elmo Bill Jeff Mary Chris Kevin 80 70 90 56 92 59
4 Printer Keith;Joe 55555 DE Keith Clark Ed Matt Martha Joe 87 94 59 48 74 89
And I want the output dataframe to look like
tem Name SerialNo Location SalesPerson01 SalesPerson02 SalesPerson03 SalesPerson04 SalesPerson05 SalesPerson06 PredictedSales01 PredictedSales02 PredictedSales03 PredictedSales04 PredictedSales05 PredictedSales06
0 TV Joe;Mary;Philip 11111 NY NaN NaN Joe NaN Mary Philip NaN NaN 30.0 NaN 99.0 100.0
1 WashingMachine Mike 22222 NJ NaN NaN NaN Mike NaN NaN NaN NaN NaN 74.0 NaN NaN
2 Dishwasher Tony;Sue 33333 NC NaN Tony NaN NaN Sue NaN NaN 49.0 NaN NaN 78.0 NaN
3 Microwave Bill;Jeff;Mary 44444 PA NaN Bill Jeff Mary NaN NaN NaN 70.0 90.0 56.0 NaN NaN
4 Printer Keith;Joe 55555 DE Keith NaN NaN NaN NaN Joe 87.0 NaN NaN NaN NaN 89.0
​
I am not sure if my approach using dataframe.at is correct or if any pointers as to what i can use to efficiently filter only those columns values which matches the value in column Name
I would recommend changing from a column focused dataframe to a row focused dataframe. You can rewrite your dataset using melt:
df_person = df.loc[:,'SalesPerson01':'SalesPerson06']
df_sales = df.loc[:,'PredictedSales01':'PredictedSales06']
df_person = df_person.melt(ignore_index=False, value_name='SalesPerson')[['SalesPerson']]
PredictedSales = df_sales.melt(ignore_index=False, value_name='PredictedSales')[['PredictedSales']]
df_person['PredictedSales'] = PredictedSales
index_cols = ['Item','SerialNo', 'Location', 'SalesPerson']
df_person = df_person.reset_index().sort_values(index_cols).set_index(index_cols)
df_person will look like this:
Item SerialNo Location SalesPerson PredictedSales
TV 11111 NY Joe 30
Julie 80
Mary 99
Philip 100
Sara 98
Tom 90
WashingMachine 22222 NJ Joe 40
Julie 70
Mary 88
Mike 74
Philip 42
Tom 80
... ... ... ... ...
Printer 55555 DE Clark 94
Ed 59
Joe 89
Keith 87
Martha 74
Matt 48
Now you only want the values from the names in you 'Name' column. Therefor we create a separate dataframe using explode:
df_names = df[['Name']].explode('Name').rename({'Name':'SalesPerson'}, axis=1)
df_names = df_names.reset_index().set_index(['Item','SerialNo', 'Location', 'SalesPerson'])
df_names will look something like this:
Item SerialNo Location SalesPerson
TV 11111 NY Joe
Mary
Philip
WashingMachine 22222 NJ Mike
Dishwasher 33333 NC Tony
Sue
Microwave 44444 PA Bill
Jeff
Mary
Printer 55555 DE Keith
Joe
Now you can simply merge your dataframes:
df_names.merge(df_person, left_index=True, right_index=True)
Now the PredictedSales are added to you df_names dataframe.
Hopefully this will run without errors. Please let me know 😀

Group a column of a data frame based on another dataframe

Based on this dataframe
df1 Name Age
Johny 15
Diana 35
Doris 97
Peter 25
Antony 55
I have this dataframe with the number of ranges that I want to use, for example
df2 Header Init1 Final1 Init2 Final2 Init3 Final3
Names NaN NaN NaN NaN NaN NaN
Age 0 20 21 50 51 100
What I'm looking for is to get a result like this
df3 Name Age
Johny 0-20
Diana 21-50
Doris 51-100
Peter 21-50
Antony 51-100
I don't know if a possible solution is with cut () but I'm new to python.
Using pd.cut:
l = df2.iloc[1,1:].tolist()
labels = [str(t[0])+'-'+str(t[1]) for t in zip(l[::1],l[1::1])]
df['Age'] = pd.cut(df['Age'], bins=l, labels=labels)
print(df)
Name Age
0 Johny 0-20
1 Diana 21-50
2 Doris 51-100
3 Peter 21-50
4 Antony 51-100

Pandas: Dict of data frames to unbalanced Panel

I have a dictionary of DataFrame objects:
dictDF={0:df0,1:df1,2:df2}
Each DataFrame df0,df1,df2 represents a table in a specific date of time, where the first column identifies (like social security number) a person and the other columns are characteristics of this person such as
DataFrame df0
id Name Age Gender Job Income
10 Daniel 40 Male Scientist 100
5 Anna 39 Female Doctor 250
DataFrame df1
id Name Age Gender Job Income
67 Guto 35 Male Engineer 100
7 Anna 39 Female Doctor 300
9 Melissa 26 Female Student 36
DataFrame df2
id Name Age Gender Job Income
77 Patricia 30 Female Dentist 300
9 Melissa 27 Female Dentist 250
Note that the id (social security number) identifies exactly the person. For instance, the same "Melissa" arises in two different DataFrames. However, there are two different "Annas".
In these dataFrames the number of people and the people vary over time. Some people is represented in all dates and others are represented only in a specific date of time.
Is there a simple way to transform this dictionary of data frames in an (unbalanced) Panel object, where the ids arise in all dates and if the data a given id is not available it is replaced by NaN?
Off course, I can do that making a list of all ids and then checking in each date if a given id is represented. If it is represented, then I copy the data. Otherwise, I just write NaN.
I wonder if there an easy way using pandas tools.
I would recommend using a MultiIndex instead of a Panel.
First, add the period to each dataframe:
for n, df in dictDF.iteritems():
df['period'] = n
Then concatenate into a big dataframe:
big_df = pd.concat([df for df in dictDF.itervalues()], ignore_index=True)
Now set your index to period and id and you are guaranteed to have a unique index:
>>> big_df.set_index(['period', 'id'])
Name Age Gender Job Income
period id
0 10 Daniel 40 Male Scientist 100
5 Anna 39 Female Doctor 250
1 67 Guto 35 Male Engineer 100
7 Anna 39 Female Doctor 300
9 Melissa 26 Female Student 36
2 77 Patricia 30 Female Dentist 300
9 Melissa 27 Female Dentist 250
You can also reverse that order:
>>> big_df.set_index(['id', 'period']).sort_index()
Name Age Gender Job Income
id period
5 0 Anna 39 Female Doctor 250
7 1 Anna 39 Female Doctor 300
9 1 Melissa 26 Female Student 36
2 Melissa 27 Female Dentist 250
10 0 Daniel 40 Male Scientist 100
67 1 Guto 35 Male Engineer 100
77 2 Patricia 30 Female Dentist 300
You can even unstack the data quite easily:
big_df.set_index(['id', 'period'])[['Income']].unstack('period')
Income
period 0 1 2
id
5 250 NaN NaN
7 NaN 300 NaN
9 NaN 36 250
10 100 NaN NaN
67 NaN 100 NaN
77 NaN NaN 300

Categories