concatenate 2 dataframes while matching multiple columns [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 12 months ago.
I have 2 almost identical pandas dataframes with 5 common columns.
I want to add the second dataframe to the first which has a new column.
Dataframe 1
Dataframe 2
But I want it to update the same row given that columns 'Lot name', 'wafer' and 'site' match (green). If the columns do not match, I want to have the value of NaN as shown below.
Desired output
I have to do this with over 160 discrete columns but with possible matching Lot name, WAFER and SITE values.
I have tried the various merging(left right outer) and concat options, just cant seem to get it right. Any help\comments is appreciated.
Edit, follow up question:
I am trying to use this in a loop, where each iteration generates a new dataframe assigned to TEMP that needs to be merged with the previous dataframe. I cannot merge with an empty dataframe as it gives a merge error. How can I achieve this?
alldata = pd.DataFrame()
for i in range(len(operation)):
temp = data[data['OPE_NO'].isin([operation[i]])]
temp = temp[temp['PARAM_NAME'].isin([parameter[i]])]
temp = temp.reset_index(drop=True)
temp = temp[["LOT",'Lot name','WAFER',"SITE","PRODUCT",'PARAM_VALUE_NUMBER']]
temp = temp.rename(columns={'PARAM_VALUE_NUMBER':'PMRM28LEMCKLYTFR.1~'+operation[i]+'~'+parameter[i]})
alldata.merge(temp,how='outer')

example can be done with the following code
df1.merge(df2, how="outer")
If I'm misunderstanding problem, please tell me problem.
my english is not good but i have good heart to help you

Related

How do I combine two dataframes on two columns? [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 months ago.
I have two df's: one has a date in the first column: all dates of the last three years and the second column are names of participants, other columns are information.
In the second df, I have some dates on which we did tests in the first column, then second column the names again and more columns information.
I would like to combine the two dateframes that in the first dataframe the information from the second will be added but for example if we did one test on 2-9-2020 and the same test for the same person on 16-9-2022 then from 2-9-202 until 16-9-2022 i want that variable and after that the other.
I hope it's clear what i mean.
i tried
data.merge(data_2, on='Date' & 'About')
but that is not possible to give two columns for on.
Please, I would be nice if you can provide and example. I would try this.
import pandas as pd
new_df = pd.merge(data,names_participants, on = ['Date'], how = 'left')
I would validate if everything is right regarding the date format as well.
With Python and Pandas, you can join on 2 variables by using something like:
df=pd.merge(df,df2,how="left",on=['Date','About']) # can be how="left" or "inner","right","outer"
I think you had the right idea, but not quite the right syntax. Does the following work for your situation?
import pandas as pd
new_df = pd.merge(data, data2, on = ["Date", "About"], how = "left")

Drop only specified amount of duplicates pandas [duplicate]

This question already has answers here:
Keeping the last N duplicates in pandas
(2 answers)
Closed 11 months ago.
Whereas panda's drop_duplicates function can be specified with "first", "last", or False. I want to be able to keep N amount of duplicates. Instead of keeping just one (e.g. with "first" or "last"), or none (with "False"), I want to keep a certain amount of the duplicates.
Any help is appreciated!
Something like this could work, but you haven't specified whether you are using one or more column(s) to deduplicate:
n = 3
df.groupby('drop_dup_col').head(n)
This can be used to keep the first three duplicates based on a column value from the top (head) of the dataframe. If you want to start from the bottom of the df, you can use .tail(n) instead.
Change n to the amount of rows you want to keep and change 'drop_dup_col' to the column name you are using to dedup your df.
Multiple columns can be specified in groupby using:
df.groupby(['col1','col5'])
Regarding the question in your comment:
It's a bit hard to implement, because if you want to say delete 3 duplicates there should also be a minimum of 3 duplicates, otherwise in case 2 duplicates occur they will be deleted from the data and no row is kept.
n = 3
df['dup_count'] = df.groupby('drop_dup_col').transform('size')
df2 = df
df2 = df2.loc[df['dup_count'] >= n]
df3 = pd.concat([df, df2])
df3.drop_duplicates(keep=False)
I believe a combination of groupby and tail(N) should work for this-
In this case, if you want to keep 4 duplicates in df['myColumnDuplicates']:
df.groupby('myColumnDuplicates').tail(4)
To be more precise, and complete the answer with #Stijn 's answer,
tail(n) would keep the last n duplicated values found- while head(n) should keep the first n duplicated values

For a requirement I need to transform a DataFrame into by creating rows out of values from of lists that are in a column of that dataFrame [duplicate]

This question already has answers here:
Pandas column of lists, create a row for each list element
(10 answers)
Closed 1 year ago.
I need to transform the below Dataframe into the required format without using a loop(or any other inefficient logic) as the size of dataframe is huge i.e., 950 thousand rows and also the value in the Points column has a list with lengths more than 1000. I'm getting this data after de-serializing a blob data from the database and will need to use this data create some ML Models.
input:
output:
for index,val in df.iterrows():
tempDF = pd.DataFrame(
[[
df['I'][index],df['x'][index],
df['y'][index],df['points'][index],
]]* int(df['points'][index]))
tempDF["Data"] = df['data'][index]
tempDF["index"] = list(range(1,int(df['k'][index])+1))
FinalDF = FinalDF.append(tempDF, ignore_index = True)
I have tried using for loop but for 950 thousand rows it takes so much time that using that logic is just not feasible. please help me in finding a pandas logic or if not then some other method to do that.
*I had to post screenshot because i was unable to post the dataframe with a table. Sorry I'm new to stackoverflow.
explode:
df.explode('points')

Python Pandas - Dataframe column gets swallowed when I add two columns from second Dataframe [duplicate]

This question already has answers here:
How do I expand the output display to see more columns of a Pandas DataFrame?
(22 answers)
Closed 3 years ago.
I have two dataframes df and df2 with contents as follows
dataframe df
dataframe df2
I'd like to add to df1 the two columns from df2 "NUMSESSIONS_ANDROID" and "AVGSESSDUR_ANDROID"
I do this as follows:
df['NUMSESSIONS_ANDROID'] = df2['NUMSESSIONS_ANDROID']
df['AVGSESSDUR_ANDROID'] = df2['AVGSESSDUR_ANDROID']
However when I print the resulting df I see ... in place of AVGSESSDUR_IOS (i.e. it appears to have swallowed that column)
Appreciate any help resolving this ....
As ALollz stated, the fact you are seeing ... in the output means there's "hidden" data that is part of the dataframe, but not showing in your console or IDE. However you can perform an easy print to check all the columns that your dataframe contains with:
print(list(df))
And this will show you all the names of the columns in your df that way you can check whether the ones you want are there or not.
Furthermore you can print an specific column as a series (first line) or dataframe (second):
print(df['column_name'])
print(df[['column_name']])
If successful you will see the series/dataframe, if the column actually doesn't exist in your original dataframe, then you will get a KeyError.
Leveraging #ALollz's hint above ...
"The ... indicates that only part of the DataFrame is being shown in your terminal/output, so 'AVGSESSDUR_IOS' is almost certainly still there it's just not shown. You can look at print(df.iloc[:, 0:3]) to see the first 3 columns for instance."
I added the following two lines to increase the number of columns and width of console display and it worked:
pd.set_option('display.max_columns',20)
pd.set_option('display.width', 1000)
print(df.iloc[:,0:5])

Lookup in a pandas Dataframe [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have a dataframe which is similar to:
grades=pd.DataFrame(columns=["person","course_code","grade"],data=[[1,101,2.0],[2,102,1.0],[3,103,3.0],[2,104,4.0],[1,102,5.0],[3,104,2.5],[2,101,1.0]])
On each row is the grade of a certain student in certain subject.
And want to convert it to another that looks like this:
students=pd.DataFrame(columns=[101,102,103,104],data [[2.0,5.0,"NaN","NaN"],[1.0,1.0,"Nan",4.0],["Nan","Nan",3.0,2.5]])
On each row is a student (codex of the row) with the different grades obtained in every subject (every column is a different subject).
I have tried doing this:
for subj in grades["COURSE_CODE"].unique():
grades_subj=grades[grades["COURSE_CODE"]==subj]
grades_subj = grades_subj.set_index("EXPEDIENT_CODE", drop = True)
for st in grades["EXPEDIENT_CODE"].unique():
grade_num=grades_subj.loc[st]["GRADE"]
student.loc[st][subj]=grade_num
But I get:
KeyError: 'the label [304208] is not in the [index]'
I have tried other ways too and get always errors...
Can someone help me, please?
try:
grades.pivot_table(index='person', columns='course_code', values='grade')
The value argument let you to choose the aggregation column.
In order to answer your comment below, you can always add different levels when indexing. This is simply done by passing a list rather than a single string to index. Note you can do the same in columns. SO, based in the example you provide.
grades.pivot_table(index=['person','school'], columns='course_code', values ='grade')
After this I usually recommend to reset_index() unless you are fluent slicing and indexing with MultiIndex.
Also, if the correspondence is 1 to 1, you could merge both dataframes using the appropiate join.
Here you have all the information about Reshaping and Pivot Tables in Pandas.

Categories