Dataframes Not Merging properly. Getting an extra column - python

I want to join Dataframes with same column names into a single dataframe. When I press " df = pd.concat([df1, df2], ignore_index=True, sort=False)", I get an extra column. Please help
enter image description here
This is how I am Getting the Concatenated Dataframe

I think problem is State column has space in one DataFrame like State or State , so after concat was not join with State column.
Solution is:
#test columns names
print (df1.columns)
print (df2.columns)
#remove trailing spaces
df1.columns = df1.columns.str.strip()
df2.columns = df2.columns.str.strip()
df = pd.concat([df1, df2], ignore_index=True, sort=False)

It looks like it's not recognizing the "state" as the same field. I'd probably start by taking a look at the setup of the state field in each table to see why it thinks they're different. If you can find a different, format them to be the same and then try again.

Related

Pandas Concat two dataframes with different amount of rows

I have a question about pd.concat. I get some weird results and I do not get why.
Let start with a simple example (this should also show what I want to achieve):
import pandas as pd
df1 = pd.DataFrame([[1,2,3],[7,6,5]], columns = ["A","B","C"])
print("DF1: \n", df1)
df2 = pd.DataFrame([[4,5,6]], columns = ["A","B","C"])
print("DF2: \n", df2)
df3 = pd.concat([df1, df2], ignore_index = True)
print("Concat DF1 and DF2: \n",df3)
Now I have my actual programm where I have DataFrames like this:
When I am applying the concat function, I get this:
It makes zero sense to me. What can possible be the reason?
P.S. It's not urgent, because I found a workaround but this bothers me and makes me a bit angry too.
Use the following code for connecting two DataFrame based on their rows
Code1) self.teste_df= (self.teste_df).append(test,ignore_index=True)
Code2) pd.concat([self.teste_df, test], axis = 0, ignore_index=True )
I made them both a list, and combined the lists with +.

Compare data of two columns of one dataframe with two columns of another dataframe and find mismatch data

I have dataframe df1 as following-
Second dataframe df2 is as following-
and I want the resulted dataframe as following
Dataframe df1 & df2 contains a large number of columns and data but here I am showing sample data. My goal is to compare Customer and ID column of df1 with Customer and Part Number of df2. Comparison is to find mismatch of data of df1['Customer'] and df1['ID'] with df2['Customer'] and df2['Part Number']. Finally storing mismatch data in another dataframe df3. For example: Customer(rishab) with ID(89ab) is present in df1 but not in df2.Thus Customer, Order#, and Part are stored in df3.
I am using isin() method to find mismatch of df1 with df2 for one column only but not able to do it for comparison of two columns.
df3 = df1[~df1['ID'].isin(df2['Part Number'].values)]
#here I am only able to find mismatch based upon only 1 column ID but I want to include Customer also
I can use loop also but the data is very large(Time complexity will increase) and I am sure there can be one-liner code to achieve this task. I have also tried to use merge but not able to produce the exact output.
So, how to produce this exact output? I am also not able to use isin() for two columns and I think isin() cannot to use for two columns
The easiest way to achieve this is:
df3 = df1.merge(df2, left_on = ['Customer', 'ID'],right_on= ['Customer', 'Part Number'], how='left', indicator=True)
df3.reset_index(inplace = True)
df3 = df3[df3['_merge'] == 'left_only']
Here, you first do a left join on the columns, and put indicator = True, which will give another column like _merge, which has indicator mentioning which side the data exists, and then we pick left_only from those.
You can try outer join to get non matching rows. Something like df3 = df1.merge(df2, left_on = ['Customer', 'ID'],right_on= ['Customer', 'Part Number'], how = "outer")

Adding correction column to dataframe

I have a pandas dataframe I read from a csv file with df = pd.read_csv("data.csv"):
date,location,value1,value2
2020-01-01,place1,1,2
2020-01-02,place2,5,8
2020-01-03,place2,2,9
I also have a dataframe with corrections df_corr = pd.read_csv("corrections .csv")
date,location,value
2020-01-02,place2,-1
2020-01-03,place2,2
How do I apply these corrections where date and location match to get the following?
date,location,value1,value2
2020-01-01,place1,1,2
2020-01-02,place2,4,8
2020-01-03,place2,4,9
EDIT:
I got two good answers and decided to go with set_index(). Here is how I did it 'non-destructively'.
df = pd.read_csv("data.csv")
df_corr = pd.read_csv("corr.csv")
idx = ['date', 'location']
df_corrected = df.set_index(idx).add(
df_corr.set_index(idx).rename(
columns={"value": "value1"}), fill_value=0
).astype(int).reset_index()
It looks like you want to join the two DataFrames on the date and location columns. After that its a simple matter of applying the correction by adding the value1 and value columns (and finally dropping the column containing the corrections).
# Join on the date and location columns.
df_corrected = pd.merge(df, df_corr, on=['date', 'location'], how='left')
# Apply the correction by adding the columns.
df_corrected.value1 = df_corrected.value1 + df_corrected.value
# Drop the correction column.
df_corrected.drop(columns='value', inplace=True)
Set date and location as index in both dataframes, add the two and fillna
df.set_index(['date','location'], inplace=True)
df1.set_index(['date','location'], inplace=True)
df['value1']=(df['value1']+df1['value']).fillna(df['value1'])

How to stand different dataframes side by side iteratively generated in Python?

I have a code through which I am generating different DataFrames and appending them one over the other.
df = pd.DataFrame()
...
new_col = pd.read_parquet(filepath)
aux = pd.concat([aux, new_col])
aux['measure'] = sn
df = df.append(aux)
The code works fine, but I need them side by side. df is a empty dataframe in which I am appending all aux which contain all the data. Therefore, apparently, concat neither join or merge don't work since I cannot concat df and aux.
Thanks!
Side by side ?
pd.concat(list_of_df, axis=1)
This should be good
In order to concatenate them, just make as you have done on the above line. However, concatenate them specifying the axis=1 and the join='outer'. Nonetheless, you have to reset the index before because when concatenating it takes into account the index.
aux.reset_index(inplace=True, drop=True)
df = pd.concat([df, aux], axis=1, join='outer')

Comparing two dataframe columns and outputting a third

I apologize in advance if this has been covered, I could not find anything quite like this. This is my first programming job (I was previously software QA) and I've been beating my head against a wall on this.
I have 2 dataframes, one is very large [df2] (14.6 million lines) and I am iterating through it in chunks. I attempted to compare a column of the same name in each dataframe, if they're equal I would like to output a secondary column of the larger frame.
i.e.
if df1['tag'] == df2['tag']:
df1['new column'] = df2['plate']
I attempted a merge but this didn't output what I expected.
df3 = pd.merge(df1, df2, on='tag', how='left')
I hope I did an okay job explaining this.
[Edit:] I also believe I should mention that df2 and df1 both have many additional columns I do not want to interact with/change. Is it possible to only compare the single columns of two dataframes, and output the third additional column?
You may try inner merge. First, you may inner merge df1 with df2 and then you will get plates only for common rows and you can rename new df1's column as per your need
df1 = df1.merge(df2, on="tag", how = 'inner')
df1['new column'] = df1['plate']
del df1['plate']
I hope this works.
As smci mentioned, this is a perfect time to use join/merge. If you're looking to preserve df1, a left join is what you want. So you were on the right path:
df1 = pd.merge(df1['tag'],
df2['tag', 'plate'],
on='tag', how='left')
df1.rename({'plate': 'new column'}, axis='columns')
That will only compare the tag columns in each dataframe, so the other columns won't matter. It'll bring over the plate column from df2, and then renames it to whatever you want your new column to be named.
This is totally a case for join/merge. You want to put df2 on the left because it's smaller.
df2.join(df1, on='tag', ...)
You only misunderstood the type of join/merge) you want to make:
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’
'how'='left' join would create an (unwanted) entry for all rows of the LHS df2. That's not quite what you want (if df2 contained other tag values not seen in df1, you'd also get entries for them).
'how'='inner' would form the intersection of df2 and df1 on the 'on'='tag' field. i.e. you only get entries for where df1 contains a valid tag value according to df2.
So:
df3 = df2.join(df1, on='tag', how='inner')
# then reference df3['plate']
or if you only want the 'plate' column in df3 (or some other selection of columns), you can directly do:
df2.join(df1, on='tag', how='inner') ['plate']

Categories