I have two pandas DFs the same column, 'name'. I want to merge the two columns but there are a lot of differences in the formatting of the 'name' column.
** For reference -- the 'name' column stores the names of public companies **
Some of the differences are due to punctuation, or capitalization, or the company's short name vs its long name. For example, 'pepsi co.' in df1 may be 'pepsi co', 'pepsi cola holding company' or 'pepsico, inc.' in df2.
So what I need is to merge the two dataframes into one but let pandas know to ignore these differences. Otherwise, only around 10% of the datasets will match up.
Any ideas of what to do? Thank you :)
Related
I am new to python and I'm currently working on a project where I have to merge two dataframes. One dataframe, which is called cancer_df, is cancer incidencies by county, year, sex, gender, etc. The other dataframe, which is called hspa_df, is a health score by county and year (FYI, it's only counties in California). I would like to combine my two dataframe on county and year. Here is the cancer dataframe before the merge and Here is the hspa dataframe before the merge
Then I imported my data and tried the following merge:
merged_df= pd.merge(cancer_df, hspa_df, on="County" , how="outer")
However, this seems to append the data not merge it. It adds my hspa_df at the end and fills the top of the variable they share in common as NaNs. Why is this happening? I have successfully used this merge with other dataframes, but i merged them on numerical columns, not string.
Here is the merged dataframes head and Here is the merged dataframes tail
I would like to combine my two dataframe on county and year
merged_df = pd.merge(cancer_df, hspa_df, on=['County', 'Year'] )
whether you want to do inner, left, right, etc. join, depends on your usecase, but note how to specify two columns.
It fills the top of the variable they share in common as NaNs
This is what an outer join does, and it uses fillers for that.
I'm trying to combine two dataframes together in pandas using left merge on common columns, only when I do that the data that I merged doesn't carry over and instead gives NaN values. All of the columns are objects and match that way, so i'm not quite sure whats going on.
this is my first dateframe header, which is the output from a program
this is my second data frame header. the second df is a 'key' document to match the first output with its correct id/tastant/etc and they share the same date/subject/procedure/etc
and this is my code thats trying to merge them on the common columns.
combined = first.merge(second, on=['trial', 'experiment','subject', 'date', 'procedure'], how='left')
with output (the id, ts and tastant columns should match correctly with the first dataframe but doesn't.
Check your dtypes, make sure they match between the 2 dataframes. Pandas makes assumptions about data types when it imports, it could be assuming numbers are int in one dataframe and object in another.
For the string columns, check for additional whitespaces. They can appear in datasets and since you can't see them and Pandas can, it result in no match. You can use df['column'].str.strip().
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html
I am performing matching (kind of fuzzy matching) between the company names of two data frames. For doing so, first I am performing full merge between all the company names, where the starting alphabet matches. Which means all the companies starting with 'A' would be matched to all the companies starting with 'A' in other data frame. This is done as follows:
df1['df1_Start'] = df1['company1'].astype(str).str.slice(0,2)
df2['df2_Start'] = df2['company2'].astype(str).str.slice(0,2)
Merge = pd.merge(df1,df2, left_on='df1_Start',right_on='df2_Start')
Now I want to have all the rows from FullMerge where company in df1 contains the company in df2. This is because companies in df1 have elongated names.
Merge1=Merge[Merge['company1'].str.contains(Merge['company2'].str)]
This isn't working for me. How do I perform this task? Also, Please suggest what other ways can I use to match the company names. Because companies might be same in two data frames but are not written in the exactly same way.
I think you need | with join for generating all values separated by | (or in regex) for str.contains:
Merge1=Merge[FullMerge['company1'].str.contains("|".join(Merge['company2'].tolist())]
I have the following dataframe:
symbol, name
abc Jumping Jack
xyz Singing Sue
rth Fat Frog
I then have another dataframe with the same structure (symbol + name). I need to output all the symbols which are in the first dataframe but not the second.
The name column is allowed to differ. For example I could have symbol = xyz in both dataframes but with different names. That is fine. I am simply trying to get the symbols which do not appear in both dataframes.
I am sure this can be done using pandas merge and then outputting the rows that didn't merge, but I just can't seem to get it right.
Use isin and negate the condition using ~:
df[~df['symbol'].isin(df1['symbol'])]
This will return rows where 'symbol' is present in your first df and not in the other df
I have two dataframes, each with a series of dates as the index. The dates to not overlap (in other words one date range from, say, 2013-01-01 through 2016-06-15 by month and the second DataFrame will start on 2016-06-15 and run quarterly through 2035-06-15.
Most of the column names overlap (i.e. are the same) and the join just does fine. However, there is one columns in each DataFrame that I would like to preserve as 'belonging' to the original DataFrame so that I have them both available for future use. I gave each a different name. For example, DF1 has a column entitled opselapsed_time and DF2 has a column entitled constructionelapsed_time.
When I try to combine DF1 and DF2 together using the command DF1.combine_first(DF2) or vice versa I get this error: ValueError: Cannot convert NA to integer.
Could someone please give me advice on how best to resolve?
Do I need to just stick with using a merge/join type solution instead of combine_first?
Found the best solution:
pd.tools.merge.concat([test.construction,test.ops],join='outer')
Joins along the date index and keeps the different columns. To the extent the column names are the same, it will join 'inner' or 'outer' as specified.