Finding rows which aren't in two dataframes - python

I have the following dataframe:
symbol, name
abc Jumping Jack
xyz Singing Sue
rth Fat Frog
I then have another dataframe with the same structure (symbol + name). I need to output all the symbols which are in the first dataframe but not the second.
The name column is allowed to differ. For example I could have symbol = xyz in both dataframes but with different names. That is fine. I am simply trying to get the symbols which do not appear in both dataframes.
I am sure this can be done using pandas merge and then outputting the rows that didn't merge, but I just can't seem to get it right.

Use isin and negate the condition using ~:
df[~df['symbol'].isin(df1['symbol'])]
This will return rows where 'symbol' is present in your first df and not in the other df

Related

Merge columns in 2 pandas dataframes based on multiple conditions

I have two pandas DFs the same column, 'name'. I want to merge the two columns but there are a lot of differences in the formatting of the 'name' column.
** For reference -- the 'name' column stores the names of public companies **
Some of the differences are due to punctuation, or capitalization, or the company's short name vs its long name. For example, 'pepsi co.' in df1 may be 'pepsi co', 'pepsi cola holding company' or 'pepsico, inc.' in df2.
So what I need is to merge the two dataframes into one but let pandas know to ignore these differences. Otherwise, only around 10% of the datasets will match up.
Any ideas of what to do? Thank you :)

Pandas, when merging two dataframes and values for some columns don't carry over

I'm trying to combine two dataframes together in pandas using left merge on common columns, only when I do that the data that I merged doesn't carry over and instead gives NaN values. All of the columns are objects and match that way, so i'm not quite sure whats going on.
this is my first dateframe header, which is the output from a program
this is my second data frame header. the second df is a 'key' document to match the first output with its correct id/tastant/etc and they share the same date/subject/procedure/etc
and this is my code thats trying to merge them on the common columns.
combined = first.merge(second, on=['trial', 'experiment','subject', 'date', 'procedure'], how='left')
with output (the id, ts and tastant columns should match correctly with the first dataframe but doesn't.
Check your dtypes, make sure they match between the 2 dataframes. Pandas makes assumptions about data types when it imports, it could be assuming numbers are int in one dataframe and object in another.
For the string columns, check for additional whitespaces. They can appear in datasets and since you can't see them and Pandas can, it result in no match. You can use df['column'].str.strip().
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html

How can I find very similar but NOT equal rows within two pandas dataframe columns?

I'm trying to compare two columns from two different dataframes to get similar values. The values are strings, so they are not just the same, but very similar. How can I get those similar values?
The dataframes that I use are like the following:
Dataframe 1, column "Company", row = "Company_name"
Dataframe 2, column "Company", row = "Company_name_INC"
What I would like to get:
Dataframe 3, column "Company_source_1" row = "Company_name", column "Company_source_2", row = "Company_name_INC"
I need to find those names that are almost the same, in order to find the companies that appear in both dataframes.
You can use regular expressions:
Regular expressions (https://docs.python.org/3/howto/regex.html) can be used to do exactly what you are asking. For example, if you are looking for a company related to 'Regex' such as:
Regex
Regex_inc
NotRegex
You can do the following:
[Note that I have converted the DataFrame column Name to a Series and use the .str.contains() method, which can be used to index the appropriate rows from your original DataFrame (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html )]
import pandas as pd
data = [['Regex', 'company_1'],
['Regex_inc', 'company_2'],
['NotRegex', 'company_3']]
df = pd.DataFrame(data).rename(columns={0:'Name', 1:'Company'})
df_sorted = df[pd.Series(df['Name']).str.contains(r'^Regex')]
print df
print df_sorted
Returns
Name Company
0 Regex company_1
1 Regex_inc company_2
2 NotRegex company_3
for df, and:
Name Company
0 Regex company_1
1 Regex_inc company_2
for df_sorted
The argument for the pd.Series.str.contains() method was '^Regex' which states that for a string to return a True value, it must begin with 'Regex'.
I used this regex cheatsheet (https://www.rexegg.com/regex-quickstart.html) for the special characters. I'm not an expert on Regex, but plenty of material can be found online, also with the links contained in this answer. Also here (https://regex101.com/) is a regex tester than can be used to test your patterns.

Contains function in Pandas

I am performing matching (kind of fuzzy matching) between the company names of two data frames. For doing so, first I am performing full merge between all the company names, where the starting alphabet matches. Which means all the companies starting with 'A' would be matched to all the companies starting with 'A' in other data frame. This is done as follows:
df1['df1_Start'] = df1['company1'].astype(str).str.slice(0,2)
df2['df2_Start'] = df2['company2'].astype(str).str.slice(0,2)
Merge = pd.merge(df1,df2, left_on='df1_Start',right_on='df2_Start')
Now I want to have all the rows from FullMerge where company in df1 contains the company in df2. This is because companies in df1 have elongated names.
Merge1=Merge[Merge['company1'].str.contains(Merge['company2'].str)]
This isn't working for me. How do I perform this task? Also, Please suggest what other ways can I use to match the company names. Because companies might be same in two data frames but are not written in the exactly same way.
I think you need | with join for generating all values separated by | (or in regex) for str.contains:
Merge1=Merge[FullMerge['company1'].str.contains("|".join(Merge['company2'].tolist())]

'combine first' in pandas produces NA error

I have two dataframes, each with a series of dates as the index. The dates to not overlap (in other words one date range from, say, 2013-01-01 through 2016-06-15 by month and the second DataFrame will start on 2016-06-15 and run quarterly through 2035-06-15.
Most of the column names overlap (i.e. are the same) and the join just does fine. However, there is one columns in each DataFrame that I would like to preserve as 'belonging' to the original DataFrame so that I have them both available for future use. I gave each a different name. For example, DF1 has a column entitled opselapsed_time and DF2 has a column entitled constructionelapsed_time.
When I try to combine DF1 and DF2 together using the command DF1.combine_first(DF2) or vice versa I get this error: ValueError: Cannot convert NA to integer.
Could someone please give me advice on how best to resolve?
Do I need to just stick with using a merge/join type solution instead of combine_first?
Found the best solution:
pd.tools.merge.concat([test.construction,test.ops],join='outer')
Joins along the date index and keeps the different columns. To the extent the column names are the same, it will join 'inner' or 'outer' as specified.

Categories