Pandas - Getting just the columns that changed between comparison - python

I have a post-outer-merge Dataframe where there's a column with the indicator (left_only, right_only) that differs a row from another.
But I want only the columns that changed, not all the row. I'm filtering just by left_only, but i just need the columns that effectively changed and the column ID needs to be "locked" there, is the only one that can't get removed when i filter just the ones that changed.
For example:
ID
Manufacturer
TAG
ID2
_merge
10003
Apple
334
999
left_only
10003
Samsung
223
999
right_only
10004
Samsung
253
567
left_only
10004
Samsung
253
999
right_only
The output should be:
ID
Manufacturer
TAG
ID2
10003
Apple
334
999
10004
567
And just for the record, I'm doing the merge like this:
df_final = (pd.merge(
df, df2, how='outer',
indicator=True)
).query('_merge != "both"')

Related

How to iterate each row and find the next matching column value from a specific column from one dataframe and comparing it to another dataframe?

I have two dataframes:
DF1: Group A
employee_id | key
100 101001
101 020208
102 101002
103 020208
104 020208
... ...
300 010506
DF2: Group B
employee_id | key
110 101001
111 020208
112 020105
113 020208
114 020208
... ...
600 051007
Compare the key from each row in both dataframes. With each matched employee, create a new dataframe with DF1.employee_id, DF1.key, DF2.employee_id and remove the matched person from DF2.
I want to iterate each employee in DF1 at a time and find a marched record in DF2 and once had a match remove that record in DF2. Thee goal is not to have duplicated matched employee from DF2 for each matched employee from DF1. How to iterate this process?
clean = df_1.merge(df_2, on=['key'], how='left')
This above script will give me duplicated records. I want the new dataframe to look like this:
New Dataframe (Sample):
employee_id_df1 | key | employee_id_df2
100 101001 110
101 020208 111
103 020208 113
104 020208 114
The goal is to have 1-to-1 match.
You can try to create a temporary column that you use in the merge:
df1["tmp"] = df1.groupby("key").cumcount()
df2["tmp"] = df2.groupby("key").cumcount()
df_out = pd.merge(df1, df2, on=["key", "tmp"], how="inner")
df_out = df_out.rename(
columns={"employee_id_x": "employee_id_df1", "employee_id_y": "employee_id_df2"}
).drop(columns="tmp")
print(df_out)
Prints:
employee_id_df1 key employee_id_df2
0 100 101001 110
1 101 20208 111
2 102 20105 112
3 103 20208 113
4 104 20208 114

Compare rows in df to rows in different df when columns and length of df is different

I have the following data on a df1:
id date ... paid
0 123 2020-10-14 ... 30.0
1 234 2020-09-23 ... 25.5
2 356 2020-08-25 ... 35.5
There's some other information on df2:
id payment_date amount type ... other_info
0 568 2020-08-25 15.9 adj1 ... some_words
1 123 2020-10-14 20.0 adj2 ... more_words
2 234 2020-09-23 25.5 adj2 ... some_other_words
3 356 2020-08-25 35.5 adj2 ... some_more_words
I need to compare every row on df1 against the rows on df2, on the specific columns mentioned. If they are an exact match, I'd like to add a column on df1 with the boolean result, or some str like "Yes". The final output should be similar to this:
id date ... paid new_col
0 123 2020-10-14 ... 30.0 False
1 234 2020-09-23 ... 25.5 True
2 356 2020-08-25 ... 35.5 True
Notice that the index is not important on any of the two dataframes, and their length is different (df1 is around 100,000 rows and 6 columns, df2 around 2,000,000 rows and 13 columns). The other columns don't matter in the comparison.
I've tried to use something like:
df1["new_col"] = ((df1["id"] == df2["id"]) &
(df1["date"] == df2["payment_date"]) &
(df1["paid"] == df2["amount"]))
But i get this: "ValueError: Can only compare identically-labeled Series objects". I can't use something like "merge", because the columns are not the same, and df2 is too big, therefore, it'll take additional time. Also, I can't use pd.Series.isin() because each ID has lots of dates and amounts, and they must match perfectly. Dates and amounts are also the same for several rows, the difference would be when comparing the three columns mentioned.
I'm looking for a vectorized approach to this problem, or just an efficient way to accomplish this without iterating row by row on both dataframes.
You could use merge like
In [37]: df1['new_col'] = df1.merge(df2,
left_on=['id', 'date', 'paid'],
right_on=['id', 'payment_date', 'amount'],
how='left', indicator=True)['_merge'].eq('both')
In [38]: df1
Out[38]:
id date paid new_col
0 123 2020-10-14 30.0 False
1 234 2020-09-23 25.5 True
2 356 2020-08-25 35.5 True

Create a new column using str.contains and where the condition fails, set it to null (NaN)

I am trying to create a new column in my pandas dataframe, but only with a value if another column contains a certain string.
My dataframe looks something like this:
raw val1 val2
0 Vendor Invoice Numbe Inv Date
1 Vendor: Company Name 1 123 456
2 13445 07708-20-2019 US 432 676
3 79935 19028808-15-2019 US 444 234
4 Vendor: company Name 2 234 234
I am trying to create a new column, vendor that transforms the dataframe into:
raw val1 val2 vendor
0 Vendor Invoice Numbe Inv Date Vendor Invoice Numbe Inv Date
1 Vendor: Company Name 1 123 456 Vendor: Company Name 1
2 13445 07708-20-2019 US 432 676 NaN
3 79935 19028808-15-2019 US 444 234 NaN
4 Vendor: company Name 2 234 234 company Name 2
5 Vendor: company Name 2 928 528 company Name 2
However, whenever I try,
df['vendor'] = df.loc[df['raw'].str.contains('Vendor', na=False), 'raw']
I get the error
ValueError: cannot reindex from a duplicate axis
I know that at index 4 and 5 it's the same value for the company, but what am I doing wrong and how to I add the new column to my dataframe?
The problem is df.loc[df['raw'].str.contains('Vendor', na=False), 'raw'] as different length than df.
You can try np.where, which assigns a new columns by an np.array of the same size, so it doesn't need index alignment.
df['vendor'] = np.where(df['raw'].str.contains('Vendor'), df['raw'], np.NaN)
You could .extract() the part of the string that comes after Vendor: using a positive lookbehind:
df['vendor'] = df['raw'].str.extract(r'(?<=Vendor:\s)(.*)')

Pandas merge result missing rows when joining on strings

I have a data set that I've been cleaning and to clean it I needed to put it into a pivot table to summarize some of the data. I'm now putting it back into a dataframe so that I can merge it with some other dataframes. df1 looks something like this:
Count Region Period ACV PRJ
167 REMAINING US WEST 3/3/2018 5 57
168 REMAINING US WEST 3/31/2018 10 83
169 SAN FRANCISCO 1/13/2018 99 76
170 SAN FRANCISCO 1/20/2018 34 21
df2 looks something like this:
Count MKTcode Region
11 RSMR0 REMAINING US SOUTH
12 RWMR0 REMAINING US WEST
13 SFR00 SAN FRANCISCO
I've tried merging them with this code:
df3 = pd.merge(df1, df2, on='Region', how='inner')
but for some reason pandas is not interpreting the Region columns as the same data and the merge is turning up NaN data in the MKTcode column and it seems to be appending df2 to df1, like this:
Count Region Period ACV PRJ MKTcode
193 WASHINGTON, D.C. 3/3/2018 36 38 NaN
194 WASHINGTON, D.C. 3/31/2018 12 3 NaN
195 ATLANTA NaN NaN NaN ATMR0
196 BOSTON NaN NaN NaN B2MRN
I've tried inner and outer joins, but the real problem seems to be that pandas is interpreting the Region column of each dataframe as different elements.
The MKTcode column and Region column in df2 has only 12 observations and each observation occurs only once, whereas df1 has several repeating instances in the Region column (multiples of the same city). Is there a way where I can just create a list of the 12 MKTcodes that I need and perform a merge where it matches with each region that I designate? Like a one to many match?
Thanks.
When a merge isn't working as expected, the first thing to do is look at the offending columns.
The biggest culprit in most cases is trailing/leading whitespaces. These are usually introduced when DataFrames are incorrectly read from files.
Try getting rid of extra whitespace characters by stripping them out. Assuming you need to join on the "Region" column, use
for df in (df1, df2):
# Strip the column(s) you're planning to join with
df['Region'] = df['Region'].str.strip()
Now, merging should work as expected,
pd.merge(df1, df2, on='Region', how='inner')
Count_x Region Period ACV PRJ Count_y MKTcode
0 167 REMAINING US WEST 3/3/2018 5 57 12 RWMR0
1 168 REMAINING US WEST 3/31/2018 10 83 12 RWMR0
2 169 SAN FRANCISCO 1/13/2018 99 76 13 SFR00
3 170 SAN FRANCISCO 1/20/2018 34 21 13 SFR00
Another possibility if you're still getting NaNs, could be because of a difference in whitespace characters between words. For example, 'REMAINING US WEST' will not compare as equal with 'REMAINING US WEST'.
This time, the fix is to use str.replace:
for df in (df1, df2):
df['Region'] = df['Region'].str.replace(r'\s+', ' ')

Adding column from dataframe with different structure

I have the following two dataframe structures:
roc_100
max min
industry Banks Health Banks Health
date
2015-03-15 3456 456 345 567
2015-03-16 6576 565 435 677
2015-03-17 5478 657 245 123
and:
roc_100
max min
date
2015-03-15 546 7856
2015-03-16 677 456
2015-03-17 3546 346
As can be seen the difference between the two dataframes is that the bottom one doesn't have an 'industry'. But the rest of the dataframe structure is the same, ie: it is also has dates along the left, and is grouped by roc_100, under which is max and min.
What I need to do is add the columns from the bottom dataframe to the top dataframe, and give the added columns an industry name, eg: 'benchmark'. The resulting dataframe should be as follows:
roc_100
max min
industry Banks Health Benchmark Banks Health Benchmark
date
2015-03-15 3456 456 546 345 567 7856
2015-03-16 6576 565 677 435 677 456
2015-03-17 5478 657 3546 245 123 346
I have tried using append and join, but neither option has worked so far because the one dataframe has an 'industry' and the other doesn't.
Edit:
I have managed to merge them correctly using:
industry_df = industry_df.merge(benchmark_df, how='inner', left_index=True, right_index=True)
The only problem now is that the newly added columns still don't have an 'industry'.
This means that if I just want one industry, eg: Health, then I can do:
print(industry_df['roc_100', 'max', 'Health'])
That works, but if I want to print all the industries including the newly added columns I can't do that. If I try:
print(industry_df['roc_100', 'max'])
This only prints out the newly added columns because they are the only ones which don't have an 'industry'. Is there a way to give these newly merged columns a name ('industry')?
You can use stack() and unstack() to bring two dataframes to identical index structures with industries as columns. Then assign new benchmark column. Last step - restore initial index/column structure by same stack() and unstack().

Categories