How to compare two CSV files and get the difference?

How to compare two CSV files and get the difference? - python

I have two CSV files,
a1.csv
city,state,link
Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf
Aguila,Arizona,http://www.co.apache.az.us/planning-and-zoning-division/zoning-ordinances/
a2.csv
city,state,link
Aguila,Arizona,http://www.co.apache.az.us
I want to get the difference.
Here is my attempt:
import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
mask = a.isin(b.to_dict(orient='list'))
# Reverse the mask and remove null rows.
# Upside is that index of original rows that
# are now gone are preserved (see result).
c = a[~mask].dropna()
print c
Expected Output:
city,state,link
Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf
But I am getting an error:
Empty DataFrame
Columns: [city, state, link]
Index: []**
I want to check based on the first two rows, then if they are the same, remove it off.

You can use pandas to read in two files, join them and remove all duplicate rows:
import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
ab = pd.concat([a,b], axis=0)
ab.drop_duplicates(keep=False)
Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

First, concatenate the DataFrames, then drop the duplicates while still keeping the first one. Then reset the index to keep it consistent.
import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
c = pd.concat([a,b], axis=0)
c.drop_duplicates(keep='first', inplace=True) # Set keep to False if you don't want any
# of the duplicates at all
c.reset_index(drop=True, inplace=True)
print(c)

Related

Best way to move an unexpected column in a Pandas DF to a new DF?

Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.

You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)

Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1

How can I drop rows in pandas based on a condition

I'm trying to drop some rows in a pandas data frame, but I'm getting this error: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
I have a list of desired items that I want to stay in the Data Frame, so I wrote this:
#
import sys
import pandas as pd
biog = sys.argv[1]
df = pd.read_csv(biog, sep ='\t')
desired = ['Affinity Capture-Luminescence', 'Affinity Capture-MS', 'Affinity Capture-Western', 'Co-crystal Structure', 'Far Western', 'FRET', 'PCA', 'Reconstituted Complex']
new_df = df[['OFFICIAL_SYMBOL_A','OFFICIAL_SYMBOL_B','EXPERIMENTAL_SYSTEM']]
for i in desired:
print(i)
new_df.drop(new_df[new_df.EXPERIMENTAL_SYSTEM != i].index, inplace = True)
print(new_df)
#
it works if I place a single condition at a time, but when the for loop is inserted it doesn't work.
I didn't placed here the data because it is too large, I hope that this is enough.
thanks for the help

You can set a new df of when a column is in a list of values. No need to loop it.
new_df = new_df[new_df['EXPERIMENTAL_SYSTEM'].isin(desired)]

Pandas drop first columns after csv read

Is there a way to reference an object within the line of the instantiation ?
See the following example :
I wanted to drop the first column (by index) of a csv file just after reading it (usually pd.to_csv outputs the index as first col) :
df = pd.read_csv(csvfile).drop(self.columns[[0]], axis=1)
I understand self should be placed in the object context but it here describes what I intent to do.
(Of course, doing this operation in two separate lines works perfectly.)

One way is to use pd.DataFrame.iloc:
import pandas as pd
from io import StringIO
mystr = StringIO("""col1,col2,col3
a,b,c
d,e,f
g,h,i
""")
df = pd.read_csv(mystr).iloc[:, 1:]
# col2 col3
# 0 b c
# 1 e f
# 2 h i

Assuming you know the total number of columns in the dataset, and the indexes you want to remove -
a = range(3)
a.remove(1)
df = pd.read_csv('test.csv', usecols = a)
Here 3 is the total number of columns, and I wanted to remove 2nd column. You can directly write index of columns to use

multiconditional mapping in python pandas

I am looking for a way to do some conditional mapping using multiple comparisons.
I have millions and millions of rows that I am investigating using sample SQL extracts in pandas. Along with SQL extracts read into pandas DataFrames I also have some rules tables, each with a few columns (these are also loaded into dateframes).
This is what I want to do: where a row in my SQL extract matches the conditions expressed in any one row in my rules table, I would like to generate a 1, else: 0. In the end I would like to add a column to my SQL extract called Rule Result with either 1's and 0's.
I have got a system that works using df.merge, but it produces many many extra duplicate rows in the process that must then be removed afterwards. I am looking for a better, faster, more elegant solution and would be grateful for any suggestions.
Here is a working example of the problem and the current solution code:
import pandas as pd
import numpy as np
#Create a set of test data
test_df = pd.DataFrame()
test_df['A'] = [1,120,982,1568,29,455,None, None, None]
test_df['B'] = ['EU','US',None, 'GB','DE','EU','US', 'GB',None]
test_df['C'] = [1111,1121,1111,1821,1131,1111,1121,1821,1723]
test_df['C_C'] = test_df['C']
test_df
test_df
#Create a rules_table
rules_df = pd.DataFrame()
rules_df['A_MIN'] = [0,500,None,600,200]
rules_df['A_MAX'] = [10,1000,500,1200,800]
rules_df['B_B'] = ['GB','GB','US','EU','EU']
rules_df['C_C'] = [1111,1821,1111,1111,None]
rules_df
def map_rules_to_df(df,rules):
#create column that mimics the index to keep track of later duplication
df['IND'] = df.index
#merge the rules with the data on C values
df = df.merge(rules,left_on='C_C',right_on='C_C',how='left')
#create a rule_result_column with a default value of zero
df['RULE_RESULT']=0
#create a mask indentifying those test_df_rows that fit with a
# rule_df_row
mask = df[
((df['A'] > df['A_MIN']) | (df['A_MIN'].isnull())) &
((df['A'] < df['A_MAX']) | (df['A_MAX'].isnull())) &
((df['B'] == df['B_B']) | (df['B_B'].isnull())) &
((df['C'] == df['C_C']) | (df['C_C'].isnull()))
]
#use mask.index to replace 0's in the result column with a 1
df.loc[mask.index.tolist(),'RULE_RESULT']=1
#drop the redundant rule_df columns
df = df.drop(['B_B','C_C','A_MIN','A_MAX'],axis=1)
#drop duplicate rows
df = df.drop_duplicates(keep='first')
#drop rows where the original index is duplicated and the rule result
#is zero
df = df[(df['IND'].duplicated(keep=False)) & (df['RULE_RESULT']==0) == False]
#reset the df index with the original index
df.index = df['IND'].values
#drop the now redundant second index column (IND)
df = df.drop('IND', axis=1)
print('df shape',df.shape)
return df
#map the rules
result_df = map_rules_to_df(test_df,rules_df)
result_df
result_df
I hope I have made what I would like to do clear and thank you for your help.
PS, my rep is non-existent, so i was not allowed to post more than two supporting images.

Find duplicates with groupby in Pandas

I read a csv file using Pandas. Then, I am checking to see if there are any duplicate rows in the data using the code below:
import pandas as pd
df= pd.read_csv("data.csv", na_values=["", " ", "-"])
print df.shape
>> (71644, 15)
print df.drop_duplicates().shape
>> (31171, 15)
I find that there are some duplicate rows, so I want to see which rows appear more than once:
data_groups = df.groupby(df.columns.tolist())
size = data_groups.size()
size[size > 1]
Doing that I get Series([], dtype: int64).
Futhermore, I can find the duplicate rows doing the following:
duplicates = df[(df.duplicated() == True)]
print duplicates.shape
>> (40473, 15)
So df.drop_duplicates() and df[(df.duplicated() == True)] show that there are duplicate rows but groupby doesn't.
My data consist of strings, integers, floats and nan.
Have I misunderstood something in the functions I mention above or something else happens ?

Simply add the reset_index() to realign aggregates to a new dataframe.
Additionally, the size() function creates an unmarked 0 column which you can use to filter for duplicate row. Then, just find length of resultant data frame to output a count of duplicates like other functions: drop_duplicates(), duplicated()==True.
data_groups = df.groupby(df.columns.tolist())
size = data_groups.size().reset_index()
size[size[0] > 1] # DATAFRAME OF DUPLICATES
len(size[size[0] > 1]) # NUMBER OF DUPLICATES

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to compare two CSV files and get the difference? - python

Related

Best way to move an unexpected column in a Pandas DF to a new DF?

How can I drop rows in pandas based on a condition

Pandas drop first columns after csv read

multiconditional mapping in python pandas

Find duplicates with groupby in Pandas

Categories

Resources