Finding diffs between two CSV files with primary column - python

I have two CSV files:
File 1
Id, 1st, 2nd
1, first, row
2, second, row
File 2
Id, 1st, 2nd
1, first, row
2, second, line
3, third, row
I am just starting in python and need to write some code, which can do the diff on these files based on primary columns and in this case first column "Id". Output file should be a delta file which should identify the rows that have changed in the second file:
Output delta file
2, second, line
3, third, row

I suggest you load both CSV files as Pandas DataFrames, and then you use and outer merge with indicator to know what rows changed in the second file. Then, you use query to get only the rows that changed in the second file, and you drop the indicator column ('_merge').
import pandas as pd
df1 = pd.read_csv("FILENAME_1.csv")
df2 = pd.read_csv("FILENAME_2.csv")
merged = pd.merge(df1, df2, how="outer", indicator=True)
diff = merged.query("_merge == 'right_only'").drop("_merge", axis="columns")
For further details on finding differences in Pandas DataFrames, read this other question.

I'd also use pandas, as Enrico suggested, for anything more complex than your example. But if you want to do it in pure Python, you can convert your rows into sets and compute a set difference:
import csv
from io import StringIO
data1 = """Id, 1st, 2nd
1, first, row
2, second, row"""
data2 = """Id, 1st, 2nd
1, first, row
2, second, line
3, third, row"""
s1 = {tuple(row) for row in csv.reader(StringIO(data1))}
s2 = {tuple(row) for row in csv.reader(StringIO(data2))}
print(s2-s1)
print(s2-s1)
{('2', ' second', ' line'), ('3', ' third', ' row')}
Note that in your example you are not actually diffing based on your primary column only, but on the entire row. If you really want to only consider the Id column, you can do:
d1 = {row[0]:row[1:] for row in csv.reader(StringIO(data1))}
d2 = {row[0]:row[1:] for row in csv.reader(StringIO(data2))}
diff = { k : d2[k] for k in set(d2) - set(d1)}
print(diff)
{'3': [' third', ' row']}

Related

How to merge three arrays?

My dataset consists of three columns, I required to merge the data into one column. For example, if 1, 2, and 3 are the first entries of each column the merged column should be 123. I attempted to solve this problem with concatenate command but it is not useful. Here is my script:
tr = pd.read_csv("YMD.txt", sep='\t',header=None)
Y = tr[0]
M = tr[1]
D = tr[2]
np.concatenate((Y, M, D))
You don't need Pandas or Numpy to read a tab-delimited file and merge the first 3 columns into a new list:
ymd = []
with open("YMD.txt") as f:
for row in f:
row = row.strip().split("\t")
ymd.append("".join(row[:3]))
print(ymd)

How to create new file from two other csv files?

I have two .csv files.
First:
col. names: 'student_id' and 'mark'
Second:
col. names: 'student_id','name','surname'
and I want create third .csv file with 'student_id','name', 'surname' where row['mark'] == 'five' or 'four'
good_student=[]
for index, row in first_file.iterrows():
if row['mark'] == 'five':
good_student.append(row['studentId'])
elif row['mark'] == 'four':
good_student.append(row['studentId'])
for index, row in second_file.iterrows():
for i in good_student:
if row['studentId'] == i:
As the other user suggested, a dataframe is a robust way of handling your csv problem. First, I would read the two csv files into dataframes using theread_csv function. Then I would join the two based on student id. The result is a dataframe w
ith columns as student_id, mark, name, and surname. Any missing values will be NaN (which dataframe the join is called on is important in handling missing values). The joined dataframe is then filtered by the value in the mark cell.
import pandas as pd
df1 = pd.read_csv('one.csv') # student_id, mark
df2 = pd.read_csv('two.csv') # student_id, name, surname
df1 = df1.join(df2.set_index('student_id'), on='student_id')
df1 = df1.loc[(df1['mark'] == 'five') | (df1['mark'] == 'four')]
You can just read both csv's as a dataframe and join them.
import pandas as pd
df_1 = pd.read_csv("csv_1")
df_2 = pd.read_csv("csv_2")
df_1 = df_1.join(df_2)
df_1.to_csv("new_csv")
The result will be a csv file with appended columns. If line 1 of csv_1 and line 1 of csv_2 refer to the same thing (person, object, ad_id...) then it can be used without problems.
Edit:
If both cvs's are index the rows by student_id, then the easiest way is to include taht when loading the dataframes:
import pandas as pd
df_1 = pd.read_csv("csv_1", index_col = "student_id")
df_2 = pd.read_csv("csv_2", index_col = "student_id")
df_1 = df_1.join(df_2)
df_1.to_csv("new_csv")

How to find first data column where data exist & remove unwanted rows, Python 3.6

I have total 343 data frame with different column structure. I want to find text from first row of first occurring column.
Actual data in excel file:
Expected Results:
firstRowtext: Q60h. As I read each one, please tell me if
output df(with column name as column 1, 2, 3, 4, 5, 6, 7):
I believe you need read each file twice - first for first value and skipped rows and then again with parameter skiprows:
files = glob.glob('data\*.xlsx')
for f in files:
df = pd.read_excel(f, index_col=False)
val = df.columns[0].split()[0]
print (val)
pos = df.iloc[:, 0].notnull().idxmax() + 1
df = pd.read_excel(f, skiprows=pos, header=None).dropna(axis=1, how='all')

Merging csv columns while checking ID of the first column

I have a 4 csv files exported from e-shop database I need to merge them by columns, which I would maybe manage to do alone. But the problem is to match the right columns
First file:
"ep_ID","ep_titleCS","ep_titlePL".....
"601","Kancelářská židle šedá",NULL.....
...
Second file:
"pe_photoID","pe_productID","pe_sort"
"459","603","1"
...
Third file:
"epc_productID","epc_categoryID","epc_root"
"2155","72","1"
...
Fourth file:
"ph_ID","ph_titleCS"...
"379","5391132275.jpg"
...
I need to match the rows so rows with same "ep_ID" and "epc_productID" are merged together and rows with same "ph_ID", "pe_photoID" too. I don't really know where to start, hopefully, I wrote it understandably
Update:
I am using :
files = ['produkty.csv', 'prirazenifotek.csv', 'pprirazenikategorii.csv', 'adresyfotek.csv']
dfs = []
for f in files:
df = pd.read_csv(f,low_memory=False)
dfs.append(df)
first_and_third =pd.merge(dfs[0],dfs[1],left_on = "ep_ID",right_on="pe_photoID")
first_and_third.to_csv('new_filepath.csv', index=False)
Ok this code works, but it does two things in another way than I need:
When there is a row in file one with ID = 1 for example and in the next file two there is 5 rows with bID = 1, then it creates 5 rows int the final file I would like to have one row that would have multiple values from every row with bID = 1 in file number two. Is it possible?
And it seems to be deleting some rows... not sure till i get rid of the "duplicates"...
You can use pandas's merge method to merge the csvs together. In your question you only provide keys between the 1st and 3rd files, and the 2nd and 4th files. Not sure if you want one giant table that has them all together -- if so you will need to find another intermediary key, maybe one you haven't listed(?).
import pandas as pd
files = ['path_to_first_file.csv', 'second_file.csv', 'third_file.csv', 'fourth_file.csv']
dfs = []
for f in files:
df = pd.read_csv(f)
dfs.append(df)
first_and_third = dfs[0].merge(dfs[2], left_on='ep_ID', right_on='epc_productID', how='left')
second_and_fourth = dfs[1].merge(dfs[3], left_on='pe_photoID', right_on='ph_ID', how='left')
If you want to save the dataframe back down to a file you can do so:
first_and_third.to_csv('new_filepath.csv', index=False)
index=False assumes you have no index set on the dataframe and don't want the dataframe's row numbers to be included in the final csv.

Selectin Dataframe columns name from a csv file

I have a .csv to read into a DataFrame and the names of the columns are in the same .csv file in the previos rows. Usually I drop all the 'unnecesary' rows to create the DataFrame and then hardcode the names of each dataframe
Trigger time,2017-07-31,10:45:38
CH,Signal name,Input,Range,Filter,Span
CH1, "Tin_MIX_Air",TEMP,PT,Off,2000.000000,-200.000000,degC
CH2, "Tout_Fan2b",TEMP,PT,Off,2000.000000,-200.000000,degC
CH3, "Tout_Fan2a",TEMP,PT,Off,2000.000000,-200.000000,degC
CH4, "Tout_Fan1a",TEMP,PT,Off,2000.000000,-200.000000,degC
Here you can see the rows where the columns names are in double quotes "TinMix","Tout..",etc there are exactly 16 rows with names
Logic/Pulse,Off
Data
Number,Date&Time,ms,CH1,CH2,CH3,CH4,CH5,CH7,CH8,CH9,CH10,CH11,CH12,CH13,CH14,CH15,CH16,CH20,Alarm1-10,Alarm11-20,AlarmOut
NO.,Time,ms,degC,degC,degC,degC,degC,degC,%RH,%RH,degC,degC,degC,degC,degC,Pa,Pa,A,A1234567890,A1234567890,A1234
1,2017-07-31 10:45:38,000,+25.6,+26.2,+26.1,+26.0,+26.3,+25.7,+43.70,+37.22,+25.6,+25.3,+25.1,+25.3,+25.3,+0.25,+0.15,+0.00,LLLLLLLLLL,LLLLLLLLLL,LLLL
And here the values of each variables start.
What I need to do is create a Dataframe from this .csv and place these names in the columns names. I'm new to Python and I'm not very sure how to do it
import pandas as pd
path = r'path-to-file.csv'
data=pd.DataFrame()
with open(path, 'r') as f:
for line in f:
data = pd.concat([data, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True)
data.drop(data.index[range(0,29)],inplace=True)
x=len(data.iloc[0])
data.drop(data.columns[[0,1,2,x-1,x-2,x-3]],axis=1,inplace=True)
data.reset_index(drop=True,inplace=True)
data = data.T.reset_index(drop=True).T
data = data.apply(pd.to_numeric)
This is what I've done so far to get my dataframe with the usefull data, I'm dropping all the other columns that arent useful to me and keeping only the values. Last three lines are to reset row/column indexes and to transform the whole df to floats. What I would like is to name the columns with each of the names I showed in the first piece of coding as a I said before I'm doing this manually as:
data.columns = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p']
But I would like to get them from the .csv file since theres a possibility on changing the CH# - "Name" combination
Thank you very much for the help!
Comment: possible for it to work within the other "OPEN " loop that I have?
Assume Column Names from Row 2 up to 6, Data from Row 7 up to EOF.
For instance (untested code)
data = None
columns = []
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2 and row <= 6:
ch, name = line.split(',')[:2]
columns.append(name)
else:
row_data = [tuple(line.strip().split(','))]
if not data:
data = pd.DataFrame(row_data, columns=columns, ignore_index=True)
else:
data.append(row_data)
Question: ... I would like to get them from the .csv file
Start with:
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2:
ch, name = line.split(',')[:2]

Categories