Related
I have two CSV files:
File 1
Id, 1st, 2nd
1, first, row
2, second, row
File 2
Id, 1st, 2nd
1, first, row
2, second, line
3, third, row
I am just starting in python and need to write some code, which can do the diff on these files based on primary columns and in this case first column "Id". Output file should be a delta file which should identify the rows that have changed in the second file:
Output delta file
2, second, line
3, third, row
I suggest you load both CSV files as Pandas DataFrames, and then you use and outer merge with indicator to know what rows changed in the second file. Then, you use query to get only the rows that changed in the second file, and you drop the indicator column ('_merge').
import pandas as pd
df1 = pd.read_csv("FILENAME_1.csv")
df2 = pd.read_csv("FILENAME_2.csv")
merged = pd.merge(df1, df2, how="outer", indicator=True)
diff = merged.query("_merge == 'right_only'").drop("_merge", axis="columns")
For further details on finding differences in Pandas DataFrames, read this other question.
I'd also use pandas, as Enrico suggested, for anything more complex than your example. But if you want to do it in pure Python, you can convert your rows into sets and compute a set difference:
import csv
from io import StringIO
data1 = """Id, 1st, 2nd
1, first, row
2, second, row"""
data2 = """Id, 1st, 2nd
1, first, row
2, second, line
3, third, row"""
s1 = {tuple(row) for row in csv.reader(StringIO(data1))}
s2 = {tuple(row) for row in csv.reader(StringIO(data2))}
print(s2-s1)
print(s2-s1)
{('2', ' second', ' line'), ('3', ' third', ' row')}
Note that in your example you are not actually diffing based on your primary column only, but on the entire row. If you really want to only consider the Id column, you can do:
d1 = {row[0]:row[1:] for row in csv.reader(StringIO(data1))}
d2 = {row[0]:row[1:] for row in csv.reader(StringIO(data2))}
diff = { k : d2[k] for k in set(d2) - set(d1)}
print(diff)
{'3': [' third', ' row']}
im struggling with 2 csv files which I have imported
the csv files look like this:
csv1
planet,diameter,discovered,color
sceptri,33.41685587,28-11-1611 05:15, black
...
csv2
planet,diameter,discovered,color
sceptri,33.41685587,28-11-1611 05:15, blue
...
in both csv files, there are the same planets but in a different order and sometimes with different values (a mismatch)
the data for each planet (diameter, discovered and color) has been entered independently. I wanted to Cross-check the two sheets and find all the fields that are mismatched. Then I want to generate a new file that contains one line per error with a description of the error.
for example:
sceptri: mismatch (black/blue)
here is my code so far
with open('planets1.csv') as csvfile:
a = csv.reader(csvfile, delimiter=',')
data_a= list(a)
for row in a:
print(row)
with open('planets2.csv') as csvfile:
b = csv.reader(csvfile, delimiter=',')
data_b= list(b)
for row in b:
print(row)
print(data_a)
print(data_b)
c= [data_a]
d= [data_b]```
thank you in advance for your help!
Assuming the name of planets are correct in both files, here is my proposal
# Working with list of list, which could be get csv file reading:
csv1 = [["sceptri",33.41685587,"28-11-1611 05:15", "black"],
["foo",35.41685587,"29-11-1611 05:15", "black"],
["bar",38.7,"29-11-1611 05:15", "black"],]
csv2 = [["foo",35.41685587,"29-11-1611 05:15", "black"],
["bar",38.17,"29-11-1611 05:15", "black"],
["sceptri",33.41685587,"28-11-1611 05:15", "blue"]]
# A list to contain the errors:
new_file = []
# A dict to check if a planet has already been processed:
a_dict ={}
# Let's read all planet data:
for planet in csv1+csv2:
# Check if planet is already as a key in a_dict:
if planet[0] in a_dict:
# Yes, sir, need to check discrepancies.
if a_dict[planet[0]] != planet[1:]:
# we have some differences in some values.
# Put both set of values in python sets to differences:
error = set(planet[1:]) ^ set(a_dict[planet[0]])
# Append [planet_name, diff.param1, diff_param2] to new_file:
new_file.append([planet[0]]+list(error))
else:
# the planet name becomes a dict key, other param are key value:
a_dict[planet[0]] = planet[1:]
print(new_file)
# [['bar', 38.17, 38.7], ['sceptri', 'black', 'blue']]
The list new_file may be saved as new file, see Writing a list to file
I'd suggest using Pandas for a task like this.
Firstly, you'll need to read the csv contents into dataframe objects. This can be done as follows:
import pandas as pd
# make a dataframe from each csv file
df1 = pd.read_csv('planets1.csv')
df2 = pd.read_csv('planets2.csv')
You may want to declare names for each column if your CSV file doesn't have them.
colnames = ['col1', 'col2', ..., 'coln']
df1 = pd.read_csv('planets1.csv', names=colnames, index_col=0)
df2 = pd.read_csv('planets2.csv', names=colnames, index_col=0)
# use index_col=0 if csv already has an index column
For the sake of reproducible code, I will define dataframe objects without a csv below:
import pandas as pd
# example column names
colnames = ['A','B','C']
# example dataframes
df1 = pd.DataFrame([[0,3,6], [4,5,6], [3,2,5]], columns=colnames)
df2 = pd.DataFrame([[1,3,1], [4,3,6], [3,6,5]], columns=colnames)
Note that df1 looks like this:
A B C
---------------
0 0 3 6
1 4 5 6
2 3 2 5
And df2 looks like this:
A B C
---------------
0 1 3 1
1 4 3 6
2 3 6 5
The following code compares dataframes, concatenate the comparison to a new dataframe, and then saves the result to a CSV:
# define the condition you want to check for (i.e., mismatches)
mask = (df1 != df2)
# df1[mask], df2[mask] will replace matched values with NaN (Not a Number), and leave mismatches
# dropna(how='all') will remove rows filled entirely with NaNs
errors_1 = df1[mask].dropna(how='all')
errors_2 = df2[mask].dropna(how='all')
# add labels to column names
errors_1.columns += '_1' # for planets 1
errors_2.columns += '_2' # for planets 2
# you can now combine horizontally into one big dataframe
errors = pd.concat([errors_1,errors_2],axis=1)
# if you want, reorder the columns of `errors` so compared columns are next to each other
errors = errors.reindex(sorted(errors.columns), axis=1)
# if you don't like the clutter of NaN values, you can replace them with fillna()
errors = errors.fillna('_')
# save to a csv
errors.to_csv('mismatches.csv')
The final result looks something like this:
A_1 A_2 B_1 B_2 C_1 C_2
-----------------------------
0 0 1 _ _ 6 1
1 _ _ 5 3 _ _
2 _ _ 2 6 _ _
Hope this helps.
This kind of problem can be solved by sorting the rows from the csv files, and then comparing the corresponding rows to see if there are differences.
This approach uses a functional style to perform the comparisons and will compare any number of csv files.
It assumes that the csvs contain the same number of records, and that the columns are in the same order.
import contextlib
import csv
def compare_files(readers):
colnames = [next(reader) for reader in readers][0]
sorted_readers = [sorted(r) for r in readers]
for gen in [compare_rows(colnames, rows) for rows in zip(*sorted_readers)]:
yield from gen
def compare_rows(colnames, rows):
col_iter = zip(*rows)
# Be sure we're comparing the same planets.
planets = set(next(col_iter))
assert len(planets) == 1, planets
planet = planets.pop()
for (colname, *vals) in zip(colnames, col_iter):
if len(set(*vals)) > 1:
yield f"{planet} mismatch {colname} ({'/'.join(*vals)})"
def main(outfile, *infiles):
with contextlib.ExitStack() as stack:
csvs = [stack.enter_context(open(fname)) for fname in infiles]
readers = [csv.reader(f) for f in csvs]
with open(outfile, 'w') as out:
for result in compare_files(readers):
out.write(result + '\n')
if __name__ == "__main__":
main('mismatches.txt', 'planets1.csv', 'planets2.csv')
I have a csv file which has three columns (A, B, and C) and their values are like the below figure:
CSV Table
1,2,4
1,257,5
1,258,6
1,8,7
1,260,8
2,24,9
2,26,10
2,234,11
3,14,12
3,22,13
3,78,14
I want to join the values in column B by "-" while their values in column A are the same. So, the expected outputs are as below:
["2-257-258-8-260", "24-26-234", "14-22-78"]
Can anyone help me how can I get these results.
Thanks in advance
Here's a plain Python solution.
We use a csv reader to read the data. In my code I read from a list of lines named file_data, but you can change file_data to an open file object.
We store the data into a dictionary of lists, with the column A value as the key, and the column B values being appended to a list.
We then loop over the keys in order, joining the B data into strings of the desired form.
import csv
from collections import defaultdict
file_data = '''\
1,2,4
1,257,5
1,258,6
1,8,7
1,260,8
2,24,9
2,26,10
2,234,11
3,14,12
3,22,13
3,78,14
'''.splitlines()
reader = csv.reader(file_data)
data = defaultdict(list)
for a, b, c in reader:
#print(a, b, c)
data[a].append(b)
out = ['-'.join(data[k]) for k in sorted(data.keys())]
print(out)
output
['2-257-258-8-260', '24-26-234', '14-22-78']
If your dataset is in the format:
A,B,C
1,2,4
1,257,5
1,258,6
1,8,7
1,260,8
2,24,9
2,26,10
2,234,11
3,14,12
3,22,13
3,78,14
You could use itertools.groupby() to group items from the A column, and join the elements from the B column:
from csv import reader
from itertools import groupby
from operator import itemgetter
with open('data.csv') as in_file:
csv_reader = reader(in_file)
# skip headers
next(csv_reader)
# sort data by A column, then C column
sorted_data = sorted(csv_reader, key=itemgetter(0, 2))
# group by A column, and join by B column
grouped = ['-'.join(map(itemgetter(1), g)) for _, g in groupby(sorted_data, key=itemgetter(0))]
print(grouped)
Which Outputs:
['2-257-258-8-260', '24-26-234', '14-22-78']
Note: This solution sorts before it groups, just in case the data is not already sorted primarily on column A, then secondarily on column C.
Pandas solution
Try using the pandas groupby function then use the pandas apply then write lambda x: in it then, join the new list comprehension with '-':
import pandas as pd
df = pd.DataFrame({'A':[1,1,1,2,2,3,3], 'B': [124,456,465,46,35,53,33]})
print(df.groupby('A')['B'].apply(lambda x: '-'.join([str(i) for i in x.values])).tolist())
Output:
['124-456-465', '46-35', '53-33']
I have a .csv to read into a DataFrame and the names of the columns are in the same .csv file in the previos rows. Usually I drop all the 'unnecesary' rows to create the DataFrame and then hardcode the names of each dataframe
Trigger time,2017-07-31,10:45:38
CH,Signal name,Input,Range,Filter,Span
CH1, "Tin_MIX_Air",TEMP,PT,Off,2000.000000,-200.000000,degC
CH2, "Tout_Fan2b",TEMP,PT,Off,2000.000000,-200.000000,degC
CH3, "Tout_Fan2a",TEMP,PT,Off,2000.000000,-200.000000,degC
CH4, "Tout_Fan1a",TEMP,PT,Off,2000.000000,-200.000000,degC
Here you can see the rows where the columns names are in double quotes "TinMix","Tout..",etc there are exactly 16 rows with names
Logic/Pulse,Off
Data
Number,Date&Time,ms,CH1,CH2,CH3,CH4,CH5,CH7,CH8,CH9,CH10,CH11,CH12,CH13,CH14,CH15,CH16,CH20,Alarm1-10,Alarm11-20,AlarmOut
NO.,Time,ms,degC,degC,degC,degC,degC,degC,%RH,%RH,degC,degC,degC,degC,degC,Pa,Pa,A,A1234567890,A1234567890,A1234
1,2017-07-31 10:45:38,000,+25.6,+26.2,+26.1,+26.0,+26.3,+25.7,+43.70,+37.22,+25.6,+25.3,+25.1,+25.3,+25.3,+0.25,+0.15,+0.00,LLLLLLLLLL,LLLLLLLLLL,LLLL
And here the values of each variables start.
What I need to do is create a Dataframe from this .csv and place these names in the columns names. I'm new to Python and I'm not very sure how to do it
import pandas as pd
path = r'path-to-file.csv'
data=pd.DataFrame()
with open(path, 'r') as f:
for line in f:
data = pd.concat([data, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True)
data.drop(data.index[range(0,29)],inplace=True)
x=len(data.iloc[0])
data.drop(data.columns[[0,1,2,x-1,x-2,x-3]],axis=1,inplace=True)
data.reset_index(drop=True,inplace=True)
data = data.T.reset_index(drop=True).T
data = data.apply(pd.to_numeric)
This is what I've done so far to get my dataframe with the usefull data, I'm dropping all the other columns that arent useful to me and keeping only the values. Last three lines are to reset row/column indexes and to transform the whole df to floats. What I would like is to name the columns with each of the names I showed in the first piece of coding as a I said before I'm doing this manually as:
data.columns = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p']
But I would like to get them from the .csv file since theres a possibility on changing the CH# - "Name" combination
Thank you very much for the help!
Comment: possible for it to work within the other "OPEN " loop that I have?
Assume Column Names from Row 2 up to 6, Data from Row 7 up to EOF.
For instance (untested code)
data = None
columns = []
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2 and row <= 6:
ch, name = line.split(',')[:2]
columns.append(name)
else:
row_data = [tuple(line.strip().split(','))]
if not data:
data = pd.DataFrame(row_data, columns=columns, ignore_index=True)
else:
data.append(row_data)
Question: ... I would like to get them from the .csv file
Start with:
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2:
ch, name = line.split(',')[:2]
I have a CSV file with 2 Columns (x,y) and 5653 rows formated like this
0,0
1,0
2,0
3,0
4,0
5,0
....
102,0
102,1
101,1
....
0,1
0,2
1,2
....
Now I want to add a third column to it out of another csv with meassured values eg -89 etc those are mean values.
these are also 5653 rows and its the first column of that file?
Now how can I read the first file read the second file and put it like this
0,0,-89
1,0,-89
2,0,-89
3,0,-89
4,0,-90
5,0,-90
6,0,-89
7,0,-89
8,0,-89
9,0,-89
So I want the values to be the same just in one CSV
You could use the library pandas which is build to work with tabular data.
typical workflow:
import pandas as pd
df1 = pd.read_csv("your_path") # df is a shorthand for dataframe, a name for tabular data.
df2 = pd.read_csv("csv2")
# concanating: http://pandas.pydata.org/pandas-docs/stable/merging.html
result = pd.concat([df1, df2], axis=1) # join by row, not by column
result.to_csv("path")
You can use the csv module which unlike pandas does not require you to install any third-party libraries. You can just zip the two readers:
import csv
with open('in1.csv') as fin1:
with open('in2.csv') as fin2:
with open('out.csv') as fout:
r1 = csv.reader(fin1) # iterator over lists of strings
r2 = csv.reader(fin2)
w = csv.reader(fout)
for row1, row2 in zip(r1, r2):
w.writerow(row1 + row2[:1]) # row from 1 + first column from 2