Trying to compare two csv files and write differences as output - python

I'm developing a script which takes the difference between 2 csv files and makes a new csv file as output with the differences BUT only if the same 2 rows (refers to row number) between the two input files contain different data e.g. row 3 has "mike", "basketball player" in file 1 and row 3 in file 2 has "mike", "baseball player". The output csv would grab these print them and write them to a csv. It works but there are some issues (I know that this question has also been asked several times before but others have done it differently to me and since I'm fairly new to programming I don't quite understand their codes).
The output in the new csv file has each letter of the output in each cell (see image below) and I believe its something to do with the delimiter/quotechar/quoting line 37. I want them in their own cells without any fullstops, multiple spaces, commas or "|".
Another issue is that it takes a long time to run. I'm working with datasets of up to 50,000 rows and it can take over an hour to run. Why is this and what advice would be useful to speed it up? Put something outside of the for loop maybe? I did try the difflib method earlier on but I was only able to print the entire "input_file1" but not compare that file with another.
# aim of script is to compare csv files and output difference as a new csv
# import necessary libraries
import csv
# File1 = open(raw_input("path:"),"r") #filename, mode
# File2 = open(raw_input("path:"),"r") #filename, mode
# selects the 2 input files to be compared
input_file1 = "G:/savestuffhereqwerty/electorate_meshblocks/teststuff/Book1.csv"
input_file2 = "G:/savestuffhereqwerty/electorate_meshblocks/teststuff/Book2.csv"
# creates the blank output csv file
output_path = "G:/savestuffhereqwerty/electorate_meshblocks/outputs/output2.csv"
a = open(input_file1, "r")
output_file = open(output_path,"w")
output_file.close()
count = 0
with open(input_file1) as fp1:
for row_number1, row_value1 in enumerate(fp1):
if row_number1 == count:
print "got to 1st point"
value1 = row_value1
with open(input_file2) as fp2:
for row_number2, row_value2 in enumerate(fp2):
if row_number2 == count:
print "got to 2nd point"
value2 = row_value2
if value1 == value2:
print value1, value2
else:
print value1, value2
with open(output_path, 'wb') as f:
writer = csv.writer(f, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
# testing to see if the code writes text to the csv
writer.writerow(["test1"])
writer.writerow(["test2", "test3", "test4"])
writer.writerows([value1, value2])
print "code reached writing stage"
count += 1
print count
print "done"
# replace(",",".")

Since you want to compare the two files line-by-line, you should not loop through the second file for every line in the first file. You can simply zip two csv readers and filter the rows:
input_file1 = "foo"
input_file2 = "bar"
output_path = "baz"
with open(input_file1) as fin1:
with open(input_file2) as fin2:
read1 = csv.reader(fin1)
read2 = csv.reader(fin2)
diff_rows = (row1 for row1, row2 in zip(read1, read2) if row1 != row2)
with open(output_path, 'w') as fout:
writer = csv.writer(fout)
writer.writerows(diff_rows)
This solution assumes that the two files have the same number of lines.

Related

How to add another loop to a Python nested loop?

Edited
I am new to Python, having a problem adding a loop to a nested loop Python code.
using Python 3.8 on my windows 7 machine.
The code does when run once: it reads from multiple CSV files, row by row, and CSV file by CSV file, and uses the data from each row ( within a given range)to run the function until there is no CSV file left, each CSV file has 4 columns, all CSV files have one header each.
There are a few seconds of delay between each row reading.
since the code is just for one-time use, when you run the code again, it reads the same rows, it does not loop to read other rows.
So I want to add another loop to it, so each time you run the file somehow it remembers the last row that was used and starts from the next row.
So assume it has been set to a range of 2 rows:
the first-time run: uses row 1 and 2 to run the function
second-time run: uses row 3 and 4 to run the function, and so on
Appreciate your help to make it work.
Example CSV
img_url,desc_1 title_1,link_1
site.com/image22.jpg;someTitle;description1;site1.com
site.com/image32.jpg;someTitle;description2;site2.com
site.com/image44.jpg;someTitle;description3;site3.com
Here is the working code I have:
from abc.zzz import xyz
path_id_map = [
{'path':'file1.csv', 'id': '12345678'},
{'path':'file2.csv', 'id': '44556677'}
{'path':'file3.csv', 'id': '33377799'}
{'path':'file4.csv', 'id': '66221144'}]
s_id = None
for pair in path_id_map:
with open(pair['path'], 'r') as f:
next(f) # skip first header line
for _ in range(1, 3):
line = next(f)
img_url, title_1, desc_1, link_1 = map(str.strip, line.split(';'))
zzz.func1(img_url=img_url, title_1=title_1, desc_1=desc_1,
link_1=link_1, B_id=B_id=pair['id'], s_id=s_id)
time.sleep(25)
**** Update ****
After a few days of looking for a solution, a Code has been posted( UPDATE 2):
but there is a major problem with it.
it works the way I want only when using the print function,
I adopted my function to it but, when it runs for a second time or more, it does not loop to the next rows, (it only does loop correctly on the last CSV file though),
the author of the code could not correct his code, I can not figure out what is wrong with it.
I checked the CSV files and tested them with the print function, they are OK.
perhaps someone helps to correct the problem or another solution altogether.
Hi I hope I have understood what you're asking. I think the below code might guide you if you adjust it a little bit for your case. You can store the number of the final line into a text file. I also assume that as a delimiter the semi-colon is used.
UPDATE 1:
Okay, I think I came up with this solution to your problem, hopefully. The only prerequisite to run this is to have a text file which includes the number of row you want to begin with for the first run (e.g. 1).
# define function
import csv
import time
import subprocess
import os
import itertools
# txt file that contains the number of line to start the next time
dir_txt = './'
fname_txt = 'number_of_last_line.txt'
path = os.path.join(dir_txt, fname_txt)
# assign line number to variable after reading relevant txt
with open(path, 'r', newline='') as f:
n = int(f.read())
# define path of csv file
fpath = './file1.csv'
# open csv file
with open(fpath, 'r', newline='') as csvfile:
csv_reader = csv.reader(csvfile, delimiter=';')
# Iterate every row of csv. csv_reader row number starts from 1,
# csv_reader generator starts from 0
for row in itertools.islice(csv_reader, n, n+3):
print('row {0} contains {1}'.format(csv_reader.line_num, row))
time.sleep(3)
# Store the number of line to start the next time
n = csv_reader.line_num + 1
# Bash (or cmd) command execution, option. You can do this with python also
sh_command = 'echo {0} > {1}'.format(csv_reader.line_num, path)
subprocess.run(sh_command, shell=True)
UPDATE 2:
Here's a revision with the code working for multiple files using the input of #Error - Syntactical Remorse. The first thing you need to do is open the metadata.json file and insert the number of row you want to begin each file, for the first run only. You also need to change the file directories according to your situation.
# define function
def get_json_metadata(json_fpath):
"""Read json file
Args:
json_fpath -- string (filepath)
Returns:
json_list -- list"""
with open(json_fpath, mode='r') as json_file:
json_str = json_file.read()
json_list = json.loads(json_str)
return json_list
# Imports
import csv, json
import time
import os
import itertools
# json file that contains the number of line to start the next time
dir_json = './'
fname_json = 'metadata.json'
json_fpath = os.path.join(dir_json, fname_json)
# csv filenames, IDs and number of row to start reading are extracted
path_id_map = get_json_metadata(json_fpath)
# iterate over csvfiles
for nfile in path_id_map:
print('\n------ Reading {} ------\n'.format(nfile['path']))
with open(nfile['path'], 'r', newline='') as csvfile:
csv_reader = csv.reader(csvfile, delimiter=';')
# Iterate every row of csv. csv_reader row number starts from 1,
# csv_reader generator starts from 0
for row in itertools.islice(csv_reader, nfile['nrow'], nfile['nrow']+5):
# skip empty line (list)
if not row:
continue
# assign values to variables
img_url, title_1, desc_1, link_1 = row
B_id = nfile['id']
print('row {0} contains {1}'.format(csv_reader.line_num, row))
time.sleep(3)
# Store the number of line to start the next time
nfile['nrow'] = csv_reader.line_num
with open(json_fpath, mode='w') as json_file:
json_str = json.dumps(path_id_map, indent=4)
json_file.write(json_str)
This is how the metadata.json format should be:
[
{
"path": "file1.csv",
"id": "12345678",
"nrow": 1
},
{
"path": "file2.csv",
"id": "44556677",
"nrow": 1
},
{
"path": "file3.csv",
"id": "33377799",
"nrow": 1
},
{
"path": "file4.csv",
"id": "66221144",
"nrow": 1
}
]

How to read a csv file and create a new csv file after every nth number of rows?

I'm trying to write a function that reads a sheet of an existing .csv file and every 20 rows are copied to a newly created csv file. Therefore, it needs to be designed like a file counter "file_01, file_02, file_04,...," where the first 20 rows are copied to file_01, the next 20 to file_02.csv, and so on.
Currently I have this code which hasn't worked for me work so far.
import csv
import os.path
from itertools import islice
N = 20
new_filename = ""
filename = ""
with open(filename, "rb") as file: # the a opens it in append mode
reader = csv.reader(file)
for i in range(N):
line = next(file).strip()
#print(line)
with open(new_filename, 'wb') as outfh:
writer = csv.writer(outfh)
writer.writerow(line)
writer.writerows(islice(reader, 2))
I have attached a file for testing.
https://1drv.ms/u/s!AhdJmaLEPcR8htYqFooEoYUwDzdZbg
32.01,18.42,58.98,33.02,55.37,63.25,12.82,-32.42,33.99,179.53,
41.11,33.94,67.85,57.61,59.23,94.69,19.43,-19.15,21.71,-161.13,
49.80,54.12,72.78,100.74,56.97,128.84,26.95,-6.76,10.07,-142.62,
55.49,81.02,68.93,148.17,49.25,157.32,34.94,5.39,0.44,-123.32,
56.01,112.81,59.27,177.87,38.50,179.63,43.43,18.42,-5.81,-102.24,
50.79,142.87,48.06,-162.32,26.60,-161.21,52.38,34.37,-7.42,-79.64,
41.54,167.36,37.12,-145.93,15.01,-142.84,60.90,57.05,-4.47,-56.54,
30.28,-172.09,27.36,-130.24,5.11,-123.66,66.24,91.12,-0.76,-35.44,
18.64,-153.20,19.52,-114.09,-1.54,-102.96,64.77,131.32,5.12,-21.68,
7.92,-134.07,14.24,-96.93,-3.79,-80.91,57.10,162.35,12.51,-9.21,
-0.34,-113.74,11.80,-78.73,-2.49,-58.46,46.75,-175.86,20.81,2.87,
-4.81,-91.85,11.78,-60.28,0.59,-39.26,35.75,-158.12,29.79,15.71,
-4.76,-68.67,13.79,-43.84,6.82,-24.69,25.27,-141.56,39.05,30.71,
-1.33,-46.42,18.44,-30.23,14.53,-11.95,16.21,-124.45,47.91,50.25,
4.14,-29.61,24.89,-18.02,23.01,0.10,9.59,-106.05,54.46,77.07,
11.04,-15.39,32.33,-6.66,31.92,12.48,6.24,-86.34,55.72,110.53,
18.69,-2.32,40.46,4.57,41.11,26.87,6.07,-65.68,50.25,142.78,
26.94,10.56,49.18,16.67,49.92,45.39,8.06,-46.86,40.13,168.29,
35.80,24.58,58.45,31.99,56.83,70.92,12.96,-31.90,28.10,-171.07,
44.90,41.72,67.41,55.89,59.21,103.94,19.63,-18.67,15.97,-152.40,
-5.41,-77.62,11.40,-63.21,4.80,-29.06,31.33,-151.44,43.00,37.25,
-2.88,-54.38,13.08,-46.00,12.16,-15.86,21.21,-134.62,51.25,59.16,
1.69,-35.73,17.44,-32.01,20.37,-3.78,13.06,-117.10,56.18,88.98,
8.15,-20.80,23.70,-19.66,29.11,8.29,7.74,-98.22,54.91,123.30,
15.52,-7.45,31.04,-8.22,38.22,21.78,5.76,-77.99,47.34,153.31,
23.53,5.38,39.07,2.98,47.29,38.71,6.58,-57.45,36.18,176.74,
32.16,18.76,47.71,14.88,55.08,61.71,9.76,-40.52,23.99,-163.75,
41.27,34.36,56.93,29.53,59.23,92.75,15.53,-26.40,12.16,-145.27,
49.92,54.65,66.04,51.59,57.34,126.97,22.59,-13.65,2.14,-126.20,
55.50,81.56,72.21,90.19,49.88,155.84,30.32,-1.48,-4.71,-105.49,
55.92,113.45,70.26,139.40,39.23,178.48,38.55,10.92,-7.09,-83.11,
50.58,143.40,61.40,172.50,27.38,-162.27,47.25,24.86,-4.77,-60.15,
41.30,167.74,50.34,-166.33,15.74,-143.93,56.21,43.14,-0.54,-38.22,
30.03,-171.78,39.24,-149.48,5.71,-124.87,63.77,70.19,4.75,-24.15,
18.40,-152.91,29.17,-133.78,-1.18,-104.31,66.51,108.81,11.86,-11.51,
7.69,-133.71,20.84,-117.74,-3.72,-82.28,61.95,146.15,20.05,0.65,
-0.52,-113.33,14.97,-100.79,-2.58,-59.75,52.78,172.46,28.91,13.29,
-4.91,-91.36,11.92,-82.84,0.34,-40.12,41.93,-167.91,38.21,27.90,
These are some of the problems with your current solution.
You created a csv.reader object but then you did not use it
You read each line but then you did not store them anywhere
You are not keeping track of 20 rows which was supposed to be your requirement
You created the output file in a separate with block which does not have access anymore to the read lines or the csv.reader object
Here's a working solution:
import csv
inp_file = "input.csv"
out_file_pattern = "file_{:{fill}2}.csv"
max_rows = 20
with open(inp_file, "r") as inp_f:
reader = csv.reader(inp_f)
all_rows = []
cur_file = 1
for row in reader:
all_rows.append(row)
if len(all_rows) == max_rows:
with open(out_file_pattern.format(cur_file, fill="0"), "w") as out_f:
writer = csv.writer(out_f)
writer.writerows(all_rows)
all_rows = []
cur_file += 1
The flow is as follows:
Read each row of the CSV using a csv.reader
Store each row in an all_rows list
Once that list gets 20 rows, open a file and write all the rows to it
Use the csv.writer's writerows method
Use a cur_file counter to format the filename
Every time 20 rows are dumped to a file, empty out the list and increment the file counter
This solution includes the blank lines as part of the 20 rows. Your test file has actually 19 rows of CSV data and 1 row for a blank line. If you need to skip the blank line, just add a simple check of
if not row:
continue
Also, as I mentioned in a comment, I assume that the input file is an actual CSV file, meaning it's a plain text file with CSV formatted data. If the input is actually an Excel file, then solutions like this won't work, because you'll need some special libraries to read Excel files, even if the contents visually looks like CSV or even if you rename the file to .csv.
Without using any special CSV libraries (e.g. csv, though you could, just that I don't know how to use them, however don't think it is necessary for this case), you could:
excel_csv_fp = open(r"<file_name>", "r", encoding="utf-8") # Check proper encoding for your file
csv_data = excel_csv_fp.readlines()
file_counter = 0
new_file_name = ""
new_fp = ""
for line in csv_data:
if line == "":
if new_fp != "":
new_fp.close()
file_counter += 1
new_file_name = "file_" + "{:02d}".format(file_counter) # 1 turns into 01 and 10 turns 10 i.e. remains the same
new_fp = open("<some_path>/" + new_file_name + ".csv", "w", encoding="utf-8") # Makes a new CSV file to start writing to
elif new_fp != "": # Updated code to make sure new_fp is a file pointer and not a string
new_fp.write(line) # Write each line after a space
If you have any questions on any of the code (how it works, why I choose what etc.), just ask in the comments and I'll try to reply as soon as possible.

How to write the data to the new column in existing csv File using python script

The below code is to read the data in File1 from columns 2, 3, 4, 8 and write it in NewFile. The data in each column 2 (which is already stored in temp_list) should be searched in File3. If found, the data in third column of each row in File3 is appended with the data stored in temp_list. But second for loop only considers the column2 data in first row. It is not considering the data in column 2 in remaining rows.
I gave print var1 in second loop to see if each column 2 data (copied in Newfile) is being considered. But the output shows value only in the first row of File3. Values in other rows are not searched. Can someone please help me to understand the problem in my code?
import csv
f1 = csv.reader(open("C:/Users/File1.csv","rb"))
f2 = csv.writer(open("C:/Users/NewFile.csv","wb"))
f3 = csv.reader(open("C:/Users/File3.csv","rb"))
for row_f1 in f1:
if not row_f1[0].startswith("-"):
temp_list = [row_f1[1],row_f1[2],row_f1[3],row_f1[7]]
var1 = row_f1[1]
for row_f3 in f3:
if var1 in row_f3:
temp_list.append(row_f3[2])
f2.writerow(temp_list)
One of your problems is that when you do for row_f3 in f3: you read the file and it doesn't go to the beginning automatically. An option is to read it once saving the lines to a list, but checking if var1 exists in a list every time will be very slow.
What is the field in row_f3 where you try to find var1? You can use a dictionary if the keys are the same:
d = dict()
for row_f3 in f3:
d[row_f3[field_index]] = row_f3[2]
And then:
new_field = d.get(var1)
if new_field is not None: temp_list.append(new_field)
How bigs are your files? If they are <1Gb you can also try pandas instead of reading line by line:
import pandas as pd
df1 = pd.read_csv("C:/Users/File1.csv",header=None,index_col=None)
df1 = df1.loc[~df1[0].str.startswith("-"),[1,2,3,7]
df1[8] = df1[1].apply(lambda x: d.get(x))
df1.to_csv("C:/Users/NewFile.csv",header=None)
If I understand your description properly, the following should do what you want. The main problem with your code is that it doesn't close and reopen the third file in order to read and copy the data from it. Since your code is also sloppy about the closing of files in general, I've taken care of that by modifying it to use with statements which will handle it automatically.
import csv
with open("C:/Users/File1.csv", "rb") as file1, \
open("C:/Users/NewFile.csv", "wb") as file2:
f2 = csv.writer(file2)
for row_f1 in csv.reader(file1):
if not row_f1[0].startswith("-"):
temp_list = [row_f1[1], row_f1[2], row_f1[3], row_f1[7]]
var1 = row_f1[1]
var1_found = False
with open("C:/Users/File3.csv", "rb") as file3:
for row_f3 in csv.reader(file3):
if var1 in row_f3:
var1_found = True
break
if var1_found:
with open("C:/Users/File3.csv", "rb") as file3:
for row_f3 in csv.reader(file3):
temp_list.append(row_f3[2])
f2.writerow(temp_list)

How to write to a specific cell in a csv file from Python

I am quite new to python and am trying to write to a specific cell in a csv file but I can't quite figure it out.
This is part of it, but I don't know how to get it to write the score (line 3) to the cell I want. e.g cell "B1":
file = open(class_name + ".csv" , 'a')
file.write(str(name + " : " ))
file.write(str(score))
file.write('\n')
file.close()
Pandas will do what you're looking for
import pandas as pd
# Read csv into dataframe
df = pd.read_csv('input.csv')
# edit cell based on 0 based index b1=1,0
df.ix(1,0) = score
# write output
df.to_csv('output.csv', index=False)
There is a CSV reader/writer in python that you can use. CSV files don't really have cells, so I will assume that when you say "B1" you mean "first value in second line". Mind you, that files do not behave the way a spreadsheet behaves. In particular, if you just start writing in the middle of the file, you will write over the content at the point where you are writing. Most of the time, you want to read the entire file, make the changes you want and write it back.
import csv
# read in the data from file
data = [line for line in csv.reader(open('yourfile.csv'))]
# manipulate first field in second line
data[1][0] = 'whatever new value you want to put here'
# write the file back
csv.writer(open('yourfile.csv', 'w')).writerows(data)
You just have to separate your columns with commas and your lines with linebreaks. There's no mystery:
name1 = "John"
name2 = "Bruce"
job1 = "actor"
job2 = "rockstar"
csv_str = ""
csv_str += name1 +";"+job1+"\n" #line 1
csv_str += name2 +";"+job2+"\n" #line 2
file = open(class_name + ".csv" , 'a')
file.write(csv_str)
file.close()
This will generate a 2x2 grid

Reading from two files

I am trying to write a script that will take several 2 column files, write the first and second columns from the first one to a result file and then only the second columns from all other files and append them on.
Example:
File one File two
Column 1 Column 2 dont take this column Column 2
Line 1 Line 2 dont take this column Line 2
The final result should be
Result file
Column 1 Column 2 Column 2
Line1 Line 2 Line 2
etc
I have the almost everything working except for adding the second columns onto the first. I am taking the ResultFile as r+ and I want to read out the line that's there (the first file data) and then read the corresponding line from the other files, append it, and put it back in.
Here's the code I have for the second section:
#Open each subsequent file for 2nd column data
while n < i:
with open(FileNames[n], "r") as InputFile
with ResultFile:
Temp2 = ResultFile.readline()
for line in InputFile:
Temp2 += line.split(",", 1)[-1]
if line == LastValue:
break
if len(ResultFile,readline()) == "":
break
YData += (Temp2 + "\n")
n += 1
InputFile.close
The break IFs are not working quite right atm I just needed a way to end the infinite loop. Also LastValue is equal to the last x column value from the first file.
Any help would be appreciated
EDIT
I'm trying to do this without itertools.
It might help to open up all the files first and store them in a list.
fileHandles = []
for f in fileNames:
fileHandles.append(open(f))
Then you can just readline() them in order for each line in the first file.
dataLine = fileHandles[0].readline()
while dataLine:
outFields = dataLine.split(",")[0:2]
for inFile in fileHandles[1:]:
dataLine = inFile.readline()
field = dataLine.split(",")[1]
outFields.append(field)
print ",".join(outFields)
dataLine = fileHandles[0].readline()
Fundamentally you want to loop over all input files simultaneously the way zip does with iterators.
This example illustrates the pattern without the distraction of files and csvs:
file_row_col = [[['1A1', '1A2'], # File 1, Row A, Column 1 and 2
['1B1', '1B2']], # File 1, Row B, Column 1 and 2
[['2A1', '2A2'], # File 2
['2B1', '2B2']],
[['3A1', '3A2'], # File 3
['3B1', '3B2']]]
outrows = []
for rows in zip(*file_row_col):
outrow = [rows[0][0]] # Column 1 of the first file
for row in rows:
outrow.extend(row[1:]) # Only Column 2 and on
outrows.append(outrow)
# outrows is now [['1A1', '1A2', '2A2', '3A2'],
# ['1B1', '1B2', '2B2', '3B2']]
The key to this is the transformation done by zip(*file_row_col).
Now let's reimplement this pattern with actual files. I'm going to use the csv library make reading and writing the csvs easier and safer.
import csv
infilenames = ['1.csv','2.csv','3.csv']
outfilename = 'result.csv'
with open(outfilename, 'wb') as out:
outcsv = csv.writer(out)
infiles = []
# We can't use `with` with a list of resources, so we use
# try...finally the old-fashioned way instead.
try:
incsvs = []
for infilename in infilenames:
infile = open(infilename, 'rb')
infiles.append(infile)
incsvs.append(csv.reader(infile))
for inrows in zip(*incsvs):
outrow = [inrows[0][0]] # Column 1 of file 1
for inrow in inrows:
outrow.extend(inrow[1:])
outcsv.writerow(outrow)
finally:
for infile in infiles:
infile.close()
Given these input files:
#1.csv
1A1,1A2
1B1,1B2
#2.csv
2A1,2A2
2B1,2B2
#3.csv
3A1,3A2
3B1,3B2
the code produces this result.csv:
1A1,1A2,2A2,3A2
1B1,1B2,2B2,3B2

Categories