Issue computing difference between two csv files - python

I'm trying to obtain the difference between two csv files A.csv and B.csv in order to obtain new rows added in the second file. A.csv has the following data.
acct ABC 88888888 99999999 ABC-GHD 4/1/18 4 1 2018 Redundant/RSK
B.csv has the following data.
acct ABC 88888888 99999999 ABC-GHD 4/1/18 4 1 2018 Redundant/RSK
acct ABC 88888888 99999999 ABC-GHD 4/1/18 4 1 2018 DT/89
To write the new rows added into an output file I'm using the following script.
input_file1 = "A.csv"
input_file2 = "B.csv"
output_path = "out.csv"
with open(input_file1, 'r') as t1:
fileone = set(t1)
with open(input_file2, 'r') as t2, open(output_path, 'w') as outFile:
for line in t2:
if line not in fileone:
outFile.write(line)
Expected output is :
acct ABC 88888888 99999999 ABC-GHD 4/1/18 4 1 2018 DT/89
Output obtained through the above script is :
acct ABC 88888888 99999999 ABC-GHD 4/1/18 4 1 2018 Redundant/RSK
acct ABC 88888888 99999999 ABC-GHD 4/1/18 4 1 2018 DT/89
I'm not sure where I'm making a mistake, tried debugging it but with no progress.

You need to be careful with trailing newlines. As such it is safer to remove the newlines before comparing and then add them back when writing:
input_file1 = "A.csv"
input_file2 = "B.csv"
output_path = "out.csv"
with open(input_file1, 'r') as t1:
fileone = set(t1.read().splitlines())
with open(input_file2, 'r') as t2, open(output_path, 'w') as outFile:
for line in t2:
line = line.strip()
if line not in fileone:
outFile.write(line + '\n')

Related

How to read specific columns in the csv file?

I have lots of live data coming from sensor. Currently, I stored the data in a csv file as following:
0 2 1 437 464 385 171 0:44:4 dog.jpg
1 1 3 452 254 444 525 0:56:2 cat.jpg
2 3 2 552 525 785 522 0:52:8 car.jpg
3 8 4 552 525 233 555 0:52:8 car.jpg
4 7 5 552 525 433 522 1:52:8 phone.jpg
5 9 3 552 525 555 522 1:52:8 car.jpg
6 6 6 444 392 111 232 1:43:4 dog.jpg
7 1 1 234 322 191 112 1:43:4 dog.jpg
.
.
.
.
Third column has numbers between 1 to 6. I want to read information of columns #4 and #5 for all the rows that have number 2 and 5 in the third columns. I also want to write them in another csv file line by line every 2 second, one line at the time.
I do so because I have another code which would go through the data and read the data from there. I was wondering how could I write the information for the lines that have 3 and 5 in their 3rd column? Please advise!
for example:
2 552 525
5 552 525
......
......
.....
.
import csv
with open('newfilename.csv', 'w') as f2:
with open('mydata.csv', mode='r') as infile:
reader = csv.reader(infile) # no conversion to list
header = next(reader) # get first line
for row in reader: # continue to read one line per loop
if row[5] == 2 & 5:
The third column has index 2 so you should be checking if row[2] is one of '2' or '5'. I have done this by defining the set select = {'2', '5'} and checking if row[2] in select.
I don't see what you are using header for but I assume you have more code that processes header somewhere. If you don't need header and just want to skip the first line, just do next(reader) without assigning it to header but I have kept header in my code under the assumption you use it later.
We can use time.sleep(2) from the time module to help us write a row every 2 seconds.
Below, "in.txt" is the csv file containing the sample input you provided and "out.txt" is the file we write to.
Code
import csv
import time
select = {'2', '5'}
with open("in.txt") as f_in, open("out.txt", "w") as f_out:
reader = csv.reader(f_in)
writer = csv.writer(f_out)
header = next(reader)
for row in reader:
if row[2] in select:
print(f"Writing {row[2:5]} at {time.time()}")
writer.writerow(row[2:5])
# f_out.flush() may need to be run here
time.sleep(2)
Output
Writing ['2', '552', '525'] at 1650526118.9760585
Writing ['5', '552', '525'] at 1650526120.9763758
"out.txt"
2,552,525
5,552,525
Input
"in.txt"
0,2,1,437,464,385,171,0:44:4,dog.jpg
1,1,3,452,254,444,525,0:56:2,cat.jpg
2,3,2,552,525,785,522,0:52:8,car.jpg
3,8,4,552,525,233,555,0:52:8,car.jpg
4,7,5,552,525,433,522,1:52:8,phone.jpg
5,9,3,552,525,555,522,1:52:8,car.jpg
6,6,6,444,392,111,232,1:43:4,dog.jpg
7,1,1,234,322,191,112,1:43:4,dog.jpg
I think you'd just need to change your if statement to be able to get the rows you want.
for example:
import csv
with open('newfilename.csv', 'w') as f2:
with open('mydata.csv', mode='r') as infile:
reader = csv.reader(infile) # no conversion to list
header = next(reader) # get first line
for row in reader: # continue to read one line per loop
if row[5] in [2,5]:
inside the if, you'll get the rows that have 2 or 5

Compare two files by checking the first 3 columns. if they are not the same values, then print the entire line (python)

im kinda new to python and Stackoverflow. forgive me If I did not explain my question properly.
First file (test1.txt):
customer ID age country version
- Alex #1233 25 Canada 7
- James #1512 30 USA 2
- Hassan #0051 19 USA 9
Second file (test2.txt):
customer ID age country version
- Alex #1233 25 Canada 3
- James #1512 30 USA 7
- Bob #0061 20 USA 2
- Hassan #0051 19 USA 1
Results for the missing lines should be
Bob #0061 20 USA 2
Here is the code
missing = []
with open('C:\\Users\\yousi\\Desktop\\Work\\Python Project\\test1.txt.txt','r') as a_file:
a_lines = a_file.read().split('\n')
with open('C:\\Users\\yousi\\Desktop\\Work\\Python Project\\test2.txt.txt','r') as b_file:
b_lines = b_file.read().split('\n')
for line_a in a_lines:
for line_b in b_lines:
if line_a in line_b:
break
else:
missing.append(line_a)
print(missing)
a_file.close()
b_file.close()
The problem with this code is that it compares both files based on the entire line. I only want to check the first 3 columns, if they dont match then it prints the entire line.
new example:
First file (test1.txt)
60122 LX HNN -- 4 32.7390 -114.6357 40 Winterlaven - Sheriff Sabstation
60122 LX HNZ -- 4 32.7390 -114.6357 40 Winterlaven - Sheriff Sabstation
60122 LX HNE -- 4 32.7390 -114.6357 40 Winterlaven - Sheriff Sabstation
second file (test2.txt)
60122 LX HNN -- 4 32.739000 -114.635700 40 Winterlaven - Sheriff Sabstation
60122 LX HNZ -- 4 32.739000 -114.635700 40 Winterlaven - Sheriff Sabstation
60122 LX HNE -- 4 32.739000 -114.635700 40 Winterlaven - Sheriff Sabstation
If you want to compare the first 3 columns, you should do this
a_line = 'Alex 1233 25 Canada' # this is one file's line
# slipt line on white
a_line = a_line.split()
>>> ['Alex', '1233', '25', 'Canada']
# cat first 3 columns
a_line = a_line[:3]
>>> ('Alex', '1233', '25')
# than you can compare
['Alex', '1233', '25', 'Canada'] == ['Alex', '1233', '25', 'Canada']
>>> True
['Alex', '1233', '25', 'Canada'] == ['Alex', '1233', '25', 'Canada2']
>>> False
Instead of using read().split('\n') you could use just readlines()
If test1.txt and test2.txt contains the text from your question, then this script:
with open('test1.txt', 'r') as f1, open('test2.txt', 'r') as f2:
i1 = [line.split()[:-1] for line in f1 if line.strip().startswith('-')]
i2 = (line.split() for line in f2 if line.strip().startswith('-'))
missing = [line for line in i2 if line[:-1] not in i1]
for _, *line in missing:
print(' '.join(line))
Prints:
Bob #0061 20 USA 2
EDIT: If the file doesn't contain - at the beginning of rows, then this script:
with open('test1.txt', 'r') as f1, open('test2.txt', 'r') as f2:
i1 = [line.split()[:-1] for line in f1 if line.strip()]
i2 = (line.split() for line in f2 if line.strip())
missing = [line for line in i2 if line[:-1] not in i1]
for line in missing:
print(' '.join(line))
Prints:
Bob #0061 20 USA 2
EDIT 2: To compare only first 3 columns, you can use this example (note the [:3]):
with open('file1.txt', 'r') as f1, open('file2.txt', 'r') as f2:
i1 = [line.split()[:3] for line in f1 if line.strip()]
i2 = (line.split() for line in f2 if line.strip())
missing = [line for line in i2 if line[:3] not in i1]
for line in missing:
print(' '.join(line))
Prints nothing for the new example files you have in the question.

Reading txt file with number and suming them python

I have txt file witht the following txt in it:
2
4 8 15 16 23 42
1 3 5
6
66 77
77
888
888 77
34
23 234 234
1
32
3
23 23 23
365
22 12
I need a way to read the file and sum all the numbers.
i have this code for now but not sure what to do next. Thx in advance
`lstComplete = []
fichNbr = open("nombres.txt", "r")
lstComplete = fichNbr
somme = 0
for i in lstComplete:
i = i.split()`
Turn them into a list and sum them:
with open('nombres.txt', 'r') as f:
num_list = f.read().split()
print sum([int(n) for n in num_list])
Returns 3227
Open the file and use read() method to get the content and then convert string to int, use sum() to get the result:
>>> sum(map(int,open('nombres.txt').read().split()))
3227

Read in file as arrays then rearrange columns

I would like to read in a file with multiple columns and write out a new file with columns in a different order than the original file. One of the columns has some extra text that I want eliminated in the new file as well.
For instance, if I read in file: data.txt
1 6 omi=11 16 21 26
2 7 omi=12 17 22 27
3 8 omi=13 18 23 28
4 9 omi=14 19 24 29
5 10 omi=15 20 25 30
I would like the written file to be: dataNEW.txt
26 1 11 16
27 2 12 17
28 3 13 18
29 4 14 19
30 5 15 20
With the help of inspectorG4dget, I came up with this:
import csv as csv
import sys as sys
infile = open('Rearrange Column Test.txt')
sys.stdout = open('Rearrange Column TestNEW.txt' , 'w')
for line in csv.reader(infile, delimiter='\t'):
newline = [line[i] for i in [5, 0, 2, 3]]
newline[2] = newline[2].split('=')[1]
print newline[0], newline[1], newline[2], newline[3]
sys.stdout.close()
Is there a more concise way to get an output without any commas than listing each line index from 0 to the total number of lines?
import csv
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
writer = csv.writer(outfile)
for line in csv.reader(infile, delimiter='\t'):
newline = [line[i] for i in [-1, 0, 2 3]]
newline[2] = newline[2].split('=')[1]
writer.writerow(newline)

Vlookup in python

I am new to python and leaning as fast as possible. I know how to do my problem in bash and trying to work on python.
I have a data file (data_array.csv in the example) and index file, index.csv, at which I want to extract the data from the data file that have the same ID in the index file and store in to a new file, Out.txt. I also want to put NA ,in the Out.txt, for those ID's that have no value in the data file. I know how to do it for one column. But my data has more than 1000 columns (from 1 to 1344). I want you help me with a script that can do it faster. My data file, index id and proposed out put as follows.
data_array.csv
Id 1 2 3 . . 1344
1 10 20 30 . . -1
2 20 30 40 . . -2
3 30 40 50 . . -3
4 40 50 60 . . -4
6 60 60 70 . . -5
8 80 70 80 . . -6
10 100 80 90 . . -7
index.csv
Id
1
2
8
9
10
Required Output is
Out.txt
Id 1 2 3 . . 1344
1 10 20 30 . . -1
2 20 30 40 . . -2
8 80 70 80 . . -6
9 NA NA NA NA
10 100 80 90 . . -7
I tried
#! /usr/bin/python
import csv
with open('data_array.csv','r') as lookuplist:
with open('index.csv', "r") as csvinput:
with open('VlookupOut','w') as output:
reader = csv.reader(lookuplist)
reader2 = csv.reader(csvinput)
writer = csv.writer(output)
for i in reader2:
for xl in reader:
if i[0] == xl[0]:
i.append(xl[1:])
writer.writerow(i)
But it only do for the first row. I want the program to work for the entire rows and columns of my data files.
It only output the first row because after xl in reader for the first time, you are at the end of the file. You need to point to the beginning of the file after that. To increase efficiency, you can read the csvinput into a dictionary first, then use dictionary lookup to get the row you need:
#! /usr/bin/python
import csv
with open('data_array.csv','r') as lookuplist:
with open('index.csv', "r") as csvinput:
with open('VlookupOut','w') as output:
reader = csv.reader(lookuplist)
reader2 = csv.reader(csvinput)
writer = csv.writer(output)
d = {}
for xl in reader2:
d[xl[0]] = xl[1:]
for i in reader:
if i[0] in d:
i.append(d[i[0]])
writer.writerow(i)
When you read a CSV file using for xl in readerit will go through every row until it reaches the end. But it will only do this once. You can tell it to go back to the first row of the CSV file by using .seek(0).
#! /usr/bin/python
import csv
with open('data_array.csv','r') as lookuplist:
with open('index.csv', "r") as csvinput:
with open('VlookupOut','w') as output:
reader = csv.reader(lookuplist)
reader2 = csv.reader(csvinput)
writer = csv.writer(output)
for i in reader2:
for xl in reader:
if i[0] == xl[0]:
i.append(xl[1:])
writer.writerow(i)
lookuplist.seek(0)

Categories