comparing a txt file with csv file with python - python

I am working on developing a python code that would compare a txt file and a csv file and find out if there are identical or not.If not,find the errors and summarize them on a excel table.
def main():
filename1=input("Enter txt file name:- ")
filename2=input("Enter csv file name:- ")
fp1=open(filename1,"r")
fp2=open(filename2,"r")
list1=[]
list2=[]
for line in fp1: #iterating through each line in the file
a=line.split(",") #splitting line based on comma
for i in a: #iterating through each element in list
i=i.rstrip() #removing new line from elements in list
list1.append(i) #appending element in the list
for line in fp2:
a=line.split(",")
for i in a:
i=i.rstrip()
list2.append(i)
fp2=open("res.csv","a") #opening res file in append mode
flag=0
for i in range(0,len(list1)): #iterating through lists
if (i==len(list1)-1 and list1[i]!=list2[i]): #if total is different in both files
fp2.write(list1[i-1]+","+str(abs(int(list1[i])-int(list2[i])))) #printing difference
flag=1
elif (list1[i]!=list2[i]): #if other lines are different
fp2.write(list1[i-1]+","+list1[i]+","+list2[i-1]+","+list2[i]+"\n") #printing different lines
flag=1
if (flag==0): #if there is no difference
fp2.write("none")
main() #calling main function
The output should be an excel table with a summary of the differences between the 2 files.The above code gives the difference in numbers between the files but if the number of lines is different, the output should also print the lines that are not in 1 file. I would appreciate any ideas to improve this code and help creating the code to compare a txt and csv file with an output of the difference on a excel spreadsheet.
Thank you.
* I am still new here so please let me know if I need to edit something or make a part of my question more clear.

If both files contains the same type of data, I'd recommend to read both .csv and .txt data to pandas data table (after transformation).
After the reading you can easily operate with columns and rows of both datasets, i.e. find the difference between the two tables, and output this difference to any format you want.

Related

Python for loop savetxt each iteration in one file

I am trying to load two txt files, each file only has one column. Then I do things for each of the file and output two columns/rows for each analysis. I want to save all iterations in the for loop. In this case the output file would have four columns/rows. Here is what I got that didn't work
filelist=["1.txt","2.txt"]
results=[]
for file in filelist:
a= np.loadtxt(file)
do things
sol = rungekutta4(....)
Theta = (t4, sol[:, 0])
savetxt('results.csv', results.append(Theta), delimiter=',')
It is possible to append to a file for each iteration of a loop without numpy.
filelist=["1.txt","2.txt"]
results=[]
for file in filelist:
a= np.loadtxt(file)
...
#savetxt('results.csv', results.append(Theta), delimiter=',')
with open('outFilename.csv', 'a') as file: # the 'a' means to append to the end of the file:
file.write(','.join(results)) # the '[separator]'.join(list) converts a list into a string, with each item separated by [separator]
Just to be clear, the for loop will output the results variable for each iteration.
Here is an example of the output:
a
a,b
a,b,c
a,b,c,d
Note that since the file is being opened over and over again many times, you may find the loop slow for a large number of iterations.
To fix this, you can also open the file at the very beginning of the file, just once.

Gathering data from huge text files

I have a text file composed of several subsequent tables. I need to get certain values from certain tables and save them in an output file. Every table has a header which contains a string that can be used to find specific tables. The size of these text files can vary from tenths of MB to some GB. I have written the following script to do the job:
string = 'str'
index = 20
n = 2
in_file = open('file.txt')
out_file = open("out.txt", 'w')
current_line = 0
for i in range(-index,index+1):
for j in range(-index,index+1):
for line in in_file:
if string in line:
En = line.split().pop(4)
for line in in_file:
current_line += 1
if current_line == 2*(n+1)+2:
x = line.split().pop(10)
elif current_line == 3*(n+1)+2:
y = line.split().pop(10)
elif current_line == 4*(n+1)+2:
z = line.split().pop(10)
current_line = 0
break
print i, j, En, x, y, z
data = "%d %d %s %s %s %s\n" % (i,j,En,x,y,z)
out_file.write(data)
break
in_file.close()
out_file.close()
The script reads the file line by line searching for the specified string ('str' in this example). When found, it then extracts a value from the line containing the string and continue reading the lines that form the data table itself. Since all the tables in the file have the same number of lines and columns, I've used the variable current_line to keep track of which line is read and to specify which line contains the data I need. The first two for-loops are just there to generate a pair of indexes that I need to be printed in the output file (in this case they are between -20 and 20).
The script works fine. But since I've been learning python by myself for about one month, and the files I have to handle can be very big, I'm asking for advices on how to make the script more efficient, and overall, better.
Also, since the tables are regular, I can know beforehand which are the lines that contain the values I need. So I was wondering, instead of reading all the lines in the file, is it possible to specify which lines have to be read and then jump directly between them?
Sample input file
Here's a sample input file. I've included just some tables so you can have an idea how it's organized. This file is composed by two blocks with three tables each. In this sample file, the string "table #" is what is used to find the data to be extracted.
Sample output file
And here's a sample output file. Keep in mind that these two files are not equivalent! This output was created by my script using an input file containing 1681 blocks of 16 tables. Each table had 13 lines just as in the sample input file.

How to remove rows from a csv file when compared to a list in a txt file using Python?

I have a list of 12.000 dictionary entries (the words only, without their definitions) stored in a .txt file.
I have a complete dictionary with 62.000 entries (the words with their definitions) stored in .csv file.
I need to compare the small list in the .txt file with the larger list in the .csv file and delete the rows containing the entries that doesn't appear on the smaller list. In other words, I want to purge this dictionary to only 12.000 entries.
The .txt file is ordered in separate lines like this, line by line:
word1
word2
word3
The .csv file is ordered like this:
ID (column 1) WORD (column 2) MEANING (column 3)
How do I accomplish this using Python?
Good answers so far. If you want to get minimalistic...
import csv
lookup = set(l.strip().lower() for l in open(path_to_file3))
map(csv.writer(open(path_to_file2, 'w')).writerow,
(row for row in csv.reader(open(path_to_file))
if row[1].lower() in lookup))
The following will not scale well, but should work for the number of records indicated.
import csv
csv_in = csv.reader(open(path_to_file, 'r'))
csv_out = csv.writer(open(path_to_file2, 'w'))
use_words = open(path_to_file3, 'r').readlines()
lookup = dict([(word, None) for word in use_words])
for line in csv_in:
if lookup.has_key(line[0]):
csv_out.writerow(line)
csv_out.close()
One of the least known facts of current computers is that when you delete a line from a text file and save the file, most of the time the editor does this:
load the file into memory
write a temporary file with the rows you want
close the files and move the temp over the original
So you have to load your wordlist:
with open('wordlist.txt') as i:
wordlist = set(word.strip() for word in i) # you said the file was small
Then you open the input file:
with open('input.csv') as i:
with open('output.csv', 'w') as o:
output = csv.writer(o)
for line in csv.reader(i): # iterate over the CSV line by line
if line[1] not in wordlist: # test the value at column 2, the word
output.writerow(line)
os.rename('input.csv', 'output.csv')
This is untested, now go do your homework and comment here if you find any bug... :-)
i would use pandas for this. the data set's not large, so you can do it in memory with no problem.
import pandas as pd
words = pd.read_csv('words.txt')
defs = pd.read_csv('defs.csv')
words.set_index(0, inplace=True)
defs.set_index('WORD', inplace=True)
new_defs = words.join(defs)
new_defs.to_csv('new_defs.csv')
you might need to manipulate new_defs to make it look like you want it to, but that's the gist of it.

reading from a particular tuple onwards from a file in python

Using seek and tell is not functioning properly as the tell returns the current position in bytes; I need to get the line number rather the position of file pointer to proceed.
I have a file glass.csv and I need to cluster the datasets. Each line in the file contains a number 1,2,3... like the below:
65,1.52172,13.48,3.74,0.90,72.01,0.18,9.61,0.00,0.07,1
66,1.52099,13.69,3.59,1.12,71.96,0.09,9.40,0.00,0.00,1
67,1.52152,13.05,3.65,0.87,72.22,0.19,9.85,0.00,0.17,1
68,1.52152,13.05,3.65,0.87,72.32,0.19,9.85,0.00,0.17,1
69,1.52152,13.12,3.58,0.90,72.20,0.23,9.82,0.00,0.16,1
70,1.52300,13.31,3.58,0.82,71.99,0.12,10.17,0.00,0.03,1
71,1.51574,14.86,3.67,1.74,71.87,0.16,7.36,0.00,0.12,2
72,1.51848,13.64,3.87,1.27,71.96,0.54,8.32,0.00,0.32,2
73,1.51593,13.09,3.59,1.52,73.10,0.67,7.83,0.00,0.00,2
74,1.51631,13.34,3.57,1.57,72.87,0.61,7.89,0.00,0.00,2
142,1.51851,13.20,3.63,1.07,72.83,0.57,8.41,0.09,0.17,2
143,1.51662,12.85,3.51,1.44,73.01,0.68,8.23,0.06,0.25,2
144,1.51709,13.00,3.47,1.79,72.72,0.66,8.18,0.00,0.00,2
145,1.51660,12.99,3.18,1.23,72.97,0.58,8.81,0.00,0.24,2
146,1.51839,12.85,3.67,1.24,72.57,0.62,8.68,0.00,0.35,2
147,1.51769,13.65,3.66,1.11,72.77,0.11,8.60,0.00,0.00,3
148,1.51610,13.33,3.53,1.34,72.67,0.56,8.33,0.00,0.00,3
149,1.51670,13.24,3.57,1.38,72.70,0.56,8.44,0.00,0.10,3
150,1.51643,12.16,3.52,1.35,72.89,0.57,8.53,0.00,0.00,3
I need to take some inputs from those tuples having 1 as the last number and save it in another file, (train.txt), and the remaining in another file, (test.txt). Likewise I need to take certain lines from those having 2 as the last number and append to the first file i.e. train.txt and remaining to test.txt.
I cannot get the second input but appends the first result itself.
The easiest way, assuming that you have a large file and can not simply load the whole file would be to use 1 file for each to do your sorting. If it is a small(ish) input file then just load as a comma separated file using the csv module.
As a quick and dirty method, (assuming smallish files).
data = []
with open('glass.csv', 'r') as infile:
for line in infile:
linedata = [float(val) for val in line.strip().split(',')]
data.append(linedata)
adata = sorted(data, key=lambda items: items[-1])
## Then open both your output files and write them in the required fields.
The default behavior for reading a text file is line-by-line. You can just do something like that:
with open('input.csv', 'r') as f, open('output_1.csv') as output_1, open('output_2.csv') as output_2:
for line in f:
line_fields = line.strip().split()[',']
if line_fields[-1] == '1':
output_1.write(line)
continue
if line_fields[-1] == '2':
output_2.write(line)
Or you can use the CSV module, it's much easier https://docs.python.org/2/library/csv.html

Sorting a list from a file, outputting in another file

I am trying to find the min and max out of a csv file, and have it output into a text file, currently my code outputs all data into the output file, and I am unsure of how to grab the data out of the multiple columns and have them sorted accordingly.
Any guidance would be appreciated, as I don't have a good lead on how to figure this out
read_file = open("riskfactors.csv", 'r')
def create_file():
read_file = open("riskfactors.csv", 'r')
write_file = open("best_and_worst.txt", "w")
for line_str in read_file:
read_file.readline()
print (line_str,file=write_file)
write_file.close()
read_file.close()
Assuming your file is a standard .csv file containing only numbers separated by semicolons:
1;5;7;6;
3;8;1;1;
Then it's easiest to use the str.split() command, followed by a type conversion to int.
You could store all values in a list (or quicker: set) and then get the maximum:
valuelist=[]
for line_str in read_file:
for cell in line_str.split(";"):
valuelist.append(int(cell))
print(max(valuelist))
print(min(valuelist))
Warning: If your file contains non-number entries you'd have to filter them out. .csv-files can also have different delimiters.
import sys, csv
def cmp_risks(x, y):
# This assumes risk factors are prioritised by key columns 1, 3
# and that column 1 is numeric while column 3 is textual
return cmp(int(x[0]), int(y[0])) or cmp(x[2], y[2])
l = sorted(csv.reader(sys.stdin), cmp_risks))
# Write out the first and last rows
csv.writer(sys.stdout).writerows([l[0], l[len(l)-1]])
Now, I took a shortcut and said the input and output files were sys.stdin and sys.stdout. You'd probably replace these with the file objects you created in your original question. (e.g. read_file and write_file)
However, in my case, I'd probably just run it (if I were using linux) with:
$ ./foo.py <riskfactors.csv >best_and_worst.txt

Categories