Python - Opening files for comparison - python

I am attempting to open two files and check if the first word in file_1 is in any line in file_2. If the first word in a line from file_1 matches the first word in a line in file_2 I'd like to print both lines out. However, with the below code I am not getting any result. I will be dealing with very large files so I'd like to avoid putting the files in to memory using a list or dictionary. I can only use the built in functions in Python3.3. Any advice would be appreciated? Also if there is a better way please also advise.
Steps I am trying to perform:
1.) Open file_1
2.) Open file_2
3.) Check if the first Word is in ANY line of file_2.
4.) If the first word in both files match print the line from both file_1 and file_2.
Contents of files:
file_1:
Apples: 5 items in stock
Pears: 10 items in stock
Bananas: 15 items in stock
file_2:
Watermelon: 20 items in stock
Oranges: 30 items in stock
Pears: 25 items in stock
Code Attempt:
with open('file_1', 'r') as a, open('file_2', 'r') as b:
for x, y in zip(a, b):
if any(x.split()[0] in item for item in b):
print(x, y)
Desired Output:
('Pears: 10 items in stock', 'Pears: 25 items in stock')

Try:
for i in open('[Your File]'):
for x in open('[Your File 2]'):
if i == x:
print(i)

I would actually heavily suggest against storing data in 1GB sized text files and not in some sort of database/standard data storage file format. If your data were more complex, I'd suggest CSV or some sort of delimited format at minimum. If you can split and store the data in much smaller chunks, maybe a markup language like XML, HTML, or JSON (which would make navigation and extraction of data easy) which are far more organized and already optimized to handle what you're trying to do (locating matching keys and returning their values).
That said, you could use the "readline" method found in section 7.2.1 of the Python 3 docs to efficiently do what you're trying to do: https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-file.
Or, you could just iterate over the file:
def _get_key(string, delim):
#Split key out of string
key=string.split(delim)[0].strip()
return key
def _clean_string(string, charToReplace):
#Remove garbage from string
for character in charToReplace:
string=string.replace(character,'')
#Strip leading and trailing whitespace
string=string.strip()
return string
def get_matching_key_values(file_1, file_2, delim, charToReplace):
#Open the files to be compared
with open(file_1, 'r') as a, open(file_2, 'r') as b:
#Create an object to hold our matches
matches=[]
#Iterate over file 'a' and extract the keys, one-at-a-time
for lineA in a:
keyA=_get_key(lineA, delim)
#Iterate over file 'b' and extract the keys, one-at-a-time
for lineB in b:
keyB=_get_key(lineB, delim)
#Compare the keys. You might need upper, but I usually prefer
#to compare all uppercase to all uppercase
if keyA.upper()==keyB.upper():
cleanedOutput=(_clean_string(lineA, charToReplace),
_clean_string(lineB, charToReplace))
#Append the match to the 'matches' list
matches.append(cleanedOutput)
#Reset file 'b' pointer to start of file and try again
b.seek(0)
#Return our final list of matches
#--NOTE: this method CAN return an empty 'matches' object!
return matches
This is not really the best/most efficient way to go about this:
ALL matches are saved to a list object in memory
There is no handling of duplicates
No speed optimization
Iteration over file 'b' occurs 'n' times, where 'n' is the number of
lines in file 'a'. Ideally, you would only iterate over each file once.
Even only using base Python, I'm sure there is a better way to go about it.
For the Gist: https://gist.github.com/MetaJoker/a63f8596d1084b0868e1bdb5bdfb5f16
I think the Gist also has a link to the repl.it I used to write and test the code if you want a copy to play with in your browser.

Related

Joblib too slow using "if not in" loop

I am working with amino acid sequences using the Biopython parser, but regardless of data format (the format is fasta, that is, you can imagine them as strings of letters as follows preceded by the id), my problem is that I have a huge amount of data and despite having tried to parallelize with joblib the estimate of the hours it would take me to run this simple code is 400.
Basically I have a file that contains a series of ids that I have to remove (ids_to_drop) from the original dataset (original_dataset), to create a new file (new_dataset) that contains all the ids contained in the original dataset without the ids_to_drop.
I've tried them all but I don't know how else to do it and I'm stuck right now. Thanks so much!
def file_without_ids_to_remove(seq):
with open(new_output, "a") as f, open(ids_to_drop, "r") as r: #output #removing file
remove = r.read().split("\n")
if seq.id not in remove:
SeqIO.write(seq, f, "fasta")
Parallel(n_jobs=10)(delayed(file_without_ids_to_remove)(seq) for seq in tqdm.tqdm(SeqIO.parse(original_dataset, 'fasta')))
To be clear this is an example of the data (sequence.id + sequence):
WP_051064487.1
MSSAAQTPEATSDVSDANAKQAEALRVASVNVNGIRASYRKGMAEWLAPRQVDILCLQEVRAPDEVVDGF
LADDWHIVHAEAEAKGRAGVLIASRKDSLAPDATRIGIGEEYFATAGRWVEADYTIGENAKKLTVISAYV
HSGEVGTQRQEDKYRFLDTMLERMAELAEQSDYALIVGDLNVGHTELDIKNWKGNVKNAGFLPEERAYFD
KFFGGGDTPGGLGWKDVQRELAGPVNGPYTWWSQRGQAFDNDTGWRIDYHMATPELFARAGNAVVDRAPS
YAERWSDHAPLLVDYTIR
UPDATE: I tried in the following way after the suggestion and it works.
with open(new_dataset, "w") as filtered:
[SeqIO.write(seq,filtered,"fasta") for seq in tqdm.tqdm(SeqIO.parse(original_dataset, 'fasta')) if seq.id not in ids_to_remove]
This looks like a simple file filter operation. Turn the ids to remove into a set one time, and then just read/filter/write the original dataset. Sets are optimized for fast lookup. This operation will be I/O bound and would not benefit from parallelization.
with open("ids-to-remove") as f:
ids_to_remove = {seq_id_line.strip() for seq_id_line in f}
# just in case there are blank lines
if "" in ids_to_remove:
ids_to_remove.remove("")
with open("original-data-set") as orig, open("filtered-data-set", "w") as filtered:
filtered.writelines(line for line in orig if line.split()[0] not in ids_to_remove)

Python split tabspaced bilingual txt to two separate txt files (list) with newlines separating strings

I have a bi-lingual corpora (EN-JP) from tatoeba and want to split this into two separate files. The strings have to say on the same line respectively.
I need this for training an NMT in nmt-keras and training data has to be stored in separate files for each language. I tried several approaches, but since I'm an absolute beginner with python and coding in general I feel like I'm running in circles.
So far the best I managed was the following:
Source txt:
Go. 行け。
Go. 行きなさい。
Hi. やっほー。
Hi. こんにちは!
Code:
with open('jpns.txt', encoding="utf8") as f:
columns = zip(*(l.split("\t") for l in f))
list1= list(columns)
print(list1)
[('Go.', 'Go.', 'Hi.', 'Hi.'), ('行け。\n', '行きなさい。\n', 'やっほー。\n', 'こんにちは!')]
Result with my code:
[('Go.', 'Go.', 'Hi.', 'Hi.'), ('行け。\n', '行きなさい。\n', 'やっほー。\n', 'こんにちは!')]
English and Japanese get properly separated (into a Tuple?) but I'm stuck at figuring out how to export only English and how to export only Japanese to an output.en and an output.jp respecitvely.
Expected result:
output.en
Go.
Go.
Hi.
Hi.
output.jp
行け。
行きなさい。
やっほー。
こんにちは!
Each outputted strings should contain an \n after the string.
Please keep in mind that I'm a total beginner with coding, so I'm not exactly sure what I did after "zip" as I just found this here on stackoverflow. I'd be really gratful for a fully commented suggestion.
The first thing to be aware of is that iterating over a file retains the newlines. That means that in your two columns, the first has no newlines, while the second has newlines already appended to each line (except possibly the last).
Writing the second column is therefore trivial if you've already unpacked the generator columns:
with open('output.jp', 'w') as f:
f.writelines(list1[-1])
But you still have to append newlines to the first column (and possibly others if you go full-on multilingual). One way would be to append newlines to all the columns but the last. Another would be to strip the columns from the last column and process all of them the same.
You can achieve the result you want with a small loop, and another call to zip:
langs = ('en', 'jp')
for index, (lang, data) in enumerate(zip(langs, columns)):
with open('output.' + lang, 'w') as f:
if index < len(langs) - 1:
data = (line + '\n' for line in data)
f.writelines(data)
This approach replaces the tuple data with a generator that appends newlines, unless we are at the last column.
There are a couple of ways to insert newlines between each line in the output files. The one I show uses a lazy generator to append to each line individually. This should save a little memory. If you don't care about memory savings, you can output the whole file as a single string:
joiner = '\n' if index < len(langs) - 1 else ''
f.write(joiner.join(data))
You can even write the loop yourself and print to the file:
for line in data:
print(line, file=f, end='\n' if index < len(args) - 1 else '')
Addendum
Let's also look at the line columns = zip(*(l.split("\t") for l in f)) in detail, since it is a very common Python idiom for transposing nested lists, and is the key to getting the result you want.
The generator expression l.split("\t") for l in f is pretty straightforward: it splits each line in the file around tabs, giving you two elements, one in English, and one in Japanese. Adding a * in front of the generator expands it so that each two-element row becomes a separate argument to zip. zip then re-combines the respective elements of each row, so you get a column of the English elements, and a column of the Japanese elements, effectively transposing your original "matrix".
The result is that columns is a generator over the columns. You can convert it to a list, but that is only necessary for viewing. The generator will work fine for the code shown above.

How to remove rows from a csv file when compared to a list in a txt file using Python?

I have a list of 12.000 dictionary entries (the words only, without their definitions) stored in a .txt file.
I have a complete dictionary with 62.000 entries (the words with their definitions) stored in .csv file.
I need to compare the small list in the .txt file with the larger list in the .csv file and delete the rows containing the entries that doesn't appear on the smaller list. In other words, I want to purge this dictionary to only 12.000 entries.
The .txt file is ordered in separate lines like this, line by line:
word1
word2
word3
The .csv file is ordered like this:
ID (column 1) WORD (column 2) MEANING (column 3)
How do I accomplish this using Python?
Good answers so far. If you want to get minimalistic...
import csv
lookup = set(l.strip().lower() for l in open(path_to_file3))
map(csv.writer(open(path_to_file2, 'w')).writerow,
(row for row in csv.reader(open(path_to_file))
if row[1].lower() in lookup))
The following will not scale well, but should work for the number of records indicated.
import csv
csv_in = csv.reader(open(path_to_file, 'r'))
csv_out = csv.writer(open(path_to_file2, 'w'))
use_words = open(path_to_file3, 'r').readlines()
lookup = dict([(word, None) for word in use_words])
for line in csv_in:
if lookup.has_key(line[0]):
csv_out.writerow(line)
csv_out.close()
One of the least known facts of current computers is that when you delete a line from a text file and save the file, most of the time the editor does this:
load the file into memory
write a temporary file with the rows you want
close the files and move the temp over the original
So you have to load your wordlist:
with open('wordlist.txt') as i:
wordlist = set(word.strip() for word in i) # you said the file was small
Then you open the input file:
with open('input.csv') as i:
with open('output.csv', 'w') as o:
output = csv.writer(o)
for line in csv.reader(i): # iterate over the CSV line by line
if line[1] not in wordlist: # test the value at column 2, the word
output.writerow(line)
os.rename('input.csv', 'output.csv')
This is untested, now go do your homework and comment here if you find any bug... :-)
i would use pandas for this. the data set's not large, so you can do it in memory with no problem.
import pandas as pd
words = pd.read_csv('words.txt')
defs = pd.read_csv('defs.csv')
words.set_index(0, inplace=True)
defs.set_index('WORD', inplace=True)
new_defs = words.join(defs)
new_defs.to_csv('new_defs.csv')
you might need to manipulate new_defs to make it look like you want it to, but that's the gist of it.

Simplify python code for txt searching

I am a beginner at python and I need to check the presence of a given set of string in a huge txt file. I've written this code so far and it runs with no problems on a light subsample of my database. The problem is that it takes more than 10 hours when searching through the whole database and I'm looking for a way to speed up the process.
The code so far reads a list of strings from a txt I've put together (list.txt) and search for every item in every line of the database (hugedataset.txt). My final output should be a list of items which are present in the database (or, alternatively, a list of items which are NOT present). I bet there is a more efficient way to do things though...
Thank you for your support!
import re
fobj_in = open('hugedataset.txt')
present=[]
with open('list.txt', 'r') as f:
list1 = [line.strip() for line in f]
print list1
for l in fobj_in:
for title in list1:
if title in l:
print title
present.append(title)
set=set(presenti)
print set
Since you don't need any per-line information, you can search the whole thing in one go for each string:
data = open('hugedataset.txt').read() # Assuming it fits in memory
present=[] # As #svk points out, you could make this a set
with open('list.txt', 'r') as f:
list1 = [line.strip() for line in f]
print list1
for title in list1:
if title in data:
print title
present.append(title)
set=set(present)
print set
You could use a regexp to check for all substring with a single pass. Look for example at this answer: Check to ensure a string does not contain multiple values

Trouble sorting a list with python

I'm somewhat new to python. I'm trying to sort through a list of strings and integers. The lists contains some symbols that need to be filtered out (i.e. ro!ad should end up road). Also, they are all on one line separated by a space. So I need to use 2 arguments; one for the input file and then the output file. It should be sorted with numbers first and then the words without the special characters each on a different line. I've been looking at loads of list functions but am having some trouble putting this together as I've never had to do anything like this. Any takers?
So far I have the basic stuff
#!/usr/bin/python
import sys
try:
infilename = sys.argv[1] #outfilename = sys.argv[2]
except:
print "Usage: ",sys.argv[0], "infile outfile"; sys.exit(1)
ifile = open(infilename, 'r')
#ofile = open(outfilename, 'w')
data = ifile.readlines()
r = sorted(data, key=lambda item: (int(item.partition(' ')[0])
if item[0].isdigit() else float('inf'), item))
ifile.close()
print '\n'.join(r)
#ofile.writelines(r)
#ofile.close()
The output shows exactly what was in the file but exactly as the file is written and not sorted at all. The goal is to take a file (arg1.txt) and sort it and make a new file (arg2.txt) which will be cmd line variables. I used print in this case to speed up the editing but need to have it write to a file. That's why the output file areas are commented but feel free to tell me I'm stupid if I screwed that up, too! Thanks for any help!
When you have an issue like this, it's usually a good idea to check your data at various points throughout the program to make sure it looks the way you want it to. The issue here seems to be in the way you're reading in the file.
data = ifile.readlines()
is going to read in the entire file as a list of lines. But since all the entries you want to sort are on one line, this list will only have one entry. When you try to sort the list, you're passing a list of length 1, which is going to just return the same list regardless of what your key function is. Try changing the line to
data = ifile.readlines()[0].split()
You may not even need the key function any more since numbers are placed before letters by default. I don't see anything in your code to remove special characters though.
since they are on the same line you dont really need readlines
with open('some.txt') as f:
data = f.read() #now data = "item 1 item2 etc..."
you can use re to filter out unwanted characters
import re
data = "ro!ad"
fixed_data = re.sub("[!?#$]","",data)
partition maybe overkill
data = "hello 23frank sam wilbur"
my_list = data.split() # ["hello","23frank","sam","wilbur"]
print sorted(my_list)
however you will need to do more to force numbers to sort maybe something like
numbers = [x for x in my_list if x[0].isdigit()]
strings = [x for x in my_list if not x[0].isdigit()]
sorted_list = sorted(numbers,key=lambda x:int(re.sub("[^0-9]","",x))) + sorted(strings(
Also, they are all on one line separated by a space.
So your file contains a single line?
data = ifile.readlines()
This makes data into a list of the lines in your file. All 1 of them.
r = sorted(...)
This makes r the sorted version of that list.
To get the words from the line, you can .read() the entire file as a single string, and .split() it (by default, it splits on whitespace).

Categories