Fetch the line in a file that mention specific keywords - python

I have two files, keywords_file (5MB) contain one word per line.Here's a sample :
prénom
pseudonyme
surnom
accès aux origines
puissance paternelle
droit de l’enfant
devoir de protection
droit à la survie
travail des enfants
obligation scolaire
assistance éducative
mauvais traitements
filiation adultérine
r_isa (205MB) that contain words that share an "isa" relationship. Here's a sample, where \t represents a literal tab character:
égalité de Parseval\tformule_0.9333\tégalité_1.0
filiation illégitime\tfiliation_1.0
Loi reconnaissant l'égalité\tloi_1.0
égalité entre les sexes\tégalité_1.0
liberté égalité fraternité\tliberté_1.0
This means, "égalité de Parseval" isa "formule" with a score of 0.9333 and isa "égalité" with a score of 1. And so go on..
I want to fetch from the r_isa file, the words that have the hypernym relationship with the keywords mentioned on the keywords_file. Here's what I did:
keywords = pd.read_csv("keywords_file.txt", sep="\t",encoding = 'utf8', header=None)
with open("r_isa.txt",encoding="utf-8") as openfile:
for line in openfile:
for k in keywords[0]:
if k in line:
file = open('isa.txt','a', encoding='utf-8')
file.write(("".join(line) + "\n"))
file.close()
This keeps running no stop through the entire night. I'm guessing something must be wrong. Any help?
PS: i wanted to add a regular expression like this :
...
for k in keywords[0]:
if re.search(r'\b' + k + r'\b', line):
...
to look up for the exact word on each line but that threw me the following error so i left it as it is now:
error: missing ), unterminated subpattern at position 69

Probably the main bottleneck is the repeated opening for appending inside the tight loop. The operating system will need to open the file and seek to the end every time you write to it. If you need several writers to have access to the end of the file, maybe run each of them with output to a separate file, then combine the result files when all the writers are done.
I'm also a bit suspicious about the order in which you read the files. Apparently the raw r_isa.txt file is bigger, but if it contains fewer lines than the keywords.txt file, perhaps switch them. Generally, try to read the smaller set of data into memory, then loop over the bigger file, one line at a time.
Here's an attempt completely without Pandas. There's probably nothing wrong with using it, but it also doesn't really provide much value here.
I also switched to using a regex; it's not clear to me whether that's going to be a performance improvement, but at least it should show you how to get this going so you can measure and compare.
import re
keywords = []
with open("keywords_file.txt") as kwfile:
keywords = [line.rstrip('\n') for line in kwfile]
regex = re.compile(r'\b(?:' + '|'.join(keywords) + r')\b')
with open("r_isa.txt") as readfile, open('isa.txt', 'w') as writefile:
for line in readfile:
firstfield = line.split('\t')[0]
m = regex.match(firstfield)
if m:
writefile.write(line)
Regular expressions are good for looking for substring matches and variations; if you simply want every line where exactly the first field exists verbatim as a line in the keywords file, this is almost certainly going to be quicker:
for line in readfile:
firstfield = line.split('\t')[0]
if firstfield in keywords:
writefile.write(line)
and then of course take out import re and the regex assignment. Maybe also then convert keywords to a set().

Related

What is the most effective way to compare strings using python in two very large files?

I have two large text files with about 10k lines in each. Each line has a unique string in the same position that needs to be compared with all the other strings in the other file to see if it matches and if not, print it out. I'm not sure how to do this in a way that makes sense time wise since the files are so large. Heres an example of the files.
File 1:
https://www.exploit-db.com/exploits/10185/
https://www.exploit-db.com/exploits/10189/
https://www.exploit-db.com/exploits/10220/
https://www.exploit-db.com/exploits/10217/
https://www.exploit-db.com/exploits/10218/
https://www.exploit-db.com/exploits/10219/
https://www.exploit-db.com/exploits/10216/
file 2:
EXPLOIT:10201 CVE-2009-4781
EXPLOIT:10216 CVE-2009-4223
EXPLOIT:10217 CVE-2009-4779
EXPLOIT:10218 CVE-2009-4082
EXPLOIT:10220 CVE-2009-4220
EXPLOIT:10226 CVE-2009-4097
I want to check if the numbers at the end of the first file match any of the numbers after EXPLOIT:
as others have said, 10k lines aren't a problem for computers that have gigabytes of memory. the important steps are:
figure out how to get the identifier out of lines in the first file
and again, but for the second file
put them together to loop over lines in each file and produce your output
regular expressions are for working with text like this, I get regexes that look like /([0-9]+)/$ and :([0-9]+) for the two files (services like https://regex101.com/ are great for playing)
you can put these together in Python by doing:
from sys import stderr
import re
# collect all exploits for easy matching
exploits = {}
for line in open('file_2'):
m = re.search(r':([0-9]+) ', line)
if not m:
print("couldn't find an id in:", repr(line), file=stderr)
continue
[id] = m.groups()
exploits[id] = line
# match them up
for line in open('file_1'):
m = re.search(r'/([0-9]+)/$', line)
if not m:
print("couldn't find an id in:", repr(line), file=stderr)
continue
[id] = m.groups()
if id in exploits:
pass # print(line, 'matched with', exploits[id])
else:
print(line)

Find matches for records in one file in another file

I have a file containing words and another "dictionary" file containing definitions. I want to find the definition for each word in the dictionary and write it out to a file.
I looked here and saw an answer that uses Unix/Linux commands but I am on windows and decided to solve in python instead and have come up with a working solution but am wondering if there is a better approach.
with open('D:/words_and_definitions.txt', 'w') as fo:
dict_file = open('D:/Oxford_English_Dictionary-orig.txt','r')
word_file = open('D:/Words.txt','r')
definitions = dict_file.readlines()
words = word_file.readlines()
count = 1;
for word in words:
findStatus='not_found'
word = word.strip() + ' '
for definition in definitions:
if re.match(r''+word, definition) is None:
count += 1
else:
fo.write(definition)
findStatus='found'
break
if findStatus == 'not_found':
fo.write(word+' ****************no definition' + '\n')
print("all done")
word_file is not sorted alphabetically, dict_file is.
Sample from word_file
Inane
Relevant
Impetuous
Ambivalent
Dejected
Postmortem
Incriminate
Sample from dict_file
Ambiguity -n. the condition of admitting of two or more meanings, of being understood in more than one way, or of referring to two or more things at the same time
Ambiguous adj. 1 having an obscure or double meaning. 2 difficult to classify. ambiguity n. (pl. -ies). [latin ambi- both ways, ago drive]
Ambit n. Scope, extent, or bounds. [latin: related to *ambience]
Ambition n. 1 determination to succeed. 2 object of this. [latin, = canvassing: related to *ambience]
Ambitious adj. 1 full of ambition or high aims. 2 (foll. By of, or to + infin.) Strongly determined.
Ambivalence n. Coexistence of opposing feelings. ambivalent adj. [latin ambo both, *equivalent]
Ambivalent adj. having opposing feelings, undecided
Amble —v. (-ling) move at an easy pace. —n. Such a pace. [latin ambulo walk]
Have you tried using dictionnaries to find a definition? Sure you could have some memory problems if your definition file is too big but in your case it could be sufficient. That could give a simple solution:
import re
definition_finder = re.compile(r'^(\w+)\s+(.*)$')
with open('Oxford_English_Dictionary-orig.txt') as dict_file:
definitions = {}
for line in dict_file:
definition_found = definition_finder.match(line)
if definition_found:
definitions[definition_found.group(1)] = definition_found.group(2)
with open('Words.txt') as word_file:
with open('words_and_definitions.txt', 'w') as fo:
input_lines = (line.strip("\n") for line in word_file)
for line in input_lines:
fo.write(f"{line} {definitions.get(line, '****************no definition')}\n")
You could have a more compact way of defining your definitions. That would give:
import re
definition_finder = re.compile(r'^(\w+)\s+(.*)$')
with open('Oxford_English_Dictionary-orig.txt') as dict_file:
definitions_found = (definition_finder.match(line) for line in dict_file)
definitions = dict(definition_found.groups() for definition_found
in definitions_found if definition_found)
with open('Words.txt') as word_file:
with open('words_and_definitions.txt', 'w') as fo:
input_lines = (line.strip("\n") for line in word_file)
for line in input_lines:
fo.write(f"{line} {definitions.get(line, '****************no definition')}\n")
If your definition file is indeed too big, then you can consider, for example using a database like the sqlite3 module.

Counting how many times a string appears in a CSV file

I have a piece of code what is supposed to tell me how many times a word occurs in a CSV file. Note: the file is pretty large (2 years of text messages)
This is my code:
key_word1 = 'Exmple_word1'
key_word2 = 'Example_word2'
counter = 0
with open('PATH_TO_FILE.csv',encoding='UTF-8') as a:
for line in a:
if (key_word1 or key_word2) in line:
counter = counter + 1
print(counter)
There are two words because I did not know how to make it non-case sensitive.
To test it I used the find function in word on the whole file (using only one of the words as I was able to do a non-case sensitive search there) and I received more than double of what my code has calculated.
At first I did use the value_counts() function BUT I received different values for the same word (searching Exmple_word1 appeared 32 and 56 times and 2 times and so on. I kind of got stuck there for a while but it got me thinking. I use two keyboards on my phone which I change regularly - could it be that the same words could actually be different and that would explain why I am getting these results?
Also, I pretty much checked all sources regarding this matter and I found different approaches that did not actually do what I want them to do. ( the value_counts() method for example)
If that is the case, how can I fix this?
Notice some mistakes in your code:
key_word1 or key_word2 - it's "lazy", meaning if the left part - "key_word1" evaluated to True, it won't even look at key_word2. The will cause checking only if key_word1 appeared in the line.
An example to emphesize:
w1 = 'word1'
w2 = 'word2'
s = 'bla word2'
(w1 or w2) in s
>> False
(w2 or w1) in s
>> True
2. Reading csv file: I recommend using csv package (just import it), something like:
import csv
with open('PATH_TO_FILE.csv') as f:
for line in csv.reader(f):
# do you logic here
Case sensitivity - don't work hard, you probably can lower case the line you read, just to not hold 2 words..
guess the solution you are looking for should look something like:
import csv
word_to_search = 'donald'
with open('PATH_TO_FILE.csv', encoding='UTF-8') as f:
for line in csv.reader(f):
if any(word_to_search in l for l in map(str.lower, line)):
counter += 1
Running on input:
bla,some other bla,donald rocks
make,who,great
again, donald is here, hura
will result:
counter=2

Split txt file into multiple new files with regex

I am calling on the collective wisdom of Stack Overflow because I am at my wits end trying to figure out how to do this and I'm a newbie self-taught coder.
I have a txt file of Letters to the Editor that I need to split into their own individual files.
The files are all formatted in relatively the same way with:
For once, before offering such generous but the unasked for advice, put yourselves in...
Who has Israel to talk to? The cowardly Jordanian monarch? Egypt, a country rocked...
Why is it that The Times does not urge totalitarian Arab slates and terrorist...
PAUL STONEHILL Los Angeles
There you go again. Your editorial again makes groundless criticisms of the Israeli...
On Dec. 7 you called proportional representation “bizarre," despite its use in the...
Proportional representation distorts Israeli politics? Huh? If Israel changes the...
MATTHEW SHUGART Laguna Beach
Was Mayor Tom Bradley’s veto of the expansion of the Westside Pavilion a political...
Although the mayor did not support Proposition U (the slow-growth initiative) his...
If West Los Angeles is any indication of the no-growth policy, where do we go from here?
MARJORIE L. SCHWARTZ Los Angeles
I thought that the best way to go about it would be to try and use regex to identify the lines that started with a name that's all in capital letters since that's the only way to really tell where one letter ends and another begins.
I have tried quite a few different approaches but nothing seems to work quite right. All the other answers I have seen are based on a repeatable line or word. (for example the answers posted here how to split single txt file into multiple txt files by Python and here Python read through file until match, read until next pattern). It all seems to not work when I have to adjust it to accept my regex of all capital words.
The closest I've managed to get is the code below. It creates the right number of files. But after the second file is created it all goes wrong. The third file is empty and in all the rest the text is all out of order and/or incomplete. Paragraphs that should be in file 4 are in file 5 or file 7 etc or missing entirely.
import re
thefile = raw_input('Filename to split: ')
name_occur = []
full_file = []
pattern = re.compile("^[A-Z]{4,}")
with open (thefile, 'rt') as in_file:
for line in in_file:
full_file.append(line)
if pattern.search(line):
name_occur.append(line)
totalFiles = len(name_occur)
letters = 1
thefile = re.sub("(.txt)","",thefile)
while letters <= totalFiles:
f1 = open(thefile + '-' + str(letters) + ".txt", "a")
doIHaveToCopyTheLine = False
ignoreLines = False
for line in full_file:
if not ignoreLines:
f1.write(line)
full_file.remove(line)
if pattern.search(line):
doIHaveToCopyTheLine = True
ignoreLines = True
letters += 1
f1.close()
I am open to completely scrapping this approach and doing it another way (but still in Python). Any help or advice would be greatly appreciated. Please assume I am the inexperienced newbie that I am if you are awesome enough to take your time to help me.
I took a simpler approach and avoided regex. The tactic here is essentially to count the uppercase letters in the first three words and make sure they pass certain logic. I went for first word is uppercase and either the second or third word is uppercase too, but you can adjust this if needed. This will then write each letter to new files with the same name as the original file (note: it assumes your file has an extension like .txt or such) but with an incremented integer appended. Try it out and see how it works for you.
import string
def split_letters(fullpath):
current_letter = []
letter_index = 1
fullpath_base, fullpath_ext = fullpath.rsplit('.', 1)
with open(fullpath, 'r') as letters_file:
letters = letters_file.readlines()
for line in letters:
words = line.split()
upper_words = []
for word in words:
upper_word = ''.join(
c for c in word if c in string.ascii_uppercase)
upper_words.append(upper_word)
len_upper_words = len(upper_words)
first_word_upper = len_upper_words and len(upper_words[0]) > 1
second_word_upper = len_upper_words > 1 and len(upper_words[1]) > 1
third_word_upper = len_upper_words > 2 and len(upper_words[2]) > 1
if first_word_upper and (second_word_upper or third_word_upper):
current_letter.append(line)
new_filename = '{0}{1}.{2}'.format(
fullpath_base, letter_index, fullpath_ext)
with open(new_filename, 'w') as new_letter:
new_letter.writelines(current_letter)
current_letter = []
letter_index += 1
else:
current_letter.append(line)
I tested it on your sample input and it worked fine.
While the other answer is suitable, you may still be curious about using a regex to split up a file.
smallfile = None
buf = ""
with open ('input_file.txt', 'rt') as f:
for line in f:
buf += str(line)
if re.search(r'^([A-Z\s\.]+\b)' , line) is not None:
if smallfile:
smallfile.close()
match = re.findall(r'^([A-Z\s\.]+\b)' , line)
smallfile_name = '{}.txt'.format(match[0])
smallfile = open(smallfile_name, 'w')
smallfile.write(buf)
buf = ""
if smallfile:
smallfile.close()
If you run on Linux, use csplit.
Otherwise, check out these two threads:
How can I split a text file into multiple text files using python?
How to match "anything up until this sequence of characters" in a regular expression?

Output comes twice - Update of a Q asked 30 minutes before posting this one

Here is my code
import re
with open('newfiles.txt') as f:
k = f.read()
p = re.compile(r'[\w\:\-\.\,\']+|[^[\w\:\-\.\'\,]\s]')
originaltext = p.findall(k)
uniquelist = []
for word in originaltext:
if word not in uniquelist:
uniquelist.append(word)
indexes = ' '.join(str(uniquelist.index(word)+1) for word in originaltext)
n = p.findall(indexes)
file = open("newfiletwo.txt","w")
file.write (' '.join(str(e) for e in n))
file.close()
file = open("newfilethree.txt","w")
file.write(' '.join(uniquelist))
file.close()
with open('newfiletwo.txt') as f:
indexess = f.read()
with open('newfilethree.txt') as f:
differentwords = f.read()
differentwords = p.findall(differentwords)
indexess = [uniquelist.index(word) for word in originaltext]
for word in originaltext:
if not word in differentwords:
differentwords.append(word)
i = differentwords.index(word)
indexess.append(i)
s = "" # the reconstructed sentence
for i in indexess:
s = s + differentwords[i] + " "
print(s)
The program basically takes an external text file, returns the index of its positions (if any word repeats, then the first position is taken) and then saves the positions as an external file. Whilst doing this, I have split up the text file including splitting punctuation and saved different words and punctuation that occur in the file as an external file too. Now for the hard part, using both of these external files - the indexes and the different separated words, I am trying to recreate the original text file, including the punctuation. But the error shown in the title occurs:
Traceback (most recent call last):
File "E:\Python\Index.py", line 31, in <module>
s = s + differentwords[i] + " "
IndexError: list index out of range
Not trying to sound rude but I am a sort of beginner, please try to change as less as possible in a simple way, as I have created this myself. You guys maybe know a far shorter way to do this, but this is the level of simplicity I can handle, proved by the length of the code. I have tried to shorten the original text file but that proves no use. Anyone know why the error occurs and how to fix it? I am not looking for efficiency right now, maybe after another couple of months of learning, but the simplest (i don't mind long) answer will be the best. Sorry if I have repeated myself a lot :-)
'newfiles' - A bunch of sentences with punctuation
UPDATE
The code does not show the error but prints the original sentence twice. The error has gone due to the removal of +1 on line 23. Does anyone know why the output repeats twice though?
Problem is, how you qualify what word is, what is not. For instance is comma part of word? In your case that is not mentioned as such, while it is also not a separator. So you end up with separate word comma, or dot, and so on. I have no access to your input, so I can just provide sample:
p = re.compile(r'[\w\:\-\.\,]+|[^[\w\:\-\.\,]\s]')
There is one point - in this case: 'Word', 'word', 'Word', 'Word.', 'word,' are all separate words. Since dot, and coma are parts of word. You can't eat cake and have it. To fix that... you need to store information if there is white space before separation.
UPDATE:
Oh, yes. Double output. Files that are stored in the middle - are OK. So something was filed after that. Look at this two lines:
i = differentwords.index(word)
indexess.append(i)
They need to be inside preceding if statement.

Categories