Read the next word in a file in python - python

I am looking for some words in a file in python. After I find each word I need to read the next two words from the file. I've looked for some solution but I could not find reading just the next words.
# offsetFile - file pointer
# searchTerms - list of words
for line in offsetFile:
for word in searchTerms:
if word in line:
# here get the next two terms after the word
Thank you for your time.
Update: Only the first appearance is necessary. Actually only one appearance of the word is possible in this case.
file:
accept 42 2820 access 183 3145 accid 1 4589 algebra 153 16272 algem 4 17439 algol 202 6530
word: ['access', 'algebra']
Searching the file when I encounter 'access' and 'algebra', I need the values of 183 3145 and 153 16272 respectively.

An easy way to deal with this is to read the file using a generator that yields one word at a time from the file.
def words(fileobj):
for line in fileobj:
for word in line.split():
yield word
Then to find the word you're interested in and read the next two words:
with open("offsetfile.txt") as wordfile:
wordgen = words(wordfile)
for word in wordgen:
if word in searchterms: # searchterms should be a set() to make this fast
break
else:
word = None # makes sure word is None if the word wasn't found
foundwords = [word, next(wordgen, None), next(wordgen, None)]
Now foundwords[0] is the word you found, foundwords[1] is the word after that, and foundwords[2] is the second word after it. If there aren't enough words, then one or more elements of the list will be None.
It is a little more complex if you want to force this to match only within one line, but usually you can get away with considering the file as just a sequence of words.

If you need to retrieve only two first words, just do it:
offsetFile.readline().split()[:2]

word = '3' #Your word
delim = ',' #Your delim
with open('test_file.txt') as f:
for line in f:
if word in line:
s_line = line.strip().split(delim)
two_words = (s_line[s_line.index(word) + 1],\
s_line[s_line.index(word) + 2])
break

def searchTerm(offsetFile, searchTerms):
# remove any found words from this list; if empty we can exit
searchThese = searchTerms[:]
for line in offsetFile:
words_in_line = line.split()
# Use this list comprehension if always two numbers continue a word.
# Else use words_in_line.
for word in [w for i, w in enumerate(words_in_line) if i % 3 == 0]:
# No more words to search.
if not searchThese:
return
# Search remaining words.
if word in searchThese:
searchThese.remove(word)
i = words_in_line.index(word)
print words_in_line[i:i+3]
For 'access', 'algebra' I get this result:
['access', '183', '3145']
['algebra', '153', '16272']

Related

Trie is Using Too Much Memory

I'm trying to get all the words made from the letters, 'crbtfopkgevyqdzsh' from a file called web2.txt. The posted cell below follows a block of code which improperly returned the whole run up to a full word e.g. for the word shocked it would return s, sh, sho, shoc, shock, shocke, shocked
So I tried a trie (know pun intended).
web2.txt is 2.5 MB in size, and contains 2,493,838 words of varying length. The trie in the cell below is breaking my Google Colab notebook. I even upgraded to Google Colab Pro, and then to Google Colab Pro+ to try and accommodate the block of code, but it's still too much. Any more efficient ideas besides trie to get the same result?
# Find the words3 word list here: svnweb.freebsd.org/base/head/share/dict/web2?view=co
trie = {}
with open('/content/web2.txt') as words3:
for word in words3:
cur = trie
for l in word:
cur = cur.setdefault(l, {})
cur['word'] = True # defined if this node indicates a complete word
def findWords(word, trie = trie, cur = '', words3 = []):
for i, letter in enumerate(word):
if letter in trie:
if 'word' in trie[letter]:
words3.append(cur)
findWords(word, trie[letter], cur+letter, words3 )
# first example: findWords(word[:i] + word[i+1:], trie[letter], cur+letter, word_list )
return [word for word in words3 if word in words3]
words3 = findWords("crbtfopkgevyqdzsh")
I'm using Pyhton3
A trie is overkill. There's about 200 thousand words, so you can just make one pass through all of them to see if you can form the word using the letters in the base string.
This is a good use case for collections.Counter, which gives us a clean way to get the frequencies (i.e. "counters") of the letters of an arbitrary string:
from collections import Counter
base_counter = Counter("crbtfopkgevyqdzsh")
with open("data.txt") as input_file:
for line in input_file:
line = line.rstrip()
line_counter = Counter(line.lower())
# Can use <= instead if on Python 3.10
if line_counter & base_counter == line_counter:
print(line)

Python list of words from a file with a specific length * no punctuation included in the words

Result should be: a list of words with a length bigger than 9 , words should be lower and no punctuation in words, ***only three lines of code in the body of the function.
The problem in my code is that is still adding punctuation to my word. I tried with checking just for one exmp. if ch not in one of those ->('-' or '"' or '!') or with r'[.,"!-]'.
I also tried to open the file not using with and it worked, i got the result that i want but using this method i am not gonna respect the part with only 3 lines of code inside body function
import string
min_length = 9
with open('my_file.txt') as file:
content = ''.join([ch for ch in file if ch not in string.punctuation])
result = [word.lower() for word in content.split() if len(word)>min_length]
print(result)
'''my output:
['distinctly', 'repeating,', 'entreating', 'entreating', 'hesitating', 'forgiveness', 'wondering,', 'whispered,', '"lenore!"-', 'countenance', '"nevermore."', 'sculptured', '"nevermore."', 'fluttered-', '"nevermore."', '"doubtless,"', 'unmerciful', 'melancholy', 'nevermore\'."', '"nevermore."', 'expressing', 'nevermore!', '"nevermore."', '"prophet!"', 'undaunted,', 'enchanted-', '"nevermore."', '"prophet!"', '"nevermore."', 'upstarting-', 'loneliness', 'unbroken!-', '"nevermore."', 'nevermore!']
as you can see there are still words with punctuation
I got this.
from string import punctuation
with open('test.txt') as f:
data = f.read().replace('\n','')
for a in punctuation:
data = data.replace(a,'')
data = list(set([a for a in data.split(' ') if len(a)>9]))
print(data)
output:
There is an empty list because in the given data there not a single word which has more than 9 letters.
I believe this could be an appropriate solution:
from string import punctuation
with open('files/text.txt') as f:
print(set([a for a in f.read().translate(''.maketrans('', '', ''.join([ p for p in punctuation ]) + '\n')).split(' ') if len(a)>9]))
However this is a crime against humanity in terms of readability and I would highly suggest you relax this three line requirement to allow your code to be more understandable in the long run.

How can I pull out text snippets around specific words?

I have a large txt file and I'm trying to pull out every instance of a specific word, as well as the 15 words on either side. I'm running into a problem when there are two instances of that word within 15 words of each other, which I'm trying to get as one large snippet of text.
I'm trying to get chunks of text to analyze about a specific topic. So far, I have working code for all instances except the scenario mentioned above.
def occurs(word1, word2, filename):
import os
infile = open(filename,'r') #opens file, reads, splits into lines
lines = infile.read().splitlines()
infile.close()
wordlist = [word1, word2] #this list allows for multiple words
wordsString = ''.join(lines) #splits file into individual words
words = wordsString.split()
f = open(filename, 'w')
f.write("start")
f.write(os.linesep)
for word in wordlist:
matches = [i for i, w in enumerate(words) if w.lower().find(word) != -1]
for m in matches:
l = " ".join(words[m-15:m+16])
f.write(f"...{l}...") #writes the data to the external file
f.write(os.linesep)
f.close
So far, when two of the same word are too close together, the program just doesn't run on one of them. Instead, I want to get out a longer chunk of text that extends 15 words behind and in front of furthest back and forward words
This snippet will get number of words around the chosen keyword. If there are some keywords together, it will join them:
s = '''xxx I have a large txt file and I'm xxx trying to pull out every instance of a specific word, as well as the 15 words on either side. I'm running into a problem when there are two instances of that word within 15 words of each other, which I'm trying to get as one large snippet of text.
I'm trying to xxx get chunks of text to analyze about a specific topic. So far, I have working code for all instances except the scenario mentioned above. xxx'''
words = s.split()
from itertools import groupby, chain
word = 'xxx'
def get_snippets(words, word, l):
snippets, current_snippet, cnt = [], [], 0
for v, g in groupby(words, lambda w: w != word):
w = [*g]
if v:
if len(w) < l:
current_snippet += [w]
else:
current_snippet += [w[:l] if cnt % 2 else w[-l:]]
snippets.append([*chain.from_iterable(current_snippet)])
current_snippet = [w[-l:] if cnt % 2 else w[:l]]
cnt = 0
cnt += 1
else:
if current_snippet:
current_snippet[-1].extend(w)
else:
current_snippet += [w]
if current_snippet[-1][-1] == word or len(current_snippet) > 1:
snippets.append([*chain.from_iterable(current_snippet)])
return snippets
for snippet in get_snippets(words, word, 15):
print(' '.join(snippet))
Prints:
xxx I have a large txt file and I'm xxx trying to pull out every instance of a specific word, as well as the 15
other, which I'm trying to get as one large snippet of text. I'm trying to xxx get chunks of text to analyze about a specific topic. So far, I have working
topic. So far, I have working code for all instances except the scenario mentioned above. xxx
With the same data and different lenght:
for snippet in get_snippets(words, word, 2):
print(' '.join(snippet))
Prints:
xxx and I'm
I have xxx trying to
trying to xxx get chunks
mentioned above. xxx
As always, a variety of solutions avaliable here. A fun one would a be a recursive wordFind, where it searches the next 15 words and if it finds the target word it can call itself.
A simpler, though perhaps not efficient, solution would be to add words one at a time:
for m in matches:
l = " ".join(words[m-15:m])
i = 1
while i < 16:
if (words[m+i].lower() == word):
i=1
else:
l.join(words[m+(i++)])
f.write(f"...{l}...") #writes the data to the external file
f.write(os.linesep)
Or if you're wanting the subsequent uses to be removed...
bExtend = false;
for m in matches:
if (!bExtend):
l = " ".join(words[m-15:m])
f.write("...")
bExtend = false
i = 1
while (i < 16):
if (words[m].lower() == word):
l.join(words[m+i])
bExtend = true
break
else:
l.join(words[m+(i++)])
f.write(l)
if (!bExtend):
f.write("...")
f.write(os.linesep)
Note, have not tested so may require a bit of debugging. But the gist is clear: add words piecemeal and extend the addition process when a target word is encountered. This also allows you to extend with other target words other than the current one with a bit of addition to to the second conditional if.

Frequency of keywords in a list

Hi so i have 2 text files I have to read the first text file count the frequency of each word and remove duplicates and create a list of list with the word and its count in the file.
My second text file contains keywords I need to count the frequency of these keywords in the first text file and return the result without using any imports, dict, or zips.
I am stuck on how to go about this second part I have the file open and removed punctuation etc but I have no clue how to find the frequency
I played around with the idea of .find() but no luck as of yet.
Any suggestions would be appreciated this is my code at the moment seems to find the frequency of the keyword in the keyword file but not in the first text file
def calculateFrequenciesTest(aString):
listKeywords= aString
listSize = len(listKeywords)
keywordCountList = []
while listSize > 0:
targetWord = listKeywords [0]
count =0
for i in range(0,listSize):
if targetWord == listKeywords [i]:
count = count +1
wordAndCount = []
wordAndCount.append(targetWord)
wordAndCount.append(count)
keywordCountList.append(wordAndCount)
for i in range (0,count):
listKeywords.remove(targetWord)
listSize = len(listKeywords)
sortedFrequencyList = readKeywords(keywordCountList)
return keywordCountList;
EDIT- Currently toying around with the idea of reopening my first file again but this time without turning it into a list? I think my errors are somehow coming from it counting the frequency of my list of list. These are the types of results I am getting
[[['the', 66], 1], [['of', 32], 1], [['and', 27], 1], [['a', 23], 1], [['i', 23], 1]]
You can try something like:
I am taking a list of words as an example.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = {}
for word in word_list:
if word not in frequency_list:
frequency_list[word] = 1
else:
frequency_list[word] += 1
print(frequency_list)
RESULT: {'test': 1, 'world': 1, 'hello': 2}
Since, you have put a constraint on dicts, I have made use of two lists to do the same task. I am not sure how efficient it is, but it serves the purpose.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = []
frequency_word = []
for word in word_list:
if word not in frequency_word:
frequency_word.append(word)
frequency_list.append(1)
else:
ind = frequency_word.index(word)
frequency_list[ind] += 1
print(frequency_word)
print(frequency_list)
RESULT : ['hello', 'world', 'test']
[2, 1, 1]
You can change it to how you like or re-factor it as you wish
I agree with #bereal that you should use Counter for this. I see that you have said that you don't want "imports, dict, or zips", so feel free to disregard this answer. Yet, one of the major advantages of Python is its great standard library, and every time you have list available, you'll also have dict, collections.Counter and re.
From your code I'm getting the impression that you want to use the same style that you would have used with C or Java. I suggest trying to be a little more pythonic. Code written this way may look unfamiliar, and can take time getting used to. Yet, you'll learn way more.
Claryfying what you're trying to achieve would help. Are you learning Python? Are you solving this specific problem? Why can't you use any imports, dict, or zips?
So here's a proposal utilizing built in functionality (no third party) for what it's worth (tested with Python 2):
#!/usr/bin/python
import re # String matching
import collections # collections.Counter basically solves your problem
def loadwords(s):
"""Find the words in a long string.
Words are separated by whitespace. Typical signs are ignored.
"""
return (s
.replace(".", " ")
.replace(",", " ")
.replace("!", " ")
.replace("?", " ")
.lower()).split()
def loadwords_re(s):
"""Find the words in a long string.
Words are separated by whitespace. Only characters and ' are allowed in strings.
"""
return (re.sub(r"[^a-z']", " ", s.lower())
.split())
# You may want to read this from a file instead
sourcefile_words = loadwords_re("""this is a sentence. This is another sentence.
Let's write many sentences here.
Here comes another sentence.
And another one.
In English, we use plenty of "a" and "the". A whole lot, actually.
""")
# Sets are really fast for answering the question: "is this element in the set?"
# You may want to read this from a file instead
keywords = set(loadwords_re("""
of and a i the
"""))
# Count for every word in sourcefile_words, ignoring your keywords
wordcount_all = collections.Counter(sourcefile_words)
# Lookup word counts like this (Counter is a dictionary)
count_this = wordcount_all["this"] # returns 2
count_a = wordcount_all["a"] # returns 1
# Only look for words in the keywords-set
wordcount_keywords = collections.Counter(word
for word in sourcefile_words
if word in keywords)
count_and = wordcount_keywords["and"] # Returns 2
all_counted_keywords = wordcount_keywords.keys() # Returns ['a', 'and', 'the', 'of']
Here is a solution with no imports. It uses nested linear searches which are acceptable with a small number of searches over a small input array, but will become unwieldy and slow with larger inputs.
Still the input here is quite large, but it handles it in reasonable time. I suspect if your keywords file was larger (mine has only 3 words) the slow down would start to show.
Here we take an input file, iterate over the lines and remove punctuation then split by spaces and flatten all the words into a single list. The list has dupes, so to remove them we sort the list so the dupes come together and then iterate over it creating a new list containing the string and a count. We can do this by incrementing the count as long the same word appears in the list and moving to a new entry when a new word is seen.
Now you have your word frequency list and you can search it for the required keyword and retrieve the count.
The input text file is here and the keyword file can be cobbled together with just a few words in a file, one per line.
python 3 code, it indicates where applicable how to modify for python 2.
# use string.punctuation if you are somehow allowed
# to import the string module.
translator = str.maketrans('', '', '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~')
words = []
with open('hamlet.txt') as f:
for line in f:
if line:
line = line.translate(translator)
# py 2 alternative
#line = line.translate(None, string.punctuation)
words.extend(line.strip().split())
# sort the word list, so instances of the same word are
# contiguous in the list and can be counted together
words.sort()
thisword = ''
counts = []
# for each word in the list add to the count as long as the
# word does not change
for w in words:
if w != thisword:
counts.append([w, 1])
thisword = w
else:
counts[-1][1] += 1
for c in counts:
print('%s (%d)' % (c[0], c[1]))
# function to prevent need to break out of nested loop
def findword(clist, word):
for c in clist:
if c[0] == word:
return c[1]
return 0
# open keywords file and search for each word in the
# frequency list.
with open('keywords.txt') as f2:
for line in f2:
if line:
word = line.strip()
thiscount = findword(counts, word)
print('keyword %s appear %d times in source' % (word, thiscount))
If you were so inclined you could modify findword to use a binary search, but its still not going to be anywhere near a dict. collections.Counter is the right solution when there are no restrictions. Its quicker and less code.

finding common occurence of given words in a line

I have a text file, each line of which contains a few of words. Now given a set of query words, i have to find the number of lines, in the file, where query words co-occur. i.e the number of lines containing two query words, number of lines containing 3 query words etc.
I tried using the following code: Note that rest(list,word) removes "word" from "list" and returns the updated list. linecount is the number of lines in raw.
raw=open("raw_dataset_1","r")
queryfile=open("queries","r")
query=queryfile.readline().split()
query_size=len(query)
two=0
three=0
four=0
while linecount>0:
line=raw.readline().split()
if query_size>=2:
for word1 in query:
beta=rest(query,word1)
for word2 in beta:
if (word1 in line) and (word2 in line):
two+=1
print line
if (query_size>=3):
for word3 in query:
beta=rest(query,word3)
for word4 in beta:
gama=rest(beta,word4)
for word5 in gama:
if (((word3 in line) and (word4 in line)) and (word5 in line)):
three+=1
print line
linecount-=1
print two
print three
It works, although there is redundancy, i can divide "two" by 2 to get the required number)
Is there a better approach to do it?
I would take a more general approach. Assuming query is a list of your query words and raw_dataset_1 is the name of the file you are analysing, I would do something like:
# list containing the number of lines with 0,1,2,3... occurrances of query words.
wordcount = [0,0,0,0,0]
for line in file("raw_dataset_1").readlines():
# loop over each query word, see if it occurs in the given line, and just count them.
# The bracket inside will create a list of elements (query_word) from your query word list (query)
# but add only those words which occur in the line (if query_word in line). [See list comprehension]
# E.g. if your line contain three query words those three will be in the list.
# You are not interested in what those words are, so you just take the length of the list (len).
# Finally, number_query_words_found is the number of query words present in the current line of text.
number_query_words_found = len([query_word for query_word in query if query_word in line])
if number_query_words_found<5:
# increase the line-number by one. The index corresponds to the number of query-words present
wordcount[number_query_words_found] += 1
print "Number of lines with 2 query words: ", wordcount[2]
print "Number of lines with 3 query words: ", wordcount[3]
This code is not tested and can be optimized. The file will be read entirely (inefficient for larger files) and the list wordcount it static, should be done dynamically (to allow for any word occurrances. But something like this should work, except I misunderstood your question. For list comprehension see e.g. here.
I would use sets for this:
raw=open("raw_dataset_1","r")
queryfile=open("queries","r")
query_line = queryfile.readline()
query_words = query_line.split()
query_set = set(query_words)
query_size = len(query_set) # Note that this isn't actually used below
for line in raw: # Iterating over a file gives you one line at a time
words = line.strip().split()
word_set = set(words)
common_set = query_set.intersection(word_set)
if len(common_set) == 2:
two += 1
elif len(common_set) == 3:
three += 1
elif len(common_set) == 4:
four += 1
Of course, instead of just counting the occurrences, you might want to save the line to a results file, or anything else. But this should give you the general idea: using sets will simplify your logic immensely.

Categories