finding common occurence of given words in a line - python

I have a text file, each line of which contains a few of words. Now given a set of query words, i have to find the number of lines, in the file, where query words co-occur. i.e the number of lines containing two query words, number of lines containing 3 query words etc.
I tried using the following code: Note that rest(list,word) removes "word" from "list" and returns the updated list. linecount is the number of lines in raw.
raw=open("raw_dataset_1","r")
queryfile=open("queries","r")
query=queryfile.readline().split()
query_size=len(query)
two=0
three=0
four=0
while linecount>0:
line=raw.readline().split()
if query_size>=2:
for word1 in query:
beta=rest(query,word1)
for word2 in beta:
if (word1 in line) and (word2 in line):
two+=1
print line
if (query_size>=3):
for word3 in query:
beta=rest(query,word3)
for word4 in beta:
gama=rest(beta,word4)
for word5 in gama:
if (((word3 in line) and (word4 in line)) and (word5 in line)):
three+=1
print line
linecount-=1
print two
print three
It works, although there is redundancy, i can divide "two" by 2 to get the required number)
Is there a better approach to do it?

I would take a more general approach. Assuming query is a list of your query words and raw_dataset_1 is the name of the file you are analysing, I would do something like:
# list containing the number of lines with 0,1,2,3... occurrances of query words.
wordcount = [0,0,0,0,0]
for line in file("raw_dataset_1").readlines():
# loop over each query word, see if it occurs in the given line, and just count them.
# The bracket inside will create a list of elements (query_word) from your query word list (query)
# but add only those words which occur in the line (if query_word in line). [See list comprehension]
# E.g. if your line contain three query words those three will be in the list.
# You are not interested in what those words are, so you just take the length of the list (len).
# Finally, number_query_words_found is the number of query words present in the current line of text.
number_query_words_found = len([query_word for query_word in query if query_word in line])
if number_query_words_found<5:
# increase the line-number by one. The index corresponds to the number of query-words present
wordcount[number_query_words_found] += 1
print "Number of lines with 2 query words: ", wordcount[2]
print "Number of lines with 3 query words: ", wordcount[3]
This code is not tested and can be optimized. The file will be read entirely (inefficient for larger files) and the list wordcount it static, should be done dynamically (to allow for any word occurrances. But something like this should work, except I misunderstood your question. For list comprehension see e.g. here.

I would use sets for this:
raw=open("raw_dataset_1","r")
queryfile=open("queries","r")
query_line = queryfile.readline()
query_words = query_line.split()
query_set = set(query_words)
query_size = len(query_set) # Note that this isn't actually used below
for line in raw: # Iterating over a file gives you one line at a time
words = line.strip().split()
word_set = set(words)
common_set = query_set.intersection(word_set)
if len(common_set) == 2:
two += 1
elif len(common_set) == 3:
three += 1
elif len(common_set) == 4:
four += 1
Of course, instead of just counting the occurrences, you might want to save the line to a results file, or anything else. But this should give you the general idea: using sets will simplify your logic immensely.

Related

Identify lines of speech which contain words from a list using pandas

I have the following dataframe:
test = pd.DataFrame(columns = ['Line No.','Person','Speech'])
test['Person'] = ['A','B','A','B','A','B']
test['Line No.'] = [1,2,3,4,5,6]
test['Speech'] = ['hello. how was your assessment day? i heard it went very well.',
'The beginning was great and the rest of the day was kinda ok.',
'why did things go from great to ok?',
'i was positive at the beginning and went right with all my answers but then i was not feeling well.',
"that's very unfortunate. if there's anything i can help you with please let me know how.",
'Will do.']
And the following list which contains keywords:
keywords = ['hello','day','great','well','happy','right','ok','why','positive']
I would like to generate an output which shows both the speaker and line no. associated with them for each time their speech contains at least 3 words from the keywords list. I have tried iterating through each line in the dataframe to see if there was at least 3 keywords present however my code only returns the last line. Below is the code I used:
def identify_line_numebr(dataframe, keywords:list, thresh:int=3):
is_person = False
keyword_match_list = []
for index, row in dataframe.iterrows():
if is_person == False:
# Pulling out the speech
line = row['Speech']
for token in line:
# Checking if each line of speech contains key words
if token in keywords:
keyword_match_list.append(token)
print(index, is_person, row['Line No.'], row['Person'])
print(len(keyword_match_list))
if len(keyword_match_list) == thresh:
is_person == True
else:
break
return {row['Person'], row['Line No.']}
The expected output for this particular case should be in a similar format:
output = [{1, 'A'},{2, 'B'},{3, 'A'},{5, 'A'}]
whereby the first value is the Line No. which contains speech which has at least 3 keywords and the letter is the person.
The problem is that you stop the iteration over the rows as soon as you find a line containing at least three keywords. Instead, you should iterate over all lines and add the person and line number to a list if the threshold count is met:
def identify_line_numbers(dataframe, keywords, thresh=3):
person_line = [] # will contain sets of {Line No., Person}
for line_index, line in enumerate(dataframe.Speech):
# check if each word is in the current line
words_in_speech = [word in line for word in keywords]
# add person and line number to our list if the threshold count is met
if sum(words_in_speech) >= thresh:
person_line.append(
{dataframe.Person[line_index], dataframe['Line No.'][line_index]}
)
return person_line

Word & Line Concordance Program

I originally posted this question here but was then told to post it to code review; however, they told me that my question needed to be posted here instead. I will try to better explain my problem so hopefully there is no confusion. I am trying to write a word-concordance program that will do the following:
1) Read the stop_words.txt file into a dictionary (use the same type of dictionary that you’re timing) containing only stop words, called stopWordDict. (WARNING: Strip the newline(‘\n’) character from the end of the stop word before adding it to stopWordDict)
2) Process the WarAndPeace.txt file one line at a time to build the word-concordance dictionary(called wordConcordanceDict) containing “main” words for the keys with a list of their associated line numbers as their values.
3) Traverse the wordConcordanceDict alphabetically by key to generate a text file containing the concordance words printed out in alphabetical order along with their corresponding line numbers.
I tested my program on a small file with a short list of stop words and it worked correctly (provided an example of this below). The outcome was what I expected, a list of the main words with their line count, not including words from the stop_words_small.txt file. The only difference between the small file I tested and the main file I am actually trying to test, is the main file is much longer and contains punctuation. So the problem I am running into is when I run my program with the main file, I am getting way more results then expected. The reason I am getting more results then expected is because the punctuation is not being removed from the file.
For example, below is a section of the outcome where my code counted the word Dmitri as four separate words because of the different capitalization and punctuation that follows the word. If my code were to remove the punctuation correctly, the word Dmitri would be counted as one word followed by all the locations found. My output is also separating upper and lower case words, so my code is not making the file lower case either.
What my code currently displays:
Dmitri : [2528, 3674, 3687, 3694, 4641, 41131]
Dmitri! : [16671, 16672]
Dmitri, : [2530, 3676, 3685, 13160, 16247]
dmitri : [2000]
What my code should display:
dmitri : [2000, 2528, 2530, 3674, 3676, 3685, 3687, 3694, 4641, 13160, 16671, 16672, 41131]
Words are defined to be sequences of letters delimited by any non-letter. There should also be no distinction made between upper and lower case letters, but my program splits those up as well; however, blank lines are to be counted in the line numbering.
Below is my code and I would appreciate it if anyone could take a look at it and give me any feedback on what I am doing wrong. Thank you in advance.
import re
def main():
stopFile = open("stop_words.txt","r")
stopWordDict = dict()
for line in stopFile:
stopWordDict[line.lower().strip("\n")] = []
hwFile = open("WarAndPeace.txt","r")
wordConcordanceDict = dict()
lineNum = 1
for line in hwFile:
wordList = re.split(" |\n|\.|\"|\)|\(", line)
for word in wordList:
word.strip(' ')
if (len(word) != 0) and word.lower() not in stopWordDict:
if word in wordConcordanceDict:
wordConcordanceDict[word].append(lineNum)
else:
wordConcordanceDict[word] = [lineNum]
lineNum = lineNum + 1
for word in sorted(wordConcordanceDict):
print (word," : ",wordConcordanceDict[word])
if __name__ == "__main__":
main()
Just as another example and reference here is the small file I test with the small list of stop words that worked perfectly.
stop_words_small.txt file
a, about, be, by, can, do, i, in, is, it, of, on, the, this, to, was
small_file.txt
This is a sample data (text) file to
be processed by your word-concordance program.
The real data file is much bigger.
correct output
bigger: 4
concordance: 2
data: 1 4
file: 1 4
much: 4
processed: 2
program: 2
real: 4
sample: 1
text: 1
word: 2
your: 2
You can do it like this:
import re
from collections import defaultdict
wordConcordanceDict = defaultdict(list)
with open('stop_words_small.txt') as sw:
words = (line.strip() for line in sw)
stop_words = set(words)
with open('small_file.txt') as f:
for line_number, line in enumerate(f, 1):
words = (re.sub(r'[^\w\s]','',word).lower() for word in line.split())
good_words = (word for word in words if word not in stop_words)
for word in good_words:
wordConcordanceDict[word].append(line_number)
for word in sorted(wordConcordanceDict):
print('{}: {}'.format(word, ' '.join(map(str, wordConcordanceDict[word]))))
Output:
bigger: 4
data: 1 4
file: 1 4
much: 4
processed: 2
program: 2
real: 4
sample: 1
text: 1
wordconcordance: 2
your: 2

​I will add explanations tomorrow, it's getting late here ;). Meanwhile, you can ask in the comments if some part of the code isn't clear for you.

Frequency of keywords in a list

Hi so i have 2 text files I have to read the first text file count the frequency of each word and remove duplicates and create a list of list with the word and its count in the file.
My second text file contains keywords I need to count the frequency of these keywords in the first text file and return the result without using any imports, dict, or zips.
I am stuck on how to go about this second part I have the file open and removed punctuation etc but I have no clue how to find the frequency
I played around with the idea of .find() but no luck as of yet.
Any suggestions would be appreciated this is my code at the moment seems to find the frequency of the keyword in the keyword file but not in the first text file
def calculateFrequenciesTest(aString):
listKeywords= aString
listSize = len(listKeywords)
keywordCountList = []
while listSize > 0:
targetWord = listKeywords [0]
count =0
for i in range(0,listSize):
if targetWord == listKeywords [i]:
count = count +1
wordAndCount = []
wordAndCount.append(targetWord)
wordAndCount.append(count)
keywordCountList.append(wordAndCount)
for i in range (0,count):
listKeywords.remove(targetWord)
listSize = len(listKeywords)
sortedFrequencyList = readKeywords(keywordCountList)
return keywordCountList;
EDIT- Currently toying around with the idea of reopening my first file again but this time without turning it into a list? I think my errors are somehow coming from it counting the frequency of my list of list. These are the types of results I am getting
[[['the', 66], 1], [['of', 32], 1], [['and', 27], 1], [['a', 23], 1], [['i', 23], 1]]
You can try something like:
I am taking a list of words as an example.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = {}
for word in word_list:
if word not in frequency_list:
frequency_list[word] = 1
else:
frequency_list[word] += 1
print(frequency_list)
RESULT: {'test': 1, 'world': 1, 'hello': 2}
Since, you have put a constraint on dicts, I have made use of two lists to do the same task. I am not sure how efficient it is, but it serves the purpose.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = []
frequency_word = []
for word in word_list:
if word not in frequency_word:
frequency_word.append(word)
frequency_list.append(1)
else:
ind = frequency_word.index(word)
frequency_list[ind] += 1
print(frequency_word)
print(frequency_list)
RESULT : ['hello', 'world', 'test']
[2, 1, 1]
You can change it to how you like or re-factor it as you wish
I agree with #bereal that you should use Counter for this. I see that you have said that you don't want "imports, dict, or zips", so feel free to disregard this answer. Yet, one of the major advantages of Python is its great standard library, and every time you have list available, you'll also have dict, collections.Counter and re.
From your code I'm getting the impression that you want to use the same style that you would have used with C or Java. I suggest trying to be a little more pythonic. Code written this way may look unfamiliar, and can take time getting used to. Yet, you'll learn way more.
Claryfying what you're trying to achieve would help. Are you learning Python? Are you solving this specific problem? Why can't you use any imports, dict, or zips?
So here's a proposal utilizing built in functionality (no third party) for what it's worth (tested with Python 2):
#!/usr/bin/python
import re # String matching
import collections # collections.Counter basically solves your problem
def loadwords(s):
"""Find the words in a long string.
Words are separated by whitespace. Typical signs are ignored.
"""
return (s
.replace(".", " ")
.replace(",", " ")
.replace("!", " ")
.replace("?", " ")
.lower()).split()
def loadwords_re(s):
"""Find the words in a long string.
Words are separated by whitespace. Only characters and ' are allowed in strings.
"""
return (re.sub(r"[^a-z']", " ", s.lower())
.split())
# You may want to read this from a file instead
sourcefile_words = loadwords_re("""this is a sentence. This is another sentence.
Let's write many sentences here.
Here comes another sentence.
And another one.
In English, we use plenty of "a" and "the". A whole lot, actually.
""")
# Sets are really fast for answering the question: "is this element in the set?"
# You may want to read this from a file instead
keywords = set(loadwords_re("""
of and a i the
"""))
# Count for every word in sourcefile_words, ignoring your keywords
wordcount_all = collections.Counter(sourcefile_words)
# Lookup word counts like this (Counter is a dictionary)
count_this = wordcount_all["this"] # returns 2
count_a = wordcount_all["a"] # returns 1
# Only look for words in the keywords-set
wordcount_keywords = collections.Counter(word
for word in sourcefile_words
if word in keywords)
count_and = wordcount_keywords["and"] # Returns 2
all_counted_keywords = wordcount_keywords.keys() # Returns ['a', 'and', 'the', 'of']
Here is a solution with no imports. It uses nested linear searches which are acceptable with a small number of searches over a small input array, but will become unwieldy and slow with larger inputs.
Still the input here is quite large, but it handles it in reasonable time. I suspect if your keywords file was larger (mine has only 3 words) the slow down would start to show.
Here we take an input file, iterate over the lines and remove punctuation then split by spaces and flatten all the words into a single list. The list has dupes, so to remove them we sort the list so the dupes come together and then iterate over it creating a new list containing the string and a count. We can do this by incrementing the count as long the same word appears in the list and moving to a new entry when a new word is seen.
Now you have your word frequency list and you can search it for the required keyword and retrieve the count.
The input text file is here and the keyword file can be cobbled together with just a few words in a file, one per line.
python 3 code, it indicates where applicable how to modify for python 2.
# use string.punctuation if you are somehow allowed
# to import the string module.
translator = str.maketrans('', '', '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~')
words = []
with open('hamlet.txt') as f:
for line in f:
if line:
line = line.translate(translator)
# py 2 alternative
#line = line.translate(None, string.punctuation)
words.extend(line.strip().split())
# sort the word list, so instances of the same word are
# contiguous in the list and can be counted together
words.sort()
thisword = ''
counts = []
# for each word in the list add to the count as long as the
# word does not change
for w in words:
if w != thisword:
counts.append([w, 1])
thisword = w
else:
counts[-1][1] += 1
for c in counts:
print('%s (%d)' % (c[0], c[1]))
# function to prevent need to break out of nested loop
def findword(clist, word):
for c in clist:
if c[0] == word:
return c[1]
return 0
# open keywords file and search for each word in the
# frequency list.
with open('keywords.txt') as f2:
for line in f2:
if line:
word = line.strip()
thiscount = findword(counts, word)
print('keyword %s appear %d times in source' % (word, thiscount))
If you were so inclined you could modify findword to use a binary search, but its still not going to be anywhere near a dict. collections.Counter is the right solution when there are no restrictions. Its quicker and less code.

Read the next word in a file in python

I am looking for some words in a file in python. After I find each word I need to read the next two words from the file. I've looked for some solution but I could not find reading just the next words.
# offsetFile - file pointer
# searchTerms - list of words
for line in offsetFile:
for word in searchTerms:
if word in line:
# here get the next two terms after the word
Thank you for your time.
Update: Only the first appearance is necessary. Actually only one appearance of the word is possible in this case.
file:
accept 42 2820 access 183 3145 accid 1 4589 algebra 153 16272 algem 4 17439 algol 202 6530
word: ['access', 'algebra']
Searching the file when I encounter 'access' and 'algebra', I need the values of 183 3145 and 153 16272 respectively.
An easy way to deal with this is to read the file using a generator that yields one word at a time from the file.
def words(fileobj):
for line in fileobj:
for word in line.split():
yield word
Then to find the word you're interested in and read the next two words:
with open("offsetfile.txt") as wordfile:
wordgen = words(wordfile)
for word in wordgen:
if word in searchterms: # searchterms should be a set() to make this fast
break
else:
word = None # makes sure word is None if the word wasn't found
foundwords = [word, next(wordgen, None), next(wordgen, None)]
Now foundwords[0] is the word you found, foundwords[1] is the word after that, and foundwords[2] is the second word after it. If there aren't enough words, then one or more elements of the list will be None.
It is a little more complex if you want to force this to match only within one line, but usually you can get away with considering the file as just a sequence of words.
If you need to retrieve only two first words, just do it:
offsetFile.readline().split()[:2]
word = '3' #Your word
delim = ',' #Your delim
with open('test_file.txt') as f:
for line in f:
if word in line:
s_line = line.strip().split(delim)
two_words = (s_line[s_line.index(word) + 1],\
s_line[s_line.index(word) + 2])
break
def searchTerm(offsetFile, searchTerms):
# remove any found words from this list; if empty we can exit
searchThese = searchTerms[:]
for line in offsetFile:
words_in_line = line.split()
# Use this list comprehension if always two numbers continue a word.
# Else use words_in_line.
for word in [w for i, w in enumerate(words_in_line) if i % 3 == 0]:
# No more words to search.
if not searchThese:
return
# Search remaining words.
if word in searchThese:
searchThese.remove(word)
i = words_in_line.index(word)
print words_in_line[i:i+3]
For 'access', 'algebra' I get this result:
['access', '183', '3145']
['algebra', '153', '16272']

Scan through txt, append certain data to an empty list in Python

I have a text file that I am reading in python . I'm trying to extract certain elements from the text file that follow keywords to append them into empty lists . The file looks like this:
so I want to make two empty lists
1st list will append the sequence names
2nd list will be a list of lists which will include be in the format [Bacteria,Phylum,Class,Order, Family, Genus, Species]
most of the organisms will be Uncultured bacterium . I am trying to add the Uncultured bacterium with the following IDs that are separated by ;
Is there anyway to scan for a certain word and when the word is found, take the word that is after it [separated by a '\t'] ?
I need it to create a dictionary of the Sequence Name to be translated to the taxonomic data .
I know i will need an empty list to append the names to:
seq_names=[ ]
a second list to put the taxonomy lists into
taxonomy=[ ]
and a 3rd list that will be reset after every iteration
temp = [ ]
I'm sure it can be done in Biopython but i'm working on my python skills
Yes there is a way.
You can split the string which you get from reading the file into an array using the inbuilt function split. From this you can find the index of the word you are looking for and then using this index plus one to get the word after it. For example using a text file called test.text that looks like so (the formatting is a bit weird because SO doesn't seem to like hard tabs).
one two three four five six seven eight nine
The following code
f = open('test.txt','r')
string = f.read()
words = string.split('\t')
ind = words.index('seven')
desired = words[ind+1]
will return desired as 'eight'
Edit: To return every following word in the list
f = open('test.txt','r')
string = f.read()
words = string.split('\t')
desired = [words[ind+1] for ind, word in enumerate(words) if word == "seven"]
This is using list comprehensions. It enumerates the list of words and if the word is what you are looking for includes the word at the next index in the list.
Edit2: To split it on both new lines and tabs you can use regular expressions
import re
f = open('testtest.txt','r')
string = f.read()
words = re.split('\t|\n',string)
desired = [words[ind+1] for ind, word in enumerate(words) if word == "seven"]
It sounds like you might want a dictionary indexed by sequence name. For instance,
my_data = {
'some_sequence': [Bacteria,Phylum,Class,Order, Family, Genus, Species],
'some_other_sequence': [Bacteria,Phylum,Class,Order, Family, Genus, Species]
}
Then, you'd just access my_data['some_sequence'] to pull up the data about that sequence.
To populate your data structure, I would just loop over the lines of the files, .split('\t') to break them into "columns" and then do something like my_data[the_row[0]] = [the_row[10], the_row[11], the_row[13]...] to load the row into the dictionary.
So,
for row in inp_file.readlines():
row = row.split('\t')
my_data[row[0]] = [row[10], row[11], row[13], ...]

Categories