Hi so i have 2 text files I have to read the first text file count the frequency of each word and remove duplicates and create a list of list with the word and its count in the file.
My second text file contains keywords I need to count the frequency of these keywords in the first text file and return the result without using any imports, dict, or zips.
I am stuck on how to go about this second part I have the file open and removed punctuation etc but I have no clue how to find the frequency
I played around with the idea of .find() but no luck as of yet.
Any suggestions would be appreciated this is my code at the moment seems to find the frequency of the keyword in the keyword file but not in the first text file
def calculateFrequenciesTest(aString):
listKeywords= aString
listSize = len(listKeywords)
keywordCountList = []
while listSize > 0:
targetWord = listKeywords [0]
count =0
for i in range(0,listSize):
if targetWord == listKeywords [i]:
count = count +1
wordAndCount = []
wordAndCount.append(targetWord)
wordAndCount.append(count)
keywordCountList.append(wordAndCount)
for i in range (0,count):
listKeywords.remove(targetWord)
listSize = len(listKeywords)
sortedFrequencyList = readKeywords(keywordCountList)
return keywordCountList;
EDIT- Currently toying around with the idea of reopening my first file again but this time without turning it into a list? I think my errors are somehow coming from it counting the frequency of my list of list. These are the types of results I am getting
[[['the', 66], 1], [['of', 32], 1], [['and', 27], 1], [['a', 23], 1], [['i', 23], 1]]
You can try something like:
I am taking a list of words as an example.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = {}
for word in word_list:
if word not in frequency_list:
frequency_list[word] = 1
else:
frequency_list[word] += 1
print(frequency_list)
RESULT: {'test': 1, 'world': 1, 'hello': 2}
Since, you have put a constraint on dicts, I have made use of two lists to do the same task. I am not sure how efficient it is, but it serves the purpose.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = []
frequency_word = []
for word in word_list:
if word not in frequency_word:
frequency_word.append(word)
frequency_list.append(1)
else:
ind = frequency_word.index(word)
frequency_list[ind] += 1
print(frequency_word)
print(frequency_list)
RESULT : ['hello', 'world', 'test']
[2, 1, 1]
You can change it to how you like or re-factor it as you wish
I agree with #bereal that you should use Counter for this. I see that you have said that you don't want "imports, dict, or zips", so feel free to disregard this answer. Yet, one of the major advantages of Python is its great standard library, and every time you have list available, you'll also have dict, collections.Counter and re.
From your code I'm getting the impression that you want to use the same style that you would have used with C or Java. I suggest trying to be a little more pythonic. Code written this way may look unfamiliar, and can take time getting used to. Yet, you'll learn way more.
Claryfying what you're trying to achieve would help. Are you learning Python? Are you solving this specific problem? Why can't you use any imports, dict, or zips?
So here's a proposal utilizing built in functionality (no third party) for what it's worth (tested with Python 2):
#!/usr/bin/python
import re # String matching
import collections # collections.Counter basically solves your problem
def loadwords(s):
"""Find the words in a long string.
Words are separated by whitespace. Typical signs are ignored.
"""
return (s
.replace(".", " ")
.replace(",", " ")
.replace("!", " ")
.replace("?", " ")
.lower()).split()
def loadwords_re(s):
"""Find the words in a long string.
Words are separated by whitespace. Only characters and ' are allowed in strings.
"""
return (re.sub(r"[^a-z']", " ", s.lower())
.split())
# You may want to read this from a file instead
sourcefile_words = loadwords_re("""this is a sentence. This is another sentence.
Let's write many sentences here.
Here comes another sentence.
And another one.
In English, we use plenty of "a" and "the". A whole lot, actually.
""")
# Sets are really fast for answering the question: "is this element in the set?"
# You may want to read this from a file instead
keywords = set(loadwords_re("""
of and a i the
"""))
# Count for every word in sourcefile_words, ignoring your keywords
wordcount_all = collections.Counter(sourcefile_words)
# Lookup word counts like this (Counter is a dictionary)
count_this = wordcount_all["this"] # returns 2
count_a = wordcount_all["a"] # returns 1
# Only look for words in the keywords-set
wordcount_keywords = collections.Counter(word
for word in sourcefile_words
if word in keywords)
count_and = wordcount_keywords["and"] # Returns 2
all_counted_keywords = wordcount_keywords.keys() # Returns ['a', 'and', 'the', 'of']
Here is a solution with no imports. It uses nested linear searches which are acceptable with a small number of searches over a small input array, but will become unwieldy and slow with larger inputs.
Still the input here is quite large, but it handles it in reasonable time. I suspect if your keywords file was larger (mine has only 3 words) the slow down would start to show.
Here we take an input file, iterate over the lines and remove punctuation then split by spaces and flatten all the words into a single list. The list has dupes, so to remove them we sort the list so the dupes come together and then iterate over it creating a new list containing the string and a count. We can do this by incrementing the count as long the same word appears in the list and moving to a new entry when a new word is seen.
Now you have your word frequency list and you can search it for the required keyword and retrieve the count.
The input text file is here and the keyword file can be cobbled together with just a few words in a file, one per line.
python 3 code, it indicates where applicable how to modify for python 2.
# use string.punctuation if you are somehow allowed
# to import the string module.
translator = str.maketrans('', '', '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~')
words = []
with open('hamlet.txt') as f:
for line in f:
if line:
line = line.translate(translator)
# py 2 alternative
#line = line.translate(None, string.punctuation)
words.extend(line.strip().split())
# sort the word list, so instances of the same word are
# contiguous in the list and can be counted together
words.sort()
thisword = ''
counts = []
# for each word in the list add to the count as long as the
# word does not change
for w in words:
if w != thisword:
counts.append([w, 1])
thisword = w
else:
counts[-1][1] += 1
for c in counts:
print('%s (%d)' % (c[0], c[1]))
# function to prevent need to break out of nested loop
def findword(clist, word):
for c in clist:
if c[0] == word:
return c[1]
return 0
# open keywords file and search for each word in the
# frequency list.
with open('keywords.txt') as f2:
for line in f2:
if line:
word = line.strip()
thiscount = findword(counts, word)
print('keyword %s appear %d times in source' % (word, thiscount))
If you were so inclined you could modify findword to use a binary search, but its still not going to be anywhere near a dict. collections.Counter is the right solution when there are no restrictions. Its quicker and less code.
Related
I'd like to help my young child broaden his vocabulary. The plan is to parse the dictionary (in this case, MacOS), append the words to a list and if those list items meets specific criteria they are added to another list... perhaps a little messier than it needs to be!
I'd like to see just five randomly chosen words be printed. I've managed most of it already but get an error when trying to pick a random item to show...
IndexError: list index out of range
And the code thus far...
import random
word_file = "/usr/share/dict/words"
WORDS = open(word_file).read().splitlines()
for x in WORDS:
myRawList = []
myRawListWithCount = []
# filters out words that don't start with 'a"
if x.startswith("a"):
myRawList.append(x)
# word len. cannot exceed 5
for y in myRawList:
if (len(y)) <= 5:
myRawListWithCount.append(y)
# the line that causes an error. Simpler/shorter lists for names etc seem to work OK with the command.
print(random.choice(myRawListWithCount))
Try changing the scope of your lists. It's possible that it doesn't like you accessing it in the way you currently are with two different scopes.
myRawList = []
myRawListWithCount = []
for x in WORDS:
# filters out words that don't start with 'a"
if x.startswith("a"):
myRawList.append(x)
# word len. cannot exceed 5
for y in myRawList:
if (len(y)) <= 5:
myRawListWithCount.append(y)
# the line that causes an error. Simpler/shorter lists for names etc seem to work OK with the command.
print(random.choice(myRawListWithCount))
You set up a new empty list at each step in the loop.
Also the way you generate the second list is extremely inefficient (you read again all found words for each new word).
myRawList = []
myRawListWithCount = []
for x in WORDS:
# filters out words that don't start with 'a"
if x.startswith("a"):
myRawList.append(x)
# word len. cannot exceed 5
if (len(x)) <= 5:
myRawListWithCount.append(x)
print(random.choice(myRawListWithCount))
Example output: amino
Another idea of optimization, as a dictionary is sorted, you could break out of the loop as soon as you find a word not starting with a (you would then need a separate loop to create the second list)
I may have misunderstood your goal. If you just need five words chosen at random from the file, why not use a list comprehension to build up a list of all words from the file that meet your criteria, then use random.sample to pull out a sample of five words?
import random
word_file = "/usr/share/dict/words"
with open(word_file, "r") as f:
print(random.sample([word for line in f.readlines()
for word in [line.strip()]
if word.startswith("a") and len(word) <= 5],
5))
Goal is to a) print a list of unique words from a text file and also b) find the longest word.
I cannot use imports in this challenge.
File handling and main functionality are what I want, however the list needs to be cleaned. As you can see from the output, words are getting joined with punctuation and therefore maxLength is obviously incorrect.
with open("doc.txt") as reader, open("unique.txt", "w") as writer:
unwanted = "[],."
unique = set(reader.read().split())
unique = list(unique)
unique.sort(key=len)
regex = [elem.strip(unwanted).split() for elem in unique]
writer.write(str(regex))
reader.close()
maxLength = len(max(regex,key=len ))
print(maxLength)
res = [word for word in regex if len(word) == maxLength]
print(res)
===========
Sample:
pioneered the integrated placement year concept over 50 years ago [7][8][9] with more than 70 per cent of students taking a placement year, the highest percentage in the UK.[10]
Here's a solution that uses str.translate() to throw away all bad characters (+ newline) before we ever do the split(). (Normally we'd use a regex with re.sub(), but you're not allowed.) This makes the cleaning a one-liner, which is really neat:
bad = "[],.\n"
bad_transtable = str.maketrans(bad, ' ' * len(bad))
# We can directly read and clean the entire output, without a reader object:
cleaned_input = open('doc.txt').read().translate(bad_transtable)
#with open("doc.txt") as reader:
# cleaned_input = reader.read().translate(bad_transtable)
# Get list of unique words, in decreasing length
unique_words = sorted(set(cleaned_input.split()), key=lambda w: -len(w))
with open("unique.txt", "w") as writer:
for word in unique_words:
writer.write(f'{word}\n')
max_length = len(unique_words[0])
print ([word for word in unique_words if len(word) == max_length])
Notes:
since the input is already 100% cleaned and split, no need to append to a list/insert to a set as we go, then have to make another cleaning pass later. We can just create unique_words directly! (using set() to keep only uniques). And while we're at it, we might as well use sorted(..., key=lambda w: -len(w)) to sort it in decreasing length. Only need to call sort() once. And no iterative-append to lists.
hence we guarantee that max_length = len(unique_words[0])
this approach is also going to be more performant than nested loops for line in <lines>: for word in line.split(): ...iterative append() to wordlist
no need to do explicit writer/reader.open()/.close(), that's what the with statement does for you. (It's also more elegant for handling IO when exceptions happen.)
you could also merge the printing of the max_length words inside the writer loop. But it's cleaner code to keep them separate.
note we use f-string formatting f'{word}\n' to add the newline back when we write() an output line
in Python we use lower_case_with_underscores for variable names, hence max_length not maxLength. See PEP8
in fact here, we don't strictly need a with-statement for the writer, if all we're going to do is slurp its entire contents in one go in with open('doc.txt').read(). (That's not scaleable for huge files, you'd have to read in chunks or n lines).
str.maketrans() is a builtin, but if your teacher objects to the module reference, you can also call it on a bound string e.g. ' '.maketrans()
str.maketrans() is really a throwback to the days when we only had 95 printable ASCII characters, not Unicode. It still works on Unicode, but building and using huge translation dicts is annoying and uses memory, regex on Unicode is easier, you can define entire character classes.
Alternative solution if you don't yet know str.translate()
dirty_input = open('doc.txt').read()
cleaned_input = dirty_input
# If you can't use either 're.sub()' or 'str.translate()', have to manually
# str.replace() each bad char one-by-one (or else use a method like str.isalpha())
for bad_char in bad:
cleaned_input = cleaned_input.replace(bad_char, ' ')
And if you wanted to be ridiculously minimalist, you could write the entire output file in one line with a list-comprehension. Don't do this, it would be terrible for debugging, e.g if you couldn't open/write/overwrite the output file, or got IOError, or unique_words wasn't a list, etc:
open("unique.txt", "w").writelines([f'{word}\n' for word in unique_words])
Here is another solution without any function.
bad = '`~##$%^&*()-_=+[]{}\|;\':\".>?<,/?'
clean = ' '
for i in a:
if i not in bad:
clean += i
else:
clean += ' '
cleans = [i for i in clean.split(' ') if len(i)]
clean_uniq = list(set(cleans))
clean_uniq.sort(key=len)
print(clean_uniq)
print(len(clean_uniq[-1]))
Here is a solution. The trick is to use the python str method .isalpha() to filter non-alphanumerics.
with open("unique.txt", "w") as writer:
with open("doc.txt") as reader:
cleaned_words = []
for line in reader.readlines():
for word in line.split():
cleaned_word = ''.join([c for c in word if c.isalpha()])
if len(cleaned_word):
cleaned_words.append(cleaned_word)
# print unique words
unique_words = set(cleaned_words)
print(unique_words)
# write words to file? depends what you need here
for word in unique_words:
writer.write(str(word))
writer.write('\n')
# print length of longest
print(len(sorted(unique_words, key=len, reverse=True)[0]))
I am iterating through hundreds of thousands of words in several documents, looking to find the frequencies of contractions in English. I have formatted the documents appropriately, and it's now a matter of writing the correct function and storing the data properly. I need to store information for each document on which contractions were found and how frequently they were used in the document. Ideally, my data frame would look something like the following:
filename contraction count
file1 it's 34
file1 they're 13
file1 she's 9
file2 it's 14
file2 we're 15
file3 it's 4
file4 it's 45
file4 she's 13
How can I best go about this?
Edit: Here's my code, thus far:
for i in contractions_list: # for each of the 144 contractions in my list
for l in every_link: # for each speech
count = 0
word_count = 0
content_2 = processURL_short(l)
for word in content2.split():
word = word.strip(p)
word_count = word_count + 1
if i in contractions:
count = count + 1
Where processURL_short() is a function I wrote that scrapes a website and returns a speech as str.
Edit2:
link_store = {}
for i in contractions_list_test: # for each of the 144 contractions
for l in every_link_test: # for each speech
link_store[l] = {}
count = 0
word_count = 0
content_2 = processURL_short(l)
for word in content_2.split():
word = word.strip(p)
word_count = word_count + 1
if word == i:
count = count + 1
if count: link_store[l][i] = count
print i,l,count
Here's my file-naming code:
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president,speech_num)
Opening and reading are slow operations: don't cycle through the entire file list 144 times.
Exceptions are slow: throwing an exception for every non-contraction in every speech will be ponderous.
Don't cycle through your list of contractions checking against words. Instead, use the built-in in function to see whether that contraction is on the list, and then use a dictionary to tally the entries, just as you might do by hand.
Go through the files, word by word. When you see a word on the contraction list, see whether it's already on your tally sheet. If so, add a mark, if not, add it to the sheet with a count of 1.
Here's an example. I've made very short speeches and a trivial processURL_short function.
def processURL_short(string):
return string.lower()
every_link = [
"It's time for going to Sardi's",
"We're in the mood; it's about DST",
"They're he's it's don't",
"I'll be home for Christmas"]
contraction_list = [
"it's",
"don't",
"can't",
"i'll",
"he's",
"she's",
"they're"
]
for l in every_link: # for each speech
contraction_count = {}
content = processURL_short(l)
for word in content.split():
if word in contraction_list:
if word in contraction_count:
contraction_count[word] += 1
else:
contraction_count[word] = 1
for key, value in contraction_count.items():
print key, '\t', value
you can have your structure set up like this:
links = {}
for l in every_link:
links[l] = {}
for i in contractions_list:
count = 0
... #here is where you do your count, which you seem to know how to do
... #note that in your code, i think you meant if i in word/ if i == word for your final if statement
if count: links[l][i] = count #only adds the value if count is not 0
you would end up with a data structure like this:
links = {
'file1':{
"it's":34,
"they're":14,
...,
},
'file2':{
....,
},
...,
}
which you could easily iterate through to write the necessary data to your file (which i again assume you know how to do since its seemingly not part of the question)
Dictionaries seems to be the best option here, because they will allow
you easier manipulation of your data. Your goal should be indexing
results by filename extracted form the link (the URL to your speech
text) to a mapping of contraction and its count.
Something like:
{"file1": {"it's": 34, "they're": 13, "she's": 9},
"file2": {"it's": 14, "we're": 15},
"file3": {"it's": 4},
"file4": {"it's": 45, "she's": 13}}
Here's the full code:
ret = {}
for link, text in ((l, processURL_short(l)) for l in every_link):
contractions = {c:0 for c in contractions_list}
for word in text.split():
try:
contractions[word] += 1
except KeyError:
# Word or contraction not found.
pass
ret[file_naming_code(link)] = contractions
Let's go into each step.
First we intialize ret, it will be the resulting dictionary. Then we use
generator expressions
to perform processURL_short() for each step (instead of going though
all link list at once). We return a list of tuple (<link-name>, <speech-test>) so we can use the link name later.
Next That's the contractions count mapping, intialize to 0s, it
will be used to count contractions.
Then we split the text into words, for each word we search for it
in the contractions mapping, if found we count it, otherwise
KeyError will be raise for each key not found.
(Another question stated that this will perform poorly, another
possiblity is checking with in, like word in contractions.)
Finally:
ret[file_naming_code(link)] = contractions
Now ret is a dictionary of filename mapping to contractions
occurrences. Now you can easily create your table using:
Here's how you would get your output:
print '\t'.join(('filename', 'contraction', 'count'))
for link, counts in ret.items():
for name, count in counts.items():
print '\t'.join((link, name, count))
I am struggling with a small program in Python which aims at counting the occurrences of a specific set of characters in the lines of a text file.
As an example, if I want to count '!' and '#' from the following lines
hi!
hello#gmail.com
collection!
I'd expect the following output:
!;2
#;1
So far I got a functional code, but it's inefficient and does not use the potential that Python libraries have.
I have tried using collections.counter, with limited success. The efficiency blocker I found is that I couldn't select specific sets of characters on counter.update(), all the rest of the characters found were also counted. Then I would have to filter the characters I am not interested in, which adds another loop...
I also considered regular expressions, but I can't see an advantage in this case.
This is the functional code I have right now (the simplest idea I could imagine), which looks for special characters in file's lines. I'd like to see if someone can come up with a neater Python-specific idea:
def count_special_chars(filename):
special_chars = list('!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ ')
dict_count = dict(zip(special_chars, [0] * len(special_chars)))
with open(filename) as f:
for passw in f:
for c in passw:
if c in special_chars:
dict_count[c] += 1
return dict_count
thanks for checking
Why not count the whole file all together? You should avoid looping through string for each line of the file. Use string.count instead.
from pprint import pprint
# Better coding style: put constant out of the function
SPECIAL_CHARS = '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ '
def count_special_chars(filename):
with open(filename) as f:
content = f.read()
return dict([(i, content.count(i)) for i in SPECIAL_CHARS])
pprint(count_special_chars('example.txt'))
example output:
{' ': 0,
'!': 2,
'.': 1,
'#': 1,
'[': 0,
'~': 0
# the remaining keys with a value of zero are ignored
...}
Eliminating the extra counts from collections.Counter is probably not significant either way, but if it bothers you, do it during the initial iteration:
from collections import Counter
special_chars = '''!"#$%&'()*+,-./:;<=>?#[\\]^_`{|}~ '''
found_chars = [c for c in open(yourfile).read() if c in special_chars]
counted_chars = Counter(found_chars)
need not to process file contents line-by-line
to avoid nested loops, which increase complexity of your program
If you want to count character occurrences in some string, first, you loop over the entire string to construct an occurrence dict. Then, you can find any occurrence of character from the dict. This reduce complexity of the program.
When constructing occurrence dict, defaultdict would help you to initialize count values.
A refactored version of the program is as below:
special_chars = list('!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ ')
dict_count = defaultdict(int)
with open(filename) as f:
for c in f.read():
dict_count[c] += 1
for c in special_chars:
print('{0};{1}'.format(c, dict_count[c]))
ref. defaultdict Examples: https://docs.python.org/3.4/library/collections.html#defaultdict-examples
I did something like this where you do not need to use the counter library. I used it to count all the special char but you can adapt to put the count in a dict.
import re
def countSpecial(passwd):
specialcount = 0
for special in special_chars:
lenght = 0
#print special
lenght = len(re.findall(r'(\%s)' %special , passwd))
if lenght > 0:
#print lenght,special
specialcount = lenght + specialcount
return specialcount
I am looking for some words in a file in python. After I find each word I need to read the next two words from the file. I've looked for some solution but I could not find reading just the next words.
# offsetFile - file pointer
# searchTerms - list of words
for line in offsetFile:
for word in searchTerms:
if word in line:
# here get the next two terms after the word
Thank you for your time.
Update: Only the first appearance is necessary. Actually only one appearance of the word is possible in this case.
file:
accept 42 2820 access 183 3145 accid 1 4589 algebra 153 16272 algem 4 17439 algol 202 6530
word: ['access', 'algebra']
Searching the file when I encounter 'access' and 'algebra', I need the values of 183 3145 and 153 16272 respectively.
An easy way to deal with this is to read the file using a generator that yields one word at a time from the file.
def words(fileobj):
for line in fileobj:
for word in line.split():
yield word
Then to find the word you're interested in and read the next two words:
with open("offsetfile.txt") as wordfile:
wordgen = words(wordfile)
for word in wordgen:
if word in searchterms: # searchterms should be a set() to make this fast
break
else:
word = None # makes sure word is None if the word wasn't found
foundwords = [word, next(wordgen, None), next(wordgen, None)]
Now foundwords[0] is the word you found, foundwords[1] is the word after that, and foundwords[2] is the second word after it. If there aren't enough words, then one or more elements of the list will be None.
It is a little more complex if you want to force this to match only within one line, but usually you can get away with considering the file as just a sequence of words.
If you need to retrieve only two first words, just do it:
offsetFile.readline().split()[:2]
word = '3' #Your word
delim = ',' #Your delim
with open('test_file.txt') as f:
for line in f:
if word in line:
s_line = line.strip().split(delim)
two_words = (s_line[s_line.index(word) + 1],\
s_line[s_line.index(word) + 2])
break
def searchTerm(offsetFile, searchTerms):
# remove any found words from this list; if empty we can exit
searchThese = searchTerms[:]
for line in offsetFile:
words_in_line = line.split()
# Use this list comprehension if always two numbers continue a word.
# Else use words_in_line.
for word in [w for i, w in enumerate(words_in_line) if i % 3 == 0]:
# No more words to search.
if not searchThese:
return
# Search remaining words.
if word in searchThese:
searchThese.remove(word)
i = words_in_line.index(word)
print words_in_line[i:i+3]
For 'access', 'algebra' I get this result:
['access', '183', '3145']
['algebra', '153', '16272']