count word in textfile - python

I have a textfile that I wanna count the word "quack" in.
textfile named "quacker.txt" example:
This is the textfile quack.
Oh, and how quack did quack do in his exams back in 2009?\n Well, he passed with nine P grades and one B.\n He says that quack he wants to go to university in the\n future but decided to try and make a career on YouTube before that Quack....\n So, far, it’s going very quack well Quack!!!!
So here I want 7 as the output.
readf= open("quacker.txt", "r")
lst= []
for x in readf:
lst.append(str(x).rstrip('\n'))
readf.close()
#above gives a list of each row.
cv=0
for i in lst:
if "quack" in i.strip():
cv+=1
above only works for one "quack" in the element of the list

Well if the file isn't too long, you could try:
with open('quacker.txt') as f:
text = f.read().lower() # make it all lowercase so the count works below
quacks = text.count('quack')
As #PadraicCunningham mentioned in the comments, this would also count the 'quack' in
words like 'quacks' or 'quacking'. But if that's not an issue, then this is fine.

you're incrementing by one if the line contains the string, but what if the line has several occurrences of 'quack'?
try:
for line in lst:
for word in line.split():
if 'quack' in word:
cv+=1

You need to lower, strip and split to get an accurate count:
from string import punctuation
with open("test.txt") as f:
quacks = sum(word.lower().strip(punctuation) == "quack"
for line in f for word in line.split())
print(quacks)
7
You need to split each word in the file into individual words or you will get False positives using in or count. word.lower().strip(punctuation) lowers each word and removes any punctuation, sum will sum all the times word.lower().strip(punctuation) == "quack" is True.
In your own code x is already a string so calling str(x)... is unnecessary, you could also just check each line the first time you iterate, there is no need to add the strings to a list and then iterate a second time. Why you only get one returned is most like because all the data is actually on a single line, you are also comparing quack to Quack which will not work, you need to lower the string.

Related

Read in file and print only the last words of each line if these words are 8 characters or longer and dont contain #, # or : symbols

so far i can write the code to filter out words that are less than 8 characters long and also the words that contain the #, # or : symbols. However i cant figure out how to just get the last words. My code looks like this so far.
f = open("file.txt").read()
for words in f.split():
if len(words) >= 8 and not "#" in words and not "#" in words and not ":" in words:
print(words)
Edit - sorry im pretty new to this and so ive probably done something wrong above as well. The file is quite long so ill give the first line and the expected output. The first line is:
"I wish they would show out takes of Dick Cheney #GOPdebates Candidates went after #HillaryClinton 32 times in the #GOPdebate-but remained"
the expected output is "remained" however my code outputs "Candidates" and "remained".
for line in open(filename):
if some_test(line):
do_rad_thing(line)
I think is what you want .... you have the some_test part and the do_rad_thing part
I think this works: you can open the file with readlines and pass the delimeter in split(), then get the last one using [-1].
f = open("file.txt").realines()
for line in f:
last_word = line.split()[-1]
This should accomplish what you are trying to do.
Split words of the file into an array using .split() and then access the last value using [-1]. I also put all the illegal characters into an array and just did a check to see if any of the chars in the illegal_chars array are in last_word.
f = open("file.txt").read()
illegal_chars = ["#", "#", ":"]
last_word = f.split()[-1]
if( len(last_word) >= 8 and illegal_chars not in last_word:
print(last_word)

Python: Appending string constructed out of multiple lines to list

I'm trying to parse a txt file and put sentences in a list that fit my criteria.
The text file consists of several thousand lines and I'm looking for lines that start with a specific string, lets call this string 'start'.
The lines in this text file can belong together and are somehow seperated with \n at random.
This means I have to look for any string that starts with 'start', put it in an empty string 'complete' and then continue scanning each line after that to see if it also starts with 'start'.
If not then I need to append it to 'complete' because then it is part of the entire sentence. If it does I need to append 'complete' to a list, create a new, empty 'complete' string and start appending to that one. This way I can loop through the entire text file without paying attention to the number of lines a sentence exists of.
My code thusfar:
import sys, string
lines_1=[]
startswith = ('keys', 'values', 'files', 'folders', 'total')
completeline = ''
with open (sys.argv[1]) as f:
data = f.read()
for line in data:
if line.lower().startswith(startswith):
completeline = line
else:
completeline += line
lines_1.append(completeline)
# check some stuff in output
for l in lines_1:
print "______"
print l
print len(lines_1)
However this puts the entire content in 1 item in the list, where I'd like everything to be seperated.
Keep in mind that the lines composing one sentence can span one, two, 10 or 1000 lines so it needs to spot the next startswith value, append the existing completeline to the list and then fill completeline up with the next sentence.
Much obliged!
Two issues:
Iterating over a string, not lines:
When you iterate over a string, the value yielded is a character, not a line. This means for line in data: is going character by character through the string. Split your input by newlines, returning a list, which you then iterate over. e.g. for line in data.split('\n'):
Overwriting the completeline inside the loop
You append a completed line at the end of the loop, but not when you start recording a new line inside the loop. Change the if in the loop to something like this:
if line.lower().startswith(startswith):
if completeline:
lines_1.append(completeline)
completeline = line
For task like this
"I'm trying to parse a txt file and put sentences in a list that fit my criteria"
I usually prefer using dictionary for such kind of ideas, for example
from collections import defaultdict
seperatedItems = defaultdict(list)
for sentence in fileDataAsAList:
if satisfiesCriteria("start",sentence):
seperatedItems["start"].append(sentence)
def satisfiesCriteria(criteria,sentence):
if sentence.lower.startswith(criteria):
return True
return False
Something like this should suffise.. the code is just for giving you idea of what you might like to do.. you can have list of criterias and loop over them which will add sentences related to different creterias into dictionary something like this
mycriterias = ['start','begin','whatever']
for criteria in mycriterias:
for sentence in fileDataAsAList:
if satisfiesCriteria(criteria ,sentence):
seperatedItems[criteria ].append(sentence)
mind the spellings :p

How to count number of words that start with a string

I'm trying to write a code that counts prefix's,suffix's, and roots.
All I need to know is how to count the numbers of words that start or end with a certain string such as 'co'.
this is what I have so far.
SWL=open('mediumWordList.txt').readlines()
for x in SWL:
x.lower
if x.startswith('co'):
a=x.count(x)
while a==True:
a=+1
print a
all I get from this is an infinite loop of ones.
First of all as a more pythonic way for dealing with files you can use with statement to open the file which close the file automatically at the end of the block.
Also you don't need to use readlines method to load all the line in memory you can simply loop over the file object.
And about the counting the words you need to split your lines to words then use str.stratswith and str.endswith to count the words based on your conditions.
So you can use a generator expression within sum function to count the number of your words :
with open('mediumWordList.txt') as f:
sum(1 for line in f for word in line.split() if word.startswith('co'))
Note that we need to split the line to access the words, if you don't split the lines you'll loop over the all characters of the line.
As suggested in comments as a more pythonic way you can use following approach :
with open('mediumWordList.txt') as f:
sum(word.startswith('co') for line in f for word in line.split())
You could try to use Counter class from collections. For example, to count 'Foo' in Bar.txt:
from collections import Counter
with open('Bar.txt') as barF:
words = [word for line in barF.readlines() for word in line.split()]
c = Counter(words)
print c['Foo']

Counting Hashtag

I'm writing a function called HASHcount(name,list), which receives 2 parameters, the name one is the name of the file that will be analized, a text file structured like this:
Date|||Time|||Username|||Follower|||Text
So, basically my input is a tweets list, with several rows structured like above. The list parameter is a list of hashtags I want to count in that text file. I want my function to check how many times each word of the list given occurred in the tweets list, and give as output a dictionary with each word count, even if the word is missing.
For instance, with the instruction HASHcount(December,[Peace, Love]) the program should give as output a dictionary made by checking how many times the word Peace and the word Love have been used as hashtag in the Text field of each tweet in the file called December.
Also, in the dictionary the words have to be without the hashtag simbol.
I'm stuck on making this function, I'm at this point but I'm having some issues concerning the dictionary:
def HASHcount(name,list):
f = open(name,"r")
dic={}
l = f.readline()
for word in list:
dic[word]=0
for line in f:
li_lis=line.split("|||")
li_tuple=tuple(li_lis)
if word in li_tuple[4]:
dic[word]=dic[word]+1
return dic
The main issue is that you are iterating over the lines in the file for each word, rather than the reverse. Thus the first word will consume all the lines of the file, and each subsequent word will have 0 matches.
Instead, you should do something like this:
def hash_count(name, words):
dic = {word:0 for word in words}
with open(name) as f:
for line in f:
line_text = line.split('|||')[4]
for word in words:
# Check if word appears as a hashtag in line_text
# If so, increment the count for word
return dic
There are several issues with your code, some of which have already been pointed out, while others (e.g concerning the identification of hashtags in a tweet's text) have not. Here's a partial solution not covering the fine points of the latter issue:
def HASHcount(name, words):
dic = dict.fromkeys(words, 0)
with open(name,"r") as f:
for line in f:
for w in words:
if '#' + w in line:
dic[w] += 1
return dic
This offers several simplifications keyed on the fact that hashtags in a tweet do start with # (which you don't want in the dic) -- as a result it's not worth analyzing each line since the # cannot be present except in the text.
However, it still has a fraction of a problem seen in other answers (except the one which just commented out this most delicate of parts!-) -- it can get false positives by partial matches. When the check is just like word in linetext the problem would be huge -- e.g if a word is cat it gets counted as hashtag even if present in perfectly ordinary text (on its own or as part of another word, e.g vindicative). With the '#' + approach, it's a bit better, but still, prefix matches would lead to a false positive, e.g #catalog would erroneously be counted as a hit for cat.
As some suggested, regular expressions can help with that. However, here's an alternative for the body of the for w in words loop...
for w in words:
where = line.find('#' + w)
if where == -1: continue
after = line[where + len(w) + 1]
if after in chars_acceptable_in_hashes: continue
dic[w] += 1
The only issue remaining is to determine which characters can be part of hashtags, i.e, the set chars_acceptable_in_hashes -- I haven't memorized Twitter's specs so I don't know it offhand, but surely you can find out. Note that this works at end of line, too, because line has not be stripped, so it's known to end with a \n. which is not in the acceptable set (so a hashtag at the very end of the line will be "properly terminated" too).
I like using collections module. This worked for me.
from collections import defaultdict
def HASHcount(file_to_open, lst):
with open(file_to_open) as my_file:
my_dict= defaultdict(int)
for line in my_file:
line = line.split('|||')
txt = line[4].strip(" ")
if txt in lst:
my_dict[txt] += 1
return my_dict

python -- trying to count the length of the words from a file with dictionaries

def myfunc(filename):
filename=open('hello.txt','r')
lines=filename.readlines()
filename.close()
lengths={}
for line in lines:
for punc in ".,;'!:&?":
line=line.replace(punc," ")
words=line.split()
for word in words:
length=len(word)
if length not in lengths:
lengths[length]=0
lengths[length]+=1
for length,counter in lengths.items():
print(length,counter)
filename.close()
Use Counter. (<2.7 version)
You are counting the frequency of words in a single line.
for line in lines:
for word in length.keys():
print(wordct,length)
length is dict of all distinct words plus their frequency, not their length
length.get(word,0)+1
so you probably want to replace the above with
for line in lines:
....
#keep this at this indentaiton - will have a v large dict but of all words
for word in sorted(length.keys(), key=lambda x:len(x)):
#word, freq, length
print(word, length[word], len(word), "\n")
I would also suggest
Dont bring the file into memory like that, the file objects and handlers are now iterators and well optimised for reading from files.
drop the wordct and so on in the main lines loop.
rename length to something else - perhaps words or dict_words
Errr, maybe I misunderstood - are you trying to count the number of distinct words in the file, in which case use len(length.keys()) or the length of each word in the file, presumably ordered by length....
The question has been more clearly defined now so replacing the above answer
The aim is to get a frequency of word lengths throughout the whole file.
I would not even bother with line by line but use something like:
fo = open(file)
d_freq = {}
st = 0
while 1:
next_space_index = fo.find(" ", st+1)
word_len = next_space_index - st
d_freq.get(word_len,0) += 1
print d_freq
I think that will work, not enough time to try it now. HTH

Categories