Why is this dictionary line number count not working? - python

I have this piece of code, the last bit of the code starting from d = {}.
I'm trying to print the words with its line number located in the text but it is not working, it's only printing the words - anyone know why?
import sys
import string
text = []
infile = open(sys.argv[1], 'r').read()
for punct in string.punctuation:
infile = infile.replace(punct, "")
text = infile.split("\n")
dict = open(sys.argv[2], 'r').read()
dictset = []
dictset = dict.split()
words = []
words = list(set(text) - set(dictset))
words = [text.lower() for text in words]
words.sort()
d = {}
counter = 0
for lines in text:
counter += 1
if word not in d:
d[words] = [counter]
else:
d[words.append[counter]
print(word, d)
This code outputs:
helo
goin
ist
I want it to output :
helo #tab# 3 4
goin #tab# 1 2

text is a list of WORDS, it's not a list of LINES. When you do:
text = infile.split()
you're irreversibly, forever throwing away all connections between a word and the line it was in. So when you later write
for lines in text:
it's a lie: text's items are words, not lines. If they weren't, then this other earlier line:
words = list(set(text) - set(dictset))
would be totally broken -- this depends on text's items being words, not lines.
And, by the way, when you do:
words = [text.lower() for text in words]
text is now left bound to the last item in words -- you've destroyed whatever other value it had previously.
Recommendation number one: stop reusing identifiers for many different, incompatible purposes. Make a commitment to yourself that no identifier shall ever be bound to two different things within any one of your programs. This will, at least, reduce the incredible amount of utter confusion that you manager to pile onto so few lines.

Related

How to make a Python program that recognizes words common to 2 text files?

So, I'm making a python program that will read the code from a .txt (source.txt)file, and see if source.txt has any words that are there in a certain wordlist (words.txt). Also, I need it to tell me which is the common word.
SO, any idea how to do this?
Text File:-
Hello, How are you today
I am doing very fine fine
I am also very cool
My friends are cool too
We are all very cool
Code: -
Not using any list comprehensions deliberately.
index = [] #Empty List
check = ['fine', 'cool'] #Words to check for
with open('Sample', 'r') as file: #Open Text File
for line in file: #Line in text file
for word in line.split(): #Split the line into words
for i in range(len(check)): #Check if words from check match the words in the line
if word == check[i]: #i equals the index of the word in the list "check"
index.append(i) #We add the index to our index list
#Find the most common index in our index list
max = 0
res = index[0]
for i in index:
freq = index.count(i)
if freq > max:
max = freq
res = i #The element with this index in "check" is the most common
print("The most common word is :", check[res],"It occurs", max, "times in the file")
Output:
The most common word is : cool It occurs 3 times in the file
Read from source txt file, either use regular expression or split to get list of words from the text file. Methods may vary.
Do same thing to your words.txt
Set & operator
below is bad but a working example :
f = open('./source.txt').read()
f2 = open('./words.txt').read()
a = set(' '.join(f.split('\n')).split(' '))
b = set(' '.join(f2.split('\n')).split(' '))
print (a&b)

How to convert a list into float for using the '.join' function?

I have to compress a file into a list of words and list of positions to recreate the original file. My program should also be able to take a compressed file and recreate the full text, including punctuation and capitalization, of the original file. I have everything correct apart from the recreation, using the map function my program can't convert my list of positions into floats because of the '[' as it is a list.
My code is:
text = open("speech.txt")
CharactersUnique = []
ListOfPositions = []
DownLine = False
while True:
line = text.readline()
if not line:
break
TwoList = line.split()
for word in TwoList:
if word not in CharactersUnique:
CharactersUnique.append(word)
ListOfPositions.append(CharactersUnique.index(word))
if not DownLine:
CharactersUnique.append("\n")
DownLine = True
ListOfPositions.append(CharactersUnique.index("\n"))
w = open("List_WordsPos.txt", "w")
for c in CharactersUnique:
w.write(c)
w.close()
x = open("List_WordsPos.txt", "a")
x.write(str(ListOfPositions))
x.close()
with open("List_WordsPos.txt", "r") as f:
NewWordsUnique = f.readline()
f.close()
h = open("List_WordsPos.txt", "r")
lines = h.readlines()
NewListOfPositions = lines[1]
NewListOfPositions = map(float, NewListOfPositions)
print("Recreated Text:\n")
recreation = " " .join(NewWordsUnique[pos] for pos in (NewListOfPositions))
print(recreation)
The error I get is:
Task 3 Code.py", line 42, in <genexpr>
recreation = " " .join(NewWordsUnique[pos] for pos in (NewListOfPositions))
ValueError: could not convert string to float: '['
I am using Python IDLE 3.5 (32-bit). Does anyone have any ideas on how to fix this?
Why do you want to turn the position values in the list into floats, since they list indices, and those must be integer? I suspected this might be an instance of what is called the XY Problem.
I also found your code difficult to understand because you haven't followed the PEP 8 - Style Guide for Python Code. In particular, with how many (although not all) of the variable names are CamelCased, which according to the guidelines, should should be reserved for the class names.
In addition some of your variables had misleading names, like CharactersUnique, which actually [mostly] contained unique words.
So, one of the first things I did was transform all the CamelCased variables into lowercase underscore-separated words, like camel_case. In several instances I also gave them better names to reflect their actual contents or role: For example: CharactersUnique became unique_words.
The next step was to improve the handling of files by using Python's with statement to ensure they all would be closed automatically at the end of the block. In other cases I consolidated multiple file open() calls into one.
After all that I had it almost working, but that's when I discovered a problem with the approach of treating newline "\n" characters as separate words of the input text file. This caused a problem when the file was being recreated by the expression:
" ".join(NewWordsUnique[pos] for pos in (NewListOfPositions))
because it adds one space before and after every "\n" character encountered that aren't there in the original file. To workaround that, I ended up writing out the for loop that recreates the file instead of using a list comprehension, because doing so allows the newline "words" could be handled properly.
At any rate, here's the resulting rewritten (and working) code:
input_filename = "speech.txt"
compressed_filename = "List_WordsPos.txt"
# Two lists to represent contents of input file.
unique_words = ["\n"] # preload with newline "word"
word_positions = []
with open(input_filename, "r") as input_file:
for line in input_file:
for word in line.split():
if word not in unique_words:
unique_words.append(word)
word_positions.append(unique_words.index(word))
word_positions.append(unique_words.index("\n")) # add newline at end of each line
# Write representations of the two data-structures to compressed file.
with open(compressed_filename, "w") as compr_file:
words_repr = " ".join(repr(word) for word in unique_words)
compr_file.write(words_repr + "\n")
positions_repr = " ".join(repr(posn) for posn in word_positions)
compr_file.write(positions_repr + "\n")
def strip_quotes(word):
"""Strip the first and last characters from the string (assumed to be quotes)."""
tmp = word[1:-1]
return tmp if tmp != "\\n" else "\n" # newline "words" are special case
# Recreate input file from data in compressed file.
with open(compressed_filename, "r") as compr_file:
line = compr_file.readline()
new_unique_words = list(map(strip_quotes, line.split()))
line = compr_file.readline()
new_word_positions = map(int, line.split()) # using int, not float here
words = []
lines = []
for posn in new_word_positions:
word = new_unique_words[posn]
if word != "\n":
words.append(word)
else:
lines.append(" ".join(words))
words = []
print("Recreated Text:\n")
recreation = "\n".join(lines)
print(recreation)
I created my own speech.txt test file from the first paragraph of your question and ran the script on it with these results:
Recreated Text:
I have to compress a file into a list of words and list of positions to recreate
the original file. My program should also be able to take a compressed file and
recreate the full text, including punctuation and capitalization, of the
original file. I have everything correct apart from the recreation, using the
map function my program can't convert my list of positions into floats because
of the '[' as it is a list.
Per your question in the comments:
You will want to split the input on spaces. You will also likely want to use different data structures.
# we'll map the words to a list of positions
all_words = {}
with open("speech.text") as f:
data = f.read()
# since we need to be able to re-create the file, we'll want
# line breaks
lines = data.split("\n")
for i, line in enumerate(lines):
words = line.split(" ")
for j, word in enumerate(words):
if word in all_words:
all_words[word].append((i, j)) # line and pos
else:
all_words[word] = [(i, j)]
Note that this does not yield maximum compression as foo and foo. count as separate words. If you want more compression, you'll have to go character by character. Hopefully now you can use a similar approach to do so if desired.

Counting Hashtag

I'm writing a function called HASHcount(name,list), which receives 2 parameters, the name one is the name of the file that will be analized, a text file structured like this:
Date|||Time|||Username|||Follower|||Text
So, basically my input is a tweets list, with several rows structured like above. The list parameter is a list of hashtags I want to count in that text file. I want my function to check how many times each word of the list given occurred in the tweets list, and give as output a dictionary with each word count, even if the word is missing.
For instance, with the instruction HASHcount(December,[Peace, Love]) the program should give as output a dictionary made by checking how many times the word Peace and the word Love have been used as hashtag in the Text field of each tweet in the file called December.
Also, in the dictionary the words have to be without the hashtag simbol.
I'm stuck on making this function, I'm at this point but I'm having some issues concerning the dictionary:
def HASHcount(name,list):
f = open(name,"r")
dic={}
l = f.readline()
for word in list:
dic[word]=0
for line in f:
li_lis=line.split("|||")
li_tuple=tuple(li_lis)
if word in li_tuple[4]:
dic[word]=dic[word]+1
return dic
The main issue is that you are iterating over the lines in the file for each word, rather than the reverse. Thus the first word will consume all the lines of the file, and each subsequent word will have 0 matches.
Instead, you should do something like this:
def hash_count(name, words):
dic = {word:0 for word in words}
with open(name) as f:
for line in f:
line_text = line.split('|||')[4]
for word in words:
# Check if word appears as a hashtag in line_text
# If so, increment the count for word
return dic
There are several issues with your code, some of which have already been pointed out, while others (e.g concerning the identification of hashtags in a tweet's text) have not. Here's a partial solution not covering the fine points of the latter issue:
def HASHcount(name, words):
dic = dict.fromkeys(words, 0)
with open(name,"r") as f:
for line in f:
for w in words:
if '#' + w in line:
dic[w] += 1
return dic
This offers several simplifications keyed on the fact that hashtags in a tweet do start with # (which you don't want in the dic) -- as a result it's not worth analyzing each line since the # cannot be present except in the text.
However, it still has a fraction of a problem seen in other answers (except the one which just commented out this most delicate of parts!-) -- it can get false positives by partial matches. When the check is just like word in linetext the problem would be huge -- e.g if a word is cat it gets counted as hashtag even if present in perfectly ordinary text (on its own or as part of another word, e.g vindicative). With the '#' + approach, it's a bit better, but still, prefix matches would lead to a false positive, e.g #catalog would erroneously be counted as a hit for cat.
As some suggested, regular expressions can help with that. However, here's an alternative for the body of the for w in words loop...
for w in words:
where = line.find('#' + w)
if where == -1: continue
after = line[where + len(w) + 1]
if after in chars_acceptable_in_hashes: continue
dic[w] += 1
The only issue remaining is to determine which characters can be part of hashtags, i.e, the set chars_acceptable_in_hashes -- I haven't memorized Twitter's specs so I don't know it offhand, but surely you can find out. Note that this works at end of line, too, because line has not be stripped, so it's known to end with a \n. which is not in the acceptable set (so a hashtag at the very end of the line will be "properly terminated" too).
I like using collections module. This worked for me.
from collections import defaultdict
def HASHcount(file_to_open, lst):
with open(file_to_open) as my_file:
my_dict= defaultdict(int)
for line in my_file:
line = line.split('|||')
txt = line[4].strip(" ")
if txt in lst:
my_dict[txt] += 1
return my_dict

Why did my method of writing list items to a .txt file not work?

I've written a short program will take an input file, remove punctuation, sort the contents by number of occurrences per word and then write the 100 most common results to an output file.
I had some trouble on the last part (writing the results to an output file), and though I've fixed it, I don't know what the problem was.
The full code looks like so:
from collections import Counter
from itertools import chain
import sys
import string
wordList = []
#this file contains text from a number of reviews
file1 = open('reviewfile', 'r+')
reviewWords = file1.read().lower()
#this file contains a list of the 1000 most common English words
file2 = open('commonwordsfile', 'r')
commonWords = file2.read().lower()
#remove punctuation
for char in string.punctuation:
reviewWords = reviewWords.replace(char, " ")
#create a list of individual words from file1
splitWords = reviewWords.split()
for w in splitWords:
if w not in commonWords and len(w)>2:
wordList.append(w)
#sort the resulting list by length
wordList = sorted(wordList, key=len)
#return a list containing the 100
#most common words and number of occurrences
words_to_count = (word for word in wordList)
c = Counter(words_to_count)
commonHundred = c.most_common(100)
#create new file for results and write
#the 100 most common words to it
fileHandle = open("outcome", 'w' )
for listItem in commonHundred:
fileHandle.write (str(listItem) + "\n")
fileHandle.close()
I previously had this following code snippet attempting to write the 100 most common terms to a .txt file, but it didn't work. Can anyone explain why not?
makeFile = open("outputfile", "w")
for item in CommonHundred:
makeFile.write("[0]\n".format(item))
makeFile.close()
Those should be curly braces, like:
makefile.write("{0}\n".format(item))
Run this and see what happens:
a = "[0]".format("test")
print(a)
b = "{0}".format("test")
print(b)
Then go search for "Format String Syntax" here if you'd like to know more: http://docs.python.org/3/library/string.html.

Python: counting unique instance of words across several lines

I have a text file with several observations. Each observation is in one line. I would like to detect unique occurrence of each word in a line. In other words, if same word occurs twice or more on the same line, it is still counted as once. However, I would like to count the frequency of occurrence of each words across all observations. This means that if a word occurs in two or more lines,I would like to count the number of lines it occurred in. Here is the program I wrote and it is really slow in processing large number of file. I also remove certain words in the file by referencing another file. Please offer suggestions on how to improve speed. Thank you.
import re, string
from itertools import chain, tee, izip
from collections import defaultdict
def count_words(in_file="",del_file="",out_file=""):
d_list = re.split('\n', file(del_file).read().lower())
d_list = [x.strip(' ') for x in d_list]
dict2={}
f1 = open(in_file,'r')
lines = map(string.strip,map(str.lower,f1.readlines()))
for line in lines:
dict1={}
new_list = []
for char in line:
new_list.append(re.sub(r'[0-9#$?*_><#\(\)&;:,.!-+%=\[\]\-\/\^]', "_", char))
s=''.join(new_list)
for word in d_list:
s = s.replace(word,"")
for word in s.split():
try:
dict1[word]=1
except:
dict1[word]=1
for word in dict1.keys():
try:
dict2[word] += 1
except:
dict2[word] = 1
freq_list = dict2.items()
freq_list.sort()
f1.close()
word_count_handle = open(out_file,'w+')
for word, freq in freq_list:
print>>word_count_handle,word, freq
word_count_handle.close()
return dict2
dict = count_words("in_file.txt","delete_words.txt","out_file.txt")
You're running re.sub on each character of the line, one at a time. That's slow. Do it on the whole line:
s = re.sub(r'[0-9#$?*_><#\(\)&;:,.!-+%=\[\]\-\/\^]', "_", line)
Also, have a look at sets and the Counter class in the collections module. It may be faster if you just count and then discard those you don't want afterwards.
Without having done any performance testing, the following come to mind:
1) you're using regexes -- why? Are you just trying to get rid of certain characters?
2) you're using exceptions for flow control -- although it can be pythonic (better to ask forgiveness than permission), throwing exceptions can often be slow. As seen here:
for word in dict1.keys():
try:
dict2[word] += 1
except:
dict2[word] = 1
3) turn d_list into a set, and use python's in to test for membership, and simultaneously ...
4) avoid heavy use of replace method on strings -- I believe you're using this to filter out the words that appear in d_list. This could be accomplished instead by avoiding replace, and just filtering the words in the line, either with a list comprehension:
[word for word words if not word in del_words]
or with a filter (not very pythonic):
filter(lambda word: not word in del_words, words)
import re
u_words = set()
u_words_in_lns = []
wordcount = {}
words = []
# get unique words per line
for line in buff.split('\n'):
u_words_in_lns.append(set(line.split(' ')))
# create a set of all unique words
map( u_words.update, u_words_in_lns )
# flatten the sets into a single list of words again
map( words.extend, u_words_in_lns)
# count everything up
for word in u_words:
wordcount[word] = len(re.findall(word,str(words)))

Categories