My goal is to open a file and split it into unique words and display that list (along with a number count). I think I have to split the file into lines and then split those lines into words and add it all into a list.
The problem is that if my program will run in an infinite loop and not display any results, or it will only read a single line and then stop. The file being read is The Gettysburg Address.
def uniquify( splitz, uniqueWords, lineNum ):
for word in splitz:
word = word.lower()
if word not in uniqueWords:
uniqueWords.append( word )
def conjunctionFunction():
uniqueWords = []
with open(r'C:\Users\Alex\Desktop\Address.txt') as f :
getty = [line.rstrip('\n') for line in f]
lineNum = 0
lines = getty[lineNum]
getty.append("\n")
while lineNum < 20 :
splitz = lines.split()
lineNum += 1
uniquify( splitz, uniqueWords, lineNum )
print( uniqueWords )
conjunctionFunction()
Using your current code, the line:
lines = getty[lineNum]
should be moved within the while loop.
You figured out what's wrong with your code, but nonetheless, I would do this slightly differently. Since you need to keep track of the number of unique words and their counts, you should use a dictionary for this task:
wordHash = {}
with open('C:\Users\Alex\Desktop\Address.txt', 'r') as f :
for line in f:
line = line.rstrip().lower()
for word in line:
if word not in wordHash:
wordHash[word] = 1
else:
wordHash[word] += 1
print wordHash
def splitData(filename):
return [words for words in open(filename).reads().split()]
Easiest way to split a file into words :)
Assume inp is retrived from a file
inp = """Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense."""
data = inp.splitlines()
print data
_d = {}
for line in data:
word_lst = line.split()
for word in word_lst:
if word in _d:
_d[word] += 1
else:
_d[word] = 1
print _d.keys()
Output
['Beautiful', 'Flat', 'Simple', 'is', 'dense.', 'Explicit', 'better', 'nested.', 'Complex', 'ugly.', 'Sparse', 'implicit.', 'complex.', 'than', 'complicated.']
I recommend:
#!/usr/local/cpython-3.3/bin/python
import pprint
import collections
def genwords(file_):
for line in file_:
for word in line.split():
yield word
def main():
with open('gettysburg.txt', 'r') as file_:
result = collections.Counter(genwords(file_))
pprint.pprint(result)
main()
...but you could use re.findall to deal with punctuation better, instead of string.split.
Related
I compare two txt files, find a match and print the line that matches and three corresponding lines after. I have read How to search a text file for a specific word in Python to accomplish that.
However, I want anything printed to be exported in an excel file. I think I am getting the call out words wrong for the List.Word and Match
An example of the output I want my code to do
import os
import xlwt
def createlist():
items = []
with open('Trialrun.txt') as input:
for line in input:
items.extend(line.strip().split(','))
return items
print(createlist())
word_list = createlist()
my_xls=xlwt.Workbook(encoding = "utf-8")
my_sheet=my_xls.add_sheet("Results")
row_num=0
my_sheet.write(row_num,0,"List.Word()")
my_sheet.write(row_num,1,"Match")
row_num+=1
with open('January 19.txt', 'r') as f:
for line in f:
for word in line.strip().split():
if word in word_list:
print'\t',List.Word(),'\t,',Match (),
print (word, end= '')
my_sheet.write(row_num,0,List.Word())
my_sheet.write(row_num,1,Match())
row_num+=1
print(next(f))
print(next(f))
print(next(f))
else:
StopIteration
my_xls.save("results.xls")
I don't get completely what you want to achieve and, I don't understand the 2nd match and list.word occurrence as well as the print(next(f)) at the end.
But maybe something like this helps; at least the script below iterates over the file and outputs results based on a match in the 2nd file.
import os
import xlwt
def createlist():
items = []
with open('Trialrun.txt') as input:
for line in input:
items.extend(line.strip().split(','))
return items
word_list = createlist()
my_xls = xlwt.Workbook(encoding="utf-8")
my_sheet = my_xls.add_sheet("Results")
row_num = 0
my_sheet.write(row_num, 0, "List.Word()")
my_sheet.write(row_num, 1, "Match")
row_num += 1
i = 1
with open('January 19.txt', 'r') as f:
for line in f:
for word in line.strip().split():
my_sheet.write(row_num, 0, word)
for line in word_list:
if word in line:
i+=1
my_sheet.write(row_num, i, line)
else:
StopIteration
row_num += 1
my_xls.save("results.xls")
I'm trying to make a simple program that can find the frequency of occurrences in a text file line by line. I have it outputting everything correctly except for when more than one word is on a line in the text file. (More information below)
The text file looks like this:
Hello
Hi
Hello
Good Day
Hi
Good Day
Good Night
I want the output to be: (Doesn't have to be in the same order)
Hello: 2
Hi: 2
Good Day: 2
Good Night: 2
What it's currently outputting:
Day: 2
Good: 3
Hello: 2
Hi: 2
Night: 1
My code:
file = open("test.txt", "r")
text = file.read() #reads file (I've tried .realine() & .readlines()
word_list = text.split(None)
word_freq = {} # Declares empty dictionary
for word in word_list:
word_freq[word] = word_freq.get(word, 0) + 1
keys = sorted(word_freq.keys())
for word in keys:
final=word.capitalize()
print(final + ': ' + str(word_freq[word])) # Line that prints the output
You want to preserve the lines. Don't split. Don't capitalize. Don't sort
Use a Counter
from collections import Counter
c = Counter()
with open('test.txt') as f:
for line in f:
c[line.rstrip()] += 1
for k, v in c.items():
print('{}: {}'.format(k, v))
Instead of splitting the text by None, split it by each line break so you get each line into a list.
file = open("test.txt", "r")
text = file.read() #reads file (I've tried .realine() & .readlines()
word_list = text.split('\n')
word_freq = {} # Declares empty dictionary
for word in word_list:
word_freq[word] = word_freq.get(word, 0) + 1
keys = sorted(word_freq.keys())
for word in keys:
final=word.capitalize()
print(final + ': ' + str(word_freq[word])) # Line that prints the output
You can make it yourself very easy by using a Counter object. If you want to count the occurrences of full lines you can simply do:
from collections import Counter
with open('file.txt') as f:
c = Counter(f)
print(c)
Edit
Since you asked for a way without modules:
counter_dict = {}
with open('file.txt') as f:
l = f.readlines()
for line in l:
if line not in counter_dict:
counter_dict[line] = 0
counter_dict[line] +=1
print(counter_dict)
Thank you all for the answers, most of the code produces the desired output just in different ways. The code I ended up using with no modules was this:
file = open("test.txt", "r")
text = file.read() #reads file (I've tried .realine() & .readlines()
word_list = text.split('\n')
word_freq = {} # Declares empty dictionary
for word in word_list:
word_freq[word] = word_freq.get(word, 0) + 1
keys = sorted(word_freq.keys())
for word in keys:
final=word.capitalize()
print(final + ': ' + str(word_freq[word])) # Line that prints the output
The code I ended up using with modules was this:
from collections import Counter
c = Counter()
with open('live.txt') as f:
for line in f:
c[line.rstrip()] += 1
for k, v in c.items():
print('{}: {}'.format(k, v))
I have a .txt file which contains some words:
e.g
bye
bicycle
bi
cyc
le
and i want to return a list which contains all the words in the file. I have tried some code which actually works but i think it takes a lot of time to execute for bigger files. Is there a way to make this code more efficient?
with open('file.txt', 'r') as f:
for line in f:
if line == '\n': --> #blank line
lst1.append(line)
else:
lst1.append(line.replace('\n', '')) --> #the way i find more efficient to concatenate letters of a specific word
str1 = ''.join(lst1)
lst_fin = str1.split()
expected output:
lst_fin = ['bye', 'bicycle', 'bicycle']
I don't know if this is more efficient, but at least it's an alternative... :)
with open('file.txt') as f:
words = f.read().replace('\n\n', '|').replace('\n', '').split('|')
print(words)
...or if you don't want to insert a character like '|' (which could be already there) into the data you could do also
with open('file.txt') as f:
words = f.read().split('\n\n')
words = [w.replace('\n', '') for w in words]
print(words)
result is the same in both cases:
# ['bye', 'bicycle', 'bicycle']
EDIT:
I think I have another approach. However, it requires the file not to start with a blank line, iiuc...
with open('file.txt') as f:
res = []
current_elmnt = next(f).strip()
for line in f:
if line.strip():
current_elmnt += line.strip()
else:
res.append(current_elmnt)
current_elmnt = ''
print(words)
Perhaps you want to give it a try...
You can use the iter function with a sentinel of '' instead:
with open('file.txt') as f:
lst_fin = list(iter(lambda: ''.join(iter(map(str.strip, f).__next__, '')), ''))
Demo: https://repl.it/#blhsing/TalkativeCostlyUpgrades
You could use this(I don't know about its efficiency):
lst = []
s = ''
with open('tp.txt', 'r') as file:
l = file.readlines()
for i in l:
if i == '\n':
lst.append(s)
s = ''
elif i == l[-1]:
s += i.rstrip()
lst.append(s)
else:
s+= i.rstrip()
print(lst)
I am retrieving only unique words in a file, here is what I have so far, however is there a better way to achieve this in python in terms of big O notation? Right now this is n squared
def retHapax():
file = open("myfile.txt")
myMap = {}
uniqueMap = {}
for i in file:
myList = i.split(' ')
for j in myList:
j = j.rstrip()
if j in myMap:
del uniqueMap[j]
else:
myMap[j] = 1
uniqueMap[j] = 1
file.close()
print uniqueMap
If you want to find all unique words and consider foo the same as foo. and you need to strip punctuation.
from collections import Counter
from string import punctuation
with open("myfile.txt") as f:
word_counts = Counter(word.strip(punctuation) for line in f for word in line.split())
print([word for word, count in word_counts.iteritems() if count == 1])
If you want to ignore case you also need to use line.lower(). If you want to accurately get unique word then there is more involved than just splitting the lines on whitespace.
I'd go with the collections.Counter approach, but if you only wanted to use sets, then you could do so by:
with open('myfile.txt') as input_file:
all_words = set()
dupes = set()
for word in (word for line in input_file for word in line.split()):
if word in all_words:
dupes.add(word)
all_words.add(word)
unique = all_words - dupes
Given an input of:
one two three
two three four
four five six
Has an output of:
{'five', 'one', 'six'}
Try this to get unique words in a file.using Counter
from collections import Counter
with open("myfile.txt") as input_file:
word_counts = Counter(word for line in input_file for word in line.split())
>>> [word for (word, count) in word_counts.iteritems() if count==1]
-> list of unique words (words that appear exactly once)
You could slightly modify your logic and move it from unique on second occurrence (example using sets instead of dicts):
words = set()
unique_words = set()
for w in (word.strip() for line in f for word in line.split(' ')):
if w in words:
continue
if w in unique_words:
unique_words.remove(w)
words.add(w)
else:
unique_words.add(w)
print(unique_words)
I am trying to write a program which reads a text file and then sorts it out into whether the comments in it are positive, negative or neutral. I have tried all sorts of ways to do this but each time with no avail. I can search for 1 word with no problems but any more than that and it doesn't work. Also, I have an if statement but I've had to use else twice underneath it as it wouldn't allow me to use elif. Any help with where I'm going wrong would be really appreciated. Thanks in advance.
middle = open("middle_test.txt", "r")
positive = []
negative = [] #the empty lists
neutral = []
pos_words = ["GOOD", "GREAT", "LOVE", "AWESOME"] #the lists I'd like to search
neg_words = ["BAD", "HATE", "SUCKS", "CRAP"]
for tweet in middle:
words = tweet.split()
if pos_words in words: #doesn't work
positive.append(words)
else: #can't use elif for some reason
if 'BAD' in words: #works but is only 1 word not list
negative.append(words)
else:
neutral.append(words)
Use a Counter, see http://docs.python.org/2/library/collections.html#collections.Counter:
import urllib2
from collections import Counter
from string import punctuation
# data from http://inclass.kaggle.com/c/si650winter11/data
target_url = "http://goo.gl/oMufKm"
data = urllib2.urlopen(target_url).read()
word_freq = Counter([i.lower().strip(punctuation) for i in data.split()])
pos_words = ["good", "great", "love", "awesome"]
neg_words = ["bad", "hate", "sucks", "crap"]
for i in pos_words:
try:
print i, word_freq[i]
except: # if word not in data
pass
[out]:
good 638
great 1082
love 7716
awesome 2032
You could use the code below to count the number of positive and negative words in a paragraph:
from collections import Counter
def readwords( filename ):
f = open(filename)
words = [ line.rstrip() for line in f.readlines()]
return words
# >cat positive.txt
# good
# awesome
# >cat negative.txt
# bad
# ugly
positive = readwords('positive.txt')
negative = readwords('negative.txt')
print positive
print negative
paragraph = 'this is really bad and in fact awesome. really awesome.'
count = Counter(paragraph.split())
pos = 0
neg = 0
for key, val in count.iteritems():
key = key.rstrip('.,?!\n') # removing possible punctuation signs
if key in positive:
pos += val
if key in negative:
neg += val
print pos, neg
You are not reading the lines from the file. And this line
if pos_words in words:
I think it is checking for the list ["GOOD", "GREAT", "LOVE", "AWESOME"] in words. That is you are looking in the list of words for a list pos_words = ["GOOD", "GREAT", "LOVE", "AWESOME"].
You have some problems. At first you can create functions that read comments from file and divides comments into words. Make them and check if they work as you want. Then main procedure can look like:
for comment in get_comments(file_name):
words = get_words(comment)
classified = False
# at first look for negative comment
for neg_word in NEGATIVE_WORDS:
if neg_word in words:
classified = True
negatives.append(comment)
break
# now look for positive
if not classified:
for pos_word in POSITIVE_WORDS:
if pos_word in words:
classified = True
positives.append(comment)
break
if not classified:
neutral.append(comment)
be careful, open() returns a file object.
>>> f = open('workfile', 'w')
>>> print f
<open file 'workfile', mode 'w' at 80a0960>
Use this:
>>> f.readline()
'This is the first line of the file.\n'
Then use set intersection:
positive += list(set(pos_words) & set(tweet.split()))