I'm working on a python program that prints the words that are in the last file entered from the command line. The words can't be in any of the preceding files. So for example if I input 2 files from the command line and
File 1 contains: "We are awesome" and File 2(the last file entered) contains: "We are really awesome"
My final list should only contain: "really"
Right now my code is set up to only look at the last file entered, how can I look at all of the preceding files and compare them in the context of what I am trying to do? Here is my code:
UPDATE
import re
import sys
def get_words(filename):
test_file = open(filename).read()
lower_split = test_file.lower()
new_split = re.split("[^a-z']+", lower_split)
really_new_split = sorted(set(new_split))
return really_new_split
if __name__ == '__main__':
bag = []
for filename in sys.argv[1:]:
bag.append(get_words(filename))
unique_words = bag[-1].copy()
for other in bag[:-1]:
unique_words -= other
for word in unique_words:
print(word)
Also:
>>> set([1,2,3])
{1, 2, 3}
There is really not a lot missing: Step 1 put your code in a function so you can reuse it. You are doing the same thing (parsing a text file) several times so why not put the corresponding code in a reusable unit.
def get_words(filename):
test_file = open(filename).read()
lower_split = test_file.lower()
new_split = re.split("[^a-z']+", lower_split)
return set(new_split)
Step 2: Set up a loop to call your function. In this particular case we could use a list comprehension but maybe that's too much for a rookie. You'll come to that in good time:
bag = []
for filename in sys.argv[x:] # you'll have to experiment what to put
# for x it will be at least one because
# the first argument is the name of your
# program
bag.append(get_words(filename))
Now you have all the words conveniently grouped by file. As I said, you can simply take the set difference. So if you want all the words that are only in the very last file:
unique_words = bag[-1].copy()
for other in bag[:-1]: loop over all the other files
unique_words -= other
for word in unique_words:
print(word)
I didn't test it, so let me know whether it runs.
Consider simplifying by using Set's difference operation, to 'subtract' the sets of words in your files.
import re
s1 = open('file1.txt', 'r').read()
s2 = open('file2.txt', 'r').read()
set(re.findall(r'\w+',s2.lower())) - set(re.findall(r'\w+',s1.lower()))
result:
{'really'}
Related
I've seen a few people ask how this would be done, but their questions were 'too broad' so I decided to find out how to do it. I've posted below how.
So to do this, first you must open the file (Assuming you have a file of text called 'text.txt') We do this by calling the open function.
file = open('text.txt', 'r')
The open function uses the syntax: open(file, mode)
The file being the text document, and the mode being how it's opened. ('r' means read only) The read function just reads the file, then split separates each of the words into a list object. Lastly, we use the count function to find how many times the word appears.
word = input('word: ')
print(file.read().split().count(word))
And there you have it, counting words in a text file!
Word counts can be tricky. At a minimum, one would like to avoid differences in capitalization and punctuation. A simple way to take the next step in word counts is to use regular expressions and to convert its resulting words to lower case before we do the count. We could even use collections.Counter and count all of the words.
import re
# `word_finder(somestring)` emits all words in string as list
word_finder = re.compile(r'\w+').findall
filename = input('filename: ')
word = input('word: ')
# remove case for compare
lword = word.lower()
# `word_finder` emits all of the words excluding punctuation
# `filter` removes the lower cased words we don't want
# `len` counts the result
count = len(list(filter(lambda w: w.lower() == lword,
word_finder(open(filename).read()))))
print(count)
# we could go crazy and count all of the words in the file
# and do it line by line to reduce memory footprint.
import collections
import itertools
from pprint import pprint
word_counts = collections.Counter(itertools.chain.from_iterable(
word_finder(line.lower()) for line in open(filename)))
print(pprint(word_counts))
Splitting on whitespace isn't sufficient -- split on everything you're not counting and get your case under control:
import re
import sys
file = open(sys.argv[1])
word = sys.argv[2]
print(re.split(r"[^a-z]+", file.read().casefold()).count(word.casefold()))
You can add apostrophes to the inverted pattern [^a-z'] or whatever else you want to include in your count.
Hogan: Colonel, you're asking and answering your own questions. That's tops in German efficiency.
def words_frequency_counter(filename):
"""Print how many times the word appears in the text."""
try:
with open(filename) as file_object:
contents = file_object.read()
except FileNotFoundError:
pass
else:
word = input("Give me a word: ")
print("'" + word + "'" + ' appears ' +
str(contents.lower().count(word.lower())) + ' times.\n')
First, you want to open the file. Do this with:
your_file = open('file.txt', 'r')
Next, you want to count the word. Let's set your word as brian under the variable life. No reason.
your_file.read().split().count(life)
What that does is reads the file, splits it into individual words, and counts the instances of the word 'brian'. Hope this helps!
I'm a beginner with python and would like to know how to use two txt files to count the characters as well as counter the 10 most common characters. also how to convert all characters in the file to lower case and eliminate all characters other than a-z
here's what i've tried and had no luck with:
from string import ascii_lowercase
from collections import Counter
with open ('document1.txt' , 'document2.txt') as f:
print Counter(letter for line in f
for letter in line.lower()
if letter in ascii_lowercase)
try like this :
>>> from collections import Counter
>>> import re
>>> words = re.findall(r'\w+', "{} {}".format(open('your_file1').read().lower(), open('your_file2').read().lower()))
>>> Counter(words).most_common(10)
Here is a simple example. You can adapt this code to fit your needs
from string import ascii_lowercase
from collections import Counter
with open('file1.txt', 'r') as file1data: #opening an reading file one
file1 = file1data.read().lower() #convert the entire file contents to lower
with open('file2.txt', 'r') as file2data: #opening an reading file two
file2 = file2data.read().lower()
#The contents of both file 1 and 2 are stored in fil1 and file2 variables
#Examples of how to work with one file repeat for two files
file1_list = []
for ch in file1:
if ch in ascii_lowercase: #makes sure only lowercase alphabet is appended. All Non alphabet characters are removed
file1_list.append(ch)
elif ch in [" ", ".", ",", "'"]: #remove this elif block is you just want the letters
file1_list.append(ch) #make sure basic punctionation is kept
print "".join(file1_list) #this line is not needed. Just to show what the text looks like now
print Counter(file1_list).most_common(10) #prints the top ten
print Counter(file1_list) #prints the number of characters and how many times they repeat
Now that you've reviewed that mess above and have an idea of what each line is doing, here is a cleaner version, that does what you were looking for.
from string import ascii_lowercase
from collections import Counter
with open('file1.txt', 'r') as file1data:
file1 = file1data.read().lower()
with open('file2.txt', 'r') as file2data:
file2 = file2data.read().lower()
file1_list = []
for ch in file1:
if ch in ascii_lowercase:
file1_list.append(ch)
file2_list = []
for ch in file2:
if ch in ascii_lowercase:
file2_list.append(ch)
all_counter = Counter(file1_list + file2_list)
top_ten_counter = Counter(file1_list + file2_list).most_common(10)
print sorted(all_counter.items())
print sorted(top_ten_counter)
Unfortunately there is no way to insert into the middle of a file without re-writing it. As previous posters have indicated, you can append to a file or overwrite part of it using seek but if you want to add stuff at the beginning or the middle, you'll have to rewrite it.
This is an operating system thing, not a Python thing. It is the same in all languages.
What I usually do is read from the file, make the modifications and write it out to a new file called myfile.txt.tmp or something like that. This is better than reading the whole file into memory because the file may be too large for that. Once the temporary file is completed, I rename it the same as the original file.
This is a good, safe way to do it because if the file write crashes or aborts for any reason, you still have your untouched original file.
To find most common words from multiple files,
from collections import Counter
import re
with open(''document1.txt'') as f1, open(''document1.txt'') as f2:
words = re.findall(r'\w+', f1.read().lower()) + re.findall(r'\w+', f2.read().lower())
>>>Counter(words).most_common(10)
"wil give you most 10 common words"
If you want most 10 common characters
>>>Counter(f1.read() + f2.read()).most_common(10)
I want to find words in a text file that match words stored in an existing list called items, the list is created in a previous function and I want to be able to use the list in the next function as well but I'm unsure how to do that, I tried using classes for that but i couldn't get it right. And I can't figure out what the problem is with the rest of the code. I tried running it without the class and list and replaced the list 'items[]' in line 8 with a word in the text file being opened and it still didn't do anything, even though no errors come up. When the below code is run it prints out: "Please entre a valid textfile name: " and it stops there.
class searchtext():
textfile = input("Please entre a valid textfile name: ")
items = []
def __init__search(self):
with open("textfile") as openfile:
for line in openfile:
for part in line.split():
if ("items[]=") in part:
print (part)
else:
print("not found")
The list is created from another text file containing words in a previous function that looks like this and it works as it should, if it is to any help:
def createlist():
items = []
with open('words.txt') as input:
for line in input:
items.extend(line.strip().split(','))
return items
print(createlist())
You can use regexp the following way:
>>> import re
>>> words=['car','red','woman','day','boston']
>>> word_exp='|'.join(words)
>>> re.findall(word_exp,'the red car driven by the woman',re.M)
['red', 'car', 'woman']
The second command creates a list of acceptable words separated by "|". To run this on a file, just replace the string in 'the red car driven by the woman' for open(your_file,'r').read().
This may be a bit cleaner. I feel class is an overkill here.
def createlist():
items = []
with open('words.txt') as input:
for line in input:
items.extend(line.strip().split(','))
return items
print(createlist())
# store the list
word_list = createlist()
with open('file.txt') as f:
# split the file content to words (first to lines, then each line to it's words)
for word in (sum([x.split() for x in f.read().split('\n')], [])):
# check if each word is in the list
if word in word_list:
# do something with word
print word + " is in the list"
else:
# word not in list
print word + " is NOT in the list"
There is nothing like Regular expressions in matching https://docs.python.org/3/howto/regex.html
items=['one','two','three','four','five'] #your items list created previously
import re
file=open('text.txt','r') #load your file
content=file.read() #save the read output so the reading always starts from begining
for i in items:
lis=re.findall(i,content)
if len(lis)==0:
print('Not found')
elif len(lis)==1:
print('Found Once')
elif len(lis)==2:
print('Found Twice')
else:
print('Found',len(lis),'times')
I've written a short program will take an input file, remove punctuation, sort the contents by number of occurrences per word and then write the 100 most common results to an output file.
I had some trouble on the last part (writing the results to an output file), and though I've fixed it, I don't know what the problem was.
The full code looks like so:
from collections import Counter
from itertools import chain
import sys
import string
wordList = []
#this file contains text from a number of reviews
file1 = open('reviewfile', 'r+')
reviewWords = file1.read().lower()
#this file contains a list of the 1000 most common English words
file2 = open('commonwordsfile', 'r')
commonWords = file2.read().lower()
#remove punctuation
for char in string.punctuation:
reviewWords = reviewWords.replace(char, " ")
#create a list of individual words from file1
splitWords = reviewWords.split()
for w in splitWords:
if w not in commonWords and len(w)>2:
wordList.append(w)
#sort the resulting list by length
wordList = sorted(wordList, key=len)
#return a list containing the 100
#most common words and number of occurrences
words_to_count = (word for word in wordList)
c = Counter(words_to_count)
commonHundred = c.most_common(100)
#create new file for results and write
#the 100 most common words to it
fileHandle = open("outcome", 'w' )
for listItem in commonHundred:
fileHandle.write (str(listItem) + "\n")
fileHandle.close()
I previously had this following code snippet attempting to write the 100 most common terms to a .txt file, but it didn't work. Can anyone explain why not?
makeFile = open("outputfile", "w")
for item in CommonHundred:
makeFile.write("[0]\n".format(item))
makeFile.close()
Those should be curly braces, like:
makefile.write("{0}\n".format(item))
Run this and see what happens:
a = "[0]".format("test")
print(a)
b = "{0}".format("test")
print(b)
Then go search for "Format String Syntax" here if you'd like to know more: http://docs.python.org/3/library/string.html.
I need to iterate through the words of a large file, which consists of a single, long long line. I am aware of methods iterating through the file line by line, however they are not applicable in my case, because of its single line structure.
Any alternatives?
It really depends on your definition of word. But try this:
f = file("your-filename-here").read()
for word in f.split():
# do something with word
print word
This will use whitespace characters as word boundaries.
Of course, remember to properly open and close the file, this is just a quick example.
Long long line? I assume the line is too big to reasonably fit in memory, so you want some kind of buffering.
First of all, this is a bad format; if you have any kind of control over the file, make it one word per line.
If not, use something like:
line = ''
while True:
word, space, line = line.partition(' ')
if space:
# A word was found
yield word
else:
# A word was not found; read a chunk of data from file
next_chunk = input_file.read(1000)
if next_chunk:
# Add the chunk to our line
line = word + next_chunk
else:
# No more data; yield the last word and return
yield word.rstrip('\n')
return
You really should consider using Generator
def word_gen(file):
for line in file:
for word in line.split():
yield word
with open('somefile') as f:
word_gen(f)
There are more efficient ways of doing this, but syntactically, this might be the shortest:
words = open('myfile').read().split()
If memory is a concern, you aren't going to want to do this because it will load the entire thing into memory, instead of iterating over it.
I've answered a similar question before, but I have refined the method used in that answer and here is the updated version (copied from a recent answer):
Here is my totally functional approach which avoids having to read and
split lines. It makes use of the itertools module:
Note for python 3, replace itertools.imap with map
import itertools
def readwords(mfile):
byte_stream = itertools.groupby(
itertools.takewhile(lambda c: bool(c),
itertools.imap(mfile.read,
itertools.repeat(1))), str.isspace)
return ("".join(group) for pred, group in byte_stream if not pred)
Sample usage:
>>> import sys
>>> for w in readwords(sys.stdin):
... print (w)
...
I really love this new method of reading words in python
I
really
love
this
new
method
of
reading
words
in
python
It's soo very Functional!
It's
soo
very
Functional!
>>>
I guess in your case, this would be the way to use the function:
with open('words.txt', 'r') as f:
for word in readwords(f):
print(word)
Read in the line as normal, then split it on whitespace to break it down into words?
Something like:
word_list = loaded_string.split()
After reading the line you could do:
l = len(pattern)
i = 0
while True:
i = str.find(pattern, i)
if i == -1:
break
print str[i:i+l] # or do whatever
i += l
Alex.
What Donald Miner suggested looks good. Simple and short. I used the below in a code that I have written some time ago:
l = []
f = open("filename.txt", "rU")
for line in f:
for word in line.split()
l.append(word)
longer version of what Donald Miner suggested.