I need to iterate through the words of a large file, which consists of a single, long long line. I am aware of methods iterating through the file line by line, however they are not applicable in my case, because of its single line structure.
Any alternatives?
It really depends on your definition of word. But try this:
f = file("your-filename-here").read()
for word in f.split():
# do something with word
print word
This will use whitespace characters as word boundaries.
Of course, remember to properly open and close the file, this is just a quick example.
Long long line? I assume the line is too big to reasonably fit in memory, so you want some kind of buffering.
First of all, this is a bad format; if you have any kind of control over the file, make it one word per line.
If not, use something like:
line = ''
while True:
word, space, line = line.partition(' ')
if space:
# A word was found
yield word
else:
# A word was not found; read a chunk of data from file
next_chunk = input_file.read(1000)
if next_chunk:
# Add the chunk to our line
line = word + next_chunk
else:
# No more data; yield the last word and return
yield word.rstrip('\n')
return
You really should consider using Generator
def word_gen(file):
for line in file:
for word in line.split():
yield word
with open('somefile') as f:
word_gen(f)
There are more efficient ways of doing this, but syntactically, this might be the shortest:
words = open('myfile').read().split()
If memory is a concern, you aren't going to want to do this because it will load the entire thing into memory, instead of iterating over it.
I've answered a similar question before, but I have refined the method used in that answer and here is the updated version (copied from a recent answer):
Here is my totally functional approach which avoids having to read and
split lines. It makes use of the itertools module:
Note for python 3, replace itertools.imap with map
import itertools
def readwords(mfile):
byte_stream = itertools.groupby(
itertools.takewhile(lambda c: bool(c),
itertools.imap(mfile.read,
itertools.repeat(1))), str.isspace)
return ("".join(group) for pred, group in byte_stream if not pred)
Sample usage:
>>> import sys
>>> for w in readwords(sys.stdin):
... print (w)
...
I really love this new method of reading words in python
I
really
love
this
new
method
of
reading
words
in
python
It's soo very Functional!
It's
soo
very
Functional!
>>>
I guess in your case, this would be the way to use the function:
with open('words.txt', 'r') as f:
for word in readwords(f):
print(word)
Read in the line as normal, then split it on whitespace to break it down into words?
Something like:
word_list = loaded_string.split()
After reading the line you could do:
l = len(pattern)
i = 0
while True:
i = str.find(pattern, i)
if i == -1:
break
print str[i:i+l] # or do whatever
i += l
Alex.
What Donald Miner suggested looks good. Simple and short. I used the below in a code that I have written some time ago:
l = []
f = open("filename.txt", "rU")
for line in f:
for word in line.split()
l.append(word)
longer version of what Donald Miner suggested.
Related
I have a really long list of words that are on each line. How do I make a program that takes in all that and print them all side by side?
I tried making the word an element of a list, but I don't know how to proceed.
Here's the code I've tried so far:
def convert(lst):
return([i for item in lst for i in item.split()])
lst = [''' -The list of words come here- ''']
print(convert(lst))
If you already have the words in a list, you can use the join() function to concatenate them. See https://docs.python.org/3/library/stdtypes.html#str.join
words = open('your_file.txt').readlines()
separator = ' '
print(separator.join(words))
Another, a little bit more cumbersome method would be to print the words using the builtin print() function but suppress the newline that print() normally adds automatically to the end of your argument.
words = open('your_file.txt').readlines()
for word in words:
print(word, end=' ')
Try this, and example.txt just has a list of words going down line by line.
with open("example.txt", "r") as a_file:
sentence = ""
for line in a_file:
stripped_line = line.strip()
sentence = sentence + f"{stripped_line} "
print(sentence)
If your input file is really large and you cant fit it all in memory, you can read the words lazy and write them to disk instead of holding the whole output in memory.
# create a generator that yields each individual line
lines = (l for l in open('words'))
with open("output", "w+") as writer:
# read the file line by line to avoid memory issues
while True:
try:
line = next(lines)
# add to the paragraph in the out file
writer.write(line.replace('\n', ' '))
except StopIteration:
break
You can check the working example here: https://replit.com/#bluebrown/readwritewords#main.py
I've seen a few people ask how this would be done, but their questions were 'too broad' so I decided to find out how to do it. I've posted below how.
So to do this, first you must open the file (Assuming you have a file of text called 'text.txt') We do this by calling the open function.
file = open('text.txt', 'r')
The open function uses the syntax: open(file, mode)
The file being the text document, and the mode being how it's opened. ('r' means read only) The read function just reads the file, then split separates each of the words into a list object. Lastly, we use the count function to find how many times the word appears.
word = input('word: ')
print(file.read().split().count(word))
And there you have it, counting words in a text file!
Word counts can be tricky. At a minimum, one would like to avoid differences in capitalization and punctuation. A simple way to take the next step in word counts is to use regular expressions and to convert its resulting words to lower case before we do the count. We could even use collections.Counter and count all of the words.
import re
# `word_finder(somestring)` emits all words in string as list
word_finder = re.compile(r'\w+').findall
filename = input('filename: ')
word = input('word: ')
# remove case for compare
lword = word.lower()
# `word_finder` emits all of the words excluding punctuation
# `filter` removes the lower cased words we don't want
# `len` counts the result
count = len(list(filter(lambda w: w.lower() == lword,
word_finder(open(filename).read()))))
print(count)
# we could go crazy and count all of the words in the file
# and do it line by line to reduce memory footprint.
import collections
import itertools
from pprint import pprint
word_counts = collections.Counter(itertools.chain.from_iterable(
word_finder(line.lower()) for line in open(filename)))
print(pprint(word_counts))
Splitting on whitespace isn't sufficient -- split on everything you're not counting and get your case under control:
import re
import sys
file = open(sys.argv[1])
word = sys.argv[2]
print(re.split(r"[^a-z]+", file.read().casefold()).count(word.casefold()))
You can add apostrophes to the inverted pattern [^a-z'] or whatever else you want to include in your count.
Hogan: Colonel, you're asking and answering your own questions. That's tops in German efficiency.
def words_frequency_counter(filename):
"""Print how many times the word appears in the text."""
try:
with open(filename) as file_object:
contents = file_object.read()
except FileNotFoundError:
pass
else:
word = input("Give me a word: ")
print("'" + word + "'" + ' appears ' +
str(contents.lower().count(word.lower())) + ' times.\n')
First, you want to open the file. Do this with:
your_file = open('file.txt', 'r')
Next, you want to count the word. Let's set your word as brian under the variable life. No reason.
your_file.read().split().count(life)
What that does is reads the file, splits it into individual words, and counts the instances of the word 'brian'. Hope this helps!
I am kind of stuck on a question that I have to do regarding iambic pentameters, but because it is long, I'll try to simplify it.
So I need to get some words and their stress patterns from a text file that look somewhat like this:
if, 0
music,10
be,1
the,0
food,1
of,0
love,1
play,0
on,1
hello,01
world,1
And from the file, you can assume there will be much more words for different sentences. I am trying to get sentences from a text file which have multiple sentences, and to see if the sentence (ignoring punctuation and case) is an iambic pentameter.
For example if the text file contains this:
If music be the food of love play on
hello world
The first sentence will be assigned from the stress dictionary like this: 0101010101, and the second is obviously not a pentameter(011). I would like it so that it only prints sentences which are iambic pentameters.
Sorry if this is a convoluted or messy question.
This is what I have so far:
import string
dict = {};
sentence = open('sentences.txt')
stress = open('stress.txt')
for some in stress:
word,number = some.split(',')
dict[word] = number
for line in sentence:
one = line.split()
I don't think you are building your dictionary of stresses correctly. It's crucial to remember to get rid of the implicit \n character from lines as you read them in, as well as strip any whitespace from words after you've split on the comma. As things stand, the line if, 0 will be split to ['if', ' 0\n'] which isn't what you want.
So to create your dictionary of stresses you could do something like this:
stress_dict = {}
with open('stress.txt', 'r') as f:
for line in f:
word_stress = line.strip().split(',')
word = word_stress[0].strip().lower()
stress = word_stress[1].strip()
stress_dict[word] = stress
For the actual checking, the answer by #khelwood is a good way, but I'd take extra care to handle the \n character as you read in the lines and also make sure that all the characters in the line were lowercase (like in your dictionary).
Define a function is_iambic_pentameter to check whether a sentence is an iambic pentameter (returning True/False) and then check each line in sentences.txt:
def is_iambic_pentameter(line):
line_stresses = [stress_dict[word] for word in line.split()]
line_stresses = ''.join(line_stresses)
return line_stresses == '0101010101'
with open('sentences.txt', 'r') as f:
for line in f:
line = line.rstrip()
line = line.lower()
if is_iambic_pentameter(line):
print line
As an aside, you might be interested in NLTK, a natural language processing library for Python. Some Internet searching finds that people have written Haiku generators and other scripts for evaluating poetic forms using the library.
I wouldn't have thought iambic pentameter was that clear cut: always some words end up getting stressed or unstressed in order to fit the rhythm. But anyway. Something like this:
for line in sentences:
words = line.split()
stresspattern = ''.join([dict[word] for word in words])
if stresspattern=='0101010101':
print line
By the way, it's generally a bad idea to be calling your dictionary 'dict', since you're hiding the dict type.
Here's how the complete code could look like:
#!/usr/bin/env python3
def is_iambic_pentameter(words, word_stress_pattern):
"""Whether words are a line of iambic pentameter.
word_stress_pattern is a callable that given a word returns
its stress pattern
"""
return ''.join(map(word_stress_pattern, words)) == '01'*5
# create 'word -> stress pattern' mapping, to implement word_stress_pattern(word)
with open('stress.txt') as stress_file:
word_stress_pattern = dict(map(str.strip, line.split(','))
for line in stress_file).__getitem__
# print lines that use iambic pentameter
with open('sentences.txt') as file:
for line in file:
if is_iambic_pentameter(line.casefold().split(), word_stress_pattern):
print(line, end='')
In my CS class I have been given a task to read in the entire corpus of Shakespeare's Plays and Sonnets and print out the number of times a particular word occurs. Can anyone help me get my feet off the ground with this. Here is the first level of the stepwise refinement I was given.
Level 0
Define a function that tokenizes a file, returning an array of tokens. Loop through the array, printing each token one per line. For example, your specialized main might look something like this:
def main():
tokens = readTokens("shakespeare.txt")
for i in range(0,len(tokens),1):
print(tokens[i])
I guess my real question is how do I tokenize a file and then read it into an array into python? Sorry if this kind of question is not what this website is for, Im just looking for some help. Thanks.
goodletters = set("abcdefghijklmnopqrstuvwxyz' \t")
def tokenize_file(fname):
tokens = []
with open(fname) as inf:
for line in inf:
clean = ''.join(ch for ch in line.lower() if ch in goodletters)
tokens.extend(clean.split())
return tokens
Written this way for clarity; in production, I would use inf.read().translate(), but the setup for that is significantly different for Python 2.x vs 3.x and I'd prefer not to be more confusing than necessary.
from collections import Counter
def readTokens(file):
tokens = Counter()
with open(file) as f:
for line in f:
tokens += Counter(word.strip() for word in line.split())
# if you're trying to count "Won't", "won't", and "won't!"
# all together, do this instead:
## tokens += Counter(word.strip('"!?,.;:').casefold() for word in line.split())
return tokens
def main():
tokens = readTokens('shakespeare.txt')
for token in tokens:
print(token)
print("The most commonly used word is {}".format(max(tokens.items(), key=
lambda x: x[1])))
We have a file named wordlist, which contains 1,876 KB worth of alphabetized words, all of which are longer than 4 letters and contain one carriage return between each new two-letter construction (ab, ac, ad, etc., words all contain returns between them):
wfile = open("wordlist.txt", "r+")
I want to create a new file that contains only words that are not derivatives of other, smaller words. For example, the wordlist contains the following words ["abuser, abused, abusers, abuse, abuses, etc.] The new file that is created should retain only the word "abuse" because it is the "lowest common denominator" (if you will) between all those words. Similarly, the word "rodeo" would be removed because it contains the word rode.
I tried this implementation:
def root_words(wordlist):
result = []
base = wordlist[1]
for word in wordlist:
if not word.startswith(base):
result.append(base)
print base
base=word
result.append(base)
return result;
def main():
wordlist = []
wfile = open("wordlist.txt", "r+")
for line in wfile:
wordlist.append(line[:-1])
wordlist = root_words(wordlist)
newfile = open("newwordlist.txt", "r+")
newfile.write(wordlist)
But it always froze my computer. Any solutions?
I would do something like this:
def bases(words):
base = next(words)
yield base
for word in words:
if word and not word.startswith(base):
yield word
base = word
def get_bases(infile, outfile):
with open(infile) as f_in:
words = (line.strip() for line in f_in)
with open(outfile, 'w') as f_out:
f_out.writelines(word + '\n' for word in bases(words))
This goes through the corncob list of 58,000 words in a fifth of a second on my fairly old laptop. It's old enough to have one gig of memory.
$ time python words.py
real 0m0.233s
user 0m0.180s
sys 0m0.012s
It uses iterators everywhere it can to go easy on the memory. You could probably increase performance by slicing off the end of the lines instead of using strip to get rid of the newlines.
Also note that this relies on your input being sorted and non-empty. That was part of the stated preconditions though so I don't feel too bad about it ;)
One possible improvement is to use a database to load the words and avoid loading the full input file in RAM. Another option is to process the words as you read them from the file and write the results without loading everything in memory.
The following example treats the file as it is read without pre-loading stuff in memory.
def root_words(f,out):
result = []
base = f.readline()
for word in f:
if not word.startswith(base):
out.write(base + "\n")
base=word
out.write(base + "\n")
def main():
wfile = open("wordlist.txt", "r+")
newfile = open("newwordlist.txt", "w")
root_words(wfile,newfile)
wfile.close()
newfile.close()
Memory complexity of this solution is O(1) since the variable base is the only thing that you need in order to process the file. This can be done thanks to that the file is alphabetically sorted.
since the list is alphabetized, this does the trick (takes 0.4seconds with 5 megs of data, so should not be a problem with 1.8)
res = [" "]
with open("wordlist.txt","r") as f:
for line in f:
tmp = line.strip()
if tmp.startswith(res[-1]):
pass
else:
res.append(tmp)
with open("newlist.txt","w") as f:
f.write('\n'.join(res[1:]))