article = open("article.txt", encoding="utf-8")
for i in article:
print(max(i.split(), key=len))
The text is written with line breaks, and it gives me the longest words from each line. How to get the longest word from all of the text?
One approach would be to read the entire text file into a Python string, remove newlines, and then find the largest word:
with open('article.text', 'r') as file:
data = re.sub(r'\r?\n', '', file.read())
longest_word = max(re.findall(r'\w+', data), key=len)
longest = 0
curr_word = ""
with open("article.txt", encoding="utf-8") as f:
for line in f:
for word in line.split(" "): # Use line-by-line generator to avoid loading large file in memory
word = word.strip()
if (wl := len(word)) > longest: # Python 3.9+, otherwise use 2 lines
longest = wl
curr_word = word
print(curr_word)
Instead of iterating through each line, you can get the entire text of the file and then split them using article.readline().split()
article = open("test.txt", encoding="utf-8")
print(max(article.readline().split(), key=len))
article.close()
There are many ways by which you could do that. This would work
with open("article.txt", encoding="utf-8") as article:
txt = [word for item in article.readlines() for word in item.split(" ")]
biggest_word = sorted(txt, key=lambda word: (-len(word), word), )[0]
Note that I am using a with statement to close the connection to the file when the reading is done, that I use readlines to read the entire file, returing a list of lines, and that I unpack the split items twice to get a flat list of items. The last line of code sorts the list and uses -len(word) to inverse the sorting from ascending to descending.
I hope this is what you are looking for :)
If your file is large enough to fit in memory, you can read all line at once.
file = open("article.txt", encoding="utf-8", mode='r')
all_text = file.read()
longest = max(i.split(), key=len)
print(longest)
Related
I have a really long list of words that are on each line. How do I make a program that takes in all that and print them all side by side?
I tried making the word an element of a list, but I don't know how to proceed.
Here's the code I've tried so far:
def convert(lst):
return([i for item in lst for i in item.split()])
lst = [''' -The list of words come here- ''']
print(convert(lst))
If you already have the words in a list, you can use the join() function to concatenate them. See https://docs.python.org/3/library/stdtypes.html#str.join
words = open('your_file.txt').readlines()
separator = ' '
print(separator.join(words))
Another, a little bit more cumbersome method would be to print the words using the builtin print() function but suppress the newline that print() normally adds automatically to the end of your argument.
words = open('your_file.txt').readlines()
for word in words:
print(word, end=' ')
Try this, and example.txt just has a list of words going down line by line.
with open("example.txt", "r") as a_file:
sentence = ""
for line in a_file:
stripped_line = line.strip()
sentence = sentence + f"{stripped_line} "
print(sentence)
If your input file is really large and you cant fit it all in memory, you can read the words lazy and write them to disk instead of holding the whole output in memory.
# create a generator that yields each individual line
lines = (l for l in open('words'))
with open("output", "w+") as writer:
# read the file line by line to avoid memory issues
while True:
try:
line = next(lines)
# add to the paragraph in the out file
writer.write(line.replace('\n', ' '))
except StopIteration:
break
You can check the working example here: https://replit.com/#bluebrown/readwritewords#main.py
file_contents = x.read()
#print (file_contents)
for line in file_contents:
if "ase" in line:
print (line)
I'm looking for all the sentences that contain the phrase "ase" in the file. When I run it, nothing is printed.
Since file_contents is the result of x.read(), it's a string not a list of strings.
So you're iterating on each character.
Do that instead:
file_contents = x.readlines()
now you can search in your lines
or if you're not planning to reuse file_contents, iterate on the file handle with:
for line in x:
so you don't have to readlines() and store all file in memory (if it's big, it can make a difference)
read will return the whole content of the file (not line by line) as string. So when you iterate over it you iterate over the single characters:
file_contents = """There is a ase."""
for char in file_contents:
print(char)
You can simply iterate over the file object (which returns it line-by-line):
for line in x:
if "ase" in line:
print(line)
Note that if you actually look for sentences instead of lines where 'ase' is contained it will be a bit more complicated. For example you could read the complete file and split at .:
for sentence in x.read().split('.'):
if "ase" in sentence:
print(sentence)
However that would fail if there are .s that don't represent the end of a sentence (like abbreviations).
I'm trying to count the number of instances several words appear in a file.
Here is my code:
#!/usr/bin/env python
file = open('my_output', 'r')
word1 = 'wordA'
print('wordA', file.read().split().count(word1))
word2 = 'wordB'
print('wordB', file.read().split().count(word2))
word3 = 'wordC'
print('wordC', file.read().split().count(word3))
The issue in the code is that it only counts the number of instances of word1. How can this code be fixed to count word2 and word3?
Thanks!
i think instead of continuously reading and splitting the file , this code would work better if you did : [ this way you could find the term frequency of any number of words you find in the file ]
file=open('my_output' , 'r')
s=file.read()
s=s.split()
w=set(s)
tf={}
for i in s:
tf[i]=s.count(i)
print(tf)
The main problem is that file.read() consumes the file. Thus the second time you search you end up searching an empty file. The simplest solution is to read the file once (if it is not too large) and then just search the previously read text:
#!/usr/bin/env python
with open('my_output', 'r') as file:
text = file.read()
word1 = 'wordA'
print('wordA', text.split().count(word1))
word2 = 'wordB'
print('wordB', text.split().count(word2))
word3 = 'wordC'
print('wordC', text.split().count(word3))
To improve performance it is also possible to split only once:
#!/usr/bin/env python
with open('my_output', 'r') as file:
split_text = file.read().split()
word1 = 'wordA'
print('wordA', split_text.count(word1))
word2 = 'wordB'
print('wordB', split_text.count(word2))
word3 = 'wordC'
print('wordC', split_text.count(word3))
Using with will also ensure that the file closes correctly after being read.
Can you try this:
file = open('my_output', 'r')
splitFile = file.read().split()
lst = ['wordA','wordB','wordC']
for wrd in lst:
print(wrd, splitFile.count(wrd))
Short solution using collections.Counter object:
import collections
with open('my_output', 'r') as f:
wordnames = ('wordA', 'wordB', 'wordC')
counts = (i for i in collections.Counter(f.read().split()).items() if i[0] in wordnames)
for c in counts:
print(c[0], c[1])
For the following sample text line:
'wordA some dfasd asdasdword B wordA sdfsd sdasdasdddasd wordB wordC wordC sdfsdfsdf wordA'
we would obtain the output:
wordB 1
wordC 2
wordA 3
in your code the file is consumed (exhausted) in the first line so the next lines will not return anything to count: the first file.read() reads the whole contents of the file and returns it as a string. the second file.read() has nothing left to read and just returns an empty string '' - as does the third file.read() .
this is a version that should do what you want:
from collections import Counter
counter = Counter()
with open('my_output', 'r') as file:
for line in file:
counter.update(line.split())
print(counter)
you may have to do some preprocessing (in order to get rid of special characters and , and . and what not).
Counter is in the python standard library and is very useful for exactly that kind of thing.
note that this way you iterate once over the file only and you do not have to store the whole file in memory at any time.
if you only want to keep track of certain words you could select only them instead of passing the whole line to a counter:
from collections import Counter
import string
counter = Counter()
words = ('wordA', 'wordB', 'wordC')
chars_to_remove = str.maketrans('', '', string.punctuation)
with open('my_output', 'r') as file:
for line in file:
line = line.translate(chars_to_remove)
w = (word for word in line.split() if word in words)
counter.update(w)
print(counter)
i also included an example of what i meant with preprocessing: punctuation will be removed before counting.
from collections import Counter
#Create a empty word_list which stores each of the words from a line.
word_list=[]
#file_handle to refer to the file object
file_handle=open(r'my_file.txt' , 'r+')
#read all the lines in a file
for line in file_handle.readlines():
#get each line,
#split each line into list of words
#extend those returned words into the word_list
word_list.extend(line.split())
# close the file object
file_handle.close()
#Pass the word_list to Counter() and get the dictionary of the words
dictionary_of_words=Counter(word_list)
print dictionary_of_words
I have some pretty large text files (>2g) that I would like to process word by word. The files are space-delimited text files with no line breaks (all words are in a single line). I want to take each word, test if it is a dictionary word (using enchant), and if so, write it to a new file.
This is my code right now:
with open('big_file_of_words', 'r') as in_file:
with open('output_file', 'w') as out_file:
words = in_file.read().split(' ')
for word in word:
if d.check(word) == True:
out_file.write("%s " % word)
I looked at lazy method for reading big file in python, which suggests using yield to read in chunks, but I am concerned that using chunks of predetermined size will split words in the middle. Basically, I want chunks to be as close to the specified size while splitting only on spaces. Any suggestions?
Combine the last word of one chunk with the first of the next:
def read_words(filename):
last = ""
with open(filename) as inp:
while True:
buf = inp.read(10240)
if not buf:
break
words = (last+buf).split()
last = words.pop()
for word in words:
yield word
yield last
with open('output.txt') as output:
for word in read_words('input.txt'):
if check(word):
output.write("%s " % word)
You might be able to get away with something similar to an answer on the question you've linked to, but combining re and mmap, eg:
import mmap
import re
with open('big_file_of_words', 'r') as in_file, with open('output_file', 'w') as out_file:
mf = mmap.mmap(in_file.fileno(), 0, access=ACCESS_READ)
for word in re.finditer('\w+', mf):
# do something
fortunately Petr Viktorin has already written code for us. The following code reads a chunk from a file, then does a yield for each contained word. If a word spans chunks, that's handled also.
line = ''
while True:
word, space, line = line.partition(' ')
if space:
# A word was found
yield word
else:
# A word was not found; read a chunk of data from file
next_chunk = input_file.read(1000)
if next_chunk:
# Add the chunk to our line
line = word + next_chunk
else:
# No more data; yield the last word and return
yield word.rstrip('\n')
return
https://stackoverflow.com/a/7745406/143880
We have a file named wordlist, which contains 1,876 KB worth of alphabetized words, all of which are longer than 4 letters and contain one carriage return between each new two-letter construction (ab, ac, ad, etc., words all contain returns between them):
wfile = open("wordlist.txt", "r+")
I want to create a new file that contains only words that are not derivatives of other, smaller words. For example, the wordlist contains the following words ["abuser, abused, abusers, abuse, abuses, etc.] The new file that is created should retain only the word "abuse" because it is the "lowest common denominator" (if you will) between all those words. Similarly, the word "rodeo" would be removed because it contains the word rode.
I tried this implementation:
def root_words(wordlist):
result = []
base = wordlist[1]
for word in wordlist:
if not word.startswith(base):
result.append(base)
print base
base=word
result.append(base)
return result;
def main():
wordlist = []
wfile = open("wordlist.txt", "r+")
for line in wfile:
wordlist.append(line[:-1])
wordlist = root_words(wordlist)
newfile = open("newwordlist.txt", "r+")
newfile.write(wordlist)
But it always froze my computer. Any solutions?
I would do something like this:
def bases(words):
base = next(words)
yield base
for word in words:
if word and not word.startswith(base):
yield word
base = word
def get_bases(infile, outfile):
with open(infile) as f_in:
words = (line.strip() for line in f_in)
with open(outfile, 'w') as f_out:
f_out.writelines(word + '\n' for word in bases(words))
This goes through the corncob list of 58,000 words in a fifth of a second on my fairly old laptop. It's old enough to have one gig of memory.
$ time python words.py
real 0m0.233s
user 0m0.180s
sys 0m0.012s
It uses iterators everywhere it can to go easy on the memory. You could probably increase performance by slicing off the end of the lines instead of using strip to get rid of the newlines.
Also note that this relies on your input being sorted and non-empty. That was part of the stated preconditions though so I don't feel too bad about it ;)
One possible improvement is to use a database to load the words and avoid loading the full input file in RAM. Another option is to process the words as you read them from the file and write the results without loading everything in memory.
The following example treats the file as it is read without pre-loading stuff in memory.
def root_words(f,out):
result = []
base = f.readline()
for word in f:
if not word.startswith(base):
out.write(base + "\n")
base=word
out.write(base + "\n")
def main():
wfile = open("wordlist.txt", "r+")
newfile = open("newwordlist.txt", "w")
root_words(wfile,newfile)
wfile.close()
newfile.close()
Memory complexity of this solution is O(1) since the variable base is the only thing that you need in order to process the file. This can be done thanks to that the file is alphabetically sorted.
since the list is alphabetized, this does the trick (takes 0.4seconds with 5 megs of data, so should not be a problem with 1.8)
res = [" "]
with open("wordlist.txt","r") as f:
for line in f:
tmp = line.strip()
if tmp.startswith(res[-1]):
pass
else:
res.append(tmp)
with open("newlist.txt","w") as f:
f.write('\n'.join(res[1:]))