I have this text file made up of numbers and words, for example like this - 09807754 18 n 03 aristocrat 0 blue_blood 0 patrician and I want to split it so that each word or number will come up as a new line.
A whitespace separator would be ideal as I would like the words with the dashes to stay connected.
This is what I have so far:
f = open('words.txt', 'r')
for word in f:
print(word)
not really sure how to go from here, I would like this to be the output:
09807754
18
n
3
aristocrat
...
Given this file:
$ cat words.txt
line1 word1 word2
line2 word3 word4
line3 word5 word6
If you just want one word at a time (ignoring the meaning of spaces vs line breaks in the file):
with open('words.txt','r') as f:
for line in f:
for word in line.split():
print(word)
Prints:
line1
word1
word2
line2
...
word6
Similarly, if you want to flatten the file into a single flat list of words in the file, you might do something like this:
with open('words.txt') as f:
flat_list=[word for line in f for word in line.split()]
>>> flat_list
['line1', 'word1', 'word2', 'line2', 'word3', 'word4', 'line3', 'word5', 'word6']
Which can create the same output as the first example with print '\n'.join(flat_list)...
Or, if you want a nested list of the words in each line of the file (for example, to create a matrix of rows and columns from a file):
with open('words.txt') as f:
matrix=[line.split() for line in f]
>>> matrix
[['line1', 'word1', 'word2'], ['line2', 'word3', 'word4'], ['line3', 'word5', 'word6']]
If you want a regex solution, which would allow you to filter wordN vs lineN type words in the example file:
import re
with open("words.txt") as f:
for line in f:
for word in re.findall(r'\bword\d+', line):
# wordN by wordN with no lineN
Or, if you want that to be a line by line generator with a regex:
with open("words.txt") as f:
(word for line in f for word in re.findall(r'\w+', line))
f = open('words.txt')
for word in f.read().split():
print(word)
As supplementary,
if you are reading a vvvvery large file, and you don't want read all of the content into memory at once, you might consider using a buffer, then return each word by yield:
def read_words(inputfile):
with open(inputfile, 'r') as f:
while True:
buf = f.read(10240)
if not buf:
break
# make sure we end on a space (word boundary)
while not str.isspace(buf[-1]):
ch = f.read(1)
if not ch:
break
buf += ch
words = buf.split()
for word in words:
yield word
yield '' #handle the scene that the file is empty
if __name__ == "__main__":
for word in read_words('./very_large_file.txt'):
process(word)
What you can do is use nltk to tokenize words and then store all of the words in a list, here's what I did.
If you don't know nltk; it stands for natural language toolkit and is used to process natural language. Here's some resource if you wanna get started
[http://www.nltk.org/book/]
import nltk
from nltk.tokenize import word_tokenize
file = open("abc.txt",newline='')
result = file.read()
words = word_tokenize(result)
for i in words:
print(i)
The output will be this:
09807754
18
n
03
aristocrat
0
blue_blood
0
patrician
with open(filename) as file:
words = file.read().split()
Its a List of all words in your file.
import re
with open(filename) as file:
words = re.findall(r"([a-zA-Z\-]+)", file.read())
Here is my totally functional approach which avoids having to read and split lines. It makes use of the itertools module:
Note for python 3, replace itertools.imap with map
import itertools
def readwords(mfile):
byte_stream = itertools.groupby(
itertools.takewhile(lambda c: bool(c),
itertools.imap(mfile.read,
itertools.repeat(1))), str.isspace)
return ("".join(group) for pred, group in byte_stream if not pred)
Sample usage:
>>> import sys
>>> for w in readwords(sys.stdin):
... print (w)
...
I really love this new method of reading words in python
I
really
love
this
new
method
of
reading
words
in
python
It's soo very Functional!
It's
soo
very
Functional!
>>>
I guess in your case, this would be the way to use the function:
with open('words.txt', 'r') as f:
for word in readwords(f):
print(word)
Related
I have a txt-file with sentances and am able to find words from a list within it. I would like to print the line above the 'found-line' to a seperate list. I tried it with the below-code, but this only returns [].
Here is my code:
fname_in = "test.txt"
lv_pos = []
search_list = ['word1', 'word2']
with open (fname_in, 'r') as f:
file_l1 = [line.split('\n') for line in f.readlines()]
counter = 0
for word in search_list:
if word in file_l1:
l_pos.append(file_l1[counter - 1])
counter += 1
print(l_pos)
The text file looks somthing like this:
Bla bla bla
I want this line1.
I found this line with word1.
Bla bla bla
I want this line2.
I found this line with word2.
The result I want this:
l_pos = ['I want this line1.','I want this line2.']
In the second line of your example you wrote lv_pos instead of l_pos. Inside the with statement you could fix it like this I think:
fname_in = "test.txt"
l_pos = []
search_list = ['word1', 'word2']
file_l1 = f.readlines()
for line in range(len(file_l1)):
for word in search_words:
if word in file_l1[line].split(" "):
l_pos.append(file_l1[line - 1])
print(l_pos)
I'm not thrilled about this solution but I think it would fix your code with minimal modification.
Treat the file as a collection of pairs of lines and lines-before:
[prev for prev,this in zip(lines, lines[1:])
if 'word1' in this or 'word2' in this]
#['I want this line1.', 'I want this line2.']
This approach can be extended to cover any number of words:
words = {'word1', 'word2'}
[prev for prev,this in zip(lines,lines[1:])
if any(word in this for word in words)]
#['I want this line1.', 'I want this line2.']
Finally, if you care about proper words rather than occurrences (as in "thisisnotword1"), you should properly tokenize lines with, say, nltk.word_tokenize():
from nltk import word_tokenize
[prev for prev,this in zip(lines,lines[1:])
if words & set(word_tokenize(this))]
#['I want this line1.', 'I want this line2.']
First of all you got some typos in your codeāin some places you wrote l_pos and in others, lv_pos.
The other problem is I don't think you realize that file_l1 is a list-of-lists, so the if word in file_l1: isn't doing what you think. You need to check each word against each of these sublists.
Here's some working code based on your own:
fname_in = "simple_test.txt"
l_pos = []
search_list = ['word1', 'word2']
with open(fname_in) as f:
lines = f.read().splitlines()
for i, line in enumerate(lines):
for word in search_list:
if word in line:
l_pos.append(lines[i - 1])
print(l_pos) # -> ['I want this line1.', 'I want this line2.']
Update
Here's another way to do it that doesn't require reading the entire file into memory at once, so doesn't require as much memory:
from collections import deque
fname_in = "simple_test.txt"
l_pos = []
search_list = ['word1', 'word2']
with open(fname_in) as file:
lines = (line.rstrip('\n') for line in file) # Generator expression.
try: # Create and initialize a sliding window.
sw = deque(next(lines), maxlen=2)
except StopIteration: # File with less than 1 line.
pass
for line in lines:
sw.append(line)
for word in search_list:
if word in sw[1]:
l_pos.append(sw[0])
print(l_pos) # -> ['I want this line1.', 'I want this line2.']
I'm trying to count the number of instances several words appear in a file.
Here is my code:
#!/usr/bin/env python
file = open('my_output', 'r')
word1 = 'wordA'
print('wordA', file.read().split().count(word1))
word2 = 'wordB'
print('wordB', file.read().split().count(word2))
word3 = 'wordC'
print('wordC', file.read().split().count(word3))
The issue in the code is that it only counts the number of instances of word1. How can this code be fixed to count word2 and word3?
Thanks!
i think instead of continuously reading and splitting the file , this code would work better if you did : [ this way you could find the term frequency of any number of words you find in the file ]
file=open('my_output' , 'r')
s=file.read()
s=s.split()
w=set(s)
tf={}
for i in s:
tf[i]=s.count(i)
print(tf)
The main problem is that file.read() consumes the file. Thus the second time you search you end up searching an empty file. The simplest solution is to read the file once (if it is not too large) and then just search the previously read text:
#!/usr/bin/env python
with open('my_output', 'r') as file:
text = file.read()
word1 = 'wordA'
print('wordA', text.split().count(word1))
word2 = 'wordB'
print('wordB', text.split().count(word2))
word3 = 'wordC'
print('wordC', text.split().count(word3))
To improve performance it is also possible to split only once:
#!/usr/bin/env python
with open('my_output', 'r') as file:
split_text = file.read().split()
word1 = 'wordA'
print('wordA', split_text.count(word1))
word2 = 'wordB'
print('wordB', split_text.count(word2))
word3 = 'wordC'
print('wordC', split_text.count(word3))
Using with will also ensure that the file closes correctly after being read.
Can you try this:
file = open('my_output', 'r')
splitFile = file.read().split()
lst = ['wordA','wordB','wordC']
for wrd in lst:
print(wrd, splitFile.count(wrd))
Short solution using collections.Counter object:
import collections
with open('my_output', 'r') as f:
wordnames = ('wordA', 'wordB', 'wordC')
counts = (i for i in collections.Counter(f.read().split()).items() if i[0] in wordnames)
for c in counts:
print(c[0], c[1])
For the following sample text line:
'wordA some dfasd asdasdword B wordA sdfsd sdasdasdddasd wordB wordC wordC sdfsdfsdf wordA'
we would obtain the output:
wordB 1
wordC 2
wordA 3
in your code the file is consumed (exhausted) in the first line so the next lines will not return anything to count: the first file.read() reads the whole contents of the file and returns it as a string. the second file.read() has nothing left to read and just returns an empty string '' - as does the third file.read() .
this is a version that should do what you want:
from collections import Counter
counter = Counter()
with open('my_output', 'r') as file:
for line in file:
counter.update(line.split())
print(counter)
you may have to do some preprocessing (in order to get rid of special characters and , and . and what not).
Counter is in the python standard library and is very useful for exactly that kind of thing.
note that this way you iterate once over the file only and you do not have to store the whole file in memory at any time.
if you only want to keep track of certain words you could select only them instead of passing the whole line to a counter:
from collections import Counter
import string
counter = Counter()
words = ('wordA', 'wordB', 'wordC')
chars_to_remove = str.maketrans('', '', string.punctuation)
with open('my_output', 'r') as file:
for line in file:
line = line.translate(chars_to_remove)
w = (word for word in line.split() if word in words)
counter.update(w)
print(counter)
i also included an example of what i meant with preprocessing: punctuation will be removed before counting.
from collections import Counter
#Create a empty word_list which stores each of the words from a line.
word_list=[]
#file_handle to refer to the file object
file_handle=open(r'my_file.txt' , 'r+')
#read all the lines in a file
for line in file_handle.readlines():
#get each line,
#split each line into list of words
#extend those returned words into the word_list
word_list.extend(line.split())
# close the file object
file_handle.close()
#Pass the word_list to Counter() and get the dictionary of the words
dictionary_of_words=Counter(word_list)
print dictionary_of_words
import re
twovowels=re.compile(r".*[aeiou].*[aeiou].*", re.I)
nonword=re.compile(r"\W+", re.U)
text_file = open("twoVoweledWordList.txt", "w")
file = open("FirstMondayArticle.html","r")
for line in file:
for word in nonword.split(line):
if twovowels.match(word): print word
text_file.write('\n' + word)
text_file.close()
file.close()
This is my python code, I am trying to print only the words that have two or more occurring vowels. When i run this code, it prints everything, including the words and numbers that do not have vowels, to my text file. But the python shell shows me all of the words that have two or more occurring vowels. So how do I change that?
You can remove the vowels with str.translate and compare lengths. If after removing the letters the length difference is > 1 you have at least two vowels:
with open("FirstMondayArticle.html") as f, open("twoVoweledWordList.txt", "w") as out:
for line in file:
for word in line.split():
if len(word) - len(word.lower().translate(None,"aeiou")) > 1:
out.write("{}\n".format(word.rstrip()))
In your own code you always write the word as text_file.write('\n' + word) is outside the if block. a good lesson in why you should not have multiple statements on one line, your code is equivalent to:
if twovowels.match(word):
print(word)
text_file.write('\n' + word) # <- outside the if
Your code with the if in the correct location, some changes to your naming convention, adding some spaces between assignments and using with which closes your files for you:
import re
with open("FirstMondayArticle.html") as f, open("twoVoweledWordList.txt", "w") as out:
two_vowels = re.compile(r".*[aeiou].*[aeiou].*", re.I)
non_word = re.compile(r"\W+", re.U)
for line in f:
for word in non_word.split(line):
if two_vowels.match(word):
print(word)
out.write("{}\n".format(word.rstrip()))
Because it is outside of if condition. This is what the code lines should look like:
for line in file:
for word in nonword.split(line):
if twovowels.match(word):
print word
text_file.write('\n' + word)
text_file.close()
file.close()
Here is a sample program on Tutorialspoint showing the code above is correct.
I would suggest an alternate, and simpler, method, not using re:
def twovowels(word):
count = 0
for char in word.lower():
if char in "aeiou":
count = count + 1
if count > 1:
return True
return False
with open("FirstMondayArticle.html") as file,
open("twoVoweledWordList.txt", "w") as text_file:
for line in file:
for word in line.split():
if twovowels(word):
print word
text_file.write(word + "\n")
I have an output of which each line contains one list, each list contains one word of a sentence after hyphenation.
It looks something like this:
['I']
['am']
['a']
['man.']
['I']
['would']
['like']
['to']
['find']
['a']
['so','lu','tion.']
(let's say it's hyphenated like this, I'm not a native English speaker)
etc.
Now, what I'd like to do is write this output to a new .txt file, but each sentence (sentence ends when item in list contains a point) has to be written to a newline. I'd like to have following result written to this .txt file:
I am a man.
I would like to find a so,lu,tion.
etc.
The coding that precedes all this is the following:
with open('file.txt','r') as f:
for line in f:
for word in line.split():
if h_en.syllables(word)!=[]:
h_en.syllables (word)
else:
print ([word])
The result I want is a file which contains a sentence at each line.
Each word of a sentence is represented by it's hyphenated version.
Any suggestions?
Thank you a lot.
Something basic like this seems to answer your need:
def write_sentences(filename, *word_lists):
with open(filename, "w") as f:
sentence = []
for word_list in word_lists:
word = ",".join(word_list) ##last edit
sentence.append(word)
if word.endswith("."):
f.write(" ".join(sentence))
f.write("\n")
sentence = []
Feed the write_sentences function with the output filename, then each of your word
lists as arguments. If you have a list of word lists (e.g [['I'], ['am'], ...]), you can use * when calling
the function to pass everything.
EDIT: changed to make it work with the latest edit of the answer (with multiple words in the word lists)
This short regex does what you want when it is compiled in MULTILINE mode:
>>> regex = re.compile("\[([a-zA-Z\s]*\.?)\]$",re.MULTILINE)`
>>> a = regex.findall(string)
>>> a
[u'I', u'am', u'a man.', u'I', u'would like', u'to find', u'a solution.']
Now you just manipulate the list until you get your wanted result. An example follows, but there are more ways to do it:
>>> b = ' '.join(a)
>>> b
'I am a real man. I want a solution.'
>>> c = re.sub('\.','.\n',b)
>>> print(c)
'I am a real man.'
' I want a solution.'
>>> with open("result.txt", "wt") as f:
f.write(c)
words = [['I'],['am'],['a'],['man.'],['I'],['would'],['like'],['to'],['find'],['a'],['so','lu','tion.']]
text = "".join(
"".join(item) + ("\n" if item[-1].endswith(".") else " ")
for item in words)
with open("out.txt", "wt") as f:
f.write(text)
I have a text file that has a sentence at each line. And I have a word list. I just want to get only the sentences which contain at least one word from the list. Is there a pythonic way to do that?
sentences = [line for line in f if any(word in line for word in word_list)]
Here f would be your file object, for example you could replace it with open('file.txt') if file.txt was the name of your file and it was located in the same directory as the script.
Using set.intersection:
with open('file') as f:
[line for line in f if set(line.lower().split()).itersection(word_set)]
or with filter:
filter(lambda x:word_set.intersection(set(x.lower().split())),f)
this will give you a start:
words = ['a', 'and', 'foo']
infile = open('myfile.txt', 'r')
match_sentences = []
for line in infile.readlines():
# check for words in this line
# if match, append to match_sentences list