read words from file, line by line and concatenate to paragraph - python

I have a really long list of words that are on each line. How do I make a program that takes in all that and print them all side by side?
I tried making the word an element of a list, but I don't know how to proceed.
Here's the code I've tried so far:
def convert(lst):
return([i for item in lst for i in item.split()])
lst = [''' -The list of words come here- ''']
print(convert(lst))

If you already have the words in a list, you can use the join() function to concatenate them. See https://docs.python.org/3/library/stdtypes.html#str.join
words = open('your_file.txt').readlines()
separator = ' '
print(separator.join(words))
Another, a little bit more cumbersome method would be to print the words using the builtin print() function but suppress the newline that print() normally adds automatically to the end of your argument.
words = open('your_file.txt').readlines()
for word in words:
print(word, end=' ')

Try this, and example.txt just has a list of words going down line by line.
with open("example.txt", "r") as a_file:
sentence = ""
for line in a_file:
stripped_line = line.strip()
sentence = sentence + f"{stripped_line} "
print(sentence)

If your input file is really large and you cant fit it all in memory, you can read the words lazy and write them to disk instead of holding the whole output in memory.
# create a generator that yields each individual line
lines = (l for l in open('words'))
with open("output", "w+") as writer:
# read the file line by line to avoid memory issues
while True:
try:
line = next(lines)
# add to the paragraph in the out file
writer.write(line.replace('\n', ' '))
except StopIteration:
break
You can check the working example here: https://replit.com/#bluebrown/readwritewords#main.py

Related

How to open a file in python, read the comments ("#"), find a word after the comments and select the word after it?

I have a function that loops through a file that Looks like this:
"#" XDI/1.0 XDAC/1.4 Athena/0.9.25
"#" Column.4: pre_edge
Content
That is to say that after the "#" there is a comment. My function aims to read each line and if it starts with a specific word, select what is after the ":"
For example if I had These two lines. I would like to read through them and if the line starts with "#" and contains the word "Column.4" the word "pre_edge" should be stored.
An example of my current approach follows:
with open(file, "r") as f:
for line in f:
if line.startswith ('#'):
word = line.split(" Column.4:")[1]
else:
print("n")
I think my Trouble is specifically after finding a line that starts with "#" how can I parse/search through it? and save its Content if it contains the desidered word.
In case that # comment contain str Column.4: as stated above, you could parse it this way.
with open(filepath) as f:
for line in f:
if line.startswith('#'):
# Here you proceed comment lines
if 'Column.4' in line:
first, remainder = line.split('Column.4: ')
# Remainder contains everything after '# Column.4: '
# So if you want to get first word ->
word = remainder.split()[0]
else:
# Here you can proceed lines that are not comments
pass
Note
Also it is a good practice to use for line in f: statement instead of f.readlines() (as mentioned in other answers), because this way you don't load all lines into memory, but proceed them one by one.
You should start by reading the file into a list and then work through that instead:
file = 'test.txt' #<- call file whatever you want
with open(file, "r") as f:
txt = f.readlines()
for line in txt:
if line.startswith ('"#"'):
word = line.split(" Column.4: ")
try:
print(word[1])
except IndexError:
print(word)
else:
print("n")
Output:
>>> ['"#" XDI/1.0 XDAC/1.4 Athena/0.9.25\n']
>>> pre_edge
Used a try and except catch because the first line also starts with "#" and we can't split that with your current logic.
Also, as a side note, in the question you have the file with lines starting as "#" with the quotation marks so the startswith() function was altered as such.
with open('stuff.txt', 'r+') as f:
data = f.readlines()
for line in data:
words = line.split()
if words and ('#' in words[0]) and ("Column.4:" in words):
print(words[-1])
# pre_edge

How to convert a list into float for using the '.join' function?

I have to compress a file into a list of words and list of positions to recreate the original file. My program should also be able to take a compressed file and recreate the full text, including punctuation and capitalization, of the original file. I have everything correct apart from the recreation, using the map function my program can't convert my list of positions into floats because of the '[' as it is a list.
My code is:
text = open("speech.txt")
CharactersUnique = []
ListOfPositions = []
DownLine = False
while True:
line = text.readline()
if not line:
break
TwoList = line.split()
for word in TwoList:
if word not in CharactersUnique:
CharactersUnique.append(word)
ListOfPositions.append(CharactersUnique.index(word))
if not DownLine:
CharactersUnique.append("\n")
DownLine = True
ListOfPositions.append(CharactersUnique.index("\n"))
w = open("List_WordsPos.txt", "w")
for c in CharactersUnique:
w.write(c)
w.close()
x = open("List_WordsPos.txt", "a")
x.write(str(ListOfPositions))
x.close()
with open("List_WordsPos.txt", "r") as f:
NewWordsUnique = f.readline()
f.close()
h = open("List_WordsPos.txt", "r")
lines = h.readlines()
NewListOfPositions = lines[1]
NewListOfPositions = map(float, NewListOfPositions)
print("Recreated Text:\n")
recreation = " " .join(NewWordsUnique[pos] for pos in (NewListOfPositions))
print(recreation)
The error I get is:
Task 3 Code.py", line 42, in <genexpr>
recreation = " " .join(NewWordsUnique[pos] for pos in (NewListOfPositions))
ValueError: could not convert string to float: '['
I am using Python IDLE 3.5 (32-bit). Does anyone have any ideas on how to fix this?
Why do you want to turn the position values in the list into floats, since they list indices, and those must be integer? I suspected this might be an instance of what is called the XY Problem.
I also found your code difficult to understand because you haven't followed the PEP 8 - Style Guide for Python Code. In particular, with how many (although not all) of the variable names are CamelCased, which according to the guidelines, should should be reserved for the class names.
In addition some of your variables had misleading names, like CharactersUnique, which actually [mostly] contained unique words.
So, one of the first things I did was transform all the CamelCased variables into lowercase underscore-separated words, like camel_case. In several instances I also gave them better names to reflect their actual contents or role: For example: CharactersUnique became unique_words.
The next step was to improve the handling of files by using Python's with statement to ensure they all would be closed automatically at the end of the block. In other cases I consolidated multiple file open() calls into one.
After all that I had it almost working, but that's when I discovered a problem with the approach of treating newline "\n" characters as separate words of the input text file. This caused a problem when the file was being recreated by the expression:
" ".join(NewWordsUnique[pos] for pos in (NewListOfPositions))
because it adds one space before and after every "\n" character encountered that aren't there in the original file. To workaround that, I ended up writing out the for loop that recreates the file instead of using a list comprehension, because doing so allows the newline "words" could be handled properly.
At any rate, here's the resulting rewritten (and working) code:
input_filename = "speech.txt"
compressed_filename = "List_WordsPos.txt"
# Two lists to represent contents of input file.
unique_words = ["\n"] # preload with newline "word"
word_positions = []
with open(input_filename, "r") as input_file:
for line in input_file:
for word in line.split():
if word not in unique_words:
unique_words.append(word)
word_positions.append(unique_words.index(word))
word_positions.append(unique_words.index("\n")) # add newline at end of each line
# Write representations of the two data-structures to compressed file.
with open(compressed_filename, "w") as compr_file:
words_repr = " ".join(repr(word) for word in unique_words)
compr_file.write(words_repr + "\n")
positions_repr = " ".join(repr(posn) for posn in word_positions)
compr_file.write(positions_repr + "\n")
def strip_quotes(word):
"""Strip the first and last characters from the string (assumed to be quotes)."""
tmp = word[1:-1]
return tmp if tmp != "\\n" else "\n" # newline "words" are special case
# Recreate input file from data in compressed file.
with open(compressed_filename, "r") as compr_file:
line = compr_file.readline()
new_unique_words = list(map(strip_quotes, line.split()))
line = compr_file.readline()
new_word_positions = map(int, line.split()) # using int, not float here
words = []
lines = []
for posn in new_word_positions:
word = new_unique_words[posn]
if word != "\n":
words.append(word)
else:
lines.append(" ".join(words))
words = []
print("Recreated Text:\n")
recreation = "\n".join(lines)
print(recreation)
I created my own speech.txt test file from the first paragraph of your question and ran the script on it with these results:
Recreated Text:
I have to compress a file into a list of words and list of positions to recreate
the original file. My program should also be able to take a compressed file and
recreate the full text, including punctuation and capitalization, of the
original file. I have everything correct apart from the recreation, using the
map function my program can't convert my list of positions into floats because
of the '[' as it is a list.
Per your question in the comments:
You will want to split the input on spaces. You will also likely want to use different data structures.
# we'll map the words to a list of positions
all_words = {}
with open("speech.text") as f:
data = f.read()
# since we need to be able to re-create the file, we'll want
# line breaks
lines = data.split("\n")
for i, line in enumerate(lines):
words = line.split(" ")
for j, word in enumerate(words):
if word in all_words:
all_words[word].append((i, j)) # line and pos
else:
all_words[word] = [(i, j)]
Note that this does not yield maximum compression as foo and foo. count as separate words. If you want more compression, you'll have to go character by character. Hopefully now you can use a similar approach to do so if desired.

When counting the occurrence of a string in a file, my code does not count the very first word

Code
def main():
try:
file=input('Enter the name of the file you wish to open: ')
thefile=open(file,'r')
line=thefile.readline()
line=line.replace('.','')
line=line.replace(',','')
thefilelist=line.split()
thefilelistset=set(thefilelist)
d={}
for item in thefilelist:
thefile.seek(0)
wordcount=line.count(' '+item+' ')
d[item]=wordcount
for i in d.items():
print(i)
thefile.close()
except IOError:
print('IOError: Sorry but i had an issue opening the file that you specified to READ from please try again but keep in mind to check your spelling of the file you want to open')
main()
Problem
Basically I am trying to read the file and count the number of times each word in the file appears then print that word with the number of times it appeared next to it.
It all works except that it will not count the first word in the file.
File I am using
my practice file that I am testing this code on contains this text:
This file is for testing. It is going to test how many times the words
in here appear.
output
('for', 1)
('going', 1)
('the', 1)
('testing', 1)
('is', 2)
('file', 1)
('test', 1)
('It', 1)
('This', 0)
('appear', 1)
('to', 1)
('times', 1)
('here', 1)
('how', 1)
('in', 1)
('words', 1)
('many', 1)
note
If you notice it says that 'This' appears 0 times but it does in fact appear in the file.
any ideas?
My guess would be this line:
wordcount=line.count(' '+item+' ')
You are looking for "space" + YourWord + "space", and the first word is not preceded by space.
I would suggest more use of Python utilities. A big flaw is that you only read one line from the file.
Then you create a set of unique words and then start counting them individually which is highly inefficient; the line is traversed many times: once to create the set and then for each unique word.
Python has a built-in "high performance counter" (https://docs.python.org/2/library/collections.html#collections.Counter) which is specifically meant for use cases like this.
The following few lines replace your program; it also uses "re.split()" to split each line by word boundaries (https://docs.python.org/2/library/re.html#regular-expression-syntax).
The idea is to execute this split() function on each of the lines of the file and update the wordcounter with the results from this split. Also re.sub() is used to replace the dots and commas in one go before handing the line to the split function.
import re, collections
with open(raw_input('Enter the name of the file you wish to open: '), 'r') as file:
for d in reduce(lambda acc, line: acc.update(re.split("\W", line)) or acc,
map(lambda line: re.sub("(\.,)", "", line), file),
collections.Counter()).items():
print d
If you want a simple fix it is simple in this line:
wordcount=line.count(' '+item+' ')
There is no space before "This".
I think the are a couple ways to fix it but I recommend using the with block and using .readlines()
I recommend using some more of pythons capabilities. In this case, a couple recommendations. One if the file is more than one line this code won't work. Also if a sentence is words... lastwordofsentence.Firstwordofnextsentence it won't work because they will be next to each other and become one word. Please change your replace to do spaces by that i mean change '' to ' ', as split will replace multiple spaces .
Also, please post whether you are using Python 2.7 or 3.X. It helps with small possible syntax problems.
filename = input('Enter the name of the file you wish to open: ')
# Using a with block like this is cleaner and nicer than try catch
with open(filename, "r") as f:
all_lines = f.readlines()
d={} # Create empty dictionary
# Iterate through all lines in file
for line in all_lines:
# Replace periods and commas with spaces
line=line.replace('.',' ')
line=line.replace(',',' ')
# Get all words on this line
words_in_this_line = line.split() # Split into all words
# Iterate through all words
for word in words_in_this_line:
#Check if word already exists in dictionary
if word in d: # Word exists increment count
d[word] += 1
else: #Word doesn't exist, add it with count 1
d[word] = 1
# Print all words with frequency of occurrence in file
for i in d.items():
print(i)
You check if line contains ' '+item+' ', which means you are searching for a word starting and ending with a space. Because "This" is the first word of the line, it is not surrounded by two spaces.
To fix that, you can use the following code:
wordcount=(' '+line+' ').count(' '+item+' ')
Above code ensures that the first and the last word are counted correctly.
The problem is in this line wordcount=line.count(' '+item+' '). The first word will not have a space in front of it. I have also have removed some other redundant statements from your code:
import string
def main():
try:
#file=input('Enter the name of the file you wish to open: ')
thefile=open('C:/Projects/Python/data.txt','r')
line=thefile.readline()
line = line.translate(string.maketrans("",""), string.punctuation)
thefilelist=line.split()
d={}
for item in thefilelist:
if item not in d:
d[item] = 0
d[item] = d[item]+1
for i in d.items():
print(i)
thefile.close()
except IOError:
print('IOError: Sorry but i had an issue opening the file that you specified to READ from please try again but keep in mind to check your spelling of the file you want to open')
main()
This do not have space in front ' '.
Quick fix:
line= ' ' + thefile.readline()
But there are many problem in Your code.
For example:
What about multi line file?
What about file without . at the end?

Iterate through words of a file in Python

I need to iterate through the words of a large file, which consists of a single, long long line. I am aware of methods iterating through the file line by line, however they are not applicable in my case, because of its single line structure.
Any alternatives?
It really depends on your definition of word. But try this:
f = file("your-filename-here").read()
for word in f.split():
# do something with word
print word
This will use whitespace characters as word boundaries.
Of course, remember to properly open and close the file, this is just a quick example.
Long long line? I assume the line is too big to reasonably fit in memory, so you want some kind of buffering.
First of all, this is a bad format; if you have any kind of control over the file, make it one word per line.
If not, use something like:
line = ''
while True:
word, space, line = line.partition(' ')
if space:
# A word was found
yield word
else:
# A word was not found; read a chunk of data from file
next_chunk = input_file.read(1000)
if next_chunk:
# Add the chunk to our line
line = word + next_chunk
else:
# No more data; yield the last word and return
yield word.rstrip('\n')
return
You really should consider using Generator
def word_gen(file):
for line in file:
for word in line.split():
yield word
with open('somefile') as f:
word_gen(f)
There are more efficient ways of doing this, but syntactically, this might be the shortest:
words = open('myfile').read().split()
If memory is a concern, you aren't going to want to do this because it will load the entire thing into memory, instead of iterating over it.
I've answered a similar question before, but I have refined the method used in that answer and here is the updated version (copied from a recent answer):
Here is my totally functional approach which avoids having to read and
split lines. It makes use of the itertools module:
Note for python 3, replace itertools.imap with map
import itertools
def readwords(mfile):
byte_stream = itertools.groupby(
itertools.takewhile(lambda c: bool(c),
itertools.imap(mfile.read,
itertools.repeat(1))), str.isspace)
return ("".join(group) for pred, group in byte_stream if not pred)
Sample usage:
>>> import sys
>>> for w in readwords(sys.stdin):
... print (w)
...
I really love this new method of reading words in python
I
really
love
this
new
method
of
reading
words
in
python
It's soo very Functional!
It's
soo
very
Functional!
>>>
I guess in your case, this would be the way to use the function:
with open('words.txt', 'r') as f:
for word in readwords(f):
print(word)
Read in the line as normal, then split it on whitespace to break it down into words?
Something like:
word_list = loaded_string.split()
After reading the line you could do:
l = len(pattern)
i = 0
while True:
i = str.find(pattern, i)
if i == -1:
break
print str[i:i+l] # or do whatever
i += l
Alex.
What Donald Miner suggested looks good. Simple and short. I used the below in a code that I have written some time ago:
l = []
f = open("filename.txt", "rU")
for line in f:
for word in line.split()
l.append(word)
longer version of what Donald Miner suggested.

Python- need fast algorithm that removes all words in a file that are derivatives in other words

We have a file named wordlist, which contains 1,876 KB worth of alphabetized words, all of which are longer than 4 letters and contain one carriage return between each new two-letter construction (ab, ac, ad, etc., words all contain returns between them):
wfile = open("wordlist.txt", "r+")
I want to create a new file that contains only words that are not derivatives of other, smaller words. For example, the wordlist contains the following words ["abuser, abused, abusers, abuse, abuses, etc.] The new file that is created should retain only the word "abuse" because it is the "lowest common denominator" (if you will) between all those words. Similarly, the word "rodeo" would be removed because it contains the word rode.
I tried this implementation:
def root_words(wordlist):
result = []
base = wordlist[1]
for word in wordlist:
if not word.startswith(base):
result.append(base)
print base
base=word
result.append(base)
return result;
def main():
wordlist = []
wfile = open("wordlist.txt", "r+")
for line in wfile:
wordlist.append(line[:-1])
wordlist = root_words(wordlist)
newfile = open("newwordlist.txt", "r+")
newfile.write(wordlist)
But it always froze my computer. Any solutions?
I would do something like this:
def bases(words):
base = next(words)
yield base
for word in words:
if word and not word.startswith(base):
yield word
base = word
def get_bases(infile, outfile):
with open(infile) as f_in:
words = (line.strip() for line in f_in)
with open(outfile, 'w') as f_out:
f_out.writelines(word + '\n' for word in bases(words))
This goes through the corncob list of 58,000 words in a fifth of a second on my fairly old laptop. It's old enough to have one gig of memory.
$ time python words.py
real 0m0.233s
user 0m0.180s
sys 0m0.012s
It uses iterators everywhere it can to go easy on the memory. You could probably increase performance by slicing off the end of the lines instead of using strip to get rid of the newlines.
Also note that this relies on your input being sorted and non-empty. That was part of the stated preconditions though so I don't feel too bad about it ;)
One possible improvement is to use a database to load the words and avoid loading the full input file in RAM. Another option is to process the words as you read them from the file and write the results without loading everything in memory.
The following example treats the file as it is read without pre-loading stuff in memory.
def root_words(f,out):
result = []
base = f.readline()
for word in f:
if not word.startswith(base):
out.write(base + "\n")
base=word
out.write(base + "\n")
def main():
wfile = open("wordlist.txt", "r+")
newfile = open("newwordlist.txt", "w")
root_words(wfile,newfile)
wfile.close()
newfile.close()
Memory complexity of this solution is O(1) since the variable base is the only thing that you need in order to process the file. This can be done thanks to that the file is alphabetically sorted.
since the list is alphabetized, this does the trick (takes 0.4seconds with 5 megs of data, so should not be a problem with 1.8)
res = [" "]
with open("wordlist.txt","r") as f:
for line in f:
tmp = line.strip()
if tmp.startswith(res[-1]):
pass
else:
res.append(tmp)
with open("newlist.txt","w") as f:
f.write('\n'.join(res[1:]))

Categories