How do I print specific strings from text files?

How do I print specific strings from text files? - python

file_contents = x.read()
#print (file_contents)
for line in file_contents:
if "ase" in line:
print (line)
I'm looking for all the sentences that contain the phrase "ase" in the file. When I run it, nothing is printed.

Since file_contents is the result of x.read(), it's a string not a list of strings.
So you're iterating on each character.
Do that instead:
file_contents = x.readlines()
now you can search in your lines
or if you're not planning to reuse file_contents, iterate on the file handle with:
for line in x:
so you don't have to readlines() and store all file in memory (if it's big, it can make a difference)

read will return the whole content of the file (not line by line) as string. So when you iterate over it you iterate over the single characters:
file_contents = """There is a ase."""
for char in file_contents:
print(char)
You can simply iterate over the file object (which returns it line-by-line):
for line in x:
if "ase" in line:
print(line)
Note that if you actually look for sentences instead of lines where 'ase' is contained it will be a bit more complicated. For example you could read the complete file and split at .:
for sentence in x.read().split('.'):
if "ase" in sentence:
print(sentence)
However that would fail if there are .s that don't represent the end of a sentence (like abbreviations).

Related

How to get the longest word in txt file python

article = open("article.txt", encoding="utf-8")
for i in article:
print(max(i.split(), key=len))
The text is written with line breaks, and it gives me the longest words from each line. How to get the longest word from all of the text?

One approach would be to read the entire text file into a Python string, remove newlines, and then find the largest word:
with open('article.text', 'r') as file:
data = re.sub(r'\r?\n', '', file.read())
longest_word = max(re.findall(r'\w+', data), key=len)

longest = 0
curr_word = ""
with open("article.txt", encoding="utf-8") as f:
for line in f:
for word in line.split(" "): # Use line-by-line generator to avoid loading large file in memory
word = word.strip()
if (wl := len(word)) > longest: # Python 3.9+, otherwise use 2 lines
longest = wl
curr_word = word
print(curr_word)

Instead of iterating through each line, you can get the entire text of the file and then split them using article.readline().split()
article = open("test.txt", encoding="utf-8")
print(max(article.readline().split(), key=len))
article.close()

There are many ways by which you could do that. This would work
with open("article.txt", encoding="utf-8") as article:
txt = [word for item in article.readlines() for word in item.split(" ")]
biggest_word = sorted(txt, key=lambda word: (-len(word), word), )[0]
Note that I am using a with statement to close the connection to the file when the reading is done, that I use readlines to read the entire file, returing a list of lines, and that I unpack the split items twice to get a flat list of items. The last line of code sorts the list and uses -len(word) to inverse the sorting from ascending to descending.
I hope this is what you are looking for :)

If your file is large enough to fit in memory, you can read all line at once.
file = open("article.txt", encoding="utf-8", mode='r')
all_text = file.read()
longest = max(i.split(), key=len)
print(longest)

read words from file, line by line and concatenate to paragraph

I have a really long list of words that are on each line. How do I make a program that takes in all that and print them all side by side?
I tried making the word an element of a list, but I don't know how to proceed.
Here's the code I've tried so far:
def convert(lst):
return([i for item in lst for i in item.split()])
lst = [''' -The list of words come here- ''']
print(convert(lst))

If you already have the words in a list, you can use the join() function to concatenate them. See https://docs.python.org/3/library/stdtypes.html#str.join
words = open('your_file.txt').readlines()
separator = ' '
print(separator.join(words))
Another, a little bit more cumbersome method would be to print the words using the builtin print() function but suppress the newline that print() normally adds automatically to the end of your argument.
words = open('your_file.txt').readlines()
for word in words:
print(word, end=' ')

Try this, and example.txt just has a list of words going down line by line.
with open("example.txt", "r") as a_file:
sentence = ""
for line in a_file:
stripped_line = line.strip()
sentence = sentence + f"{stripped_line} "
print(sentence)

If your input file is really large and you cant fit it all in memory, you can read the words lazy and write them to disk instead of holding the whole output in memory.
# create a generator that yields each individual line
lines = (l for l in open('words'))
with open("output", "w+") as writer:
# read the file line by line to avoid memory issues
while True:
try:
line = next(lines)
# add to the paragraph in the out file
writer.write(line.replace('\n', ' '))
except StopIteration:
break
You can check the working example here: https://replit.com/#bluebrown/readwritewords#main.py

Split a .txt at each period instead of by line?

I am attempting to split a .txt file by sentence into a list, but my coding efforts can only split by line.
Example of .txt contents:
This is line 1 of txt file,
it is now on line 2. Here is the
second sentence between line 2 and 3.
Code
listed = []
with open("example.txt","r") as text:
Line = text.readline()
while Line!="":
Line1 = Line.split(".")
for sentence in Line1:
listed.append(sentence)
Line = text.readline()
print(listed)
This would print something like: ['This is line 1 of txt file,\n','it is now on line 2\n', 'Here is the\n','second sentence between line 2 and 3/n']
If the entire document was on one line, this would work correctly, except for cases like "Mr." and "Mrs." and such. However, that's a future worry. Does anyone out there know how to use split in the above scenario?

Assuming all sentence ends with a dot .
You may just :
read the whole file : fic.read()
remove return char replace('\n', '')
split on dot
apply strip on each sentence to remove spaces padding and leading spaces
keep the sentences
with open("data.txt", "r") as fic:
content = fic.read().replace('\n', '')
sentences = list(map(str.strip, content.split(".")))
A version more detailled
with open("data.txt", "r") as fic:
content = fic.read()
content = content.replace('\n', '')
sentences = content.split(".")
sentences = list(map(str.strip, sentences))
# same as
sentences = [s.strip() for s in sentences]

split on a string will split on whatever you ask it to, without regard to line breaks, just do read to pull the whole file instead of readlines. the issue becomes whether that's too much text to handle in a single read, if so you'll need to be more clever. you'll probably want to filter out actual line breaks to get the effect of one-string-per-sentence.

strings in file do not match to string in a set

I have a file with a word in each line and a set with words, and I want to put not equal words from set called 'out' to the file. There is part of my code:
def createNextU(self):
print "adding words to final file"
if not os.path.exists(self.finalFile):
open(self.finalFile, 'a').close
fin = open(self.finalFile,"r")
out = set()
for line in self.lines_seen: #lines_seen is a set with words
if line not in fin:
out.add(line)
else:
print line
fin.close()
fout= open(self.finalFile,"a+")
for line in out:
fout.write(line)
but it only match a bit of real equal words. I play with the same dictionary of words and it add repeat words to file each run. What I am doing wrong?? what happening?? I try to use '==' and 'is' comparators and I have the same result.
Edit 1: I am working with huge files(finalFile), which can't be full loaded at RAM, so I think I should read file line by line
Edit 2: Found big problem with pointer:
def createNextU(self):
print "adding words to final file"
if not os.path.exists(self.finalFile):
open(self.finalFile, 'a').close
out = set()
out.clear()
with open(self.finalFile,"r") as fin:
for word in self.lines_seen:
fin.seek(0, 0)'''with this line speed down to 40 lines/second,without it dont work'''
if word in fin:
self.totalmatches = self.totalmatches+1
else:
out.add(word)
self.totalLines=self.totalLines+1
fout= open(self.finalFile,"a+")
for line in out:
fout.write(line)
If I put the lines_seen bucle before opening the file, I open the file for each line in lines_seen, but speed ups to 30k lines/second only. With set() I am having 200k lines/second at worst, so I think I will load the file by parts and compare it using sets. Any better solution?
Edit 3: Done!

fin is a filehandle so you can't compare it with if line not in fin. The content needs to be read first.
with open(self.finalFile, "r") as fh:
fin = fh.read().splitlines() # fin is now a list of words from finalFile
for line in self.lines_seen: #lines_seen is a set with words
if line not in fin:
out.add(line)
else:
print line
# remove fin.close()
EDIT:
Since lines_seen is a set, try to create a new set with the words from finalFile then diff the sets?
file_set = set()
with open(self.finalFile, "r") as fh:
for f_line in fh:
new_set.add(f_line.strip())
# This will give you all the words in finalFile that are not in lines_seen.
print new_set.difference(self.lines_seen)

Your comparison is likely not working because the lines read from the file will have a newline at the end, so you are comparing 'word\n' to 'word'. Using 'rstrip' will help remove the trailing newlines:
>>> foo = 'hello\n'
>>> foo
'hello\n'
>>> foo.rstrip()
'hello'
I would also iterate over the file, rather than iterate over the variable containing the words you would like to check against. If I've understood your code, you would like to write anything that is in self.lines_seen to self.finalFile, if it is not already in it. If you use 'if line not in fin' as you have, this will not work as you're expecting. For example, if your file contains:
lineone
linetwo
linethree
and the set lines_seen, being unordered, returns 'linethree' and then 'linetwo', then the following will match 'linethree' but not 'linetwo' because the file object has already read past it:
with open(self.finalFile,"r" as fin:
for line in self.lines_seen:
if line not in fin:
print line
Instead, consider using a counter:
from collections import Counter
linecount = Counter()
# using 'with' means you don't have to worry about closing it once the block ends
with open(self.finalFile,"r") as fin:
for line in fin:
line = line.rstrip() # remove the right-most whitespace/newline
linecount[line] += 1
for word in self.lines_seen:
if word not in linecount:
out.add(word)

Splitting lines in python based on some character

Input:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Output:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
'!' is the starting character and +0013 should be the ending of each line (if present).
Problem which I am getting:
Output is like :
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
Any help would be highly appreciated...!!!
My code:
file_open= open('sample.txt','r')
file_read= file_open.read()
file_open2= open('output.txt','w+')
counter =0
for i in file_read:
if '!' in i:
if counter == 1:
file_open2.write('\n')
counter= counter -1
counter= counter +1
file_open2.write(i)

You can try something like this:
with open("abc.txt") as f:
data=f.read().replace("\r\n","") #replace the newlines with ""
#the newline can be "\n" in your system instead of "\r\n"
ans=filter(None,data.split("!")) #split the data at '!', then filter out empty lines
for x in ans:
print "!"+x #or write to some other file
.....:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.

Could you just use str.split?
lines = file_read.split('!')
Now lines is a list which holds the split data. This is almost the lines you want to write -- The only difference is that they don't have trailing newlines and they don't have '!' at the start. We can put those in easily with string formatting -- e.g. '!{0}\n'.format(line). Then we can put that whole thing in a generator expression which we'll pass to file.writelines to put the data in a new file:
file_open2.writelines('!{0}\n'.format(line) for line in lines)
You might need:
file_open2.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
if you find that you're getting more newlines than you wanted in the output.
A few other points, when opening files, it's nice to use a context manager -- This makes sure that the file is closed properly:
with open('inputfile') as fin:
lines = fin.read()
with open('outputfile','w') as fout:
fout.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)

Another option, using replace instead of split, since you know the starting and ending characters of each line:
In [14]: data = """!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.""".replace('\n', '')
In [15]: print data.replace('+0013!', "+0013\n!")
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.

Just for some variance, here is a regular expression answer:
import re
outputFile = open('output.txt', 'w+')
with open('sample.txt', 'r') as f:
for line in re.findall("!.+?(?=!|$)", f.read(), re.DOTALL):
outputFile.write(line.replace("\n", "") + '\n')
outputFile.close()
It will open the output file, get the contents of the input file, and loop through all the matches using the regular expression !.+?(?=!|$) with the re.DOTALL flag. The regular expression explanation & what it matches can be found here: http://regex101.com/r/aK6aV4
After we have a match, we strip out the new lines from the match, and write it to the file.

Let's try to add a \n before every "!"; then let python splitlines :-) :
file_read.replace("!", "!\n").splitlines()

I will actually implement as a generator so that you can work on the data stream rather than the entire content of the file. This will be quite memory friendly if working with huge files
>>> def split_on_stream(it,sep="!"):
prev = ""
for line in it:
line = (prev + line.strip()).split(sep)
for parts in line[:-1]:
yield parts
prev = line[-1]
yield prev
>>> with open("test.txt") as fin:
for parts in split_on_stream(fin):
print parts
,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:19,000.0,0,37N22.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I print specific strings from text files? - python

file_contents = x.read() #print (file_contents) for line in file_contents: if "ase" in line: print (line) I'm looking for all the sentences that contain the phrase "ase" in the file. When I run it, nothing is printed.

Related

How to get the longest word in txt file python

read words from file, line by line and concatenate to paragraph

Split a .txt at each period instead of by line?

strings in file do not match to string in a set

Splitting lines in python based on some character

Categories

Resources