I am reading a file with a different string on each line. I want to be able to search an input string for a substring that matches an entire line in the file and then save that substring so that it can be printed. This is what I have right now:
wordsDoc = open('Database.doc', 'r', encoding='latin-1')
words = wordsDoc.read().lower()
matching = [string for string in words if string in op_text]
But this matches on each character. How would I do this properly?
This will create a list named "matching" containing all the lines in the file that exactly match the string in op_text, once lowercased.
with open('Database.doc', 'r', encoding='latin-1') as wordsDoc:
matching = [line for line in wordsDoc if op_text == line.lower()]
I assume the idea is that there is some search phrase and if it is contained in any line from the file, you want to filter those lines out.
Try this, which will compare the lower cased version of the line, but will return the original line from the file if it contains the search_key.
with open('somefile.doc') as f:
matching = [line for line in f if search_key in line.lower()]
Couple of comments:
First, using with to open a file is usually better:
with open('Database.doc', 'r', encoding='latin-1') as f:
# closes the file automagically at the end of this block...
Second, there is no need to read in the whole file unless you are doing something with the file as a whole. Since you are searching lines, deal with the lines one by one:
matches=[]
with open('Database.doc', 'r', encoding='latin-1') as f:
for line in f:
if string in line.lower():
matches.append(line)
If you are trying to match the entire line:
matches=[]
with open('Database.doc', 'r', encoding='latin-1') as f:
for line in f:
if string == line.lower():
matches.append(line)
Or, more Pythonically, with a list comprehension:
with open('Database.doc', 'r', encoding='latin-1') as f:
matches=[line for line in f if line.lower()==string]
etc...
Related
so I have a txt file that I am required to add a phrase at every end of the line.
Note that the phrase is the same added on every line
soo what I need is
here are some words
some words are also here
vlavlavlavlavl
blaaablaabalbaaa
before
here are some words, the end
some words are also here, the end
vlavlavlavlavl, the end
blaaablaabalbaaa, the end
after
i also tried this method
with open("Extracts.txt", encoding="utf-8") as f:
for line in f:
data = [line for line in f]
with open("new.txt", 'w', encoding="utf-8") as f:
for line in data:
f.write(", Deposited")
f.write(line)
but the word was shown at the beginning of the line and not the end.
line ends with a newline. Remove the newline, write the line and the addition, followed by a newline.
There's also no need to read the lines into a list first, you can just iterate over the input file directly.
with open("Extracts.txt", encoding="utf-8") as infile, open("new.txt", 'w', encoding="utf-8") as outfile:
for line in infile:
line = line.rstrip("\n")
outfile.write(f"{line}, Deposited\n")
You can first get all the lines in the text file using the readlines method, and then add the line you want to.
with open("Extracts.txt", encoding="utf-8") as f:
data = f.readlines()
new_data = []
for line in data:
line = line.replace("\n", "")
line += " , Deposited\n"
new_data.append(line)
with open("new.txt", "w", encoding="utf-8") as f:
f.writelines(new_data)
As mkrieger1 already said, the order of operations here is wrong. You are writing the ", Deposited" to the file before you're writing the content of the line in question. So a working version of the code swaps those operations:
with open("Extracts.txt", encoding="utf-8") as f:
for line in f:
data = [line for line in f]
with open("new.txt", 'w', encoding="utf-8") as f:
for line in data:
f.write(line.strip())
f.write(", Deposited\n")
Note that I also added a strip() function to handling the line of text, this removes whitespaces at the start and end of the string to get rid of any extra line changes before the ", Deposited". Then the line change was manually added to the end of the string as a string literal "\n".
In Python, calling e.g. temp = open(filename,'r').readlines() results in a list in which each element is a line from the file. However, these strings have a newline character at the end, which I don't want.
How can I get the data without the newlines?
You can read the whole file and split lines using str.splitlines:
temp = file.read().splitlines()
Or you can strip the newline by hand:
temp = [line[:-1] for line in file]
Note: this last solution only works if the file ends with a newline, otherwise the last line will lose a character.
This assumption is true in most cases (especially for files created by text editors, which often do add an ending newline anyway).
If you want to avoid this you can add a newline at the end of file:
with open(the_file, 'r+') as f:
f.seek(-1, 2) # go at the end of the file
if f.read(1) != '\n':
# add missing newline if not already present
f.write('\n')
f.flush()
f.seek(0)
lines = [line[:-1] for line in f]
Or a simpler alternative is to strip the newline instead:
[line.rstrip('\n') for line in file]
Or even, although pretty unreadable:
[line[:-(line[-1] == '\n') or len(line)+1] for line in file]
Which exploits the fact that the return value of or isn't a boolean, but the object that was evaluated true or false.
The readlines method is actually equivalent to:
def readlines(self):
lines = []
for line in iter(self.readline, ''):
lines.append(line)
return lines
# or equivalently
def readlines(self):
lines = []
while True:
line = self.readline()
if not line:
break
lines.append(line)
return lines
Since readline() keeps the newline also readlines() keeps it.
Note: for symmetry to readlines() the writelines() method does not add ending newlines, so f2.writelines(f.readlines()) produces an exact copy of f in f2.
temp = open(filename,'r').read().split('\n')
Reading file one row at the time. Removing unwanted chars from end of the string with str.rstrip(chars).
with open(filename, 'r') as fileobj:
for row in fileobj:
print(row.rstrip('\n'))
See also str.strip([chars]) and str.lstrip([chars]).
I think this is the best option.
temp = [line.strip() for line in file.readlines()]
temp = open(filename,'r').read().splitlines()
My preferred one-liner -- if you don't count from pathlib import Path :)
lines = Path(filename).read_text().splitlines()
This it auto-closes the file, no need for with open()...
Added in Python 3.5.
https://docs.python.org/3/library/pathlib.html#pathlib.Path.read_text
Try this:
u=open("url.txt","r")
url=u.read().replace('\n','')
print(url)
To get rid of trailing end-of-line (/n) characters and of empty list values (''), try:
f = open(path_sample, "r")
lines = [line.rstrip('\n') for line in f.readlines() if line.strip() != '']
You can read the file as a list easily using a list comprehension
with open("foo.txt", 'r') as f:
lst = [row.rstrip('\n') for row in f]
my_file = open("first_file.txt", "r")
for line in my_file.readlines():
if line[-1:] == "\n":
print(line[:-1])
else:
print(line)
my_file.close()
This script here will take lines from file and save every line without newline with ,0 at the end in file2.
file = open("temp.txt", "+r")
file2 = open("res.txt", "+w")
for line in file:
file2.writelines(f"{line.splitlines()[0]},0\n")
file2.close()
if you looked at line, this value is data\n, so we put splitlines()
to make it as an array and [0] to choose the only word data
import csv
with open(filename) as f:
csvreader = csv.reader(f)
for line in csvreader:
print(line[0])
I am reading a file and getting the first element from each start of the line, and comparing it to my list, if found, then I append it to the new output file that is supposed to be exactly like the input file in terms of the structure.
my_id_list = [
4985439
5605471
6144703
]
input file:
4985439 16:0.0719814
5303698 6:0.09407 19:0.132581
5605471 5:0.0486076
5808678 8:0.130536
6144703 5:0.193785 19:0.0492507
6368619 3:0.242678 6:0.041733
my attempt:
output_file = []
input_file = open('input_file', 'r')
for line in input_file:
my_line = np.array(line.split())
id = str(my_line[0])
if id in my_id_list:
output_file.append(line)
np.savetxt("output_file", output_file, fmt='%s')
Question is:
It is currently adding an extra empty line after each line written to the output file. How can I fix it? or is there any other way to do it more efficiently?
update:
output file should be for this example:
4985439 16:0.0719814
5605471 5:0.0486076
6144703 5:0.193785 19:0.0492507
try something like this
# read lines and strip trailing newline characters
with open('input_file','r') as f:
input_lines = [line.strip() for line in f.readlines()]
# collect all the lines that match your id list
output_file = [line for line in input_lines if line.split()[0] in my_id_list]
# write to output file
with open('output_file','w') as f:
f.write('\n'.join(output_file))
I don't know what numpy does to the text when reading it, but this is how you could do it without numpy:
my_id_list = {4985439, 5605471, 6144703} # a set is faster for membership testing
with open('input_file') as input_file:
# Your problem is most likely related to line-endings, so here
# we read the inputfile into an list of lines with intact line endings.
# To preserve the input, exactly, you would need to open the files
# in binary mode ('rb' for the input file, and 'wb' for the output
# file below).
lines = input_file.read().splitlines(keepends=True)
with open('output_file', 'w') as output_file:
for line in lines:
first_word = line.split()[0]
if first_word in my_id_list:
output_file.write(line)
getting the first word of each line is wasteful, since this:
first_word = line.split()[0]
creates a list of all "words" in the line when we just need the first one.
If you know that the columns are separated by spaces you can make it more efficient by only splitting on the first space:
first_word = line.split(' ', 1)[0]
I have a file with a word in each line and a set with words, and I want to put not equal words from set called 'out' to the file. There is part of my code:
def createNextU(self):
print "adding words to final file"
if not os.path.exists(self.finalFile):
open(self.finalFile, 'a').close
fin = open(self.finalFile,"r")
out = set()
for line in self.lines_seen: #lines_seen is a set with words
if line not in fin:
out.add(line)
else:
print line
fin.close()
fout= open(self.finalFile,"a+")
for line in out:
fout.write(line)
but it only match a bit of real equal words. I play with the same dictionary of words and it add repeat words to file each run. What I am doing wrong?? what happening?? I try to use '==' and 'is' comparators and I have the same result.
Edit 1: I am working with huge files(finalFile), which can't be full loaded at RAM, so I think I should read file line by line
Edit 2: Found big problem with pointer:
def createNextU(self):
print "adding words to final file"
if not os.path.exists(self.finalFile):
open(self.finalFile, 'a').close
out = set()
out.clear()
with open(self.finalFile,"r") as fin:
for word in self.lines_seen:
fin.seek(0, 0)'''with this line speed down to 40 lines/second,without it dont work'''
if word in fin:
self.totalmatches = self.totalmatches+1
else:
out.add(word)
self.totalLines=self.totalLines+1
fout= open(self.finalFile,"a+")
for line in out:
fout.write(line)
If I put the lines_seen bucle before opening the file, I open the file for each line in lines_seen, but speed ups to 30k lines/second only. With set() I am having 200k lines/second at worst, so I think I will load the file by parts and compare it using sets. Any better solution?
Edit 3: Done!
fin is a filehandle so you can't compare it with if line not in fin. The content needs to be read first.
with open(self.finalFile, "r") as fh:
fin = fh.read().splitlines() # fin is now a list of words from finalFile
for line in self.lines_seen: #lines_seen is a set with words
if line not in fin:
out.add(line)
else:
print line
# remove fin.close()
EDIT:
Since lines_seen is a set, try to create a new set with the words from finalFile then diff the sets?
file_set = set()
with open(self.finalFile, "r") as fh:
for f_line in fh:
new_set.add(f_line.strip())
# This will give you all the words in finalFile that are not in lines_seen.
print new_set.difference(self.lines_seen)
Your comparison is likely not working because the lines read from the file will have a newline at the end, so you are comparing 'word\n' to 'word'. Using 'rstrip' will help remove the trailing newlines:
>>> foo = 'hello\n'
>>> foo
'hello\n'
>>> foo.rstrip()
'hello'
I would also iterate over the file, rather than iterate over the variable containing the words you would like to check against. If I've understood your code, you would like to write anything that is in self.lines_seen to self.finalFile, if it is not already in it. If you use 'if line not in fin' as you have, this will not work as you're expecting. For example, if your file contains:
lineone
linetwo
linethree
and the set lines_seen, being unordered, returns 'linethree' and then 'linetwo', then the following will match 'linethree' but not 'linetwo' because the file object has already read past it:
with open(self.finalFile,"r" as fin:
for line in self.lines_seen:
if line not in fin:
print line
Instead, consider using a counter:
from collections import Counter
linecount = Counter()
# using 'with' means you don't have to worry about closing it once the block ends
with open(self.finalFile,"r") as fin:
for line in fin:
line = line.rstrip() # remove the right-most whitespace/newline
linecount[line] += 1
for word in self.lines_seen:
if word not in linecount:
out.add(word)
I have a text file that has a sentence at each line. And I have a word list. I just want to get only the sentences which contain at least one word from the list. Is there a pythonic way to do that?
sentences = [line for line in f if any(word in line for word in word_list)]
Here f would be your file object, for example you could replace it with open('file.txt') if file.txt was the name of your file and it was located in the same directory as the script.
Using set.intersection:
with open('file') as f:
[line for line in f if set(line.lower().split()).itersection(word_set)]
or with filter:
filter(lambda x:word_set.intersection(set(x.lower().split())),f)
this will give you a start:
words = ['a', 'and', 'foo']
infile = open('myfile.txt', 'r')
match_sentences = []
for line in infile.readlines():
# check for words in this line
# if match, append to match_sentences list