I am attempting to split a .txt file by sentence into a list, but my coding efforts can only split by line.
Example of .txt contents:
This is line 1 of txt file,
it is now on line 2. Here is the
second sentence between line 2 and 3.
Code
listed = []
with open("example.txt","r") as text:
Line = text.readline()
while Line!="":
Line1 = Line.split(".")
for sentence in Line1:
listed.append(sentence)
Line = text.readline()
print(listed)
This would print something like: ['This is line 1 of txt file,\n','it is now on line 2\n', 'Here is the\n','second sentence between line 2 and 3/n']
If the entire document was on one line, this would work correctly, except for cases like "Mr." and "Mrs." and such. However, that's a future worry. Does anyone out there know how to use split in the above scenario?
Assuming all sentence ends with a dot .
You may just :
read the whole file : fic.read()
remove return char replace('\n', '')
split on dot
apply strip on each sentence to remove spaces padding and leading spaces
keep the sentences
with open("data.txt", "r") as fic:
content = fic.read().replace('\n', '')
sentences = list(map(str.strip, content.split(".")))
A version more detailled
with open("data.txt", "r") as fic:
content = fic.read()
content = content.replace('\n', '')
sentences = content.split(".")
sentences = list(map(str.strip, sentences))
# same as
sentences = [s.strip() for s in sentences]
split on a string will split on whatever you ask it to, without regard to line breaks, just do read to pull the whole file instead of readlines. the issue becomes whether that's too much text to handle in a single read, if so you'll need to be more clever. you'll probably want to filter out actual line breaks to get the effect of one-string-per-sentence.
Related
I have a .txt file which contains 4 texts and I'd like to create a list in which all the for texts will appear on a new line - thus I'll have 4 objects in a list. The code should say something: read the text line by line (but append the lines within a doc), but as soon as you get to '1 doc of x' start a new line. I've tried the following, which does not create what I want:
with open('testfile.txt') as f:
myList = f.readlines()
myList = [x.strip() for x in content]
testfile.txt
1 doc of 4
Hello World.
This is another question
2 doc of 4
This is a new text file.
Not much in it.
3 doc of 4
This is the third text.
It contains separate info.
4 doc of 4
The final text.
A short one.
expected output for myList:
myList=['Hello World. This is another question',
'This is a new text file. Not much in it.',
'This is the third text. It contains separate info.',
'The final text. A short one.']
Sure.
Something like this will do – it will crash miserably if the document does not start with a header line, though.
import re
# This will hold each document as a list of lines.
# To begin with, there are no documents.
myList = []
# Define a regular expression to match header lines.
header_line_re = re.compile(r'\d+ doc of \d+')
with open('testfile.txt') as f:
for line in f: # For each line...
line = line.strip() # Remove leading and trailing whitespace
if header_line_re.match(line): # If the line matches the header line regular expression...
myList.append([]) # Start a new group within `myList`,
continue # then skip processing the line further.
if line: # If the line is not empty, simply add it to the last group.
myList[-1].append(line)
# Recompose the lines back to strings (separated by spaces, not newlines).
myList = [' '.join(doc) for doc in myList]
print(myList)
The output is:
[
"Hello World. This is another question",
"This is a new text file. Not much in it.",
"This is the third text. It contains separate info.",
"The final text. A short one.",
]
I want to read from a file various lines like this for example:
hello I live in London.
hello I study.
And then based on what is the first word I want to remove the line from the file.
Can I put which sentence in a array?
You can read in the entire contents of the file into memory (into a list), choose which lines you wish to keep, and write those a new file (you can replace the old one if you wish).
For example:
old_lines = open("input.txt",'r').readlines()
new_lines = []
for line in old_lines:
words = line.split()
if words[0] == 'hello': # if the first word is "hello", keep it.
new_lines.append(line)
f = open("output.txt",'w')
for line in new_lines:
f.write(line)
file_contents = x.read()
#print (file_contents)
for line in file_contents:
if "ase" in line:
print (line)
I'm looking for all the sentences that contain the phrase "ase" in the file. When I run it, nothing is printed.
Since file_contents is the result of x.read(), it's a string not a list of strings.
So you're iterating on each character.
Do that instead:
file_contents = x.readlines()
now you can search in your lines
or if you're not planning to reuse file_contents, iterate on the file handle with:
for line in x:
so you don't have to readlines() and store all file in memory (if it's big, it can make a difference)
read will return the whole content of the file (not line by line) as string. So when you iterate over it you iterate over the single characters:
file_contents = """There is a ase."""
for char in file_contents:
print(char)
You can simply iterate over the file object (which returns it line-by-line):
for line in x:
if "ase" in line:
print(line)
Note that if you actually look for sentences instead of lines where 'ase' is contained it will be a bit more complicated. For example you could read the complete file and split at .:
for sentence in x.read().split('.'):
if "ase" in sentence:
print(sentence)
However that would fail if there are .s that don't represent the end of a sentence (like abbreviations).
This is the structure of the txt file (repeated units of CDS-text-ORIGIN):
CDS 311..>428
/gene="PNR"
/codon_start=1
/product="photoreceptor-specific nuclear receptor"
/protein_id="AAD28302.1"
/db_xref="GI:4726077"
/translation="METRPTALMSSTVAAAAPAAGAASRKESPGRWGLGEDPT"
ORIGIN
I want to pull out the text from 311..<428 to GEDPT" as a string
The regex I have so far is:
compiler = re.compile(r"^\s+CDS\s+(.+)ORIGIN.+", re.DOTALL|re.MULTILINE)
I then use a loop to add each string to a list:
for line in file:
match = compiler.match(line)
if match:
list.append(str(match.group(1)))
But I keep getting an empty list! Any ideas why?
Help would be much appreciated, I'm new to this!
I am assuming that file is a filepointer such as file = open('filename.txt'). If that is the case then using:
for line in file:
will break each line on the newline character. So the first three lines will be:
1: ' CDS 311..>428\n'
2: ' /gene="PNR"\n'
3: ' /codon_start=1:\n'
Because each line is separate, you will not match the multiline pattern unless you combine the lines. You may want to consider using:
compiler = re.compile(r"^\s+CDS\s+(.+?)ORIGIN", re.DOTALL|re.MULTILINE)
fp = open('filename.txt')
all_text = fp.read() # this reads all the text without splitting on newlines
compiler.findall(all_text) # returns a list of all matches
sry im still new to python.
My complete code so far:
for line in file:
line = line.split("\t")
if my_var in line[1]:
print line[13]
What the program should do, is reading lines from a file.
the lines have the following Format:
"word" \t "word" \t "word" ...
The Programm should split each line into a list of strings containing the words
==> list = (word1, word2, word3, ...)
then i wish to test if the word at index 1 matches a given word, and if so i wish to print the word at index 13 (each line has the same ammount of elements)
What i dont understand is, writing:
for line in file:
line = line.split("\t")
word = line[1]
print word
works, while
for line in file:
line = line.split("\t")
word = line[1]
if my_var in word:
print line[13]
does not work.
Im pretty shure there is an easy solution to this Problem and that i simply cant find it.
Your error is because of the following line :
print line[16]
Your splited list hasn't 16 item it is just contain 4 item and you have tried to get the 16th index.