Parse lines into individual segments - python - python

I'm new to python and having issues working with a text file. The text file structure being used is shown. What I'm trying to do is first split the two polylines into their own variable and then split each variable into individual coordinates. The end goal is to have it structured as:
polyline 1:
[###, ###] [###, ###]
polyline 2:
[###, ###] [###, ###]
Text file structure:
Polyline;
1: ###,###; ###,###
2: ###,###; ###,###; ###,###
The code I've tried is just working with a single line. While I've been able to split the single line, I have not been able to move to the next step which is to split the line further.
f=open('txt.txt', 'r')
pl = []
for line in f.read().split('\n'):
if (line.find('1: ') !=-1):
ln = line.split('1: ')
print ln
f.close()
What is the best way to split the line to the end state?

First of all you can use with ... as statement to open a file which will close the file at the end of block , secondly you don't have to read the file and split with \n just use a for loop to loop over your file object.
Also for checking the start with digit number you can us regex and in this case you can use re.match function, then you can split the line with ; and using a list comprehension split another parts with , :
import re
with open('txt.txt') as f:
for line in f:
if re.match(r'\d:.*',line):
ln = [var.split(',') for var in line.split(';')]
print ln

Related

Create a list from a .txt file separating paragraphs

My current code is:
file = open("quotes.txt", "r")
lines = file.readlines()
quote = 0
quoteList = [line.split()for line in open ("quotes.txt")]
print(quoteList)
lines = []
for l in lines:
lines.append(l.split(" "))
I am trying to open a file and have it read the paragraphs in it and split them up into individual items in a list. It is a file filled with quotes and they all go something like this:
No problem should ever have to be solved twice.
-- Eric S. Raymond, How to become a hacker
There's about a thousand lines of quotes in the file and I am wondering how to split them up and put them into a list where you can access individual quotes and print them at random.

How do I print only the first instance of a string in a text file using Python?

I am trying to extract data from a .txt file in Python. My goal is to capture the last occurrence of a certain word and show the next line, so I do a reverse () of the text and read from behind. In this case, I search for the word 'MEC', and show the next line, but I capture all occurrences of the word, not the first.
Any idea what I need to do?
Thanks!
This is what my code looks like:
import re
from file_read_backwards import FileReadBackwards
with FileReadBackwards("camdex.txt", encoding="utf-8") as file:
for l in file:
lines = l
while line:
if re.match('MEC', line):
x = (file.readline())
x2 = (x.strip('\n'))
print(x2)
break
line = file.readline()
The txt file contains this:
MEC
29/35
MEC
28,29/35
And with my code print this output:
28,29/35
29/35
And my objetive is print only this:
28,29/35
This will give you the result as well. Loop through lines, add the matching lines to an array. Then print the last element.
import re
with open("data\camdex.txt", encoding="utf-8") as file:
result = []
for line in file:
if re.match('MEC', line):
x = file.readline()
result.append(x.strip('\n'))
print(result[-1])
Get rid of the extra imports and overhead. Read your file normally, remembering the last line that qualifies.
with ("camdex.txt", encoding="utf-8") as file:
for line in file:
if line.startswith("MEC"):
last = line
print(last[4:-1]) # "4" gets rid of "MEC "; "-1" stops just before the line feed.
If the file is very large, then reading backwards makes sense -- seeking to the end and backing up will be faster than reading to the end.

Python: Access "field" in line

I have the following .txt-File (modified bash emboss-dreg report, the original report has seqtable format):
Start End Strand Pattern Sequence
43392 43420 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT
52037 52064 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC
188334 188360 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC
I would like to access the elements under "sequence" only, to compare them with some variables and delete the whole lines, if the comparison does not give the desired result (using Levenshtein distance for comparison).
But I can't even get started .... :(
I am searching for something like the linux -f option, to directly get to the right "field" in the line to do my comparison.
I came across re.split:
with open(textFile) as f:
for line in f:
cleaned=re.split(r'\t',line)
print(cleaned)
which results in:
[' Start End Strand Pattern Sequence\n']
['\n']
[' 43392 43420 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT\n']
['\n']
[' 52037 52064 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC\n']
['\n']
[' 188334 188360 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC\n']
['\n']
That is the closest I got to "split my lines into elements". I feel like totally going the wrong way, but searching Stack Overflow and google did not result in anything :(
I have never worked with seqtable-format before, so I tried to deal with it as .txt Maybe, there is another approach better for dealing with it?
Python is the main language I am learning, I am not so firm in Bash, but bash-answers for dealing with the issue would be ok for me, too.
I am thankful for any hint/link/help :)
The format itself seems to be using multiple lines as delimiters while your r'\t' is not doing anything (you're instructing Python to split on a literal \t). Also, based on what you've pasted the data is not using a tab delimiter anyway, but a random number of whitespaces to pad the table.
To address both, you can read the file, treat the first line as a header (if you need it), then read the rest line by line, strip the trailing\leading whitespace, check if there is any data there and if there is - further split it on whitespace to get to your line elements:
with open("your_data", "r") as f:
header = f.readline().split() # read the first line as a header
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split() # split the data on whitespace to get your elements
print(elements[-1]) # print the last element
TGATCGCACGCCGAATGGAAACACGTTTT
TGACCCTGCTTGGCGATCCCGGCGTTTC
TGATCGCGCAACTGCAGCGGGAGTTAC
As a bonus, since you have the header, you can turn it into a map and then use 'proxied' named access to get the element you're looking for so you don't need to worry about the element position:
with open("your_data", "r") as f:
# read the header and turn it into a value:index map
header = {v: i for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split()
print(elements[header["Sequence"]]) # print the Sequence element
You can also use a header map to turn your rows into dict structures for even easier access.
UPDATE: Here's how to create a header map and then use it to build a dict out of your lines:
with open("your_data", "r") as f:
# read the header and turn it into an index:value map
header = {i: v for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
# split the line, iterate over it and use the header map to create a dict
row = {header[i]: v for i, v in enumerate(line.split())}
print(row["Sequence"]) # ... or you can append it to a list for later use
As for how to 'delete' lines that you don't want for some reason, you'll have to create a temporary file, loop through your original file, compare your values, write the ones that you want to keep into the temporary file, delete the original file and finally rename the temporary file to match your original file, something like:
import shutil
from tempfile import NamedTemporaryFile
SOURCE_FILE = "your_data" # path to the original file to process
def compare_func(seq): # a simple comparison function for our sequence
return not seq.endswith("TC") # use Levenshtein distance or whatever you want instead
# open a temporary file for writing and our source file for reading
with NamedTemporaryFile(mode="w", delete=False) as t, open(SOURCE_FILE, "r") as f:
header_line = f.readline() # read the header
t.write(header_line) # write the header immediately to the temporary file
header = {v: i for i, v in enumerate(header_line.split())} # create a header map
last_line = "" # a var to store the whitespace to keep the same format
for line in f: # read the rest of the file line-by-line
row = line.strip() # first clear out the whitespace
if row: # check if there is any content left or is it an empty line
elements = row.split() # split the row into elements
# now lets call our comparison function
if compare_func(elements[header["Sequence"]]): # keep the line if True
t.write(last_line) # write down the last whitespace to the temporary file
t.write(line) # write down the current line to the temporary file
else:
last_line = line # store the whitespace for later use
shutil.move(t.name, SOURCE_FILE) # finally, overwrite the source with the temporary file
This will produce the same file sans the second row from your example since its sequence ends in a TC and our comp_function() returns False in that case.
For a bit less complexity, instead of using temporary files you can load your whole source file into the working memory and then just overwrite it, but that would work only for files that can fit your working memory while the above approach can work with files as large as your free storage space.

extract the dimensions from the head lines of text file

Please see following attached image showing the format of the text file. I need to extract the dimensions of data matrix indicated by the first line in the file, here 49 * 70 * 1 for the case shown by the image. Note that the length of name "gd_fac" can be varying. How can I extract these numbers as integers? I am using Python 3.6.
Specification is not very clear. I am assuming that the information you want will always be in the first line, and always be in parenthesis. After that:
with open(filename) as infile:
line = infile.readline()
string = line[line.find('(')+1:line.find(')')]
lst = string.split('x')
This will create the list lst = [49, 70, 1].
What is happening here:
First I open the file (you will need to replace filename with the name of your file, as a string. The with ... as ... structure ensures that the file is closed after use. Then I read the first line. After that. I select only the parts of that line that fall after the open paren (, and before the close paren ). Finally, I break the string into parts, with the character x as the separator. This creates a list that contains the values in the first line of the file, which fall between parenthesis, and are separated by x.
Since you have mentioned that length of 'gd_fac' van be variable, best solution will be using Regular Expression.
import re
with open("a.txt") as fh:
for line in fh:
if '(' in line and ')' in line:
dimension = re.findall(r'.*\((.*)\)',line)[0]
break
print dimension
Output:
'49x70x1'
What this does is it looks for "gd_fac"
then if it's there is removes all the unneeded stuff and replaces it with just what you want.
with open('test.txt', 'r') as infile:
for line in infile:
if("gd_fac" in line):
line = line.replace("gd_fac", "")
line = line.replace("x", "*")
line = line.replace("(","")
line = line.replace(")","")
print (line)
break
OUTPUT: "49x70x1"

Not able to frame text while adding a line to middle of file in python

My text.txt looks like this
abcd
xyzv
dead-hosts
-abcd.srini.com
-asdsfcd.srini.com
And I want to insert few lines after "dead-hosts" line, I made a script to add lines to file, there is extra space before last line, that's mandatory in my file, but post added new lines that space got removed, dont know how to maintain the space as it is.
Here is my script
Failvrlist = ['srini.com','srini1.com']
tmplst = []
with open(‘test.txt’,'r+') as fd:
for line in fd:
tmplst.append(line.strip())
pos = tmplst.index('dead-hosts:')
tmplst.insert(pos+1,"#extra comment ")
for i in range(len(Failvrlist)):
tmplst.insert(pos+2+i," - "+Failvrlist[i])
tmplst.insert(pos+len(Failvrlist)+2,"\n")
for i in xrange(len(tmplst)):
fd.write("%s\n" %(tmplst[i]))
output is as below
abcd
xyzv
dead-hosts
#extra comment
- srini.com
- srini1.com
- abcd.srini.com
- asdsfcd.srini.com
if you look at the last two lines the space got removed, please advise .
Points:
In you code , pos = tmplst.index('dead-hosts:'), you are trying to find dead-hosts:. However, input file you have given has only "dead hosts". No colon after dead-hosts, I am considering dead-hosts:
While reading file first time into list, use rstrip() instead of strip(). Using rstrip() will keep spaces at the start of line as it is.
Once you read file into list, code after that should be outside with block which is use to open and read file.
Actually, flow of code should be
Open file and read lines to list and close the file.
Modify list by inserting values at specific index.
Write the file again.
Code:
Failvrlist = ['srini.com','srini1.com']
tmplst = []
#Open file and read it
with open('result.txt','r+') as fd:
for line in fd:
tmplst.append(line.rstrip())
#Modify list
pos = tmplst.index('dead-hosts:')
tmplst.insert(pos+1,"#extra comment")
pos = tmplst.index('#extra comment')
a = 1
for i in Failvrlist:
to_add = " -" + i
tmplst.insert(pos+a,to_add)
a+=1
#Write to file
with open('result.txt','w') as fd:
for i in range(len(tmplst)):
fd.write("%s\n" %(tmplst[i]))
Content of result.txt:
abcd
xyzv
dead-hosts:
#extra comment
-srini.com
-srini1.com
-abcd.srini.com
-asdsfcd.srini.com

Categories