complex regex matches in python - python

I have a txt file that contains the following data:
chrI
ATGCCTTGGGCAACGGT...(multiple lines)
chrII
AGGTTGGCCAAGGTT...(multiple lines)
I want to first find 'chrI' and then iterate through the multiple lines of ATGC until I find the xth char. Then I want to print the xth char until the yth char. I have been using regex but once I have located the line containing chrI, I don't know how to continue iterating to find the xth char.
Here is my code:
for i, line in enumerate(sacc_gff):
for match in re.finditer(chromo_val, line):
print(line)
for match in re.finditer(r"[ATGC]{%d},{%d}\Z" % (int(amino_start), int(amino_end)), line):
print(match.group())
What the variables mean:
chromo_val = chrI
amino_start = (some start point my program found)
amino_end = (some end point my program found)
Note: amino_start and amino_end need to be in variable form.
Please let me know if I could clarify anything for you, Thank you.

It looks like you are working with fasta data, so I will provide an answer with that in mind, but if it isn't you can use the sub_sequence selection part still.
fasta_data = {} # creates an empty dictionary
with open( fasta_file, 'r' ) as fh:
for line in fh:
if line[0] == '>':
seq_id = line.rstrip()[1:] # strip newline character and remove leading '>' character
fasta_data[seq_id] = ''
else:
fasta_data[seq_id] += line.rstrip()
# return substring from chromosome 'chrI' with a first character at amino_start up to but not including amino_end
sequence_string1 = fasta_data['chrI'][amino_start:amino_end]
# return substring from chromosome 'chrII' with a first character at amino_start up to and including amino_end
sequence_string2 = fasta_data['chrII'][amino_start:amino_end+1]
fasta format:
>chr1
ATTTATATATAT
ATGGCGCGATCG
>chr2
AATCGCTGCTGC

Since you are working with fasta files which are formatted like this:
>Chr1
ATCGACTACAAATTT
>Chr2
ACCTGCCGTAAAAATTTCC
and are a bioinformatics major I am guessing you will be manipulating sequences often I recommend install the perl package called FAST. Once this is installed to get the 2-14 character of every sequence you would do this:
fascut 2..14 fasta_file.fa
Here is the recent publication for FAST and github that contains a whole toolbox for manipulating molecule sequence data on the command line.

Related

python open csv search for pattern and strip everything else

I got a csv file 'svclist.csv' which contains a single column list as follows:
pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1
pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs
I need to strip each line from everything except the PL5 directoy and the 2 numbers in the last directory
and should look like that
PL5,00
PL5,01
I started the code as follow:
clean_data = []
with open('svclist.csv', 'rt') as f:
for line in f:
if line.__contains__('profile'):
print(line, end='')
and I'm stuck here.
Thanks in advance for the help.
you can use the regular expression - (PL5)[^/].{0,}([0-9]{2,2})
For explanation, just copy the regex and paste it here - 'https://regexr.com'. This will explain how the regex is working and you can make the required changes.
import re
test_string_list = ['pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1',
'pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs']
regex = re.compile("(PL5)[^/].{0,}([0-9]{2,2})")
result = []
for test_string in test_string_list:
matchArray = regex.findall(test_string)
result.append(matchArray[0])
with open('outfile.txt', 'w') as f:
for row in result:
f.write(f'{str(row)[1:-1]}\n')
In the above code, I've created one empty list to hold the tuples. Then, I'm writing to the file. I need to remove the () at the start and end. This can be done via str(row)[1:-1] this will slice the string.
Then, I'm using formatted string to write content into 'outfile.csv'
You can use regex for this, (in general, when trying to extract a pattern this might be a good option)
import re
pattern = r"pf=/usr/sap/PL5/SYS/profile/PL5_.*(\d{2})"
with open('svclist.csv', 'rt') as f:
for line in f:
if 'profile' in line:
last_two_numbers = pattern.findall(line)[0]
print(f'PL5,{last_two_numbers}')
This code goes over each line, checks if "profile" is in the line (this is the same as _contains_), then extracts the last two digits according to the pattern
I made the assumption that the number is always between the two underscores. You could run something similar to this within your for-loop.
test_str = "pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1"
test_list = test_str.split("_") # splits the string at the underscores
output = test_list[1].strip(
"abcdefghijklmnopqrstuvwxyz" + str.swapcase("abcdefghijklmnopqrstuvwxyz")) # removing any character
try:
int(output) # testing if the any special characters are left
print(f"PL5, {output}")
except ValueError:
print(f'Something went wrong! Output is PL5,{output}')

Regular Expression to find valid words in file

I need to write a function get_specified_words(filename) to get a list of lowercase words from a text file. All of the following conditions must be applied:
Include all lower-case character sequences including those that
contain a - or ' character and those that end with a '
character.
Exclude words that end with a -.
The function must only process lines between the start and end marker lines
Use this regular expression to extract the words from each relevant line of a file: valid_line_words = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line)
Ensure that the line string is lower case before using the regular expression.
Use the optional encoding parameter when opening files for reading. That is your open file call should look like open(filename, encoding='utf-8'). This will be especially helpful if your operating system doesn't set Python's default encoding to UTF-8.
The sample text file testing.txt contains this:
That are after the start and should be dumped.
So should that
and that
and yes, that
*** START OF SYNTHETIC TEST CASE ***
Toby's code was rather "interesting", it had the following issues: short,
meaningless identifiers such as n1 and n; deep, complicated nesting;
a doc-string drought; very long, rambling and unfocused functions; not
enough spacing between functions; inconsistent spacing before and
after operators, just like this here. Boy was he going to get a low
style mark.... Let's hope he asks his friend Bob to help him bring his code
up to an acceptable level.
*** END OF SYNTHETIC TEST CASE ***
This is after the end and should be ignored too.
Have a nice day.
Here's my code:
import re
def stripped_lines(lines):
for line in lines:
stripped_line = line.rstrip('\n')
yield stripped_line
def lines_from_file(fname):
with open(fname, 'rt') as flines:
for line in stripped_lines(flines):
yield line
def is_marker_line(line, start='***', end='***'):
min_len = len(start) + len(end)
if len(line) < min_len:
return False
return line.startswith(start) and line.endswith(end)
def advance_past_next_marker(lines):
for line in lines:
if is_marker_line(line):
break
def lines_before_next_marker(lines):
valid_lines = []
for line in lines:
if is_marker_line(line):
break
valid_lines.append(re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line))
for content_line in valid_lines:
yield content_line
def lines_between_markers(lines):
it = iter(lines)
advance_past_next_marker(it)
for line in lines_before_next_marker(it):
yield line
def words(lines):
text = '\n'.join(lines).lower().split()
return text
def get_valid_words(fname):
return words(lines_between_markers(lines_from_file(fname)))
# This must be executed
filename = "valid.txt"
all_words = get_valid_words(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(all_words)))
print("word list:")
print("\n".join(all_words))
Here's my output:
File "C:/Users/jj.py", line 45, in <module>
text = '\n'.join(lines).lower().split()
builtins.TypeError: sequence item 0: expected str instance, list found
Here's the expected output:
valid.txt loaded ok.
73 valid words found.
word list:
toby's
code
was
rather
interesting
it
had
the
following
issues
short
meaningless
identifiers
such
as
n
and
n
deep
complicated
nesting
a
doc-string
drought
very
long
rambling
and
unfocused
functions
not
enough
spacing
between
functions
inconsistent
spacing
before
and
after
operators
just
like
this
here
boy
was
he
going
to
get
a
low
style
mark
let's
hope
he
asks
his
friend
bob
to
help
him
bring
his
code
up
to
an
acceptable
level
I need help with getting my code to work. Any help is appreciated.
lines_between_markers(lines_from_file(fname))
gives you a list of list of valid words.
So you just need to flatten it :
def words(lines):
words_list = [w for line in lines for w in line]
return words_list
Does the trick.
But I think that you should review the design of your program :
lines_between_markers should only yield lines between markers, but it does more. Regexp should be use on the result of this function and not inside the function.
What you didn't do :
Ensure that the line string is lower case before using the regular expression.
Use the optional encoding parameter when opening files for reading.
That is your open file call should look like open(filename,
encoding='utf-8').

Python: losing nucleotides from fasta file to dictionary

I am trying to write a code to extract longest ORF in a fasta file. It is from Coursera Genomics data science course.
the file is a practice file: "dna.example.fasta"
Data is here:https://d396qusza40orc.cloudfront.net/genpython/data_sets/dna.example.fasta
Part of my code is below to extract reading frame 2 (start from the second position of a sequence. eg: seq: ATTGGG, to get reading frame 2: TTGGG):
#!/usr/bin/python
import sys
import getopt
o, a = getopt.getopt(sys.argv[1:], 'h')
opts = dict()
for k,v in o:
opts[k] = v
if '-h' in k:
print "--help\n"
if len(a) < 0:
print "missing fasta file\n"
f = open(a[0], "r")
seq = dict()
for line in f:
line = line.strip()
if line.startswith(">"):
name = line.split()[0]
seq[name] = ''
else:
seq[name] = seq[name] + line[1:]
k = seq[">gi|142022655|gb|EQ086233.1|323"]
print len(k)
The length of this particular sequence should be 4804 bp. Therefore by using this sequence alone I could get the correct answer.
However, with the code, here in the dictionary, this particular sequence becomes only 4736 bp.
I am new to python, so I can not wrap my head around as to where did those 100 bp go?
Thank you,
Xio
Take another look at your data file
An example of some of the lines:
>gi|142022655|gb|EQ086233.1|43 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
TCGGGCGAAGGCGGCAGCAAGTCGTCCACGCGCAGCGCGGCACCGCGGGCCTCTGCCGTGCGCTGCTTGG
CCATGGCCTCCAGCGCACCGATCGGATCAAAGCCGCTGAAGCCTTCGCGCATCAGGCGGCCATAGTTGGC
Notice how the sequences start on the first value of each line.
Your addition line seq[name] = seq[name] + line[1:] is adding everything on that line after the first character, excluding the first (Python 2 indicies are zero based). It turns out your missing number of nucleotides is the number of lines it took to make that genome, because you're losing the first character every time.
The revised way is seq[name] = seq[name] + line which simply adds the line without losing that first character.
The quickest way to find these kind of debugging errors is to either use a formal debugger, or add a bunch of print statements on your code and test with a small portion of the file -- something that you can see the output of and check for yourself if it's coming out right. A short file with maybe 50 nucleotides instead of 5000 is much easier to evaluate by hand and make sure the code is doing what you want. That's what I did to come up with the answer to the problem in about 5 minutes.
Also for future reference, please mention the version of python you are using before hand. There are quite a few differences between python 2 (The one you're using) and python 3.
I did some additional testing with your code, and if you get any extra characters at the end, they might be whitespace. Make sure you use the .strip() method on each line before adding it to your string, which clears whitespace.
Addressing your comment,
To start from the 2nd position on the first line of the sequence only and then use the full lines until the following nucleotide, you can take advantage of the file's linear format and just add one more clause to your if statement, an elif. This will test if we're on the first line of the sequence, and if so, use the characters starting from the second, if we're on any other line, use the whole line.
if line.startswith(">"):
name = line.split()[0]
seq[name] = ''
#If it's the first line in the series, then the dict's value
# will be an empty string, so this elif means "If we're at the
# start of the series..."
elif seq[name] == '':
seq[name] = seq[name] + line[1:]
else:
seq[name] = seq[name]
This adaptation will start from the 2nd nucleotide in the genome without losing the first from every line in the rest of the nucleotide.

Having problems with strings and arrays

I want to read a text file and copy text that is in between '~~~~~~~~~~~~~' into an array. However, I'm new in Python and this is as far as I got:
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a=[0]
b=0
for i,line in enumerate(searchlines):
if '~~~~~~~~~~~~~' in line:
b=b+1
if '~~~~~~~~~~~~~' not in line:
if 's1mb4d' in line:
break
a.insert(b,line)
This is what I envisioned:
First I read all the lines of the text file,
then I declare 'a' as an array in which text should be added,
then I declare 'b' because I need it as an index. The number of lines in between the '~~~~~~~~~~~~~' is not even, that's why I use 'b' so I can put lines of text into one array index until a new '~~~~~~~~~~~~~' was found.
I check for '~~~~~~~~~~~~~', if found I increase 'b' so I can start adding lines of text into a new array index.
The text file ends with 's1mb4d', so once its found, the program ends.
And if '~~~~~~~~~~~~~' is not found in the line, I add text to the array.
But things didn't go well. Only 1 line of the entire text between those '~~~~~~~~~~~~~' is being copied to the each array index.
Here is an example of the text file:
~~~~~~~~~~~~~
Text123asdasd
asdasdjfjfjf
~~~~~~~~~~~~~
123abc
321bca
gjjgfkk
~~~~~~~~~~~~~
You could use regex expression, give a try to this:
import re
input_text = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = []
for line in input_text:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
What it does is it reads line by line looks for all characters but '~' if line consists only of '~' it ignores it, every line with text is appended to your a list afterwards.
And just because we can, oneliner (excluding import and source ofc):
import re
lines = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = [re.findall(r'[^\~]+', line) for line in lines if len(re.findall(r'[^\~]+', line)) != 0]
In python the solution to a large part of problems is often to find the right function from the standard library that does the job. Here you should try using split instead, it should be way easier.
If I understand correctly your goal, you can do it like that :
joined_lines = ''.join(searchlines)
result = joined_lines.split('~~~~~~~~~~')
The first line joins your list of lines into a sinle string, and then the second one cut that big string every times it encounters the '~~' sequence.
I tried to clean it up to the best of my knowledge, try this and let me know if it works. We can work together on this!:)
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a = []
currentline = ''
for i,line in enumerate(searchlines):
currentline += line
if '~~~~~~~~~~~~~' in line:
a.append(currentline)
elif 's1mb4d' in line:
break
Some notes:
You can use elif for your break function
Append will automatically add the next iteration to the end of the array
currentline will continue to add text on each line as long as it doesn't have 's1mb4d' or the ~~~ which I think is what you want
s = ['']
with open('path\\to\\sample.txt') as f:
for l in f:
a = l.strip().split("\n")
s += a
a = []
for line in s:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
print a
>>> [['Text123asdasd asdasdjfjfjf'], ['123abc 321bca gjjgfkk']]
If you're willing to impose/accept the constraint that the separator should be exactly 13 ~ characters (actually '\n%s\n' % ( '~' * 13) to be specific) ...
then you could accomplish this for relatively normal sized files using just
#!/usr/bin/python
## (Should be #!/usr/bin/env python; but StackOverflow's syntax highlighter?)
separator = '\n%s\n' % ('~' * 13)
with open('somefile.txt') as f:
results = f.read().split(separator)
# Use your results, a list of the strings separated by these separators.
Note that '~' * 13 is a way, in Python, of constructing a string by repeating some smaller string thirteen times. 'xx%sxx' % 'YY' is a way to "interpolate" one string into another. Of course you could just paste the thirteen ~ characters into your source code ... but I would consider constructing the string as shown to make it clear that the length is part of the string's specification --- that this is part of your file format requirements ... and that any other number of ~ characters won't be sufficient.
If you really want any line of any number of ~ characters to serve as a separator than you'll want to use the .split() method from the regular expressions module rather than the .split() method provided by the built-in string objects.
Note that this snippet of code will return all of the text between your separator lines, including any newlines they include. There are other snippets of code which can filter those out. For example given our previous results:
# ... refine results by filtering out newlines (replacing them with spaces)
results = [' '.join(each.split('\n')) for each in results]
(You could also use the .replace() string method; but I prefer the join/split combination). In this case we're using a list comprehension (a feature of Python) to iterate over each item in our results, which we're arbitrarily naming each), performing our transformation on it, and the resulting list is being boun back to the name results; I highly recommend learning and getting comfortable with list comprehension if you're going to learn Python. They're commonly used and can be a bit exotic compared to the syntax of many other programming and scripting languages).
This should work on MS Windows as well as Unix (and Unix-like) systems because of how Python handles "universal newlines." To use these examples under Python 3 you might have to work a little on the encodings and string types. (I didn't need to for my Python3.6 installed under MacOS X using Homebrew ... but just be forewarned).

Python regex - capture text between two words as string, then append to list

This is the structure of the txt file (repeated units of CDS-text-ORIGIN):
CDS 311..>428
/gene="PNR"
/codon_start=1
/product="photoreceptor-specific nuclear receptor"
/protein_id="AAD28302.1"
/db_xref="GI:4726077"
/translation="METRPTALMSSTVAAAAPAAGAASRKESPGRWGLGEDPT"
ORIGIN
I want to pull out the text from 311..<428 to GEDPT" as a string
The regex I have so far is:
compiler = re.compile(r"^\s+CDS\s+(.+)ORIGIN.+", re.DOTALL|re.MULTILINE)
I then use a loop to add each string to a list:
for line in file:
match = compiler.match(line)
if match:
list.append(str(match.group(1)))
But I keep getting an empty list! Any ideas why?
Help would be much appreciated, I'm new to this!
I am assuming that file is a filepointer such as file = open('filename.txt'). If that is the case then using:
for line in file:
will break each line on the newline character. So the first three lines will be:
1: ' CDS 311..>428\n'
2: ' /gene="PNR"\n'
3: ' /codon_start=1:\n'
Because each line is separate, you will not match the multiline pattern unless you combine the lines. You may want to consider using:
compiler = re.compile(r"^\s+CDS\s+(.+?)ORIGIN", re.DOTALL|re.MULTILINE)
fp = open('filename.txt')
all_text = fp.read() # this reads all the text without splitting on newlines
compiler.findall(all_text) # returns a list of all matches

Categories