I have a fasta file with a header than includes the sequence name and length
>1 9081 bp
gcgcccgaacagggacttgaaagcgaaagagaaaccagagaagctctctcgacgcagga
I need to remove everything after the name "1" and tried doing that in python by:
newfile.write(oldfile.replace("bp",""))
This removes "bp" but I still have the numbers now.
>1 9081
gcgcccgaacagggacttgaaagcgaaagagaaaccagagaagctctctcgacgcagga
How do I designate the term: any character followed by bp to be replaced with nothing. I tried ***bp or ---bp or ...bp but those don't work.
Thanks!
Radwa
You should use a regular expression for this purpose.
Try this (assuming your file name may contain more than 1 characters and may contain both digits and letters):
import re
regex = re.compile(r'(^\w+)\s.*', re.DOTALL)
print(regex.sub(r'\1', '1 9081 bp\ngcgcccgaacagggacttgaaagcgaaagagaaaccagagaagctctctcgacgcagga' ))
print(regex.sub(r'\1', 's12d 9081 bp\ngcgcccgaacagggacttgaaagcgaaagagaaaccagagaagctctctcgacgcagga' ))
Output:
1
s12d
Related
I'm trying to emulate the strike-through markdown from GitHub in python and I managed to do half of the job. Now there's just one problem I have: The pattern I'm using doesn't seem to replace the text containing symbols and I couldn't figure it out so I hope someone can help me
text = "This is a ~~test?~~"
match = re.findall(r"(?<![.+?])(~{2})(?!~~)(.+?)(?<!~~)\1(?![.+?])", text) # Finds all the text between ~~ symbols
if match:
for _, m in match: # Iterates though the matches. First variable (_) containing the symbol ~ and the second one (m) contains the text I want to replace
text = re.sub(f"~~{m}~~", "\u0336".join(m) + "\u0336", text) # Should replace ~~test?~~ with t̶e̶s̶t̶?̶ but it fails
There is a problem in the string that you are trying to replace. In your case, ~~{m}~~ where value of m is test? the regex to be replaced becomes ~~test?~~ and here ? has a special meaning which you aren't escaping hence the replace doesn't work properly. Just try using re.escape(m) instead of m so meta characters get escaped and are treated as literals.
Try your modified Python code,
import re
text = "This is a ~~test?~~"
match = re.findall(r"(?<![.+?])(~{2})(?!~~)(.+?)(?<!~~)\1(?![.+?])", text) # Finds all the text between ~~ symbols
if match:
for _, m in match: # Iterates though the matches. First variable (_) containing the symbol ~ and the second one (m) contains the text I want to replace
print(m)
text = re.sub(f"~~{re.escape(m)}~~", "\u0336".join(m) + "\u0336", text) # Should replace ~~test?~~ with t̶e̶s̶t̶?̶ but it fails
print(text)
This replaces like you expected and prints,
This is a t̶e̶s̶t̶?̶
I'm trying to write a regular expression in python 3.4 that will take the input from a text file of potential prices and match for valid formatting.
The requirements are that the price be in $X.YY or $X format where X must be greater than 0.
Invalid formats include $0.YY, $.YY, $X.Y, $X.YYY
So far this is what I have:
import re
from sys import argv
FILE = 1
file = open(argv[FILE], 'r')
string = file.read()
file.close()
price = re.compile(r""" # beginning of string
(\$ # dollar sign
[1-9] # first digit must be non-zero
\d * ) # followed by 0 or more digits
(\. # optional cent portion
\d {2} # only 2 digits allowed for cents
)? # end of string""", re.X)
valid_prices = price.findall(string)
print(valid_prices)
This is the file I am using to test right now:
test.txt
$34.23 $23 $23.23 $2 $2313443.23 $3422342 $02394 $230.232 $232.2 $05.03
Current output:
$[('$34', '.23'), ('$23', ''), ('$23', '.23'), ('$2', ''), ('$2313443', '.23'), ('$3422342', ''), ('$230', '.23'), ('$232', '')]
It is currently matching $230.232 and $232.2 when these should be rejected.
I am separating the dollar portion and the cent portion into different groups to do further processing later on. That is why my output is a list of tuples.
One catch here is that I do not know what deliminator, if any, will be used in the input file.
I am new to regular expressions and would really appreciate some help. Thank you!
If it's really not clear, which delimeter will be used, to me it would only make sense to check for "not a digit and not a dot" as delimeter:
\$[1-9]\d*(\.\d\d)?(?![\d.])
https://regex101.com/r/jH2dN5/1
Add a zero width positive lookahead (?=\s|$) to ensure that the match will be followed by whitespace or end of the line only:
>>> s = '$34.23 $23 $23.23 $2 $2313443.23 $3422342 $02394 $230.232 $232.2 $05.03'
>>> re.findall(r'\$[1-9]\d*(?:\.\d{2})?(?=\s|$)', s)
['$34.23', '$23', '$23.23', '$2', '$2313443.23', '$3422342']
Try this
\$(?!0\d)\d+(?:\.\d{2})?(?=\s|$)
Regex demo
Matches:
$34.23 $23 $23.23 $2 $2313443.23 $3422342 $0.99 $3.00
I have a set of strings like this:
uc001acu.2;C1orf159;chr1:1046736-1056736;uc001act.2;C1orf159;
I need to extract the sub-string between two semicolons and I only need the first occurrence.
The result should be: C1orf159
I have tried this code, but it does not work:
import re
info = "uc001acu.2;C1orf159;chr1:1046736-1056736;uc001act.2;C1orf159;"
name = re.search(r'\;(.*)\;', info)
print name.group()
Please help me.
Thanks
You can split the string and limit it to two splits.
x = info.split(';',2)[1]
import re
pattern=re.compile(r".*?;([a-zA-Z0-9]+);.*")
print pattern.match(info).groups()
This looks for first ; eating up non greedily through .*? .Then it captures the alpha numeric string until next ; is found.Then it eats up the rest of the string.Match captured though .groups()
I have a string pa$$word. I want to change this string to pa\$\$word. This must be changed to 2 or more such characters only and not for pa$word. The replacement must happen n number of times where n is the number of "$" symbols. For example, pa$$$$word becomes pa\$\$\$\$word and pa$$$word becomes pa\$\$\$word.
How can I do it?
import re
def replacer(matchobj):
mat = matchobj.group()
return "".join(item for items in zip("\\" * len(mat), mat) for item in items)
print re.sub(r"((\$)\2+)", replacer, "pa$$$$word")
# pa\$\$\$\$word
print re.sub(r"((\$)\2+)", replacer, "pa$$$word")
# pa\$\$\$word
print re.sub(r"((\$)\2+)", replacer, "pa$$word")
# pa\$\$word
print re.sub(r"((\$)\2+)", replacer, "pa$word")
# pa$word
((\$)\2+) - We create two capturing groups here. First one is, the entire match as it is, which can be referred later as \1. The second capturing group is a nested one, which captures the string \$ and referred as \2. So, we first match $ once and make sure that it exists more than once, continuously by \2+.
So, when we find a string like that, we call replacer function with the matched string and the captured groups. In the replacer function, we get the entire matched string with matchobj.group() and then we simply interleave that matched string with \.
I believe the regex you're after is:
[$]{2,}
which will match 2 or more of the character $
this should help
import re
result = re.sub("\$", "\\$", yourString)
or you can try
str.replace("\$", "\\$")
I want to parse a string, such as:
package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'
uses-permission:'android.permission.WRITE_APN_SETTINGS'
uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'
uses-permission:'android.permission.ACCESS_NETWORK_STATE'
I want to get:
string1: jp.tjkapp.droidllwp`
string2: 1.1
Because there are multiple uses-permission, I want to get permission as a list, contains:
WRITE_APN_SETTINGS, RECEIVE_BOOT_COMPLETED and ACCESS_NETWORK_STATE.
Could you help me write the python regular expression to get the strings I want?
Thanks.
Assuming the code block you provided is one long string, here stored in a variable called input_string:
name = re.search(r"(?<=name\=\')[\w\.]+?(?=\')", input_string).group(0)
versionName = re.search(r"(?<=versionName\=\')\d+?\.\d+?(?=\')", input_string).group(0)
permissions = re.findall(r'(?<=android\.permission\.)[A-Z_]+(?=\')', input_string)
Explanation:
name
(?<=name\=\'): check ahead of the main string in order to return only strings that are preceded by name='. The \ in front of = and ' serve to escape them so that the regex knows we're talking about the = string and not a regex command. name=' is not also returned when we get the result, we just know that the results we get are all preceded by it.
[\w\.]+?: This is the main string we're searching for. \w means any alphanumeric character and underscore. \. is an escaped period, so the regex knows we mean . and not the regex command represented by an unescaped period. Putting these in [] means we're okay with anything we've stuck in brackets, so we're saying that we'll accept any alphanumeric character, _, or .. + afterwords means at least one of the previous thing, meaning at least one (but possibly more) of [\w\.]. Finally, the ? means don't be greedy--we're telling the regex to get the smallest possible group that meets these specifications, since + could go on for an unlimited number of repeats of anything matched by [\w\.].
(?=\'): check behind the main string in order to return only strings that are followed by '. The \ is also an escape, since otherwise regex or Python's string execution might misinterpret '. This final ' is not returned with our results, we just know that in the original string, it followed any result we do end up getting.
You can do this without regex by reading the file content line by line.
>>> def split_string(s):
... if s.startswith('package'):
... return [i.split('=')[1] for i in s.split() if "=" in i]
... elif s.startswith('uses-permission'):
... return s.split('.')[-1]
...
>>> split_string("package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'")
["'jp.tjkapp.droid1lwp'", "'2'", "'1.1'"]
>>> split_string("uses-permission:'android.permission.WRITE_APN_SETTINGS'")
"WRITE_APN_SETTINGS'"
>>> split_string("uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'")
"RECEIVE_BOOT_COMPLETED'"
>>> split_string("uses-permission:'android.permission.ACCESS_NETWORK_STATE'")
"ACCESS_NETWORK_STATE'"
>>>
Here is one example code
#!/usr/bin/env python
inputFile = open("test.txt", "r").readlines()
for line in inputFile:
if line.startswith("package"):
words = line.split()
string1 = words[1].split("=")[1].replace("'","")
string2 = words[3].split("=")[1].replace("'","")
test.txt file contains input data you mentioned earlier..