I'm trying to write a regular expression in python 3.4 that will take the input from a text file of potential prices and match for valid formatting.
The requirements are that the price be in $X.YY or $X format where X must be greater than 0.
Invalid formats include $0.YY, $.YY, $X.Y, $X.YYY
So far this is what I have:
import re
from sys import argv
FILE = 1
file = open(argv[FILE], 'r')
string = file.read()
file.close()
price = re.compile(r""" # beginning of string
(\$ # dollar sign
[1-9] # first digit must be non-zero
\d * ) # followed by 0 or more digits
(\. # optional cent portion
\d {2} # only 2 digits allowed for cents
)? # end of string""", re.X)
valid_prices = price.findall(string)
print(valid_prices)
This is the file I am using to test right now:
test.txt
$34.23 $23 $23.23 $2 $2313443.23 $3422342 $02394 $230.232 $232.2 $05.03
Current output:
$[('$34', '.23'), ('$23', ''), ('$23', '.23'), ('$2', ''), ('$2313443', '.23'), ('$3422342', ''), ('$230', '.23'), ('$232', '')]
It is currently matching $230.232 and $232.2 when these should be rejected.
I am separating the dollar portion and the cent portion into different groups to do further processing later on. That is why my output is a list of tuples.
One catch here is that I do not know what deliminator, if any, will be used in the input file.
I am new to regular expressions and would really appreciate some help. Thank you!
If it's really not clear, which delimeter will be used, to me it would only make sense to check for "not a digit and not a dot" as delimeter:
\$[1-9]\d*(\.\d\d)?(?![\d.])
https://regex101.com/r/jH2dN5/1
Add a zero width positive lookahead (?=\s|$) to ensure that the match will be followed by whitespace or end of the line only:
>>> s = '$34.23 $23 $23.23 $2 $2313443.23 $3422342 $02394 $230.232 $232.2 $05.03'
>>> re.findall(r'\$[1-9]\d*(?:\.\d{2})?(?=\s|$)', s)
['$34.23', '$23', '$23.23', '$2', '$2313443.23', '$3422342']
Try this
\$(?!0\d)\d+(?:\.\d{2})?(?=\s|$)
Regex demo
Matches:
$34.23 $23 $23.23 $2 $2313443.23 $3422342 $0.99 $3.00
Related
I have text with values like:
this is a value £28.99 (0.28/ml)
I want to remove everything to return the price only so it returns:
£28.99
there could be any number of digits between the £ and .
I think
r"£[0-9]*\.[0-9]{2}"
matches the pattern I want to keep but i'm unsure on how to remove everything else and keep the pattern instead of replacing the pattern like in usual re.sub() cases.
I want to remove everything to return the price only so it returns:
Why not trying to extract the proper information instead?
import re
s = "this is a value £28.99 (0.28/ml)"
m = re.search("£\d*(\.\d+)?",s)
if m:
print(m.group(0))
to find several occurrences use findall or finditer instead of search
You don't care how many digits are before the decimal, so using the zero-or-more matcher was correct. However, you could just rely on the digit class (\d) to provide that more succinctly.
The same is true of after the decimal. You only need two so your limiting the matches to 2 is correct.
The issue then comes in with how you actually capture the value. You can use a capturing group to be sure that you only ever get the value you care about.
Complete regex:
(£\d*.\d{2})
Sample code:
import re
r = re.compile("(£\d*.\d{2})")
match = r.findall("this is a value £28.99 (0.28/ml)")
if match: # may bring back an empty list; check for that here
print(match[0]) # uses the first group, and will print £28.99
If it's a string, you can do something like this:
x = "this is a value £28.99 (0.28/ml)"
x_list = x.split()
for i in x_list:
if "£" in i: #or if i.startswith("£") Credit – Jean-François Fabre
value=i
print(value)
>>>£28.99
You can try:
import re
t = "this is a value £28.99 (0.28/ml)"
r = re.sub(".*(£[\d.]+).*", r"\1", t)
print(r)
Output:
£28.99
Python Demo
I have a fasta file with a header than includes the sequence name and length
>1 9081 bp
gcgcccgaacagggacttgaaagcgaaagagaaaccagagaagctctctcgacgcagga
I need to remove everything after the name "1" and tried doing that in python by:
newfile.write(oldfile.replace("bp",""))
This removes "bp" but I still have the numbers now.
>1 9081
gcgcccgaacagggacttgaaagcgaaagagaaaccagagaagctctctcgacgcagga
How do I designate the term: any character followed by bp to be replaced with nothing. I tried ***bp or ---bp or ...bp but those don't work.
Thanks!
Radwa
You should use a regular expression for this purpose.
Try this (assuming your file name may contain more than 1 characters and may contain both digits and letters):
import re
regex = re.compile(r'(^\w+)\s.*', re.DOTALL)
print(regex.sub(r'\1', '1 9081 bp\ngcgcccgaacagggacttgaaagcgaaagagaaaccagagaagctctctcgacgcagga' ))
print(regex.sub(r'\1', 's12d 9081 bp\ngcgcccgaacagggacttgaaagcgaaagagaaaccagagaagctctctcgacgcagga' ))
Output:
1
s12d
I'm a newbie at python.
So my file has lines that look like this:
-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333
I need help coming up with the correct python code to extract every float preceded by a colon and followed by a space (ex: [-0.294118, 0.487437,etc...])
I've tried dataList = re.findall(':(.\*) ', str(line)) and dataList = re.split(':(.\*) ', str(line)) but these come up with the whole line. I've been researching this problem for a while now so any help would be appreciated. Thanks!
try this one:
:(-?\d\.\d+)\s
In your code that will be
p = re.compile(':(-?\d\.\d+)\s')
m = p.match(str(line))
dataList = m.groups()
This is more specific on what you want.
In your case .* will match everything it can
Test on Regexr.com:
In this case last element wasn't captured because it doesnt have space to follow, if this is a problem just remove the \s from the regex
This will do it:
import re
line = "-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333"
for match in re.finditer(r"(-?\d\.\d+)", line, re.DOTALL | re.MULTILINE):
print match.group(1)
Or:
match = re.search(r"(-?\d\.\d+)", line, re.DOTALL | re.MULTILINE)
if match:
datalist = match.group(1)
else:
datalist = ""
Output:
-0.294118
0.487437
0.180328
-0.292929
0.00149028
-0.53117
-0.0333333
Live Python Example:
http://ideone.com/DpiOBq
Regex Demo:
https://regex101.com/r/nR4wK9/3
Regex Explanation
(-?\d\.\d+)
Match the regex below and capture its match into backreference number 1 «(-?\d\.\d+)»
Match the character “-” literally «-?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match a single character that is a “digit” (ASCII 0–9 only) «\d»
Match the character “.” literally «\.»
Match a single character that is a “digit” (ASCII 0–9 only) «\d+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Given:
>>> s='-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333.333'
With your particular data example, you can just grab the parts that would be part of a float with a regex:
>>> re.findall(r':([\d.-]+)', s)
['-0.294118', '0.487437', '0.180328', '-0.292929', '-1', '0.00149028', '-0.53117', '-0.0333.333']
You can also split and partition, which would be substantially faster:
>>> [e.partition(':')[2] for e in s.split() if ':' in e]
['-0.294118', '0.487437', '0.180328', '-0.292929', '-1', '0.00149028', '-0.53117', '-0.0333.333']
Then you can convert those to a float using try/except and map and filter:
>>> def conv(s):
... try:
... return float(s)
... except ValueError:
... return None
...
>>> filter(None, map(conv, [e.partition(':')[2] for e in s.split() if ':' in e]))
[-0.294118, 0.487437, 0.180328, -0.292929, -1.0, 0.00149028, -0.53117, -0.0333333]
A simple oneliner using list comprehension -
str = "-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333"
[float(s.split()[0]) for s in str.split(':')]
Note: this is simplest to understand (and pobably fastest) as we are not doing any regex evaluation. But this would only work for the particular case above. (eg. if you've to get the second number - in the above not so correctly formatted string would need more work than a single one-liner above).
I have a set of strings like this:
uc001acu.2;C1orf159;chr1:1046736-1056736;uc001act.2;C1orf159;
I need to extract the sub-string between two semicolons and I only need the first occurrence.
The result should be: C1orf159
I have tried this code, but it does not work:
import re
info = "uc001acu.2;C1orf159;chr1:1046736-1056736;uc001act.2;C1orf159;"
name = re.search(r'\;(.*)\;', info)
print name.group()
Please help me.
Thanks
You can split the string and limit it to two splits.
x = info.split(';',2)[1]
import re
pattern=re.compile(r".*?;([a-zA-Z0-9]+);.*")
print pattern.match(info).groups()
This looks for first ; eating up non greedily through .*? .Then it captures the alpha numeric string until next ; is found.Then it eats up the rest of the string.Match captured though .groups()
I have a string pa$$word. I want to change this string to pa\$\$word. This must be changed to 2 or more such characters only and not for pa$word. The replacement must happen n number of times where n is the number of "$" symbols. For example, pa$$$$word becomes pa\$\$\$\$word and pa$$$word becomes pa\$\$\$word.
How can I do it?
import re
def replacer(matchobj):
mat = matchobj.group()
return "".join(item for items in zip("\\" * len(mat), mat) for item in items)
print re.sub(r"((\$)\2+)", replacer, "pa$$$$word")
# pa\$\$\$\$word
print re.sub(r"((\$)\2+)", replacer, "pa$$$word")
# pa\$\$\$word
print re.sub(r"((\$)\2+)", replacer, "pa$$word")
# pa\$\$word
print re.sub(r"((\$)\2+)", replacer, "pa$word")
# pa$word
((\$)\2+) - We create two capturing groups here. First one is, the entire match as it is, which can be referred later as \1. The second capturing group is a nested one, which captures the string \$ and referred as \2. So, we first match $ once and make sure that it exists more than once, continuously by \2+.
So, when we find a string like that, we call replacer function with the matched string and the captured groups. In the replacer function, we get the entire matched string with matchobj.group() and then we simply interleave that matched string with \.
I believe the regex you're after is:
[$]{2,}
which will match 2 or more of the character $
this should help
import re
result = re.sub("\$", "\\$", yourString)
or you can try
str.replace("\$", "\\$")