I want to find the first float that appears in a string using Python 3.
I looked at other similar questions but I couldn't understand them and when I tried to implement them they didn't work for my case.
An example string would be
I would like 1.5 cookies please
I'm pretty sure there's more elegant solution, but this one works for your specific case:
s = 'I would like 1.5 cookies please'
for i in s.split():
try:
#trying to convert i to float
result = float(i)
#break the loop if i is the first string that's successfully converted
break
except:
continue
print(result) #1.5
You can find this using regex, notice this pattern will only return the substring if it's already in float type, i.e. decimal fomatting, so something like this:
>>> import re
>>> matches = re.findall("[+-]?\d+\.\d+", "I would like 1.5 cookies please")
As you say you want only the first one:
>>> matches[0]
'1.5'
Edit: Added [+-]? to the pattern for it to recognize negative floats, as pistache recommended!
If you expect whitespace separated decimal floats, using str methods and removing -+.:
s = 'I would like 1.5 cookies please'
results = [t for t in s.split()
if t.lstrip('+-').replace('.', '', 1).isdigit()]
print(results[0]) #1.5
lstrip is used to remove the sign only on the lefthand side of the text, and the third argument to replace is used to replace only one dot in the text. The exact implementation depends on the how you expect floats to be formatted (support whitespace between sign, etc).
I would use a regex. below also checks for negative values.
import re
stringToSearch = 'I would like 1.5 cookies please'
searchPattern = re.compile(".*(-?[0-9]\.[0-9]).*")
searchMatch = searchPattern.search(stringToSearch)
if searchMatch:
floatValue = searchMatch.group(1)
else:
raise Exception('float not found')
You can use PyRegex to check the regex.
Related
I have text with values like:
this is a value £28.99 (0.28/ml)
I want to remove everything to return the price only so it returns:
£28.99
there could be any number of digits between the £ and .
I think
r"£[0-9]*\.[0-9]{2}"
matches the pattern I want to keep but i'm unsure on how to remove everything else and keep the pattern instead of replacing the pattern like in usual re.sub() cases.
I want to remove everything to return the price only so it returns:
Why not trying to extract the proper information instead?
import re
s = "this is a value £28.99 (0.28/ml)"
m = re.search("£\d*(\.\d+)?",s)
if m:
print(m.group(0))
to find several occurrences use findall or finditer instead of search
You don't care how many digits are before the decimal, so using the zero-or-more matcher was correct. However, you could just rely on the digit class (\d) to provide that more succinctly.
The same is true of after the decimal. You only need two so your limiting the matches to 2 is correct.
The issue then comes in with how you actually capture the value. You can use a capturing group to be sure that you only ever get the value you care about.
Complete regex:
(£\d*.\d{2})
Sample code:
import re
r = re.compile("(£\d*.\d{2})")
match = r.findall("this is a value £28.99 (0.28/ml)")
if match: # may bring back an empty list; check for that here
print(match[0]) # uses the first group, and will print £28.99
If it's a string, you can do something like this:
x = "this is a value £28.99 (0.28/ml)"
x_list = x.split()
for i in x_list:
if "£" in i: #or if i.startswith("£") Credit – Jean-François Fabre
value=i
print(value)
>>>£28.99
You can try:
import re
t = "this is a value £28.99 (0.28/ml)"
r = re.sub(".*(£[\d.]+).*", r"\1", t)
print(r)
Output:
£28.99
Python Demo
How to get this string "534641" (this value is dynamic, can be 6,5,4 digits)? How to find "-" before "534641"?
import re
string = "http://www.test.com.my/white-red-gift-perfume-powerbank-yellow-534641.html?ff=1\u0026s=Ebsr"
m = re.search('-(.+?).html', string).group(1)
print (m)
https://repl.it/JSxp
You are almost there. Since what you want is only digits, you could use \d to capture only digits:
>>> m = re.search('-(\d+).html', string).group(1)
>>> print (m)
534641
Another way would be to tell 'all characters excepts -':
>>> m = re.search('-([^-]+).html', string).group(1)
>>> print (m)
534641
For more info, see the doc.
Some quick notes: the .html should be \.html, avoid using names such as 'string', 'list' that are used by python. It could go wrong without knowing why.
You already have the number at the end. Just split on the dashes using:
m = re.search('-(.+?).html', string).group(1).split("-")
# last element in m is the number you are looking for
print (m[-1])
I would like to get the string after a specific keyword.
For example:
import re
def findWholeWord(w):
return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search
abc = "<StephenCurry Pro='ThreepointShooter'>MVP1times</StephenCurry>"
if findWholeWord("SeedNumber")(abc):
dddd = re.search('(?<=ThreepointShooter)(.\w+)', abc)
mvp = dddd.gorup()
print (mvp)
print ("found")
else:
print ("not found")
I expect the result suppose to be 'MVP1times'.
Is there any better method to find a specific string after keyword ? the result maybe a string, Digit or even mix like the result above.
Thanks for help!
You can use look-arounds to get the string surrounded by > and < (assuming this stays consistent):
>>> s = "<StephenCurry Pro='ThreepointShooter'>MVP1times</StephenCurry>"
>>> re.search(r'(?<=\>)[^<]+(?=\<)', s).group(0)
'MVP1times'
You can change the regular expressiion to: (?<=ThreepointShooter['|"]>)(.\w+). See it live on http://pythex.org/
I'm not sure what exactly your going to do but you don't even need to use lookbehind expression here.
I have a string like this:
<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>
I would like to strip the first 3 opening and the last 3 closing tags from the string. I do not know the tag names in advance.
I can strip the first 3 strings with re.sub(r'<[^<>]+>', '', in_str, 3)). How do I strip the closing tags? What should remain is:
<v1>aaa<b>bbb</b>ccc</v1>
I know I could maybe 'do it right', but I actually do not wish to do xml nor html parsing for my purpose, which is to aid myself visualizing the xml representation of some classes.
Instead, I realized that this problem is interesting. It seems I cannot simply search backwards with regex, ie. right to left. because that seems unsupported:
If you mean, find the right-most match of several (similar to the
rfind method of a string) then no, it is not directly supported. You
could use re.findall() and chose the last match but if the matches can
overlap this may not give the correct result.
But .rstrip is not good with words, and won't do patterns either.
I looked at Strip HTML from strings in Python but I only wish to strip up to 3 tags.
What approach could be used here? Should I reverse the string (ugly in itself and due to the '<>'s). Do tokenization (why not parse, then?)? Or create static closing tags based on the left-to-right match?
Which strategy to follow to strip the patterns from the end of the string?
The simplest would be to use old-fashing string splitting and limiting the split:
in_str.split('>', 3)[-1].rsplit('<', 3)[0]
Demo:
>>> in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>'
>>> in_str.split('>', 3)[-1].rsplit('<', 3)[0]
'<v1>aaa<b>bbb</b>ccc</v1>'
str.split() and str.rsplit() with a limit will split the string from the start or the end up to the limit times, letting you select the remainder unsplit.
You've already got practically all the solution. re can't do backwards, but you can:
in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
in_str = re.sub(r'<[^<>]+>', '', in_str, 3)
in_str = in_str[::-1]
print in_str
in_str = re.sub(r'>[^<>]+/<', '', in_str, 3)
in_str = in_str[::-1]
print in_str
<v1>aaa<b>bbb</b>ccc</v1>
Note the reversed regex for the reversed string, but then it goes back-to-front.
Of course, as mentioned, this is way easier with a proper parser:
in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
from lxml.html import etree
ix = etree.fromstring(in_str)
print etree.tostring(ix[0][0][0])
<v1>aaa<b>bbb</b>ccc</v1>
I would look into regular expressions and use one such pattern to use a split
http://docs.python.org/3/library/re.html?highlight=regex#re.regex.split
Sorry, can't comment, but will give it as an answer.
in_str.split('>', 3)[-1].rsplit('<', 3)[0] will work for the given example
<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>, but not for
<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo><another>test</another>.
You just should be aware of this.
To solve the counter example provided by me, you will have to track state (or count) of tags and evaluate that you match the correct pairs.
I have a bunch of mathematical expressions stored as strings. Here's a short one:
stringy = "((2+2)-(3+5)-6)"
I want to break this string up into a list that contains ONLY the information in each "sub-parenthetical phrase" (I'm sure there's a better way to phrase that.) So my yield would be:
['2+2','3+5']
I have a couple of ideas about how to do this, but I keep running into a "okay, now what" issue.
For example:
for x in stringy:
substring = stringy[stringy.find('('+1 : stringy.find(')')+1]
stringlist.append(substring)
Works just peachy to return 2+2, but that's about as far as it goes, and I am completely blanking on how to move through the remainder...
One way using regex:
import re
stringy = "((2+2)-(3+5)-6)"
for exp in re.findall("\(([\s\d+*/-]+)\)", stringy):
print exp
Output
2+2
3+5
You could use regular expressions like the following:
import re
x = "((2+2)-(3+5)-6)"
re.findall(r"(?<=\()[0-9+/*-]+(?=\))", x)
Result:
['2+2', '3+5']