How can i solve this regular expression, Python? - python

I would like to construct a reg expression pattern for the following string, and use Python to extract:
str = "hello w0rld how 34 ar3 44 you\n welcome 200 stack000verflow\n"
What I want to do is extract the independent number values and add them which should be 278. A prelimenary python code is:
import re
x = re.findall('([0-9]+)', str)
The problem with the above code is that numbers within a char substring like 'ar3' would show up. Any idea how to solve this?

Why not try something simpler like this?:
str = "hello w0rld how 34 ar3 44 you\n welcome 200 stack000verflow\n"
print sum([int(s) for s in str.split() if s.isdigit()])
# 278

s = re.findall(r"\s\d+\s", a) # \s matches blank spaces before and after the number.
print (sum(map(int, s))) # print sum of all
\d+ matches all digits. This gives the exact expected output.
278

How about this?
x = re.findall('\s([0-9]+)\s', str)

The solutions posted so far only work (if at all) for numbers that are preceded and followed by whitespace. They will fail if a number occurs at the very start or end of the string, or if a number appears at the end of a sentence, for example. This can be avoided using word boundary anchors:
s = "100 bottles of beer on the wall (ignore the 1000s!), now 99, now only 98"
s = re.findall(r"\b\d+\b", a) # \b matches at the start/end of an alphanumeric sequence
print(sum(map(int, s)))
Result: 297

To avoid a partial match
use this:
'^[0-9]*$'

Related

pandas regex look ahead and behind from a 1st occurrence of character

I have python strings like below
"1234_4534_41247612_2462184_2131_GHI.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx"
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx"
I would like to do the below
a) extract characters that appear before and after 1st dot
b) The keywords that I want are always found after the last _ symbol
For ex: If you look at 2nd input string, I would like to get only PQRST.GHI as output. It is after last _ and before 1st . and we also get keyword after 1st .
So, I tried the below
for s in strings:
after_part = (s.split('.')[1])
before_part = (s.split('.')[0])
before_part = qnd_part.split('_')[-1]
expected_keyword = before_part + "." + after_part
print(expected_keyword)
Though this works, this is definitely not nice and elegant way to write a regex.
Is there any other better way to write this?
I expect my output to be like as below. As you can see that we get keywords before and after 1st dot character
GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV
Try (regex101):
import re
strings = [
"1234_4534_41247612_2462184_2131_ABCDEF.GHI.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx",
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx",
]
pat = re.compile(r"[^.]+_([^.]+\.[^.]+)")
for s in strings:
print(pat.search(s).group(1))
Prints:
ABCDEF.GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV
You can do (try the pattern here )
df['text'].str.extract('_([^._]+\.[^.]+)',expand=False)
Output:
0 ABCDEF.GHI
1 PQRST.GHI
2 JKLMN.OPQ
3 WXY.TUV
Name: text, dtype: object
You can also do it with rsplit(). Specify maxsplit, so that you don't split more than you need to (for efficiency):
[s.rsplit('_', maxsplit=1)[1].rsplit('.', maxsplit=1)[0] for s in strings]
# ['GHI', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']
If there are strings with less than 2 dots and each returned string should have one dot in it, then add a ternary operator that splits (or not) depending on the number of dots in the string.
[x.rsplit('.', maxsplit=1)[0] if x.count('.') > 1 else x
for s in strings
for x in [s.rsplit('_', maxsplit=1)[1]]]
# ['GHI.xlsx', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']

Deleting numbers in a string using regex

Replacing numbers with a placeholder in a string inclding decimals and percentages using re in Python
def remove_numbers(text):
remove = re.sub(r"\W\d\S*", " [DD]", text,)
return remove
The function works fine on this sample string. sample = "I can give you 10% of 100,000 to you. The thing went up by 10% so it costs 12.25 euros now.
But if a string starts with a number, the first numer does not get replaced by the placeholder.
So looping through the replace method seems to be the easiest way to do this.
def remove_numbers(text):
nums = '123456787980'
for i in nums:
text = text.replace(i, '[DD]')
return text
\W will not match at the start of string. It appears you are using \W to make sure that the number you are replacing is not a part of a word. This makes sense. But, \W doesn't match at start-of-string. You can use \A for that. But, you probably don't want to add a space when you are replacing at start-of-string. This can be done in a single regex, but I think it results in easier-to-read code if you do it in two steps.
import re
def remove_numbers(text):
# replace internal numbers that are not a part of a word (adds a space)
remove = re.sub(r"\W\d\S*", " [DD]", text,)
# replace number at start of string (if any) (does not add a space)
remove = re.sub(r"\A\d\S*", "[DD]", remove,)
return remove
a = "3 foxes jumped over 3 fences"
b = remove_numbers(a)
print("before <{}>".format(a))
print("after <{}>".format(b))
\W requires a character to be there, so when you try it with a number at the beginning it'll look like just \d\S*.
Use '\b' instead of '\w' to match word boundaries:
def remove_numbers(text):
remove = re.sub(r"\b\d\S*", "[DD]", text,)
return remove
Or, keeping more in the spirit of your original code:
def remove_numbers(text):
remove = re.sub(r"(\s|^)\d\S*", r"\1[DD]", text,)
return remove
And use \d+ instead of \d if you want to also match multiple digits in a row.
Do this:
import re
def remove_numbers(text):
remove = re.sub(r"\W?\d\S*", " [DD]", text,)
return remove.strip()
print(remove_numbers())
The ? means 0 or more of the previous pattern
Change your regex to :
remove = re.sub("^\d+\s|\s\d+\s|\s\d+$", " [DD] ", text)
All code :
import re
def remove_numbers(text):
s = re.sub("^\d+\s|\s\d+\s|\s\d+$", " [DD] ", text)
return s
t1 = "3 foxes jumped over 3 fences"
print (remove_numbers(t1))
Output :
[DD] foxes jumped over [DD] fences

remove all characters aside from the decimal numbers immediately followed by the £ sign

I have text with values like:
this is a value £28.99 (0.28/ml)
I want to remove everything to return the price only so it returns:
£28.99
there could be any number of digits between the £ and .
I think
r"£[0-9]*\.[0-9]{2}"
matches the pattern I want to keep but i'm unsure on how to remove everything else and keep the pattern instead of replacing the pattern like in usual re.sub() cases.
I want to remove everything to return the price only so it returns:
Why not trying to extract the proper information instead?
import re
s = "this is a value £28.99 (0.28/ml)"
m = re.search("£\d*(\.\d+)?",s)
if m:
print(m.group(0))
to find several occurrences use findall or finditer instead of search
You don't care how many digits are before the decimal, so using the zero-or-more matcher was correct. However, you could just rely on the digit class (\d) to provide that more succinctly.
The same is true of after the decimal. You only need two so your limiting the matches to 2 is correct.
The issue then comes in with how you actually capture the value. You can use a capturing group to be sure that you only ever get the value you care about.
Complete regex:
(£\d*.\d{2})
Sample code:
import re
r = re.compile("(£\d*.\d{2})")
match = r.findall("this is a value £28.99 (0.28/ml)")
if match: # may bring back an empty list; check for that here
print(match[0]) # uses the first group, and will print £28.99
If it's a string, you can do something like this:
x = "this is a value £28.99 (0.28/ml)"
x_list = x.split()
for i in x_list:
if "£" in i: #or if i.startswith("£") Credit – Jean-François Fabre
value=i
print(value)
>>>£28.99
You can try:
import re
t = "this is a value £28.99 (0.28/ml)"
r = re.sub(".*(£[\d.]+).*", r"\1", t)
print(r)
Output:
£28.99
Python Demo

Check if a string ends with a decimal in Python 2

I want to check if a string ends with a decimal of varying numbers, from searching for a while, the closest solution I found was to input values into a tuple and using that as the condition for endswith(). But is there any shorter way instead of inputting every possible combination?
I tried hard coding the end condition but if there are new elements in the list it wont work for those, I also tried using regex it returns other elements together with the decimal elements as well. Any help would be appreciated
list1 = ["abcd 1.01", "zyx 22.98", "efgh 3.0", "qwe -70"]
for e in list1:
if e.endswith('.0') or e.endswith('.98'):
print 'pass'
Edit: Sorry should have specified that I do not want to have 'qwe -70' to be accepted, only those elements with a decimal point should be accepted
I'd like to propose another solution: using regular expressions to search for an ending decimal.
You can define a regular expression for an ending decimal with the following regex [-+]?[0-9]*\.[0-9]+$.
The regex broken apart:
[-+]?: optional - or + symbol at the beginning
[0-9]*: zero or more digits
\.: required dot
[0-9]+: one or more digits
$: must be at the end of the line
Then we can test the regular expression to see if it matches any of the members in the list:
import re
regex = re.compile('[-+]?[0-9]*\.[0-9]+$')
list1 = ["abcd 1.01", "zyx 22.98", "efgh 3.0", "qwe -70", "test"]
for e in list1:
if regex.search(e) is not None:
print e + " passes"
else:
print e + " does not pass"
The output for the previous script is the following:
abcd 1.01 passes
zyx 22.98 passes
efgh 3.0 passes
qwe -70 does not pass
test does not pass
Your example data leaves many possibilities open:
Last character is a digit:
e[-1].isdigit()
Everything after the last space is a number:
try:
float(e.rsplit(None, 1)[-1])
except ValueError:
# no number
pass
else:
print "number"
Using regular expressions:
re.match('[.0-9]$', e)
suspects = [x.split() for x in list1] # split by the space in between and get the second item as in your strings
# iterate over to try and cast it to float -- if not it will raise ValueError exception
for x in suspects:
try:
float(x[1])
print "{} - ends with float".format(str(" ".join(x)))
except ValueError:
print "{} - does not ends with float".format(str(" ".join(x)))
## -- End pasted text --
abcd 1.01 - ends with float
zyx 22.98 - ends with float
efgh 3.0 - ends with float
qwe -70 - ends with float
I think this will work for this case:
regex = r"([0-9]+\.[0-9]+)"
list1 = ["abcd 1.01", "zyx 22.98", "efgh 3.0", "qwe -70"]
for e in list1:
str = e.split(' ')[1]
if re.search(regex, str):
print True #Code for yes condition
else:
print False #Code for no condition
As you correctly guessed, endswith() is not a good way to look at the solution, given that the number of combinations is basically infinite. The way to go is - as many suggested - a regular expression that would match the end of the string to be a decimal point followed by any count of digits. Besides that, keep the code simple, and readable. The strip() is in there just in case one the input string has an extra space at the end, which would unnecessarily complicate the regex.
You can see this in action at: https://eval.in/649155
import re
regex = r"[0-9]+\.[0-9]+$"
list1 = ["abcd 1.01", "zyx 22.98", "efgh 3.0", "qwe -70"]
for e in list1:
if re.search(regex, e.strip()):
print e, 'pass'
The flowing maybe help:
import re
reg = re.compile(r'^[a-z]+ \-?[0-9]+\.[0-9]+$')
if re.match(reg, the_string):
do something...
else:
do other...

Need help extracting data from a file

I'm a newbie at python.
So my file has lines that look like this:
-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333
I need help coming up with the correct python code to extract every float preceded by a colon and followed by a space (ex: [-0.294118, 0.487437,etc...])
I've tried dataList = re.findall(':(.\*) ', str(line)) and dataList = re.split(':(.\*) ', str(line)) but these come up with the whole line. I've been researching this problem for a while now so any help would be appreciated. Thanks!
try this one:
:(-?\d\.\d+)\s
In your code that will be
p = re.compile(':(-?\d\.\d+)\s')
m = p.match(str(line))
dataList = m.groups()
This is more specific on what you want.
In your case .* will match everything it can
Test on Regexr.com:
In this case last element wasn't captured because it doesnt have space to follow, if this is a problem just remove the \s from the regex
This will do it:
import re
line = "-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333"
for match in re.finditer(r"(-?\d\.\d+)", line, re.DOTALL | re.MULTILINE):
print match.group(1)
Or:
match = re.search(r"(-?\d\.\d+)", line, re.DOTALL | re.MULTILINE)
if match:
datalist = match.group(1)
else:
datalist = ""
Output:
-0.294118
0.487437
0.180328
-0.292929
0.00149028
-0.53117
-0.0333333
Live Python Example:
http://ideone.com/DpiOBq
Regex Demo:
https://regex101.com/r/nR4wK9/3
Regex Explanation
(-?\d\.\d+)
Match the regex below and capture its match into backreference number 1 «(-?\d\.\d+)»
Match the character “-” literally «-?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match a single character that is a “digit” (ASCII 0–9 only) «\d»
Match the character “.” literally «\.»
Match a single character that is a “digit” (ASCII 0–9 only) «\d+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Given:
>>> s='-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333.333'
With your particular data example, you can just grab the parts that would be part of a float with a regex:
>>> re.findall(r':([\d.-]+)', s)
['-0.294118', '0.487437', '0.180328', '-0.292929', '-1', '0.00149028', '-0.53117', '-0.0333.333']
You can also split and partition, which would be substantially faster:
>>> [e.partition(':')[2] for e in s.split() if ':' in e]
['-0.294118', '0.487437', '0.180328', '-0.292929', '-1', '0.00149028', '-0.53117', '-0.0333.333']
Then you can convert those to a float using try/except and map and filter:
>>> def conv(s):
... try:
... return float(s)
... except ValueError:
... return None
...
>>> filter(None, map(conv, [e.partition(':')[2] for e in s.split() if ':' in e]))
[-0.294118, 0.487437, 0.180328, -0.292929, -1.0, 0.00149028, -0.53117, -0.0333333]
A simple oneliner using list comprehension -
str = "-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333"
[float(s.split()[0]) for s in str.split(':')]
Note: this is simplest to understand (and pobably fastest) as we are not doing any regex evaluation. But this would only work for the particular case above. (eg. if you've to get the second number - in the above not so correctly formatted string would need more work than a single one-liner above).

Categories