I don't want to use string split because I have numbers 1-99, and a column of string that contain '#/#' somewhere in the text.
How can I write a regex to extract the number 10 in the following example:
He got 10/19 questions right.
Use a lookahead to match on the /, like this:
\d+(?=/)
You may need to escape the / if your implementation uses it as its delimiter.
Live example: https://regex101.com/r/xdT4vq/1
You can still use str.split() if you carefully construct logic around it:
t = "He got 10/19 questions right."
t2 = "He/she got 10/19 questions right"
for q in [t,t2]:
# split whole string at spaces
# split each part at /
# only keep parts that contain / but not at 1st position and only consists
# out of numbers elsewise
numbers = [x.split("/") for x in q.split()
if "/" in x and all(c in "0123456789/" for c in x)
and not x.startswith("/")]
if numbers:
print(numbers[0][0])
Output:
10
10
import re
myString = "He got 10/19 questions right."
oldnumber = re.findall('[0-9]+/', myString) #find one or more digits followed by a slash.
newNumber = oldnumber[0].replace("/","") #get rid of the slash.
print(newNumber)
>>>10
res = re.search('(\d+)/\d+', r'He got 10/19 questions right.')
res.groups()
('10',)
Find all numbers before the forward-slash and exclude the forward-slash by using start-stop parentheses.
>>> import re
>>> myString = 'He got 10/19 questions right.'
>>> stringNumber = re.findall('([0-9]+)/', myString)
>>> stringNumber
['10']
This returns all numbers ended with a forward-slash, but in a list of strings. if you want integers, you should map your list with int, then make a list again.
>>> intNumber = list(map(int, stringNumber))
>>> intNumber
[10]
Related
How to get this string "534641" (this value is dynamic, can be 6,5,4 digits)? How to find "-" before "534641"?
import re
string = "http://www.test.com.my/white-red-gift-perfume-powerbank-yellow-534641.html?ff=1\u0026s=Ebsr"
m = re.search('-(.+?).html', string).group(1)
print (m)
https://repl.it/JSxp
You are almost there. Since what you want is only digits, you could use \d to capture only digits:
>>> m = re.search('-(\d+).html', string).group(1)
>>> print (m)
534641
Another way would be to tell 'all characters excepts -':
>>> m = re.search('-([^-]+).html', string).group(1)
>>> print (m)
534641
For more info, see the doc.
Some quick notes: the .html should be \.html, avoid using names such as 'string', 'list' that are used by python. It could go wrong without knowing why.
You already have the number at the end. Just split on the dashes using:
m = re.search('-(.+?).html', string).group(1).split("-")
# last element in m is the number you are looking for
print (m[-1])
I'm trying to filter all non-alphanumeric characters to the end of the strings. I am having a hard time with the regex since I don't know where the special characters we be. Here are a couple of simple examples.
hello*there*this*is*a*str*ing*with*asterisks
and&this&is&a&str&ing&&with&ersands&in&i&t
one%mo%refor%good%mea%sure%I%think%you%get%it
How would I go about sliding all the special characters to the end of the string?
Here is what I tried, but I didn't get anything.
re.compile(r'(.+?)(\**)')
r.sub(r'\1\2', string)
Edit:
Expected output for the first string would be:
hellotherethisisastringwithasterisks********
There's no need for regex here. Just use str.isalpha and build up two lists, then join them:
strings = ['hello*there*this*is*a*str*ing*with*asterisks',
'and&this&is&a&str&ing&&with&ersands&in&i&t',
'one%mo%refor%good%mea%sure%I%think%you%get%it']
for s in strings:
a = []
b = []
for c in s:
if c.isalpha():
a.append(c)
else:
b.append(c)
print(''.join(a+b))
Result:
hellotherethisisastringwithasterisks********
andthisisastringwithampersandsinit&&&&&&&&&&&
onemoreforgoodmeasureIthinkyougetit%%%%%%%%%%
Alternative print() call for Python 3.5 and higher:
print(*a, *b, sep='')
Here is my proposed solution for this with regex:
import re
def move_nonalpha(string,char):
pattern = "\\"+char
char_list = re.findall(pattern,string)
if len(char_list)>0:
items = re.split(pattern,string)
if len(items)>0:
return ''.join(items)+''.join(char_list)
Usage:
string = "hello*there*this*is*a*str*ing*with*asterisks"
print (move_nonalpha(string,"*"))
Gives me output:
hellotherethisisastringwithasterisks********
I tried with your other input patterns as well and it's working. Hope it'll help.
I would like to construct a reg expression pattern for the following string, and use Python to extract:
str = "hello w0rld how 34 ar3 44 you\n welcome 200 stack000verflow\n"
What I want to do is extract the independent number values and add them which should be 278. A prelimenary python code is:
import re
x = re.findall('([0-9]+)', str)
The problem with the above code is that numbers within a char substring like 'ar3' would show up. Any idea how to solve this?
Why not try something simpler like this?:
str = "hello w0rld how 34 ar3 44 you\n welcome 200 stack000verflow\n"
print sum([int(s) for s in str.split() if s.isdigit()])
# 278
s = re.findall(r"\s\d+\s", a) # \s matches blank spaces before and after the number.
print (sum(map(int, s))) # print sum of all
\d+ matches all digits. This gives the exact expected output.
278
How about this?
x = re.findall('\s([0-9]+)\s', str)
The solutions posted so far only work (if at all) for numbers that are preceded and followed by whitespace. They will fail if a number occurs at the very start or end of the string, or if a number appears at the end of a sentence, for example. This can be avoided using word boundary anchors:
s = "100 bottles of beer on the wall (ignore the 1000s!), now 99, now only 98"
s = re.findall(r"\b\d+\b", a) # \b matches at the start/end of an alphanumeric sequence
print(sum(map(int, s)))
Result: 297
To avoid a partial match
use this:
'^[0-9]*$'
I have a strings in the format of feet'-inches" (i.e. 18'-6") and I want to split it so that the values of the feet and inches are separated.
I have tried:
re.split(r'\s|-', `18'-6`)
but it still returns 18'-6.
Desired output: [18,6] or similar
Thanks!
Just split normally replacing the ':
s="18'-6"
a, b = s.replace("'","").split("-")
print(a,b)
If you have both " and ' one must be escaped so just split and slice up to the second last character:
s = "18'-6\""
a, b = s.split("-")
print(a[:-1], b[:-1])
18 6
You can use
import re
p = re.compile(ur'[-\'"]')
test_str = u"18'-6\""
print filter(None,re.split(p, test_str))
Output:
[u'18', u'6']
Ideone demo
A list comprehension will do the trick:
In [13]: [int(i[:-1]) for i in re.split(r'\s|-', "18'-6\"")]
Out[13]: [18, 6]
This assumes that your string is of the format feet(int)'-inches(int)", and you are trying to get the actual ints back, not just numbers in string format.
The built-in split method can take an argument that will cause it to split at the specified point.
"18'-16\"".replace("'", "").replace("\"", "").split("-")
A one-liner. :)
I asked a question a little while ago (Python splitting unknown string by spaces and parentheses) which worked great until I had to change my way of thinking. I have still not grasped regex so I need some help with this.
If the user types this:
new test (test1 test2 test3) test "test5 test6"
I would like it to look like the output to the variable like this:
["new", "test", "test1 test2 test3", "test", "test5 test6"]
In other words if it is one word seperated by a space then split it from the next word, if it is in parentheses then split the whole group of words in the parentheses and remove them. Same goes for the quotation marks.
I currently am using this code which does not meet the above standard (From the answers in the link above):
>>>import re
>>>strs = "Hello (Test1 test2) (Hello1 hello2) other_stuff"
>>>[", ".join(x.split()) for x in re.split(r'[()]',strs) if x.strip()]
>>>['Hello', 'Test1, test2', 'Hello1, hello2', 'other_stuff']
This works well but there is a problem, if you have this:
strs = "Hello Test (Test1 test2) (Hello1 hello2) other_stuff"
It combines the Hello and Test as one split instead of two.
It also doesn't allow the use of parentheses and quotation marks splitting at the same time.
The answer was simply:
re.findall('\[[^\]]*\]|\([^\)]*\)|\"[^\"]*\"|\S+',strs)
This is pushing what regexps can do. Consider using pyparsing instead. It does recursive descent. For this task, you could use:
from pyparsing import *
import string, re
RawWord = Word(re.sub('[()" ]', '', string.printable))
Token = Forward()
Token << ( RawWord |
Group('"' + OneOrMore(RawWord) + '"') |
Group('(' + OneOrMore(Token) + ')') )
Phrase = ZeroOrMore(Token)
Phrase.parseString(s, parseAll=True)
This is robust against strange whitespace and handles nested parentheticals. It's also a bit more readable than a large regexp, and therefore easier to tweak.
I realize you've long since solved your problem, but this is one of the highest google-ranked pages for problems like this, and pyparsing is an under-known library.
Your problem is not well defined.
Your description of the rules is
In other words if it is one word seperated by a space then split it
from the next word, if it is in parentheses then split the whole group
of words in the parentheses and remove them. Same goes for the commas.
I guess with commas you mean inverted commas == quotation marks.
Then with this
strs = "Hello (Test1 test2) (Hello1 hello2) other_stuff"
you should get that
["Hello (Test1 test2) (Hello1 hello2) other_stuff"]
since everything is surrounded by inverted commas. Most probably, you want to work with no care of largest inverted commas.
I propose this, although a bot ugly
import re, itertools
strs = raw_input("enter a string list ")
print [ y for y in list(itertools.chain(*[re.split(r'\"(.*)\"', x)
for x in re.split(r'\((.*)\)', strs)]))
if y <> '']
gets
>>>
enter a string list here there (x y ) thereagain "there there"
['here there ', 'x y ', ' thereagain ', 'there there']
This is doing what you expect
import re, itertools
strs = raw_input("enter a string list ")
res1 = [ y for y in list(itertools.chain(*[re.split(r'\"(.*)\"', x)
for x in re.split(r'\((.*)\)', strs)]))
if y <> '']
set1 = re.search(r'\"(.*)\"', strs).groups()
set2 = re.search(r'\((.*)\)', strs).groups()
print [k for k in res1 if k in list(set1) or k in list(set2) ]
+ list(itertools.chain(*[k.split() for k in res1 if k
not in set1 and k not in set2 ]))
For python 3.6 - 3.8
I had a similar question, however I like none of those answers, maybe because most of them are from 2013. So I elaborated my own solution.
regex = r'\(.+?\)|".+?"|\w+'
test = 'Hello Test (Test1 test2) (Hello1 hello2) other_stuff'
result = re.findall(regex, test)
Here you are looking for three different groups:
Something that is included inside (); parenthesis should be written together with backslashes
Something that is included inside ""
Just words
The use of ? makes your search lazy instead of greedy