How do you set a variable number of regex expressions? - python

Currently I have out = re.sub(r'[0-9][0-9][0-9]', '', input). I would like to have a variable number of [0-9]'s.
So far I have;
string = ''
for i in xrange(numlen):
string = string + '[0-9]'
string = 'r' + string
out = re.sub(string, '', input)
This doesn't work, and I've tried using re.compile, but haven't had any luck. Is there a better way of doing this? Or am I just missing something trivial?

You can specify repetition using {}, for example 3 digits would be
[0-9]{3}
So you can do something like
reps = 5 # or whatever value you'd like
out = re.sub('[0-9]{{{}}}'.format(reps), '', input)
Or if you don't know how many digits there will be
out = re.sub('[0-9]+', '', input)

Use quantified + which would match one or more occurence of digits
out = re.sub(r'[0-9]+', '', input)
See how the regex matches http://regex101.com/r/cE6yS6/1
For example
>>> import re
>>> word="hello 123"
>>> out = re.sub(r'[0-9]+', '', word)
>>> word
'hello 123'
>>> out
'hello '

Related

Remove multiple occurrence in a row from a certain character in a string

I want to remove multiple '.' in a row to a single '.' in Python.
This is my code, that i come up with:
# remove multiple occurrences of '.'
string = "FOO...BAR......FOO..BAR.FOO"
last_char = None
new_string = ""
for char in string:
if not (last_char == '.' and char == '.'):
new_string += char
last_char = char
string = new_string
print(string)
and it is indeed doing what i want it to do, but I think there has to be a more elegant way to do this.
>>> FOO.BAR.FOO.BAR.FOO
That's it easier to do with a regex (re module) with pattern "\.+" which means an
a dot \.
which repeats between 1 to infinite amount of times +
import re
string = "FOO...BAR......FOO..BAR.FOO"
string = re.sub(r"\.+", '.', string)
print(string) # FOO.BAR.FOO.BAR.FOO
Or just replace all 2 dots, by one, until you have no more 2 dots together
string = "FOO...BAR......FOO..BAR.FOO"
while string.find('..') >= 0:
string = string.replace("..", ".")
print(string) # FOO.BAR.FOO.BAR.FOO
string = "FOO...BAR......FOO..BAR.FOO"
print (".".join([x for x in string.split(".") if x]))
Output:
FOO.BAR.FOO.BAR.FOO
You split at any given character, and join back to string if the list element is not null with the same given character as delimiter. So you replace all multiple occurrences of the given character by a single one.
Step by step:
string = "FOO...BAR......FOO..BAR.FOO"
print ([x for x in string.split(".")])
print ([x for x in string.split(".") if x])
print (".".join([x for x in string.split(".") if x]))
Output:
['FOO', '', '', 'BAR', '', '', '', '', '', 'FOO', '', 'BAR', 'FOO']
['FOO', 'BAR', 'FOO', 'BAR', 'FOO']
FOO.BAR.FOO.BAR.FOO

Splitting a string with numbers and letters [duplicate]

I'd like to split strings like these
'foofo21'
'bar432'
'foobar12345'
into
['foofo', '21']
['bar', '432']
['foobar', '12345']
Does somebody know an easy and simple way to do this in python?
I would approach this by using re.match in the following way:
import re
match = re.match(r"([a-z]+)([0-9]+)", 'foofo21', re.I)
if match:
items = match.groups()
print(items)
>> ("foofo", "21")
def mysplit(s):
head = s.rstrip('0123456789')
tail = s[len(head):]
return head, tail
>>> [mysplit(s) for s in ['foofo21', 'bar432', 'foobar12345']]
[('foofo', '21'), ('bar', '432'), ('foobar', '12345')]
Yet Another Option:
>>> [re.split(r'(\d+)', s) for s in ('foofo21', 'bar432', 'foobar12345')]
[['foofo', '21', ''], ['bar', '432', ''], ['foobar', '12345', '']]
>>> r = re.compile("([a-zA-Z]+)([0-9]+)")
>>> m = r.match("foobar12345")
>>> m.group(1)
'foobar'
>>> m.group(2)
'12345'
So, if you have a list of strings with that format:
import re
r = re.compile("([a-zA-Z]+)([0-9]+)")
strings = ['foofo21', 'bar432', 'foobar12345']
print [r.match(string).groups() for string in strings]
Output:
[('foofo', '21'), ('bar', '432'), ('foobar', '12345')]
I'm always the one to bring up findall() =)
>>> strings = ['foofo21', 'bar432', 'foobar12345']
>>> [re.findall(r'(\w+?)(\d+)', s)[0] for s in strings]
[('foofo', '21'), ('bar', '432'), ('foobar', '12345')]
Note that I'm using a simpler (less to type) regex than most of the previous answers.
here is a simple function to seperate multiple words and numbers from a string of any length, the re method only seperates first two words and numbers. I think this will help everyone else in the future,
def seperate_string_number(string):
previous_character = string[0]
groups = []
newword = string[0]
for x, i in enumerate(string[1:]):
if i.isalpha() and previous_character.isalpha():
newword += i
elif i.isnumeric() and previous_character.isnumeric():
newword += i
else:
groups.append(newword)
newword = i
previous_character = i
if x == len(string) - 2:
groups.append(newword)
newword = ''
return groups
print(seperate_string_number('10in20ft10400bg'))
# outputs : ['10', 'in', '20', 'ft', '10400', 'bg']
import re
s = raw_input()
m = re.match(r"([a-zA-Z]+)([0-9]+)",s)
print m.group(0)
print m.group(1)
print m.group(2)
without using regex, using isdigit() built-in function, only works if starting part is text and latter part is number
def text_num_split(item):
for index, letter in enumerate(item, 0):
if letter.isdigit():
return [item[:index],item[index:]]
print(text_num_split("foobar12345"))
OUTPUT :
['foobar', '12345']
This is a little longer, but more versatile for cases where there are multiple, randomly placed, numbers in the string. Also, it requires no imports.
def getNumbers( input ):
# Collect Info
compile = ""
complete = []
for letter in input:
# If compiled string
if compile:
# If compiled and letter are same type, append letter
if compile.isdigit() == letter.isdigit():
compile += letter
# If compiled and letter are different types, append compiled string, and begin with letter
else:
complete.append( compile )
compile = letter
# If no compiled string, begin with letter
else:
compile = letter
# Append leftover compiled string
if compile:
complete.append( compile )
# Return numbers only
numbers = [ word for word in complete if word.isdigit() ]
return numbers
Here is simple solution for that problem, no need for regex:
user = input('Input: ') # user = 'foobar12345'
int_list, str_list = [], []
for item in user:
try:
item = int(item) # searching for integers in your string
except:
str_list.append(item)
string = ''.join(str_list)
else: # if there are integers i will add it to int_list but as str, because join function only can work with str
int_list.append(str(item))
integer = int(''.join(int_list)) # if you want it to be string just do z = ''.join(int_list)
final = [string, integer] # you can also add it to dictionary d = {string: integer}
print(final)
In Addition to the answer of #Evan
If the incoming string is in this pattern 21foofo then the re.match pattern would be like this.
import re
match = re.match(r"([0-9]+)([a-z]+)", '21foofo', re.I)
if match:
items = match.groups()
print(items)
>> ("21", "foofo")
Otherwise, you'll get UnboundLocalError: local variable 'items' referenced before assignment error.

split string into sentences everytime there is punctuation, with punctuation?

I would like to split a string into separate sentences in a list.
example:
string = "Hey! How are you today? I am fine."
output should be:
["Hey!", "How are you today?", "I am fine."]
You can use a built-in regular expression library.
import re
string = "Hey! How are you today? I am fine."
output = re.findall(".*?[.!\?]", string)
output>> ['Hey!', ' How are you today?', ' I am fine.']
Update:
You may use split() method but it'll not return the character used for splitting.
import re
string = "Hey! How are you today? I am fine."
output = re.split("!|?", string)
output>> ['Hey', ' How are you today', ' I am fine.']
If this works for you, you can use replace() and split().
string = "Hey! How are you today? I am fine."
output = string.replace("!", "?").split("?")
you can try
>>> a='Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n',a)
['Beautiful', 'is', 'better', 'than', 'ugly']
I find it in here
You can use the methode split()
import re
string = "Hey! How are you today? I am fine."
yourlist = re.split("!|?",string)
You don't need regex for this. Just create your own generator:
def split_punc(text):
punctuation = '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
# Alternatively, can use:
# from string import punctuation
j = 0
for i, x in enumerate(text):
if x in punctuation:
yield text[j:i+1]
j = i + 1
return text[j:i+1]
Usage:
list(split_punc(string))
# ['Hey!', ' How are you today?', ' I am fine.']

String split using regex with pattern present in text

I have many string that I need to split by commas. Example:
myString = r'test,Test,NEAR(this,that,DISTANCE=4),test again,"another test"'
myString = r'test,Test,FOLLOWEDBY(this,that,DISTANCE=4),test again,"another test"'
My desired output would be:
["test", "Test", "NEAR(this,that,DISTANCE=4)", "test again", """another test"""] #list length = 5
I can't figure out how to keep the commas between "this,that,DISTANCE" in one item. I tried this:
l = re.compile(r',').split(myString) # matches all commas
l = re.compile(r'(?<!\(),(?=\))').split(myString) # (negative lookback/lookforward) - no matches at all
Any ideas? Let's say the list of allowed "functions" is defined as:
f = ["NEAR","FOLLOWEDBY","AND","OR","MAX"]
You may use
(?:\([^()]*\)|[^,])+
See the regex demo.
The (?:\([^()]*\)|[^,])+ pattern matches one or more occurrences of any substring between parentheses with no ( and ) in them or any char other than ,.
See the Python demo:
import re
rx = r"(?:\([^()]*\)|[^,])+"
s = 'test,Test,NEAR(this,that,DISTANCE=4),test again,"another test"'
print(re.findall(rx, s))
# => ['test', 'Test', 'NEAR(this,that,DISTANCE=4)', 'test again', '"another test"']
If explicitly want to specify which strings count as functions, you need to build the regex dynamically. Otherwise, go with Wiktor's solution.
>>> functions = ["NEAR","FOLLOWEDBY","AND","OR","MAX"]
>>> funcs = '|'.join('{}\([^\)]+\)'.format(f) for f in functions)
>>> regex = '({})|,'.format(funcs)
>>>
>>> myString1 = 'test,Test,NEAR(this,that,DISTANCE=4),test again,"another test"'
>>> list(filter(None, re.split(regex, myString1)))
['test', 'Test', 'NEAR(this,that,DISTANCE=4)', 'test again', '"another test"']
>>> myString2 = 'test,Test,FOLLOWEDBY(this,that,DISTANCE=4),test again,"another test"'
>>> list(filter(None, re.split(regex, myString2)))
['test',
'Test',
'FOLLOWEDBY(this,that,DISTANCE=4)',
'test again',
'"another test"']

Python Regex Split interacting in a weird way

I'm doing my Formal Languages assignment, and I got in some trouble trying to deal with Python Regex, using regex.split(param)
I've the following text:
{q0,q1,q2,q3},{a,b},q0,{q1,q3}
Which must be splitted as:
["q0,q1,q2,q3", "a,b", "q0", "q1,q3"]
It is always comma-separated, and it contains alpha-numeric values, which might start with a letter or a number.
To achieve the above separation I created this incredibly long piece of code, dealing with String.join() and Array.split():
[x for x in ' '.join(' '.join(' '.join(args.split(',{')).split('}')).split('{')).split(' ') if x != '']
I tried the following with REGEX, but it simply doesn't work:
re.compile("(,{)|}|{|(},)")
It returns me:
['', None, None, 'q0,q1,q2,q3', None, None, '', ',{', None, 'a,b', None, None, ',q0', ',{', None, 'q1,q3', None, None, '']
It is easy to take care of all this falsey values, but why does it keeps stuff like ,{ in the array?
You can get the desired at once with a simple re.findall. Optionally repeat word characters followed by commas in a group, then finish with more word characters:
str = '{q0,q1,q2,q3},{a,b},q0,{q1,q3}'
re.findall(r'(?:\w+,)*\w+', str)
Output:
['q0,q1,q2,q3', 'a,b', 'q0', 'q1,q3']
The regex will find anything between the outside commas and then I strip it from curly braces if they exist:
import re
s = '{q0,q1,q2,q3},{a,b},q0,{q1,q3}'
result = [i[1:-1] if i.startswith('{') else i for i in re.findall(r'[^,{]*(?:\{[^{}]*\})*[^,}]*', s) if i]
print(result) # ['q0,q1,q2,q3', 'a,b', 'q0', 'q1,q3']
It will also work for other characters than ASCII letters:
import re
s = '{q0,q1,q2,q3.?!},{a,b},q0,#,{q1,q3}'
result = [i[1:-1] if i.startswith('{') else i for i in re.findall(r'[^,{]*(?:\{[^{}]*\})*[^,}]*', s) if i]
print(result) # ['q0,q1,q2,q3.?!', 'a,b', 'q0', '#', 'q1,q3']
use the following regex:
import re
s = "{q0,q1,q2,q3},{a,b},q0,{q1,q3}"
m = re.findall(r"\{([A-Za-z0-9_,]+)\}|,([A-Za-z0-9_]+),", s)
if m:
print(m)

Categories