Regex to separate Numeric from Alpha - python

I have a bunch of strings:
"10people"
"5cars"
..
How would I split this to?
['10','people']
['5','cars']
It can be any amount of numbers and text.
I'm thinking about writing some sort of regex - however I'm sure there's an easy way to do it in Python.

>>> re.findall('(\d+|[a-zA-Z]+)', '12fgsdfg234jhfq35rjg')
['12', 'fgsdfg', '234', 'jhfq', '35', 'rjg']

Use the regex (\d+)([a-zA-Z]+).
import re
a = ["10people", "5cars"]
[re.match('^(\\d+)([a-zA-Z]+)$', x).groups() for x in a]
Result:
[('10', 'people'), ('5', 'cars')]

>>> re.findall("\d+|[a-zA-Z]+","10people")
['10', 'people']
>>> re.findall("\d+|[a-zA-Z]+","10people5cars")
['10', 'people', '5', 'cars']

In general, a split on /(?<=[0-9])(?=[a-z])|(?<=[a-z])(?=[0-9])/i separates a string that way.

>>> import re
>>> s = '10cars'
>>> m = re.match(r'(\d+)([a-z]+)', s)
>>> print m.group(1)
10
>>> print m.group(2)
cars

If you are like me and goes long loops around to avoid regexpes justbecause they are ugly, here is a non-regex approach:
data = "5people10cars"
numbers = "".join(ch if ch.isdigit() else "\n" for ch in data).split()
names = "".join(ch if not ch.isdigit() else "\n" for ch in data).split()
final = zip (numbers, names)

Piggybacking on jsbueno's idea, using str.translate, followed by split:
import string
allchars = ''.join(chr(i) for i in range(32,256))
digExtractTrans = string.maketrans(allchars, ''.join(ch if ch.isdigit() else ' ' for ch in allchars))
alpExtractTrans = string.maketrans(allchars, ''.join(ch if ch.isalpha() else ' ' for ch in allchars))
data = "5people10cars"
numbers = data.translate(digExtractTrans).split()
names = data.translate(alpExtractTrans).split()
You only need to create the translation tables once, then call translate and split as often as you want.

Related

Splitting a string with numbers and letters [duplicate]

I'd like to split strings like these
'foofo21'
'bar432'
'foobar12345'
into
['foofo', '21']
['bar', '432']
['foobar', '12345']
Does somebody know an easy and simple way to do this in python?
I would approach this by using re.match in the following way:
import re
match = re.match(r"([a-z]+)([0-9]+)", 'foofo21', re.I)
if match:
items = match.groups()
print(items)
>> ("foofo", "21")
def mysplit(s):
head = s.rstrip('0123456789')
tail = s[len(head):]
return head, tail
>>> [mysplit(s) for s in ['foofo21', 'bar432', 'foobar12345']]
[('foofo', '21'), ('bar', '432'), ('foobar', '12345')]
Yet Another Option:
>>> [re.split(r'(\d+)', s) for s in ('foofo21', 'bar432', 'foobar12345')]
[['foofo', '21', ''], ['bar', '432', ''], ['foobar', '12345', '']]
>>> r = re.compile("([a-zA-Z]+)([0-9]+)")
>>> m = r.match("foobar12345")
>>> m.group(1)
'foobar'
>>> m.group(2)
'12345'
So, if you have a list of strings with that format:
import re
r = re.compile("([a-zA-Z]+)([0-9]+)")
strings = ['foofo21', 'bar432', 'foobar12345']
print [r.match(string).groups() for string in strings]
Output:
[('foofo', '21'), ('bar', '432'), ('foobar', '12345')]
I'm always the one to bring up findall() =)
>>> strings = ['foofo21', 'bar432', 'foobar12345']
>>> [re.findall(r'(\w+?)(\d+)', s)[0] for s in strings]
[('foofo', '21'), ('bar', '432'), ('foobar', '12345')]
Note that I'm using a simpler (less to type) regex than most of the previous answers.
here is a simple function to seperate multiple words and numbers from a string of any length, the re method only seperates first two words and numbers. I think this will help everyone else in the future,
def seperate_string_number(string):
previous_character = string[0]
groups = []
newword = string[0]
for x, i in enumerate(string[1:]):
if i.isalpha() and previous_character.isalpha():
newword += i
elif i.isnumeric() and previous_character.isnumeric():
newword += i
else:
groups.append(newword)
newword = i
previous_character = i
if x == len(string) - 2:
groups.append(newword)
newword = ''
return groups
print(seperate_string_number('10in20ft10400bg'))
# outputs : ['10', 'in', '20', 'ft', '10400', 'bg']
import re
s = raw_input()
m = re.match(r"([a-zA-Z]+)([0-9]+)",s)
print m.group(0)
print m.group(1)
print m.group(2)
without using regex, using isdigit() built-in function, only works if starting part is text and latter part is number
def text_num_split(item):
for index, letter in enumerate(item, 0):
if letter.isdigit():
return [item[:index],item[index:]]
print(text_num_split("foobar12345"))
OUTPUT :
['foobar', '12345']
This is a little longer, but more versatile for cases where there are multiple, randomly placed, numbers in the string. Also, it requires no imports.
def getNumbers( input ):
# Collect Info
compile = ""
complete = []
for letter in input:
# If compiled string
if compile:
# If compiled and letter are same type, append letter
if compile.isdigit() == letter.isdigit():
compile += letter
# If compiled and letter are different types, append compiled string, and begin with letter
else:
complete.append( compile )
compile = letter
# If no compiled string, begin with letter
else:
compile = letter
# Append leftover compiled string
if compile:
complete.append( compile )
# Return numbers only
numbers = [ word for word in complete if word.isdigit() ]
return numbers
Here is simple solution for that problem, no need for regex:
user = input('Input: ') # user = 'foobar12345'
int_list, str_list = [], []
for item in user:
try:
item = int(item) # searching for integers in your string
except:
str_list.append(item)
string = ''.join(str_list)
else: # if there are integers i will add it to int_list but as str, because join function only can work with str
int_list.append(str(item))
integer = int(''.join(int_list)) # if you want it to be string just do z = ''.join(int_list)
final = [string, integer] # you can also add it to dictionary d = {string: integer}
print(final)
In Addition to the answer of #Evan
If the incoming string is in this pattern 21foofo then the re.match pattern would be like this.
import re
match = re.match(r"([0-9]+)([a-z]+)", '21foofo', re.I)
if match:
items = match.groups()
print(items)
>> ("21", "foofo")
Otherwise, you'll get UnboundLocalError: local variable 'items' referenced before assignment error.

How do I split a string at a separator unless that separator is followed by a certain pattern?

I have a Python string
string = aaa1bbb1ccc1ddd
and I want to split it like this
re.split('[split at all occurrences of "1", unless the 1 is followed by a c]', string)
so that the result is
['aaa', 'bbb1ccc', 'ddd']
How do I do this?
Use negative-lookahead with regex and the re module:
>>> string = 'aaa1bbb1ccc1ddd'
>>> import re
>>> re.split(r"1(?!c)", string)
['aaa', 'bbb1ccc', 'ddd']
def split_by_delim_except(s, delim, bar):
escape = '\b'
find = delim + bar
return map(lambda s: s.replace(escape, find),
s.replace(find, escape).split(delim))
split_by_delim_except('aaa1bbb1ccc1ddd', '1', 'c')
Although not as pretty as regex, my following code returns the same result:
string = 'aaa1bbb1ccc1ddd'
Split the string at all instances of '1'
p1 = string.split('1')
Create a new empty list so we can append our desired items to
new_result = []
count = 0
for j in p1:
if j.startswith('c'):
# This removes the previous element from the list and stores it in a variable.
prev_element = new_result.pop(count-1)
prev_one_plus_j = prev_element + '1' + j
new_result.append(prev_one_plus_j)
else:
new_result.append(j)
count += 1
print (new_result)
Output:
['aaa', 'bbb1ccc', 'ddd']

Python re.sub to replace 3 or more of same character with 1

I'm trying to find characters that are repeated 3 times or more, for example I want to take the following strings:
('aaa', 'buuuuut', 'oddddddddd')
and replace all occurrences of three or more of a letter with only one:
('a', 'but', 'od').
I've tried following code
s=re.sub(r'(\w)\3*',r'(\w)',s)
but it results in a compile error.
What regex do I need to use?
Look at this:
>>> mystr = 'buuuuuttttt'
>>> re.sub(r'(.)\1{2,}', r'\1', mystr)
'but'
>>> mystr = 'buttt'
>>> re.sub(r'(.)\1{2,}', r'\1', mystr)
'but'
>>>
>>> s = 'abbcccdddd'
>>> s = re.sub(r'(\w)\1(\1+)',r'\1',s)
>>> s
'abbcd'
Maybe try something like this:
s = re.sub(r'(\w)\1\1+', r'\1', s)

Partitioning elements in list by \t . Python

my_list = ['1\tMelkor\tMorgoth\tSauronAtDolGoldul','2\tThingols\tHeirIsDior\tSilmaril','3\tArkenstone\tIsProbablyA\tSilmaril']
I'm trying to split this list into sublists separated by \t
output = [['1','Melkor','Morgoth','SauronAtDolGoldul'],['2','Thigols','HeirIsDior','Silmaril'],['3','Arkenstone','IsProbablyA','Silmaril']]
I was thinking something on the lines of
output = []
for k_string in my_list:
temp = []
for i in k_string:
temp_s = ''
if i != '\':
temp_s = temp_s + i
elif i == '\':
break
temp.append(temp_s)
it gets messed up with the t . . i'm not sure how else I would go about doing it. I've seen people use .join for similar things but I don't really understand how to use .join
You want to use str.split(); a list comprehension lets you apply this to all elements in one line:
output = [sub.split('\t') for sub in my_list]
There is no literal \ in the string; the \t is an escape code that signifies the tab character.
Demo:
>>> my_list = ['1\tMelkor\tMorgoth\tSauronAtDolGoldul','2\tThingols\tHeirIsDior\tSilmaril','3\tArkenstone\tIsProbablyA\tSilmaril']
>>> [sub.split('\t') for sub in my_list]
[['1', 'Melkor', 'Morgoth', 'SauronAtDolGoldul'], ['2', 'Thingols', 'HeirIsDior', 'Silmaril'], ['3', 'Arkenstone', 'IsProbablyA', 'Silmaril']]
>>> import csv
>>> my_list = ['1\tMelkor\tMorgoth\tSauronAtDolGoldul','2\tThingols\tHeirIsDior\tSilmaril','3\tArkenstone\tIsProbablyA\tSilmaril']
>>> list(csv.reader(my_list, delimiter='\t'))
[['1', 'Melkor', 'Morgoth', 'SauronAtDolGoldul'], ['2', 'Thingols', 'HeirIsDior', 'Silmaril'], ['3', 'Arkenstone', 'IsProbablyA', 'Silmaril']]

Adding 2 dictionary entries from one string

Suppose I have the following code:
new_dict = {}
text = "Yes: No Maybe: So"
I want to split the string up into 2 dictionary elements like so:
new_dict = {'Yes':'No', 'Maybe':'So'}
I tried to split the string up into a list in the same fashion to get a brief idea on how to do it, but I haven't had much success.
text = "Yes: No Maybe: So"
words = [w.rstrip(':') for w in text.split()]
new_dict = dict(zip(words[::2], words[1::2]))
If each colon is followed by a space, str.split() will work fine for you:
tokens = (s.rstrip(":") for s in text.split())
new_dict = dict(zip(tokens, tokens))
>>> import re
>>> text = "Yes: No Maybe: So"
>>> dict(re.findall(r'(\w+): (\w+)', text))
{'Maybe': 'So', 'Yes': 'No'}
or the more efficient:
>>> dict(m.groups() for m in re.finditer(r'(\w+): (\w+)', text))
{'Maybe': 'So', 'Yes': 'No'}

Categories