How to split a string in different words - python

I want to split the string: "3quartos2suítes3banheiros126m²"
in this format using python:
3 quartos
2 suítes
3 banheiros
126m²
Is there a built-in function i can use? How can I do this?

You can do this using regular expressions, specifically re.findall()
s = "3quartos2suítes3banheiros126m²"
matches = re.findall(r"[\d,]+[^\d]+", s)
gives a list containing:
['3quartos', '2suítes', '3banheiros', '126m²']
Regex explanation (Regex101):
[\d,]+ : Match a digit, or a comma one or more times
[^\d]+ : Match a non-digit one or more times
Then, add a space after the digits using re.sub():
result = []
for m in matches:
result.append(re.sub(r"([\d,]+)", r"\1 ", m))
which makes result =
['3 quartos', '2 suítes', '3 banheiros', '126 m²']
This adds a space between 126 and m², but that can't be helped.
Explanation:
Pattern :
r"([\d,]+)" : Match a digit or a comma one or more times, capture this match as a group
Replace with:
r"\1 " : The first captured group, followed by a space

Related

What would be the regex pattern for the following?

I have multiple regex strings in format:-
Example:
A='AB.224-QW-2018'
B='AB.876-5-LS-2018'
C='AB.26-LS-18'
D='AB-123-6-LS-2017'
E='IA-Mb-22L-AB.224-QW-2018-IA-Mb-22L'
F='ZX-ss-12L-AB-123-6-LS-2017-BC-22'
G='AB.224-2018'
H=''AB.224/QW/2018'
I=''AB/224/2018'
J='AB-10-HDB-231-NCLT-1-2017 AD-42-HH-2019'
K=''AB-1-HDB-NCLT-1-2016 AD-42-HH-2020'
L='AB-1-HDB-NCLT-1-2016/(AD-42-HH-2020)
I want a regex pattern to get the output for the numbers that occur after the alphabets(that appear at the start) as well as the first alphabets. And at last years that are mentioned at last.
There are some strings which contain 876-5,123-6 in B and D respectively.
I don't want the single number that appear after -.
My code :
re.search(r"\D*\d*\D*(AB)\D*(\d+)\D*(20)?(\d{2})\D*\d*\D*)
Another attempt
re.search(r"D*\d*\D*(AB)\D*(\d+)\D*\d?\D*(20)?(\d{2})D*\d*\D*)
Both attempts will not work for all of them.
Any pattern to match all strings?
I have created groups in regex pattern and extracted them as
d.group(1)+"/"+d.group(2)+"/"+d.group(4). So output is expected as following if a regex pattern matches for all of them.
Expected Output
A='AB/224/18'
B='AB/876/18'
C='AB/26/18'
D='AB/123/17'
E='AB/224/18'
F='AB/123/17'
G='AB/224/18'
H='AB/224/18'
I='AB/224/18'
J='AB/10/17'
K='AB/1/16'
L='AB/1/16'
You could use 3 capture groups:
\b(AB)\D*(\d+)\S*?(?:20)?(\d\d)\b
\b A word boundary to prevent a partial word match
(AB) Capture AB in group 1
\D* Match optional non digits
(\d+) Capture 1+ digits in group 2
\S*? Optionally match non whitespace characters, as least as possible
(?:20)? Optionally match 20
(\d\d) Capture 2 digits in group 3
\b A word boundary
Regex demo
For example using re.finditer which returns Match objects that each hold the group values.
Using enumerate you can loop the matches. Every item in the iteration returns a tuple, where the first value is the count (that you don't need here) and the second value contains the Match object.
import re
pattern = r"\b(AB)\D*(\d+)\S*?(?:20)?(\d\d)\b"
s = ("A='AB.224-QW-2018'\n"
"B='AB.876-5-LS-2018'\n"
"C='AB.26-LS-18'\n"
"D='AB-123-6-LS-2017'\n"
"IA-Mb-22L-AB.224-QW-2018-IA-Mb-22L' F='ZX-ss-12L-AB-123-6-LS-2017-BC-22\n"
"A='AB.224-QW-2018'\n"
"B='AB.876-5-LS-2018'\n"
"C='AB.26-LS-18'\n"
"D='AB-123-6-LS-2017'\n"
"E='IA-Mb-22L-AB.224-QW-2018-IA-Mb-22L'\n"
"F='ZX-ss-12L-AB-123-6-LS-2017-BC-22'\n"
"G='AB.224-2018'\n"
"H='AB.224/QW/2018'\n"
"I='AB/224/2018'")
matches = re.finditer(pattern, s)
for _, m in enumerate(matches, start=1):
print(m.group(1) + "/" + m.group(2) + "/" + m.group(3))
Output
AB/224/18
AB/876/18
AB/26/18
AB/123/17
AB/224/18
AB/123/17
AB/224/18
AB/876/18
AB/26/18
AB/123/17
AB/224/18
AB/123/17
AB/224/18
AB/224/18
AB/224/18
Can't you just look for the last two digits, irrespective of dashes and "20" prefix? Like
(AB)[.-](\d+).*(\d\d)
I've tested in Sublime Text - works for me, it returns the same output you mentioned as desired.

Python regex Get first element after specific string

I'm trying to get the first number (int and float) after a specific pattern:
strings = ["Building 38 House 10",
"Building : 10.5 house 900"]
for x in string:
print(<rule>)
Wanted result:
'38'
'10.5'
I tried:
for x in strings:
print(re.findall(f"(?<=Building).+\d+", x))
print(re.findall(f"(?<=Building).+(\d+.?\d+)", x))
[' 38 House 10']
['10']
[' : 10.5 house 900']
['00']
But I'm missing something.
You could use a capture group:
\bBuilding[\s:]+(\d+(?:\.\d+)?)\b
Explanation
\bBuilding Match the word Building
[\s:]+ Match 1+ whitespace chars or colons
(\d+(?:\.\d+)?) Capture group 1, match 1+ digits with an optional decimal part
\b A word boundary
Regex demo
import re
strings = ["Building 38 House 10",
"Building : 10.5 house 900"]
pattern = r"\bBuilding[\s:]+(\d+(?:\.\d+)?)"
for x in strings:
m = re.search(pattern, x)
if m:
print(m.group(1))
Output
38
10.5
An idea to use \D (negated \d) to match any non-digits in between and capture the number:
Building\D*\b([\d.]+)
See this demo at regex101 or Python demo at tio.run
Just to mention, use word boundaries \b around Building to match the full word.
re.findall(r"(?<![a-zA-Z:])[-+]?\d*\.?\d+", x)
This will find all numbers in the given string.
If you want the first number only you can access it simply through indexing:
re.findall(r"(?<![a-zA-Z:])[-+]?\d*\.?\d+", x)[0]

Pattern to extract, expand and form a sentence based on a certain delimiter

I was trying out to solve a problem on regex:
There is an input sentence which is of one of these forms: Number1,2,3 or Number1/2/3 or Number1-2-3 these are the 3 delimiters: , / -
The expected output is: Number1,Number2,Number3
Pattern I've tried so far:
(?\<=,)\[^,\]+(?=,)
but this misses out on the edge cases i.e. 1st element and last element. I am also not able to generate for '/'.
You could separate out the key from values, then use a list comprehension to build the output you want.
inp = "Number1,2,3"
matches = re.search(r'(\D+)(.*)', inp)
output = [matches[1] + x for x in re.split(r'[,/]', matches[2])]
print(output) # ['Number1', 'Number2', 'Number3']
You can do it in several steps: 1) validate the string to match your pattern, and once validated 2) add the first non-digit chunk to the numbers while replacing - and / separator chars with commas:
import re
texts = ['Number1,2,3', 'Number1/2/3', 'Number1-2-3']
for text in texts:
m = re.search(r'^(\D+)(\d+(?=([,/-]))(?:\3\d+)*)$', text)
if m:
print( re.sub(r'(?<=,)(?=\d)', m.group(1).replace('\\', '\\\\'), text.replace('/',',').replace('-',',')) )
else:
print(f"NO MATCH in '{text}'")
See this Python demo.
Output:
Number1,Number2,Number3
Number1,Number2,Number3
Number1,Number2,Number3
The ^(\D+)(\d+(?=([,/-]))(?:\3\d+)*)$ regex validates your three types of input:
^ - start of string
(\D+) - Group 1: one or more non-digits
(\d+(?=([,/-]))(?:\3\d+)*) - Group 2: one or more digits, and then zero or more repetitions of ,, / or - and one or more digits (and the separator chars should be consistent due to the capture used in the positive lookahead and the \3 backreference to that value used in the non-capturing group)
$ - end of string.
The re.sub pattern, (?<=,)(?=\d), matches a location between a comma and a digit, the Group 1 value is placed there (note the .replace('\\', '\\\\') is necessary since the replacement is dynamic).
import re
for text in ("Number1,2,3", "Number1-2-3", "Number1/2/3"):
print(re.sub(r"(\D+)(\d+)[/,-](\d+)[/,-](\d+)", r"\1\2,\1\3,\1\4", text))
\D+ matches "Number" or any other non-number text
\d+ matches a number (or more than one)
[/,-] matches any of /, ,, -
The rest is copy paste 3 times.
The substitution consists of backreferences to the matched "Number" string (\1) and then each group of the (\d+)s.
This works if you're sure that it's always three numbers divided by that separator. This does not ensure that it's the same separator between each number. But it's short.
Output:
Number1,Number2,Number3
Number1,Number2,Number3
Number1,Number2,Number3
If you can make use of the pypi regex module you can use the captures collection with a named capture group.
([^\d\s,/]+)(?<num>\d+)([,/-])(?<num>\d+)(?:\3(?<num>\d+))*(?!\S)
([^\d\s,/]+) Capture group 1, match 1+ chars other than the listed
(?<num>\d+) Named capture group num matching 1+ digits
([,/-]) Capture either , / - in group 3
(?<num>\d+) Named capture group num matching 1+ digits
(?:\3(?<num>\d+))* Optionally repeat a backreference to group 3 to keep the separators the same and match 1+ digits in group num
(?!\S) Assert a whitspace boundary to the right to prevent a partial match
Regex demo | Python demo
import regex as re
pattern = r"([^\d\s,/]+)(?<num>\d+)([,/-])(?<num>\d+)(?:\3(?<num>\d+))*(?!\S)"
s = "Number1,2,3 or Number4/5/6 but not Number7/8,9"
matches = re.finditer(pattern, s)
for _, m in enumerate(matches, start=1):
print(','.join([m.group(1) + c for c in m.captures("num")]))
Output
Number1,Number2,Number3
Number4,Number5,Number6

python regex comma separated value csv of 3 characters

I have a set of 3 lowercase letter csv's and I want to use the re.match function in python to extract them. I am using regex to obtain the results.
My csv is ' sdh , ash, vbn' I want to capture all of them by skipping the white spaces and the commas. However, I don't get the correct output. I am getting this list as a result: (',vbn',) . The expression is like this: re.match('^[a-z]{3}((?:,?)[a-z]{3})*')
You might just match 3 characters surrounded by word boundaries:
csvText = ' sdh , ash, vbn'
matches = re.findall(r'\b\w{3}\b', csvText)
inp = ' sdh , ash, vbn'
m = re.match('(\w+),(\w+),(\w+)', inp.replace(" ", ""))
if m:
print(m.groups())
This regexp will match all characters but whitespaces and commas:
import re
line = ' sdh , ash, vbn'
print(re.findall(r'[^\s,]+', line))
Prints:
['sdh', 'ash', 'vbn']
If you want to use match, you might use:
\s*([a-z]{3})\s*,\s*([a-z]{3}),\s*([a-z]{3})\s*
That will match zero or more times a whitespace charcter \s*, capture in a group 3 lowercase characters ([a-z]{3}) followed by zero or more times a whitespace character \s* and a comma for the first 2 sets of 3 charactes. For the last set the comma is not matched at the end.
import re
match = re.match(r'\s*([a-z]{3})\s*,\s*([a-z]{3}),\s*([a-z]{3})\s*', ' sdh , ash, vbn')
if match:
print(match.groups())
Result:
('sdh', 'ash', 'vbn')
Demo

split string in python when characters on either side of separator are not numbers

I have a large list of chemical data, that contains entries like the following:
1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc
I have a function that is correctly splitting the 1st entry into:
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).
Is there an easy pythonic way to do this?
I explain a little bit based on #eph's answer:
import re
data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)
re.split(pattern, string) will split string by the occurrences of regex pattern.
(plz read Regex Quick Start if you are not familiar with regex.)
The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:
The middle | is the OR operator.
\D matches a single character that is not a digit.
\s matches a whitespace character (includes tabs and line breaks).
, matches character ",".
* attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
(?<= ... ) and (?= ...) are the lookbebind and lookahead assertions.
For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.
Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.
Some useful tools for regex:
Regex Cheat Sheet
Online regex tester: regex101 (with a tree structure explanation to your regex)
Use regex and lookbehind/lookahead assertion
>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']

Categories