I have a set of 3 lowercase letter csv's and I want to use the re.match function in python to extract them. I am using regex to obtain the results.
My csv is ' sdh , ash, vbn' I want to capture all of them by skipping the white spaces and the commas. However, I don't get the correct output. I am getting this list as a result: (',vbn',) . The expression is like this: re.match('^[a-z]{3}((?:,?)[a-z]{3})*')
You might just match 3 characters surrounded by word boundaries:
csvText = ' sdh , ash, vbn'
matches = re.findall(r'\b\w{3}\b', csvText)
inp = ' sdh , ash, vbn'
m = re.match('(\w+),(\w+),(\w+)', inp.replace(" ", ""))
if m:
print(m.groups())
This regexp will match all characters but whitespaces and commas:
import re
line = ' sdh , ash, vbn'
print(re.findall(r'[^\s,]+', line))
Prints:
['sdh', 'ash', 'vbn']
If you want to use match, you might use:
\s*([a-z]{3})\s*,\s*([a-z]{3}),\s*([a-z]{3})\s*
That will match zero or more times a whitespace charcter \s*, capture in a group 3 lowercase characters ([a-z]{3}) followed by zero or more times a whitespace character \s* and a comma for the first 2 sets of 3 charactes. For the last set the comma is not matched at the end.
import re
match = re.match(r'\s*([a-z]{3})\s*,\s*([a-z]{3}),\s*([a-z]{3})\s*', ' sdh , ash, vbn')
if match:
print(match.groups())
Result:
('sdh', 'ash', 'vbn')
Demo
Related
I have a string with several spaces followed by commas in a pandas column. These are how the strings are organized.
original_string = "okay, , , , humans"
I want to remove the spaces and the subsequent commas so that the string will be:
goodstring = "okay,humans"
But when I use this regex pattern: [\s,]+ what I get is different. I get
badstring = "okayhumans".
It removes the comma after okay but I want it to be like in goodstring.
How can I do that?
Replace:
[\s,]*,[\s,]*
With:
,
See an online demo
[\s,]* - 0+ leading whitespace-characters or comma;
, - A literal comma (ensure we don't replace a single space);
[\s,]* - 0+ trainling whitespace-characters or comma.
In Pandas, this would translate to something like:
df[<YourColumn>].str.replace('[\s,]*,[\s,]*', ',', regex=True)
You have two issues with your code:
Since [\s,]+ matches any combination of spaces and commas (e.g. single comma ,) you should not remove the match but replace it with ','
[\s,]+ matches any combination of spaces and commas, e.g. just a space ' '; it is not what we are looking for, we must be sure that at least one comma is present in the match.
Code:
text = 'okay, , ,,,, humans! A,B,C'
result = re.sub(r'\s*,[\s,]*', ',', text);
Pattern:
\s* - zero or more (leading) whitespaces
, - comma (we must be sure that we have at least one comma in a match)
[\s,]* - arbitrary combination of spaces and commas
Please try this
re.sub('[,\s+,]+',',',original_string)
you want to replace ",[space]," with ",".
You could use substitution:
import re
pattern = r'[\s,]+'
original_string = "okay, , , , humans"
re.sub(r'[\s,]+', ',', original_string)
I want to split the string: "3quartos2suítes3banheiros126m²"
in this format using python:
3 quartos
2 suítes
3 banheiros
126m²
Is there a built-in function i can use? How can I do this?
You can do this using regular expressions, specifically re.findall()
s = "3quartos2suítes3banheiros126m²"
matches = re.findall(r"[\d,]+[^\d]+", s)
gives a list containing:
['3quartos', '2suítes', '3banheiros', '126m²']
Regex explanation (Regex101):
[\d,]+ : Match a digit, or a comma one or more times
[^\d]+ : Match a non-digit one or more times
Then, add a space after the digits using re.sub():
result = []
for m in matches:
result.append(re.sub(r"([\d,]+)", r"\1 ", m))
which makes result =
['3 quartos', '2 suítes', '3 banheiros', '126 m²']
This adds a space between 126 and m², but that can't be helped.
Explanation:
Pattern :
r"([\d,]+)" : Match a digit or a comma one or more times, capture this match as a group
Replace with:
r"\1 " : The first captured group, followed by a space
I have a DataFrame with list of strings as below
df
text
,info_concern_blue,replaced_mod,replaced_rad
,info_concern,info_concern_red,replaced_unit
,replaced_link
I want to replace all words after info_concern for eg. info_concern_blue/info_concern_red to info_concern until it encounters comma.
I tried the following regex:
df['replaced_text'] = [re.sub(r'info_concern[^,]*.+?,', 'info_concern,',
x) for x in df['text']]
But this is giving me incorrect results.
Desired output:
replaced_text
,info_concern,replaced_mod,replaced_rad
,info_concern,info_concern,replaced_unit
,replaced_link
Please suggest/advise.
You can use
df['replaced_text'] = df['text'].str.replace(r'(info_concern)[^,]*', r'\1', regex=True)
See the regex demo.
If you want to make sure the match starts right after a comma or start of string, add the (?<![^,]) lookbehind at the start of the pattern:
df['replaced_text'] = df['text'].str.replace(r'(?<![^,])(info_concern)[^,]*', r'\1', regex=True)
See this regex demo. Details:
(?<![^,]) - right before, there should be either , or start of string
(info_concern) - Group 1: info_concern string
[^,]* - zero or more chars other than a comma.
The \1 replacement replaces the match with Group 1 value.
The issue is that the pattern info_concern[^,]*.+?, matches till before the first comma using [^,]*
Then this part .+?, matches at least a single character (which can also be a comma due to the .) and then till the next first comma.
So if there is a second comma, it will overmatch and remove too much.
You could also assert info_concern to the left, and match any char except a comma to be removed by an empty string.
If there has to be a comma to the right, you can assert it.
(?<=\binfo_concern)[^,]*(?=,)
The pattern matches:
(?<=\binfo_concern) Positive lookbehind, assert info_concern to the left
[^,]* Match 0+ times any char except ,
(?=,) Positive lookahead, assert , directly to the right
Regex demo
If the comma is not mandatory, you can omit the lookahead
(?<=\binfo_concern)[^,]*
For example
import pandas as pd
texts = [
",info_concern_blue,replaced_mod,replaced_rad",
",info_concern,info_concern_red,replaced_unit",
",replaced_link"
]
df = pd.DataFrame(texts, columns=["text"])
df['replaced_text'] = df['text'].str.replace(r'(?<=\binfo_concern)[^,]*(?=,)', '', regex=True)
print(df)
Output
text replaced_text
0 ,info_concern_blue,replaced_mod,replaced_rad ,info_concern,replaced_mod,replaced_rad
1 ,info_concern,info_concern_red,replaced_unit ,info_concern,info_concern,replaced_unit
2 ,replaced_link ,replaced_link
I want to remove all occurrences of dots separated by single characters, I also want to replace all occurrences of dots separated by more than one consecutive character with a space (if one side has len > 1 char).
For example. Given a string,
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.R.S T.U.VWXYZ'
After processing the output should look like:
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
Notice that in the case of A.B.C.D.E., all dots are removed (this should be true for when there is no trailing dot also)
Notice that in the case of K.L.M.NO, the first two dots are removed, the last one is replaced with a space (because NO is not a single character)
Notice that in the case of PQ.R.S, the first dot is replaced with a space, the second dot is removed.
I almost have a working solution:
re.sub(r'(?<!\w)([A-Z])\.', r'\1', s)
But in the example given, T.U.VWXYZ gets translated to TUVWXYZ, whereas it should be TU VWXYZ
Note: it's not important for this to be solved with a single regex, or even regex at all for that matter.
Edit: changed PQ.RS to PQ.R.S in the example string.
I'd take two steps.
replace (\b[A-Z])\.(?=[A-Z]\b|\s|$) with r'\1'
replace (\b[A-Z]{2,})\.(?=[A-Z])|(\b[A-Z])\.(?=[A-Z]{2,}) with r'\1\2 '
Sample
import re
re1 = re.compile(r'(\b[A-Z])\.(?=[A-Z]\b|\s|$)')
re2 = re.compile(r'(\b[A-Z]{2,})\.(?=[A-Z])|(\b[A-Z])\.(?=[A-Z]{2,})')
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.RS T.U.VWXYZ'
r = re2.sub(r'\1\2 ', re1.sub(r'\1', s)).strip()
print(r)
outputs
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
which matches your desired result:
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
re1 matches all dots that are preceded by a free-standing letter and followed by either another free-standing letter, or whitespace, or the end of the string.
re2 matches all dots that are preceded by a least 2 and followed by at least 1 letter (or the other way around)
You can first replace all dots followed by two characters by spaces, and then remove the remaining dots:
re.sub(r'\.([A-Z]{2})', r' \1', s).replace(".", "")
This gives " ABCDE FGH IJ KLM NO PQ RS TU VWXYZ" on your example.
hopefully this is slightly neater:
import re
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.RS T.U.VWXYZ'
s = re.sub(r"\.(\w{2})", r" \1", s)
s = re.sub(r"(\w{2})\.(\w)", r"\1 \2", s)
s = re.sub(r"\.", "",s)
s = s.strip()
print(s)
You can use a single regex solution if you consider using a dynamic replacement:
import re
rx = r'\b([A-Z](?:\.[A-Z])+\b(?:\.(?![A-Z]))?)|\.'
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.R.S T.U.VWXYZ'
print( re.sub(rx, lambda x: x.group(1).replace('.', '') if x.group(1) else ' ', s.strip()) )
# => ABCDE FGH IJ KLM NO PQ RS TU VWXYZ
See the Python demo and a regex demo.
The regex matches:
\b([A-Z](?:\.[A-Z])+\b(?:\.(?![A-Z]))?) - a word boundary, then Group 1 (that will be replaced with itself after stripping off all periods) capturing:
[A-Z] - an uppercase ASCII letter
(?:\.[A-Z])+ - zero or more sequences of a dot and an uppercase ASCII letter
\b - word boundary
(?:\.(?![A-Z]))? - an optional sequence of . that is not followed with an uppercase ASCII letter
| - or
\. - a . in any other context (it will be replaced with a space).
The lambda x: x.group(1).replace('.', '') if x.group(1) else ' ' replacement means that if Group 1 matches, the replacement string is Group 1 value without dots, and if Group 1 does not match the replacement is a single regular space.
I have a multiline string which looks like this:
st = '''emp:firstinfo\n
:secondinfo\n
thirdinfo
'''
print(st)
What I am trying to do is to skip the second ':' from my string, and get an output which looks like this:
'''emp:firstinfo\n
secondinfo\n
thirdinfo
'''
simply put if it starts with a ':' I'm trying to ignore it.
Here's what I've done:
mat_obj = re.match(r'(.*)\n*([^:](.*))\n*(.*)' , st)
print(mat_obj.group())
Clearly, I don't see my mistake but could anyone please help me telling where I am getting it wrong?
You may use re.sub with this regex:
>>> print (re.sub(r'([^:\n]*:[^:\n]*\n)\s*:(.+)', r'\1\2', st))
emp:firstinfo
secondinfo
thirdinfo
RegEx Demo
RegEx Details:
(: Start 1st capture group
[^:\n]*: Match 0 or more of any character that is not : and newline
:: Match a colon
[^:\n]*: Match 0 or more of any character that is not : and newline
\n: Match a new line
): End 1st capture group
\s*: Match 0 or more whitespaces
:: Match a colon
(.+): Match 1 or more of any characters (except newlines) in 2nd capture group
\1\2: Is used in replacement to put back substring captured in groups 1 and 2.
You can use sub instead, just don't capture the undesired part.
(.*\n)[^:]*:(.*\n)(.*)
Replace by
\1\2\3
Regex Demo
import re
regex = r"(.*\n)[^:]*:(.*\n)(.*)"
test_str = ("emp:firstinfo\\n\n"
" :secondinfo\\n\n"
" thirdinfo")
subst = "\\1\\2\\3"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
#import regex library
import re
#remove character in a String and replace with empty string.
text = "The film Pulp Fiction was released in year 1994"
result = re.sub(r"[a-z]", "", text)
print(result)