Need some help with regular expressions.
I want to match some Roman numerals and replace them to arabic.
First of all if use (IX|IV|V?I{0,3}) to match roman numerals (from 1 to 9).
Then i add some logic to either space (with some text before) or nothing (begin/end of string) with (?:^|\s)(?:\s|$)
So finaly i've (?:^|\s)(IX|IV|V?I{0,3})(?:\s|$)
It matches all this variants:
some text VI
IX here we are
another III text
If i define dict with roman-arabic map {'iii': 3, 'IX': 9} - how to repalce matches with values from dict? Also it matches only first accur, i.e. in some V then III i get only V
Also it matches only first accur, i.e. in some V then III i get only V
I assume that you are using re.match or re.search which is only giving you one result. We will use re.sub to solve your main question so this won't be an issue. re.sub can take a callable. We replace any match with the corresponding value from your dictionary. Use
re.sub(your_regex, lambda m: your_dict[m.group(1)], your_string)
This assumes any possible match is in your dict. If not, use
re.sub(your_regex, lambda m: your_dict[m.group(1)] if m.group(1) in your_dict else m.group(1), your_string)
Related
I have a dataframe with a column containing string (sentence). This string has many camelcased abbreviations. There is another dictionary which has details of these abbreviations and their respective longforms.
For Example:
Dictionary: {'ShFrm':'Shortform', 'LgFrm':'Longform' ,'Auto':'Automatik'}
Dataframe columns has text like this : (for simplicity, each list entry is one row in dataframe)
['ShFrmLgFrm should be replaced Automatically','Auto', 'AutoLgFrm']
If i simply do replace using the dictionary, all replacements are correct except Automatically converts to 'Automatikmatically' in first text.
I tried using regex in the key values of dictionary with condition, replace the word only if has a space/start pf string/small alphabet before it and Capital letter/space/end of sentence after it : '(?:^|[a-z])ShFrm(?:[^A-Z]|$)', but it replaces the character before and after the middle string as well.
Could you please help me to modify the regex pattern such that it matches the abbreviations only if it has small letter before/is start of a word/space before and has capital alphabet after it/end of word/space after it and replaces only the middle word, and not the before and after characters
You need to build an alternation-based regex from the dictionary keys and use a lambda expression as the replacement argument.
See the following Python demo:
import re
d = {'ShFrm':'Shortform', 'LgFrm':'Longform' ,'Auto':'Automatik'}
col = ['ShFrmLgFrm should be replaced Automatically','Auto', 'AutoLgFrm']
rx = r'(?:\b|(?<=[a-z]))(?:{})(?=[A-Z]|\b)'.format("|".join(d.keys()))
# => (?:\b|(?<=[a-z]))(?:ShFrm|LgFrm|Auto)(?=[A-Z]|\b)
print([re.sub(rx, lambda x: d[x.group()], v) for v in col])
# => ['ShortformLongform should be replaced Automatically', 'Automatik', 'AutomatikLongform']
In Pandas, you would use it like this:
df[col] = df[col].str.replace(rx, lambda x: d[x.group()], regex=True)
See the regex demo.
You can use the lookahead function which matches a group after the main expression without including it in the result.
(?<=\b|[a-z])(ShFrm|LgFrm|Auto)(?=[A-Z]|\b)
That matches your requirements perfectly. Though python re only supports fixed-width positive lookbehind, we can change to negative lookbehind
rx=r"(?<![A-Z])(ShFrm|LgFrm|Auto)(?=[A-Z]|\b)"
re.findall(rx,"['ShFrmLgFrm should be replaced Automatically','Auto', 'AutoLgFrm']")
Out: ['ShFrm', 'LgFrm', 'Auto', 'Auto', 'LgFrm']
This is not for homework!
Hello,
Just a quick question about Regex formatting.
I have a list of different courses.
L = ['CI101', 'CS164', 'ENGL101', 'I-', 'III-', 'MATH116', 'PSY101']
I was looking for a format to find all the words that start with I, or II, or III. Here is what I did. (I used python fyi)
for course in L:
if re.search("(I?II?III?)*", course):
L.pop()
I learned that ? in regex means optional. So I was thinking of making I, II, and III optional and * to include whatever follows. However, it seems like it is not working as I intended. What would be a better working format?
Thanks
Here is the regex you should use:
^I{1,3}.*$
click here to see example
^ means the head of a line. I{1,3} means repeat I 1 to 3 times. .* means any other strings. $ means the tail of a line. So this regex will match all the words that start with I, II, or III.
Look at your regex, first, you don't have the ^ mark, so it will match I anywhere. Second, ? will only affect the previous one character, so the first I is optional, but the second I is not, then the third I is optional, the fourth and fifth I are not, the sixth I is optional. Finally, you use parentheses with *, that means the expression in parentheses will repeat many times include 0 time. So it will match 0 I, or at least 3 I.
your regex
Instead of search() you can use the function match() that matches the pattern at the beginning of string:
import re
l = ['CI101', 'CS164', 'ENGL101', 'I-', 'III-', 'MATH116', 'PSY101']
pattern = re.compile(r'I{1,3}')
[i for i in l if not pattern.match(i)]
# ['CI101', 'CS164', 'ENGL101', 'MATH116', 'PSY101']
I use
re.compile(r"(.+?)\1+").findall('44442(2)2(2)44')
can get
['4','2(2)','4']
, but how can I get
['4444','2(2)2(2)','44']
by using regular expression?
Thanks
No change to your pattern needed. Just need to use to right function for the job. re.findall will return a list of groups if there are capturing groups in the pattern. To get the entire match, use re.finditer instead, so that you can extract the full match from each actual match object.
pattern = re.compile(r"(.+?)\1+")
[match.group(0) for match in pattern.finditer('44442(2)2(2)44')]
With minimal change to OP's regular expression:
[m[0] for m in re.compile(r"((.+?)\2+)").findall('44442(2)2(2)44')]
findall will give you the full match if there are no groups, or groups if there are some. So given that you need groups for your regexp to work, we simply add another group to encompass the full match, and extract it afterwards.
You can do:
[i[0] for i in re.findall(r'((\d)(?:[()]*\2*[()]*)*)', s)]
Here the Regex is:
((\d)(?:[()]*\2*[()]*)*)
which will output a list of tuples containing the two captured groups, and we are only interest din the first one hence i[0].
Example:
In [15]: s
Out[15]: '44442(2)2(2)44'
In [16]: [i[0] for i in re.findall(r'((\d)(?:[()]*\2*[()]*)*)', s)]
Out[16]: ['4444', '2(2)2(2)', '44']
I have created a regular expression to match a string which would have a "K" preceeded by 10 characters before and proceeded by 10 characters after.
Check Demo Here
However, I'm not able to detect strings wherever a K is said to exist. I would like to have multiple combinations of a string whenever a K is present ?
you can use re.findall() :
print re.findall('([\w\n]{10}?K[\w\n]{10})',s)
result:
['GGKKKTKICDKVSHEEDRISQ', 'ISEILFHLSTKDSVRTSALST', 'FDSHRDSWIRKLRLDLGYHHD', 'HLDVHCFHDNKIPLSIYTCTT', 'PEFVSLP\nCLKIMHFENVSYP', 'ELILFSTMYPKGNVLQLRSDT', 'YAPLLQCLRAKMYSTK\nNFQI', 'DFVNTGGRYQKKKVIEDILID', 'RDLVISSNTWKEFFLYSKSRP', 'MLPTLLESCPKLESLILVMSS']
I am trying to parse a chemical formula that is given to me in unicode in the format C7H19N3
I wish to isolate the position of the first number after the letter, I.e 7 is at index 1 and 1 is at index 3. With is this i want to insert "sub" infront of the digits
My first couple attempts had me looping though trying to isolate the position of only the first numbers but to no avail.
I think that Regular expressions can accomplish this, though im quite lost in it.
My end goal is to output the formula Csub7Hsub19Nsub3 so that my text editor can properly format it.
How about this?
>>> re.sub('(\d+)', 'sub\g<1>', "C7H19N3")
'Csub7Hsub19Nsub3'
(\d+) is a capturing group that matches 1 or more digits. \g<1> is a way of referring to the saved group in the substitute string.
Something like this with lookahead and lookbehind:
>>> strs = 'C7H19N3'
>>> re.sub(r'(?<!\d)(?=\d)','sub',strs)
'Csub7Hsub19Nsub3'
This matches the following positions in the string:
C^7H^19N^3 # ^ represents the positions matched by the regex.
Here is one which literally matches the first digit after a letter:
>>> re.sub(r'([A-Z])(\d)', r'\1sub\2', "C7H19N3")
'Csub7Hsub19Nsub3'
It's functionally equivalent but perhaps more expressive of the intent? \1 is a shorter version of \g<1>, and I also used raw string literals (r'\1sub\2' instead of '\1sub\2').