I have a dataframe with a column containing string (sentence). This string has many camelcased abbreviations. There is another dictionary which has details of these abbreviations and their respective longforms.
For Example:
Dictionary: {'ShFrm':'Shortform', 'LgFrm':'Longform' ,'Auto':'Automatik'}
Dataframe columns has text like this : (for simplicity, each list entry is one row in dataframe)
['ShFrmLgFrm should be replaced Automatically','Auto', 'AutoLgFrm']
If i simply do replace using the dictionary, all replacements are correct except Automatically converts to 'Automatikmatically' in first text.
I tried using regex in the key values of dictionary with condition, replace the word only if has a space/start pf string/small alphabet before it and Capital letter/space/end of sentence after it : '(?:^|[a-z])ShFrm(?:[^A-Z]|$)', but it replaces the character before and after the middle string as well.
Could you please help me to modify the regex pattern such that it matches the abbreviations only if it has small letter before/is start of a word/space before and has capital alphabet after it/end of word/space after it and replaces only the middle word, and not the before and after characters
You need to build an alternation-based regex from the dictionary keys and use a lambda expression as the replacement argument.
See the following Python demo:
import re
d = {'ShFrm':'Shortform', 'LgFrm':'Longform' ,'Auto':'Automatik'}
col = ['ShFrmLgFrm should be replaced Automatically','Auto', 'AutoLgFrm']
rx = r'(?:\b|(?<=[a-z]))(?:{})(?=[A-Z]|\b)'.format("|".join(d.keys()))
# => (?:\b|(?<=[a-z]))(?:ShFrm|LgFrm|Auto)(?=[A-Z]|\b)
print([re.sub(rx, lambda x: d[x.group()], v) for v in col])
# => ['ShortformLongform should be replaced Automatically', 'Automatik', 'AutomatikLongform']
In Pandas, you would use it like this:
df[col] = df[col].str.replace(rx, lambda x: d[x.group()], regex=True)
See the regex demo.
You can use the lookahead function which matches a group after the main expression without including it in the result.
(?<=\b|[a-z])(ShFrm|LgFrm|Auto)(?=[A-Z]|\b)
That matches your requirements perfectly. Though python re only supports fixed-width positive lookbehind, we can change to negative lookbehind
rx=r"(?<![A-Z])(ShFrm|LgFrm|Auto)(?=[A-Z]|\b)"
re.findall(rx,"['ShFrmLgFrm should be replaced Automatically','Auto', 'AutoLgFrm']")
Out: ['ShFrm', 'LgFrm', 'Auto', 'Auto', 'LgFrm']
Related
I want to remove a substring between two words from a string with Python without removing the words that delimit this substring.
what I have as input : "abcde"
what I want as output : "abde"
The code I have:
import re
s = "abcde"
a = re.sub(r'b.*?d', "", s)
what I get as Output : "ae"
------------Edit :
another example to explain the case :
what I have as input : "c:/user/home/56_image.jpg"
what I want as output : "c:/user/home/image.jpg"
The code I have:
import re
s = "c:/user/home/56_image.jpg"
a = re.sub(r'/.*?image', "", s)
what I get as Output : "c:/user/home.jpg"
/!\ the number before "image" is changing so I could not use replace() function I want to use something generic
You can do like bellow:
''.join('abcde'.split('c'))
I would phrase the regex replacement as:
s = "abcde"
a = re.sub(r'b\w*d', "bd", s)
print(a) # abde
I am using \w* to match zero or more word characters in between b and d. This is to ensure that we don't accidentally match across words.
You are also matching what you want to keep with an empty string, that is why you don't see it in the replacement.
You can use capture groups and use the group in the replacement, or lookarounds which are non consuming.
For example, using group 1 using \1 in the replacement:
(b)\w*?(?=d)
Regex demo
Or using a lookaround, and use an empty string in the replacement.
\d+_(?=image)
Regex demo
I would like to include 5 characters before and after a specific word is matched in my regex query. Those words are in a list and I iterate over it.
See example below, this is what I tried:
import re
text = "This is an example of quality and this is true."
words = ['example', 'quality']
words_around = []
for word in words:
neighbors = re.findall(fr'(.{0,5}{word}.{0,5})', str(text))
words_around.append(neighbors)
print(words_around)
The output is empty. I would expect an array containing ['s an exmaple of q', 'e of quality and ']
You can use PyPi regex here that allows an infinite length lookbehind patterns:
import regex
import pandas as pd
words = ['example', 'quality']
df = pd.DataFrame({'col':[
"This is an example of quality and this is true.",
"No matches."
]})
rx = regex.compile(fr'(?<=(.{{0,5}}))({"|".join(words)})(?=(.{{0,5}}))')
def extract_regex(s):
return ["".join(x) for x in rx.findall(s)]
df['col2'] = df['col'].apply(extract_regex)
Output:
>>> df
col col2
0 This is an example of quality and this is true. [s an example of q, e of quality and ]
1 No matches. []
Both the pattern and how it is used are of importance.
The fr'(?<=(.{{0,5}}))({"|".join(words)})(?=(.{{0,5}}))' part defines the regex pattern. This is a "raw" f-string literal, f makes it possible to use variables inside the string literal, but it also requires to double all literal braces inside it. The pattern - given the current words list - looks like (?<=(.{0,5}))(example|quality)(?=(.{0,5})), see its demo online. It captures 0-5 chars before the words inside a positive lookbehind, then captures the words, and then captures the next 0-5 chars in a positive lookahead (lookarounds are used to make sure any overlapping matches are found).
The ["".join(x) for x in rx.findall(s)] part joins the groups of each match into a single string, and returns a list of matches as a result.
So I have been trying to construct a regex that can detect the pattern {word}{.,#}{word} and seperate it into [word,',' (or '.','#'), word].
But i am not able to create one that does strict matching for this pattern and ignores everything else.
I used the following regex
r"[\w]+|[.]"
this one is doing well , but it doesnt do strict matching, as in if (,, # or .) characters dont occur in text, it will still give me words, which i dont want.
I would like to have a regex which strictly matches the above pattern and gives me the splits(using re.findall) and if not returns the whole word as it is.
Please Note: word on either side of the {,.#} , both words are not strictly to be present but atleast one should be present
Some example text for reference:
no.16 would give me ['no','.','16']
#400 would give me ['#,'400']
word1.word2 would give me ['word1','.','word2']
Looking forward to some help and assistance from all regex gurus out there
EDIT:
I forgot to add this. #viktor's version works as needed with only one problem, It ignores ALL other words during re.findall
eg. ONE TWO THREE #400 with the viktor's regex gives me ['','#','400']
but what was expected was ['ONE','TWO','THREE','#',400]
this can be done with NLTK or spacy, but use of those is a limitation.
I suggest using
(\w+)?([.,#])((?(1)\w*|\w+))
See the regex demo.
Details
(\w+)? - An optional group #1: one or more word chars
([.,#]) - Group #2: ., , or #
((?(1)\w*|\w+)) - Group #3: if Group 1 matched, match zero or more word chars (the word is optional on the right side then), else, match one or more word chars (there must be a word on the right side of the punctuation chars since there is no word before them).
See the Python demo:
import re
pattern = re.compile(r'(\w+)?([.,#])((?(1)\w*|\w+))')
strings = ['no.16', '#400', 'word1.word2', 'word', '123']
for s in strings:
print(s, ' -> ', pattern.findall(s))
Output:
no.16 -> [('no', '.', '16')]
#400 -> [('', '#', '400')]
word1.word2 -> [('word1', '.', 'word2')]
word -> []
123 -> []
The answer to your edit is
if re.search(r'\w[.,#]|[.,#]\w', text):
print( re.findall(r'[.,#]|[^\s.,#]+', text) )
If there is a word char, then any of the three punctuation symbols, and then a word char again in the input string, you can find and extract all occurrences of the [.,#]|[^\s.,#]+ pattern, namely a ., , or #, or one or more occurrences of any one or more chars other than whitespace, ., , and #.
I hope this code will solve your problem if you want to split the string by any of the mentioned special characters:
a='no.16'
b='#400'
c='word1.word2'
lst=[a, b, c]
for elem in lst:
result= re.split('(\.|#|,)',elem)
while('' in result):
result.remove('')
print(result)
You could do something like this:
import re
str = "no.16"
pattern = re.compile(r"(\w+)([.|#])(\w+)")
result = list(filter(None, pattern.split(str)))
The list(filter(...)) part is needed to remove the empty strings that split returns (see Python - re.split: extra empty strings that the beginning and end list).
However, this will only work if your string only contains these two words separated by one of the delimiters specified by you. If there is additional content before or after the pattern, this will also be returned by split.
I have the following string
my_string = "this data is F56 F23 and G87"
And I would like to use regex to return the following output
['F56 F23', 'G87']
So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
I approached the problem with python and with this code
import re
re.findall(r'\b(F\d{2}|G\d{2})\b', my_string)
I was able to get all the occurrences
['F56', 'F23', 'G87']
But I would like to have the first two groups together since they are consecutive occurrences. Any ideas of how I can achieve that?
You can use this regex:
\b[FG]\d{2}(?:\s+[FG]\d{2})*\b
Non-capturing group (?:\s+[FG]\d{2})* will find zero or more of the following space separated F/G substrings.
Code:
>>> my_string = "this data is F56 F23 and G87"
>>> re.findall(r'\b[FG]\d{2}(?:\s+[FG]\d{2})*\b', my_string)
['F56 F23', 'G87']
So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
You can do this with:
\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b
in case it is separated by at least one spacing character. If that is not a requirement, you can do this with:
\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b
Both the first and second regex generate:
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']
print map(lambda x : x[0].strip(), re.findall(r'((\b(F\d{2}|G\d{2})\b\s*)+)', my_string))
change your regex to r'((\b(F\d{2}|G\d{2})\b\s*)+)' (brackets around, /s* to find all, that are connected by whitespaces, a + after the last bracket to find more than one occurance (greedy)
now you have a list of lists, of which you need every 0th Argument. You can use map and lambda for this. To kill last blanks I used strip()
Need some help with regular expressions.
I want to match some Roman numerals and replace them to arabic.
First of all if use (IX|IV|V?I{0,3}) to match roman numerals (from 1 to 9).
Then i add some logic to either space (with some text before) or nothing (begin/end of string) with (?:^|\s)(?:\s|$)
So finaly i've (?:^|\s)(IX|IV|V?I{0,3})(?:\s|$)
It matches all this variants:
some text VI
IX here we are
another III text
If i define dict with roman-arabic map {'iii': 3, 'IX': 9} - how to repalce matches with values from dict? Also it matches only first accur, i.e. in some V then III i get only V
Also it matches only first accur, i.e. in some V then III i get only V
I assume that you are using re.match or re.search which is only giving you one result. We will use re.sub to solve your main question so this won't be an issue. re.sub can take a callable. We replace any match with the corresponding value from your dictionary. Use
re.sub(your_regex, lambda m: your_dict[m.group(1)], your_string)
This assumes any possible match is in your dict. If not, use
re.sub(your_regex, lambda m: your_dict[m.group(1)] if m.group(1) in your_dict else m.group(1), your_string)