Find and replace semi-common strings in dataframe? - python

I am attempting to find a semi-common occurring string and remove all other data in the column. Pandas and Re have been imported. For instance, I have dataframe...
>>>df
COLUMN COUNT DATA
1 this row RA-123: data 8b43a
2 here RA-5372: data 94h63c
I need to keep just the RA-'number that follows' and remove everything before and after. The numbers that follow are not always the same length and the 'RA-' string does not always occur in the same position. There is a colon after every instance that can be used as a delimiter.
I tried this (a friend wrote the regex search piece for me because I am not familiar with it).
df.assign(DATA= df['DATA'].str.extract(re.search('RA[^:]+')))
But python returned
TypeError: search() missing 1 required positional argument: 'string'
What am I missing here? Thanks in advance!

You should use acapturing group with extract:
df['DATA'].str.extract(r'(RA-\d+)')
Here, (RA-\d+) is a capturing group matching RA, then a hyphen and then one or more digits.
You may use your own pattern, but you still need to wrap it with capturing parentheses, r'(RA[^:]+)'.

Looking at the docs, you don't need the re.search method. You just call df[DATA] = df['DATA'].str.extract(r'RA[^:]+'))

As I mentioned earlier, no need for re here.
Other answers addressed well how to use extract directly. However, to answer your specificly, if you really want to use re, the way to go is to use re.compile instead of re.search.
df.assign(DATA= df['DATA'].str.extract(re.compile(regex_str)))

Related

Extract values in name=value lines with regex

I'm really sorry for asking because there are some questions like this around. But can't get the answer fixed to make problem.
This are the input lines (e.g. from a config file)
profile2.name=share2
profile8.name=share8
profile4.name=shareSSH
profile9.name=share9
I just want to extract the values behind the = sign with Python 3.9. regex.
I tried this on regex101.
^profile[0-9]\.name=(.*?)
But this gives me the variable name including the = sign as result; e.g. profile2.name=. But I want exactly the inverted opposite.
The expected results (what Pythons re.find_all() return) are
['share2', 'share8', 'shareSSH', 'share9']
Try pattern profile\d+\.name=(.*), look at Regex 101 example
import re
re.findall('profile\d+\.name=(.*)', txt)
# output
['share2', 'share8', 'shareSSH', 'share9']
But this problem doesn't necessarily need regex, split should work absolutely fine:
Try removing the ? quantifier. It will make your capture group match an empty st
regex101

Regex to remove strings from list that do not match given prefix

I have a string that includes multiple comma-separated lists of values, always embedded between <mks:Field name="MyField"> and </mks:Field>.
For example:
<mks:Field name="MyField">X001_ABC</mks:Field><mks:Field name="AnotherField">X002_XYZ</mks:Field><mks:Field name="MyField"></mks:Field><mks:Field name="MyField">X000_Test1,X000_Test2</mks:Field><mks:Field name="MyField">X001_ABC,X000_Test1</mks:Field><mks:Field name="MyField">X000_Test1,X000_Test2,X002_XYZ</mks:Field>
In this example I have the following values to work with:
X001_ABC
(empty)
X000_Test1,X000_Test2
X001_ABC,X000_Test1
X000_Test1,X000_Test2,X002_XYZ
Now I want to remove all the values that do not start with the prefix ""X000_", including any needless commas, so that my result looks like this:
<mks:Field name="MyField"></mks:Field><mks:Field name="AnotherField">X002_XYZ</mks:Field><mks:Field name="MyField"></mks:Field><mks:Field name="MyField">X000_Test1,X000_Test2</mks:Field><mks:Field name="MyField">X000_Test1</mks:Field><mks:Field name="MyField">X000_Test1,X000_Test2</mks:Field>
I have tried the following regex, but it does not work properly if only one value exists not matching my regex and I do not want to change my regex if a new value matching my prefix is introduced (e.g. X000_Test3).
Search: (?<=name="MyField">)[^<>](?:.*?(X000_Test1,X000_Test2|X000_Test1|X000_Test2))?.*?(?=</mks:Field>)
Replace: \1
This gives me the following result that does not match the expected output:
<mks:Field name="MyField">X000_Test1,X000_Test2</mks:Field><mks:Field name="MyField">X000_Test1</mks:Field><mks:Field name="MyField">X000_Test2</mks:Field>
Unfortunately I cannot simply parse the string with something else - I only have the option of a regex search/replace in this case.
Thank you in advance, any help would be appreciated.
If you are using Javascript use this:
prefix='X000';
let pattern= new RegExp(`((?<=>)|,)((?!${prefix}|[>\<,]).)*(,|(?=\<))`, 'g');
For any other language use this:
'/((?<=>)|,)((?!X000|[>\<,]).)*(,|(?=\<))/';
X000 being the prefix you want to keep

regex extraction 2 groups resulting only in one match

New to regex.
Consider you have the following text structure:
"hello_1:45||hello_2:67||bye_1:45||bye_5:89||.....|| bye_last:100" and so on
I want to build a dictionary out of it taking the string value as a key, and the decimal number as the dict value.
I was trying to check my concept using this nice tool
I wrote my regex expression:
(\w+):(\d+)
And got only one match ->the first in the string : hello_1:45
I tried also something like:
.*(\w+):(\d+).*
But also not good, any ideas?
You should use the g (global) modifier to get all the matches and not stop to the first one. In python you can use the re.findall function to get all the matches. Check the example here.
You may achieve this only through split function.
s = "hello_1:45||hello_2:67||bye_1:45||bye_5:89"
print {i.split(':')[0]:i.split(':')[1] for i in s.split('||')}
Try this if you want to convert the value part as int.
print {i.split(':')[0]:int(i.split(':')[1]) for i in s.split('||')}
or
print {i.split(':')[0]:float(i.split(':')[1]) for i in s.split('||')}

Regex named conditional lookahead (in Python)

I'm hoping to match the beginning of a string differently based on whether a certain block of characters is present later in the string. A very simplified version of this is:
re.search("""^(?(pie)a|b)c.*(?P<pie>asda)$""", 'acaaasda')
Where, if <pie> is matched, I want to see a at the beginning of the string, and if it isn't then I'd rather see b.
I'd use normal numerical lookahead but there's no guarantee how many groups will or won't be matched between these two.
I'm currently getting error: unknown group name. The sinking feeling in my gut tells me that this is because what I want is impossible (look-ahead to named groups isn't exactly a feature of a regular language parser), but I really really really want this to work -- the alternative is scrapping 4 or 5 hours' worth of regex writing and redoing it all tomorrow as a recursive descent parser or something.
Thanks in advance for any help.
Unfortunately, I don't think there is a way to do what you want to do with named groups. If you don't mind duplication too much, you could duplicate the shared conditions and OR the expressions together:
^(ac.*asda|bc.*)$
If it is a complicated expression you could always use string formatting to share it (rather than copy-pasting the shared part):
common_regex = "c.*"
final_regex = "^(a{common}asda|b{common})$".format(common=common_regex)
You can use something like that:
^(?:a(?=c.*(?P<pie>asda)$)|b)c.*$
or without .*$ if you don't need it.

Python regular expressions with more than 100 groups?

Is there any way to beat the 100-group limit for regular expressions in Python? Also, could someone explain why there is a limit.
There is a limit because it would take too much memory to store the complete state machine efficiently. I'd say that if you have more than 100 groups in your re, something is wrong either in the re itself or in the way you are using them. Maybe you need to split the input and work on smaller chunks or something.
I found the easiest way was to
import regex as re
instead of
import re
The default _MAXCACHE for regex is 500 instead of 100 I believe. This is one of the many reasons I find regex to be a better module than re.
If I'm not mistaken, the "new" regex module (currently third-party, but intended to eventually replace the re module in the stdlib) does not have this limit, so you might give that a try.
I'm not sure what you're doing exactly, but try using a single group, with a lot of OR clauses inside... so (this)|(that) becomes (this|that). You can do clever things with the results by passing a function that does something with the particular word that is matched:
newContents, num = cregex.subn(lambda m: replacements[m.string[m.start():m.end()]], contents)
If you really need so many groups, you'll probably have to do it in stages... one pass for a dozen big groups, then another pass inside each of those groups for all the details you want.
I doubt you really need to process 100 named groups by next commands or use it in regexp replacement command. It would be quite impractical. If you just need groups to express the rich conditions in regexp you can use non-grouping group.
(?:word1|word2)(?:word3|word4)
etc. Complex scenarios including nesting groups are possible.
There is no limit for non-grouping groups.
First, as others have said, there are probably good alternatives to using 100 groups. The re.findall method might be a useful place to start. If you really need more than 100 groups, the only workaround I see is to modify the core Python code.
In [python-install-dir]/lib/sre_compile.py simply modify the compile() function by removing the following lines:
# in lib/sre_compile.py
if pattern.groups > 100:
raise AssertionError(
"sorry, but this version only supports 100 named groups"
)
For a slightly more flexible version, just define a constant at the top of the sre_compile module, and have the above line compare to that constant instead of 100.
Funnily enough, in the (Python 2.5) source there is a comment indicating that the 100 group limit is scheduled to be removed in future versions.
I've found that Python 3 doesn't have this limitation, whereas the same code ran in latest 2.7 displays this error.
When I run into this I had a really complex pattern that was actually composed of a bunch of high-level patterns joined by ORs, like this:
pattern_string = u"pattern1|" \
u"pattern2|" \
u"patternN"
pattern = re.compile(pattern_string, re.UNICODE)
for match in pattern.finditer(string_to_search):
pass # Extract data from the groups in the match.
As a workaround, I turned the pattern into a list and I used that list as follows:
pattern_strings = [
u"pattern1",
u"pattern2",
u"patternN",
]
patterns = [re.compile(pattern_string, re.UNICODE) for pattern_string in pattern_strings]
for pattern in patterns:
for match in pattern.finditer(string_to_search):
pass # Extract data from the groups in the match.
string_to_search = pattern.sub(u"", string_to_search)
I would say you could reduce the number of groups by using non-grouping parentheses, but whatever it is that you're doing seems like you want all these groupings.
in my case, i have a dictionary of n words and want to create a single regex that matches all of them.. ie: if my dictionary is
hello
goodbye
my regex would be: (^|\s)hello($|\s)|(^|\s)goodbye($|\s) ... it's the only way to do it, and works fine on small dictionaries, but when you have more tan 50 words, well...
It's very ease to resolve this error:
Open the re class and you'll see this constant _MAXCACHE = 100.
Change the value to 1000, for example, and do a test.

Categories