I have a column in my pandas Dataframe df that contains a string with some trailing hex-encoded NULLs (\x00). At least I think that it's that. When I tried to replace them with:
df['SOPInstanceUID'] = df['SOPInstanceUID'].replace('\x00', '')
the column is not updated. When I do the same with
df['SOPInstanceUID'] = df['SOPInstanceUID'].str.replace('\x00', '')
it's working fine.
What's the difference here? (SOPInstanceUID is not an index.)
thanks
The former looks for exact matches, the latter looks for matches in any part of the string, which is why the latter works for you.
The str methods are synonymous with the standard string equivalents but are vectorised
You did not specify a regex or require an exact match, hence str.replace worked
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad', axis=None)
parameter: to_replace : str, regex, list, dict, Series, numeric, or None
str or regex:
str: string exactly matching to_replace will be replaced with value
regex: regexs matching to_replace will be replaced with value
They're not actually in the string: you have unescaped control characters, which Python displays using the hexadecimal notation:
remove all non-word characters in the following way:
re.sub(r'[^\w]', '', '\x00\x00\x00\x08\x01\x008\xe6\x7f')
Related
I have a dataframe with a column containing string (sentence). This string has many camelcased abbreviations. There is another dictionary which has details of these abbreviations and their respective longforms.
For Example:
Dictionary: {'ShFrm':'Shortform', 'LgFrm':'Longform' ,'Auto':'Automatik'}
Dataframe columns has text like this : (for simplicity, each list entry is one row in dataframe)
['ShFrmLgFrm should be replaced Automatically','Auto', 'AutoLgFrm']
If i simply do replace using the dictionary, all replacements are correct except Automatically converts to 'Automatikmatically' in first text.
I tried using regex in the key values of dictionary with condition, replace the word only if has a space/start pf string/small alphabet before it and Capital letter/space/end of sentence after it : '(?:^|[a-z])ShFrm(?:[^A-Z]|$)', but it replaces the character before and after the middle string as well.
Could you please help me to modify the regex pattern such that it matches the abbreviations only if it has small letter before/is start of a word/space before and has capital alphabet after it/end of word/space after it and replaces only the middle word, and not the before and after characters
You need to build an alternation-based regex from the dictionary keys and use a lambda expression as the replacement argument.
See the following Python demo:
import re
d = {'ShFrm':'Shortform', 'LgFrm':'Longform' ,'Auto':'Automatik'}
col = ['ShFrmLgFrm should be replaced Automatically','Auto', 'AutoLgFrm']
rx = r'(?:\b|(?<=[a-z]))(?:{})(?=[A-Z]|\b)'.format("|".join(d.keys()))
# => (?:\b|(?<=[a-z]))(?:ShFrm|LgFrm|Auto)(?=[A-Z]|\b)
print([re.sub(rx, lambda x: d[x.group()], v) for v in col])
# => ['ShortformLongform should be replaced Automatically', 'Automatik', 'AutomatikLongform']
In Pandas, you would use it like this:
df[col] = df[col].str.replace(rx, lambda x: d[x.group()], regex=True)
See the regex demo.
You can use the lookahead function which matches a group after the main expression without including it in the result.
(?<=\b|[a-z])(ShFrm|LgFrm|Auto)(?=[A-Z]|\b)
That matches your requirements perfectly. Though python re only supports fixed-width positive lookbehind, we can change to negative lookbehind
rx=r"(?<![A-Z])(ShFrm|LgFrm|Auto)(?=[A-Z]|\b)"
re.findall(rx,"['ShFrmLgFrm should be replaced Automatically','Auto', 'AutoLgFrm']")
Out: ['ShFrm', 'LgFrm', 'Auto', 'Auto', 'LgFrm']
Goal: replace values in column que_text with matches of re.search pattern. Else None
Problem: Receiving only None values in que_text_new column although regex pattern is thoroughly tested!
def override(s):
x = re.search(r'(an|frage(\s+ich)?)\s+d(i|ı)e\s+Staatsreg(i|ı)erung(.*)(Dresden(\.|,|\s+)?)?', str(s), flags = re.DOTALL | re.MULTILINE))
if x :
return x.group(5)
return None
df2['que_text_new'] = df2['que_text'].apply(override)
What am i doing wrong? removing return None doesent help. There must be some structural error within my function, i assume.
You can use a pattern with a single capturing group and then simpy use Series.str.extract and chain .fillna(np.nan) to fill the non-matched values with NaN:
pattern = r'(?s)(?:an|frage(?:\s+ich)?)\s+d[iı]e\s+Staatsreg[iı]erung(.*)'
df2['que_text_new'] = df2['que_text'].astype(str).str.extract(pattern).fillna(np.nan)
Not sure you need .astype(str), but there is str(s) in your code, so it might be safer with this part.
Here,
Capturing groups with single char alternatives are converted to character classes, e.g. (i|ı) -> [iı]
Other capturing groups are converted to non-capturing ones, i.e. ( -> (?:.
To make np.nan work do not forget to import numpy as np.
(?s) is an in-pattern re.DOTALL option.
I want to filter a dataframe by only keeping the rows that conform with a regex pattern in a given column. The example in the documentation only filters by looking for that regex in every column in the dataframe (documentation to filter)
So how can i change the following example
df.filter(regex='^[\d]*', axis=0)
to something like this: (Which only looks for the regex in the specified column)
df.filter(column='column_name', regex='^[\d]*', axis=0)
Use the vectorized string method contains() or match() - see Testing for Strings that Match or Contain a Pattern:
df[df.column_name.str.contains('^\d+')]
or
df[df.column_name.str.match('\d+')] # Matches only start of the string
Note that I removed superfluous brackets ([]), and replaced * with +, because the \d* will always match as it matches a zero occurrences, too (so called a zero-length match.)
Filter the DataFrame using a Boolean mask made from the given column and regex pattern as follows:
df[df.column_name.str.contains('^[\d]*', regex=True)]
I am trying to parse some docstrings.
An example docstrings is:
Test if a column field is larger than a given value
This function can also be called as an operator using the '>' syntax
Arguments:
- DbColumn self
- string or float value: the value to compare to
in case of string: lexicographic comparison
in case of float: numeric comparison
Returns:
DbWhere object
Both the Arguments and Returns parts are optional. I want my regex to return as groups the description (first lines), the Arguments part (if present) and the Returns part (if present).
The regex I have now is:
m = re.search('(.*)(Arguments:.*)(Returns:.*)', s, re.DOTALL)
and works in case all three parts are present but fails as soon as Arguments or the Returnsparts are not available. I have tried several variations with the non-greedy modifiers like ??but to no avail.
Edit: When the Arguments and Returns parts are present, I actually would only like to match the text after Arguments: and Returns: respectively.
Thanks!
Try with:
re.search('^(.*?)(Arguments:.*?)?(Returns:.*)?$', s, re.DOTALL)
Just making the second and third groups optional by appending a ?, and making the qualifiers of the first two groups non-greedy by (again) appending a ? on them (yes, confusing).
Also, if you use the non-greedy modifier on the first group of the pattern, it'll match the shortest possible substring, which for .* is the empty string. You can overcome this by adding the end-of-line character ($) at the end of the pattern, which forces the first group to match as few characters as possible to satisfy the pattern, i.e. the whole string when there's no Arguments and no Returns sections, and everything before those sections, when present.
Edit: OK, if you just want to capture the text after the Arguments: and Returns: tokens, you'll have to tuck in a couple more groups. We're not going to use all of the groups, so naming them —with the <?P<name> notation (another question mark, argh!)— is starting to make sense:
>>> m = re.search('^(?P<description>.*?)(Arguments:(?P<arguments>.*?))?(Returns:(?P<returns>.*))?$', s, re.DOTALL)
>>> m.groupdict()['description']
"Test if a column field is larger than a given value\n This function can also be called as an operator using the '>' syntax\n\n "
>>> m.groupdict()['arguments']
'\n - DbColumn self\n - string or float value: the value to compare to\n in case of string: lexicographic comparison\n in case of float: numeric comparison\n '
>>> m.groupdict()['returns']
'\n DbWhere object'
>>>
If you want to match the text after optional Arguments: and Returns: sections, AND you don't want to use (?P<name>...) to name your capture groups, you can also use, (?:...), the non-capturing version of regular parentheses.
The regex would look like this:
m = re.search('^(.*?)(?:Arguments:(.*?))?(?:Returns:(.*?))?$', doc, re.DOTALL)
# ^^ ^^
According to the Python3 documentation:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
I have some sample string. How can I replace first occurrence of this string in a longer string with empty string?
regex = re.compile('text')
match = regex.match(url)
if match:
url = url.replace(regex, '')
string replace() function perfectly solves this problem:
string.replace(s, old, new[, maxreplace])
Return a copy of string s with all occurrences of substring old replaced by new. If the optional argument maxreplace is given, the first maxreplace occurrences are replaced.
>>> u'longlongTESTstringTEST'.replace('TEST', '?', 1)
u'longlong?stringTEST'
Use re.sub directly, this allows you to specify a count:
regex.sub('', url, 1)
(Note that the order of arguments is replacement, original not the opposite, as might be suspected.)