I have a data frame that consists of multiple rows that contain different variations of a string that is separated by commas. Rather than constantly writing variations of this code such as df.replace('Word,', ''), I am looking for a simpler way to replace variations in strings for python. I have heard about regex yet am having a difficult time understanding it.
One such example that I am looking into is df.column.str.replace('Word,?', '') which would replace all variations of "Word" regardless of comma position. However, I am unsure as to how this works. Any help in understanding replacing using regex would be greatly appreciated. Thank you in advance.
Example:
'Word, foo, bar'
'Word'
'foo, bar, Word'
'foo, Word, bar'
Desired Output:
'foo, bar'
''
'foo, bar'
'foo, bar'
df.replace(to_replace='Word,|(, )?Word',value='',regex=True)
This way .replace() method will do the required work.
to_replace is our regular expression criteria and it should be in string.
'Word,' will match all strings except at the end in form of ", Word".
To match those end string we provided "|"(or) so that we can add new criteria which is "(, )?Word". Here ? match 0 or 1 occurrence of ", "(comma and 1 space) so that both conditions for ending string as well as only 1 string "Word" matched
Value = '' : which show what to be replaced with
regex = True : which tells to treat "to_replace" parameter as a regex expression
You can do it as below
Input
df = pd.DataFrame([[1, 'Word, foo, bar'],
[2, 'Word'],
[3, 'foo, bar, Word'],
[4, 'foo, Word, bar']],columns=['id', 'text'])
id text
1 Word, foo, bar
2 Word
3 foo, bar, Word
4 foo, Word, bar
Code to replace text 'Word' and following comma & space if any
df['text']=df['text'].replace('Word(,\s)|(,\s)?Word','',regex=True)
What is happening in the code
Word : will search for the text 'Word'
(,\s)? : will look for comma, followed by space\s, ? will look and match if it is available, if comma & space does not follow, then just the text 'Word' is matched. So ? is pretty important here.
| : this matches one of the 2 expressions (in your case this is needed for line 3 where there is a preceding space & comma)
You can see detailed explanation here Regex Demo
Output
id text
1 foo, bar
2
3 foo, bar
4 foo, bar
Related
I have strings that are of the form below:
<p>The is a string.</p>
<em>This is another string.</em>
They are read in from a text file one line at a time. I want to separate these into words. For that I am just splitting the string using split().
Now I have a set of words but the first word will be <p>The rather than The. Same for the other words that have <> next to them. I want to remove the <..> from the words.
I'd like to do this in one line. What I mean is I want to pass as a parameter something of the form <*> like I would on the command line. I was thinking of using the replace() function to try to do this, but I am not sure how the replace() function parameter would look like.
For example, how could I change <..> below in a way that it will mean that I want to include anything that is between < and >:
x = x.replace("<..>", "")
Unfortunately, str.replace does not support Regex patterns. You need to use re.sub for this:
>>> from re import sub
>>> sub("<[^>]*>", "", "<p>The is a string.</p>")
'The is a string.'
>>> sub("<[^>]*>", "", "<em>This is another string.</em>")
'This is another string.'
>>>
[^>]* matches zero or more characters that are not >.
No Need for a 2-Step Solution
You don't need to 1. Split then 2. Replace. The two solutions below show you how to do it with one single step.
Option 1: Match All Instead of Splitting
Match All and Split are Two Sides of the Same Coin, and in this case it is safer to match all:
<[^>]+>|(\w+)
The words will be in Group 1.
Use it like this:
subject = '<p>The is a string.</p><em>This is another string.</em>'
regex = re.compile(r'<[^>]+>|(\w+)')
matches = [group for group in re.findall(regex, subject) if group]
print(matches)
Output
['The', 'is', 'a', 'string', 'This', 'is', 'another', 'string']
Discussion
This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
The left side of the alternation | matches complete <tags>. We will ignore these matches. The right side matches and captures words to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...
Option 2: One Single Split
<[^>]+>|[ .]
On the left side of the |, we use <complete tags> as a split delimiter. On the right side, we use a space character or a period.
Output
This
is
a
string
I have the following code line which is splitting the string data2 up into a list upon instances of a white space:
string_list = data2.split()
However in some of my data there are dates in the format "28, Dec". Here the above code is splitting on the white space between the date and the month when I don't want it to. Is there a way I can say "split on the white space, but not if it is after a comma"?
You need to use regular expressions.
>>> re.split('(?<!,) ', 'blah blah, blah')
['blah', 'blah, blah']
From the link:
(?<!...) Matches if the current position in the string is not preceded
by a match for .... This is called a negative lookbehind assertion.
Similar to positive lookbehind assertions, the contained pattern must
only match strings of some fixed length. Patterns which start with
negative lookbehind assertions may match at the beginning of the
string being searched.
Use re.split with a negative lookbehind expression:
re.split(r'(?<!,)\s','I went on 28, Dec')
Out[53]: ['I', 'went', 'on', '28, Dec']
You can split using a regular expression and utilize look-behind expressions to make sure that you don’t split on a whitespace character that is preceded by a comma:
>>> import re
>>> s = 'foo bar 28, Dec bar baz'
>>> re.split('(?<!,)\s', s)
['foo', 'bar', '28, Dec', 'bar', 'baz']
Sorry to refloat this thread, but I was trying to decode sqlite cells, and something seems odd to me. I´ll explain. I´m trying to code two different numbers into one cell by creating a string with a 0 in between and then numerizing it, so for example:
a=4
b=7
c=str(4)+'0'+str(7)
The problem is when the first number is 10, so I´m using this
re.split('0([1-9])','1003')
['10','3','']
Why I´m getting a trhee lenght list when it should be just 2?
I'm trying to substitute every individual character found in a regex match but I can't seem to make it work.
I have a string containing parenthesized expressions that have to be replaced.
For example, foo bar (baz) should become foo bar (***)
Here is what I came up with: re.sub(r"(\(.*?).(.*?\))", r"\1*\2", "foo bar (baz)") Unfortunately, I can't seem to apply the substitution to every character between the parentheses. Is there any way to make this work?
How about something like this?
>>> import re
>>> s = 'foo bar (baz)'
>>> re.sub(r'(?<=\().*?(?=\))', lambda m: '*'*len(m.group()), s)
'foo bar (***)'
I need to find a list of prefixes of words inside a target string (I would like to have the list of matching indexes in the target string handled as an array).
I think using regex should be the cleanest way.
Given that I am looking for the pattern "foo", I would like to retrieve in the target string words like "foo", "Foo", "fooing", "Fooing"
Given that I am looking for the pattern "foo bar", I would like to retrieve in the target string patterns like "foo bar", "Foo bar", "foo Bar", "foo baring" (they are still all handled as prefixes, am I right?)
At the moment, after running it in different scenarios, my Python code still does not work.
I am assuming I have to use ^ to match the beginning of a word in a target string (i.e. a prefix).
I am assuming I have to use something like ^[fF] to be case insensitive with the first letter of my prefix.
I am assuming I should use something like ".*" to let the regexp behave like a prefix.
I am assuming I should use the \prefix1|prefix2|prefix3** to put in **logic OR many different prefixes in the pattern to search.
The following source code does not work because I am wrongly setting the txt_pattern.
import re
# ' ' ' ' ' ' '
txt_str = "edb foooooo jkds Fooooooo kj fooing jdcnj Fooing ujndn ggng sxk foo baring sh foo Bar djw Foo";
txt_pattern = ''#???
out_obj = re.match(txt_pattern,txt_str)
if out_obj:
print "match!"
else:
print "No match!"
What am I missing?
How should I set the txt_pattern?
Can you please suggest me a good tutorial with minimum working examples? At the moment the standard tutorials from the first page of a Google search are very long and detailed, and not so simple to understand.
Thanks!
Regex is the wrong approach. First parse your string into a list of strings with one word per item. Then use a list comprehension with a filter. The split method on strings is a good way to get the list of words, then you can simply do [item for item in wordlist if item.startswith("foo")]
People spend ages hacking up inefficient code using convoluted regexes when all they need is a few string methods like split, partition, startswith and some pythonic list comprehensions or generators.
Regexes have their uses but simple string parsing is not one of them.
I am assuming I have to use ^ to match the beginning of a word in a target string (i.e. a prefix).
No, the ^ is an anchor that only matches the start of the string. You can use \b instead, meaning a word boundary (but remember to escape the backslash inside a string literal, or use a raw string literal).
You will also have to use re.search instead of re.match because the latter only checks the start of the string, whereas the former searches for matches anywhere in the string.
>>> s = 'Foooooo jkds Fooooooo kj fooing jdcnj Fooing ujndn ggng sxk foo baring sh foo Bar djw Foo'
>>> regex = '((?i)(foo)(\w+)?)'
>>> compiled = re.compile(regex)
>>> re.findall(compiled, s)
[('Foooooo', 'Foo', 'oooo'), ('Fooooooo', 'Foo', 'ooooo'), ('fooing', 'foo', 'ing'), ('Fooing', 'Foo', 'ing'), ('foo', 'foo', ''), ('foo', 'foo', ''), ('Foo', 'Foo', '')]
(?i) -> case insensitive
(foo) -> group1 matches foo
(\w+) -> group2 matches every other word character
>>> print [i[0] for i in re.findall(compiled, s)]
['Foooooo', 'Fooooooo', 'fooing', 'Fooing', 'foo', 'foo', 'Foo']
Try using this tool to test some stuff: http://www.pythonregex.com/
Use this reference: docs.python.org/howto/regex.html
I would use something like this for your regex:
\b(?:([Ff]oo [Bb]ar)|([Ff]oo))\w*
Inside of the non-capturing group you should separate each prefix with a |, I also placed each prefix inside of its own capturing group so you can tell which prefix a given string matched, for example:
for match in re.finditer(r'\b(?:([Ff]oo [Bb]ar)|([Ff]oo))\w*', txt_str):
n = 1
while not match.group(n):
n += 1
print "Prefix %d matched '%s'" % (n, match.group(0))
Output:
Prefix 2 matched 'foooooo'
Prefix 2 matched 'Fooooooo'
Prefix 2 matched 'fooing'
Prefix 2 matched 'Fooing'
Prefix 1 matched 'foo baring'
Prefix 1 matched 'foo Bar'
Prefix 2 matched 'Foo'
Make sure you put longer prefixes first, if you were to put the foo prefix before the foo bar prefix, you would only match 'foo' in 'foo bar'.
I am trying to write a regular expression in Python that will match either a quoted string with spaces or an unquoted string without spaces. For example given the string term:foo the result would be foo and given the string term:"foo bar" the result would be foo bar. So far I've come up with the following regular expression:
r = re.compile(r'''term:([^ "]+)|term:"([^"]+)"''')
The problem is that the match can come in either group(1) or group(2) so I have to do something like this:
m = r.match(search_string)
term = m.group(1) or m.group(2)
Is there a way I can do this all in one step?
Avoid grouping, and instead use lookahead/lookbehind assertions to eliminate the parts that are not needed:
s = 'term:foo term:"foo bar" term:bar foo term:"foo term:'
re.findall(r'(?<=term:)[^" ]+|(?<=term:")[^"]+(?=")', s)
Gives:
['foo', 'foo bar', 'bar']
It doesn't seem that you really want re.match here. Your regex is almost right, but you're grouping too much. How about this?
>>> s
('xyz term:abc 123 foo', 'foo term:"abc 123 "foo')
>>> re.findall(r'term:([^ "]+|"[^"]+")', '\n'.join(s))
['abc', '"abc 123 "']