Python split with multiple delimiters not working - python

I have a string:
feature.append(freq_and_feature(text, freq))
I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.
These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):
import re
with open("helper.txt", "r") as helper:
for row in helper:
print re.split('\' .,()_', row)
However, I get the following, which is not what I want.
[' feature.append(freq_pain_feature(text, freq))\n']

re.split('\' .,()_', row)
This looks for the string ' .,()_ to split on. You probably meant
re.split('[\' .,()_]', row)
re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".

It seems you want to split a string with non-word or underscore characters. Use
import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']
See the IDEONE demo
The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.
You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).

You can try this
str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str
This will result:
['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']

I think you are trying to split on the basis of non-word characters. It should be
re.split(r'[^A-Za-z0-9]+', s)
[^A-Za-z0-9] can be translated to --> [\W_]
Python Code
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])
This will also work, indeed
p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))
Ideone Demo

Related

Split concatenated functions keeping the delimiters

I am trying to split strings containing python functions, so that the resulting output keeps separate functions as list elements.
s='hello()there()' should be split into ['hello()', 'there()']
To do so I use a regex lookahead to split on the closing parenthesis, but not at the end of the string.
While the lookahead seems to work, I cannot keep the ) in the resulting strings as suggested in various posts. Simply splitting with the regex discards the separator:
import re
s='hello()there()'
t=re.split("\)(?!$)", s)
This results in: 'hello(', 'there()'] .
s='hello()there()'
t=re.split("(\))(?!$)", s)
Wrapping the separator as a group results in the ) being retained as a separate element: ['hello(', ')', 'there()']
As does this approach using the filter() function:
s='hello()there()'
u = list(filter(None, re.split("(\))(?!$)", s)))
resulting again in the parenthesis as a separate element: ['hello(', ')', 'there()']
How can I split such a string so that the functions remain intact in the output?
Use re.findall()
\w+\(\) matches one or more word characters followed by an opening and a closing parenthesis> That part matches the hello() and there()
t = re.findall(r"\w+\(\)", s)
['hello()', 'there()']
Edition:
.*? is a non-greedy match, meaning it will match as few characters as possible in the parenthesis.
s = "hello(x, ...)there(6L)now()"
t = re.findall(r"\w+\(.*?\)", s)
print(t)
['hello(x, ...)', 'there(6L)', 'now()']
You can split on a lookbehind for () and negative lookahead for the end of the string.
t = re.split(r'(?<=\(\))(?!$)', s)

Ignore optional suffix without preceding delimiter

I would like to capture the first part of a word, ignoring the optional suffix. Both the suffix and preceding text are composed of the same class of characters (that is, there is no delimiter before the suffix).
My first try only captures the first letter:
m = re.search(r'([A-Za-z]+?)(?:Suff)?', 'textSuff')
m.groups()
>>> ('t',)
I want to capture "text" only, but when I make the first group element greedy, it grabs the entire string.
m = re.search(r'([A-Za-z]+)(?:Suff)?', 'textSuff')
m.groups()
>>> ('textSuff',)
Is it feasible without a different character to delimit the suffix?
If your pattern is all constructed from optional patterns, be sure you will get as few characters in return as possible. Thus, there must be at least a boundary. I guess the word boundary \b is a valid way to go here (since you need to match words):
([A-Za-z]+?)(?:Suff)?\b
See demo
IDEONE DEMO:
import re
p = re.compile(r'([A-Za-z]+?)(?:Suff)?\b')
test_str = "textSuff more words tSuff"
print(re.findall(p, test_str))
Outputs:
['text', 'more', 'words', 't']
You need to specify that after everything either the string must end or there must be a non-acceptable character....
m = re.search(r'([A-Za-z]+?)(?:Suff)?(?:[^A-Za-z]|$)'

Removing lines from a text file using python and regular expressions

I have some text files, and I want to remove all lines that begin with the asterisk (“*”).
Made-up example:
words
*remove me
words
words
*remove me
My current code fails. It follows below:
import re
program = open(program_path, "r")
program_contents = program.readlines()
program.close()
new_contents = []
pattern = r"[^*.]"
for line in program_contents:
match = re.findall(pattern, line, re.DOTALL)
if match.group(0):
new_contents.append(re.sub(pattern, "", line, re.DOTALL))
else:
new_contents.append(line)
print new_contents
This produces ['', '', '', '', '', '', '', '', '', '', '*', ''], which is no goo.
I’m very much a python novice, but I’m eager to learn. And I’ll eventually bundle this into a function (right now I’m just trying to figure it out in an ipython notebook).
Thanks for the help!
Your regular expression seems to be incorrect:
[^*.]
Means match any character that isn't a ^, * or .. When inside a bracket expression, everything after the first ^ is treated as a literal character. This means in the expression you have . is matching the . character, not a wildcard.
This is why you get "*" for lines starting with *, you're replacing every character but *! You would also keep any . present in the original string. Since the other lines do not contain * and ., all of their characters will be replaced.
If you want to match lines beginning with *:
^\*.*
What might be easier is something like this:
pat = re.compile("^[^*]")
for line in contents:
if re.search(pat, line):
new_contents.append(line)
This code just keeps any line that does not start with *.
In the pattern ^[^*], the first ^ matches the start of the string. The expression [^*] matches any character but *. So together this pattern matches any starting character of a string that isn't *.
It is a good trick to really think about when using regular expressions. Do you simply need to assert something about a string, do you need to change or remove characters in a string, do you need to match substrings?
In terms of python, you need to think about what each function is giving you and what you need to do with it. Sometimes, as in my example, you only need to know that a match was found. Sometimes you might need to do something with the match.
Sometimes re.sub isn't the fastest or the best approach. Why bother going through each line and replacing all of the characters, when you can just skip that line in total? There's no sense in making an empty string when you're filtering.
Most importantly: Do I really need a regex? (Here you don't!)
You don't really need a regular expression here. Since you know the size and position of your delimiter you can simply check like this:
if line[0] != "*":
This will be faster than a regex. They're very powerful tools and can be neat puzzles to figure out, but for delimiters with fixed width and position, you don't really need them. A regex is much more expensive than an approach making use of this information.
You don't want to use a [^...] negative character class; you are matching all characters except for the * or . characters now.
* is a meta character, you want to escape that to \*. The . 'match any character' syntax needs a multiplier to match more than one. Don't use re.DOTALL here; you are operating on a line-by-line basis but don't want to erase the newline.
There is no need to test first; if there is nothing to replace the original line is returned.
pattern = r"^\*.*"
for line in program_contents:
new_contents.append(re.sub(pattern, "", line))
Demo:
>>> import re
>>> program_contents = '''\
... words
... *remove me
... words
... words
... *remove me
... '''.splitlines(True)
>>> new_contents = []
>>> pattern = r"^\*.*"
>>> for line in program_contents:
... new_contents.append(re.sub(pattern, "", line))
...
>>> new_contents
['words\n', '\n', 'words\n', 'words\n', '\n']
You can do:
print '\n'.join(re.findall(r'^[^*].*$', ''.join(f), re.M))
Example:
txt='''\
words
*remove me
words
words
*remove me '''
import StringIO
f=StringIO.StringIO(txt)
import re
print '\n'.join(re.findall(r'^[^*].*$', ''.join(f), re.M))

Match single quotes from python re

How to match the following i want all the names with in the single quotes
This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'
How to extract the name within the single quotes only
name = re.compile(r'^\'+\w+\'')
The following regex finds all single words enclosed in quotes:
In [6]: re.findall(r"'(\w+)'", s)
Out[6]: ['Tom', 'Harry', 'rock']
Here:
the ' matches a single quote;
the \w+ matches one or more word characters;
the ' matches a single quote;
the parentheses form a capture group: they define the part of the match that gets returned by findall().
If you only wish to find words that start with a capital letter, the regex can be modified like so:
In [7]: re.findall(r"'([A-Z]\w*)'", s)
Out[7]: ['Tom', 'Harry']
I'd suggest
r = re.compile(r"\B'\w+'\B")
apos = r.findall("This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'")
Result:
>>> apos
["'Tom'", "'Harry'", "'rock'"]
The "negative word boundaries" (\B) prevent matches like the 'n' in words like Rock'n'Roll.
Explanation:
\B # make sure that we're not at a word boundary
' # match a quote
\w+ # match one or more alphanumeric characters
' # match a quote
\B # make sure that we're not at a word boundary
^ ('hat' or 'caret', among other names) in regex means "start of the string" (or, given particular options, "start of a line"), which you don't care about. Omitting it makes your regex work fine:
>>> re.findall(r'\'+\w+\'', s)
["'Tom'", "'Harry'", "'rock'"]
The regexes others have suggested might be better for what you're trying to achieve, this is the minimal change to fix your problem.
Your regex can only match a pattern following the start of the string. Try something like: r"'([^']*)'"

How do I remove hex values in a python string with regular expressions?

I have a cell array in matlab
columns = {'MagX', 'MagY', 'MagZ', ...
'AccelerationX', 'AccelerationX', 'AccelerationX', ...
'AngularRateX', 'AngularRateX', 'AngularRateX', ...
'Temperature'}
I use these scripts which make use of matlab's hdf5write function to save the array in the hdf5 format.
I then read in the the hdf5 file into python using pytables. The cell array comes in as a numpy array of strings. I convert to a list and this is the output:
>>>columns
['MagX\x00\x00\x00\x08\x01\x008\xe6\x7f',
'MagY\x00\x7f\x00\x00\x00\xee\x0b9\xe6\x7f',
'MagZ\x00\x00\x00\x00\x001',
'AccelerationX',
'AccelerationY',
'AccelerationZ',
'AngularRateX',
'AngularRateY',
'AngularRateZ',
'Temperature']
These hex values pop into the strings from somewhere and I'd like to remove them. They don't always appear on the first three items of the list and I need a nice way to deal with them or to find out why they are there in the first place.
>>>print columns[0]
Mag8�
>>>columns[0]
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>repr(columns[0])
"'MagX\\x00\\x00\\x00\\x08\\x01\\x008\\xe6\\x7f'"
>>>print repr(columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
I've tried using a regular expression to remove the hex values but have little luck.
>>>re.sub('(\w*)\\\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('(\w*)\\\\x.*', r'\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub(r'(\w*)\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('([A-Za-z]*)\x00', r'\1', columns[0])
'MagX\x08\x018\xe6\x7f'
>>>re.sub('(\w*?)', '\1', columns[0])
'\x01M\x01a\x01g\x01X\x01\x00\x01\x00\x01\x00\x01\x08\x01\x01\x01\x00\x018\x01\xe6\x01\x7f\x01'
Any suggestions on how to deal with this?
You can remove all non-word characters in the following way:
>>> re.sub(r'[^\w]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'
The regex [^\w] will match any character that is not a letter, digit, or underscore. By providing that regex in re.sub with an empty string as a replacement you will delete all other characters in the string.
Since there may be other characters you want to keep, a better solution might be to specify a larger range of characters that you want to keep that excludes control characters. For example:
>>> re.sub(r'[^\x20-\x7e]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'
Or you could replace [^\x20-\x7e] with the equivalent [^ -~], depending on which seems more clear to you.
To exclude all characters after this first control character just add a .*, like this:
>>> re.sub(r'[^ -~].*', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX'
They're not actually in the string: you have unescaped control characters, which Python displays using the hexadecimal notation - that's why you see a unusual symbol when you print the value.
You should simply be able to remove the extra levels of quoting in your regular expression but you might also simply rely on something like the regexp module's generic whitespace class, which will match whitespace characters other than tabs and spaces:
>>> import re
>>> re.sub(r'\s', '?', "foo\x00bar")
'foo\x00bar'
>>> print re.sub(r'\s', '?', "foo\x00bar")
foobar
I use this one a bit to replace all input whitespace runs, including non-breaking space characters, with a single space:
>>> re.sub(r'[\xa0\s]+', ' ', input_str)
You can also do this without importing re. E.g. if you're content to keep only ascii characters:
good_string = ''.join(c if ord(c) < 129 else '?' for c in bad_string)

Categories