python regex finditer

python regex finditer - python

I have question about re, I tried to look answer on re documentary but I think I am to newbie for this.
I have string like this
string = "id=186 s_id=0 channel_name=[cspacer0]---BlaBla--- number=2"
I want to retrive all result after '=' so I used
re.finditer("=[\w]*", string)
My result was as follow
186
0
empty space <-- there should be a [cspacer0]--BlaBla--
2
How should my pattern look to get the channel_name as well?

The \w token only matches word characters, to allow metacharacters I would use \S (any non-white space character) instead. Also, instead of finditer you can use findall for this task:
>>> import re
>>> s = 'id=186 s_id=0 channel_name=[cspacer0]---BlaBla--- number=2'
>>> re.findall(r'=(\S+)', s)
['186', '0', '[cspacer0]---BlaBla---', '2']
EDIT
The orginal string looks like this, I want to get everything starting with = skip =ok and idx=0
>>> s = 'error idx=0 msg=ok id=186 s_id=0 channel_name=[cspacer0]---BlaBla--- number=2'
>>> re.findall(r'(?<!idx)=(?!ok)(\S+)', s)
['186', '0', '[cspacer0]---BlaBla---', '2']

Related

Python : Splitting a string by numbers, letters and -_

Let's say that I have a string like this one
string = 'rename_file_1122--23-_12'
Is there a way to split this like that
parts = ['rename','_','file','_','1122','--','23','-_','12']
I tried with the regular expression but it does not work
import re
name_parts = re.findall('\d+|\D+|\w+|\W+', string)
The result was:
['rename_file_', '1122', '--', '23', '-_', '12']
########## Second part
If I have a string like this one :
string2 = 'Hello_-Marco5__-'
What are the conditions that I need to use to get :['Hello','_-','Marco','5','__-']. My goal is to split a string y groups of letters,digits ans '-_'.
Thanks fors yours answers

You can use
re.findall(r'[^\W_]+|[\W_]+', string)
See the regex demo.
Regex details:
[^\W_]+ - one or more chars other than non-word and _ chars (so, one or more letters or digits)
| - or
[\W_]+ - one or more non-word and/or _ chars.
See a Python demo:
import re
string = 'rename_file_1122--23-_12'
name_parts = re.findall(r'[^\W_]+|[\W_]+', string)
print(name_parts)
# => ['rename', '_', 'file', '_', '1122', '--', '23', '-_', '12']

Alternatively you could use groupby from itertools:
from itertools import groupby
string = 'rename_file_1122--23-_12'
result = [''.join(value) for key, value in groupby(string, key=str.isalnum)]
print(result)
Output:
['rename', '_', 'file', '_', '1122', '--', '23', '-_', '12']
edit:
I came up with a perhaps simpler solution, using regular expressions:
string = 'rename_file_1122--23-_12'
result = re.split('([_-]*)', string)
print(result)
Same output.
re.split will split the string based upon matching the regular expression. The expression I've used includes a grouping pattern, and split includes the match in the result:
([_-]*)
Means match (and remember the result) of a sequence of one or more of any of _ or -. * means one or more, [] means any of whatever's inside the square brackets.
Without the group, just using [_-]* we'd get the following, without the matches:
string = 'rename_file_1122--23-_12'
result = re.split('[_-]*', string)
print(result)
Output:
['rename', 'file', '1122', '23', '12']

I have found the solution for the second part, it is the following :
name_parts=re.findall(r'[^\d_]+|[^\D]+|[^\W_]+|[\W_]+', string)

Extract Only Digits from Dollar Figures

What I'm trying to do is extract only the digits from dollar figures.
Format of Input
...
$1,289,868
$62,000
$421
...
Desired Output
...
1289868
62000
421
...
The regular expression that I was using to extract only the digits and commas is:
r'\d+(,\d+){0,}'
which of course outputs...
...
1,289,868
62,000
421
...
What I'd like to do is convert the output to an integer (int(...)), but obviously this won't work with the commas. I'm sure I could figure this out on my own, but I'm running really short on time right now.
I know I can simply use r'\d+', but this obviously separates each chunk into separate matches...

You can't match discontinuous texts within one match operation. You can't put a regex into re.findall against 1,345,456 to receive 1345456. You will need to first match the strings you need, and then post-process them within code.
A regex you may use to extract the numbers themselves
re.findall(r'\$(\d{1,3}(?:,\d{3})*)', s)
See this regex demo.
Alternatively, you may use a bit more general regex to be used with re.findall:
r'\$(\d+(?:,\d+)*)'
See this regex demo.
Note that re.findall will only return the captured part of the string (the one matched with the (...) part in the regex).
Details
\$ - a dollar sign
(\d{1,3}(?:,\d{3})*) - Capturing group 1:
\d{1,3} - 1 to 3 digits (if \d+ is used, 1 or more digits)
(?:,\d{3})* - 0 or more sequences of
, - a comma
\d{3} - 3 digits (or if \d+ is used, 1 or more digits).
Python code sample (with removing commas):
import re
s = """$1,289,868
$62,000
$421"""
result = [x.replace(",", "") for x in re.findall(r'\$(\d{1,3}(?:,\d{3})*)', s)]
print(result) # => ['1289868', '62000', '421']

Using re.sub
Ex:
import re
s = """$1,289,868
$62,000
$421"""
print([int(i) for i in re.sub(r'[^0-9\s]', "", s).splitlines()])
Output:
[1289868, 62000, 421]

You don't need regex for this.
int(''.join(filter(str.isdigit, "$1,000,000")))
works just fine.
If you did want to use regex for some reason:
int(''.join(re.findall(r"\d", "$1,000,000")))

If you know how to extract the numbers with comma groupings, the easiest thing to do is just transform that into something int can handle:
for match in matches:
i = int(match.replace(',', ''))
For example, if match is '1,289,868', then match.replace(',', '') is '1289868', and obviously int(<that>) is 1289868.

You dont need regex for this. Just string operations should be enough
>>> string = '$1,289,868\n$62,000\n$421'
>>> [w.lstrip('$').replace(',', '') for w in string.splitlines()]
['1289868', '62000', '421']
Or alternatively, you can use locale.atoi to convert string of digits with commas to int
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF8')
>>> list(map(lambda x: locale.atoi(x.lstrip('$')), string.splitlines()))
[1289868, 62000, 421]

Python split with multiple delimiters not working

I have a string:
feature.append(freq_and_feature(text, freq))
I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.
These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):
import re
with open("helper.txt", "r") as helper:
for row in helper:
print re.split('\' .,()_', row)
However, I get the following, which is not what I want.
[' feature.append(freq_pain_feature(text, freq))\n']

re.split('\' .,()_', row)
This looks for the string ' .,()_ to split on. You probably meant
re.split('[\' .,()_]', row)
re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".

It seems you want to split a string with non-word or underscore characters. Use
import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']
See the IDEONE demo
The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.
You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).

You can try this
str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str
This will result:
['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']

I think you are trying to split on the basis of non-word characters. It should be
re.split(r'[^A-Za-z0-9]+', s)
[^A-Za-z0-9] can be translated to --> [\W_]
Python Code
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])
This will also work, indeed
p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))
Ideone Demo

Removing lines from a text file using python and regular expressions

I have some text files, and I want to remove all lines that begin with the asterisk (“*”).
Made-up example:
words
*remove me
words
words
*remove me
My current code fails. It follows below:
import re
program = open(program_path, "r")
program_contents = program.readlines()
program.close()
new_contents = []
pattern = r"[^*.]"
for line in program_contents:
match = re.findall(pattern, line, re.DOTALL)
if match.group(0):
new_contents.append(re.sub(pattern, "", line, re.DOTALL))
else:
new_contents.append(line)
print new_contents
This produces ['', '', '', '', '', '', '', '', '', '', '*', ''], which is no goo.
I’m very much a python novice, but I’m eager to learn. And I’ll eventually bundle this into a function (right now I’m just trying to figure it out in an ipython notebook).
Thanks for the help!

Your regular expression seems to be incorrect:
[^*.]
Means match any character that isn't a ^, * or .. When inside a bracket expression, everything after the first ^ is treated as a literal character. This means in the expression you have . is matching the . character, not a wildcard.
This is why you get "*" for lines starting with *, you're replacing every character but *! You would also keep any . present in the original string. Since the other lines do not contain * and ., all of their characters will be replaced.
If you want to match lines beginning with *:
^\*.*
What might be easier is something like this:
pat = re.compile("^[^*]")
for line in contents:
if re.search(pat, line):
new_contents.append(line)
This code just keeps any line that does not start with *.
In the pattern ^[^*], the first ^ matches the start of the string. The expression [^*] matches any character but *. So together this pattern matches any starting character of a string that isn't *.
It is a good trick to really think about when using regular expressions. Do you simply need to assert something about a string, do you need to change or remove characters in a string, do you need to match substrings?
In terms of python, you need to think about what each function is giving you and what you need to do with it. Sometimes, as in my example, you only need to know that a match was found. Sometimes you might need to do something with the match.
Sometimes re.sub isn't the fastest or the best approach. Why bother going through each line and replacing all of the characters, when you can just skip that line in total? There's no sense in making an empty string when you're filtering.
Most importantly: Do I really need a regex? (Here you don't!)
You don't really need a regular expression here. Since you know the size and position of your delimiter you can simply check like this:
if line[0] != "*":
This will be faster than a regex. They're very powerful tools and can be neat puzzles to figure out, but for delimiters with fixed width and position, you don't really need them. A regex is much more expensive than an approach making use of this information.

You don't want to use a [^...] negative character class; you are matching all characters except for the * or . characters now.
* is a meta character, you want to escape that to \*. The . 'match any character' syntax needs a multiplier to match more than one. Don't use re.DOTALL here; you are operating on a line-by-line basis but don't want to erase the newline.
There is no need to test first; if there is nothing to replace the original line is returned.
pattern = r"^\*.*"
for line in program_contents:
new_contents.append(re.sub(pattern, "", line))
Demo:
>>> import re
>>> program_contents = '''\
... words
... *remove me
... words
... words
... *remove me
... '''.splitlines(True)
>>> new_contents = []
>>> pattern = r"^\*.*"
>>> for line in program_contents:
... new_contents.append(re.sub(pattern, "", line))
...
>>> new_contents
['words\n', '\n', 'words\n', 'words\n', '\n']

You can do:
print '\n'.join(re.findall(r'^[^*].*$', ''.join(f), re.M))
Example:
txt='''\
words
*remove me
words
words
*remove me '''
import StringIO
f=StringIO.StringIO(txt)
import re
print '\n'.join(re.findall(r'^[^*].*$', ''.join(f), re.M))

extracting multiple instances regex python

I have a string:
This is #lame
Here I want to extract lame. But here is the issue, the above string can be
This is lame
Here I dont extract anything. And then this string can be:
This is #lame but that is #not
Here i extract lame and not
So, output I am expecting in each case is:
[lame]
[]
[lame,not]
How do I extract these in robust way in python?

Use re.findall() to find multiple patterns; in this case for anything that is preceded by #, consisting of word characters:
re.findall(r'(?<=#)\w+', inputtext)
The (?<=..) construct is a positive lookbehind assertion; it only matches if the current position is preceded by a # character. So the above pattern matches 1 or more word characters (the \w character class) only if those characters were preceded by an # symbol.
Demo:
>>> import re
>>> re.findall(r'(?<=#)\w+', 'This is #lame')
['lame']
>>> re.findall(r'(?<=#)\w+', 'This is lame')
[]
>>> re.findall(r'(?<=#)\w+', 'This is #lame but that is #not')
['lame', 'not']
If you plan on reusing the pattern, do compile the expression first, then use the .findall() method on the compiled regular expression object:
at_words = re.compile(r'(?<=#)\w+')
at_words.findall(inputtext)
This saves you a cache lookup every time you call .findall().

You should use re lib here is an example:
import re
test case = "This is #lame but that is #not"
regular = re.compile("#[\w]*")
lst= regular.findall(test case)

This will give the output you requested:
import re
regex = re.compile(r'(?<=#)\w+')
print regex.findall('This is #lame')
print regex.findall('This is lame')
print regex.findall('This is #lame but that is #not')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.