I have a problem testing if "\s" symbols are present in a string. For example is '\sgoogle\s.com' must show that there is.
# use raw strings to ignore escapes:
s = r'\sgoogle\s'
print s, s.find(r'\s') != -1
# and with regex:
import re
print re.search(r'\\s', s)
Gives:
\sgoogle\s True
<_sre.SRE_Match object at 0x7f2d696fc850>
You have to escape your \ as follows:
s = '\sgoogle\s.com'
re.search(r'(\\s)', s)
Demo: http://regex101.com/r/tF9oW6
Related
I am having a hard time doing Data Analysis on a large text that has lots of non-alphabetical chars. I tried using
string = filter(str.isalnum, string)
but I also have "#" in my text that I want to keep. How do I make an exception for a character like "#" ?
It is easier to use regular expressions:
string = re.sub("[^A-Za-z0-9#]", "", string)
You can use re.sub
re.sub(r'[^\w\s\d#]', '', string)
Example:
>>> re.sub(r'[^\w\s\d#]', '', 'This is # string 123 *$^%')
This is # string 123
One way to do this would be to create a function that returns True or False if an input character is valid.
import string
valid_characters = string.ascii_letters + string.digits + '#'
def is_valid_character(character):
return character in valid_characters
# Instead of using `filter`, we `join` all characters in the input string
# if `is_valid_character` is `True`.
def get_valid_characters(string):
return "".join(char for char in string if is_valid_character(char))
Some example output:
>>> print(valid_characters)
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789#
>>> get_valid_characters("!Hello_#world?")
'Helloworld'
>>> get_valid_characters("user#example")
'user#example'
A simpler way to write it would be using regex. This will accomplish the same thing:
import re
def get_valid_characters(string):
return re.sub(r"[^\w\d#]", "", string)
You could use a lambda function to specify your allowed characters. But also note that filter returns a <filter object> which is an iterator over the returned values. So you will have to stich it back to a string:
string = "?filter_#->me3!"
extra_chars = "#!"
filtered_object = filter(lambda c: c.isalnum() or c in extra_chars, string)
string = "".join(filtered_object)
print(string)
Gives:
filter#me3!
After we found the answer to this question we are faced with next unusual replacement behavior:
Our regex is:
[\\((\\[{【]+(\\w+|\\s+|\\S+|\\W+)?[)\\)\\]}】]+
We are trying to match all content inside any type of brackets including the brackets
The original text is:
物理化学名校考研真题详解 (理工科考研辅导系列(化学生物类))
The result is:
物�研真题详解
The code for the replacement is:
delimiter = ' '
if localization == 'CN':
delimiter = ''
p = re.compile(codecs.encode(unicode(regex), "utf-8"), flags=re.I)
columnString = (p.sub(delimiter, columnString).strip()
Why � ( \ufffd) character appear and how to fix such behavior?
Same problem we are faced when we used regex:
(\\d*[满|元])
print repr(columnString)='\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4\xbc\xa4\xe7\xa7\x91\xe5\xa6\x99\xe6\x96\xb9(\xe7\xac\xac\xe5\x9b\x9b\xe7\x89\x88)'
print repr(regex)=u'[\\(\uff08\\[{\u3010]+(\\w+|\\s+|\\S+|\\W+)?[\uff09\\)\\]}\u3011]+'
print repr(p.pattern)='[\\(\xef\xbc\x88\\[{\xe3\x80\x90]+(\\w+|\\s+|\\S+|\\W+)?[\xef\xbc\x89\\)\\]}\xe3\x80\x91]+'
You should not mix UTF-8 and regular expressions. Process all your text as Unicode. Make sure you decoded both the regex and the input string to unicode values first:
>>> import re
>>> columnString = '\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4\xbc\xa4\xe7\xa7\x91\xe5\xa6\x99\xe6\x96\xb9(\xe7\xac\xac\xe5\x9b\x9b\xe7\x89\x88)'
>>> regex = '[\\(\xef\xbc\x88\\[{\xe3\x80\x90]+(\\w+|\\s+|\\S+|\\W+)?[\xef\xbc\x89\\)\\]}\xe3\x80\x91]+'
>>> utf8_compiled = re.compile(regex, flags=re.I)
>>> utf8_compiled.sub('', columnString)
'\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4'
>>> print utf8_compiled.sub('', columnString).decode('utf8', 'replace')
当代骨�
>>> unicode_compiled = re.compile(regex.decode('utf8'), flags=re.I | re.U)
>>> unicode_compiled.sub('', columnString.decode('utf8'))
u'\u5f53\u4ee3\u9aa8\u4f24\u79d1\u5999\u65b9'
>>> print unicode_compiled.sub('', columnString.decode('utf8'))
当代骨伤科妙方
>>> print unicode_compiled.sub('', u'物理化学名校考研真题详解 (理工科考研辅导系列(化学生物类))')
物理化学名校考研真题详解
When using UTF-8 in your pattern consists of separate bytes for the 【 codepoint:
>>> '【'
'\xe3\x80\x90'
which means your character class matches any of those bytes; \xe3, or \x80 or \x90 are each separately valid bytes in that character class.
Decode your string first , and you can get rid of that � ( \ufffd) character .
In [1]: import re
...: subject = '物理化学名校考研真题详解 (理工科考研辅导系列(化学生物类))'.decode('utf-8')
...: reobj = re.compile(r"[\((\[{【]+(\w+|\s+|\S+|\W+)?[)\)\]}】]+", re.IGNORECASE | re.MULTILINE)
...: result = reobj.sub("", subject)
...: print result
...:
物理化学名校考研真题详解
I have a script where I need to replace some chars that could generate some troubles with others.
I would like to optimize the number of operation required:
# Replace % at end of string
find_char = re.match( r'.+\%[a-zA-Z0-9]+', line)
if find_char:
line=re.sub(r'\%','PCT',line)
Here I want to replace % but only if it is present at the end of a string, can I do this in one single operation with re.sub?
Yes, of course, just specify that the match should be at the end of the string, using $:
>>> import re
>>> re.sub("%$", "o", "fo%")
'foo'
>>> re.sub("%$", "o", "f%o")
'f%o'
find_char = re.match( r'.+\%[a-zA-Z0-9]+', line)
if find_char:
line=re.sub(r'\%$','PCT',line)
use $ to match a character at the end
I think you mean this. It replaces the % symbol present at the end of a string with PCT
>>> import re
>>> m = re.sub(r'(?<=\S)%(?= |$)', r'PCT', 'foo%bar foo% bar%')
>>> m
'foo%bar fooPCT barPCT'
If you want to replace a single % symbol also which was preceded by a space and followed by a space then try this,
>>> m = re.sub(r'(?<=[\S\s])%(?= |$)', r'PCT', 'foo%bar % foo% bar%')
>>> m
'foo%bar PCT fooPCT barPCT'
OR
>>> import regex
>>> m = regex.sub(r'(?<=^|[\S\s])%(?= |$)', r'PCT', '% foo%bar % foo% bar%')
>>> m
'PCT foo%bar PCT fooPCT barPCT'
I need to dump some http data as a string from the http packet which i have in string format am trying to use the regular expression below to match 'data:'and everything after it,Its not working . I am new to regex and python
>>>import re
>>>pat=re.compile(r'(?:/bdata:/b)?\w$')
>>>string=" dnfhndkn data: ndknfdjoj pop"
>>>res=re.match(pat,string)
>>>print res
None
re.match matches only at the beginning of the string. Use re.search to match at any position. (See search() vs. match())
>>> import re
>>> pat = re.compile(r'(?:/bdata:/b)?\w$')
>>> string = " dnfhndkn data: ndknfdjoj pop"
>>> res = re.search(pat,string)
>>> res
<_sre.SRE_Match object at 0x0000000002838100>
>>> res.group()
'p'
To match everything, you need to change \w with .*. Also remove /b.
>>> import re
>>> pat = re.compile(r'(?:data:).*$')
>>> string = " dnfhndkn data: ndknfdjoj pop"
>>> res = re.search(pat,string)
>>> print res.group()
data: ndknfdjoj pop
No need for a regular expression here. You can just slice the string:
>>> string
' dnfhndkn data: ndknfdjoj pop'
>>> string.index('data')
10
>>> string[string.index('data'):]
'data: ndknfdjoj pop'
str.index('data') returns the point in the string where the substring data is found. The slice from this position to the end string[10:] gives you the part of the string you are interested in.
By the way, string is a potentially problematic variable name if you are planning on using the string module at any point...
you can just do:
string.split("data:")[1]
assuming "data:" appears only once in each string
In Perl it is possible to do something like this (I hope the syntax is right...):
$string =~ m/lalala(I want this part)lalala/;
$whatIWant = $1;
I want to do the same in Python and get the text inside the parenthesis in a string like $1.
If you want to get parts by name you can also do this:
>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
>>> m.groupdict()
{'first_name': 'Malcom', 'last_name': 'Reynolds'}
The example was taken from the re docs
See: Python regex match objects
>>> import re
>>> p = re.compile("lalala(I want this part)lalala")
>>> p.match("lalalaI want this partlalala").group(1)
'I want this part'
import re
astr = 'lalalabeeplalala'
match = re.search('lalala(.*)lalala', astr)
whatIWant = match.group(1) if match else None
print(whatIWant)
A small note: in Perl, when you write
$string =~ m/lalala(.*)lalala/;
the regexp can match anywhere in the string. The equivalent is accomplished with the re.search() function, not the re.match() function, which requires that the pattern match starting at the beginning of the string.
import re
data = "some input data"
m = re.search("some (input) data", data)
if m: # "if match was successful" / "if matched"
print m.group(1)
Check the docs for more.
there's no need for regex. think simple.
>>> "lalala(I want this part)lalala".split("lalala")
['', '(I want this part)', '']
>>> "lalala(I want this part)lalala".split("lalala")[1]
'(I want this part)'
>>>
import re
match = re.match('lalala(I want this part)lalala', 'lalalaI want this partlalala')
print match.group(1)
import re
string_to_check = "other_text...lalalaI want this partlalala...other_text"
p = re.compile("lalala(I want this part)lalala") # regex pattern
m = p.search(string_to_check) # use p.match if what you want is always at beginning of string
if m:
print m.group(1)
In trying to convert a Perl program to Python that parses function names out of modules, I ran into this problem, I received an error saying "group" was undefined. I soon realized that the exception was being thrown because p.match / p.search returns 0 if there is not a matching string.
Thus, the group operator cannot function on it. So, to avoid an exception, check if a match has been stored and then apply the group operator.
import re
filename = './file_to_parse.py'
p = re.compile('def (\w*)') # \w* greedily matches [a-zA-Z0-9_] character set
for each_line in open(filename,'r'):
m = p.match(each_line) # tries to match regex rule in p
if m:
m = m.group(1)
print m