Using unicode char code in regular expression - python

Simplifying my task, lets say I want to find any words written in Hebrew in some web page.
So I know that Hebrew char codes are U+05D0 to U+05EA.
I want to write something like:
expr = "[\u05D0-\u05EA]+"
url = "https://en.wikipedia.org/wiki/Category:Countries"
web_handle = urllib2.urlopen(url)
website_text = website_handle.read()
matches = sre.findall(exp, website_text)
for item in matches:
print item
The output I would expect is:
עברית
But instead the out put is a lot of Chinese/Japanese chars.

You can just use standard representation of unicode in python within a character class :
re.findall([\u05D0-\u05EA], website_text,re.U)

The expression should be:
expr = u"[\u05D0-\u05EA]+"
Notice the 'u' at the beginning.

Related

using variables inside regex patterns in Python

I'm trying to preprocess a text file that is in Persian, but the problem is that for digits, sometimes they used Arabic digits instead of Persian ones. I want to fix this using regex. Here is my snippet of code:
def preprocessing(content):
import re
for d in range(10):
arabic_digit = rf"\u066{d}"
persian_digit = rf"\u06F{d}"
content = re.sub(arabic_digit, persian_digit, content)
return(content)
but it gives this error message:
error: bad escape \u at position 0
I wonder how should I use variables inside the regex patterns. The weird thing is that the problem is with the second pattern (persian_digit) and when I change it to a static string, there are no errors. Thanks for your time.
chr() is the way to generate Unicode code points:
def preprocessing(content):
import re
for d in range(10):
arabic_digit = chr(0x660 + d)
persian_digit = chr(0x6f0 + d)
content = re.sub(arabic_digit, persian_digit, content)
return content
But, str has a built-in .translate function for making mass substitutions that is much more efficient. Give a list of characters to replace and a same-length list of new characters:
arabic_digits = ''.join([chr(i) for i in range(0x660,0x66a)])
persian_digits = ''.join([chr(i) for i in range(0x6f0,0x6fa)])
print('Arabic: ',arabic_digits)
print('Persian:',persian_digits)
# compute the translation table once
_xlat = str.maketrans(arabic_digits,persian_digits)
def preprocessing(content):
return content.translate(_xlat)
test = '4\u06645\u06656\u0666'
print('before:',test)
print('after: ',preprocessing(test))
Output:
Arabic: ٠١٢٣٤٥٦٧٨٩
Persian: ۰۱۲۳۴۵۶۷۸۹
before: 4٤5٥6٦
after: 4۴5۵6۶
According to this, it is not allowed to have unknown escapes in pattern consisting of '\' in re.sub() , which is the error you come across.
What you can do is to turn the raw string back to "normal" string like this, while I am not sure if it is the best practice:
import codecs
import re
def preprocessing(content):
for d in range(10):
arabic_digit = codecs.decode(rf"\u066{d}", 'unicode_escape')
persian_digit = codecs.decode(rf"\u06F{d}", 'unicode_escape')
content = re.sub(arabic_digit, persian_digit, content)
return content

python regular expression to match strings

I want to parse a string, such as:
package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'
uses-permission:'android.permission.WRITE_APN_SETTINGS'
uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'
uses-permission:'android.permission.ACCESS_NETWORK_STATE'
I want to get:
string1: jp.tjkapp.droidllwp`
string2: 1.1
Because there are multiple uses-permission, I want to get permission as a list, contains:
WRITE_APN_SETTINGS, RECEIVE_BOOT_COMPLETED and ACCESS_NETWORK_STATE.
Could you help me write the python regular expression to get the strings I want?
Thanks.
Assuming the code block you provided is one long string, here stored in a variable called input_string:
name = re.search(r"(?<=name\=\')[\w\.]+?(?=\')", input_string).group(0)
versionName = re.search(r"(?<=versionName\=\')\d+?\.\d+?(?=\')", input_string).group(0)
permissions = re.findall(r'(?<=android\.permission\.)[A-Z_]+(?=\')', input_string)
Explanation:
name
(?<=name\=\'): check ahead of the main string in order to return only strings that are preceded by name='. The \ in front of = and ' serve to escape them so that the regex knows we're talking about the = string and not a regex command. name=' is not also returned when we get the result, we just know that the results we get are all preceded by it.
[\w\.]+?: This is the main string we're searching for. \w means any alphanumeric character and underscore. \. is an escaped period, so the regex knows we mean . and not the regex command represented by an unescaped period. Putting these in [] means we're okay with anything we've stuck in brackets, so we're saying that we'll accept any alphanumeric character, _, or .. + afterwords means at least one of the previous thing, meaning at least one (but possibly more) of [\w\.]. Finally, the ? means don't be greedy--we're telling the regex to get the smallest possible group that meets these specifications, since + could go on for an unlimited number of repeats of anything matched by [\w\.].
(?=\'): check behind the main string in order to return only strings that are followed by '. The \ is also an escape, since otherwise regex or Python's string execution might misinterpret '. This final ' is not returned with our results, we just know that in the original string, it followed any result we do end up getting.
You can do this without regex by reading the file content line by line.
>>> def split_string(s):
... if s.startswith('package'):
... return [i.split('=')[1] for i in s.split() if "=" in i]
... elif s.startswith('uses-permission'):
... return s.split('.')[-1]
...
>>> split_string("package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'")
["'jp.tjkapp.droid1lwp'", "'2'", "'1.1'"]
>>> split_string("uses-permission:'android.permission.WRITE_APN_SETTINGS'")
"WRITE_APN_SETTINGS'"
>>> split_string("uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'")
"RECEIVE_BOOT_COMPLETED'"
>>> split_string("uses-permission:'android.permission.ACCESS_NETWORK_STATE'")
"ACCESS_NETWORK_STATE'"
>>>
Here is one example code
#!/usr/bin/env python
inputFile = open("test.txt", "r").readlines()
for line in inputFile:
if line.startswith("package"):
words = line.split()
string1 = words[1].split("=")[1].replace("'","")
string2 = words[3].split("=")[1].replace("'","")
test.txt file contains input data you mentioned earlier..

regex in python 2.4

I have a string in python as below:
"\\B1\\B1xxA1xxMdl1zzInoAEROzzMofIN"
I want to get the string as
"B1xxA1xxMdl1zzInoAEROzzMofIN"
I think this can be done using regex but could not achieve it yet. Please give me an idea.
st = "\B1\B1xxA1xxMdl1zzInoAEROzzMofIN"
s = re.sub(r"\\","",st)
idx = s.rindex("B1")
print s[idx:]
output = 'B1xxA1xxMdl1zzInoAEROzzMofIN'
OR
st = "\B1\B1xxA1xxMdl1zzInoAEROzzMofIN"
idx = st.rindex("\\")
print st[idx+1:]
output = 'B1xxA1xxMdl1zzInoAEROzzMofIN'
Here is a try:
import re
s = "\\B1\\B1xxA1xxMdl1zzInoAEROzzMofIN"
s = re.sub(r"\\[^\\]+\\","", s)
print s
Tested on http://py-ide-online.appspot.com (couldn't find a way to share though)
[EDIT] For some explanation, have a look at the Python regex documentation page and the first comment of this SO question:
How to remove symbols from a string with Python?
because using brackets [] can be tricky (IMHO)
In this case, [^\\] means anything but two backslashes \\.
So [^\\]+ means one or more character that matches anything but two backslashes \\.
If the desired section of the string is always on the RHS of a \ char then you could use:
string = "\\B1\\B1xxA1xxMdl1zzInoAEROzzMofIN"
string.rpartition("\\")[2]
output = 'B1xxA1xxMdl1zzInoAEROzzMofIN'

How can I grab all terms beginning with '#'?

I have a string like so: "sometext #Syrup #nshit #thebluntislit"
and i want to get a list of all terms starting with '#'
I used the following code:
import re
line = "blahblahblah #Syrup #nshit #thebluntislit"
ht = re.search(r'#\w*', line)
ht = ht.group(0)
print ht
and i get the following:
#Syrup
I was wondering if there is a way that I could instead get a list like:
[#Syrup,#nshit,#thebluntislit]
for all terms starting with '#' instead of just the first term.
Regular expression is not needed with good programming languages like Python:
hashed = [ word for word in line.split() if word.startswith("#") ]
You can use
compiled = re.compile(r'#\w*')
compiled.findall(line)
Output:
['#Syrup', '#nshit', '#thebluntislit']
But there is a problem. If you search the string like 'blahblahblah #Syrup #nshit #thebluntislit beg#end', the output will be ['#Syrup', '#nshit', '#thebluntislit', '#end'].
This problem may be addressed by using positive lookbehind:
compiled = re.compile(r'(?<=\s)#\w*')
(it's not possible to use \b (word boundary) here since # is not among
\w symbols [0-9a-zA-Z_] which may constitute the word which boundary is being searched).
Looks like re.findall() will do what you want.
matches = re.findall(r'#\w*', line)

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

Categories