python-re.sub() and unicode - python

I want to replace all emoji with '' but my regEx doesn't work.For example,
content= u'?\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633?'
and I want to replace all the forms like \U0001f633 with '' so I write the code:
print re.sub(ur'\\U[0-9a-fA-F]{8}','',content)
But it doesn't work.
Thanks a lot.

You won't be able to recognize properly decoded unicode codepoints that way (as strings containing \uXXXX, etc.) Properly decoded, by the time the regex parser gets to them, each is a* character.
Depending on whether your python was compiled with only 16-bit unicode code points or not, you'll want a pattern something like either:
# 16-bit codepoints
re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
# 32-bit* codepoints
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')
And your code would look like:
import re
# Pick a pattern, adjust as necessary
#re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')
content= u'[\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633]'
print(content)
stripped = re_strip.sub('', content)
print(stripped)
Both expressions, reduce the number of characters in the stripped string to 26.
These expressions strip out the emojis you were after, but may also strip out other things you do want. It may be worth reviewing a unicode codepoint range listing (e.g. here) and adjusting them.
You can determine whether your python install will only recognize 16-bit codepoints by doing something like:
import sys
print(sys.maxunicode.bit_length())
If this displays 16, you'll need the first regex expression. If it displays something greater than 16 (for me it says 21), the second one is what you want.
Neither expression will work when used on a python install with the wrong sys.maxunicode.
See also: this related.

Related

How to check if a string is an rgb hex string

I am trying to create a way to proofread command console input and check to make sure that the string is an rgb hex string. (Ex: #FAF0E6) Currently I am working with a try: except: block.
def isbgcolor(bgcolor):
#checks to see if bgcolor is binary
try:
float(bgcolor)
return True
except ValueError:
return False
I tried also using a .startswith('#'). I have seen examples of how to write this function in Java but I'm still a beginner and Python's all I know. Help?
Normally, the best way to see if a string matches some simple format is to actually try to parse it. (Especially if you're only checking so you can then parse it if valid, or print an error if not.) So, let's do that.
The standard library is full of all kinds of useful things, so it's always worth searching. If you want to parse a hex string, the first thing that comes up is binascii.unhexlify. We want to unhexlify everything after the first # character. So:
import binascii
def parse_bgcolor(bgcolor):
if not bgcolor.startswith('#'):
raise ValueError('A bgcolor must start with a "#"')
return binascii.unhexlify(bgcolor[1:])
def is_bgcolor(bgcolor):
try:
parse_bgcolor(bgcolor)
except Exception as e:
return False
else:
return True
This accepts 3-character hex strings (but then so do most data formats that use #-prefixed hex RGB), and even 16-character ones. If you want to add a check for the length, you can add that. Is the rule == 6 or in (3, 6) or % 3 == 0? I don't know, but presumably you do if you have a rule you want to add.
If you start using parse_bgcolor, you'll discover that it's giving you a bytes with 6 values from 0-255, when you really wanted 3 values from 0-65535. You can combine them manually, or you can parse each two-character pair as a number (e.g., with int(pair, 16)), or you can feed the 6-char bytes you already have into, say, struct.unpack('>HHH'). Whatever you need to do is pretty easy once you know exactly what you want to do.
Finally, if you're trying to parse CSS or HTML, things like red or rgb(1, 2, 3) are also valid colors. Do you need to handle those? If so, you'll need something a bit smarter than this. The first thing to do is look at the spec for what you're trying to parse, and work out the rules you need to turn into code. Then you can write the code.
The following would match a hex RGB string:
import re
_rgbstring = re.compile(r'#[a-fA-F0-9]{6}$')
def isrgbcolor(value):
return bool(_rgbstring.match(value))
This only returns True if a string starting with # followed by exactly 6 hex digits is passed in.
Demo:
>>> isrgbcolor('#FAF0E6')
True
>>> isrgbcolor('#FAF0')
False
>>> isrgbcolor('FAF0E6')
False
>>> isrgbcolor('#NotRgb')
False
If you want to support the 3-digit CSS format as well, update the pattern:
_rgbstring = re.compile(r'#[a-fA-F0-9]{3}(?:[a-fA-F0-9]{3})?$')
This matches a hash followed by 3 hex digits, plus an optional 3 extra digits.
This seems to be the most simplest way. This regex will notice the P doesn't belong in the HEX.
import re
from pprint import pprint
hex = '#f8Ed90P'
pprint(re.findall('[^#0-9a-fA-F]', hex))
..so if there is something in the result of re.findall there's something wrong with your HEX structure.
This code resulted in:
macbook-pro:Desktop allendar$ python3 test.py
['P']
This code has the flaw that the hash-deck can be anywhere, which of course isn't right.
You might just want to check the hash-deck at the beginning of the string so the regex is easier to discern. Afterwards just only check if the other characters are conform to the characters allowed in your regex check.

python regular expression with utf8 issue

I got a file which includes many lines of plain utf-8 text. Such as below, by the by, it's Chinese.
PROCESS:类型:关爱积分[NOTIFY] 交易号:2012022900000109 订单号:W12022910079166 交易金额:0.01元 交易状态:true 2012-2-29 10:13:08
The file itself was saved in utf-8 format. file name is xx.txt
here is my python code, env is python2.7
#coding: utf-8
import re
pattern = re.compile(r'交易金额:(\d+)元')
for line in open('xx.txt'):
match = pattern.match(line.decode('utf-8'))
if match:
print match.group()
The problematic thing here is I got no results.
I wanna get the decimal string from 交易金额:0.01元, in here, which is 0.01.
Why doesn't this code work? Can anyone explain it to me, I got no clue whatsoever.
There are several issues with your code. First you should use re.compile(ur'<unicode string>'). Also it is nice to add re.UNICODE flag (not sure if really needed here though). Next one is that still you will not receive a match since \d+ doesn't handle decimals just a series of numbers, you should use \d+\.?\d+ instead (you want number, probably a dot and a number). Example code:
#coding: utf-8
text = u"PROCESS:类型:关爱积分[NOTIFY] 交易号:2012022900000109 订单号:W12022910079166 交易金额:0.01元 交易状态:true 2012-2-29 10:13:08"
import re
pattern = re.compile(ur'交易金额:(\d+\.?\d+)元', re.UNICODE)
print pattern.search(text).group(1)
You need to use .search() since .match() is like starting your regex with ^, i.e. it only checks at the beginning of the string.
If you use utf-8, you can use flags=re.LOCALE
#coding: utf-8
import re
pattern = re.compile(r'交易金额:(\d+\.?\d+)元', flags=re.LOCALE)
for line in open('xx.txt'):
match = pattern.match(line)
More details, see re.LOCALE. There is no need to convert utf-8 to unicode.

How do I get a regular expression to recognize non-ASCII characters as letters?

I'm extracting information from a webpage in Swedish. This page is using characters like: öäå.
My problem is that when I print the information the öäå are gone.
I'm extracting the information using Beautiful Soup. I think that the problem is that I do a bunch of regular expressions on the strings that I extract, e.g. location = re.sub(r'([^\w])+', '', location) to remove everything except for the letters. Before this I guess that Beautiful Soup encoded the strings so that the öäå became something like /x02/, a hex value.
So if I'm correct, then the regexes are removing the öäå, right, I mean the only thing that should be left of the hex char is x after the regex, but there are no x instead of öäå on my page, so this little theory is maybe not correct? Anyway, if it's right or wrong, how do you solve this? When I later print the extracted information to my webpage i use self.response.out.write() in google app engine (don't know if that help in solving the problem)
EDIT: The encoding on the Swedish site is utf-8 and the encoding on my site is also utf-8.
EDIT2: You can use ISO-8859-10 for Swedish, but according to google chrome the encoding is Unicode(utf-8) on this specific site
Always work in unicode and only convert to an encoded representation when necessary.
For this particular situation, you also need to use the re.U flag so \w matches unicode letters:
#coding: utf-8
import re
location = "öäå".decode('utf-8')
location = re.sub(r'([^\w])+', '', location, flags=re.U)
print location # prints öäå
It would help if you could dump the strings before and after each step.
Check your value of re.UNICODE first, see this

how to turn 'C:\Music\song.mp3' into r'C:\Music\song.mp3'

I have been making an mp3 player with Tkinter and the module mp3play.
Say i had the song to play: C:\Music\song.mp3
and to play that song i have to run this script:
import mp3play
music_file=r'C:\Music\song.mp3'
clip = mp3play.load(music_file)
clip.play()
Easy enough, my problem though is getting the "r" there.
i have tried:
import mp3play
import re
music_file="'C:\Music\song.mp3'"
music_file='r'+music_file
music_file=re.sub('"','',music_file)
print music_file
clip = mp3play.load(music_file)
clip.play()
Which gets the output: r'C:\Music\song.mp3'
but it is a string, so it wont read the file.
The 'r' in the front denotes a particular category of string called raw string. You can't get that by adding two strings or re substituting a string. It is just a string type, but with the escape characters take care.
>>> s = r'something'
>>> s
'something'
>>>
When you are writing the script, use the 'r', if you are getting the input via raw_input, python will take care of escaping the characters. So, the question is why are you trying to do that?
try:
music_file='C:/Music/song.mp3'
In Python, the r prefix introduces a raw string. Outside of raw strings, backslash (\) characters are considered as escape characters and have to be escaped themselves (by doubling them).
Try a simple string instead:
music_file = 'C:\\Music\\song.mp3'
The r you are talking about has to be placed before a string definition, and tells python that the following string is "raw", meaning it will ignore backslash escapes (so it doesn't error on invalid backslashes in filenames, for example).
Why don't you just do it like in the first example? I don't see what you are trying to accomplish in the second example.
you can try music_file = r'%s' % path_to_file
As a few of the other answers have pointed out (I'm just posting this as an answer because it seemed kind of silly to make it a comment), what you've given in your first code block is exactly what the contents of your script should be. You don't need to do anything special to get the r there. In fact the 'r' is not part of the string, it's part of the code that makes the string.

Converting html entities into their values in python

I use this regex on some input,
[^a-zA-Z0-9##]
However this ends up removing lots of html special characters within the input, such as
#227;, #1606;, #1588; (i had to remove the & prefix so that it wouldn't
show up as the actual value..)
is there a way that I can convert them to their values so that it will satisfy the regexp expression? I also have no idea why the text decided to be so big.
Given that your text appears to have numeric-coded, not named, entities, you can first convert your byte string that includes xml entity defs (ampersand, hash, digits, semicolon) to unicode:
import re
xed_re = re.compile(r'&#(\d+);')
def usub(m): return unichr(int(m.group(1)))
s = 'ã, ن, ش'
u = xed_re.sub(usub, s)
if your terminal emulator can display arbitrary unicode glyphs, a print u will then show
ã, ن, ش
In any case, you can now, if you wish, use your original RE and you won't accidentally "catch" the entities, only ascii letters, digits, and the couple of punctuation characters you listed. (I'm not sure that's what you really want -- why not accented letters but just ascii ones, for example? -- but, if it is what you want, it will work).
If you do have named entities in addition to the numeric-coded ones, you can also apply the htmlentitydefs standard library module recommended in another answer (it only deals with named entities which map to Latin-1 code points, however).
You can adapt the following script:
import htmlentitydefs
import re
def substitute_entity (match):
name = match.group (1)
if name in htmlentitydefs.name2codepoint:
return unichr (htmlentitydefs.name2codepoint[name])
elif name.startswith ('#'):
try:
return unichr (int (name[1:]))
except:
pass
return '?'
print re.sub ('&(#?\\w+);', substitute_entity, 'x « y &wat; z {')
Produces the following answer here:
x « y ? z {
EDIT: I understood the question as "how to get rid of HTML entities before further processing", hope I haven't wasted time on answering a wrong question ;)
Without knowing what the expression is being used for I can't tell exactly what you need.
This will match special characters or strings of characters excluding letters, digits, #, and #:
[^a-zA-Z0-9##]*|#[0-9A-Za-z]+;

Categories