Converting html entities into their values in python

Converting html entities into their values in python - python

I use this regex on some input,
[^a-zA-Z0-9##]
However this ends up removing lots of html special characters within the input, such as
#227;, #1606;, #1588; (i had to remove the & prefix so that it wouldn't
show up as the actual value..)
is there a way that I can convert them to their values so that it will satisfy the regexp expression? I also have no idea why the text decided to be so big.

Given that your text appears to have numeric-coded, not named, entities, you can first convert your byte string that includes xml entity defs (ampersand, hash, digits, semicolon) to unicode:
import re
xed_re = re.compile(r'&#(\d+);')
def usub(m): return unichr(int(m.group(1)))
s = 'ã, ن, ش'
u = xed_re.sub(usub, s)
if your terminal emulator can display arbitrary unicode glyphs, a print u will then show
ã, ن, ش
In any case, you can now, if you wish, use your original RE and you won't accidentally "catch" the entities, only ascii letters, digits, and the couple of punctuation characters you listed. (I'm not sure that's what you really want -- why not accented letters but just ascii ones, for example? -- but, if it is what you want, it will work).
If you do have named entities in addition to the numeric-coded ones, you can also apply the htmlentitydefs standard library module recommended in another answer (it only deals with named entities which map to Latin-1 code points, however).

You can adapt the following script:
import htmlentitydefs
import re
def substitute_entity (match):
name = match.group (1)
if name in htmlentitydefs.name2codepoint:
return unichr (htmlentitydefs.name2codepoint[name])
elif name.startswith ('#'):
try:
return unichr (int (name[1:]))
except:
pass
return '?'
print re.sub ('&(#?\\w+);', substitute_entity, 'x « y &wat; z {')
Produces the following answer here:
x « y ? z {
EDIT: I understood the question as "how to get rid of HTML entities before further processing", hope I haven't wasted time on answering a wrong question ;)

Without knowing what the expression is being used for I can't tell exactly what you need.
This will match special characters or strings of characters excluding letters, digits, #, and #:
[^a-zA-Z0-9##]*|#[0-9A-Za-z]+;

Related

Read only Arabic text from file in Python [duplicate]

When using Regex in Python, it's easy to use brackets to represent a range of characters a-z, but this doesn't seem to be working for other languages, like Arabic:
import re
pattern = '[ي-ا]'
p = re.compile(pattern)
This results in a long error report that ends with
raise error("bad character range")
sre_constants.error: bad character range
how can this be fixed?

Since Arabic character is rendered from right to left, the correct string below, which reads "from ا to ي" is rendered backward (try to select the string if you want to confirm):
'[ا-ي]'
Console output:
>>> re.compile('[ا-ي]')
<_sre.SRE_Pattern object at 0x6001f0a80>
>>> re.compile('[ا-ي]', re.DEBUG)
in
range (1575, 1610)
<_sre.SRE_Pattern object at 0x6001f0440>
So your pattern '[ي-ا]', is actually "from ي to ا", which is an invalid range, since the code point of ا is smaller than code point of ي.
To prevent confusion, Ignacio Vazquez-Abrams's suggestion of using Unicode escape is a good alternative to the solution I provide above.

Use Unicode escapes instead.
>>> re.compile('[\u0627-\u064a]')
<_sre.SRE_Pattern object at 0x237f460>

The approved answer does work, however the unicode [\u0627-\u064a] does not include variations of the letters 'ا' such as 'أ', 'آ' or 'إ', and the letter 'و' with its' variation 'ؤ'. (I wanted to comment/suggest-edit to the approved answer but there's a queue)
So in case someone is (re)visiting this question and needs those letter variations, a unicode that worked better for me was [\u0600-\u06FF], making the answer:
pattern = re.compile('[\u0600-\u06FF]')

python-re.sub() and unicode

I want to replace all emoji with '' but my regEx doesn't work.For example,
content= u'?\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633?'
and I want to replace all the forms like \U0001f633 with '' so I write the code:
print re.sub(ur'\\U[0-9a-fA-F]{8}','',content)
But it doesn't work.
Thanks a lot.

You won't be able to recognize properly decoded unicode codepoints that way (as strings containing \uXXXX, etc.) Properly decoded, by the time the regex parser gets to them, each is a* character.
Depending on whether your python was compiled with only 16-bit unicode code points or not, you'll want a pattern something like either:
# 16-bit codepoints
re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
# 32-bit* codepoints
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')
And your code would look like:
import re
# Pick a pattern, adjust as necessary
#re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')
content= u'[\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633]'
print(content)
stripped = re_strip.sub('', content)
print(stripped)
Both expressions, reduce the number of characters in the stripped string to 26.
These expressions strip out the emojis you were after, but may also strip out other things you do want. It may be worth reviewing a unicode codepoint range listing (e.g. here) and adjusting them.
You can determine whether your python install will only recognize 16-bit codepoints by doing something like:
import sys
print(sys.maxunicode.bit_length())
If this displays 16, you'll need the first regex expression. If it displays something greater than 16 (for me it says 21), the second one is what you want.
Neither expression will work when used on a python install with the wrong sys.maxunicode.
See also: this related.

Processing accented Unicode characters with python regex module

I have following two functions that works perfectly fine with ASCII strings and use the re module:
import re
def findWord(w):
return re.compile(r'\b{0}.*?\b'.format(w), flags=re.IGNORECASE).findall
def replace_keyword(w, c, x):
return re.sub(r"\b({0}\S*)".format(w), r'<mark style="background-color:{0}">\1</mark>'.format(c), x, flags=re.I)
However, they fail on using the utf-8 encoded strings with accented characters. On searching further, I found that the regex module is better suited for Unicode strings and hence I have been trying to port this to use regex for the last couple of hours but nothing seem to be working. This is what I have as of now:
import regex
def findWord(w):
return regex.compile(r'\b{0}.*?\b'.format(w), flags=regex.IGNORECASE|regex.UNICODE).findall
def replace_keyword(w, c, x):
return regex.sub(r"\b({0}\S*)".format(w), r'<mark style="background-color:{0}">\1</mark>'.format(c), x, flags=regex.IGNORECASE|regex.UNICODE)
However, on using an accented (not normalized) utf-8 encoded string, I keep getting an ordinal not in range error.
EDIT: The suggested possible duplicate question: Regular expression to match non-English characters? doesn't solve my problem. I want to use the python re/regex module. Secondly, I want to get the find and replace functions working using python.
EDIT: I am using python 2
EDIT: If you feel you can help me get these two functions working using Python 3 please let me know. I hope I will be able to invoke python 3 for using just these 2 functions through my python 2 script.

I think I am headed somewhere. I am trying to get this working without using the modules re or regex but plain python:
found_keywords = []
for word in keyword_list:
if word.lower() in article_text.lower():
found_keywords.append(word)
for word in found_keywords: # highlight the found keyword in the text
article_text = article_text.lower().replace(word.lower(), '<mark style="background-color:%s">%s</mark>' % (yellow_color, word))
Now, I just have to somehow replace found keywords in a case-insensitive manner and I will be good to go.
Just help me with this last step of replacing keywords in a case-insensitive manner without using re or regex so that it works for accented strings.

Python replace with re-using unknown strings

I have an XML in which I'd like to rename one of the tag groups like this:
<string>ABC</string>
<string>unknown string</string>
should be
<xyz>ABC</xyz>
<xyz>unknown string</xyz>
ABC is always the same, so that's no issue. However, "unknown string" is always different, but since I need this information extracted, I also want to keep the same string in the replacement.
Here's what I got so far:
import re
#open the xml file for reading:
file = open('path/file','r+')
#convert to string:
data = file.read()
file.write(re.sub("<string>ABC</string>(\s+)<string>(.*)</string>","<xyz>ABC</xyz>[\1]<xyz>[\2]</xyz>",data))
print (data)
file.close()
I tried to use capture groups, but didn't do it correctly. The string is replaced with weird symbols in my XML. Plus, it's printed twice. I have both the unchanged and the changed version in my XML, which I don't want.

The problem you're experiencing is not due to your regex pattern. The backslash (\) in the strings are escaping proceeding characters thus resulting in the weird symbols that you see.
>>> print "hello\1world"
helloworld
>>> print r"hello\1world"
hello\1world
Always use the raw string notation to define your re patterns.
>>> data = """
... <string>ABC</string>
... <string>unknown string</string>
... """
>>> print re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data)
<xyz>ABC</xyz>
<xyz>unknown string</xyz>

Why are you including the content in your replacement operation? All you need to do is:
Replace <string> by <xyz>.
Replace </string> by </xyz>.
It would take two operations but the intent of your code would be clear and you don't need to know what unknown string is.

how to turn 'C:\Music\song.mp3' into r'C:\Music\song.mp3'

I have been making an mp3 player with Tkinter and the module mp3play.
Say i had the song to play: C:\Music\song.mp3
and to play that song i have to run this script:
import mp3play
music_file=r'C:\Music\song.mp3'
clip = mp3play.load(music_file)
clip.play()
Easy enough, my problem though is getting the "r" there.
i have tried:
import mp3play
import re
music_file="'C:\Music\song.mp3'"
music_file='r'+music_file
music_file=re.sub('"','',music_file)
print music_file
clip = mp3play.load(music_file)
clip.play()
Which gets the output: r'C:\Music\song.mp3'
but it is a string, so it wont read the file.

The 'r' in the front denotes a particular category of string called raw string. You can't get that by adding two strings or re substituting a string. It is just a string type, but with the escape characters take care.
>>> s = r'something'
>>> s
'something'
>>>
When you are writing the script, use the 'r', if you are getting the input via raw_input, python will take care of escaping the characters. So, the question is why are you trying to do that?

try:
music_file='C:/Music/song.mp3'

In Python, the r prefix introduces a raw string. Outside of raw strings, backslash (\) characters are considered as escape characters and have to be escaped themselves (by doubling them).
Try a simple string instead:
music_file = 'C:\\Music\\song.mp3'

The r you are talking about has to be placed before a string definition, and tells python that the following string is "raw", meaning it will ignore backslash escapes (so it doesn't error on invalid backslashes in filenames, for example).
Why don't you just do it like in the first example? I don't see what you are trying to accomplish in the second example.

you can try music_file = r'%s' % path_to_file

As a few of the other answers have pointed out (I'm just posting this as an answer because it seemed kind of silly to make it a comment), what you've given in your first code block is exactly what the contents of your script should be. You don't need to do anything special to get the r there. In fact the 'r' is not part of the string, it's part of the code that makes the string.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting html entities into their values in python - python

Without knowing what the expression is being used for I can't tell exactly what you need. This will match special characters or strings of characters excluding letters, digits, #, and #: [^a-zA-Z0-9##]*|#[0-9A-Za-z]+;

Related

Read only Arabic text from file in Python [duplicate]

python-re.sub() and unicode

Processing accented Unicode characters with python regex module

Python replace with re-using unknown strings

how to turn 'C:\Music\song.mp3' into r'C:\Music\song.mp3'

Categories

Resources