Finding regex, unicode patterns

Finding regex, unicode patterns - python

I'm trying to scrape a website where unicode characters are present. I stated at the very begining -*- coding: utf-8 -*- plus I used the re.UNICODE flag
pattern = re.compile('(?:{}|{})'.format(regex, regex1), re.UNICODE)
However when I print the output I still get those weird chars like �
How do I fix that? Thanks!

Just because a page it has non-latin character doesn't mean it's encoded with unicode (also, which unicode encoding? utf-8? utf-16?).
Additionally, re.UNICODE probably doesn't do what you think it does. From the docs:
Make `\w, \W, \b, \B, \d, \D, \s` and `\S` dependent on the Unicode character properties database.
All this means is that these specific character classes are more broadly defined, it has no effect on the source text.
Moreover, the coding definition, -*- coding: utf-8 -*- is only specifying the encoding of your source file.
Finally, as noted in one of the comments, the � can be the result of using a character which is not supported by the current typeface. This, in turn, can be the result of assuming a certain encoding while the text is encoded in a different encoding.

This may not be an "answer", per-se.. but you could try using http://www.debuggex.com to debug your regexp a bit.

Related

Encoding in python 2.7, from IDE to string

So here I am, I read about encoding all day, now I need some clarification.
First off I'm using eclipse mars with pydev.
Unicode is a (character set + code points), basicaly a table of symbols associated with numerical value.
The way those value are going to be stored at a binary level are defined by the encoding, let's say UTF-8.
1 : shebang
What is the shebang for? when I put # -*- coding: utf-8 -*- does it do something? or does it just indicate that my file is encoded in UTF-8 (but since it's just an indication it could be a lie :o)
2 : Eclipse file encoding
After I wrote my shebang and saved I went into the property of the file, and it said encoding : ISO-8859-1, so my guess is that the shebang does nothing beside indicate in which encoding my file is.
Do I need to manually set every files to UTF-8 or is there a way to teach eclipse to read the shebang and act accordingly.
3 : Why does the shebang only specify the encoding?
My shebang say utf-8, ok right, so what? it does not tell me which caracter set is used.
Since UTF-8 is just an encoding I could use UTF-8 with any character set no?
I could encode ASCII in UTF-8 if I wanted, since an encoding is just a way to convert and store/read code points.
What if my character set encoded in utf-8 does not have the same code points than unicode? (is this possible?)
4 : maybe a solution?
I oftenly read that utf-8 is an implementation of unicode, does that mean that each times you read encoding = UTF-8 you can be 100%, and I say 100%, sure that the characterset+code points is unicode?
I'm lost

There are multiple misconceptions in your question.
Unicode is a standard that is commonly used for working with text. It is not "character set + code points" e.g., Unicode standard defines how to find word boundaries or how to compare Unicode string.
# -*- coding: utf-8 -*- is an encoding declaration. It is not a shebang. Shebang (as it name suggests) starts with #! e.g., #! /usr/bin/env python.
You might need the encoding declaration if there are non-ascii literal characters in your Python source code e.g., you don't need an encoding declaration if you write:
#!/usr/bin/env python2
print u"\N{SNOWMAN}"
But you need it if you use literal non-ascii characters:
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
print u"☃"
Both scripts produce the same output if the second script is saved using utf-8 encoding. The encoding declaration says how to interpret bytes that constitute the Python source code to get the program text.
"is there a way to teach eclipse to read the shebang encoding declaration and act accordingly." is a good separate question. If IDE has explicit Python support then it should do it automatically.
My shebang encoding declaration say utf-8, ok right, so what? it does not tell me which character set is used.
"character encoding", codepage, and charset may be used interchangeably in many contexts. See What's the difference between encoding and charset? The distinctions are irrelevant for the task of converting from bytes to text and back in Python:
unicode_text = bytestring.decode(character_encoding)
bytestring = unicode_text.encode(character_encoding)
A bytestring is an immutable sequence of bytes in Python (roughly speaking numbers in 0..255 range) that is used to represent arbitrary binary data e.g., images, zip-archives, encrypted data, and text encoded using some character encoding. A Unicode string is an immutable sequence of Unicode codepoints (roughly speaking, numbers in 0..sys.maxunicode range) that is used to represent text in Python.
Some character encodings such as cp437 support only a few Unicode characters. Others such as utf-8 support the full range of Unicode codepoints.

The right way to add the encoding declaration is > # -*- coding: utf-8 -*-
It tells python to change the encoding in the current script to UTF-8 it has nothing to do with the user .

Ok I think I found an awnser to all those questions
1/
thanks to J.Dev, the shebang only tells the python interpreter in what the file is encoded, but YOU have to encode the file in what you put in the shebang
2/
Apparently I have to do it manually
3/
Because an encoding is associated with a charset, if you say encoding=utf-8 then it will always be a unicode charset
Some old 1 byte charset don't have encoding, you don't need encoding since the char are all stored on 1 byte, the natural binary translation is the encoding.
So when you say ASCII for instance you mean the charset and encoding = ASCII
But this leave me wondering, is there other type of charset out there with multiple encoding implementation (like unicode can be encoded in utf-8/16/32)

How to cope with diacritics while trying to match with regex in Python

Trying to use regular expression with unicode html escapes for diacritics:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
htmlstring=u'''/">čćđš</a>.../">España</a>'''
print re.findall( r'/">(.*?)</a', htmlstring, re.U )
produces :
[u'\u010d\u0107\u0111\u0161', u'Espa\xf1a']
Any help, please?

This appears to be an encoding question. Your code is working as it should. Were you expecting something different? Your strings that are prefixed with u are unicode literals. The characters that begin with \u are unicode characters followed by four hex digits, whereas the characters that begin with \x are unicode characters followed by only two hex digits. If you print out your results (instead of looking at their __repr__ method), you will see that you have received the result that it appears you were looking for:
results = [u'\u010d\u0107\u0111\u0161', u'Espa\xf1a']
for result in results:
print result
čćđš
España
In your code (i.e. in your list), you see the representation of these unicode literals:
for result in results:
print result.__repr__()
u'\u010d\u0107\u0111\u0161' # what shows up in your list
u'Espa\xf1a'
Incidentally, it appears that you are trying to parse html with regexes. You should try BeautifulSoup or something similar instead. It will save you a major headache down the road.

How to handle special characters in comments and hard coded strings in python file?

This question aims at the following two scenarios:
You want to add a string with special characters to a variable:
special_char_string = "äöüáèô"
You want to allow special characters in comments.
# This a comment with special characters in it: äöà etc.
At the moment I handle this this way:
# -*- encoding: utf-8 -*-
special_char_string = "äöüáèô".decode('utf8')
# This a comment with special characters in it: äöà etc.
Works fine.
Is this the recommended way? Or is there a better solution for this?

Python will check the first or second line for an emacs/vim-like encoding specification.
More precisely, the first or second
line must match the regular
expression "coding[:=]\s*([-\w.]+)". The first
group of this
expression is then interpreted as encoding name. If the encoding
is unknown to Python, an error is raised during compilation.
Source: PEP 263
(A BOM would also make Python interpret the source as UTF-8.
I would recommend, you use this over .decode('utf8')
# -*- encoding: utf-8 -*-
special_char_string = u"äöüáèô"
In any case, special_char_string will then contain a unicode object, no longer a str.
As you can see, they're both semantically equivalent:
>>> u"äöüáèô" == "äöüáèô".decode('utf8')
True
And the reverse:
>>> u"äöüáèô".encode('utf8')
'\xc3\xa4\xc3\xb6\xc3\xbc\xc3\xa1\xc3\xa8\xc3\xb4'
>>> "äöüáèô"
'\xc3\xa4\xc3\xb6\xc3\xbc\xc3\xa1\xc3\xa8\xc3\xb4'
There is a technical difference, however: if you use u"something", it will instruct the parser that there is a unicode literal, it should be a bit faster.

Yes, this is the recommended way for Python 2.x, see PEP 0263.
In Python 3.x and above, the default encoding is UTF-8 and not ASCII, so you don't need this there. See PEP 3120.

Compile Syntax Error: non ASCII letters in a string

I have a python file that contains a long string of HTML. When I compile & run this file/script I get this error:
_SyntaxError: Non-ASCII character '\x92' in file C:\Users...\GlobalVars.py on line 2509, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details_
I have followed the instructions and gone to the url suggested. But putting something like this at the top of my script still doesn't work:
#!/usr/bin/python
# -*- coding: latin-1 -*-
What do you think I can do to stop this compiler error from occuring?

First, in order to prevent problems like the one specified in the question you should not ever use other encoding than utf-8 for python source code.
This is the correct header to use
#! /usr/bin/env python
# -*- coding: utf-8 -*-
Now you have to convert the file from whatever encoding you may have to utf-8, probably your current text editor is able to do that.
If you wonder why I say this remember that it is impossible for a text editor to safely guess your non-unicode encoding because there is no BOM for non-unicode. For this reason most decent editors are using UTF-8 as default even when encoding is not specified. And BTW, the encoding specified in the python file header is for Python only, most editors ignore what you wrote there.
Also, as you can see Python is trying to decode a character above 128 using ASCII (not latin-1), this is supposed to fail. I am not sure why this happens but I don't even care too much because there is a much better way to solve the problem.

It must be at the top of the script that has the non-ASCII text, and it must match the actual encoding of the file. \x92 is CP1252, not Latin-1.

If you are just concerned about getting rid of this error without getting into the details of it(which you can get from the other answers on this page), you can do the following -
1) Copy your code and paste it in Notepad++
2) Select Encoding -> Encode in UTF-8
3) Select View -> Show Symbol -> Show All Characters
Now it would be visible to you that which symbol is causing the issue(x92 would be visible). Replace/Remove it to solve the problem.

Found this and hope it's helpful to the next person:
http://www.sitepoint.com/forums/showthread.php?567734-Anyone-know-what-this-error-means
Code point 0x92 (146 decimal) is the right single quotation mark, or
apostrophe (’) in Windows-1252. It's an invalid character in ISO 8859
and in UTF-8, since the 0x80-0x9F range is reserved for C1 control
characters.
Not sure if I'm busting copyright. If so please remove the blockquote.

The encoding declaration indicates that you think the file is in latin-1 encoding, but the python interpreter is finding that a char at or very near line 2509 in GlobalVars.py that is not what you think it is.
You should first confirm the encoding of GlobalVars.py. Is it really latin-1?
Next, you should check the characters near line 2509. Are they also latin-1, or were they cut and pasted from a web page or somewhere else (maybe there are UTF-8 chars mixed up in there)?
If you have chars in your source file that aren't what you think they are, then you may need to clean up the file before going any further.

add these lines on top of your code
#! /usr/bin/env python
# -*- coding: utf-8 -*-

An easy workaround solution if your file is really in latin-1 is to change the html string with its representation.
Afaik:
\x92 => 146 in decimal => Æ => Æ
If your character is not Æ, then your file is not encoded into latin-1 ;-) (and you might wanna check if utf-8/cp1292 works better as a quick win)
EDIT:
Of course, you want to check your ACTUAL file encoding before trying. I might be wrong, not 100% sure \x92 is Æ in Iso8859-1 : according to this page, it doesn't seem defined.

UTF in Python Regex

I'm aware that Python 3 fixes a lot of UTF issues, I am not however able to use Python 3, I am using 2.5.1
I'm trying to regex a document but the document has UTF hyphens in it – rather than -. Python can't match these and if I put them in the regex it throws a wobbly.
How can I force Python to use a UTF string or in some way match a character such as that?
Thanks for your help

You have to escape the character in question (–) and put a u in front of the string literal to make it a unicode string.
So, for example, this:
re.compile("–")
becomes this:
re.compile(u"\u2013")

After a quick test and visit to PEP 0264: Defining Python Source Code Encodings, I see you may need to tell Python the whole file is UTF-8 encoded by adding adding a comment like this to the first line.
# encoding: utf-8
Here's the test file I created and ran on Python 2.5.1 / OS X 10.5.6
# encoding: utf-8
import re
x = re.compile("–")
print x.search("xxx–x").start()

Don't use UTF-8 in a regular expression. UTF-8 is a multibyte encoding where some unicode code points are encoded by 2 or more bytes. You may match parts of your string that you didn't plan to match. Instead use unicode strings as suggested.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.