How to use dateparser to detect dates in strings? - python

I want to use dateparser to detect which cell contains a date. I have a broad range of different date formats: Fr, 21.02.2020 // 20.02.2020 // 21.02 // 21-02-2020 // January, 21 2020 // 21-Jan-2020 // 21/02/20 and I am sure there will still come a couple more in the future. The library dateparser is able to detect all of them pretty well, though it also detects 'PO', 'to','06','16:00' as date or relative date, which I don't want. I tried to check the Documentation and turn the relative date off or to look how to change to only detect "real dates". In the settings they offer different PARSERS and the possibility to only use some of them. These are the default PARSERS and the program runs through all of them:
'timestamp': If the input string starts with 10 digits, optionally followed by additional digits or a period (.), those first 10 digits are interpreted as Unix time.
'relative-time': Parses dates and times expressed in relation to the current date and time (e.g. “1 day ago”, “in 2 weeks”).
'custom-formats': Parses dates that match one of the date formats in the list of the date_formats parameter of dateparser.parse() or DateDataParser.get_date_data.
'absolute-time': Parses dates and times expressed in absolute form (e.g. “May 4th”, “1991-05-17”). It takes into account settings such as DATE_ORDER or PREFER_LOCALE_DATE_ORDER.
'base-formats': Parses dates that match one of the following date formats
I tried to only use one of them with the part settings={'base-formats':True}) in my code, nonetheless it won't work. Furthermore they offer the following snippet to turn of individual PARSERS:
>>> from dateparser.settings import default_parsers
>>> parsers = [parser for parser in default_parsers if parser != 'relative-time']
>>> parse('today', settings={'PARSERS': parsers})
Here pops up the error:
ModuleNotFoundError: No module named 'dateparser.settings'
I tried pip install, won't work.
Link to docu: https://dateparser.readthedocs.io/en/latest/#settings
And here's my code:
import dateparser
inputlist = [[' ','Supplier:',' Company Y', ' ', 'Project:','Carasco', ' '],[' ','21-Jan-2020',' ','Consultant:','James Farewell', ' ', ' '],['PO', ' Service', ' Cost Center', ' Accounting Object', ' deliver at', ' Amount', ' Unit'],['0106776','XYZ', 'Countable',' ', '16:00','6,00','h',],['Fr, 21.02.2020', '20.03.2020', ' ', ' ', ' ', ' ','6/04/20']]
print(inputlist)
outerlist=[]
for row in inputlist:
innerlist = []
for cell in row:
parsecheck = dateparser.parse(cell, languages=['en', 'de'], settings={'base-formats':True})
if parsecheck == None:
innerlist.append(0)
else:
innerlist.append(1)
outerlist.append(innerlist)
print(outerlist)
I currently get:
[0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 1, 1, 1], [1, 1, 0, 0, 0, 0, 1]]
Desired Output:
[0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0,0, 0, 0], [1, 1, 0, 0, 0, 0, 1]]

This is the best I could do:
import dateparser
import locale
inputlist = [[' ','Supplier:',' Company Y', ' ', 'Project:','Carasco', ' '],[' ','21-Jan-2020',' ','Consultant:','James Farewell', ' ', ' '],['PO', ' Service', ' Cost Center', ' Accounting Object', ' deliver at', ' Amount', ' Unit'],['0106776','XYZ', 'Countable',' ', '16:00','6,00','h',],['Fr, 21.02.2020', '20.03.2020', ' ', ' ', ' ', ' ','6/04/20']]
print(inputlist)
customlist = ["%d.%m.%Y", "%d-%b-%Y", "%w/%m/%y", "%a, %d.%m.%Y"]
outerlist=[]
saved = locale.setlocale(locale.LC_ALL)
locale.setlocale(locale.LC_ALL, 'de_de')
for row in inputlist:
innerlist = []
for cell in row:
parsecheck = dateparser.parse(cell, languages=['en', 'de'], settings={'PARSERS':['custom-formats']}, date_formats=customlist)
if parsecheck == None:
innerlist.append(0)
else:
innerlist.append(1)
outerlist.append(innerlist)
locale.setlocale(locale.LC_ALL, saved)
print(outerlist)
The output is:
[[0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0, 1]]
For parsing Fr, 21.02.2020 I changed the locale to Germany and, near the end I go back to your initial locale.
The format was based on documentation of strftime() and strptime() Behavior

Agreed that changing the settings does not work as expected based on the docs. Looking at the code, it doesn't look like you can get date-only objects (though I'm not an expert and may have missed something). If I understand correctly, it should be settings = {'PARSER': 'base-formats'} instead of settings = {'base-formats':True}, but that doesn't solve your problem.
I can only suggest a work around making use of the fact that the hour and minute of the returned datetime object default to 0.
import dateparser
outerlist=[]
for row in inputlist:
innerlist = []
for cell in row:
parsecheck = None
if dateparser.parse(cell, settings={'STRICT_PARSING':True}) != None and dateparser.parse(cell).hour == 0:
parsecheck = dateparser.parse(cell, languages=['en', 'de'], settings={'PARSER':'date_formats'})
if parsecheck == None:
innerlist.append(0)
else:
innerlist.append(1)
outerlist.append(innerlist)
STRICT_PARSING:True means the returned value is None if any ofYEAR, DAY or MONTH are missing, which takes care of 'PO', 'h' and '6,00' returning valid datetime objects. Checking if the hour attribute is zero gets rid of the valid times.
Unfortunately
for cell in row:
parsecheck = dateparser.parse(cell, languages=['en','de'], settings={'STRICT_PARSING':True, 'PARSER':'date_formats'})
if parsecheck != None and parsecheck.hour == 0:
innerlist.append(1)
else:
innerlist.append(0)
doesn't seem to work since it interprets '16:00' as a date
edit - you don't need to import datetime

Related

Python Binary Number Class: Numpy Array Manipulation Getting "TypeError: 'bool' object is not iterable"

I am trying to create a class called "Binary" and the main idea of it is to take a string representing a fixed width binary number that is 16 bits as it's only parameter and store it as a numpy integer array into it's one instance variable "bit_array".
If the given string is greater than 16 characters or contains anything other than 0's and 1's, it is to raise a RuntimeError. If "string" is less than 16 characters, it is to pad on the leftmost digit of given string onto the start of the numpy array until the resulting array contains exactly 16 digits. The given string defaults to '0' and accepts an empty string, treating an empty string as '0'. The code I have written for init is as follows (in "hw4.py" - the main file for this assignment):
import numpy as np
class Binary:
def __init__(self, string='0'):
if not string:
string = '0'
else:
if len(string) > 16:
raise RuntimeError
else:
for i in string:
if i != '1' and i != '0':
raise RuntimeError
if len(string) == 16:
int_arr = np.array(tuple(string))
self.bit_array = int_arr
else:
int_arr_inc = np.array(tuple(string))
if int_arr_inc[0] == 0:
pad = np.zeros((16 - len(int_arr_inc)), int)
else:
pad = np.ones((16 - len(int_arr_inc)), int)
self.bit_array = np.concatenate((pad, int_arr_inc))
There is also an eq overloaded method intended to compare the bit_array of another "Binary" object in this class (this passes the spec's test case so there shouldn't be any changes required here, just more for being transparent):
def __eq__(self, other):
if str(self.bit_array) == str(other.bit_array):
return True
else:
return False
The spec requires that there is no use of list() or any form of lists at all in the init method. The method is not to return anything and it's only purpose is to build a Binary object that has a bit_array that converts the string parameter into a numpy integer array. The test cases I have to test against calls the following test case (in a test file "hw4_test.py"):
from hw4 import Binary
import unittest, numpy as np
class TestBinary(unittest.TestCase):
def setUp(self):
self.bin_0 = Binary("")
self.bin_1 = Binary("01")
self.bin_2 = Binary("010")
def test_init(self):
self.assertTrue(all(np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) == self.bin_0.bit_array))
self.assertTrue(all(np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]) == self.bin_1.bit_array))
self.assertTrue(all(np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]) == self.bin_2.bit_array))
However, the first test case here fails, giving me :
Traceback (most recent call last):
File "C:\Users\17606\Desktop\ISTA-350\hw4\hw4_test.py", line 30, in test_init
self.assertTrue(all(np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) == self.bin_0.bit_array))
TypeError: 'bool' object is not iterable
I've tried many ways of creating the numpy arrays, such as fromstring (which is deprecated anyways) and trying with a map of the parameter string, to no avail. As the test case code above shows, I am expecting
Binary("")
To result in a Binary object containing a bit_array of a numpy array like
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
But the TypeError above keeps occuring. Any ideas? I am having trouble understanding where I am even iterating over anything that could be a bool type. I am using Python 3.10 on Windows 10 with PyCharm as my IDE.
P.S
This is for a school course and I am not very concerned about conventions or efficiency for this. Just syntax and logic :)
Thanks!

i have a python list, using map function is omitting the first zero of the list

I have this code in python, when I print the last line, it is giving an output "11100101100". I'm expecting the output,"011100101100". Notice that the output starts with 1 and not 0. although the variable gamma_sum_list is a list containing 12 digits and its starts with 0. The function somehow deletes the first zero automatically. The following is the exact gamma_sum_list:
def convert(list)
res = int("".join(map(str,list)))
return res
print(convert(gamma_sum_list))
Input:
[0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0]
Expected Output:
011100101100
Actual Output :
11100101100
Your issue is caused by converting the result of the join operation to an integer. Integers do not have leading zeroes. If you remove the int function you'll get a string with the leading zero you're after.
gamma_sum_list = [0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0]
def convert(my_list):
res = "".join(map(str,my_list))
return res
print(convert(gamma_sum_list))
Output:
011100101100
def convert(some_list):
res = "".join(map(str,some_list))
return res
gamma_sum_list = [0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0]
print(convert(gamma_sum_list))
or
conv = lambda x: ''.join(map(str, x))
print(conv(gamma_sum_list))
Consider that:
>>> "".join(list(map(str, [0, 1])))
'01'
How would you convert '01' to an integer? Well, its just 1.
>>> int("".join(list(map(str, [0, 1]))))
1
So you probably want to not convert the string to an int, just keep it as a str.

How to retain values appended to a python list

Please I need help with this code.I want mylist to retain values appended to it next time the function 'no_repeat_rule' is called. I'm pretty new to python. My code is below:
def no_repeat_rule(play_board):
mylist = list()
if seeds_left(play_board) == 2 and sum(play_board[:6])== 1:
mylist.append(play_board)
return mylist
Output of this code (in part) is:
...
[[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]]
Player 1 chose cup 0
[[0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]]
Player 2 chose cup 6
[[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]]
...
what I want the function 'no_repeat_rule' to do is to grow mylist each time a player plays. I don't know if this is clear enough to get help?
The simplest thing to do would be to add another parameter in the function defintion, such that it looks like:
def no_repeat_rule(play_board, aList):
Before you call the function, declare a list outside of the function. Set this equal to the result of the function, and pass it as a parameter whenever you call the function. For instance:
x = list()
def no_repeat_rule(play_board, aList):
myList = aList
if seeds_left(play_board) == 2 and sum(play_board[:6])== 1:
myList.append(play_board)
return myList
x = no_repeat_rule(someBoardHere, x)
I believe this should work if I understand what you're asking. If not, please respond and I'll try something else.
what do you need is an object which is associated with the function. It calls attribute. It is very handy in python.
Your code may look like this:
def no_repeat_rule(play_board):
if not hasattr(no_repeat_rule,"mylist"):
no_repeat_rule.mylist = []
if seeds_left(play_board) == 2 and sum(play_board[:6])== 1:
no_repeat_rule.mylist.append(play_board)
return no_repeat_rule.mylist
I couldn't check this code, but it should work for local atributes. BTW it is for python 2.7

How to convert a string to list using python?

I am working with RC-522 RFID Reader for my project. I want to use it for paying transportation fee. I am using python and used the code in: https://github.com/mxgxw/MFRC522-python.git
On python script Read.py, Sector 8 was read with the use of this code:
# Check if authenticated
if status == MIFAREReader.MI_OK:
MIFAREReader.MFRC522_Read(8) <---- prints the sector 8
MIFAREReader.MFRC522_StopCrypto1()
else:
print "Authentication error"
The output of this was:
Sector 8 [100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
So that last part(Sector 8 [100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), I convert it to string. I want that to be a list but I can't. Tried to put it on a variable x and use x.split() but the output when I execute print(x) is "None".
x = str(MIFAREReader.MFRC22_READ(8))
x = x.split()
print x #PRINTS ['NONE']
I want it to be like this:
DATA = [100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
so that I can use the sum(DATA) to check for balance, and I can access it using indexes like DATA[0]
Thanks a lot!!
Follow these steps:
Open MFRC522.py >> header file for RFID Reader
vi MFRC522.py
look for function
def MFRC522_Read(self, blockAddr)
add this line return backData at the end of function.
Save it.
In read() program, call it like
DATA=(MIFAREReader.MFRC522_Read(8))
print 'DATA :',DATA
I hope this solves the problem.
You can use .split(",") to specify the delimiter ",".
Something like that:
input_string = "[100, 234, 0, 0, 567, 0, 0, 0, 3, 0, 235, 0, 0, 12, 0, 0]"
listed_string = input_string[1:-1].split(",")
sum = 0
for item in listed_string:
sum += int(item)
print(sum)
prints
1151
In line with Moutch answer, using list comprehension:
input='[100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]'
DATA = [int(item) for item in input[1:-1].split(',')]
print(sum(DATA))
If data string is entire output of Read.Py
input="""Card read UID: 67,149,225,43
Size: 8
Sector 8 [100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"""
#find index position of 'Sector' text and select from this using slices.
inputn = input[input.index('Sector')+9:]
DATA = [int(item) for item in inputn[1:-1].split(',')]
print(DATA)
print(sum(DATA))
If you have some guarantee about the source and nature of the data in that list (and you know the format will always be the same), Python's eval would work. For example:
original_string = 'Sector 8 [100, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]'
data_start_index = original_string.index('[') # find '['
data_string = original_string[data_start_index:] # extract the list
data = eval(data_string)
print(type(data)) # <class 'list'>
print(sum(data)) # 101
If you don't have these guarantees, you'll have to use the split method as suggested by Moutch, due to the fragility and exploitability of eval - it blindly executes whatever (potentially malicious) code is passed to it.
Edit: Use ast.literal_eval instead of plain old eval for safety guarantees. This still requires that the formatting of the string be consistent (e.g., that it always have square brackets) in order to properly evaluate to a Python list.

Unicode as String without conversion Python

I'm trying to convert unicode text to string literally, but I don't seem to find a way to do this.
input= u'/123/123/123'
convert to string:
output="/123/123/123"
If I try to do str(), it will encode it and if I try to loop over the text and convert letter by letter, it will give me each one of the unicode characters.
EDIT: Take into consideration that the objective is not to convert the string but to take the letters in the unicode text and create a string. If I follow the link provided in the comment:
Convert a Unicode string to a string in Python (containing extra symbols)
import unicodedata
unicodedata.normalize('NFKD', input).encode('ascii','ignore')
output='SSS'
and as it is possible to see..it is not the expected output.
Edit: I wrote as an example the unicode u'/123' but Im trying to convert chinese characters, example:
a=u'\u6c34'
str(a)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u6c34' in position 0: ordinal not in range(128)
output_expected="\u6c34"
I've tried to convert it with str() as you mention in your question, and it does work for me. You can check the encoding with type().
>>> input= u'/123/123/123'
>>> type(input)
<type 'unicode'>
>>> output=str(input)
>>> print output
/123/123/123
>>> type(output)
<type 'str'>
How do you try to iterate among the letters? I've tried and they are still as a string. You could convert the input first and then do whatever you want once they are str:
letters = [x for x in output]
for letter in letters:
... print type(letter)
...
I hope it helps!
Here's how to do it the easy way:
>>> a=u'\x83\u6c34\U00103ABC'
>>> a.encode('unicode_escape')
'\\x83\\u6c34\\U00103abc'
>>> print a.encode('unicode_escape')
\x83\u6c34\U00103abc
Here's how to do it the hard way.
ascii_printable = set(unichr(i) for i in range(0x20, 0x7f))
def convert(ch):
if ch in ascii_printable:
return ch
ix = ord(ch)
if ix < 0x100:
return '\\x%02x' % ix
elif ix < 0x10000:
return '\\u%04x' % ix
return '\\U%08x' % ix
output = ''.join(convert(ch) for ch in input)
For Python 3 use chr instead of unichr.
Somebody wrote a really complete code for doing this, so cool, sources:
import unicodedata
def fix_bad_unicode(text):
if not isinstance(text, unicode):
raise TypeError("This isn't even decoded into Unicode yet. "
"Decode it first.")
if len(text) == 0:
return text
maxord = max(ord(char) for char in text)
tried_fixing = []
if maxord < 128:
# Hooray! It's ASCII!
return text
else:
attempts = [(text, text_badness(text) + len(text))]
if maxord < 256:
tried_fixing = reinterpret_latin1_as_utf8(text)
tried_fixing2 = reinterpret_latin1_as_windows1252(text)
attempts.append((tried_fixing, text_cost(tried_fixing)))
attempts.append((tried_fixing2, text_cost(tried_fixing2)))
elif all(ord(char) in WINDOWS_1252_CODEPOINTS for char in text):
tried_fixing = reinterpret_windows1252_as_utf8(text)
attempts.append((tried_fixing, text_cost(tried_fixing)))
else:
# We can't imagine how this would be anything but valid text.
return text
# Sort the results by badness
attempts.sort(key=lambda x: x[1])
#print attempts
goodtext = attempts[0][0]
if goodtext == text:
return goodtext
else:
return fix_bad_unicode(goodtext)
def reinterpret_latin1_as_utf8(wrongtext):
newbytes = wrongtext.encode('latin-1', 'replace')
return newbytes.decode('utf-8', 'replace')
def reinterpret_windows1252_as_utf8(wrongtext):
altered_bytes = []
for char in wrongtext:
if ord(char) in WINDOWS_1252_GREMLINS:
altered_bytes.append(char.encode('WINDOWS_1252'))
else:
altered_bytes.append(char.encode('latin-1', 'replace'))
return ''.join(altered_bytes).decode('utf-8', 'replace')
def reinterpret_latin1_as_windows1252(wrongtext):
return wrongtext.encode('latin-1').decode('WINDOWS_1252', 'replace')
def text_badness(text):
assert isinstance(text, unicode)
errors = 0
very_weird_things = 0
weird_things = 0
prev_letter_script = None
for pos in xrange(len(text)):
char = text[pos]
index = ord(char)
if index < 256:
weird_things += SINGLE_BYTE_WEIRDNESS[index]
if SINGLE_BYTE_LETTERS[index]:
prev_letter_script = 'latin'
else:
prev_letter_script = None
else:
category = unicodedata.category(char)
if category == 'Co':
# Unassigned or private use
errors += 1
elif index == 0xfffd:
# Replacement character
errors += 1
elif index in WINDOWS_1252_GREMLINS:
lowchar = char.encode('WINDOWS_1252').decode('latin-1')
weird_things += SINGLE_BYTE_WEIRDNESS[ord(lowchar)] - 0.5
if category.startswith('L'):
name = unicodedata.name(char)
scriptname = name.split()[0]
freq, script = SCRIPT_TABLE.get(scriptname, (0, 'other'))
if prev_letter_script:
if script != prev_letter_script:
very_weird_things += 1
if freq == 1:
weird_things += 2
elif freq == 0:
very_weird_things += 1
prev_letter_script = script
else:
prev_letter_script = None
return 100 * errors + 10 * very_weird_things + weird_things
def text_cost(text):
"""
Assign a cost function to the length plus weirdness of a text string.
"""
return text_badness(text) + len(text)
WINDOWS_1252_GREMLINS = [
# adapted from http://effbot.org/zone/unicode-gremlins.htm
0x0152, # LATIN CAPITAL LIGATURE OE
0x0153, # LATIN SMALL LIGATURE OE
0x0160, # LATIN CAPITAL LETTER S WITH CARON
0x0161, # LATIN SMALL LETTER S WITH CARON
0x0178, # LATIN CAPITAL LETTER Y WITH DIAERESIS
0x017E, # LATIN SMALL LETTER Z WITH CARON
0x017D, # LATIN CAPITAL LETTER Z WITH CARON
0x0192, # LATIN SMALL LETTER F WITH HOOK
0x02C6, # MODIFIER LETTER CIRCUMFLEX ACCENT
0x02DC, # SMALL TILDE
0x2013, # EN DASH
0x2014, # EM DASH
0x201A, # SINGLE LOW-9 QUOTATION MARK
0x201C, # LEFT DOUBLE QUOTATION MARK
0x201D, # RIGHT DOUBLE QUOTATION MARK
0x201E, # DOUBLE LOW-9 QUOTATION MARK
0x2018, # LEFT SINGLE QUOTATION MARK
0x2019, # RIGHT SINGLE QUOTATION MARK
0x2020, # DAGGER
0x2021, # DOUBLE DAGGER
0x2022, # BULLET
0x2026, # HORIZONTAL ELLIPSIS
0x2030, # PER MILLE SIGN
0x2039, # SINGLE LEFT-POINTING ANGLE QUOTATION MARK
0x203A, # SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
0x20AC, # EURO SIGN
0x2122, # TRADE MARK SIGN
]
# a list of Unicode characters that might appear in Windows-1252 text
WINDOWS_1252_CODEPOINTS = range(256) + WINDOWS_1252_GREMLINS
# Rank the characters typically represented by a single byte -- that is, in
# Latin-1 or Windows-1252 -- by how weird it would be to see them in running
# text.
#
# 0 = not weird at all
# 1 = rare punctuation or rare letter that someone could certainly
# have a good reason to use. All Windows-1252 gremlins are at least
# weirdness 1.
# 2 = things that probably don't appear next to letters or other
# symbols, such as math or currency symbols
# 3 = obscure symbols that nobody would go out of their way to use
# (includes symbols that were replaced in ISO-8859-15)
# 4 = why would you use this?
# 5 = unprintable control character
#
# The Portuguese letter à (0xc3) is marked as weird because it would usually
# appear in the middle of a word in actual Portuguese, and meanwhile it
# appears in the mis-encodings of many common characters.
SINGLE_BYTE_WEIRDNESS = (
# 0 1 2 3 4 5 6 7 8 9 a b c d e f
5, 5, 5, 5, 5, 5, 5, 5, 5, 0, 0, 5, 5, 5, 5, 5, # 0x00
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, # 0x10
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # 0x20
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # 0x30
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # 0x40
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # 0x50
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # 0x60
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, # 0x70
2, 5, 1, 4, 1, 1, 3, 3, 4, 3, 1, 1, 1, 5, 1, 5, # 0x80
5, 1, 1, 1, 1, 3, 1, 1, 4, 1, 1, 1, 1, 5, 1, 1, # 0x90
1, 0, 2, 2, 3, 2, 4, 2, 4, 2, 2, 0, 3, 1, 1, 4, # 0xa0
2, 2, 3, 3, 4, 3, 3, 2, 4, 4, 4, 0, 3, 3, 3, 0, # 0xb0
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # 0xc0
1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, # 0xd0
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # 0xe0
1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, # 0xf0
)
# Pre-cache the Unicode data saying which of these first 256 characters are
# letters. We'll need it often.
SINGLE_BYTE_LETTERS = [
unicodedata.category(unichr(i)).startswith('L')
for i in xrange(256)
]
# A table telling us how to interpret the first word of a letter's Unicode
# name. The number indicates how frequently we expect this script to be used
# on computers. Many scripts not included here are assumed to have a frequency
# of "0" -- if you're going to write in Linear B using Unicode, you're
# probably aware enough of encoding issues to get it right.
#
# The lowercase name is a general category -- for example, Han characters and
# Hiragana characters are very frequently adjacent in Japanese, so they all go
# into category 'cjk'. Letters of different categories are assumed not to
# appear next to each other often.
SCRIPT_TABLE = {
'LATIN': (3, 'latin'),
'CJK': (2, 'cjk'),
'ARABIC': (2, 'arabic'),
'CYRILLIC': (2, 'cyrillic'),
'GREEK': (2, 'greek'),
'HEBREW': (2, 'hebrew'),
'KATAKANA': (2, 'cjk'),
'HIRAGANA': (2, 'cjk'),
'HIRAGANA-KATAKANA': (2, 'cjk'),
'HANGUL': (2, 'cjk'),
'DEVANAGARI': (2, 'devanagari'),
'THAI': (2, 'thai'),
'FULLWIDTH': (2, 'cjk'),
'MODIFIER': (2, None),
'HALFWIDTH': (1, 'cjk'),
'BENGALI': (1, 'bengali'),
'LAO': (1, 'lao'),
'KHMER': (1, 'khmer'),
'TELUGU': (1, 'telugu'),
'MALAYALAM': (1, 'malayalam'),
'SINHALA': (1, 'sinhala'),
'TAMIL': (1, 'tamil'),
'GEORGIAN': (1, 'georgian'),
'ARMENIAN': (1, 'armenian'),
'KANNADA': (1, 'kannada'), # mostly used for looks of disapproval
'MASCULINE': (1, 'latin'),
'FEMININE': (1, 'latin')
}
Then you just call the method:
fix_bad_unicode(u'aあä')
>> u'a\u3042\xe4'

Categories