Unicodedata.normalize() ValueError: invalid normalization form

Unicodedata.normalize() ValueError: invalid normalization form - python

I'm trying to take foreign language text and output a human-readable, filename-safe equivalent. After looking around, it seems like the best option is unicodedata.normalize(), but I can't get it to work. I've tried putting the exact code from some answers here and elsewhere, but it keeps giving me this error. I only got one success, when I ran:
unicodedata.normalize('NFD', '\u00C7')
'C\u0327'
But every other time, I get an error. Here's my code I've tried:
unicodedata.normalize('NFKD', u'\u2460') #error, not sure why. Look same as above.
s = 'ذهب الرجل'
unicodedata.normalize('NKFC',s) #error
unicodedata.normalize('NKFD', 'ñ') #error
Specifically, the error I get is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid normalization form
I don't understand why this isn't working. All of these are strings, which means they are unicode in Python 3. I tried encoding them using .encode(), but then normalize() said it only takes arguments of string, so I know that can't be it. I'm seriously at a loss because even code I'm copying from here seems to error out. What's going on here?

Looking at unicodedata.c, the only way you can get that error is if you enter an invalid form string. The valid values are "NFC", "NFKC", "NFD", and "NFKD", but you seem to be using values with the "F" and "K" switched around:
>>> import unicodedata
>>>
>>> unicodedata.normalize('NKFD', 'ñ')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid normalization form
>>>
>>> unicodedata.normalize('NFKD', 'ñ')
'ñ'

Related

How to check if a chr()'s output will be undefined

I'm using chr() to run through a list of unicode characters, but whenever it comes across a character that is unassigned, it just continues running, and doesnt error out or anything. How do i check if the output of chr() will be undefined?
for example,
print(chr(55396))
is in range of unicode, it's just an unassigned character, how do i check what the output of chr() will give me an actual character that way this hangup doesn't occur?

You could use the unicodedata module:
>>> import unicodedata
>>> unicodedata.name(chr(55396))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> unicodedata.name(chr(120))
'LATIN SMALL LETTER X'
>>>

How can I solve this error?(python crawling)

I have been crawling Flickr data for 2 weeks.
Crawling has been done well.
But, today executing the python code in Windows PowerShell, this error happened.
Traceback (most recent call last): File "getdata_tag.py", line 3 in module
nsid= info["owner"]["nsid"];
TypeError: string indices must be integers, not str
how can I modify this code?
I will add the code here

This looks like info["owner"] or info itself is string, not dictionary.
You must check which scenario is it and then remove ["owner"]["nsid"] if info is string or only ["nsid"] if info["owner"] is string.

Why am I getting "TypeError: non-empty format string passed to object.format" [duplicate]

I hit this TypeError exception recently, which I found very difficult to debug. I eventually reduced it to this small test case:
>>> "{:20}".format(b"hi")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: non-empty format string passed to object.__format__
This is very non-obvious, to me anyway. The workaround for my code was to decode the byte string into unicode:
>>> "{:20}".format(b"hi".decode("ascii"))
'hi '
What is the meaning of this exception? Is there a way it can be made more clear?

bytes objects do not have a __format__ method of their own, so the default from object is used:
>>> bytes.__format__ is object.__format__
True
>>> '{:20}'.format(object())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: non-empty format string passed to object.__format__
It just means that you cannot use anything other than straight up, unformatted unaligned formatting on these. Explicitly convert to a string object (as you did by decoding bytes to str) to get format spec support.
You can make the conversion explicit by using the !s string conversion:
>>> '{!s:20s}'.format(b"Hi")
"b'Hi' "
>>> '{!s:20s}'.format(object())
'<object object at 0x1100b9080>'
object.__format__ explicitly rejects format strings to avoid implicit string conversions, specifically because formatting instructions are type specific.

This also happens when trying to format None:
>>> '{:.0f}'.format(None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: non-empty format string passed to object.__format__
That took a moment to work out (in my case, when None was being returned by an instance variable)!

printing numpy timedelta64 with format()

I would like to print a numpy.timedelta64() value in a formatted way. The direct method works well:
>>> import numpy as np
>>> print np.timedelta64(10,'m')
10 minutes
Which I guess comes from the __str__() method
>>> np.timedelta64(10,'m').__str__()
'10 minutes'
But when I try to print it with the format() function I get the following error:
>>> print "my delta is : {delta}".format(delta=np.timedelta64(10,'m'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: don't know how to convert scalar number to long
I would like to understand the underlying mechanism of the "string".format() function, and why it doesn't work in this particular case.

According to the Format String Syntax documentation:
The conversion field causes a type coercion before formatting.
Normally, the job of formatting a value is done by the __format__()
method of the value itself. However, in some cases it is desirable to
force a type to be formatted as a string, overriding its own
definition of formatting. By converting the value to a string before
calling __format__(), the normal formatting logic is bypassed.
>>> np.timedelta64(10,'m').__format__('')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: don't know how to convert scalar number to long
By appending !s conversion flag, you can force it to use str:
>>> "my delta is : {delta!s}".format(delta=np.timedelta64(10,'m'))
'my delta is : 10 minutes'

falsetru mentions one aspect of the problem. The other is why this errors at all.
Looking at the code for __format__, we see that it is a generic implementation.
The important part is:
else if (PyArray_IsScalar(self, Integer)) {
#if defined(NPY_PY3K)
obj = Py_TYPE(self)->tp_as_number->nb_int(self);
#else
obj = Py_TYPE(self)->tp_as_number->nb_long(self);
#endif
}
This triggers, and tries to run:
int(numpy.timedelta64(10, "m"))
but Numpy (rightly) says that you can't convert a number with units to a raw number.
This looks like a bug.

%s should be fine. It calls str() on the object.

How to format a write statement in Python?

I have data that I want to print to file. For missing data, I wish to print the mean of the actual data. However, the mean is calculated to more than the required 4 decimal places. How can I write to the mean to file and format this mean at the same time?
I have tried the following, but keep getting errors:
outfile.write('{0:%.3f}'.format(str(mean))+"\n")

First, remove the % since it makes your format syntax invalid. See a demonstration below:
>>> '{:%.3f}'.format(1.2345)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: Invalid conversion specification
>>> '{:.3f}'.format(1.2345)
'1.234'
>>>
Second, don't put mean in str since str.format is expecting a float (that's what the f in the format syntax represents). Below is a demonstration of this bug:
>>> '{:.3f}'.format('1.2345')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: Unknown format code 'f' for object of type 'str'
>>> '{:.3f}'.format(1.2345)
'1.234'
>>>
Third, the +"\n" is unnecessary since you can put the "\n" in the string you used on str.format.
Finally, as shown in my demonstrations, you can remove the 0 since it is redundant.
In the end, the code should be like this:
outfile.write('{:.3f}\n'.format(mean))

You don't need to convert to string using str(). Also, the "%" is not required. Just use:
outfile.write('{0:.3f}'.format(mean)+"\n")

First of all, the formatting of your string has nothing to do with your write statement. You can reduce your problem to:
string = '{0:%.3f}'.format(str(mean))+"\n"
outfile.write(string)
Then, your string specification is incorrect and should be:
string = '{0:.3f}\n'.format(mean)

outfile.write('{.3f}\n'.format(mean))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unicodedata.normalize() ValueError: invalid normalization form - python

Related

How to check if a chr()'s output will be undefined

How can I solve this error?(python crawling)

Why am I getting "TypeError: non-empty format string passed to object.format" [duplicate]

printing numpy timedelta64 with format()

How to format a write statement in Python?

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unicodedata.normalize() ValueError: invalid normalization form - python

Related

How to check if a chr()'s output will be undefined

How can I solve this error?(python crawling)

Why am I getting "TypeError: non-empty format string passed to object.__format__" [duplicate]

printing numpy timedelta64 with format()

How to format a write statement in Python?

Categories

Resources

Why am I getting "TypeError: non-empty format string passed to object.format" [duplicate]