Related
I've very recently migrated to Python 3.5.
This code was working properly in Python 2.7:
with open(fname, 'rb') as f:
lines = [x.strip() for x in f.readlines()]
for line in lines:
tmp = line.strip().lower()
if 'some-pattern' in tmp: continue
# ... code
But in 3.5, on the if 'some-pattern' in tmp: continue line, I get an error which says:
TypeError: a bytes-like object is required, not 'str'
I was unable to fix the problem using .decode() on either side of the in, nor could I fix it using
if tmp.find('some-pattern') != -1: continue
What is wrong, and how do I fix it?
You opened the file in binary mode:
with open(fname, 'rb') as f:
This means that all data read from the file is returned as bytes objects, not str. You cannot then use a string in a containment test:
if 'some-pattern' in tmp: continue
You'd have to use a bytes object to test against tmp instead:
if b'some-pattern' in tmp: continue
or open the file as a textfile instead by replacing the 'rb' mode with 'r'.
You can encode your string by using .encode()
Example:
'Hello World'.encode()
As the error describes, in order to write a string to a file you need to encode it to a byte-like object first, and encode() is encoding it to a byte-string.
Like it has been already mentioned, you are reading the file in binary mode and then creating a list of bytes. In your following for loop you are comparing string to bytes and that is where the code is failing.
Decoding the bytes while adding to the list should work. The changed code should look as follows:
with open(fname, 'rb') as f:
lines = [x.decode('utf8').strip() for x in f.readlines()]
The bytes type was introduced in Python 3 and that is why your code worked in Python 2. In Python 2 there was no data type for bytes:
>>> s=bytes('hello')
>>> type(s)
<type 'str'>
You have to change from wb to w:
def __init__(self):
self.myCsv = csv.writer(open('Item.csv', 'wb'))
self.myCsv.writerow(['title', 'link'])
to
def __init__(self):
self.myCsv = csv.writer(open('Item.csv', 'w'))
self.myCsv.writerow(['title', 'link'])
After changing this, the error disappears, but you can't write to the file (in my case). So after all, I don't have an answer?
Source: How to remove ^M
Changing to 'rb' brings me the other error: io.UnsupportedOperation: write
Use the encode() function along with the hardcoded string value given in a single quote.
Example:
file.write(answers[i] + '\n'.encode())
Or
line.split(' +++$+++ '.encode())
For this small example, adding the only b before
'GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n' solved my problem:
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))
mysock.send(b'GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n')
while True:
data = mysock.recv(512)
if (len(data) < 1):
break
print (data);
mysock.close()
What does the 'b' character do in front of a string literal?
You opened the file in binary mode:
The following code will throw
a TypeError: a bytes-like object is required, not 'str'.
for line in lines:
print(type(line))# <class 'bytes'>
if 'substring' in line:
print('success')
The following code will work - you have to use the decode() function:
for line in lines:
line = line.decode()
print(type(line))# <class 'str'>
if 'substring' in line:
print('success')
Try opening your file as text:
with open(fname, 'rt') as f:
lines = [x.strip() for x in f.readlines()]
Additionally, here is a link for Python 3.x on the official page:
io — Core tools for working with streams.
And this is the open function: open
If you are really trying to handle it as a binary then consider encoding your string.
I got this error when I was trying to convert a char (or string) to bytes, the code was something like this with Python 2.7:
# -*- coding: utf-8 -*-
print(bytes('ò'))
This is the way of Python 2.7 when dealing with Unicode characters.
This won't work with Python 3.6, since bytes require an extra argument for encoding, but this can be little tricky, since different encoding may output different result:
print(bytes('ò', 'iso_8859_1')) # prints: b'\xf2'
print(bytes('ò', 'utf-8')) # prints: b'\xc3\xb2'
In my case I had to use iso_8859_1 when encoding bytes in order to solve the issue.
Summary
Python 2.x encouraged many bad habits WRT text handling. In particular, its type named str does not actually represent text per the Unicode standard (that type is unicode), and the default "string literal" in fact produces a sequence of raw bytes - with some convenience functions for treating it like a string, if you can get away with assuming a "code page" style encoding.
In 3.x, "string literals" now produce actual strings, and built-in functionality no longer does any implicit conversions between the two types. Thus, the same code now has a TypeError, because the literal and the variable are of incompatible types. To fix the problem, one of the values must be either replaced or converted, so that the types match.
The Python documentation has an extremely detailed guide to working with Unicode properly.
In the example in the question, the input file is processed as if it contains text. Therefore, the file should have been opened in a text mode in the first place. The only good reason the file would have been opened in binary mode even in 2.x is to avoid universal newline translation; in 3.x, this is done by specifying the newline keyword parameter when opening a file in text mode.
To read a file as text properly requires knowing a text encoding, which is specified in the code by (string) name. The encoding iso-8859-1 is a safe fallback; it interprets each byte separately, as representing one of the first 256 Unicode code points, in order (so it will never raise an exception due to invalid data). utf-8 is much more common as of the time of writing, but it does not accept arbitrary data. (However, in many cases, for English text, the distinction will not matter; both of those encodings, and many more, are supersets of ASCII.)
Thus:
with open(fname, 'r', newline='\n', encoding='iso-8859-1') as f:
lines = [x.strip() for x in f.readlines()]
# proceed as before
# If the results are wrong, take additional steps to ascertain the correct encoding
How the error is created when migrating from 2.x to 3.x
In 2.x, 'some-pattern' creates a str, i.e. a sequence of bytes that the programmer is then likely to pretend is text. The str type is the same as the bytes type, and different from the unicode type that properly represents text. Many methods are offered to treat this data as if it were text, but it is not a proper representation of text. The meaning of each value as a text character (the encoding) is assumed. (In order to enable the illusion of raw data as "text", there would sometimes be implicit conversions between the str and unicode types. However, this results in confusing errors of its own - such as getting UnicodeDecodeError from an attempt to encode, or vice-versa).
In 3.x, 'some-pattern' creates what is also called a str; but now str means the Unicode-using, properly-text-representing string type. (unicode is no longer used as a type name, and only bytes refers to the sequence-of-bytes type.) Some changes were made to bytes to dissociate it from the text-with-assumed-encoding interpretation (in particular, indexing into a bytes object now results in an int, rather than a 1-element bytes), but many strange legacy methods persist (including ones rarely used even with actual strings any more, like zfill).
Why this causes a problem
The data, tmp, is a bytes instance. It came from a binary source: in this case, a file opened with a 'b' file mode. In other cases, it could come from a raw network socket, a web request made with urllib or similar, or some other API call.
This means that it cannot do anything meaningful in combination with a string. The elements of a string are Unicode code points (i.e., abstractions that represent, for the most part, text characters, in a universal form that represents all world languages and many other symbols). The elements of a bytes are, well, bytes. (Specifically in 3.x, they are interpreted as unsigned integers ranging from 0 to 255 inclusive.)
When the code was migrated, the literal 'some-pattern' went from describing a bytes, to describing text. Thus, the code went from making a legal comparison (byte-sequence to byte-sequence), to making an illegal one (string to byte-sequence).
Fixing the problem
In order to operate on a string and a byte-sequence - whether it's checking for equality with ==, lexicographic comparison with <, substring search with in, concatenation with +, or anything else - either the string must be converted to a byte-sequence, or vice-versa. In general, only one of these will be the correct, sensible answer, and it will depend on the context.
Fixing the source
Sometimes, one of the values can be seen to be "wrong" in the first place. For example, if reading the file was intended to result in text, then it should have been opened in a text mode. In 3.x, the file encoding can simply be passed as an encoding keyword argument to open, and conversion to Unicode is handled seamlessly without having to feed a binary file to an explicit translation step (thus, universal newline handling still takes place seamlessly).
In the case of the original example, that could look like:
with open(fname, 'r') as f:
lines = [x.strip() for x in f.readlines()]
This example assumes a platform-dependent default encoding for the file. This will normally work for files that were created in straightforward ways, on the same computer. In the general case, however, the encoding of the data must be known in order to work with it properly.
If the encoding is known to be, for example, UTF-8, that is trivially specified:
with open(fname, 'r', encoding='utf-8') as f:
lines = [x.strip() for x in f.readlines()]
Similarly, a string literal that should have been a bytes literal is simply missing a prefix: to make the bytes sequence representing integer values [101, 120, 97, 109, 112, 108, 101] (i.e., the ASCII values of the letters example), write the bytes literal b'example', rather than the string literal `'example'). Similarly the other way around.
In the case of the original example, that would look like:
if b'some-pattern' in tmp:
There is a safeguard built in to this: the bytes literal syntax only allows ASCII characters, so something like b'ëxãmþlê' will be caught as a SyntaxError, regardless of the encoding of the source file (since it is not clear which byte values are meant; in the old implied-encoding schemes, the ASCII range was well established, but everything else was up in the air.) Of course, bytes literals with elements representing values 128..255 can still be written by using \x escaping for those values: for example, b'\xebx\xe3m\xfel\xea' will produce a byte-sequence corresponding to the text ëxãmþlê in Latin-1 (ISO 8859-1) encoding.
Converting, when appropriate
Conversion between byte-sequences and text is only possible when an encoding has been determined. It has always been so; we just used to assume an encoding locally, and then mostly ignore that we had done so. (Programmers in places like East Asia have been more aware of the problem historically, because they commonly need to work with scripts that have more than 256 distinct symbols, and thus their text requires multi-byte encodings.)
In 3.x, because there is no pressure to be able to treat byte-sequences implicitly as text with an assumed encoding, there are therefore no implicit conversion steps behind the scenes. This means that understanding the API is straightforward: Bytes are raw data; therefore, they are used to encode text, which is an abstraction. Therefore, the .encode() method is provided by str (which represents text), in order to encode text into raw data. Similarly, the .decode() method is provided by bytes (which represents a byte-sequence), in order to decode raw data into text.
Applying these to the example code, again supposing UTF-8 encoding is appropriate, gives:
if 'some-pattern'.encode('utf-8') in tmp:
and
if 'some-pattern' in tmp.decode('utf-8'):
I'm working with a JSON file contains some unknown-encoded strings as the example below:
"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
I have loaded this text by using json.load() function in Python 3.7 environment and tried to encode/decode it with some methods I found around the Internet but I still cannot get the proper string as I expected. (In this case, it has to be Lê Nguyễn Phú).
My question is, which is the encoding method they used and how to parse this text in a proper way in Python?
Because the JSON file comes from an external source that I didn't handle so that I cannot know or make any changes in the process of encoding the text.
[Updated] More details:
The JSON file looks like this:
{
"content":"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
}
Firstly, I loaded the JSON file:
with open(json_path, 'r') as f:
data = json.load(f)
But when I extract the content, it's not what I expected:
string = data.get('content', '')
print(string)
'Lê Nguyá»\x85n Phú'
Someone took "Lê Nguyễn Phú", encoded that as UTF-8, and then took the resulting series of bytes and lied to a JSON encoder by telling it that those bytes were the characters of a string. The JSON encoder then cooperatively produced garbage by encoding those characters. But it is reversible garbage. You can reverse this process using something like
json.loads(in_string).encode("latin_1").decode("utf_8")
Which decodes the string from the JSON, extracts the bytes from it (the 256 symbols in Latin-1 are in a 1-to-1 correspondence with the first 256 Unicode codepoints), and then re-decodes those bytes as UTF-8.
The big problem with this technique is that it only works if you are sure that all of your input is garbled in this fashion... there's no completely reliable way to look at an input and decide whether it should have this broken decoding applied to it. If you try to apply it to a validly-encoded string containing codepoints above U+00FF, it will crash. But if you try to apply it to a validly-encoding string containing only codepoints up to U+00FF, it will turn your perfectly good string into a different kind of garbage.
I've very recently migrated to Python 3.5.
This code was working properly in Python 2.7:
with open(fname, 'rb') as f:
lines = [x.strip() for x in f.readlines()]
for line in lines:
tmp = line.strip().lower()
if 'some-pattern' in tmp: continue
# ... code
But in 3.5, on the if 'some-pattern' in tmp: continue line, I get an error which says:
TypeError: a bytes-like object is required, not 'str'
I was unable to fix the problem using .decode() on either side of the in, nor could I fix it using
if tmp.find('some-pattern') != -1: continue
What is wrong, and how do I fix it?
You opened the file in binary mode:
with open(fname, 'rb') as f:
This means that all data read from the file is returned as bytes objects, not str. You cannot then use a string in a containment test:
if 'some-pattern' in tmp: continue
You'd have to use a bytes object to test against tmp instead:
if b'some-pattern' in tmp: continue
or open the file as a textfile instead by replacing the 'rb' mode with 'r'.
You can encode your string by using .encode()
Example:
'Hello World'.encode()
As the error describes, in order to write a string to a file you need to encode it to a byte-like object first, and encode() is encoding it to a byte-string.
Like it has been already mentioned, you are reading the file in binary mode and then creating a list of bytes. In your following for loop you are comparing string to bytes and that is where the code is failing.
Decoding the bytes while adding to the list should work. The changed code should look as follows:
with open(fname, 'rb') as f:
lines = [x.decode('utf8').strip() for x in f.readlines()]
The bytes type was introduced in Python 3 and that is why your code worked in Python 2. In Python 2 there was no data type for bytes:
>>> s=bytes('hello')
>>> type(s)
<type 'str'>
You have to change from wb to w:
def __init__(self):
self.myCsv = csv.writer(open('Item.csv', 'wb'))
self.myCsv.writerow(['title', 'link'])
to
def __init__(self):
self.myCsv = csv.writer(open('Item.csv', 'w'))
self.myCsv.writerow(['title', 'link'])
After changing this, the error disappears, but you can't write to the file (in my case). So after all, I don't have an answer?
Source: How to remove ^M
Changing to 'rb' brings me the other error: io.UnsupportedOperation: write
Use the encode() function along with the hardcoded string value given in a single quote.
Example:
file.write(answers[i] + '\n'.encode())
Or
line.split(' +++$+++ '.encode())
For this small example, adding the only b before
'GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n' solved my problem:
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))
mysock.send(b'GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n')
while True:
data = mysock.recv(512)
if (len(data) < 1):
break
print (data);
mysock.close()
What does the 'b' character do in front of a string literal?
You opened the file in binary mode:
The following code will throw
a TypeError: a bytes-like object is required, not 'str'.
for line in lines:
print(type(line))# <class 'bytes'>
if 'substring' in line:
print('success')
The following code will work - you have to use the decode() function:
for line in lines:
line = line.decode()
print(type(line))# <class 'str'>
if 'substring' in line:
print('success')
Try opening your file as text:
with open(fname, 'rt') as f:
lines = [x.strip() for x in f.readlines()]
Additionally, here is a link for Python 3.x on the official page:
io — Core tools for working with streams.
And this is the open function: open
If you are really trying to handle it as a binary then consider encoding your string.
I got this error when I was trying to convert a char (or string) to bytes, the code was something like this with Python 2.7:
# -*- coding: utf-8 -*-
print(bytes('ò'))
This is the way of Python 2.7 when dealing with Unicode characters.
This won't work with Python 3.6, since bytes require an extra argument for encoding, but this can be little tricky, since different encoding may output different result:
print(bytes('ò', 'iso_8859_1')) # prints: b'\xf2'
print(bytes('ò', 'utf-8')) # prints: b'\xc3\xb2'
In my case I had to use iso_8859_1 when encoding bytes in order to solve the issue.
Summary
Python 2.x encouraged many bad habits WRT text handling. In particular, its type named str does not actually represent text per the Unicode standard (that type is unicode), and the default "string literal" in fact produces a sequence of raw bytes - with some convenience functions for treating it like a string, if you can get away with assuming a "code page" style encoding.
In 3.x, "string literals" now produce actual strings, and built-in functionality no longer does any implicit conversions between the two types. Thus, the same code now has a TypeError, because the literal and the variable are of incompatible types. To fix the problem, one of the values must be either replaced or converted, so that the types match.
The Python documentation has an extremely detailed guide to working with Unicode properly.
In the example in the question, the input file is processed as if it contains text. Therefore, the file should have been opened in a text mode in the first place. The only good reason the file would have been opened in binary mode even in 2.x is to avoid universal newline translation; in 3.x, this is done by specifying the newline keyword parameter when opening a file in text mode.
To read a file as text properly requires knowing a text encoding, which is specified in the code by (string) name. The encoding iso-8859-1 is a safe fallback; it interprets each byte separately, as representing one of the first 256 Unicode code points, in order (so it will never raise an exception due to invalid data). utf-8 is much more common as of the time of writing, but it does not accept arbitrary data. (However, in many cases, for English text, the distinction will not matter; both of those encodings, and many more, are supersets of ASCII.)
Thus:
with open(fname, 'r', newline='\n', encoding='iso-8859-1') as f:
lines = [x.strip() for x in f.readlines()]
# proceed as before
# If the results are wrong, take additional steps to ascertain the correct encoding
How the error is created when migrating from 2.x to 3.x
In 2.x, 'some-pattern' creates a str, i.e. a sequence of bytes that the programmer is then likely to pretend is text. The str type is the same as the bytes type, and different from the unicode type that properly represents text. Many methods are offered to treat this data as if it were text, but it is not a proper representation of text. The meaning of each value as a text character (the encoding) is assumed. (In order to enable the illusion of raw data as "text", there would sometimes be implicit conversions between the str and unicode types. However, this results in confusing errors of its own - such as getting UnicodeDecodeError from an attempt to encode, or vice-versa).
In 3.x, 'some-pattern' creates what is also called a str; but now str means the Unicode-using, properly-text-representing string type. (unicode is no longer used as a type name, and only bytes refers to the sequence-of-bytes type.) Some changes were made to bytes to dissociate it from the text-with-assumed-encoding interpretation (in particular, indexing into a bytes object now results in an int, rather than a 1-element bytes), but many strange legacy methods persist (including ones rarely used even with actual strings any more, like zfill).
Why this causes a problem
The data, tmp, is a bytes instance. It came from a binary source: in this case, a file opened with a 'b' file mode. In other cases, it could come from a raw network socket, a web request made with urllib or similar, or some other API call.
This means that it cannot do anything meaningful in combination with a string. The elements of a string are Unicode code points (i.e., abstractions that represent, for the most part, text characters, in a universal form that represents all world languages and many other symbols). The elements of a bytes are, well, bytes. (Specifically in 3.x, they are interpreted as unsigned integers ranging from 0 to 255 inclusive.)
When the code was migrated, the literal 'some-pattern' went from describing a bytes, to describing text. Thus, the code went from making a legal comparison (byte-sequence to byte-sequence), to making an illegal one (string to byte-sequence).
Fixing the problem
In order to operate on a string and a byte-sequence - whether it's checking for equality with ==, lexicographic comparison with <, substring search with in, concatenation with +, or anything else - either the string must be converted to a byte-sequence, or vice-versa. In general, only one of these will be the correct, sensible answer, and it will depend on the context.
Fixing the source
Sometimes, one of the values can be seen to be "wrong" in the first place. For example, if reading the file was intended to result in text, then it should have been opened in a text mode. In 3.x, the file encoding can simply be passed as an encoding keyword argument to open, and conversion to Unicode is handled seamlessly without having to feed a binary file to an explicit translation step (thus, universal newline handling still takes place seamlessly).
In the case of the original example, that could look like:
with open(fname, 'r') as f:
lines = [x.strip() for x in f.readlines()]
This example assumes a platform-dependent default encoding for the file. This will normally work for files that were created in straightforward ways, on the same computer. In the general case, however, the encoding of the data must be known in order to work with it properly.
If the encoding is known to be, for example, UTF-8, that is trivially specified:
with open(fname, 'r', encoding='utf-8') as f:
lines = [x.strip() for x in f.readlines()]
Similarly, a string literal that should have been a bytes literal is simply missing a prefix: to make the bytes sequence representing integer values [101, 120, 97, 109, 112, 108, 101] (i.e., the ASCII values of the letters example), write the bytes literal b'example', rather than the string literal `'example'). Similarly the other way around.
In the case of the original example, that would look like:
if b'some-pattern' in tmp:
There is a safeguard built in to this: the bytes literal syntax only allows ASCII characters, so something like b'ëxãmþlê' will be caught as a SyntaxError, regardless of the encoding of the source file (since it is not clear which byte values are meant; in the old implied-encoding schemes, the ASCII range was well established, but everything else was up in the air.) Of course, bytes literals with elements representing values 128..255 can still be written by using \x escaping for those values: for example, b'\xebx\xe3m\xfel\xea' will produce a byte-sequence corresponding to the text ëxãmþlê in Latin-1 (ISO 8859-1) encoding.
Converting, when appropriate
Conversion between byte-sequences and text is only possible when an encoding has been determined. It has always been so; we just used to assume an encoding locally, and then mostly ignore that we had done so. (Programmers in places like East Asia have been more aware of the problem historically, because they commonly need to work with scripts that have more than 256 distinct symbols, and thus their text requires multi-byte encodings.)
In 3.x, because there is no pressure to be able to treat byte-sequences implicitly as text with an assumed encoding, there are therefore no implicit conversion steps behind the scenes. This means that understanding the API is straightforward: Bytes are raw data; therefore, they are used to encode text, which is an abstraction. Therefore, the .encode() method is provided by str (which represents text), in order to encode text into raw data. Similarly, the .decode() method is provided by bytes (which represents a byte-sequence), in order to decode raw data into text.
Applying these to the example code, again supposing UTF-8 encoding is appropriate, gives:
if 'some-pattern'.encode('utf-8') in tmp:
and
if 'some-pattern' in tmp.decode('utf-8'):
Not sure if this is exactly the problem, but I'm trying to insert a tag on the first letter of a unicode string and it seems that this is not working. Could these be because unicode indices work differently than those of regular strings?
Right now my code is this:
for index, paragraph in enumerate(intro[2:-2]):
intro[index] = bold_letters(paragraph, 1)
def bold_letters(string, index):
return "<b>"+string[0]+"</b>"+string[index:]
And I'm getting output like this:
<b>?</b>?רך האחד וישתבח הבורא בחכמתו ורצונו כל צבא השמים ארץ וימים אלה ואלונים.
It seems the unicode gets messed up when I try to insert the HTML tag. I tried messing with the insert position but didn't make any progress.
Example desired output (hebrew goes right to left):
>>>first_letter_bold("הקדמה")
"הקדמ<\b>ה<b>"
BTW, this is for Python 2
You are right, indices work over each byte when you are dealing with raw bytes i.e String in Python(2.x).
To work seamlessly with Unicode data, you need to first let Python(2.x) know that you are dealing with Unicode, then do the string manipulation. You can finally convert it back to raw bytes to keep the behavior abstracted i.e you get String and you return String.
Ideally you should convert all the data from UTF8 raw encoding to Unicode object (I am assuming your source encoding is Unicode UTF8 because that is the standard used by most applications these days) at the very beginning of your code and convert back to raw bytes at the fag end of code like saving to DB, responding to client etc. Some frameworks might handle that for you so that you don't have to worry.
def bold_letters(string, index):
string = string.decode('utf8')
string "<b>"+string[0]+"</b>"+string[index:]
return string.encode('utf8')
This will also work for ASCII because UTF8 is a super-set of ASCII. You can understand how Unicode works and in Python specifically better by reading http://nedbatchelder.com/text/unipain.html
Python 3.x String is a Unicode object so you don't have to explicitly do anything.
You should use Unicode strings. Byte strings in UTF-8 use a variable number of bytes per character. Unicode use one (at least those in the BMP on Python 2...the first 65536 characters):
#coding:utf8
s = u"הקדמה"
t = u'<b>'+s[0]+u'</b>'+s[1:]
print(t)
with open('out.htm','w',encoding='utf-8-sig') as f:
f.write(t)
Output:
<b>ה</b>קדמה
But my Chrome browser displays out.htm as:
I am doing a word count on some text files, storing the results in a dictionary. My problem is that after outputting to file, the words are not displayed right even if they were in the original text. (I use TextWrangler to look at them).
For instance, dashes show up as dashes in the original but as \u2014 in the output; in the output, very word is prefixed by a u as well.
Problem
I do not know where, when and how in my script this happens.
I am reading the file with codecs.open() and outputting them with codecs.open() and as json.dump(). They both go wrong in the same way. In between, all is do is
tokenizing
regular expressions
collect in dictionary
And I don't know where I mess things up; I have de-activated tokenizing and most other functions to no effect. All this is happening in Python 2.
Following previous advice, I tried to keep everything within the script in Unicode.
Here is what I do (non-relevant code omitted):
#read in file, iterating over a list of "fileno"s
with codecs.open(os.path.join(dir,unicode(fileno)+".txt"), "r", "utf-8") as inputfili:
inputtext=inputfili.read()
#process the text: tokenize, lowercase, remove punctuation and conjugation
content=regular expression to extract text w/out metadata
contentsplit=nltk.tokenize.word_tokenize(content)
text=[i.lower() for i in contentsplit if not re.match(r"\d+", i)]
text= [re.sub(r"('s|s|s's|ed)\b", "", i) for i in text if i not in string.punctuation]
#build the dictionary of word counts
for word in text:
dicti[word].append(word)
#collect counts for each word, make dictionary of unique words
dicti_nos={unicode(k):len(v) for k,v in dicti.items()}
hapaxdicti= {k:v for k,v in perioddicti_nos.items() if v == 1}
#sort the dictionary
sorteddict=sorted(dictionary.items(), key=lambda x: x[1], reverse=True)
#output the results as .txt and json-file
with codecs.open(file_name, "w", "utf-8") as outputi:
outputi.write("\n".join([unicode(i) for i in sorteddict]))
with open(file_name+".json", "w") as jsonoutputi:
json.dump(dictionary, jsonoutputi, encoding="utf-8")
EDIT: Solution
Looks like my main issue was writing the file in the wrong way. If I change my code to what's reproduced below, things work out. Looks like joining a list of (string, number) tuples messed the string part up; if I join the tuples first, things work.
For the json output, I had to change to codecs.open() and set ensure_ascii to False. Apparently just setting the encoding to utf-8 does not do the trick like I thought.
with codecs.open(file_name, "w", "utf-8") as outputi:
outputi.write("\n".join([":".join([i[0],unicode(i[1])]) for i in sorteddict]))
with codecs.open(file_name+".json", "w", "utf-8") as jsonoutputi:
json.dump(dictionary, jsonoutputi, ensure_ascii=False)
Thanks for your help!
As your example is partially pseudocode there's no way to run a real test and give you something that runs and has been tested, but from reading what you have provided I think you may misunderstand the way Unicode works in Python 2.
The unicode type (such as is produced via the unicode() or unichr() functions) is meant to be an internal representation of a Unicode string that can be used for string manipulation and comparison purposes. It has no associated encoding. The unicode() function will take a buffer as its first argument and an encoding as its second argument and interpret that buffer using that encoding to produce an internally usable Unicode string that is from that point forward unencumbered by encodings.
That Unicode string isn't meant to be written out to a file; all file formats assume some encoding, and you're supposed to provide one again before writing that Unicode string out to a file. Everyplace you have a construct like unicode(fileno) or unicode(k) or unicode(i) is suspect both because you're relying on a default encoding (which probably isn't what you want) and because you're going on to expose most of these values directly to the file system.
After you're done working with these Unicode strings you can use the built-in method encode() on them with your desired encoding as an argument to pack them into strings of ordinary bytes set as required by your encoding.
So looking back at your example above, your inputtext variable is an ordinary string containing data encoded per the UTF-8 encoding. This isn't Unicode. You could convert it to a Unicode string with an operation like inputuni = unicode(inputtext, 'utf-8') and operate on it like that if you chose, but for what you're doing you may not even find it necessary. If you did convert it to Unicode though you'd have to perform the equivalent of a inputuni.encode('UTF-8') on any Unicode string that you were planning on writing out to your file.