How do I convert LF to CRLF? - python

I found a list of the majority of English words online, but the line breaks are of unix-style (encoded in Unicode: UTF-8). I found it on this website: http://dreamsteep.com/projects/the-english-open-word-list.html
How do I convert the line breaks to CRLF so I can iterate over them? The program I will be using them in goes through each line in the file, so the words have to be one per line.
This is a portion of the file: bitbackbitebackbiterbackbitersbackbitesbackbitingbackbittenbackboard
It should be:
bit
backbite
backbiter
backbiters
backbites
backbiting
backbitten
backboard
How can I convert my files to this type? Note: it's 26 files (one per letter) with 80,000 words or so in total (so the program should be very fast).
I don't know where to start because I've never worked with unicode. Thanks in advance!
Using rU as the parameter (as suggested), with this in my code:
with open(my_file_name, 'rU') as my_file:
for line in my_file:
new_words.append(str(line))
my_file.close()
I get this error:
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
addWords('B Words')
File "D:\my_stuff\Google Drive\documents\SCHOOL\Programming\Python\Programming Class\hangman.py", line 138, in addWords
for line in my_file:
File "C:\Python3.3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7488: character maps to <undefined>
Can anyone help me with this?

Instead of converting, you should be able to just open the file using Python's universal newline support:
f = open('words.txt', 'rU')
(Note the U.)

You can use the replace method of strings. Like
txt.replace('\n', '\r\n')
EDIT :
in your case :
with open('input.txt') as inp, open('output.txt', 'w') as out:
txt = inp.read()
txt = txt.replace('\n', '\r\n')
out.write(txt)

You don't need to convert the line endings in the files in order to be able to iterate over them. As suggested by NPE, simply use python's universal newlines mode.
The UnicodeDecodeError happens because the files you are processing are encoded as UTF-8 and when you attempt to decode the contents from bytes to a string, via str(line), Python is using the cp1252 encoding to convert the bytes read from the file into a Python 3 string (i.e. a sequence of unicode code points). However, there are bytes in those files that cannot be decoded with the cp1252 encoding and that causes a UnicodeDecodeError.
If you change str(line) to line.decode('utf-8') you should no longer get the UnicodeDecodeError. Check out the Text Vs. Data Instead of Unicode Vs. 8-bit writeup for some more details.
Finally, you might also find The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky useful.

You can use cereja package
pip install cereja==1.2.0
import cereja cereja.lf_to_crlf(dir_or_file_path)
or
cereja.lf_to_crlf(dir_or_file_path, ext_in=[“.py”,”.csv”])
You can substitute for any standard. See the filetools module

Related

Unicode file with python and fileinput

I am becoming more and more convinced that the business of file encodings is made as confusing as possible on purpose. I have a problem with reading a file in utf-8 encoding that contains just one line:
“blabla this is some text”
(note that the quotation marks are some fancy version of the standard quotation marks).
Now, I run this piece of Python code on it:
import fileinput
def charinput(paths):
with open(paths) as fi:
for line in fi:
for char in line:
yield char
i = charinput('path/to/file.txt')
for item in i:
print(item)
with two results:
If i run my python code from command prompt, the result is some strange characters, followed by an error mesage:
ď
»
ż
â
Traceback (most recent call last):
File "krneki.py", line 11, in <module>
print(item)
File "C:\Python34\lib\encodings\cp852.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u20ac' in position
0: character maps to <undefined>
I get the idea that the problem comes from the fact that Python tries to read a "wrongly" encoded document, but is there a way to order fileinput.input to read utf-8?
EDIT: Some really weird stuff is happening and I have NO idea how any of it works. After saving the same file as before in notepad++, the python code now runs within IDLE and results in the following output (newlines removed):
“blabla this is some text”
while I can get the command prompt to not crash if I first input chcp 65001. Running the file then results in
Ä»żâ€śblabla this is some text ”
Any ideas? This is a horrible mess, if you ask me, but it is vital I understand it...
Encoding
Every file is encoded. The byte 0x4C is interpreted as latin capital letter L according to the ASCII encoding, but as less-than sign ('<') according to the EBCDIC encoding. There Ain't No Such Thing As Plain Text.
There are single byte character sets like ASCII that use a single byte to encode each symbol, there are double byte character sets like KS X 1001 that use two bytes to encode each symbol, and there are encodings like the popular UTF-8 that use a variable number of bytes per symbol.
UTF-8 has become the most popular encoding for new applications, so I'll give some examples: The Latin Capital Letter A is stored as a single byte: 0x41. The Left Double Quotation Mark (“) is stored as three bytes: 0xE2 0x80 0x9C. The emoji Pile of Poo is stored as four bytes: 0xF0 0x9F 0x92 0xA9.
Any program that reads a file and has to interpret the bytes as symbols has to know (or to guess) which encoding was used.
If you are not familiar with Unicode or UTF-8 you might want to read http://www.joelonsoftware.com/articles/unicode.html
Reading Files in Python 3
Python 3's builtin function open() has an optional keywords argument encoding to support different encodings. To open a UTF-8 encoded file you can write open(filename, encoding="utf-8") and Python will take care of the decoding.
Also, the fileinput module supports encodings via the openhook keyword argument: fileinput.input(filename, openhook=fileinput.hook_encoded("utf-8")).
If you are not familiar with Python and Unicode or UTF-8 you should read http://docs.python.org/3/howto/unicode.html
I also found some nice tricks in http://www.chirayuk.com/snippets/python/unicode
Reading Strings in Python 2
In Python 2 open() does not know about encodings. Instead you can use the codecs module to specify which encoding should be used: codecs.open(filename, encoding="utf-8")
The best source for Python2/Unicode enlightment is http://docs.python.org/2/howto/unicode.html

How to open an unicode text file inside a zip?

I tried
with zipfile.ZipFile("5.csv.zip", "r") as zfile:
for name in zfile.namelist():
with zfile.open(name, 'rU') as readFile:
line = readFile.readline()
print(line)
split = line.split('\t')
it answers:
b'$0.0\t1822\t1\t1\t1\n'
Traceback (most recent call last)
File "zip.py", line 6
split = line.split('\t')
TypeError: Type str doesn't support the buffer API
How to open the text file as unicode instead of as b?
To convert a byte stream into Unicode stream, you could use io.TextIOWrapper():
encoding = 'utf-8'
with zipfile.ZipFile("5.csv.zip") as zfile:
for name in zfile.namelist():
with zfile.open(name) as readfile:
for line in io.TextIOWrapper(readfile, encoding):
print(repr(line))
Note: TextIOWrapper() uses universal newline mode by default. rU mode in zfile.open() is deprecated since version 3.4.
It avoids issues with multibyte encodings described in #Peter DeGlopper's answer.
edit For Python 3, using io.TextIOWrapper as this answer describes is the best choice. The answer below could still be helpful for 2.x. I don't think anything below is actually incorrect even for 3.x, but io.TestIOWrapper is still better.
If the file is utf-8, this will work:
# the rest of the code as above, then:
with zfile.open(name, 'rU') as readFile:
line = readFile.readline().decode('utf8')
# etc
If you're going to be iterating over the file you can use codecs.iterdecode, but that won't work with readline().
with zfile.open(name, 'rU') as readFile:
for line in codecs.iterdecode(readFile, 'utf8'):
print line
# etc
Note that neither approach is necessarily safe for multibyte encodings. For example, little-endian UTF-16 represents the newline character with the bytes b'\x0A\x00'. A non-unicode aware tool looking for newlines will split that incorrectly, leaving the null bytes on the following line. In such a case you'd have to use something that doesn't try to split the input by newlines, such as ZipFile.read, and then decode the whole byte string at once. This is not a concern for utf-8.
The reason why you're seeing that error is because you are trying to mix bytes with unicode. The argument to split must also be byte-string:
>>> line = b'$0.0\t1822\t1\t1\t1\n'
>>> line.split(b'\t')
[b'$0.0', b'1822', b'1', b'1', b'1\n']
To get a unicode string, use decode:
>>> line.decode('utf-8')
'$0.0\t1822\t1\t1\t1\n'

How to use unidecode in python (3.3)

I'm trying to remove all non-ascii characters from a text document. I found a package that should do just that, https://pypi.python.org/pypi/Unidecode
It should accept a string and convert all non-ascii characters to the closest ascii character available. I used this same module in perl easily enough by just calling while (<input>) { $_ = unidecode($_); } and this one is a direct port of the perl module, the documentation indicates that it should work the same.
I'm sure this is something simple, I just don't understand enough about character and file encoding to know what the problem is. My origfile is encoded in UTF-8 (converted from UCS-2LE). The problem may have more to do with my lack of encoding knowledge and handling strings wrong than the module, hopefully someone can explain why though. I've tried everything I know without just randomly inserting code and search the errors I'm getting with no luck so far.
Here's my python
from unidecode import unidecode
def toascii():
origfile = open(r'C:\log.convert', 'rb')
convertfile = open(r'C:\log.toascii', 'wb')
for line in origfile:
line = unidecode(line)
convertfile.write(line)
origfile.close()
convertfile.close()
toascii();
If I don't open the original file in byte mode (origfile = open('file.txt','r') then I get an error UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1563: character maps to <undefined> from the for line in origfile: line.
If I do open it in byte mode 'rb' I get TypeError: ord() expected string length 1, but int found from the line = unidecode(line) line.
if I declare line as a string line = unidecode(str(line)) then it will write to the file, but... not correctly. \r\n'b'\xef\xbb\xbf[ 2013.10.05 16:18:01 ] User_Name > .\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\ It's writing out the \n, \r, etc and unicode characters instead of converting them to anything.
If I convert the line to string as above, and open the convertfile in byte mode 'wb' it gives the error TypeError: 'str' does not support the buffer interface
If I open it in byte mode without declaring it a string 'wb' and unidecode(line) then I get the TypeError: ord() expected string length 1, but int found error again.
The unidecode module accepts unicode string values and returns a unicode string in Python 3. You are giving it binary data instead. Decode to unicode or open the input text file in textmode, and encode the result to ASCII before writing it to a file, or open the output text file in text mode.
Quoting from the module documentation:
The module exports a single function that takes an Unicode object (Python 2.x) or string (Python 3.x) and returns a string (that can be encoded to ASCII bytes in Python 3.x)
Emphasis mine.
This should work:
def toascii():
with open(r'C:\log.convert', 'r', encoding='utf8') as origfile, open(r'C:\log.toascii', 'w', encoding='ascii') as convertfile:
for line in origfile:
line = unidecode(line)
convertfile.write(line)
This opens the inputfile in text modus (using UTF8 encoding, which judging by your sample line is correct) and writes in text modus (encoding to ASCII).
You do need to explicitly specify the encoding of the file you are opening; if you omit the encoding the current system locale is used (the result of a locale.getpreferredencoding(False) call), which usually won't be the correct codec if your code needs to be portable.

Pythonic way of reading NUL in a file

I'm using python to read a text file with the segment below
(can't post a screenshot since i'm a noob) but this is what is looks like in notepad++:
NULSOHSOHNULNULNULSUBMesssage-ID:
error:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
print(f.readline())
File "C:\Python32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 7673: character maps to <undefined>
Opening the file as binary:
f = open('file.txt','rb')
f.readline()
gives me the text as binary
b'\x00\x01\x01\x00\x00\x00\x1a\xb7Message-ID:
but how do I get the text as ascii ? And whats the easiest/pythonic way of handling this ?
The problem is with "byte 0x8f in position 7673", not with "byte 0x00 in position 1". I.e., your NUL is not the problem. If you look at the cp-1252 codepage on wikipedia, you can see that 0x8f has no corresponding character.
The larger issue is that your file is not in a single encoding: it appears to be a mix of binary framing of text segments. What you really need to do is figure out the format of this file and parse it into binary pieces (or perhaps some richer data structure, like a tuple, list, dict, object, etc), then decode the text pieces into unicode if you need to process further.
When opening a file in text mode, you can specifically tell which encoding to use:
f = open('file.txt','r',encoding='ascii')
However your real problem is different: the binary piece that you cited can not be read as ASCII, because the byte \xb7 is outside of ASCII range (0-127). The exception traceback tells that Python is using cp1252 codec by default, which cannot decode your file either.
You need either to figure out which encoding the file has, or to handle it as binary all the time.
Perhaps open it in read the correct read mode?
f = open('file.txt','r')
f.readline()

UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`
The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")
If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)
Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary
As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)
TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).
Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed
for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")
def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))
For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"
Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.
In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.
for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

Categories