I need to convert the EBCDIC files to ASCII using python 2.
The sample extract from the sample file looks like the below (in notepad++)
I have tried to decode it with 'cp500' and then encode it in 'utf8' in python like below
with open(path, 'rb') as input_file:
line = input_file.read()
line = line.decode('cp500').encode('utf8').strip()
print line
And below
with io.open(path, 'rb', encoding="cp500") as input_file:
line = input_file.read()
print line
Also, tried with codecs
with codecs.open(path, 'rb') as input_file:
count = 0
line = input_file.read()
line = codecs.decode(line, 'cp500').encode('utf8')
print line
Also, tried importing/installing the ebcdic module, but it doesn't seem to be working properly.
here is the sample output for the first 58 chars
It does transform the data to some human-readable values for some bytes but doesn't seem to be 100 percent in ASCII. For example, the 4th character in the input file is 'P' (after the first three NUL), and if I open the file in hex mode, the hex code for 'P' is 0x50, which maps to character 'P' in ASCII. But the code above gives me the character '&' for this in output, which is the EBCDIC character for hex value 0x50.
Also, tried the below code,
with open(path, 'rb') as input_file:
line = input_file.read()
line = line.decode('utf8').strip()
print line
It gives me the below error.
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf0 in position 4: invalid continuation byte
And If I change the 'utf8' to 'latin1' in the above code, it generates the same output as in the input file shown above which was opened in the notepad++.
Can anyone please help me with how to transform the EBCDIC to ASCII correctly?
Should I build my own mapping dictionary/table/map to transform the EBCDIC to ASCII i.e. convert the file data in hex codes and then get the corresponding ASCII char from that mapping table/dict?
If I do so, then hex 0x40 is 'Space' and 0xe2 is 'S' in EBCDIC but in ASCII 0x40 is '#' and 0xe2 doesn't have the mapping in the ASCII. But as per the input data, it looks like I need EBCDIC characters in this case.
So should I construct some map by looking at the input data and decide wheater I want EBCDIC or ASCII character for some particular hex value and construct that map accordingly for lookup?
Or I need to follow some other way to correctly parse the data.
Note:- The non-alphanumeric data is needed as well, there are some images at some particulars places in the input file encoded in that non-alphanumeric/alphanumeric chars, which we can extract, so not sure if I need to convert that to ASCII or leave as its.
Thanks in advance
Posting for others how I was able to transform the EBCDIC to ASCII.
I learned that I only needed to convert the non-binary alpha-numeric data to ASCII from EBCDIC. To know which data will be non-binary alphanumeric data, one needs to understand the format/structure of the EBCDIC/input file. Since I knew the format/structure of the input file, I was aware of which fields/bytes of the input files needed transformation and did transform only those bytes leaving other binary data as it is in the input file.
Earlier I was trying to convert the whole file into ASCII, which was converting the binary data as well, hence distorting the data in conversion. Hence, by understanding the structure/format of the files I converted only the required alphanumeric data to ASCII and processed it. It worked.
Options
Convert the file to Text on the Mainframe - They have the tools understand the formats
You might be able to use Stingray to read the file in python
Write a Cobol program (GNU Cobol) to translate the file
Use java utilities coboltocsv or coboltoxml to convert the file
Java/Jython code with JRecord
ZOS Mainframe Files
The 2 main mainframe file formats
FB - all records (lines) are the same length
VB - each record start with a length and is followed by the data. These files
can be transfered to other platforms with/without the record length.
Cobol Files
A Cobol copybook allows you to work out
Where fields start and End
The format of the field
Some examples of Cobol Fields and there representation
Inn this example I will look at 2 Cobol Field definitions
and how 4 values are represented in a file
Cobol field definition
03 fld1 pic s999v99.
03 fld2 pic s999v99 comp-3.
Representation in the file
Numeric-Value pic s999v99 pic s999v99 comp-3
12.34 0123D x'01234C'
-12.34 0123M x'01234d'
12.35 0123E x'01235C'
-12.35 0123N x'01235d'
You are reading the file in binary mode so the content in the buffer is in EBCDIC. You need to decode it to ASCII. Try the following:
with open(path, 'rb') as input_file:
line = input_file.read()
line = line.decode('utf8').strip()
print line
The above suggestion was tested on a z/OS machine, but if you are running on an ASCII machine you can try the following instead:
with codecs.open(path, 'rb', 'cp500') as input_file:
line = input_file.read()
print line
These suggestions assume you have a text file, but if the file contains binary data mixed with text you will need a different approach as suggested by #bruce-martin.
Related
I've received several text files, where each file contains thousands of lines of text. Because the files use Unicode encoding, each file ends up being around 1GB. I know this might sound borderline ridiculous, but it unfortunately is the reality:
I'm using Python 2.7 on a Windows 7 machine. I've only started using Python but figured this would be a good chance to really start using the language. You've gotta use it to learn it, right?
What I'm hoping to do is to be able to make a copy of all of these massive files. The new copies would be using ASCII character encoding and would ideally be significantly smaller in size. I know that changing the character encoding is a solution because I've had success by opening a file in MS WordPad and saving it to a regular text file:
Using WordPad is a manual and slow process: I need to open the file, which takes forever because it's so big, and then save it as a new file, which also takes forever since it's so big. I'd really like to automate this by having a script run in the background while I work on other things. I've written a bit of Python to do this, but it's not working correctly. What I've done so far is the following:
def convertToAscii():
# Getting a list of the current files in the directory
cwd = os.getcwd()
current_files = os.listdir(cwd)
# I don't want to mess with all of the files, so I'll just pick the second one since the first file is the script itself
test_file = current_files[1]
# Determining a new name for the ASCII-encoded file
file_name_length = len(test_file)
ascii_file_name = test_file[:file_name_length - 3 - 1] + "_ASCII" + test_file[file_name_length - 3 - 1:]
# Then we open the new blank file
the_file = open(ascii_file_name, 'w')
# Finally, we open our original file for testing...
with io.open(test_file, encoding='utf8') as f:
# ...read it line by line
for line in f:
# ...encode each line into ASCII
line.encode("ascii")
# ...and then write the ASCII line to the new file
the_file.write(line)
# Finally, we close the new file
the_file.close()
convertToAscii()
And I end up with the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
But that doesn't make any sense.... The first line within all of the text files is either a blank line or a series of equal signs, such as ===========.
I was wondering if someone would be able to put me onto the right path for this. I understand that doing this operation can take a very long time since I'm essentially reading each file line by line and then encoding the string into ASCII. What must I do in order to get around my current issue? And is there a more efficient way to do this?
For characters that exist in ASCII, UTF-8 already encodes using single bytes. Opening a UTF8 file with only single byte characters then saving an ASCII file should be a non-operation.
For any size difference, your files would have to be some wider encoding of Unicode, like UTF-16 / UCS-2. That would also explain the utf8 codec complaining about unexpected bytes in the source file.
Find out what encoding your files actually are, then save using utf8 codec. That way your files will be just as small (equivalent to ASCII) for single byte characters, but if your source files happen to have any multibyte characters, the result file will still be able to encode them and you won't be doing a lossy conversion.
There's a potential speedup if you avoid splitting the file into lines, since the only thing that you're doing is joining the lines back together. This allows you to process the input in larger blocks.
Using the shutil.copyfileobj function (which is just read and write in a loop):
import shutil
with open('input.txt', encoding='u16') as infile, \
open('output.txt', 'w', encoding='u8') as outfile:
shutil.copyfileobj(infile, outfile)
(Using Python 3 here, by passing the encoding argument directly to open, but it should be the same as the library function io.open.)
I have a Python 2.7 script which imports data from CSV files exported from various others sources.
As part of the import process I have a small function that establishes the correct character encoding for the file. I then open the file and loop the lines using:
with io.open(filename, "r", encoding=file_encoding) as input_file:
for raw_line in input_file:
cleaned_line = raw_line.replace('\x00', '').replace(u"\ufeff", "").encode('utf-8')
# do stuff
The files from this source usually come as UTF-8 (with BOM) and I detect the encoding 'utf-8-sig' and use that to open the file.
The problem I am having is that one of my data sources returns a file that seems to have an encoding error. The rest of the file (about 27k lines of CSV data) are all correct, as usual, but one line fails.
The line in question fails with this error (at the for raw_line in input_file line):
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1709: invalid start byte
The line has several non-breaking spaces characters that are encoded as with a single byte with value 'A0' rather than 2 bytes with 'C2 A0'.
I am already doing some light cleaning on a line by line basis for other problems as you can see on my "cleaned_line" line at the top of the loop (I dislike doing this per line but with the files I get I haven't found a way to do it better). However, the code fails before I ever even get there.
Is there a correct/nice way to handle this particular issue? I thought I'd nailed the whole encoding issue until this.
You can tell Python to ignore decoding errors, or to replace the faulty bytes with a placeholder character.
Set errors to 'ignore' to ignore the A0 bytes:
with io.open(filename, "r", encoding=file_encoding, errors='ignore') as input_file:
or to 'replace' to replace them with the U+FFFD REPLACEMENT CHARACTER:
with io.open(filename, "r", encoding=file_encoding, errors='replace') as input_file:
UTF-8 is a self-correcting encoding; bytes are either always part of a multi-byte code point, or can be decoded as ASCII directly, so ignoring un-decodable bytes is relatively safe.
You can do encoding 'translit/long' to normalize utf-8 to table of string charts, need to import translitcodec first.
I'm writing a python script that opens a file and reads a two character string value e.g. 32 or 6f. It is then supposed to write that value as a single byte to a binary file as hex e.g. 0x32 or 0x6f.
file_in = open('32.xml')
file_contents = file_in.read()
file_in.close()
file_out = open('testfile', 'wb')
file_out.write(file_contents)
file_out.close()
In this example, 32.xml is a plain text file that contains the string '32'. But the contents of the testfile output file are '32' instead of 0x32 (or just 2).
I've tried all kinds of variations on the write command. I tried the chr() function but that requires converting the string to an int.
file_out.write(chr(int(file_contents)))
That ended up writing the hex value of the string, not what I wanted. It also failed as soon as you had a value containing a-f.
I also tried
file_out.write('\x' + file_contents)
but the python interpreter didn't like that.
You need to interpret the original string as a hexadecimal integer. Hexadecimal is a base-16 notation, so add 16 to the int() call:
file_out.write(chr(int(file_contents, 16)))
I'm writing a program to 'manually' arrange a csv file to be proper JSON syntax, using a short Python script. From the input file I use readlines() to format the file as a list of rows, which I manipulate and concenate into a single string, which is then outputted into a separate .txt file. The output, however, contains gibberish instead of Hebrew characters that were present in the input file, and the output is double-spaced, horizontally (a whitespace character is added in between each character). As far as I can understand, the problem has to do with the encoding, but I haven't been able to figure out what. When I detect the encoding of the input and output files (using .encoding attribute), they both return None, which means they use the system default. Technical details: Python 2.7, Windows 7.
While there are a number of questions out there on this topic, I didn't find a direct answer to my problem.
Detecting the system defaults won't help me in this case, because I need the program to be portable.
Here's the code:
def txt_to_JSON(csv_list):
...some manipulation of the list...
return JSON_string
file_name = "input_file.txt"
my_file = open(file_name)
# make each line of input file a value in a list
lines = my_file.readlines()
# break up each line into a list such that each 'column' is a value in that list
for i in range(0,len(lines)):
lines[i] = lines[i].split("\t")
J_string = txt_to_JSON(lines)
json_file = open("output_file.txt", "w+")
json_file.write(jstring)
json_file.close()
All data needs to be encoded to be stored on disk. If you don't know the encoding, the best you can do is guess. There's a library for that: https://pypi.python.org/pypi/chardet
I highly recommend Ned Batchelder's presentation
http://nedbatchelder.com/text/unipain.html
for details.
There's an explanation about the use of "unicode" as an encoding on windows: What's the difference between Unicode and UTF-8?
TLDR:
Microsoft uses UTF16 as encoding for unicode strings, but decided to call it "unicode" as they also use it internally.
Even if Python2 is a bit lenient as to string/unicode conversions, you should get used to always decode on input and encode on output.
In your case
filename = 'where your data lives'
with open(filename, 'rb') as f:
encoded_data = f.read()
decoded_data = encoded_data.decode("UTF16")
# do stuff, resulting in result (all on unicode strings)
result = text_to_json(decoded_data)
encoded_result = result.encode("UTF-16") #really, just using UTF8 for everything makes things a lot easier
outfile = 'where your data goes'
with open(outfile, 'wb') as f:
f.write(encoded_result)
You need to tell Python to use the Unicode character encoding to decode the Hebrew characters.
Here's a link to how you can read Unicode characters in Python: Character reading from file in Python
I found a list of the majority of English words online, but the line breaks are of unix-style (encoded in Unicode: UTF-8). I found it on this website: http://dreamsteep.com/projects/the-english-open-word-list.html
How do I convert the line breaks to CRLF so I can iterate over them? The program I will be using them in goes through each line in the file, so the words have to be one per line.
This is a portion of the file: bitbackbitebackbiterbackbitersbackbitesbackbitingbackbittenbackboard
It should be:
bit
backbite
backbiter
backbiters
backbites
backbiting
backbitten
backboard
How can I convert my files to this type? Note: it's 26 files (one per letter) with 80,000 words or so in total (so the program should be very fast).
I don't know where to start because I've never worked with unicode. Thanks in advance!
Using rU as the parameter (as suggested), with this in my code:
with open(my_file_name, 'rU') as my_file:
for line in my_file:
new_words.append(str(line))
my_file.close()
I get this error:
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
addWords('B Words')
File "D:\my_stuff\Google Drive\documents\SCHOOL\Programming\Python\Programming Class\hangman.py", line 138, in addWords
for line in my_file:
File "C:\Python3.3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7488: character maps to <undefined>
Can anyone help me with this?
Instead of converting, you should be able to just open the file using Python's universal newline support:
f = open('words.txt', 'rU')
(Note the U.)
You can use the replace method of strings. Like
txt.replace('\n', '\r\n')
EDIT :
in your case :
with open('input.txt') as inp, open('output.txt', 'w') as out:
txt = inp.read()
txt = txt.replace('\n', '\r\n')
out.write(txt)
You don't need to convert the line endings in the files in order to be able to iterate over them. As suggested by NPE, simply use python's universal newlines mode.
The UnicodeDecodeError happens because the files you are processing are encoded as UTF-8 and when you attempt to decode the contents from bytes to a string, via str(line), Python is using the cp1252 encoding to convert the bytes read from the file into a Python 3 string (i.e. a sequence of unicode code points). However, there are bytes in those files that cannot be decoded with the cp1252 encoding and that causes a UnicodeDecodeError.
If you change str(line) to line.decode('utf-8') you should no longer get the UnicodeDecodeError. Check out the Text Vs. Data Instead of Unicode Vs. 8-bit writeup for some more details.
Finally, you might also find The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky useful.
You can use cereja package
pip install cereja==1.2.0
import cereja cereja.lf_to_crlf(dir_or_file_path)
or
cereja.lf_to_crlf(dir_or_file_path, ext_in=[“.py”,”.csv”])
You can substitute for any standard. See the filetools module