EBCDIC to ASCII Conversions

EBCDIC to ASCII Conversions - python

I have mainframe file in EBCDIC format and I want to convert those files into ASCII format.
I have tried converting EBCDIC to ASCII using python 2.6 but there are many issues in that like compression field didn't get converted and records count gets increased.
Is there any way to convert EBCDIC files having compressed fields to ASCII format.

If you already have the file downloaded you can convert it easily from EBCDIC to ASCII in a Linux or MacOS machine by using the command line.
To accomplish that you need to use the dd command.
Here a quick overview of some parameters it uses:
dd [bs=size] [cbs=size] [conv=conversion] [count=n] [ibs=size] [if=file] [imsg=string] [iseek=n] [obs=s] [of=file] [omsg=string] [seek=n] [skip=n]
There are more parameters that those above, to check all available just do the command: man dd, it will show all other available parameters and the explanation of each one.
In your case you should start with:
dd conv=ascii if=EBCDIC_file.txt of=ASCII_file.txt
where EBCDIC_file.txt is the filename of your input EBCDIC file and ASCII_file.txt will be the file created as output with all bytes converted from EBCDIC to ASCII.
Likewise you can do the reverse by using conv=ebcdic to convert a file from ASCII to EBCDIC.
Here's the man page for dd on the web: https://www.man7.org/linux/man-pages/man1/dd.1.html
When you mention compressed in your file, do you mean the whole file comes compressed from the mainframe? Probably it came TERSED (by using terse utility on mainframe). If this is the case, there is a public version of terse that runs on DOS, Linux, MacOS, AIX and others. It is available on cbtape site: http://www.cbttape.org/ftp/cbt/CBT892.zip

Options
Some options
Convert the file to a Text file on the mainframe (sort or eastrieve will both do this)
If it is a once off the Fileaid/File master can convert the file to Text on the mainframe
If it is a once off the RecordEditor should be able to edit the file with a Cobol Copybook. It can also generate JRecord code to read the file.
If there is only one Record-Type in the file, CobolToCsv can use the Cobol Copybook to convert the file to a CSV.
The JRecord lets you read a Cobol Copybook in Java
JRecord has a COBOL Copy utility will let you do a Cobol to cobol copy. If there is only one Record type you can
Copy the EBCDIC Copybook to the equivalent Ascii copybook (ext fields are converted, binary fields are left unchanged). This is usefull if converting a Mainframe Cobol file for use in a Windows / Linux Cobol system
Copy a EBCDIC Binary Copybook to an Ascii Text copybook
The Stingray project provides access to cobol files in python
CobolTCsv
For Example to convert a Cobol Data File to Csv (single record type) using CobolToCsv :
java -jar ../lib/Cobol2Csv.jar -I In/DTAR020.bin -O Out/o_DTAR020_space.csv ^
-C DTAR020.cbl ^
-Q DoubleQuote -FS Fixed_Length ^
-IC CP037 -Delimiter ,
Where
In/DTAR020.bin is the Input Cobol data file
Out/o_DTAR020_space.csv is the output Csv file
**DTAR020.cbl ** is the Cobol Copybook
Fixed_Length idicates it a fixed length File (FB on the Mainframe)
RecordEditor
To edit the file see How do you edit a Binary Mainframe file in the RecordEditor using a Cobol Copybook (pt1)
To generate JRecord Code see How do you generate java~jrecord code for a Cobol copybook

Related

Postgres COPY FROM file throwing unicode error while referenced character apparently not in file

First of all, thank you to everyone on Stack Overflow for past, present, and future help. You've all saved me from disaster (both of my own design and otherwise) too many times to count.
The present issue is part of a decision at my firm to transition from a Microsoft SQL Server 2005 database to PostgreSQL 9.4. We have been following the notes on the Postgres wiki (https://wiki.postgresql.org/wiki/Microsoft_SQL_Server_to_PostgreSQL_Migration_by_Ian_Harding), and these are the steps we're following for the table in question:
Download table data [on Windows client]:
bcp "Carbon.consensus.observations" out "Carbon.consensus.observations" -k -S [servername] -T -w
Copy to Postgres server [running CentOS 7]
Run Python pre-processing script on Postgres server to change encoding and clean:
import sys
import os
import re
import codecs
import fileinput
base_path = '/tmp/tables/'
cleaned_path = '/tmp/tables_processed/'
files = os.listdir(base_path)
for filename in files:
source_path = base_path + filename
temp_path = '/tmp/' + filename
target_path = cleaned_path + filename
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with open(source_path, 'r') as source_file:
with open(target_path, 'w') as target_file:
start = True
while True:
contents = source_file.read(BLOCKSIZE).decode('utf-16le')
if not contents:
break
if start:
if contents.startswith(codecs.BOM_UTF8.decode('utf-8')):
contents = contents.replace(codecs.BOM_UTF8.decode('utf-8'), ur'')
contents = contents.replace(ur'\x80', u'')
contents = re.sub(ur'\000', ur'', contents)
contents = re.sub(ur'\r\n', ur'\n', contents)
contents = re.sub(ur'\r', ur'\\r', contents)
target_file.write(contents.encode('utf-8'))
start = False
for line in fileinput.input(target_path, inplace=1):
if '\x80' in line:
line = line.replace(r'\x80', '')
sys.stdout.write(line)
Execute SQL to load table:
COPY consensus.observations FROM '/tmp/tables_processed/Carbon.consensus.observations';
The issue is that the COPY command is failing with a unicode error:
[2015-02-24 19:52:24] [22021] ERROR: invalid byte sequence for encoding "UTF8": 0x80
Where: COPY observations, line 2622420: "..."
Given that this could very likely be because of bad data in the table (which also contains legitimate non-ASCII characters), I'm trying to find the actual byte sequence in context, and I can't find it anywhere (sed to look at the line in question, regexes to replace the character as part of the preprocessing, etc). For reference, this grep returns nothing:
cat /tmp/tables_processed/Carbon.consensus.observations | grep --color='auto' -P "[\x80]"
What am I doing wrong in tracking down where this byte sequence sits in context?

I would recommend loading the SQL file (which appears to be /tmp/tables_processed/Carbon.consensus.observations) into an editor that has a hex mode. This should allow you to see it (depending on the exact editor) in context.
gVim (or terminal-based Vim) is one option I would recommend.
For example, if I open in gVim an SQL copy file that has this content:
1 1.2
2 1.1
3 3.2
I can the convert it into hex mode via the command %!xxd (in gVim or terminal Vim) or the Menu option Tools > Convert to HEX.
That yields this display:
0000000: 3109 312e 320a 3209 312e 310a 3309 332e 1.1.2.2.1.1.3.3.
0000010: 320a 2.
You can then run %!xxd -r to convert it back, or the Menu option Tools > Convert back.
Note: This actually modifies the file, so it would be advisable to do this to a copy of the original, just in case the changes somehow get written (you would have to explicitly save the buffer in Vim).
This way, you can see both the hex sequences on the left, and their ASCII equivalent on the right. If you search for 80, you should be able to see it in context. With gVim, the line numbering will be different for both modes, though, as is evidenced by this example.
It's likely the first 80 you find will be that line, though, since if there were earlier ones, it likely would've failed on those instead.
Another tool which might help that I've used in the past is the graphical hex editor GHex. Since that's a GNOME project, not quite sure it'll work with CentOS. wxHexEditor supposedly works with CentOS and looks promising from the website, although I've not yet used it. It's pitched as a "hex editor for massive files", so if your SQL file is large, that might be the way to go.

Embedding binary data in a script efficiently

I have seen some installation files (huge ones, install.sh for Matlab or Mathematica, for example) for Unix-like systems, they must have embedded quite a lot of binary data, such as icons, sound, graphics, etc, into the script. I am wondering how that can be done, since this can be potentially useful in simplifying file structure.
I am particularly interested in doing this with Python and/or Bash.
Existing methods that I know of in Python:
Just use a byte string: x = b'\x23\xa3\xef' ..., terribly inefficient, takes half a MB for a 100KB wav file.
base64, better than option 1, enlarge the size by a factor of 4/3.
I am wondering if there are other (better) ways to do this?

You can use base64 + compression (using bz2 for instance) if that suits your data (e.g., if you're not embedding already compressed data).
For instance, to create your data (say your data consist of 100 null bytes followed by 200 bytes with value 0x01):
>>> import bz2
>>> bz2.compress(b'\x00' * 100 + b'\x01' * 200).encode('base64').replace('\n', '')
'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
And to use it (in your script) to write the data to a file:
import bz2
data = 'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
with open('/tmp/testfile', 'w') as fdesc:
fdesc.write(bz2.decompress(data.decode('base64')))

Here's a quick and dirty way. Create the following script called MyInstaller:
#!/bin/bash
dd if="$0" of=payload bs=1 skip=54
exit
Then append your binary to the script, and make it executable:
cat myBinary >> myInstaller
chmod +x myInstaller
When you run the script, it will copy the binary portion to a new file specified in the path of=. This could be a tar file or whatever, so you can do additional processing (unarchiving, setting execute permissions, etc) after the dd command. Just adjust the number in "skip" to reflect the total length of the script before the binary data starts.

Writing files in Python and carriage return in Windows

I'm using OpenCV Python library to extract descriptors and write them to file. Each descriptor is 32 bytes and I only save 80 of them. Meaning that, the final file must be exactly 2560 bytes. But it's 2571 bytes.
I also have another file which had been written using the same Python script (Not on Windows but I guess it was on Linux) and it's exactly 2560 bytes.
Using WinMerge, I tried to compare them and it gave me a warning that the carriage return is different in two files and asked me if I wanted to treat them equally. If I say "yes", then both files are identical but if I say "no" then they are different.
I was wondering if there is anyway in Python to write binary files which produce identical result on both Windows and Linux?
Not to mention this is the relevant part of the script:
f = open("something", "w+")
f.write(descriptors)
f.close()

Yes, there's a way to open a file in binary mode - just put the b character into the open.
f = open("something", "wb+")
If you don't do that in Windows, every linefeed '\n' will be converted to the two-character line ending sequence that is used by Windows, '\r\n'.

AppDailySales: Works, but the downloaded gzip file is corrupted

I am trying to use the appdailysales.py module to download daily our iPhone apps. I am a .NET developer, so I tried running this using IronPython in a C# solution using the following code:
using IronPython.Hosting;
var ipy = Python.CreateRuntime();
dynamic appSales = ipy.UseFile("appdailysales.py");
appSales.main();
Because I didn't have gzip, I took out the references to that module. I was going to use the GZipStream C# class to decompress the file (Apple, provides their downloads as .gz files). So, I commented out lines 75 and 429-435.
I have tried executing appdailysales.py in my C# solution, directly from IronPython and using Python 2.7 (installed ActivePython last night); all with the same results: When I try to open the .gz file using 7zip, I get the following error:
CRC Failed ... file is broken
When I try using the GZipStream class I get:
The CRC in GZip footer does not match the CRC calculated from the decompressed data
If I download the .gz file manually, I can decompress the file just fine using 7Zip or GZipStream.
I am fluent in C#, but new to Python. Any help you can provide would be much appreciated.
Thanks for your time.

Looks like line 444 is the problem. Here are lines 444-446:
downloadFile = open(filename, 'w')
downloadFile.write(filebuffer)
downloadFile.close()
At this stage, IF you have deleted lines 429-435 OR selected not to unzip, then filebuffer refers to the raw gzipped stream that you got from the web. The output file is opened in TEXT mode, and you are on Windows, so every \n in the BINARY gzipped stream will be converted to \r\n -- CORRUPTION, like the error message said.
So: for the module to be used portably on both Windows and other platforms, the open mode must be "wb" (b for binary). If the gunzipped result file is also a binary file, "wb" can be hardcoded in the open call. However if the gunzipped file is a text file (meant to be capable of being opened in a text editor), then you need just "w" for that purpose, and you should set a variable mode to either "wb" or "w" as appropriate, and use mode in the open call.
Big question: I understand why you removed the gzip references for IronPython usage. Did you remove those lines for Python 2.7? Or did you run it under Python 2.7 with those lines still in, but set options.unzipFile to False?

Line reading chokes on 0x1A

I have the following file:
abcde
kwakwa
<0x1A>
line3
linllll
Where <0x1A> represents a byte with the hex value of 0x1A. When attempting to read this file in Python as:
for line in open('t.txt'):
print line,
It only reads the first two lines, and exits the loop.
The solution seems to be to open the file in binary (or universal newline mode) - 'rb' or 'rU'. Can you explain this behavior ?

0x1A is Ctrl-Z, and DOS historically used that as an end-of-file marker. For example, try using a command prompt, and "type"ing your file. It will only display the content up the Ctrl-Z.
Python uses the Windows CRT function _wfopen, which implements the "Ctrl-Z is EOF" semantics.

Ned is of course correct.
If your curiosity runs a little deeper, the root cause is backwards compatibility taken to an extreme. Windows is compatible with DOS, which used Ctrl-Z as an optional end of file marker for text files. What you might not know is that DOS was compatible with CP/M, which was popular on small computers before the PC. CP/M's file system didn't keep track of file sizes down to the byte level, it only kept track by the number of floppy disk sectors. If your file wasn't an exact multiple of 128 bytes, you needed a way to mark the end of the text. This Wikipedia article implies that the selection of Ctrl-Z was based on an even older convention used by DEC.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.