convert a .dat file with python - python
We have a .dat file that is completely unreadable in a text-editor. This .dat file holds information about materials and I used on a machine which can display the data.
Now we are going to use the files for other purposes end need to convert them to a readable format.
When I open the .dat file in NOtepad++ this is what I see:
2ñg½x¹¾T?B½ÛÁ#½^fÓ¼":°êȽ¸ô»YY‚½g *½$S)¼¤“è¼F„J¼c1¼$ ¼*‡Ç»½Ú7¼F]S¼Ê(Ï<(‚¤½Y´½å½#ø;N‡;o¸¼¨*S:ΣC¼ÎÀR½žO<š_å¼T÷½p4>½8«»«=ýÆZ<¿[=²”<æt¼pc»q³<×R<ï4¼}Ž‚8pýw<~ï»z†:Qš¼^Kp;XI=<Ѷ ¼
j½…é-=*Ý=;-X7½ßÓ:<ÐZ<Ás!=²LÀ;æã=võu<„4½§V9„燺ý$D<"Š|»å€,<E{=+»¥;2wN¼¸rF=h®<ç[=²=é\=Îý<…À¦¼Î,è¼u…<#_.¼¾Ã¨9æ3½Å°“<ª×½°ÇD¼JÝþ»ph{=Ÿv8;Ne¼’Q; ´{»(ì¿<6Þï»éõ¼*p½©m¼ÝM–<ròä¼½™™¼Õö=j|½±‰Í;2¥C¼¯ 輓?½>¼:„3» ù¼¦k
¼wÞ¹¼Öm‚»=T¼êy¦¼k[…»ÎÉO¼Žc¼$ï½ÖN;H¼4Ø:8¸ž¼dLý¼ø9ø»cI(;4뼈Q¼ž7½,h?¼À ɼy½Å’œ¼¶Åº¼å"±¼bžu¼ Z;½¨½øáY¼ZÖ»2
½ð^š<Þ„§<»ƒ<#±c<f<ŸPÝ;‹œlºÐöï»ö²ñ;ÜŠb=¦';f´<ò=¬3B<\mÛ¼¹©»åB<»Xô;€ºp»¸ ±¼‰Øâ¼7Ug¼€÷ø¼lËû»j}»²‘ô;wu½®ö²¼Ÿ„¼ŠÉ¼ÖV8 Š¼‹÷¯¼ål¼é°ª¼‹o4½ðî$<4Q:.A<
<Ž¬ë<^·G<n
œ<¶l<: è;’MÜ9êÁa<’¢T;~&¼gY®»"P¼¤µº;$H=½…o<6ëæ»ûÒ¼Ê,<‚p½¯À¼#êw»Ír¥¼¸wغA:«<TDI»Nºµ<€ŠMºwnܸ·6:CÕj<àÆ:Dr<7ëo9STÏ<G¼R?M<:)N;.3 <†L<ºZ=I,Y<ñF;iÙ.» pºå0<;:=Tʪ;—ÄË;?'й0Ž:J’J<jR¯»´/½Ô”ؼ•¥˜¼hμd™<9¼iˆ‘<(Šd<ɇÖ#·³È‚»#O><Úo<Ó¸ <ëî;ÒQ<õöî<#Nm¼öw4¼’O¼v <:3<
We know the data in the .dat file has the following format:
MaterialBase ThicknessBase ThicknessIterated Pixel Value
Plastic 0 0 1 -5.662651e-02
Plastic 0 0 2 -1.501216e-01
Plastic 0 0 3 -4.742368e-02
By searching lots of code on, also here of course, I came up with the following code:
import time
import binascii
import csv
import serial
import numpy as np
with open('validationData.dat.201805271617', 'rb') as input_file:
lines = input_file.readlines()
newLines = []
for line in lines:
newLine = line.strip('|').split()
newLines.append(newLine)
with open('file.csv', 'w') as output_file:
file_writer = csv.writer(output_file)
file_writer.writerows(newLines)
The error I get now is:
File "c:\Users\joost.bazelmans\Documents\python\dat2csv.py", line 15, in
<module>
newLine = line.strip('|').split()
TypeError: a bytes-like object is required, not 'str'
It looks lik the script is reading the file, but then it cannot split it by the | character. But I am lost now. Any ideas on how to continue?
Edit 2018-07-23 13:00
Based on guidot's reply we tried to work with.struct. We are now able to get a list of floating point values out of the .dat file. This is also what the R-script does. But after that the R-script does translate the floats to readable data.
import struct
datavector = []
with open('validationData.dat.201805271617', "rb") as f:
n = 1000000
count = 0
byte = f.read(4) # same as size = 4 in R
while count < n and byte != b"":
datavector.append(struct.unpack('f',byte))
count += 1
byte = f.read(4)
print(datavector)
the result we get is something like this:
[(-0.05662650614976883,), (-0.1501215696334839,), (-0.047423675656318665,), (-0.04705987498164177,), (-0.025805648416280746,), (0.0006194132147356868,), (-0.09810388088226318,), (-0.007468236610293388,), (-0.06364697962999344,), (-0.04153480753302574,), (-0.010334763675928116,), (-0.028390713036060333,), (-0.01236063800752163,), (-0.010809036903083324,), (-0.0195484422147274,), (-0.006089110858738422,), (-0.011221584863960743,), (-0.012900656089186668,), (0.02528800442814827,), (-0.0803263783454895,), (-0.03630480542778969,), (-0.03244496509432793,), (0.007571130990982056,), (0.004120028577744961,), (-0.022513896226882935,), (0.0008055367507040501,), (-0.011940909549593925,), (-0.05145340412855148,), (0.008258728310465813,), (-0.02799968793988228,), (-0.035880401730537415,), (-0.04643672704696655,), (-0.005221989005804062,), (0.03542486950755119,), (0.013353106565773487,), (0.035976167768239975,), (0.008336232975125313,), (-0.01492307148873806,), (-0.003470425494015217,), (0.02190450392663479,), (0.012822589837014675,), (-0.008801682852208614,), (6.225423567229882e-05,), (0.015136107802391052,), (-0.007297097705304623,), (0.0010259768459945917,), (-0.018891485407948494,), (0.0036666016094386578,), (0.01155313104391098,), (-0.009809211827814579,), (-0.03696637228131294,), (0.04245902970433235,), (0.002897093538194895,), (-0.04476182535290718,), (0.011403053067624569,), (0.01330728828907013,), (0.03941703215241432,), (0.005868517793715,), (0.031955622136592865,), (0.015012135729193687,), (-0.0439620167016983,), (0.00014146660396363586,), (-0.0010368679650127888,), (0.011971709318459034,), (-0.003853448200970888,), (0.010528777725994587,), (0.06129004433751106,), (0.00505771255120635,), (-0.012601660564541817,), (0.01481446623802185,), (0.019019771367311478,), (0.004633020609617233,), (-0.021741455420851707,), (-0.033449672162532806,), (-0.021316081285476685,), (0.00593474181368947,), (0.0030296281911432743,), (0.023055575788021088,), (0.0256675872951746,), (0.03663543984293938,), (0.044298700988292694,), (0.01264342200011015,), (0.032493121922016144,), (-0.06546197831630707,), (0.031123168766498566,), (0.005013703368604183,), (-0.006611336953938007,), (-0.041526272892951965,), (0.0007577596697956324,), (0.030475322157144547,), (0.034476157277822495,), (-0.015037396922707558,), (0.07587681710720062,)]
Now the question is how to convert these flaoting point numbers to readable content
Since you opened your file in binary format, mode='rb', I think you probably just need to specify a bytes-like character to strip:
newLine = line.strip(b'|').split()
Related
- how to decode Decoding Multiple CDR Records from a File with asn1tools (python)
how to decode Decoding Multiple CDR Records from a File with asn1tools.. this is my python code: ### input file name fileName='D:/Python/Asn1/SIO/bHWMSC12021043012454329.dat' ## create Dict from asn file struct Foo = asn1tools.compile_files('asn_Huawei.asn',cache_dir='My-Cache',numeric_enums=True) ## Open binary file with open(fileName,"rb+") as binaryfile: buffer = binaryfile.read() ## Match and decode all record with Dict decoded = Foo.decode('CallEventRecord',buffer) print(decoded) print(decoded) give only first record. My file contain 1550 records.... how to read Tag by tag my file with asn1tools
I got the same issue, trying to figure it out. You can +1 this issue I managed to decode multiCDR files with a combination of pyasn1 and asn1tools, I'll update when it is fully tested.
I would use "decode_with_length" instead of decode. decoded, lenDecoded = Foo.decode_with_length('CallEventRecord',buffer) buffer = buffer[lenDecoded:] Then just loop until the buffer is empty.
For me (it highly depends how your Files are actually internally structured and what your goal is) the so far fastest method is to use find to jump to the next relevant ticket, mine always start with 'b\xa7': with open(file, 'rb') as encoded_bytes: data = encoded_bytes.read() encoded_bytes.close() file_len = len(data) while index < file_len: next_occurence = data[index:].find(b'\xa7') if next_occurence < 0: break .... index += next_occurence decoded_record, record_length = schema.decode_with_length('CallEventRecord', data[index:]) ..... index += 1 continue
Removing blanks in Python
I am new to Python so i am trying to make is simple as possible. I am working with CSV file that contains data that i have to Mapreduce. In my mapper portion I get blank data which does not let me Reduce is. This is due to CSV file has blanks in it. I need an advice on how to remove blanks in my Mapper so it will not go into my Reducer. Example of my result. BLUE 1 GY 1 WT 1 1 WH 1 1 BLACK 1 1 GN 1 BLK 1 BLACK 1 RED 1 My code #!/usr/bin/python from operator import itemgetter import sys sys_stdin = open("Parking_Violations.csv", "r") for line in sys_stdin: line = line.split(",") vehiclecolor = line[33] #This is the column in CSV file where data i need is located. try: issuecolor = str(vehiclecolor) print("%s\t%s" % (issuecolor, 1)) except ValueError: continue
You can use the builtin string.strip function #!/usr/bin/python from operator import itemgetter import sys from typing import List, Any sys_stdin = open("Parking_Violations.csv", "r") for line in sys_stdin: vehiclecolor = line[33].strip() if vehiclecolor: issuecolor = str(vehiclecolor) print("%s\t%s" % (issuecolor, 1)) What it does is getting the 33rd line and strips all whitespaces from it .strip(). This assumes that your file has actually 33 lines in it otherwise it will raise an exception. Then it checks if the vehiclecolor has any characters via the if and prints it only if there is a value. In Python an expression of an empty string is recognized as "false".
Python parsing a text file and logical methods
I'm a bit stuck with python logic. I'd like some some advice on how to tackle a problem I'm having with python and the methods to parsing data. I've spent a bit of time reading the python reference documents and going through this site and I understand there are several ways to do what I'm trying to achieve and this is the path I've gone down. I'm re-formating some text files with data generated from some satellite hardware to be uploaded into a MySQL database. This is the raw data TP N: 1 Frequency: 12288.635 Mhz Symbol rate: 3000 KS Polarization: Vertical Spectrum: Inverted Standard/Modulation: DVB-S2/QPSK FEC: 1/2 RollOff: 0.20 Pilot: on Coding mode: ACM/VCM Short frame Transport stream Single input stream RF-Level: -49 dBm Signal/Noise: 6.3 dB Carrier width: 3.600 Mhz BitRate: 2.967 Mbit/s The above section is repeated for each transponder TP N on the satellite I'm using this script to extract the data I need strings = ("Frequency", "Symbol", "Polar", "Mod", "FEC", "RF", "Signal", "Carrier", "BitRate") sat_raw = open('/BLScan/reports/1520.txt', 'r') sat_out = open('1520out.txt', 'w') for line in sat_raw: if any(s in line for s in strings): for word in line.split(): if ':' in word: sat_out.write(line.split(':')[-1]) sat_raw.close() sat_out.close() The output data is then formatted like this before its sent to the database 12288.635 Mhz 3000 KS Vertical DVB-S2/QPSK 1/2 -49 dBm 6.3 dB 3.600 Mhz 2.967 Mbit/s This script is working fine but for some features I want to implement on MySQL I need to edit this further. Remove the decimal point and 3 numbers after it and MHz on the first "frequency" line. Remove all the trailing measurement references KS,dBm,dB, Mhz, Mbit. Join the 9 fields into a comma delimited string so each transponders (approx 30 per file ) are on their own line I'm unsure weather to continue down this path adding onto this existing script (which I'm stuck at the point where the output file is written). Or rethink my approach to the way I'm processing the raw file.
My solution is crude, might not work in corner cases, but it is a good start. import re import csv strings = ("Frequency", "Symbol", "Polar", "Mod", "FEC", "RF", "Signal", "Carrier", "BitRate") sat_raw = open('/BLScan/reports/1520.txt', 'r') sat_out = open('1520out.txt', 'w') csv_writer = csv.writer(sat_out) csv_output = [] for line in sat_raw: if any(s in line for s in strings): try: m = re.match(r'^.*:\s+(\S+)', line) value = m.groups()[0] # Attempt to convert to int, thus removing the decimal part value = int(float(value)) except ValueError: pass # Ignore conversion except AttributeError: pass # Ignore case when m is None (no match) csv_output.append(value) elif line.startswith('TP N'): # Before we start a new set of values, write out the old set if csv_output: csv_writer.writerow(csv_output) csv_output=[] # If we reach the end of the file, don't miss the last set of values if csv_output: csv_writer.writerow(csv_output) sat_raw.close() sat_out.close() Discussion The csv package helps with CSV output The re (regular expression) module helps parsing the line and extract the value from the line. In the line that reads, value = int(...), We attempt to turn the string value into an integer, thus removing the dot and following digits. When the code encounters a line that starts with 'TP N', which signals a new set of values. We write out the old set of value to the CSV file.
import math strings = ("Frequency", "Symbol", "Polar", "Mod", "FEC", "RF", "Signal", "Carrier", "BitRate") files=['/BLScan/reports/1520.txt'] sat_out = open('1520out.txt', 'w') combineOutput=[] for myfile in files: sat_raw = open(myfile, 'r') singleOutput=[] for line in sat_raw: if any(s in line for s in strings): marker=line.split(':')[1] try: data=str(int(math.floor(float(marker.split()[0])))) except: data=marker.split()[0] singleOutput.append(data) combineOutput.append(",".join(singleOutput)) for rec in combineOutput: sat_out.write("%s\n"%rec) sat_raw.close() sat_out.close() Add all the files that you want to parse in files list. It will write the output of each file as a separate line and each field comma separated.
Read Wireshark dump file for packet times
My instructions were to read the wireshark.bin data file dumped from the Wireshark program and pick out the packet times. I have no idea how to skip the header and find the first time. """ reads the wireshark.bin data file dumped from the wireshark program """ from datetime import datetime import struct import datetime #file = r"V:\workspace\Python3_Homework08\src\wireshark.bin" file = open("wireshark.bin", "rb") idList = [ ] with open("wireshark.bin", "rb") as f: while True: bytes_read = file.read(struct.calcsize("=l")) if not bytes_read: break else: if len(bytes_read) > 3: idList.append(struct.unpack("=l", bytes_read)[0]) o = struct.unpack("=l111", bytes_read)[0] print( datetime.date.fromtimestamp(o))
Try reading the entire file at once, and then accessing it as a list: data = open("wireshark.bin", "rb").read() # let Python automatically close file magic = data[:4] # magic wireshark number (also reveals byte order) gmt_correction = data[8:12] # GMT offset data = data[24:] # actual packets Now you can loop through data in (16?) byte size chunks, looking at the appropriate offset in each chunk for the timestamp. The magic number is 0xa1b2c3d4, which takes four bytes, or two words. We can determine the order (big-endian or little-endian) by examining those first four bytes by using the struct module: magic = struct.unpack('>L', data[0:4])[0] # before the data = data[24:] line above if magic == 0xa1b2c3d4: order = '>' elif magic == 0xd4c3b2a1: order = '<' else: raise NotWireSharkFile() Now that we have the order (and know it's a wireshark file), we can loop through the packets: field0, field1, field2, field3 = \ struct.unpack('%sllll' % order, data[:16]) payload = data[16 : 16 + field?] data = data[16 + field?] I left the names vague, since this is homework, but those field? names represent the information stored in the packet header which includes the timestamp and the length of the following packet data. This code is incomplete, but hopefully will be enough to get you going.
Is this a sensible approach for an EBCDIC (CP500) to Latin-1 converter?
I have to convert a number of large files (up to 2GB) of EBCDIC 500 encoded files to Latin-1. Since I could only find EBCDIC to ASCII converters (dd, recode) and the files contain some additional proprietary character codes, I thought I'd write my own converter. I have the character mapping so I'm interested in the technical aspects. This is my approach so far: # char mapping lookup table EBCDIC_TO_LATIN1 = { 0xC1:'41', # A 0xC2:'42', # B # and so on... } BUFFER_SIZE = 1024 * 64 ebd_file = file(sys.argv[1], 'rb') latin1_file = file(sys.argv[2], 'wb') buffer = ebd_file.read(BUFFER_SIZE) while buffer: latin1_file.write(ebd2latin1(buffer)) buffer = ebd_file.read(BUFFER_SIZE) ebd_file.close() latin1_file.close() This is the function that does the converting: def ebd2latin1(ebcdic): result = [] for ch in ebcdic: result.append(EBCDIC_TO_LATIN1[ord(ch)]) return ''.join(result).decode('hex') The question is whether or not this is a sensible approach from an engineering standpoint. Does it have some serious design issues? Is the buffer size OK? And so on... As for the "proprietary characters" that some don't believe in: Each file contains a year's worth of patent documents in SGML format. The patent office has been using EBCDIC until they switched to Unicode in 2005. So there are thousands of documents within each file. They are separated by some hex values that are not part of any IBM specification. They were added by the patent office. Also, at the beginning of each file there are a few digits in ASCII that tell you about the length of the file. I don't really need that information but if I want to process the file so I have to deal with them. Also: $ recode IBM500/CR-LF..Latin1 file.ebc recode: file.ebc failed: Ambiguous output in step `CR-LF..data' Thanks for the help so far.
EBCDIC 500, aka Code Page 500, is amongst Pythons encodings, although you link to cp1047, which doesn't. Which one are you using, really? Anyway this works for cp500 (or any other encoding that you have). from __future__ import with_statement import sys from contextlib import nested BUFFER_SIZE = 16384 with nested(open(sys.argv[1], 'rb'), open(sys.argv[2], 'wb')) as (infile, outfile): while True: buffer = infile.read(BUFFER_SIZE) if not buffer: break outfile.write(buffer.decode('cp500').encode('latin1')) This way you shouldn't need to keep track of the mappings yourself.
If you set up the table correctly, then you just need to do: translated_chars = ebcdic.translate(EBCDIC_TO_LATIN1) where ebcdic contains EBCDIC characters and EBCDIC_TO_LATIN1 is a 256-char string which maps each EBCDIC character to its Latin-1 equivalent. The characters in EBCDIC_TO_LATIN1 are the actual binary values rather than their hex representations. For example, if you are using code page 500, the first 16 bytes of EBCDIC_TO_LATIN1 would be '\x00\x01\x02\x03\x37\x2D\x2E\x2F\x16\x05\x25\x0B\x0C\x0D\x0E\x0F' using this reference.
While this might not help the original poster anymore, some time ago I released a package for Python 2.6+ and 3.2+ that adds most of the western 8 bit mainframe codecs including CP1047 (French) and CP1141 (German): https://pypi.python.org/pypi/ebcdic. Simply import ebcdic to add the codecs and then use open(..., encoding='cp1047') to read or write files.
Answer 1: Yet another silly question: What gave you the impression that recode produced only ASCII as output? AFAICT it will transcode ANY of its repertoire of charsets to ANY of its repertoire, AND its repertoire includes IBM cp500 and cp1047, and OF COURSE latin1. Reading the comments, you will note that Lennaert and I have discovered that there aren't any "proprietary" codes in those two IBM character sets. So you may well be able to use recode after all, once you are certain what charset you've actually got. Answer 2: If you really need/want to transcode IBM cp1047 via Python, you might like to firstly get the mapping from an authoritative source, processing it via script with some checks: URL = "http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/glibc-IBM1047-2.1.2.ucm" """ Sample lines: <U0000> \x00 |0 <U0001> \x01 |0 <U0002> \x02 |0 <U0003> \x03 |0 <U0004> \x37 |0 <U0005> \x2D |0 """ import urllib, re text = urllib.urlopen(URL).read() regex = r"<U([0-9a-fA-F]{4,4})>\s+\\x([0-9a-fA-F]{2,2})\s" results = re.findall(regex, text) wlist = [None] * 256 for result in results: unum, inum = [int(x, 16) for x in result] assert wlist[inum] is None assert 0 <= unum <= 255 wlist[inum] = chr(unum) assert not any(x is None for x in wlist) print repr(''.join(wlist)) Then carefully copy/paste the output into your transcoding script for use with Vinay's buffer.translate(the_mapping) idea, with a buffer size perhaps a bit larger than 16KB and certainly a bit smaller than 2GB :-)
No crystal ball, no info from OP, so had a bit of a rummage in the EPO website. Found freely downloadable weekly patent info files, still available in cp500/SGML even though website says this to be replaced by utf8/XML in 2006 :-). Got the 2009 week 27 file. Is a zip containing 2 files s350927[ab].bin. "bin" means "not XML". Got the spec! Looks possible that "proprietary codes" are actually BINARY fields. Each record has a fixed 252-byte header. First 5 bytes are record length in EBCDIC e.g. hex F0F2F2F0F8 -> 2208 bytes. Last 2 bytes of the fixed header are the BINARY length (redundant) of the following variable part. In the middle are several text fields, two 2-byte binary fields, and one 4-byte binary field. The binary fields are serial numbers within groups, but all I saw are 1. The variable part is SGML. Example (last record from s350927b.bin): Record number: 7266 pprint of header text and binary slices: ['EPB102055619 TXT00000001', 1, ' 20090701200927 08013627.8 EP20090528NN ', 1, 1, ' T *lots of spaces snipped*'] Edited version of the rather long SGML: <PATDOC FILE="08013627.8" CY=EP DNUM=2055619 KIND=B1 DATE=20090701 STATUS=N> *snip* <B541>DE<B542>Windschutzeinheit für ein Motorrad <B541>EN<B542>Windshield unit for saddle-ride type vehicle <B541>FR<B542>Unité pare-brise pour motocyclette</B540> *snip* </PATDOC> There are no header or trailer records, just this one record format. So: if the OP's annual files are anything like this, we might be able to help him out. Update: Above was the "2 a.m. in my timezone" version. Here's a bit more info: OP said: "at the beginning of each file there are a few digits in ASCII that tell you about the length of the file." ... translate that to "at the beginning of each record there are five digits in EBCDIC that tell you exactly the length of the record" and we have a (very fuzzy) match! Here is the URL of the documentation page: http://docs.epoline.org/ebd/info.htm The FIRST file mentioned is the spec. Here is the URL of the download-weekly-data page: http://ebd2.epoline.org/jsp/ebdst35.jsp An observation: The data that I looked at is in the ST.35 series. There is also available for download ST.32 which appears to be a parallel version containing only the SGML content (in "reduced cp437/850", one tag per line). This indicates that the fields in the fixed-length header of the ST.35 records may not be very interesting, and can thus be skipped over, which would greatly simplify the transcoding task. For what it's worth, here is my (investigatory, written after midnight) code: [Update 2: tidied up the code a little; no functionality changes] from pprint import pprint as pp import sys from struct import unpack HDRSZ = 252 T = '>s' # text H = '>H' # binary 2 bytes I = '>I' # binary 4 bytes hdr_defn = [ 6, T, 38, H, 40, T, 94, I, 98, H, 100, T, 251, H, # length of following SGML text HDRSZ + 1 ] # above positions as per spec, reduce to allow for counting from 1 for i in xrange(0, len(hdr_defn), 2): hdr_defn[i] -= 1 def records(fname, output_encoding='latin1', debug=False): xlator=''.join(chr(i).decode('cp500').encode(output_encoding, 'replace') for i in range(256)) # print repr(xlator) def xlate(ebcdic): return ebcdic.translate(xlator) # return ebcdic.decode('cp500') # use this if unicode output desired f = open(fname, 'rb') recnum = -1 while True: # get header buff = f.read(HDRSZ) if not buff: return # EOF recnum += 1 if debug: print "\nrecnum", recnum assert len(buff) == HDRSZ recsz = int(xlate(buff[:5])) if debug: print "recsz", recsz # split remainder of header into text and binary pieces fields = [] for i in xrange(0, len(hdr_defn) - 2, 2): ty = hdr_defn[i + 1] piece = buff[hdr_defn[i]:hdr_defn[i+2]] if ty == T: fields.append(xlate(piece)) else: fields.append(unpack(ty, piece)[0]) if debug: pp(fields) sgmlsz = fields.pop() if debug: print "sgmlsz: %d; expected: %d - %d = %d" % (sgmlsz, recsz, HDRSZ, recsz - HDRSZ) assert sgmlsz == recsz - HDRSZ # get sgml part sgml = f.read(sgmlsz) assert len(sgml) == sgmlsz sgml = xlate(sgml) if debug: print "sgml", sgml yield recnum, fields, sgml if __name__ == "__main__": maxrecs = int(sys.argv[1]) # dumping out the last `maxrecs` records in the file fname = sys.argv[2] keep = [None] * maxrecs for recnum, fields, sgml in records(fname): # do something useful here keep[recnum % maxrecs] = (recnum, fields, sgml) keep.sort() for k in keep: if k: recnum, fields, sgml = k print print recnum pp(fields) print sgml
Assuming cp500 contains all of your "additional propietary characters", a more concise version based on Lennart's answer using the codecs module: import sys, codecs BUFFER_SIZE = 64*1024 ebd_file = codecs.open(sys.argv[1], 'r', 'cp500') latin1_file = codecs.open(sys.argv[2], 'w', 'latin1') buffer = ebd_file.read(BUFFER_SIZE) while buffer: latin1_file.write(buffer) buffer = ebd_file.read(BUFFER_SIZE) ebd_file.close() latin1_file.close()