Python kludge to read UCS-2 (UTF-16?) as ASCII - python

I'm in a little over my head on this one, so please pardon my terminology in advance.
I'm running this using Python 2.7 on Windows XP.
I found some Python code that reads a log file, does some stuff, then displays something.
What, that's not enough detail? Ok, here's a simplified version:
#!/usr/bin/python
import re
import sys
class NotSupportedTOCError(Exception):
pass
def filter_toc_entries(lines):
while True:
line = lines.next()
if re.match(r""" \s*
.+\s+ \| (?#track)
\s+.+\s+ \| (?#start)
\s+.+\s+ \| (?#length)
\s+.+\s+ \| (?#start sec)
\s+.+\s*$ (?#end sec)
""", line, re.X):
lines.next()
break
while True:
line = lines.next()
m = re.match(r"""
^\s*
(?P<num>\d+)
\s*\|\s*
(?P<start_time>[0-9:.]+)
\s*\|\s*
(?P<length_time>[0-9:.]+)
\s*\|\s*
(?P<start_sector>\d+)
\s*\|\s*
(?P<end_sector>\d+)
\s*$
""", line, re.X)
if not m:
break
yield m.groupdict()
def calculate_mb_toc_numbers(eac_entries):
eac = list(eac_entries)
num_tracks = len(eac)
tracknums = [int(e['num']) for e in eac]
if range(1,num_tracks+1) != tracknums:
raise NotSupportedTOCError("Non-standard track number sequence: %s", tracknums)
leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
offsets = [(int(x['start_sector']) + 150) for x in eac]
return [1, num_tracks, leadout_offset] + offsets
f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart
The code works fine as long as the log file is "simple" text (I'm tempted to say ASCII although that may not be precise/accurate - for e.g. Notepad++ indicates it's ANSI).
However, the script doesn't work on certain log files (in these cases, Notepad++ says "UCS-2 Little Endian").
I get the following error:
Traceback (most recent call last):
File "simple.py", line 55, in <module>
mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_
toc_entries(f)))
File "simple.py", line 49, in calculate_mb_toc_numbers
leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
IndexError: list index out of range
This log works
This log breaks
I believe it's the encoding that's breaking the script because if I simply do this at a command prompt:
type ascii.log > scrubbed.log
and then run the script on scrubbed.log, the script works fine (this is actually fine for my purposes since there's no loss of important information and I'm not writing back to a file, just printing to the console).
One workaround would be to "scrub" the log file before passing it to Python (e.g. using the type pipe trick above to a temporary file and then have the script run on that), but I would like to have Python "ignore" the encoding if it's possible. I'm also not sure how to detect what type of log file the script is reading so I can act appropriately.
I'm reading this and this but my eyes are still spinning around in their head, so while that may be my longer term strategy, I'm wondering if there's an interim hack I could use.

codecs.open() will allow you to open a file using a specific encoding, and it will produce unicodes. You can try a few, going from most likely to least likely (or the tool could just always produce UTF-16LE but ha ha fat chance).
Also, "Unicode In Python, Completely Demystified".

works.log appears to be encoded in ASCII:
>>> data = open('works.log', 'rb').read()
>>> all(d < '\x80' for d in data)
True
breaks.log appears to be encoded in UTF-16LE -- it starts with the 2 bytes '\xff\xfe'. None of the characters in breaks.log are outside the ASCII range:
>>> data = open('breaks.log', 'rb').read()
>>> data[:2]
'\xff\xfe'
>>> udata = data.decode('utf16')
>>> all(d < u'\x80' for d in udata)
True
If these are the only two possibilities, you should be able to get away with the following hack. Change your mainline code from:
f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(
str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart
to this:
f = open(sys.argv[1], 'rb')
data = f.read()
f.close()
if data[:2] == '\xff\xfe':
data = data.decode('utf16').encode('ascii')
# ilines is a generator which produces newline-terminated strings
ilines = (line + '\n' for line in data.splitlines())
mb_toc_urlpart = "%20".join(
str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(ilines)) )
print mb_toc_urlpart

Python 2.x expects normal strings to be ASCII (or at least one byte). Try this:
Put this at the top of your Python source file:
from __future__ import unicode_literals
And change all the str to unicode.
[edit]
And as Ignacio Vazquez-Abrams wrote, try codecs.open() to open the input file.

Related

Read Null terminated string in python

I'm trying to read a null terminated string but i'm having issues when unpacking a char and putting it together with a string.
This is the code:
def readString(f):
str = ''
while True:
char = readChar(f)
str = str.join(char)
if (hex(ord(char))) == '0x0':
break
return str
def readChar(f):
char = unpack('c',f.read(1))[0]
return char
Now this is giving me this error:
TypeError: sequence item 0: expected str instance, int found
I'm also trying the following:
char = unpack('c',f.read(1)).decode("ascii")
But it throws me:
AttributeError: 'tuple' object has no attribute 'decode'
I don't even know how to read the chars and add it to the string, Is there any proper way to do this?
Here's a version that (ab)uses __iter__'s lesser-known "sentinel" argument:
with open('file.txt', 'rb') as f:
val = ''.join(iter(lambda: f.read(1).decode('ascii'), '\x00'))
How about:
myString = myNullTerminatedString.split("\x00")[0]
For example:
myNullTerminatedString = "hello world\x00\x00\x00\x00\x00\x00"
myString = myNullTerminatedString.split("\x00")[0]
print(myString) # "hello world"
This works by splitting the string on the null character. Since the string should terminate at the first null character, we simply grab the first item in the list after splitting. split will return a list of one item if the delimiter doesn't exist, so it still works even if there's no null terminator at all.
It also will work with byte strings:
myByteString = b'hello world\x00'
myStr = myByteString.split(b'\x00')[0].decode('ascii') # "hello world" as normal string
If you're reading from a file, you can do a relatively larger read - estimate how much you'll need to read to find your null string. This is a lot faster than reading byte-by-byte. For example:
resultingStr = ''
while True:
buf = f.read(512)
resultingStr += buf
if len(buf)==0: break
if (b"\x00" in resultingStr):
extraBytes = resultingStr.index(b"\x00")
resultingStr = resultingStr.split(b"\x00")[0]
break
# now "resultingStr" contains the string
f.seek(0 - extraBytes,1) # seek backwards by the number of bytes, now the pointer will be on the null byte in the file
# or f.seek(1 - extraBytes,1) to skip the null byte in the file
(edit version 2, added extra way at the end)
Maybe there are some libraries out there that can help you with this, but as I don't know about them lets attack the problem at hand with what we know.
In python 2 bytes and string are basically the same thing, that change in python 3 where string is what in py2 is unicode and bytes is its own separate type, which mean that you don't need to define a read char if you are in py2 as no extra work is required, so I don't think you need that unpack function for this particular case, with that in mind lets define the new readString
def readString(myfile):
chars = []
while True:
c = myfile.read(1)
if c == chr(0):
return "".join(chars)
chars.append(c)
just like with your code I read a character one at the time but I instead save them in a list, the reason is that string are immutable so doing str+=char result in unnecessary copies; and when I find the null character return the join string. And chr is the inverse of ord, it will give you the character given its ascii value. This will exclude the null character, if its needed just move the appending...
Now lets test it with your sample file
for instance lets try to read "Sword_Wea_Dummy" from it
with open("sword.blendscn","rb") as archi:
#lets simulate that some prior processing was made by
#moving the pointer of the file
archi.seek(6)
string=readString(archi)
print "string repr:", repr(string)
print "string:", string
print ""
#and the rest of the file is there waiting to be processed
print "rest of the file: ", repr(archi.read())
and this is the output
string repr: 'Sword_Wea_Dummy'
string: Sword_Wea_Dummy
rest of the file: '\xcd\xcc\xcc=p=\x8a4:\xa66\xbfJ\x15\xc6=\x00\x00\x00\x00\xeaQ8?\x9e\x8d\x874$-i\xb3\x00\x00\x00\x00\x9b\xc6\xaa2K\x15\xc6=;\xa66?\x00\x00\x00\x00\xb8\x88\xbf#\x0e\xf3\xb1#ITuB\x00\x00\x80?\xcd\xcc\xcc=\x00\x00\x00\x00\xcd\xccL>'
other tests
>>> with open("sword.blendscn","rb") as archi:
print readString(archi)
print readString(archi)
print readString(archi)
sword
Sword_Wea_Dummy
ÍÌÌ=p=Š4:¦6¿JÆ=
>>> with open("sword.blendscn","rb") as archi:
print repr(readString(archi))
print repr(readString(archi))
print repr(readString(archi))
'sword'
'Sword_Wea_Dummy'
'\xcd\xcc\xcc=p=\x8a4:\xa66\xbfJ\x15\xc6='
>>>
Now that I think about it, you mention that the data portion is of fixed size, if that is true for all files and the structure on all of them is as follow
[unknow size data][know size data]
then that is a pattern we can exploit, we only need to know the size of the file and we can get both part smoothly as follow
import os
def getDataPair(filename,knowSize):
size = os.path.getsize(filename)
with open(filename, "rb") as archi:
unknown = archi.read(size-knowSize)
know = archi.read()
return unknown, know
and by knowing the size of the data portion, its use is simple (which I get by playing with the prior example)
>>> strins_data, data = getDataPair("sword.blendscn", 80)
>>> string_data, data = getDataPair("sword.blendscn", 80)
>>> string_data
'sword\x00Sword_Wea_Dummy\x00'
>>> data
'\xcd\xcc\xcc=p=\x8a4:\xa66\xbfJ\x15\xc6=\x00\x00\x00\x00\xeaQ8?\x9e\x8d\x874$-i\xb3\x00\x00\x00\x00\x9b\xc6\xaa2K\x15\xc6=;\xa66?\x00\x00\x00\x00\xb8\x88\xbf#\x0e\xf3\xb1#ITuB\x00\x00\x80?\xcd\xcc\xcc=\x00\x00\x00\x00\xcd\xccL>'
>>> string_data.split(chr(0))
['sword', 'Sword_Wea_Dummy', '']
>>>
Now to get each string a simple split will suffice and you can pass the rest of the file contained in data to the appropriated function to be processed
Doing file I/O one character at a time is horribly slow.
Instead use readline0, now on pypi: https://pypi.org/project/readline0/ . Or something like it.
In 3.x, there's a "newline" argument to open, but it doesn't appear to be as flexible as readline0.
Here is my implementation:
import struct
def read_null_str(f):
r_str = ""
while 1:
back_offset = f.tell()
try:
r_char = struct.unpack("c", f.read(1))[0].decode("utf8")
except:
f.seek(back_offset)
temp_char = struct.unpack("<H", f.read(2))[0]
r_char = chr(temp_char)
if ord(r_char) == 0:
return r_str
else:
r_str += r_char

Two regex functions together do not work

I am trying to get the index for the start of a tag and the end of another tag. However, when I use one regex it works absolutely fine but for two regex functions, it gives an error for the second one.
Kindly help in explaining the reason
The below code works fine:
import re
f = open('C:/Users/Jyoti/Desktop/PythonPrograms/try.xml','r')
opentag = re.search('<TEXT>',f.read())
begin = opentag.start()+6
print begin
But when I add another similar regex it give me the error
AttributeError: 'NoneType' object has no attribute 'start'
which I understand is due to the start() function returning None
Below is the code:
import re
f = open('C:/Users/Jyoti/Desktop/PythonPrograms/try.xml','r')
opentag = re.search('<TEXT>',f.read())
begin = opentag.start()+6
print begin
closetag = re.search('</TEXT>',f.read())
end = closetag.start() - 1
print end
Please provide a solution to how can I get this working. Also I am a newbie here so please don't mind if I ask more questions on the solution.
You are reading the file in f.read() which reads the whole file, and so the file descriptor moves forward, which means the text can't be read again when you do f.read() the next time.
If you need to search on the same text again, save the output of f.read(), and then do a regular expression search on it as below:
import re
f = open('C:/Users/Jyoti/Desktop/PythonPrograms/try.xml','r')
text = f.read()
opentag = re.search('<TEXT>',text)
begin = opentag.start()+6
print begin
closetag = re.search('</TEXT>',text)
end = closetag.start() - 1
print end
f.read() reads the whole file. So there's nothing left to read on the second f.read() call.
See https://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects
First of all you have to know that f.read() after read file sets the pointer to the EOF so if you again use f.read() it gives you empty string ''. Secondly you should use r before string passed as a pattern of re.search function, which means raw, and automatically escapes special characters. So you have to do something like this:
import re
f = open('C:/Users/Jyoti/Desktop/PythonPrograms/try.xml','r')
data = f.read()
opentag = re.search(r'<TEXT>',data)
begin = opentag.start()+6
print begin
closetag = re.search(r'</TEXT>',data)
end = closetag.start() - 1
print end
gl & hf with Python :)

Parsing a text file for pattern and writing found pattern back to another file python 3.4

I am trying to open a text file. Parse the text file for specific regex patterns then when if I find that pattern I write the regex returned pattern to another text file.
Specifically a list of IP Addresses which I want to parse specific ones out of.
So the file may have
10.10.10.10
9.9.9.9
5.5.5.5
6.10.10.10
And say I want just the IPs that end in 10 (the regex I think I am good with) My example looks for the 10.180.42, o4 41.XX IP hosts. But I will adjust as needed.
I've tried several method and fail miserably at them all. It's days like this I know why I just never mastered any language. But I'm committed to Python so here goes.
import re
textfile = open("SymantecServers.txt", 'r')
matches = re.findall('^10.180\.4[3,1].\d\d',str(textfile))
print(matches)
This gives me empty backets. I had to encase the textfile in the str function or it just puked. I don't know if this is right.
This just failed all over the place no matter how I fine tuned it.
f = open("SymantecServers.txt","r")
o = open("JustIP.txt",'w', newline="\r\n")
for line in f:
pattern = re.compile("^10.180\.4[3,1].\d\d")
print(pattern)
#o.write(pattern)
#o.close()
f.close()
I did get one working but it just returned the entire line (including netmask and other test like hostname which are all on the same line in the text file. I just want IP)
Any help on how to read a text file and if it has a pattern of IP grab the full IP and write that into another text file so I end up with a text file with a list of just the IPs I want. I am 3 hours into it and behind on work so going to do the first file by hand...
I am just at a loss what I am missing. Sorry for being a newbie
here is it working:
>>> s = """10.10.10.10
... 9.9.9.9
... 5.5.5.5
... 10.180.43.99
... 6.10.10.10"""
>>> re.findall(r'10\.180\.4[31]\.\d\d', s)
['10.180.43.99']
you do not really need to add line boundaries, as you're matching a very specific IP address, if your file does not have weird things like '123.23.234.10.180.43.99.21354' that you don't want to match, it should be ok!
your syntax of [3,1] is matching either 3, 1 or , and you don't want to match against a comma ;-)
about your function:
r = re.compile(r'10\.180\.4[31]\.\d\d')
with open("SymantecServers.txt","r") as f:
with open("JustIP.txt",'w', newline="\r\n") as o:
for line in f:
matches = r.findall(line)
for match in matches:
o.write(match)
though if I were you, I'd extract IPs using:
r = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
with open("SymantecServers.txt","r") as f:
with open("JustIP.txt",'w', newline="\r\n") as o:
for line in f:
matches = r.findall(line)
for match in matches:
a, b, c, d = match.split('.')
if int(a) < 255 and int(b) < 255 and int(c) in (43, 41) and int(d) < 100:
o.write(match)
or another way to do it:
r = re.compile(r'(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})')
with open("SymantecServers.txt","r") as f:
with open("JustIP.txt",'w', newline="\r\n") as o:
for line in f:
m = r.match(line)
if m:
a, b, c, d = m.groups()
if int(a) < 255 and int(b) < 255 and int(c) in (43, 41) and int(d) < 100:
o.write(match)
which uses the regex to split the IP address into groups.
What you're missing is that you're doing a re.compile() which creates a Regular Expression object in Python. You're forgetting to match.
You could try:
# This isn't the best way to match IP's, but if it fits for your use-case keep it for now.
pattern = re.compile("^10.180\.4[13].\d\d")
f = open("SymantecServers.txt",'r')
o = open("JustIP.txt",'w')
for line in f:
m = pattern.match(line)
if m is not None:
print "Match: %s" %(m.group(0))
o.write(m.group(0) + "\n")
f.close()
o.close()
Which is compiling the Python object, attempting to match the line against the compiled object, and then printing out that current match. I can avoid having to split my matches, but I have to pay attention to matching groups - therefore group(0)
You can also look at re.search() which you can do, but if you're running search enough times with the same regular expression, it becomes more worthwhile to use compile.
Also note that I moved the f.close() to the outside of the for loop.

Is this a sensible approach for an EBCDIC (CP500) to Latin-1 converter?

I have to convert a number of large files (up to 2GB) of EBCDIC 500 encoded files to Latin-1. Since I could only find EBCDIC to ASCII converters (dd, recode) and the files contain some additional proprietary character codes, I thought I'd write my own converter.
I have the character mapping so I'm interested in the technical aspects.
This is my approach so far:
# char mapping lookup table
EBCDIC_TO_LATIN1 = {
0xC1:'41', # A
0xC2:'42', # B
# and so on...
}
BUFFER_SIZE = 1024 * 64
ebd_file = file(sys.argv[1], 'rb')
latin1_file = file(sys.argv[2], 'wb')
buffer = ebd_file.read(BUFFER_SIZE)
while buffer:
latin1_file.write(ebd2latin1(buffer))
buffer = ebd_file.read(BUFFER_SIZE)
ebd_file.close()
latin1_file.close()
This is the function that does the converting:
def ebd2latin1(ebcdic):
result = []
for ch in ebcdic:
result.append(EBCDIC_TO_LATIN1[ord(ch)])
return ''.join(result).decode('hex')
The question is whether or not this is a sensible approach from an engineering standpoint. Does it have some serious design issues? Is the buffer size OK? And so on...
As for the "proprietary characters" that some don't believe in: Each file contains a year's worth of patent documents in SGML format. The patent office has been using EBCDIC until they switched to Unicode in 2005. So there are thousands of documents within each file. They are separated by some hex values that are not part of any IBM specification. They were added by the patent office. Also, at the beginning of each file there are a few digits in ASCII that tell you about the length of the file. I don't really need that information but if I want to process the file so I have to deal with them.
Also:
$ recode IBM500/CR-LF..Latin1 file.ebc
recode: file.ebc failed: Ambiguous output in step `CR-LF..data'
Thanks for the help so far.
EBCDIC 500, aka Code Page 500, is amongst Pythons encodings, although you link to cp1047, which doesn't. Which one are you using, really? Anyway this works for cp500 (or any other encoding that you have).
from __future__ import with_statement
import sys
from contextlib import nested
BUFFER_SIZE = 16384
with nested(open(sys.argv[1], 'rb'), open(sys.argv[2], 'wb')) as (infile, outfile):
while True:
buffer = infile.read(BUFFER_SIZE)
if not buffer:
break
outfile.write(buffer.decode('cp500').encode('latin1'))
This way you shouldn't need to keep track of the mappings yourself.
If you set up the table correctly, then you just need to do:
translated_chars = ebcdic.translate(EBCDIC_TO_LATIN1)
where ebcdic contains EBCDIC characters and EBCDIC_TO_LATIN1 is a 256-char string which maps each EBCDIC character to its Latin-1 equivalent. The characters in EBCDIC_TO_LATIN1 are the actual binary values rather than their hex representations. For example, if you are using code page 500, the first 16 bytes of EBCDIC_TO_LATIN1 would be
'\x00\x01\x02\x03\x37\x2D\x2E\x2F\x16\x05\x25\x0B\x0C\x0D\x0E\x0F'
using this reference.
While this might not help the original poster anymore, some time ago I released a package for Python 2.6+ and 3.2+ that adds most of the western 8 bit mainframe codecs including CP1047 (French) and CP1141 (German): https://pypi.python.org/pypi/ebcdic. Simply import ebcdic to add the codecs and then use open(..., encoding='cp1047') to read or write files.
Answer 1:
Yet another silly question: What gave you the impression that recode produced only ASCII as output? AFAICT it will transcode ANY of its repertoire of charsets to ANY of its repertoire, AND its repertoire includes IBM cp500 and cp1047, and OF COURSE latin1. Reading the comments, you will note that Lennaert and I have discovered that there aren't any "proprietary" codes in those two IBM character sets. So you may well be able to use recode after all, once you are certain what charset you've actually got.
Answer 2:
If you really need/want to transcode IBM cp1047 via Python, you might like to firstly get the mapping from an authoritative source, processing it via script with some checks:
URL = "http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/glibc-IBM1047-2.1.2.ucm"
"""
Sample lines:
<U0000> \x00 |0
<U0001> \x01 |0
<U0002> \x02 |0
<U0003> \x03 |0
<U0004> \x37 |0
<U0005> \x2D |0
"""
import urllib, re
text = urllib.urlopen(URL).read()
regex = r"<U([0-9a-fA-F]{4,4})>\s+\\x([0-9a-fA-F]{2,2})\s"
results = re.findall(regex, text)
wlist = [None] * 256
for result in results:
unum, inum = [int(x, 16) for x in result]
assert wlist[inum] is None
assert 0 <= unum <= 255
wlist[inum] = chr(unum)
assert not any(x is None for x in wlist)
print repr(''.join(wlist))
Then carefully copy/paste the output into your transcoding script for use with Vinay's buffer.translate(the_mapping) idea, with a buffer size perhaps a bit larger than 16KB and certainly a bit smaller than 2GB :-)
No crystal ball, no info from OP, so had a bit of a rummage in the EPO website. Found freely downloadable weekly patent info files, still available in cp500/SGML even though website says this to be replaced by utf8/XML in 2006 :-). Got the 2009 week 27 file. Is a zip containing 2 files s350927[ab].bin. "bin" means "not XML". Got the spec! Looks possible that "proprietary codes" are actually BINARY fields. Each record has a fixed 252-byte header. First 5 bytes are record length in EBCDIC e.g. hex F0F2F2F0F8 -> 2208 bytes. Last 2 bytes of the fixed header are the BINARY length (redundant) of the following variable part. In the middle are several text fields, two 2-byte binary fields, and one 4-byte binary field. The binary fields are serial numbers within groups, but all I saw are 1. The variable part is SGML.
Example (last record from s350927b.bin):
Record number: 7266
pprint of header text and binary slices:
['EPB102055619 TXT00000001',
1,
' 20090701200927 08013627.8 EP20090528NN ',
1,
1,
' T *lots of spaces snipped*']
Edited version of the rather long SGML:
<PATDOC FILE="08013627.8" CY=EP DNUM=2055619 KIND=B1 DATE=20090701 STATUS=N>
*snip*
<B541>DE<B542>Windschutzeinheit für ein Motorrad
<B541>EN<B542>Windshield unit for saddle-ride type vehicle
<B541>FR<B542>Unité pare-brise pour motocyclette</B540>
*snip*
</PATDOC>
There are no header or trailer records, just this one record format.
So: if the OP's annual files are anything like this, we might be able to help him out.
Update: Above was the "2 a.m. in my timezone" version. Here's a bit more info:
OP said: "at the beginning of each file there are a few digits in ASCII that tell you about the length of the file." ... translate that to "at the beginning of each record there are five digits in EBCDIC that tell you exactly the length of the record" and we have a (very fuzzy) match!
Here is the URL of the documentation page: http://docs.epoline.org/ebd/info.htm
The FIRST file mentioned is the spec.
Here is the URL of the download-weekly-data page: http://ebd2.epoline.org/jsp/ebdst35.jsp
An observation: The data that I looked at is in the ST.35 series. There is also available for download ST.32 which appears to be a parallel version containing only the SGML content (in "reduced cp437/850", one tag per line). This indicates that the fields in the fixed-length header of the ST.35 records may not be very interesting, and can thus be skipped over, which would greatly simplify the transcoding task.
For what it's worth, here is my (investigatory, written after midnight) code:
[Update 2: tidied up the code a little; no functionality changes]
from pprint import pprint as pp
import sys
from struct import unpack
HDRSZ = 252
T = '>s' # text
H = '>H' # binary 2 bytes
I = '>I' # binary 4 bytes
hdr_defn = [
6, T,
38, H,
40, T,
94, I,
98, H,
100, T,
251, H, # length of following SGML text
HDRSZ + 1
]
# above positions as per spec, reduce to allow for counting from 1
for i in xrange(0, len(hdr_defn), 2):
hdr_defn[i] -= 1
def records(fname, output_encoding='latin1', debug=False):
xlator=''.join(chr(i).decode('cp500').encode(output_encoding, 'replace') for i in range(256))
# print repr(xlator)
def xlate(ebcdic):
return ebcdic.translate(xlator)
# return ebcdic.decode('cp500') # use this if unicode output desired
f = open(fname, 'rb')
recnum = -1
while True:
# get header
buff = f.read(HDRSZ)
if not buff:
return # EOF
recnum += 1
if debug: print "\nrecnum", recnum
assert len(buff) == HDRSZ
recsz = int(xlate(buff[:5]))
if debug: print "recsz", recsz
# split remainder of header into text and binary pieces
fields = []
for i in xrange(0, len(hdr_defn) - 2, 2):
ty = hdr_defn[i + 1]
piece = buff[hdr_defn[i]:hdr_defn[i+2]]
if ty == T:
fields.append(xlate(piece))
else:
fields.append(unpack(ty, piece)[0])
if debug: pp(fields)
sgmlsz = fields.pop()
if debug: print "sgmlsz: %d; expected: %d - %d = %d" % (sgmlsz, recsz, HDRSZ, recsz - HDRSZ)
assert sgmlsz == recsz - HDRSZ
# get sgml part
sgml = f.read(sgmlsz)
assert len(sgml) == sgmlsz
sgml = xlate(sgml)
if debug: print "sgml", sgml
yield recnum, fields, sgml
if __name__ == "__main__":
maxrecs = int(sys.argv[1]) # dumping out the last `maxrecs` records in the file
fname = sys.argv[2]
keep = [None] * maxrecs
for recnum, fields, sgml in records(fname):
# do something useful here
keep[recnum % maxrecs] = (recnum, fields, sgml)
keep.sort()
for k in keep:
if k:
recnum, fields, sgml = k
print
print recnum
pp(fields)
print sgml
Assuming cp500 contains all of your "additional propietary characters", a more concise version based on Lennart's answer using the codecs module:
import sys, codecs
BUFFER_SIZE = 64*1024
ebd_file = codecs.open(sys.argv[1], 'r', 'cp500')
latin1_file = codecs.open(sys.argv[2], 'w', 'latin1')
buffer = ebd_file.read(BUFFER_SIZE)
while buffer:
latin1_file.write(buffer)
buffer = ebd_file.read(BUFFER_SIZE)
ebd_file.close()
latin1_file.close()

str.startswith() not working as I intended

I'm trying to test for a /t or a space character and I can't understand why this bit of code won't work. What I am doing is reading in a file, counting the loc for the file, and then recording the names of each function present within the file along with their individual lines of code. The bit of code below is where I attempt to count the loc for the functions.
import re
...
else:
loc += 1
for line in infile:
line_t = line.lstrip()
if len(line_t) > 0 \
and not line_t.startswith('#') \
and not line_t.startswith('"""'):
if not line.startswith('\s'):
print ('line = ' + repr(line))
loc += 1
return (loc, name)
else:
loc += 1
elif line_t.startswith('"""'):
while True:
if line_t.rstrip().endswith('"""'):
break
line_t = infile.readline().rstrip()
return(loc,name)
Output:
Enter the file name: test.txt
line = '\tloc = 0\n'
There were 19 lines of code in "test.txt"
Function names:
count_loc -- 2 lines of code
As you can see, my test print for the line shows a /t, but the if statement explicitly says (or so I thought) that it should only execute with no whitespace characters present.
Here is my full test file I have been using:
def count_loc(infile):
""" Receives a file and then returns the amount
of actual lines of code by not counting commented
or blank lines """
loc = 0
for line in infile:
line = line.strip()
if len(line) > 0 \
and not line.startswith('//') \
and not line.startswith('/*'):
loc += 1
func_loc, func_name = checkForFunction(line);
elif line.startswith('/*'):
while True:
if line.endswith('*/'):
break
line = infile.readline().rstrip()
return loc
if __name__ == "__main__":
print ("Hi")
Function LOC = 15
File LOC = 19
\s is only whitespace to the re package when doing pattern matching.
For startswith, an ordinary method of ordinary strings, \s is nothing special. Not a pattern, just characters.
Your question has already been answered and this is slightly off-topic, but...
If you want to parse code, it is often easier and less error-prone to use a parser. If your code is Python code, Python comes with a couple of parsers (tokenize, ast, parser). For other languages, you can find a lot of parsers on the internet. ANTRL is a well-known one with Python bindings.
As an example, the following couple of lines of code print all lines of a Python module that are not comments and not doc-strings:
import tokenize
ignored_tokens = [tokenize.NEWLINE,tokenize.COMMENT,tokenize.N_TOKENS
,tokenize.STRING,tokenize.ENDMARKER,tokenize.INDENT
,tokenize.DEDENT,tokenize.NL]
with open('test.py', 'r') as f:
g = tokenize.generate_tokens(f.readline)
line_num = 0
for a_token in g:
if a_token[2][0] != line_num and a_token[0] not in ignored_tokens:
line_num = a_token[2][0]
print(a_token)
As a_token above is already parsed, you can easily check for function definition, too. You can also keep track where the function ends by looking at the current column start a_token[2][1]. If you want to do more complex things, you should use ast.
You string literals aren't what you think they are.
You can specify a space or TAB like so:
space = ' '
tab = '\t'

Categories