I have to change formatting of a text.
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 20G 15G 4.2G 78% /
/dev/sda6 68G 39G 26G 61% /u01
/dev/sda2 30G 5.8G 22G 21% /opt
/dev/sda1 99M 19M 76M 20% /boot
tmpfs 48G 8.2G 39G 18% /dev/shm
/dev/mapper/vg3-KPGBKUP4
10T 7.6T 2.5T 76% /KPGBKUP4
I want the output as below:
20G 15G 4.2G 78%
68G 39G 26G 61%
30G 5.8G 22G 21%
99M 19M 76M 20%
48G 8.2G 39G 18%
10T 7.6T 2.5T 76%
This is what I have done, but this requires me to put name of all partitions in my script. I have to run this script on more than 25 servers which have different names of partition, which keeps on changing. Is there any better way?
This is my current script:
import os, sys, fileinput
for line in fileinput.input('report.txt', inplace= True):
line = line.replace("/dev/sda3 ", "")
line = line.replace("/dev/sda6 ", "")
line = line.replace("/dev/sda2 ", "")
line = line.replace("/dev/sda1 ", "")
line = line.replace("tmpfs ", "")
line = line.replace("/dev/mapper/vg3-KPGBKUP4", "")
line = line.replace("/KPGBKUP4", "")
line = line.lstrip()
sys.stdout.write(line)
Did you try something like this?
import os, sys, fileinput
for line in fileinput.input('report.txt'):
line = line.split()
print ' '.join(line[1:])
This works for the input you provided:
for line in open('report.txt'):
if line.startswith('Filesystem'):
# Skip header line
continue
try:
part, size, used, avail, used_percent, mount = line.split(None)
except ValueError:
# Unexpected line format, skip
continue
print ' '.join([size,
used.rjust(5),
avail.rjust(5),
used_percent.rjust(5)])
A key point here is to use line.split(None) in order to split on consecutive whitespace - see docs on split()` for details on this behavior.
Also see str.rjust() on how to format a string to be right justified.
You can use regex:
import os, sys, fileinput
import re
for line in fileinput.input('report.txt', inplace= True):
sys.stdout.write(re.sub('^.+?\s+','', line.strip()))
If there are occurances in the file like in your first example, when the line is actually (not just in the terminal) broken into two lines by a newline character, you should read the whole file to a variable, and use flags=re.MULTILINE in the sub() function:
f = open('report.txt', 'r')
print re.sub('^.+?\s+','',a,flags=re.MULTILINE)
#!/usr/bin/env python
import re, subprocess, shlex
cmd = 'df -Th'
# execute command
output = subprocess.Popen(shlex.split(cmd), stdout = subprocess.PIPE)
# read stdout-output from command
output = output.communicate()[0]
# concat line-continuations
output = re.sub(r'\n\s+', ' ', output)
# split output to list of lines
output = output.splitlines()[1:]
# print fields right-aligned in columns of width 6
for line in output:
print("{2:>6}{3:>6}{4:>6}{5:>6}".format(*tuple(line.split())))
I use subprocess to execute the df-Command instead of reading report.txt.
To format the output i used pythons Format String Syntax.
Here's a quick and dirty solution:
first = True
for line in open('A.py'):
# Skip the header row
if first:
first = False
continue
# For skipping improperly formatted lines, like the last one
if len(line.split()) == 6:
filesystem, size, used, avail, use, mount = line.split()
print size, used, avail, use
print ' '.join(line.split()[1:5])
Related
I need to write a code that will loop through the output of command "df -h" and check if it's 100% utilized and print that specific line in red and if it's 80% utilized, the line to be printed in Yellow.
Output should be similar to :
Filesystem Size Used Avail Use% Mounted on
***/dev/sda1 44G 44G 44G 100% /*** -> Should be in red.
udev 8.9G 112K 8.9G 1% /dev
tmpfs 8.9G 84K 8.9G 1% /dev/shm
/dev/sdb1 6.9G 4.5G 4.5G 80% /storage/log -> Should be in yellow.
/dev/sdb2 7.9G 147M 7.4G 2% /storage/ext
/dev/sdc1 25G 173M 24G 1% /storage/artifactory
/dev/sdd1 50G 5.8G 42G 13% /storage/db
The code I have written so far is simple, but I am not quite sure how to loop in the subprocess output and check the used % field.
def DiskSize():
import subprocess
dfh = subprocess.Popen(["df","-h"],stdout=subprocess.PIPE)
for line in iter(dfh.stdout.readline,''):
print line.rstrip()
Output I get, is the expected :
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 44G 44G 44G 100% /
tmpfs 8.9G 84K 8.9G 1% /dev/shm
/dev/sdb1 6.9G 4.5G 4.5G 80% /storage/log
/dev/sdb2 7.9G 147M 7.4G 2% /storage/ext
/dev/sdc1 25G 173M 24G 1% /storage/artifactory
/dev/sdd1 50G 5.8G 42G 13% /storage/db
Thanks for your help.
A better way of doing it, would be to do:
import os
try:
import psutil
except:
## Time to monkey patch in all the psutil-functions as if the real psutil existed.
class disk():
def __init__(self, size, free, percent):
self.size = size
self.free = free
self.percent = percent
class psutil():
def disk_usage(partition):
disk_stats = os.statvfs(partition)
free_size = disk_stats.f_bfree * disk_stats.f_bsize
disk_size = disk_stats.f_blocks * disk_stats.f_bsize
percent = (100/disk_size)*disk_free
return disk(disk_size, free_size, percent)
print(psutil.disk_usage('/').percent)
This will use either psutil if present, or use os.statvfs on the partition desired to calculate the free space. Again, this would apply if what you're after - is only the percentage used on a disk/drive/partition.
If you actually need to parse and print the lines, this won't help.
But I'll leave this here as a reference in case it does help anyone passing by trying to parse df -h data :)
Here's a working example if you really do need to go through the trouble:
from select import epoll, EPOLLIN, EPOLLHUP
from subprocess import Popen, STDOUT, PIPE
class bcolors:
HEADER = '\033[95m'
OKBLUE = '\033[94m'
OKGREEN = '\033[92m'
WARNING = '\033[93m'
FAIL = '\033[91m'
ENDC = '\033[0m'
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
handle = Popen("df -h".split(), stdout=PIPE, stderr=STDOUT)
lines = b''
while handle.poll() is None:
lines += handle.stdout.read()
for index, line in enumerate(lines.split(b'\n')):
if index == 0:
print(line.decode('UTF-8'))
continue
parts = line.split(b' ')
if len(line) > 2:
percentage = int(parts[-2][:-1]) # as Jean-François said, replace this line with a regexp for better accuracy
if percentage < 80:
print(bcolors.WARNING + line.decode('UTF-8') + bcolors.ENDC)
continue
print(line.decode('UTF-8'))
This will grab all the data from df -h, iterate over each line, skip the header from being parsed. Fetch the 2:d least item in columns, unless your mounted paths have spaces that is (correct for that if you need to). Remove the % from that column, convert it to a integer and do the check.
Please consider upvoting this answer if you end up using the later of these two. I completely stole the output formatting from him :)
I would like to remove entries of a fasta file where all nucleotides are N, but not entries which contain ACGT and N nucleotides.
from an example of input file content:
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq2
NNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNN
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
#...
Hoping the output file content to be:
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
#...
Any suggestions in doing this with awk, perl, python, other?
Thank you!
FDS
With GNU awk
awk -v RS='#>seq[[:digit:]]+' '!/^[N\n]+$/{printf "%s",term""$0}; {term=RT}' file
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
With python, using the BioPython module:
import Bio
INPUT = "bio_input.fas"
OUTPUT = "bio_output.fas"
def main():
records = Bio.SeqIO.parse(INPUT, 'fasta')
filtered = (rec for rec in records if any(ch != 'N' for ch in rec.seq))
Bio.SeqIO.write(filtered, OUTPUT, 'fasta')
if __name__=="__main__":
main()
however note that the FastA spec says sequence ids are supposed to start with >, not #>!
Run against
>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
>seq2
NNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNN
>seq3
catgcatcgacgatgctgacgatc
>seq4
cacacaccNNNNttgtgca
this produces
>seq_1
TGCTAGCTAGCTGATCGTGTCGATCGCACCACANNNNNCACGTGTCG
>seq3
catgcatcgacgatgctgacgatc
>seq4
cacacaccNNNNttgtgca
(Note, the default linewrap length is 60 chars).
So essentially blocks start with a #> marker, and you wish to remove blocks where no line contains anything other than N. One way in Python:
#! /usr/bin/env python
import fileinput, sys, re
block=[]
nonN=re.compile('[^N\n]')
for line in fileinput.input():
if line.startswith('#>'):
if len(block)==1 or any(map(nonN.search, block[1:])):
sys.stdout.writelines(block)
block=[line]
else:
block.append(line)
if len(block)==1 or any(map(nonN.search, block[1:])):
sys.stdout.writelines(block)
in python using regex:
#!/usr/bin/env python
import re
ff = open('test', 'r')
data = ff.read()
ff.close()
m = re.compile(r'(#>seq\d+[N\n]+)$', re.M)
f = re.sub(m, '', data)
fo = open('out', 'w')
fo.write(f)
fo.close()
and you will get in your out file:
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
#...
hope this helps.
With shell command egrep (grep + regrex)
egrep -B 1 "^[^NNNN && ^#seq]" your.fa >conert.fa
# -B 1 means print match row and the previous row
# ^[^NNNN && ^#seq] is regrex pattern means not match with (begin with NNNN and
# #seq)
so only match with begin with common A/T/G/C sequence row and its previous row which is fasta header
I have some real time packets coming as below stored in a file log.log, and i am reading the log.log file as tail -f and parsing it. But all the lines are random with no fixed values, such as random ip, random values in data::blocks, each data:: is a column values. e.g in a log.log
Ohter type of lines...
[TCP]: incomeing data: 91 bytes, data=connect data::10.109.0.200data::10.109.0.86data::wandata::p4data::1400data::800data::end
[TCP]: incomeing data: 91 bytes, data=connect data::10.109.0.201data::10.109.8.86data::landata::p4data::1400data::700data::end
[TCP]: incomeing data: 91 bytes, data=connect data::10.109.0.200data::10.109.58.86data::3gdata::p4data::400data::800data::end
something.. else...
Now, How can i parse the line? where it can ignore anything and only parse when this match:
connect data::ANYdata::ANYdata::ANYdata::ANYdata::ANYdata::ANYdata::end
Run:
$ tail -f /var/tmp/log.log | python -u /var/tmp/myparse.py
myparse.py:
import sys, time, os, subprocess
import re
def p(command):
subprocess.Popen(command, shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
while True:
line = sys.stdin.readline()
if line:
if "command:start" in line:
print "OK - working"
p("/var/tmp/single_thread_process.sh")
if "connect data::" in line:
..
else:
# ^(?:\+|00)(\d+)$ Parse the 0032, 32, +32
#match = re.search(r'^(?:\+|00)(\d+)$', line)
#if match:
#print "OK"
### NOT working ###
match = re.search(r'^connect data::*data::*data::*data::*data::*data::*data::end$', line)
if match:
print "OK"
Try using:
match = re.search(r'connect data::[^:]+::[^:]+::[^:]+::[^:]+::[^:]+::[^:]+::end$', line)
The beginning of line anchor ^ is the first thing that's preventing the matches.
Also * is not a wildcard in regex, it's a quantifier meaning 0 or more times. You can use [^:]+ to mean 'any character except colons'.
regex101 demo
I've read all of the articles I could find, even understood a few of them but as a Python newb I'm still a little lost and hoping for help :)
I'm working on a script to parse items of interest out of an application specific log file, each line begins with a time stamp which I can match and I can define two things to identify what I want to capture, some partial content and a string that will be the termination of what I want to extract.
My issue is multi-line, in most cases every log line is terminated with a newline but some entries contain SQL that may have new lines within it and therefore creates new lines in the log.
So, in a simple case I may have this:
[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)
This all appears as one line which I can match with this:
re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2}).*(milliseconds)')
However in some cases there may be line breaks in the SQL, as such I want to still capture it (and potentially replace the line breaks with spaces). I am currently reading the file a line at a time which obviously isn't going to work so...
Do I need to process the whole file in one go? They are typically 20mb in size. How do I read the entire file and iterate through it looking for single or multi-line blocks?
How would I write a multi-line RegEx that would match either the whole thing on one line or of it is spread across multiple lines?
My overall goal is to parameterize this so I can use it for extracting log entries that match different patterns of the starting string (always the start of a line), the ending string (where I want to capture to) and a value that is between them as an identifier.
Thanks in advance for any help!
Chris.
import sys, getopt, os, re
sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/Test.log"
lines = []
print "--- START ----"
lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )')
lineContains = re.compile('.*BMXAA6720W.*')
lineEndsWith = re.compile('(?:.*milliseconds.*)')
lines = []
with open(logFileName, 'r') as f:
for line in f:
if lineStartsWith.match(line) and lineContains.match(line):
if lineEndsWith.match(line) :
print 'Full Line Found'
print line
print "- Record Separator -"
else:
print 'Partial Line Found'
print line
print "- Record Separator -"
print "--- DONE ----"
Next step, for my partial line I'll continue reading until I find lineEndsWith and assemble the lines in to one block.
I'm no expert so suggestions are always welcome!
UPDATE - So I have it working, thanks to all the responses that helped direct things, I realize it isn't pretty and I need to clean up my if / elif mess and make it more efficient but IT's WORKING! Thanks for all the help.
import sys, getopt, os, re
sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/Test.log"
print "--- START ----"
lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )')
lineContains = re.compile('.*BMXAA6720W.*')
lineEndsWith = re.compile('(?:.*milliseconds.*)')
lines = []
multiLine = False
with open(logFileName, 'r') as f:
for line in f:
if lineStartsWith.match(line) and lineContains.match(line) and lineEndsWith.match(line):
lines.append(line.replace("\n", " "))
elif lineStartsWith.match(line) and lineContains.match(line) and not multiLine:
#Found the start of a multi-line entry
multiLineString = line
multiLine = True
elif multiLine and not lineEndsWith.match(line):
multiLineString = multiLineString + line
elif multiLine and lineEndsWith.match(line):
multiLineString = multiLineString + line
multiLineString = multiLineString.replace("\n", " ")
lines.append(multiLineString)
multiLine = False
for line in lines:
print line
Do I need to process the whole file in one go? They are typically 20mb in size. How do I read the entire file and iterate through it looking for single or multi-line blocks?
There are two options here.
You could read the file block by block, making sure to attach any "leftover" bit at the end of each block to the start of the next one, and search each block. Of course you will have to figure out what counts as "leftover" by looking at what your data format is and what your regex can match, and in theory it's possible for multiple blocks to all count as leftover…
Or you could just mmap the file. An mmap acts like a bytes (or like a str in Python 2.x), and leaves it up to the OS to handle paging blocks in and out as necessary. Unless you're trying to deal with absolutely huge files (gigabytes in 32-bit, even more in 64-bit), this is trivial and efficient:
with open('bigfile', 'rb') as f:
with mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as m:
for match in compiled_re.finditer(m):
do_stuff(match)
In older versions of Python, mmap isn't a context manager, so you'll need to wrap contextlib.closing around it (or just use an explicit close if you prefer).
How would I write a multi-line RegEx that would match either the whole thing on one line or of it is spread across multiple lines?
You could use the DOTALL flag, which makes the . match newlines. You could instead use the MULTILINE flag and put appropriate $ and/or ^ characters in, but that makes simple cases a lot harder, and it's rarely necessary. Here's an example with DOTALL (using a simpler regexp to make it more obvious):
>>> s1 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)"""
>>> s2 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and
(exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)"""
>>> r = re.compile(r'\[(.*?)\].*?milliseconds\)', re.DOTALL)
>>> r.findall(s1)
['8/21/13 11:30:33:557 PDF']
>>> r.findall(s2)
['8/21/13 11:30:33:557 PDF']
As you can see the second .*? matched the newline just as easily as a space.
If you're just trying to treat a newline as whitespace, you don't need either; '\s' already catches newlines.
For example:
>>> s1 = 'abc def\nghi\n'
>>> s2 = 'abc\ndef\nghi\n'
>>> r = re.compile(r'abc\s+def')
>>> r.findall(s1)
['abc def']
>>> r.findall(s2)
['abc\ndef']
You can read an entire file into a string and then you can use re.split to make a list of all the entries separated by times. Here's an example:
f = open(...)
allLines = ''.join(f.readlines())
entries = re.split(regex, allLines)
I'm in a little over my head on this one, so please pardon my terminology in advance.
I'm running this using Python 2.7 on Windows XP.
I found some Python code that reads a log file, does some stuff, then displays something.
What, that's not enough detail? Ok, here's a simplified version:
#!/usr/bin/python
import re
import sys
class NotSupportedTOCError(Exception):
pass
def filter_toc_entries(lines):
while True:
line = lines.next()
if re.match(r""" \s*
.+\s+ \| (?#track)
\s+.+\s+ \| (?#start)
\s+.+\s+ \| (?#length)
\s+.+\s+ \| (?#start sec)
\s+.+\s*$ (?#end sec)
""", line, re.X):
lines.next()
break
while True:
line = lines.next()
m = re.match(r"""
^\s*
(?P<num>\d+)
\s*\|\s*
(?P<start_time>[0-9:.]+)
\s*\|\s*
(?P<length_time>[0-9:.]+)
\s*\|\s*
(?P<start_sector>\d+)
\s*\|\s*
(?P<end_sector>\d+)
\s*$
""", line, re.X)
if not m:
break
yield m.groupdict()
def calculate_mb_toc_numbers(eac_entries):
eac = list(eac_entries)
num_tracks = len(eac)
tracknums = [int(e['num']) for e in eac]
if range(1,num_tracks+1) != tracknums:
raise NotSupportedTOCError("Non-standard track number sequence: %s", tracknums)
leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
offsets = [(int(x['start_sector']) + 150) for x in eac]
return [1, num_tracks, leadout_offset] + offsets
f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart
The code works fine as long as the log file is "simple" text (I'm tempted to say ASCII although that may not be precise/accurate - for e.g. Notepad++ indicates it's ANSI).
However, the script doesn't work on certain log files (in these cases, Notepad++ says "UCS-2 Little Endian").
I get the following error:
Traceback (most recent call last):
File "simple.py", line 55, in <module>
mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_
toc_entries(f)))
File "simple.py", line 49, in calculate_mb_toc_numbers
leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
IndexError: list index out of range
This log works
This log breaks
I believe it's the encoding that's breaking the script because if I simply do this at a command prompt:
type ascii.log > scrubbed.log
and then run the script on scrubbed.log, the script works fine (this is actually fine for my purposes since there's no loss of important information and I'm not writing back to a file, just printing to the console).
One workaround would be to "scrub" the log file before passing it to Python (e.g. using the type pipe trick above to a temporary file and then have the script run on that), but I would like to have Python "ignore" the encoding if it's possible. I'm also not sure how to detect what type of log file the script is reading so I can act appropriately.
I'm reading this and this but my eyes are still spinning around in their head, so while that may be my longer term strategy, I'm wondering if there's an interim hack I could use.
codecs.open() will allow you to open a file using a specific encoding, and it will produce unicodes. You can try a few, going from most likely to least likely (or the tool could just always produce UTF-16LE but ha ha fat chance).
Also, "Unicode In Python, Completely Demystified".
works.log appears to be encoded in ASCII:
>>> data = open('works.log', 'rb').read()
>>> all(d < '\x80' for d in data)
True
breaks.log appears to be encoded in UTF-16LE -- it starts with the 2 bytes '\xff\xfe'. None of the characters in breaks.log are outside the ASCII range:
>>> data = open('breaks.log', 'rb').read()
>>> data[:2]
'\xff\xfe'
>>> udata = data.decode('utf16')
>>> all(d < u'\x80' for d in udata)
True
If these are the only two possibilities, you should be able to get away with the following hack. Change your mainline code from:
f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(
str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart
to this:
f = open(sys.argv[1], 'rb')
data = f.read()
f.close()
if data[:2] == '\xff\xfe':
data = data.decode('utf16').encode('ascii')
# ilines is a generator which produces newline-terminated strings
ilines = (line + '\n' for line in data.splitlines())
mb_toc_urlpart = "%20".join(
str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(ilines)) )
print mb_toc_urlpart
Python 2.x expects normal strings to be ASCII (or at least one byte). Try this:
Put this at the top of your Python source file:
from __future__ import unicode_literals
And change all the str to unicode.
[edit]
And as Ignacio Vazquez-Abrams wrote, try codecs.open() to open the input file.