How to efficiently parse fixed width files?

How to efficiently parse fixed width files? - python

I am trying to find an efficient way of parsing files that holds fixed width lines. For example, the first 20 characters represent a column, from 21:30 another one and so on.
Assuming that the line holds 100 characters, what would be an efficient way to parse a line into several components?
I could use string slicing per line, but it's a little bit ugly if the line is big. Are there any other fast methods?

Using the Python standard library's struct module would be fairly easy as well as fairly fast since it's written in C. The code below how it use it. It also allows columns of characters to be skipped by specifying negative values for the number of characters in the field.
import struct
fieldwidths = (2, -10, 24)
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths)
# Convert Unicode input to bytes and the result back to Unicode string.
unpack = struct.Struct(fmtstring).unpack_from # Alias.
parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))
print('fmtstring: {!r}, record size: {} chars'.format(fmtstring, struct.calcsize(fmtstring)))
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fields = parse(line)
print('fields: {}'.format(fields))
Output:
fmtstring: '2s 10x 24s', recsize: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')
Here's a way to do it with string slices, as you were considering but were concerned that it might get too ugly. It is kind of complicated and speedwise it's about the same as the version based the struct module — although I have an idea about how it could be sped up (which might make the extra complexity worthwhile). See update below on that topic.
from itertools import zip_longest
from itertools import accumulate
def make_parser(fieldwidths):
cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields
flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1] # ignore final one
parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)
# Optional informational function attributes.
parse.size = sum(abs(fw) for fw in fieldwidths)
parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
return parse
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24) # negative widths represent ignored padding fields
parse = make_parser(fieldwidths)
fields = parse(line)
print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size))
print('fields: {}'.format(fields))
Output:
format: '2s 10x 24s', rec size: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')
Update
As I suspected, there is a way of making the string-slicing version of the code faster — which in Python 2.7 make it about the same speed as the version using struct, but in Python 3.x make it 233% faster (as well as the un-optimized version of itself which is about the same speed as the struct version).
What the version presented above does is define a lambda function that's primarily a comprehension that generates the limits of a bunch of slices at runtime.
parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)
Which is equivalent to a statement like the following, depending on the values of i and j in the for loop, to something looking like this:
parse = lambda line: tuple(line[0:2], line[12:36], line[36:51], ...)
However the latter executes more than twice as fast since the slice boundaries are all constants.
Fortunately it relatively easy to convert and "compile" the former into the latter using the built-in eval() function:
def make_parser(fieldwidths):
cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
pads = tuple(fw < 0 for fw in fieldwidths) # bool flags for padding fields
flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1] # ignore final one
slcs = ', '.join('line[{}:{}]'.format(i, j) for pad, i, j in flds if not pad)
parse = eval('lambda line: ({})\n'.format(slcs)) # Create and compile source code.
# Optional informational function attributes.
parse.size = sum(abs(fw) for fw in fieldwidths)
parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
return parse

I'm not really sure if this is efficient, but it should be readable (as opposed to do the slicing manually). I defined a function slices that gets a string and column lengths, and returns the substrings. I made it a generator, so for really long lines, it doesn't build a temporary list of substrings.
def slices(s, *args):
position = 0
for length in args:
yield s[position:position + length]
position += length
Example
In [32]: list(slices('abcdefghijklmnopqrstuvwxyz0123456789', 2))
Out[32]: ['ab']
In [33]: list(slices('abcdefghijklmnopqrstuvwxyz0123456789', 2, 10, 50))
Out[33]: ['ab', 'cdefghijkl', 'mnopqrstuvwxyz0123456789']
In [51]: d,c,h = slices('dogcathouse', 3, 3, 5)
In [52]: d,c,h
Out[52]: ('dog', 'cat', 'house')
But I think the advantage of a generator is lost if you need all columns at once. Where one could benefit from is when you want to process columns one by one, say in a loop.

Two more options that are easier and prettier than already mentioned solutions:
The first is using pandas:
import pandas as pd
path = 'filename.txt'
#inferred - as suggested in the comments by James Paul Mason
data = pd.read_fwf(path, colspecs='infer')
# Or using Pandas with a column specification
col_specification = [(0, 20), (21, 30), (31, 50), (51, 100)]
data = pd.read_fwf(path, colspecs=col_specification)
And the second option using numpy.loadtxt:
import numpy as np
# Using NumPy and letting it figure it out automagically
data_also = np.loadtxt(path)
It really depends on in what way you want to use your data.

The code below gives a sketch of what you might want to do if you have some serious fixed-column-width file handling to do.
"Serious" = multiple record types in each of multiple file types, records up to 1000 bytes, the layout-definer and "opposing" producer/consumer is a government department with attitude, layout changes result in unused columns, up to a million records in a file, ...
Features: Precompiles the struct formats. Ignores unwanted columns. Converts input strings to required data types (sketch omits error handling). Converts records to object instances (or dicts, or named tuples if you prefer).
Code:
import struct, datetime, io, pprint
# functions for converting input fields to usable data
cnv_text = rstrip
cnv_int = int
cnv_date_dmy = lambda s: datetime.datetime.strptime(s, "%d%m%Y") # ddmmyyyy
# etc
# field specs (field name, start pos (1-relative), len, converter func)
fieldspecs = [
('surname', 11, 20, cnv_text),
('given_names', 31, 20, cnv_text),
('birth_date', 51, 8, cnv_date_dmy),
('start_date', 71, 8, cnv_date_dmy),
]
fieldspecs.sort(key=lambda x: x[1]) # just in case
# build the format for struct.unpack
unpack_len = 0
unpack_fmt = ""
for fieldspec in fieldspecs:
start = fieldspec[1] - 1
end = start + fieldspec[2]
if start > unpack_len:
unpack_fmt += str(start - unpack_len) + "x"
unpack_fmt += str(end - start) + "s"
unpack_len = end
field_indices = range(len(fieldspecs))
print unpack_len, unpack_fmt
unpacker = struct.Struct(unpack_fmt).unpack_from
class Record(object):
pass
# or use named tuples
raw_data = """\
....v....1....v....2....v....3....v....4....v....5....v....6....v....7....v....8
Featherstonehaugh Algernon Marmaduke 31121969 01012005XX
"""
f = cStringIO.StringIO(raw_data)
headings = f.next()
for line in f:
# The guts of this loop would of course be hidden away in a function/method
# and could be made less ugly
raw_fields = unpacker(line)
r = Record()
for x in field_indices:
setattr(r, fieldspecs[x][0], fieldspecs[x][3](raw_fields[x]))
pprint.pprint(r.__dict__)
print "Customer name:", r.given_names, r.surname
Output:
78 10x20s20s8s12x8s
{'birth_date': datetime.datetime(1969, 12, 31, 0, 0),
'given_names': 'Algernon Marmaduke',
'start_date': datetime.datetime(2005, 1, 1, 0, 0),
'surname': 'Featherstonehaugh'}
Customer name: Algernon Marmaduke Featherstonehaugh

> str = '1234567890'
> w = [0,2,5,7,10]
> [ str[ w[i-1] : w[i] ] for i in range(1,len(w)) ]
['12', '345', '67', '890']

This is how I solved with a dictionary that contains where fields start and end. Giving start and end points helped me to manage changes at the length of the column also.
# fixed length
# '---------- ------- ----------- -----------'
line = '20.06.2019 myname active mydevice '
SLICES = {'date_start': 0,
'date_end': 10,
'name_start': 11,
'name_end': 18,
'status_start': 19,
'status_end': 30,
'device_start': 31,
'device_end': 42}
def get_values_as_dict(line, SLICES):
values = {}
key_list = {key.split("_")[0] for key in SLICES.keys()}
for key in key_list:
values[key] = line[SLICES[key+"_start"]:SLICES[key+"_end"]].strip()
return values
>>> print (get_values_as_dict(line,SLICES))
{'status': 'active', 'name': 'myname', 'date': '20.06.2019', 'device': 'mydevice'}

Here's a simple module for Python 3, based on John Machin's answer - adapt as needed :)
"""
fixedwidth
Parse and iterate through a fixedwidth text file, returning record objects.
Adapted from https://stackoverflow.com/a/4916375/243392
USAGE
import fixedwidth, pprint
# define the fixed width fields we want
# fieldspecs is a list of [name, description, start, width, type] arrays.
fieldspecs = [
["FILEID", "File Identification", 1, 6, "A/N"],
["STUSAB", "State/U.S. Abbreviation (USPS)", 7, 2, "A"],
["SUMLEV", "Summary Level", 9, 3, "A/N"],
["LOGRECNO", "Logical Record Number", 19, 7, "N"],
["POP100", "Population Count (100%)", 30, 9, "N"],
]
# define the fieldtype conversion functions
fieldtype_fns = {
'A': str.rstrip,
'A/N': str.rstrip,
'N': int,
}
# iterate over record objects in the file
with open(f, 'rb'):
for record in fixedwidth.reader(f, fieldspecs, fieldtype_fns):
pprint.pprint(record.__dict__)
# output:
{'FILEID': 'SF1ST', 'LOGRECNO': 2, 'POP100': 1, 'STUSAB': 'TX', 'SUMLEV': '040'}
{'FILEID': 'SF1ST', 'LOGRECNO': 3, 'POP100': 2, 'STUSAB': 'TX', 'SUMLEV': '040'}
...
"""
import struct, io
# fieldspec columns
iName, iDescription, iStart, iWidth, iType = range(5)
def get_struct_unpacker(fieldspecs):
"""
Build the format string for struct.unpack to use, based on the fieldspecs.
fieldspecs is a list of [name, description, start, width, type] arrays.
Returns a string like "6s2s3s7x7s4x9s".
"""
unpack_len = 0
unpack_fmt = ""
for fieldspec in fieldspecs:
start = fieldspec[iStart] - 1
end = start + fieldspec[iWidth]
if start > unpack_len:
unpack_fmt += str(start - unpack_len) + "x"
unpack_fmt += str(end - start) + "s"
unpack_len = end
struct_unpacker = struct.Struct(unpack_fmt).unpack_from
return struct_unpacker
class Record(object):
pass
# or use named tuples
def reader(f, fieldspecs, fieldtype_fns):
"""
Wrap a fixedwidth file and return records according to the given fieldspecs.
fieldspecs is a list of [name, description, start, width, type] arrays.
fieldtype_fns is a dictionary of functions used to transform the raw string values,
one for each type.
"""
# make sure fieldspecs are sorted properly
fieldspecs.sort(key=lambda fieldspec: fieldspec[iStart])
struct_unpacker = get_struct_unpacker(fieldspecs)
field_indices = range(len(fieldspecs))
for line in f:
raw_fields = struct_unpacker(line) # split line into field values
record = Record()
for i in field_indices:
fieldspec = fieldspecs[i]
fieldname = fieldspec[iName]
s = raw_fields[i].decode() # convert raw bytes to a string
fn = fieldtype_fns[fieldspec[iType]] # get conversion function
value = fn(s) # convert string to value (eg to an int)
setattr(record, fieldname, value)
yield record
if __name__=='__main__':
# test module
import pprint, io
# define the fields we want
# fieldspecs are [name, description, start, width, type]
fieldspecs = [
["FILEID", "File Identification", 1, 6, "A/N"],
["STUSAB", "State/U.S. Abbreviation (USPS)", 7, 2, "A"],
["SUMLEV", "Summary Level", 9, 3, "A/N"],
["LOGRECNO", "Logical Record Number", 19, 7, "N"],
["POP100", "Population Count (100%)", 30, 9, "N"],
]
# define a conversion function for integers
def to_int(s):
"""
Convert a numeric string to an integer.
Allows a leading ! as an indicator of missing or uncertain data.
Returns None if no data.
"""
try:
return int(s)
except:
try:
return int(s[1:]) # ignore a leading !
except:
return None # assume has a leading ! and no value
# define the conversion fns
fieldtype_fns = {
'A': str.rstrip,
'A/N': str.rstrip,
'N': to_int,
# 'N': int,
# 'D': lambda s: datetime.datetime.strptime(s, "%d%m%Y"), # ddmmyyyy
# etc
}
# define a fixedwidth sample
sample = """\
SF1ST TX04089000 00000023748 1
SF1ST TX04090000 00000033748! 2
SF1ST TX04091000 00000043748!
"""
sample_data = sample.encode() # convert string to bytes
file_like = io.BytesIO(sample_data) # create a file-like wrapper around bytes
# iterate over record objects in the file
for record in reader(file_like, fieldspecs, fieldtype_fns):
# print(record)
pprint.pprint(record.__dict__)

Here is what NumPy uses under the hood (much much simplified, but still - this code is found in the LineSplitter class within the _iotools module):
import numpy as np
DELIMITER = (20, 10, 10, 20, 10, 10, 20)
idx = np.cumsum([0] + list(DELIMITER))
slices = [slice(i, j) for (i, j) in zip(idx[:-1], idx[1:])]
def parse(line):
return [line[s] for s in slices]
It does not handle negative delimiters for ignoring column so it is not as versatile as struct, but it is faster.

Because my old work often handles 1 million lines of fixwidth data, I did research on this issue when I started using Python.
There are 2 types of FixedWidth
ASCII FixedWidth (ascii character length = 1, double-byte encoded character length = 2)
Unicode FixedWidth (ascii character & double-byte encoded character length = 1)
If the resource string is all composed of ascii characters, then ASCII FixedWidth = Unicode FixedWidth
Fortunately, string and byte are different in py3, which reduces a lot of confusion when dealing with double-byte encoded characters (e.g.gbk, big5, euc-jp, shift-jis, etc.).
For the processing of "ASCII FixedWidth", the String is usually converted to Bytes and then split.
Without importing third-party modules
totalLineCount = 1 million, lineLength = 800 byte , FixedWidthArgs=(10,25,4,....), I split the Line in about 5 ways and get the following conclusion:
struct is the fastest (1x)
Loop only, not pre-processing FixedWidthArgs is the slowest (5x+)
slice(bytes) is faster than slice(string)
The source string is the bytes test result: struct(1x) , operator.itemgetter(1.7x) , precompiled sliceObject & list comprehensions(2.8x), re.patten object (2.9x)
When dealing with large files, we often use with open ( file, "rb") as f:.
The method traverses one of the above files, about 2.4 second.
I think the appropriate handler, which processes 1 million rows of data, splits each row into 20 fields and takes less than 2.4 seconds.
I only find that stuct and itemgetter meet the requirements
ps: For normal display, I converted unicode str to bytes.
If you are in a double-byte environment, you don't need to do this.
from itertools import accumulate
from operator import itemgetter
def oprt_parser(sArgs):
sum_arg = tuple(accumulate(abs(i) for i in sArgs))
# Negative parameter field index
cuts = tuple(i for i,num in enumerate(sArgs) if num < 0)
# Get slice args and Ignore fields of negative length
ig_Args = tuple(item for i, item in enumerate(zip((0,)+sum_arg,sum_arg)) if i not in cuts)
# Generate `operator.itemgetter` object
oprtObj =itemgetter(*[slice(s,e) for s,e in ig_Args])
return oprtObj
lineb = b'abcdefghijklmnopqrstuvwxyz\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4\xb6\xee\xb7\xa2\xb8\xf6\xba\xcd0123456789'
line = lineb.decode("GBK")
# Unicode Fixed Width
fieldwidthsU = (13, -13, 4, -4, 5,-5) # Negative width fields is ignored
# ASCII Fixed Width
fieldwidths = (13, -13, 8, -8, 5,-5) # Negative width fields is ignored
# Unicode FixedWidth processing
parse = oprt_parser(fieldwidthsU)
fields = parse(line)
print('Unicode FixedWidth','fields: {}'.format(tuple(map(lambda s: s.encode("GBK"), fields))))
# ASCII FixedWidth processing
parse = oprt_parser(fieldwidths)
fields = parse(lineb)
print('ASCII FixedWidth','fields: {}'.format(fields))
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24)
parse = oprt_parser(fieldwidths)
fields = parse(line)
print(f"fields: {fields}")
Output:
Unicode FixedWidth fields: (b'abcdefghijklm', b'\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4', b'01234')
ASCII FixedWidth fields: (b'abcdefghijklm', b'\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4', b'01234')
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')
oprt_parser is 4x make_parser(list comprehensions + slice)
During the research, it was found that when the cpu speed is faster, it seems that the efficiency of the re method increases faster.
Since I don't have more and better computers to test, provide my test code, if anyone is interested, you can test it with a faster computer.
Run Environment:
os:win10
python: 3.7.2
CPU:amd athlon x3 450
HD:seagate 1T
import timeit
import time
import re
from itertools import accumulate
from operator import itemgetter
def eff2(stmt,onlyNum= False,showResult=False):
'''test function'''
if onlyNum:
rl = timeit.repeat(stmt=stmt,repeat=roundI,number=timesI,globals=globals())
avg = sum(rl) / len(rl)
return f"{avg * (10 ** 6)/timesI:0.4f}"
else:
rl = timeit.repeat(stmt=stmt,repeat=10,number=1000,globals=globals())
avg = sum(rl) / len(rl)
print(f"【{stmt}】")
print(f"\tquick avg = {avg * (10 ** 6)/1000:0.4f} s/million")
if showResult:
print(f"\t Result = {eval(stmt)}\n\t timelist = {rl}\n")
else:
print("")
def upDouble(argList,argRate):
return [c*argRate for c in argList]
tbStr = "000000001111000002222真2233333333000000004444444QAZ55555555000000006666666ABC这些事中文字abcdefghijk"
tbBytes = tbStr.encode("GBK")
a20 = (4,4,2,2,2,3,2,2, 2 ,2,8,8,7,3,8,8,7,3, 12 ,11)
a20U = (4,4,2,2,2,3,2,2, 1 ,2,8,8,7,3,8,8,7,3, 6 ,11)
Slng = 800
rateS = Slng // 100
tStr = "".join(upDouble(tbStr , rateS))
tBytes = tStr.encode("GBK")
spltArgs = upDouble( a20 , rateS)
spltArgsU = upDouble( a20U , rateS)
testList = []
timesI = 100000
roundI = 5
print(f"test round = {roundI} timesI = {timesI} sourceLng = {len(tStr)} argFieldCount = {len(spltArgs)}")
print(f"pure str \n{''.ljust(60,'-')}")
# ==========================================
def str_parser(sArgs):
def prsr(oStr):
r = []
r_ap = r.append
stt=0
for lng in sArgs:
end = stt + lng
r_ap(oStr[stt:end])
stt = end
return tuple(r)
return prsr
Str_P = str_parser(spltArgsU)
# eff2("Str_P(tStr)")
testList.append("Str_P(tStr)")
print(f"pure bytes \n{''.ljust(60,'-')}")
# ==========================================
def byte_parser(sArgs):
def prsr(oBytes):
r, stt = [], 0
r_ap = r.append
for lng in sArgs:
end = stt + lng
r_ap(oBytes[stt:end])
stt = end
return r
return prsr
Byte_P = byte_parser(spltArgs)
# eff2("Byte_P(tBytes)")
testList.append("Byte_P(tBytes)")
# re,bytes
print(f"re compile object \n{''.ljust(60,'-')}")
# ==========================================
def rebc_parser(sArgs,otype="b"):
re_Args = "".join([f"(.{{{n}}})" for n in sArgs])
if otype == "b":
rebc_Args = re.compile(re_Args.encode("GBK"))
else:
rebc_Args = re.compile(re_Args)
def prsr(oBS):
return rebc_Args.match(oBS).groups()
return prsr
Rebc_P = rebc_parser(spltArgs)
# eff2("Rebc_P(tBytes)")
testList.append("Rebc_P(tBytes)")
Rebc_Ps = rebc_parser(spltArgsU,"s")
# eff2("Rebc_Ps(tStr)")
testList.append("Rebc_Ps(tStr)")
print(f"struct \n{''.ljust(60,'-')}")
# ==========================================
import struct
def struct_parser(sArgs):
struct_Args = " ".join(map(lambda x: str(x) + "s", sArgs))
def prsr(oBytes):
return struct.unpack(struct_Args, oBytes)
return prsr
Struct_P = struct_parser(spltArgs)
# eff2("Struct_P(tBytes)")
testList.append("Struct_P(tBytes)")
print(f"List Comprehensions + slice \n{''.ljust(60,'-')}")
# ==========================================
import itertools
def slice_parser(sArgs):
tl = tuple(itertools.accumulate(sArgs))
slice_Args = tuple(zip((0,)+tl,tl))
def prsr(oBytes):
return [oBytes[s:e] for s, e in slice_Args]
return prsr
Slice_P = slice_parser(spltArgs)
# eff2("Slice_P(tBytes)")
testList.append("Slice_P(tBytes)")
def sliceObj_parser(sArgs):
tl = tuple(itertools.accumulate(sArgs))
tl2 = tuple(zip((0,)+tl,tl))
sliceObj_Args = tuple(slice(s,e) for s,e in tl2)
def prsr(oBytes):
return [oBytes[so] for so in sliceObj_Args]
return prsr
SliceObj_P = sliceObj_parser(spltArgs)
# eff2("SliceObj_P(tBytes)")
testList.append("SliceObj_P(tBytes)")
SliceObj_Ps = sliceObj_parser(spltArgsU)
# eff2("SliceObj_Ps(tStr)")
testList.append("SliceObj_Ps(tStr)")
print(f"operator.itemgetter + slice object \n{''.ljust(60,'-')}")
# ==========================================
def oprt_parser(sArgs):
sum_arg = tuple(accumulate(abs(i) for i in sArgs))
cuts = tuple(i for i,num in enumerate(sArgs) if num < 0)
ig_Args = tuple(item for i,item in enumerate(zip((0,)+sum_arg,sum_arg)) if i not in cuts)
oprtObj =itemgetter(*[slice(s,e) for s,e in ig_Args])
return oprtObj
Oprt_P = oprt_parser(spltArgs)
# eff2("Oprt_P(tBytes)")
testList.append("Oprt_P(tBytes)")
Oprt_Ps = oprt_parser(spltArgsU)
# eff2("Oprt_Ps(tStr)")
testList.append("Oprt_Ps(tStr)")
print("|".join([s.split("(")[0].center(11," ") for s in testList]))
print("|".join(["".center(11,"-") for s in testList]))
print("|".join([eff2(s,True).rjust(11," ") for s in testList]))
Output:
Test round = 5 timesI = 100000 sourceLng = 744 argFieldCount = 20
...
...
   Str_P | Byte_P | Rebc_P | Rebc_Ps | Struct_P | Slice_P | SliceObj_P|SliceObj_Ps| Oprt_P | Oprt_Ps
-----------|-----------|-----------|-----------|-- ---------|-----------|-----------|-----------|---- -------|-----------
     9.6315| 7.5952| 4.4187| 5.6867| 1.5123| 5.2915| 4.2673| 5.7121| 2.4713| 3.9051

String slicing doesn't have to be ugly as long as you keep it organized. Consider storing your field widths in a dictionary and then using the associated names to create an object:
from collections import OrderedDict
class Entry:
def __init__(self, line):
name2width = OrderedDict()
name2width['foo'] = 2
name2width['bar'] = 3
name2width['baz'] = 2
pos = 0
for name, width in name2width.items():
val = line[pos : pos + width]
if len(val) != width:
raise ValueError("not enough characters: \'{}\'".format(line))
setattr(self, name, val)
pos += width
file = "ab789yz\ncd987wx\nef555uv"
entry = []
for line in file.split('\n'):
entry.append(Entry(line))
print(entry[1].bar) # output: 987

I like to process text files containing fixed width fields using regular expressions. More specifically, using named capture groups. It's fast, does not require importing large libraries and is quite descriptive and convenient (in my opinion).
I also like the fact that the named capture groups are basically auto-documenting the data format, acting as a sort of data specification, since each capture group can be written to define each fields' name, data type and length.
Here's simple example...
import re
data = [
"1234ABCDEFGHIJ5",
"6789KLMNOPQRST0"
]
record_regex = (
r"^"
r"(?P<firstnumbers>[0-9]{4})"
r"(?P<middletext>[a-zA-Z0-9_\-\s]{10})"
r"(?P<lastnumber>[0-9]{1})"
r"$"
)
records = []
for line in data:
match = re.match(record_regex, line)
if match:
records.append(match.groupdict())
print(records)
...that yields a convenient dictionary of each record:
[
{'firstnumbers': '1234', 'lastnumber': '5', 'middletext': 'ABCDEFGHIJ'},
{'firstnumbers': '6789', 'lastnumber': '0', 'middletext': 'KLMNOPQRST'}
]
Helpful tools, like the online regex tester and debugger, are available if you are not familiar (or comfortable) with Python regular expressions or named capture groups.

Related

Ouput is incorrect (list not sorted)

The program takes as input a data-set of orders where id, t selection and t shipping
are of type unsigned int, n is the number of orders, and a space character.
id1, t selection1, t shipping1; ...; idn, t selectionn, t shippingn \n.
The expected output is a space-separated list of the ids, sorted by t selection + t shipping
and terminated by a newline \n.
Input: 1, 500, 100; 2, 700, 100; 3, 100, 100; 4, 50, 50\n
Output: 4 3 1 2\n
My output however shows this
output: 4 1 2 3
Could somebody help me fix this? thanks in advance. below you can see my code. in the code there are some annotations from my teacher btw, don't mind them.
#!/usr/bin/env python3
import sys
class Order:
def __init__(self, id: int, selection_time: int, shipping_time: int):
self.id: int = id
self.selection_time: int = selection_time
self.shipping_time: int = shipping_time
'''
Remove me if you don't need me.
Add a method to assign to me.
'''
self.next: Order = None
'''
Make your life easier and your code prettier, use `Operator Overloading`.
'''
def sort(data):
sorted_order = selection_t + shipping_t
for i in range(len(data)):
for j in range(i + 1, len(data)):
if sorted_order[i] > sorted_order[j]:
data[i], data[j] = data[j], data[i]
return data
if __name__ == '__main__':
'''
Retrieves and splits the input
'''
data = input()
data = data.split('; ')
for d in data:
id, selection_t, shipping_t = d.split(', ', 2)
order: Order = Order(int(id), int(selection_t), int(shipping_t))
sort(data)
for order.id in data:
sys.stdout.write(order.id[0])
sys.stdout.write(" ")

as pointed out by Matt, you're not actually using the Order class. Funny thing about classes is that they have a number of so-called magic methods (that naming is very correct)
The magic method that is useful to you in this case is __lt__. This is an abbreviation for Lesser Than. Google search
If you set up this magic method for the class, you can simply call sort on a list containing only instances of that class.
If you do not want to use this magic method, the other option is to use a lambda to tell the sort method how to sort the list. This is very well explained here.
(also, I removed the sys.stdout and replaced it with the standard print)
#!/usr/bin/env python3
class Order:
def __init__(self, id: int, selection_time: int, shipping_time: int):
self.id: int = id
self.selection_time: int = selection_time
self.shipping_time: int = shipping_time
self.sort_value: int = shipping_time + selection_time
def __lt__(self, other) -> bool:
return self.sort_value < other.sort_value
if __name__ == "__main__":
data = "1, 500, 100; 2, 700, 100; 3, 100, 100; 4, 50, 50"
data = data.split("; ")
order_list = []
order_list_lambda = []
for d in data:
id, selection_t, shipping_t = [int(s) for s in d.split(", ")]
order: Order = Order(id, selection_t, shipping_t)
order_list.append(order)
order_list_lambda.append(order)
print("using __lt__ class magic method")
order_list.sort()
for order in order_list:
print(order.id)
print("-----")
print("using lamda")
order_list_lambda.sort(key=lambda x: x.sort_value)
for order in order_list_lambda:
print(order.id)
output
using __lt__ class magic method
4
3
1
2
-----
using lamda
4
3
1
2

Convert buffer repesents list of int little indian Python class

I'm trying to get data from buffer represents as string,
Example:
got :
str = "0004000001000000020000000A000000"
class MyData:
length
some_data
array_data
buf_data
data = parse(str)
Except :
length=1024, some_data=1, array_data=[2,10], buf_data="000000020000010"
Explain:
length=1024 since the 8 numbers "00040000" repesnts an hex number in little indian
and the rest the same idea,
"00040000 01000000 0200000 00A000000"
1024, 1, 2, 10
any idea?
I have some solution but it's too messy and isn't easy to support

This is one way to do it:
class MyData:
mmap = [16**1, 16**0, 16**3, 16**2, 16**5, 16**4, 16**7, 16**6]
def __init__(self, buffer):
self.buffer = buffer
self.integers = []
def get_integers(self):
if len(self.integers) == 0:
for i in range(0, len(self.buffer), 8):
a = 0
for x, y in zip(self.buffer[i:i+8], self.mmap):
a += int(x, 16) * y
self.integers.append(a)
return self.integers
mydata = MyData('0004000001000000020000000A000000')
print(mydata.get_integers())
Output:
[1024, 1, 2, 10]
NOTE: This is specifically for 32-bit unsigned values

Is it possible to convert a really large int to a string quickly in python

I am building an encryption program which produces a massive integer.It looks something like this:
a = plaintextOrd**bigNumber
when i do
a = str(a)
it takes over 28 minutes.
Is there any possible way to convert an integer like this quicker that using the built in str() function?
the reason i need it to be a string is because of this function here:
def divideStringIntoParts(parts,string):
parts = int(parts)
a = len(string)//parts
new = []
firstTime = True
secondTime = True
for i in range(parts):
if firstTime:
new.append(string[:a])
firstTime = False
elif secondTime:
new.append(string[a:a+a])
secondTime = False
else:
new.append(string[a*i:a*(i+1)])
string2 = ""
for i in new:
for i in i:
string2 += i
if len(string2) - len(string) != 0:
lettersNeeded = len(string) - len(string2)
for i in range(lettersNeeded):
new[-1] += string[len(string2) + i]
return new

You wrote in the comments that you want to get the length of the integer in decimal format. You don't need to convert this integer to a string, you can use "common logarithm" instead:
import math
math.ceil(math.log(a, 10))
Moreover, if you know that:
a = plaintextOrd**bigNumber
then math.log(a, 10) is equal to math.log(plaintextOrd, 10) * bigNumber, which shouldn't take more than a few milliseconds to calculate:
>>> plaintextOrd = 12345
>>> bigNumber = 67890
>>> a = plaintextOrd**bigNumber
>>> len(str(a))
277772
>>> import math
>>> math.ceil(math.log(a, 10))
277772
>>> math.ceil(math.log(plaintextOrd, 10) * bigNumber)
277772
It should work even if a wouldn't fit on your hard drive:
>>> math.ceil(math.log(123456789, 10) * 123456789012345678901234567890)
998952457326621672529828249600
As mentioned by #kaya3, Python standard floats aren't precise enough to describe the exact length of such a large number.
You could use mpmath (arbitrary-precision floating-point arithmetic) to get results with the desired precision:
>>> from mpmath import mp
>>> mp.dps = 1000
>>> mp.ceil(mp.log(123456789, 10) * mp.mpf('123456789012345678901234567890'))
mpf('998952457326621684655868656199.0')

Some quick notes on the "I need it for this function".
You don't need the first/second logic:
[:a] == [a*0:a*(0+1)]
[a:a+a] == [a*1:a*(1+1)]
So we have
new = []
for i in range(parts):
new.append(string[a*i:a*(i+1)])
or just new = [string[a*i:a*(i+1)] for i in range(parts)].
Note that you have silently discarded the last len(string) % parts characters.
In your second loop, you shadow i with for i in i, which happens to work but is awkward and dangerous. It can also be replaced with string2 = ''.join(new), which means you can just do string2 = string[:-(len(string) % parts)].
You then see if the strings are the same length, and then add the extra letters to the end of the last list. This is a little surprising, e.g. you would have
>>> divideStringIntoParts(3, '0123456789a')
['012', '345', '6789a']
When most algorithms would produce something that favors even distributions, and earlier elements, e.g.:
>>> divideStringIntoParts(3, '0123456789a')
['0124', '4567', '89a']
Regardless of this, we see that you don't really care about the value of the string at all here, just how many digits it has. Thus you could rewrite your function as follows.
def divide_number_into_parts(number, parts):
'''
>>> divide_number_into_parts(12345678901, 3)
[123, 456, 78901]
'''
total_digits = math.ceil(math.log(number + 1, 10))
part_digits = total_digits // parts
extra_digits = total_digits % parts
remaining = number
results = []
for i in range(parts):
to_take = part_digits
if i == 0:
to_take += extra_digits
digits, remaining = take_digits(remaining, to_take)
results.append(digits)
# Reverse results, since we go from the end to the beginning
return results[::-1]
def take_digits(number, digits):
'''
Removes the last <digits> digits from number.
Returns those digits along with the remainder, e.g.:
>>> take_digits(12345, 2)
(45, 123)
'''
mod = 10 ** digits
return number % mod, number // mod
This should be very fast, since it avoids strings altogether. You can change it to strings at the end if you'd like, which may or may not benefit from the other answers here, depending on your chunk sizes.

Faster than function str conversion of int to str is provided by GMPY2
Source of Example Below
import time
from gmpy2 import mpz
# Test number (Large)
x = 123456789**12345
# int to str using Python str()
start = time.time()
python_str = str(x)
end = time.time()
print('str conversion time {0:.4f} seconds'.format(end - start))
# int to str using GMPY2 module
start = time.time()
r = mpz(x)
gmpy2_str = r.digits()
end = time.time()
print('GMPY2 conversion time {0:.4f} seconds'.format(end - start))
print('Length of 123456789**12345 is: {:,}'.format(len(python_str)))
print('str result == GMPY2 result {}'.format(python_str==gmpy2_str))
Results (GMPY2 was 12 times faster in test)
str conversion time 0.3820 seconds
GMPY2 conversion time 0.0310 seconds
Length of 123456789**12345 is: 99,890
str result == GMPY2 result True

python - unique set of ranges, merging when needed

Is there a datastructure that will maintain a unique set of ranges, merging an contiguous or overlapping ranges that are added? I need to track which ranges have been processed, but this may occur in an arbitrary order. E.g.:
range_set = RangeSet() # doesn't exist that I know of, this is what I need help with
def process_data(start, end):
global range_set
range_set.add_range(start, end)
# ...
process_data(0, 10)
process_data(20, 30)
process_data(5, 15)
process_data(50, 60)
print(range_set.missing_ranges())
# [[16,19], [31, 49]]
print(range_set.ranges())
# [[0,15], [20,30], [50, 60]]
Notice that overlapping or contiguous ranges get merged together. What is the best way to do this? I looked at using the bisect module, but its use didn't seem terribly clear.

Another approach is based on sympy.sets.
>>> import sympy as sym
>>> a = sym.Interval(1, 2, left_open=False, right_open=False)
>>> b = sym.Interval(3, 4, left_open=False, right_open=False)
>>> domain = sym.Interval(0, 10, left_open=False, right_open=False)
>>> missing = domain - a - b
>>> missing
[0, 1) U (2, 3) U (4, 10]
>>> 2 in missing
False
>>> missing.complement(domain)
[1, 2] U [3, 4]

You could get some similar functionality with pythons built-in set data structure; supposing only integer values are valid for start and end.
>>> whole_domain = set(range(12))
>>> A = set(range(0,1))
>>> B = set(range(4,9))
>>> C = set(range(3,6)) # processed range(3,5) twice
>>> done = A | B | C
>>> print done
set([0, 3, 4, 5, 6, 7, 8])
>>> missing = whole_domain - done
>>> print missing
set([1, 2, 9, 10, 11])
This still lacks many 'range'-features but might be sufficient.
A simple query if a certain range was already processed could look like this:
>>> isprocessed = [foo in done for foo in set(range(2,6))]
>>> print isprocessed
[False, True, True, True]

I've only lightly tested it, but it sounds like you're looking for something like this. You'll need to add the methods to get the ranges and missing ranges yourself, but it should be very straighforward as RangeSet.ranges is a list of Range objects maintained in sorted order. For a more pleasant interface you could write a convenience method that converted it to a list of 2-tuples, for example.
EDIT: I've just modified it to use less-than-or-equal comparisons for merging. Note, however, that this won't merge "adjacent" entries (e.g. it won't merge (1, 5) and (6, 10)). To do this you'd need to simply modify the condition in Range.check_merge().
import bisect
class Range(object):
# Reduces memory usage, overkill unless you're using a lot of these.
__slots__ = ["start", "end"]
def __init__(self, start, end):
"""Initialise this range."""
self.start = start
self.end = end
def __cmp__(self, other):
"""Sort ranges by their initial item."""
return cmp(self.start, other.start)
def check_merge(self, other):
"""Merge in specified range and return True iff it overlaps."""
if other.start <= self.end and other.end >= self.start:
self.start = min(other.start, self.start)
self.end = max(other.end, self.end)
return True
return False
class RangeSet(object):
def __init__(self):
self.ranges = []
def add_range(self, start, end):
"""Merge or insert the specified range as appropriate."""
new_range = Range(start, end)
offset = bisect.bisect_left(self.ranges, new_range)
# Check if we can merge backwards.
if offset > 0 and self.ranges[offset - 1].check_merge(new_range):
new_range = self.ranges[offset - 1]
offset -= 1
else:
self.ranges.insert(offset, new_range)
# Scan for forward merges.
check_offset = offset + 1
while (check_offset < len(self.ranges) and
new_range.check_merge(self.ranges[offset+1])):
check_offset += 1
# Remove any entries that we've just merged.
if check_offset - offset > 1:
self.ranges[offset+1:check_offset] = []

You have hit on a good solution in your example use case. Rather than try to maintain a set of the ranges that have been used, keep track of the ranges that haven't been used. This makes the problem pretty easy.
class RangeSet:
def __init__(self, min, max):
self.__gaps = [(min, max)]
self.min = min
self.max = max
def add(self, lo, hi):
new_gaps = []
for g in self.__gaps:
for ng in (g[0],min(g[1],lo)),(max(g[0],hi),g[1]):
if ng[1] > ng[0]: new_gaps.append(ng)
self.__gaps = new_gaps
def missing_ranges(self):
return self.__gaps
def ranges(self):
i = iter([self.min] + [x for y in self.__gaps for x in y] + [self.max])
return [(x,y) for x,y in zip(i,i) if y > x]
The magic is in the add method, which checks each existing gap to see whether it is affected by the new range, and adjusts the list of gaps accordingly.
Note that the behaviour of the tuples used for ranges here is the same as Python's range objects, i.e. they are inclusive of the start value and exclusive of the stop value. This class will not behave in exactly the way you described in your question, where your ranges seem to be inclusive of both.

Have a look at portion (https://pypi.org/project/portion/). I'm the maintainer of this library, and it supports disjuction of continuous intervals out of the box. It automatically simplifies adjacent and overlapping intervals.
Consider the intervals provided in your example:
>>> import portion as P
>>> i = P.closed(0, 10) | P.closed(20, 30) | P.closed(5, 15) | P.closed(50, 60)
>>> # get "used ranges"
>>> i
[0,15] | [20,30] | [50,60]
>>> # get "missing ranges"
>>> i.enclosure - i
(15,20) | (30,50)

Similar to DavidT's answer – also based on sympy's sets, but using a list of any length and addition (union) in a single operation:
import sympy
intervals = [[1,4], [6,10], [3,5], [7,8]] # pairs of left,right
print(intervals)
symintervals = [sympy.Interval(i[0],i[1], left_open=False, right_open=False) for i in intervals]
print(symintervals)
merged = sympy.Union(*symintervals) # one operation; adding to an union one by one is much slower for a large number of intervals
print(merged)
for i in merged.args: # assumes that the "merged" result is an union, not a single interval
print(i.left, i.right) # getting bounds of merged intervals

Here's my solution:
def flatten(collection):
subset = set()
for elem in collection:
to_add = elem
to_remove = set()
for s in subset:
if s[0] <= to_add[0] <= s[1] or s[0] <= to_add[1] <= s[1] or (s[0] > to_add[0] and s[1] < to_add[1]):
to_remove.add(s)
to_add = (min(to_add[0], s[0]), max(to_add[1], s[1]))
subset -= to_remove
subset.add(to_add)
return subset
range_set = {(-12, 4), (3, 20), (21, 25), (25, 30), (-13, -11), (5, 10), (-13, 20)}
print(flatten(range_set))
# {(21, 30), (-13, 20)}

How to make a random but partial shuffle in Python?

Instead of a complete shuffle, I am looking for a partial shuffle function in python.
Example : "string" must give rise to "stnrig", but not "nrsgit"
It would be better if I can define a specific "percentage" of characters that have to be rearranged.
Purpose is to test string comparison algorithms. I want to determine the "percentage of shuffle" beyond which an(my) algorithm will mark two (shuffled) strings as completely different.
Update :
Here is my code. Improvements are welcome !
import random
percent_to_shuffle = int(raw_input("Give the percent value to shuffle : "))
to_shuffle = list(raw_input("Give the string to be shuffled : "))
num_of_chars_to_shuffle = int((len(to_shuffle)*percent_to_shuffle)/100)
for i in range(0,num_of_chars_to_shuffle):
x=random.randint(0,(len(to_shuffle)-1))
y=random.randint(0,(len(to_shuffle)-1))
z=to_shuffle[x]
to_shuffle[x]=to_shuffle[y]
to_shuffle[y]=z
print ''.join(to_shuffle)

This is a problem simpler than it looks. And the language has the right tools not to stay between you and the idea,as usual:
import random
def pashuffle(string, perc=10):
data = list(string)
for index, letter in enumerate(data):
if random.randrange(0, 100) < perc/2:
new_index = random.randrange(0, len(data))
data[index], data[new_index] = data[new_index], data[index]
return "".join(data)

Your problem is tricky, because there are some edge cases to think about:
Strings with repeated characters (i.e. how would you shuffle "aaaab"?)
How do you measure chained character swaps or re arranging blocks?
In any case, the metric defined to shuffle strings up to a certain percentage is likely to be the same you are using in your algorithm to see how close they are.
My code to shuffle n characters:
import random
def shuffle_n(s, n):
idx = range(len(s))
random.shuffle(idx)
idx = idx[:n]
mapping = dict((idx[i], idx[i-1]) for i in range(n))
return ''.join(s[mapping.get(x,x)] for x in range(len(s)))
Basically chooses n positions to swap at random, and then exchanges each of them with the next in the list... This way it ensures that no inverse swaps are generated and exactly n characters are swapped (if there are characters repeated, bad luck).
Explained run with 'string', 3 as input:
idx is [0, 1, 2, 3, 4, 5]
we shuffle it, now it is [5, 3, 1, 4, 0, 2]
we take just the first 3 elements, now it is [5, 3, 1]
those are the characters that we are going to swap
s t r i n g
^ ^ ^
t (1) will be i (3)
i (3) will be g (5)
g (5) will be t (1)
the rest will remain unchanged
so we get 'sirgnt'
The bad thing about this method is that it does not generate all the possible variations, for example, it could not make 'gnrits' from 'string'. This could be fixed by making partitions of the indices to be shuffled, like this:
import random
def randparts(l):
n = len(l)
s = random.randint(0, n-1) + 1
if s >= 2 and n - s >= 2: # the split makes two valid parts
yield l[:s]
for p in randparts(l[s:]):
yield p
else: # the split would make a single cycle
yield l
def shuffle_n(s, n):
idx = range(len(s))
random.shuffle(idx)
mapping = dict((x[i], x[i-1])
for i in range(len(x))
for x in randparts(idx[:n]))
return ''.join(s[mapping.get(x,x)] for x in range(len(s)))

import random
def partial_shuffle(a, part=0.5):
# which characters are to be shuffled:
idx_todo = random.sample(xrange(len(a)), int(len(a) * part))
# what are the new positions of these to-be-shuffled characters:
idx_target = idx_todo[:]
random.shuffle(idx_target)
# map all "normal" character positions {0:0, 1:1, 2:2, ...}
mapper = dict((i, i) for i in xrange(len(a)))
# update with all shuffles in the string: {old_pos:new_pos, old_pos:new_pos, ...}
mapper.update(zip(idx_todo, idx_target))
# use mapper to modify the string:
return ''.join(a[mapper[i]] for i in xrange(len(a)))
for i in xrange(5):
print partial_shuffle('abcdefghijklmnopqrstuvwxyz', 0.2)
prints
abcdefghljkvmnopqrstuxwiyz
ajcdefghitklmnopqrsbuvwxyz
abcdefhwijklmnopqrsguvtxyz
aecdubghijklmnopqrstwvfxyz
abjdefgcitklmnopqrshuvwxyz

Evil and using a deprecated API:
import random
# adjust constant to taste
# 0 -> no effect, 0.5 -> completely shuffled, 1.0 -> reversed
# Of course this assumes your input is already sorted ;)
''.join(sorted(
'abcdefghijklmnopqrstuvwxyz',
cmp = lambda a, b: cmp(a, b) * (-1 if random.random() < 0.2 else 1)
))

maybe like so:
>>> s = 'string'
>>> shufflethis = list(s[2:])
>>> random.shuffle(shufflethis)
>>> s[:2]+''.join(shufflethis)
'stingr'
Taking from fortran's idea, i'm adding this to collection. It's pretty fast:
def partial_shuffle(st, p=20):
p = int(round(p/100.0*len(st)))
idx = range(len(s))
sample = random.sample(idx, p)
res=str()
samptrav = 1
for i in range(len(st)):
if i in sample:
res += st[sample[-samptrav]]
samptrav += 1
continue
res += st[i]
return res

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to efficiently parse fixed width files? - python

> str = '1234567890' > w = [0,2,5,7,10] > [ str[ w[i-1] : w[i] ] for i in range(1,len(w)) ] ['12', '345', '67', '890']

Related

Ouput is incorrect (list not sorted)

Convert buffer repesents list of int little indian Python class

Is it possible to convert a really large int to a string quickly in python

python - unique set of ranges, merging when needed

How to make a random but partial shuffle in Python?

Categories

Resources