tricky string parsing with python - python

I have a text file like this:
ID = 31
Ne = 5122
============
List of 104 four tuples:
1 2 12 40
2 3 4 21
.
.
51 21 41 42
ID = 34
Ne = 5122
============
List of 104 four tuples:
3 2 12 40
4 3 4 21
.
.
The four-tuples are tab delimited.
For each ID, I'm trying to make a dictionary with the ID being the key and the four-tuples (in list/tuple form) as elements of that key.
dict = {31: (1,2,12,40),(2,3,4,21)....., 32:(3,2,12,40), (4,3,4,21)..
My string parsing knowledge is limited to adding using a reference object for file.readlines(), using str.replace() and str.split() on 'ID = '. But there has to be a better way. Here some beginnings of what I have.
file = open('text.txt', 'r')
fp = file.readlines()
B = [];
for x in fp:
x.replace('\t',',')
x.replace('\n',')')
B.append(x)

something like this:
ll = []
for line in fp:
tt = tuple(int(x) for x in line.split())
ll.append(tt)
that will produce a list of tuples to assign to the key for your dictionary

Python's great for this stuff, why not write up a 5-10 liner for it? It's kind of what the language is meant to excel at.
$ cat test
ID = 31
Ne = 5122
============
List of 104 four tuples:
1 2 12 40
2 3 4 21
ID = 34
Ne = 5122
============
List of 104 four tuples:
3 2 12 40
4 3 4 21
data = {}
for block in open('test').read().split('ID = '):
if not block:
continue #empty line
lines = block.split('\n')
ID = int(lines[0])
tups = map(lambda y: int(y), [filter(lambda x: x, line.split('\t')) for line in lines[4:]])
data[ID] = tuple(filter(lambda x: x, tups))
print(data)
# {34: ([3, 2, 12, 40], [4, 3, 4, 21]), 31: ([1, 2, 12, 40], [2, 3, 4, 21])}
Only annoying thing is all the filters - sorry, that's just the result of empty strings and stuff from extra newlines, etc. For a one-off little script, it's no biggie.

I think this will do the trick for you:
import csv
def parse_file(filename):
"""
Parses an input data file containing tags of the form "ID = ##" (where ## is a
number) followed by rows of data. Returns a dictionary where the ID numbers
are the keys and all of the rows of data are stored as a list of tuples
associated with the key.
Args:
filename (string) name of the file you want to parse
Returns:
my_dict (dictionary) dictionary of data with ID numbers as keys
"""
my_dict = {}
with open(filename, "r") as my_file: # handles opening and closing file
rows = my_file.readlines()
for row in rows:
if "ID = " in row:
my_key = int(row.split("ID = ")[1]) # grab the ID number
my_list = [] # initialize a new data list for a new ID
elif row != "\n": # skip rows that only have newline char
try: # if this fails, we don't have a valid data line
my_list.append(tuple([int(x) for x in row.split()]))
except:
my_dict[my_key] = my_list # stores the data list
continue # repeat until done with file
return my_dict
I made it a function so that you can it from anywhere, just passing the filename. It makes assumptions about the file format, but if the file format is always what you showed us here, it should work for you. You would call it on your data.txt file like:
a_dictionary = parse_file("data.txt")
I tested it on the data that you gave us and it seems to work just fine after deleting the "..." rows.
Edit: I noticed one small bug. As written, it will add an empty tuple in place of a new line character ("\n") wherever that appears alone on a line. To fix this, put the try: and except: clauses inside of this:
elif row != "\n": # skips rows that only contain newline char
I added this to the full code above as well.

Related

Find first line of text according to value in Python

How can I do a search of a value of the first "latitude, longitude" coordinate in a "file.txt" list in Python and get 3 rows above and 3 rows below?
Value
37.0459
file.txt
37.04278,-95.58895
37.04369,-95.58592
37.04369,-95.58582
37.04376,-95.58557
37.04376,-95.58546
37.04415,-95.58429
37.0443,-95.5839
37.04446,-95.58346
37.04461,-95.58305
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
37.04508,-95.57914
37.04494,-95.57842
37.04483,-95.5771
37.0448,-95.57674
37.04474,-95.57606
37.04467,-95.57534
37.04462,-95.57474
37.04458,-95.57396
37.04454,-95.57274
37.04452,-95.57233
37.04453,-95.5722
37.0445,-95.57164
37.04448,-95.57122
37.04444,-95.57054
37.04432,-95.56845
37.04432,-95.56834
37.04424,-95.5668
37.044,-95.56251
37.04396,-95.5618
Expected Result
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
Additional information
In linux I can get the closest line and do the treatment I need using grep, sed, cut and others, but I'd like in Python.
Any help will be greatly appreciated!
Thank you.
How can I do a search of a value of the first "latitude, longitude"
coordinate in a "file.txt" list in Python and get 3 rows above and 3
rows below?*
You can try:
with open("text_filter.txt") as f:
text = f.readlines() # read text lines to list
filter= "37.0459"
match = [i for i,x in enumerate(text) if filter in x] # get list index of item matching filter
if match:
if len(text) >= match[0]+3: # if list has 3 items after filter, print it
print("".join(text[match[0]:match[0]+3]).strip())
print(text[match[0]].strip())
if match[0] >= 3: # if list has 3 items before filter, print it
print("".join(text[match[0]-3:match[0]]).strip())
Output:
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04597,-95.58127
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
You can use pandas to import the data in a dataframe and then easily manipulate it. As per your question the value to check is not the exact match and therefore I have converted it to string.
import pandas as pd
data = pd.read_csv("file.txt", header=None, names=["latitude","longitude"]) #imports text file as dataframe
value_to_check = 37.0459 # user defined
for i in range(len(data)):
if str(value_to_check) == str(data.iloc[i,0])[:len(str(value_to_check))]:
break
print(data.iloc[i-3:i+4,:])
output
latitude longitude
9 37.04502 -95.58204
10 37.04516 -95.58184
11 37.04572 -95.58139
12 37.04597 -95.58127
13 37.04565 -95.58073
14 37.04546 -95.58033
15 37.04516 -95.57948
A solution with iterators, that only keeps in memory the necessary lines and doesn't load the unnecessary part of the file:
from collections import deque
from itertools import islice
def find_in_file(file, target, before=3, after=3):
queue = deque(maxlen=before)
with open(file) as f:
for line in f:
if target in map(float, line.split(',')):
out = list(queue) + [line] + list(islice(f, 3))
return out
queue.append(line)
else:
raise ValueError('target not found')
Some tests:
print(find_in_file('test.txt', 37.04597))
# ['37.04502,-95.58204\n', '37.04516,-95.58184\n', '37.04572,-95.58139\n', '37.04597,-95.58127\n',
# '37.04565,-95.58073\n', '37.04565,-95.58073\n', '37.04565,-95.58073\n']
print(find_in_file('test.txt', 37.044)) # Only one line after the match
# ['37.04432,-95.56845\n', '37.04432,-95.56834\n', '37.04424,-95.5668\n', '37.044,-95.56251\n',
# '37.04396,-95.5618\n']
Also, it works if there is less than the expected number of lines before or after the match. We match floats, not strings, as '37.04' would erroneously match '37.0444' otherwise.
This solution will print the before and after elements even if they are less than 3.
Also I am using string as it is implied from the question that you want partial matches also. ie. 37.0459 will match 37.04597
search_term='37.04462'
with open('file.txt') as f:
lines = f.readlines()
lines = [line.strip().split(',') for line in lines] #remove '\n'
for lat,lon in lines:
if search_term in lat:
index=lines.index([lat,lon])
break
left=0
right=0
for k in range (1,4): #bcoz last one is not included
if index-k >=0:
left+=1
if index+k<=(len(lines)-1):
right+=1
for i in range(index-left,index+right+1): #bcoz last one is not included
print(lines[i][0],lines[i][1])

How can I organize case-insensitive text and the material following it?

I'm very new to Python so it'd be very appreciated if this could be explained as in-depth as possible.
If I have some text like this on a text file:
matthew : 60 kg
MaTtHew : 5 feet
mAttheW : 20 years old
maTThEw : student
MaTTHEW : dog owner
How can I make a piece of code that can write something like...
Matthew : 60 kg , 5 feet , 20 years old , student , dog owner
...by only gathering information from the text file?
def test_data():
# This is obviously the source data as a multi-line string constant.
source = \
"""
matthew : 60 kg
MaTtHew : 5 feet
mAttheW : 20 years old
maTThEw : student
MaTTHEW : dog owner
bob : 70 kg
BoB : 6 ft
"""
# Split on newline. This will return a list of lines like ["matthew : 60 kg", "MaTtHew : 5 feet", etc]
return source.split("\n")
def append_pair(d, p):
k, v = p
if k in d:
d[k] = d[k] + [v]
else:
d[k] = [v]
return d
if __name__ == "__main__":
# Do a list comprehension. For every line in the test data, split by ":", strip off leading/trailing whitespace,
# and convert to lowercase. This will yield lists of lists.
# This is mostly a list of key/value size-2-lists
pairs = [[x.strip().lower() for x in line.split(":", 2)] for line in test_data()]
# Filter the lists in the main list that do not have a size of 2. This will yield a list of key/value pairs like:
# [["matthew", "60 kg"], ["matthew", "5 feet"], etc]
cleaned_pairs = [p for p in pairs if len(p) == 2]
# This will iterate the list of key/value pairs and send each to append_pair, which will either append to
# an existing key, or create a new key.
d = reduce(append_pair, cleaned_pairs, {})
# Now, just print out the resulting dictionary.
for k, v in d.items():
print("{}: {}".format(k, ", ".join(v)))
import sys
# There's a number of assumptions I have to make based on your description.
# I'll try to point those out.
# Should be self-explanatory. something like: "C:\Users\yourname\yourfile"
path_to_file = "put_your_path_here"
# open a file for reading. The 'r' indicates read-only
infile = open(path_to_file, 'r')
# reads in the file line by line and strips the "invisible" endline character
readLines = [line.strip() for line in infile]
# make sure we close the file
infile.close()
# An Associative array. Does not use normal numerical indexing.
# instead, in our case, we'll use a string(the name) to index into.
# At a given name index(AKA key) we'll save the attributes about that person.
names = dict()
# iterate through each line we read in from the file
# each line in this loop will be stored in the variable
# item for that iteration.
for item in readLines:
#assuming that your file has a strict format:
# name : attribute
index = item.find(':')
# if there was a ':' found then continue
if index is not -1:
# grab only the name of the person and convert the string to all lowercase
name = item[0:index].lower()
# see if our associative array already has that peson
if names.has_key(name):
# if that person has already been indexed add the new attribute
# this assumes there are no dupilcates so I don't check for them.
names[name].append(item[index+1:len(item)])
else:
# if that person was not in the array then add them.
# we're adding a list at that index to store their attributes.
names[name] = list()
# append the attribute to the list.
# the len() function tells us how long the string 'item' is
# offsetting the index by 1 so we don't capture the ':'
names[name].append(item[index+1:len(item)])
else:
# there was no ':' found in the line so skip it
pass
# iterate through keys (names) we found.
for name in names:
# write it to stdout. I am using this because the "print" built-in to python
# always ends with a new line. This way I can print the name and then
# iterate through the attributes associated with them
sys.stdout.write(name + " : ")
# iterate through attributes
for attribute in names[name]:
sys.stdout.write(attribute + ", ")
# end each person with a new line.
sys.stdout.write('\r\n')

How to read and organize text files divided by keywords

I'm working on this code (on python) that reads a text file.
The text file contains information to construct a certain geometry, and it is separated by sections by using keywords, for example, the file:
*VERTICES
1 0 0 0
2 10 0 0
3 10 10 0
4 0 10 0
*EDGES
1 1 2
2 1 4
3 2 3
4 3 4
contains the information of a square with vertices at (0,0), (0,10), (10,0), (10,10). The "*Edges" part defines the connection between the vertices. The first number in each row is an ID number.
Here is my problem, the information in the text file is not necessarily in order, sometimes the "Vertices" section appears first, and some other times the "Edges" section will come first. I have other keywords as well, so I'm trying to avoid repeating if statements to test if each line has a new keyword.
What I have been doing is reading the text file multiple times, each time looking for a different keyword:
open file
read line by line
if line == *Points
store all the following lines in a list until a new *command is encountered
close file
open file (again)
read line by line
if line == *Edges
store all the following lines in a list until a new *command is encountered
close file
open file (again)
...
Can someone point out how can I identify these keywords without such a tedious procedure? Thanks.
You can read the file once and store the contents in a dictionary. Since you have conveniently labeled the "command" lines with a *, you can use all lines beginning with a * as the dictionary key and all following lines as the values for that key. You can do this with a for loop:
with open('geometry.txt') as f:
x = {}
key = None # store the most recent "command" here
for y in f.readlines()
if y[0] == '*':
key = y[1:] # your "command"
x[key] = []
else:
x[key].append(y.split()) # add subsequent lines to the most recent key
Or you can take advantage of python's list and dictionary comprehensions to do the same thing in one line:
with open('test.txt') as f:
x = {y.split('\n')[0]:[z.split() for z in y.strip().split('\n')[1:]] for y in f.read().split('*')[1:]}
which I'll admit is not very nice looking but it gets the job done by splitting the entire file into chunks between '*' characters and then using new lines and spaces as delimiters to break up the remaining chunks into dictionary keys and lists of lists (as dictionary values).
Details about splitting, stripping, and slicing strings can be found here
The fact that they are unordered I think lends itself well for parsing into a dictionary from which you can access values later. I wrote a function that you may find useful for this task:
features = ['POINTS','EDGES']
def parseFile(dictionary, f, features):
"""
Creates a format where you can access a shape feature like:
dictionary[shapeID][feature] = [ [1 1 1], [1,1,1] ... ]
Assumes: all features although out of order occurs in the order
shape1
*feature1
.
.
.
*featuren
Assumes all possible features are in in the list features
f is input file handle
"""
shapeID = 0
found = []
for line in f:
if line[0] == '*' and found != features:
found.append(line[1:]) #appends feature like POINTS to found
feature = line[1:]
elif line[0] == '*' and found == features:
found = []
shapeID += 1
feature = line[1:] #current feature
else:
dictionary[shapeID][feature].append(
[int(i) for i in line.split(' ')]
)
return dictionary
#to access the shape features you can get vertices like:
for vertice in dictionary[shapeID]['POINTS']:
print vertice
#to access edges
for edge in dictionary[shapeID]['EDGES']:
print edge
You should just create a dictionary of the sections. You could use a generator to read the file and yield each section in whatever order they arrive and build a dictionary from the results.
Here's some incomplete code that might help you along:
def load(f):
with open(f) as file:
section = next(file).strip() # Assumes first line is always a section
data = []
for line in file:
if line[0] == '*': # Any appropriate test for a new section
yield section, data
section = line.strip()
data = []
else:
data.append(list(map(int, line.strip().split())))
yield section, data
Assuming the data above is in a file called data.txt:
>>> data = dict(load('data.txt'))
>>> data
{'*EDGES': [[1, 1, 2], [2, 1, 4], [3, 2, 3], [4, 3, 4]],
'*VERTICES': [[1, 0, 0, 0], [2, 10, 0, 0], [3, 10, 10, 0], [4, 0, 10, 0]]}
Then you can reference each section, e.g.:
for edge in data['*EDGES']:
...
Assuming your file is named 'data.txt'
from collections import defaultdict
def get_data():
d = defaultdict(list)
with open('data.txt') as f:
key = None
for line in f:
if line.startswith('*'):
key = line.rstrip()
continue
d[key].append(line.rstrip())
return d
The returned defaultdict looks like this:
defaultdict(list,
{'*EDGES': ['1 1 2', '2 1 4', '3 2 3', '4 3 4'],
'*VERTICES': ['1 0 0 0', '2 10 0 0', '3 10 10 0', '4 0 10 0']})
You access the data just like a normal dictionary
d['*EDGES']
['1 1 2', '2 1 4', '3 2 3', '4 3 4']
A common strategy with this type of parsing is to build a function that can yield the data a section at a time. Then your top-level calling code can be fairly simple because it doesn't have to worry about the section logic at all. Here's an example with your data:
import sys
def main(file_path):
# An example usage.
for section_name, rows in sections(file_path):
print('===============')
print(section_name)
for row in rows:
print(row)
def sections(file_path):
# Setup.
section_name = None
rows = []
# Process the file.
with open(file_path) as fh:
for line in fh:
# Section start: yield any rows we have so far,
# and then update the section name.
if line.startswith('*'):
if rows:
yield (section_name, rows)
rows = []
section_name = line[1:].strip()
# Otherwise, just add another row.
else:
row = line.split()
rows.append(row)
# Don't forget the last batch of rows.
if rows:
yield (section_name, rows)
main(sys.argv[1])
A dictionary is probably the way to go given that your data isn't ordered. You can access it by section name after reading the file into a list. Note that the with keyword closes your file automatically.
Here's what it might look like:
# read the data file into a simple list:
with open('file.dat') as f:
lines = list(f)
# get the line numbers for each section:
section_line_nos = [line for line, data in enumerate(lines) if '*' == data[0]]
# add a terminating line number to mark end of the file:
section_line_nos.append(len(lines))
# split each section off into a new list, all contained in a dictionary
# with the section names as keys
section_dict = {lines[section_line_no][1:]:lines[section_line_no + 1: section_line_nos[section_no + 1]] for section_no, section_line_no in enumerate(section_line_nos[:-1])}
You will get a dictionary that looks like this:
{'VERTICES': ['1 0 0 0', '2 10 0 0', '3 10 10 0', '4 0 10 0'], 'EDGES': ['1 1 2', '2 1 4', '3 2 3', '4 3 4']}
Access each section this way:
section_dict['EDGES']
Note that the above code assumes each section starts with *, and that no other line starts with *. If the first is not the case, you could make this change:
section_names = ['*EDGES', '*VERTICES']
section_line_nos = [line for line, data in enumerate(lines) if data.strip() in section_names]
Also note that this part of the section_dict code:
lines[section_line_no][1:]
...gets rid of the star at the beginning of each section name. If this is not desired, you can change that to:
lines[section_line_no]
If it is possible there will be undesired white space in your section name lines, you can do this to get rid of it:
lines[section_line_no].strip()[1:]
I haven't tested all of this yet but this is the general idea.

Compare configuration data text with a default data text

I am in the process of understanding how to compare data from two text files and print the data that does not match into a new document or output.
The Program Goal:
Allow the user to compare the data in a file that contains many lines of data with a default file that has the correct values of the data.
Compare multiple lines of different data with the same parameters against a default list of the data with the same parameters
Example:
Lets say I have the following text document that has these parameters and data:
Lets call it Config.txt:
<231931844151>
Bird = 3
Cat = 4
Dog = 5
Bat = 10
Tiger = 11
Fish = 16
<92103884812>
Bird = 4
Cat = 40
Dog = 10
Bat = Null
Tiger = 19
Fish = 24
etc. etc.
Let's call this my Configuration data, now I need to make sure that the values these parameters inside my Config Data file are correct.
So I have a default data file that has the correct values for these parameters/variables. Lets call it Default.txt
<Correct Parameters>
Bird = 3
Cat = 40
Dog = 10
Bat = 10
Tiger = 19
Fish = 234
This text file is the default configuration or the correct configuration for the data.
Now I want to compare these two files and print out the data that is incorrect.
So, in theory, if I were to compare these two text document I should get an output of the following: Lets call this Output.txt
<231931844151>
Cat = 4
Dog = 5
Tiger = 11
Fish = 16
<92103884812>
Bird = 4
Bat = Null
Fish = 24
etc. etc.
Since these are the parameters that are incorrect or do not match. So in this case we see that for <231931844151> the parameters Cat, Dog, Tiger, and Fish did not match the default text file so those get printed. In the case of <92103884812> Bird, Bat, and Fish do not match the default parameters so those get printed.
So that's the gist of it for now.
Code:
Currently this is my approach I am trying to do however I'm not sure how I can compare a data file that has different sets of lines with the same parameters to a default data file.
configFile = open("Config.txt", "rb")
defaultFile = open("Default.txt", "rb")
with open(configFile) as f:
dataConfig = f.read().splitlines()
with open(defaultFile) as d:
dataDefault = d.read().splitlines()
def make_dict(data):
return dict((line.split(None, 1)[0], line) for line in data)
defdict = make_dict(dataDefault)
outdict = make_dict(dataConfig)
#Create a sorted list containing all the keys
allkeys = sorted(set(defdict) | set(outdict))
#print allkeys
difflines = []
for key in allkeys:
indef = key in defdict
inout = key in outdict
if indef and not inout:
difflines.append(defdict[key])
elif inout and not indef:
difflines.append(outdict[key])
else:
#key must be in both dicts
defval = defdict[key]
outval = outdict[key]
if outval != defval:
difflines.append(outval)
for line in difflines:
print line
Summary:
I want to compare two text documents that have data/parameters in them, One text document will have a series of data with the same parameters while the other will have just one series of data with the same parameters. I need to compare those parameters and print out the ones that do not match the default. How can I go about doing this in Python?
EDIT:
Okay so thanks to #Maria 's code I think I am almost there. Now I just need to figure out how to compare the dictionary to the list and print out the differences. Here's an example of what I am trying to do:
for i in range (len(setNames)):
print setNames[i]
for k in setData[i]:
if k in dataDefault:
print dataDefault
obvious the print line is just there to see if it worked or not but I'm not sure if this is the proper way about going through this.
Sample code for parsing the file into separate dictionaries. This works by finding the group separators (blank lines). setNames[i] is the name of the set of parameters in the dictionary at setData[i]. Alternatively you can create an object which has a string name member and a dictionary data member and keep a list of those. Doing the comparisons and outputting it how you want is up to you, this just regurgitates the input file to the command line in a slightly different format.
# The function you wrote
def make_dict(data):
return dict((line.split(None, 1)[0], line) for line in data)
# open the file and read the lines into a list of strings
with open("Config.txt" , "rb") as f:
dataConfig = f.read().splitlines()
# get rid of trailing '', as they cause problems and are unecessary
while (len(dataConfig) > 0) and (dataConfig[len(dataConfig) - 1] == ''):
dataConfig.pop()
# find the indexes of all the ''. They amount to one index past the end of each set of parameters
setEnds = []
index = 0
while '' in dataConfig[index:]:
setEnds.append(dataConfig[index:].index('') + index)
index = setEnds[len(setEnds) - 1] + 1
# separate out your input into separate dictionaries, and keep track of the name of each dictionary
setNames = []
setData = []
i = 0;
j = 0;
while j < len(setEnds):
setNames.append(dataConfig[i])
setData.append(make_dict(dataConfig[i+1:setEnds[j]]))
i = setEnds[j] + 1
j += 1
# handle the last index to the end of the list. Alternativel you could add len(dataConfig) to the end of setEnds and you wouldn't need this
if len(setEnds) > 0:
setNames.append(dataConfig[i])
setData.append(make_dict(dataConfig[i+1:]))
# regurgitate the input to prove it worked the way you wanted.
for i in range(len(setNames)):
print setNames[i]
for k in setData[i]:
print "\t" + k + ": " + setData[i][k];
print ""
Why not just use those dicts and loop through them to compare?
for keys in outdict:
if defdict.get(keys):
print outdict.get(keys)

Python splitting up line into separate lists

I have data in a text file that is space separated into right aligned columns. I would like to be able to take each column and put it in a list, basically like you would do with an array. I can't seem to find an equivalent to
left(strname,#ofcharacters)/mid(strname,firstcharacter,lastcharacter)/right(strname,#ofcharacters)
like you would normally use in VB to accomplish the task. How do I separate off the data and put each like 'unit' with its value next from the next line in Python.
Is it possible? Oh yeah, some spacing is 12 characters apart(right aligned) while others are 15 characters apart.
-1234 56 32452 68584.4 Extra_data
-5356 9 546 12434.5 Extra_data
- 90 12 2345 43522.1 Extra_data
Desired output:
[-1234, -5356, -90]
[56, 9, 12]
[32452, 546, 2345]
etc
The equivalent method in python you are looking for is str.split() without any arguments to split the string on whitespaces. It will also take care of any trailing newline/spaces and as in your VB example, you do not need to care about data width.
Example
with open("data.txt") as fin:
data = map(str.split, fin) #Split each line of data on white-spaces
data = zip(*data) #Transpose the Data
But if you have columns with whitespaces, you need some to split the data, based on column position
>>> def split_on_width(data, pos):
if pos[-1] != len(data):
pos = pos + (len(data), )
indexes = zip(pos, pos[1:]) #Create an index pair with current start and
#end as next start
return [data[start: end].strip() for start, end in indexes] #Slice the data using
#the indexes
>>> def trynum(n):
try:
return int(n)
except ValueError:
pass
try:
return float(n)
except ValueError:
return n
>>> pos
(0, 5, 13, 22, 36)
>>> with open("test.txt") as fin:
data = (split_on_width(data.strip(), pos) for data in fin)
data = [[trynum(n) for n in row] for row in zip(*data)]
>>> data
[[-1234, -5356, -90], [56, 9, 12], [32452, 546, 2345], [68584.4, 12434.5, 43522.1], ['Extra_data', 'Extra_data', 'Extra_data']]
Just use str.split() with no arguments; it splits an input string on arbitrary width whitespace:
>>> ' some_value another_column 123.45 42 \n'.split()
['some_value', 'another_column', '123.45', '42']
Note that any columns containing whitespace would also be split.
If you wanted to have lists if columns, you need to transpose the rows:
with open(filename) as inputfh:
columns = zip(*(l.split() for l in inputfh))

Categories