How to read and organize text files divided by keywords - python

I'm working on this code (on python) that reads a text file.
The text file contains information to construct a certain geometry, and it is separated by sections by using keywords, for example, the file:
*VERTICES
1 0 0 0
2 10 0 0
3 10 10 0
4 0 10 0
*EDGES
1 1 2
2 1 4
3 2 3
4 3 4
contains the information of a square with vertices at (0,0), (0,10), (10,0), (10,10). The "*Edges" part defines the connection between the vertices. The first number in each row is an ID number.
Here is my problem, the information in the text file is not necessarily in order, sometimes the "Vertices" section appears first, and some other times the "Edges" section will come first. I have other keywords as well, so I'm trying to avoid repeating if statements to test if each line has a new keyword.
What I have been doing is reading the text file multiple times, each time looking for a different keyword:
open file
read line by line
if line == *Points
store all the following lines in a list until a new *command is encountered
close file
open file (again)
read line by line
if line == *Edges
store all the following lines in a list until a new *command is encountered
close file
open file (again)
...
Can someone point out how can I identify these keywords without such a tedious procedure? Thanks.

You can read the file once and store the contents in a dictionary. Since you have conveniently labeled the "command" lines with a *, you can use all lines beginning with a * as the dictionary key and all following lines as the values for that key. You can do this with a for loop:
with open('geometry.txt') as f:
x = {}
key = None # store the most recent "command" here
for y in f.readlines()
if y[0] == '*':
key = y[1:] # your "command"
x[key] = []
else:
x[key].append(y.split()) # add subsequent lines to the most recent key
Or you can take advantage of python's list and dictionary comprehensions to do the same thing in one line:
with open('test.txt') as f:
x = {y.split('\n')[0]:[z.split() for z in y.strip().split('\n')[1:]] for y in f.read().split('*')[1:]}
which I'll admit is not very nice looking but it gets the job done by splitting the entire file into chunks between '*' characters and then using new lines and spaces as delimiters to break up the remaining chunks into dictionary keys and lists of lists (as dictionary values).
Details about splitting, stripping, and slicing strings can be found here

The fact that they are unordered I think lends itself well for parsing into a dictionary from which you can access values later. I wrote a function that you may find useful for this task:
features = ['POINTS','EDGES']
def parseFile(dictionary, f, features):
"""
Creates a format where you can access a shape feature like:
dictionary[shapeID][feature] = [ [1 1 1], [1,1,1] ... ]
Assumes: all features although out of order occurs in the order
shape1
*feature1
.
.
.
*featuren
Assumes all possible features are in in the list features
f is input file handle
"""
shapeID = 0
found = []
for line in f:
if line[0] == '*' and found != features:
found.append(line[1:]) #appends feature like POINTS to found
feature = line[1:]
elif line[0] == '*' and found == features:
found = []
shapeID += 1
feature = line[1:] #current feature
else:
dictionary[shapeID][feature].append(
[int(i) for i in line.split(' ')]
)
return dictionary
#to access the shape features you can get vertices like:
for vertice in dictionary[shapeID]['POINTS']:
print vertice
#to access edges
for edge in dictionary[shapeID]['EDGES']:
print edge

You should just create a dictionary of the sections. You could use a generator to read the file and yield each section in whatever order they arrive and build a dictionary from the results.
Here's some incomplete code that might help you along:
def load(f):
with open(f) as file:
section = next(file).strip() # Assumes first line is always a section
data = []
for line in file:
if line[0] == '*': # Any appropriate test for a new section
yield section, data
section = line.strip()
data = []
else:
data.append(list(map(int, line.strip().split())))
yield section, data
Assuming the data above is in a file called data.txt:
>>> data = dict(load('data.txt'))
>>> data
{'*EDGES': [[1, 1, 2], [2, 1, 4], [3, 2, 3], [4, 3, 4]],
'*VERTICES': [[1, 0, 0, 0], [2, 10, 0, 0], [3, 10, 10, 0], [4, 0, 10, 0]]}
Then you can reference each section, e.g.:
for edge in data['*EDGES']:
...

Assuming your file is named 'data.txt'
from collections import defaultdict
def get_data():
d = defaultdict(list)
with open('data.txt') as f:
key = None
for line in f:
if line.startswith('*'):
key = line.rstrip()
continue
d[key].append(line.rstrip())
return d
The returned defaultdict looks like this:
defaultdict(list,
{'*EDGES': ['1 1 2', '2 1 4', '3 2 3', '4 3 4'],
'*VERTICES': ['1 0 0 0', '2 10 0 0', '3 10 10 0', '4 0 10 0']})
You access the data just like a normal dictionary
d['*EDGES']
['1 1 2', '2 1 4', '3 2 3', '4 3 4']

A common strategy with this type of parsing is to build a function that can yield the data a section at a time. Then your top-level calling code can be fairly simple because it doesn't have to worry about the section logic at all. Here's an example with your data:
import sys
def main(file_path):
# An example usage.
for section_name, rows in sections(file_path):
print('===============')
print(section_name)
for row in rows:
print(row)
def sections(file_path):
# Setup.
section_name = None
rows = []
# Process the file.
with open(file_path) as fh:
for line in fh:
# Section start: yield any rows we have so far,
# and then update the section name.
if line.startswith('*'):
if rows:
yield (section_name, rows)
rows = []
section_name = line[1:].strip()
# Otherwise, just add another row.
else:
row = line.split()
rows.append(row)
# Don't forget the last batch of rows.
if rows:
yield (section_name, rows)
main(sys.argv[1])

A dictionary is probably the way to go given that your data isn't ordered. You can access it by section name after reading the file into a list. Note that the with keyword closes your file automatically.
Here's what it might look like:
# read the data file into a simple list:
with open('file.dat') as f:
lines = list(f)
# get the line numbers for each section:
section_line_nos = [line for line, data in enumerate(lines) if '*' == data[0]]
# add a terminating line number to mark end of the file:
section_line_nos.append(len(lines))
# split each section off into a new list, all contained in a dictionary
# with the section names as keys
section_dict = {lines[section_line_no][1:]:lines[section_line_no + 1: section_line_nos[section_no + 1]] for section_no, section_line_no in enumerate(section_line_nos[:-1])}
You will get a dictionary that looks like this:
{'VERTICES': ['1 0 0 0', '2 10 0 0', '3 10 10 0', '4 0 10 0'], 'EDGES': ['1 1 2', '2 1 4', '3 2 3', '4 3 4']}
Access each section this way:
section_dict['EDGES']
Note that the above code assumes each section starts with *, and that no other line starts with *. If the first is not the case, you could make this change:
section_names = ['*EDGES', '*VERTICES']
section_line_nos = [line for line, data in enumerate(lines) if data.strip() in section_names]
Also note that this part of the section_dict code:
lines[section_line_no][1:]
...gets rid of the star at the beginning of each section name. If this is not desired, you can change that to:
lines[section_line_no]
If it is possible there will be undesired white space in your section name lines, you can do this to get rid of it:
lines[section_line_no].strip()[1:]
I haven't tested all of this yet but this is the general idea.

Related

Get Number after a text in Python

I need some help with getting numbers after a certain text
For example, I have a list:
['Martin 9', ' Leo 2 10', ' Elisabeth 3']
Now I need to change the list into variables like this:
Martin = 9
Leo2 =10
Elisabeth = 3
I didn't tried many things, because I'm new to Python.
Thanks for reading my question and probably helping me
I suppose you get some list with using selenium so getting integer from list works without using re like :
list = ["Martin 9","Leo 10","Elisabeth 3"]
for a in list:
print ''.join(filter(str.isdigit, a))
Output :
9
10
3
Lets assume you loaded it from a file and got it line by line:
numbers = []
for line in lines:
line = line.split()
if line[0] == "Martin": numbers.append(int(line[1]))
I don't think you really want a bunch of variables, especially if the list grows large. Instead I think you want a dictionary where the names are the key and the number is the value.
I would start by splitting your string into a list as described in this question/answer.
s=
"Martin 9
Leo 10
Elisabeth 3"
listData = data.splitlines()
Then I would turn that list into a dictionary as described in this question/answer.
myDictionary = {}
for listItem in listData:
i = listItem.split(' ')
myDictionary[i[0]] = int(i[1])
Then you can access the number, in Leo's case 10, via:
myDictionary["Leo"]
I have not tested this syntax and I'm not used to python, so I'm sure a little debugging will be involved. Let me know if I need to make some corrections.
I hope that helps :)
s="""Leo 1 9
Leo 2 2"""
I shortened the list here
look for the word in the list:
if "Leo 1" in s:
then get the index of it:
i = s.index("Leo 1")
then add the number of character of the word
lenght=len("Leo 1")
And then add it: (+1 because of the space)
b = i + lenght + 1
and then get the number on the position b:
x = int(s[b]
so in total:
if "Leo 1" in s:
i = s.index("Leo 1")
lenght = len("Leo 1")
b = i + lenght + 1
x = int(s[b])

Find first line of text according to value in Python

How can I do a search of a value of the first "latitude, longitude" coordinate in a "file.txt" list in Python and get 3 rows above and 3 rows below?
Value
37.0459
file.txt
37.04278,-95.58895
37.04369,-95.58592
37.04369,-95.58582
37.04376,-95.58557
37.04376,-95.58546
37.04415,-95.58429
37.0443,-95.5839
37.04446,-95.58346
37.04461,-95.58305
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
37.04508,-95.57914
37.04494,-95.57842
37.04483,-95.5771
37.0448,-95.57674
37.04474,-95.57606
37.04467,-95.57534
37.04462,-95.57474
37.04458,-95.57396
37.04454,-95.57274
37.04452,-95.57233
37.04453,-95.5722
37.0445,-95.57164
37.04448,-95.57122
37.04444,-95.57054
37.04432,-95.56845
37.04432,-95.56834
37.04424,-95.5668
37.044,-95.56251
37.04396,-95.5618
Expected Result
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
Additional information
In linux I can get the closest line and do the treatment I need using grep, sed, cut and others, but I'd like in Python.
Any help will be greatly appreciated!
Thank you.
How can I do a search of a value of the first "latitude, longitude"
coordinate in a "file.txt" list in Python and get 3 rows above and 3
rows below?*
You can try:
with open("text_filter.txt") as f:
text = f.readlines() # read text lines to list
filter= "37.0459"
match = [i for i,x in enumerate(text) if filter in x] # get list index of item matching filter
if match:
if len(text) >= match[0]+3: # if list has 3 items after filter, print it
print("".join(text[match[0]:match[0]+3]).strip())
print(text[match[0]].strip())
if match[0] >= 3: # if list has 3 items before filter, print it
print("".join(text[match[0]-3:match[0]]).strip())
Output:
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04597,-95.58127
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
You can use pandas to import the data in a dataframe and then easily manipulate it. As per your question the value to check is not the exact match and therefore I have converted it to string.
import pandas as pd
data = pd.read_csv("file.txt", header=None, names=["latitude","longitude"]) #imports text file as dataframe
value_to_check = 37.0459 # user defined
for i in range(len(data)):
if str(value_to_check) == str(data.iloc[i,0])[:len(str(value_to_check))]:
break
print(data.iloc[i-3:i+4,:])
output
latitude longitude
9 37.04502 -95.58204
10 37.04516 -95.58184
11 37.04572 -95.58139
12 37.04597 -95.58127
13 37.04565 -95.58073
14 37.04546 -95.58033
15 37.04516 -95.57948
A solution with iterators, that only keeps in memory the necessary lines and doesn't load the unnecessary part of the file:
from collections import deque
from itertools import islice
def find_in_file(file, target, before=3, after=3):
queue = deque(maxlen=before)
with open(file) as f:
for line in f:
if target in map(float, line.split(',')):
out = list(queue) + [line] + list(islice(f, 3))
return out
queue.append(line)
else:
raise ValueError('target not found')
Some tests:
print(find_in_file('test.txt', 37.04597))
# ['37.04502,-95.58204\n', '37.04516,-95.58184\n', '37.04572,-95.58139\n', '37.04597,-95.58127\n',
# '37.04565,-95.58073\n', '37.04565,-95.58073\n', '37.04565,-95.58073\n']
print(find_in_file('test.txt', 37.044)) # Only one line after the match
# ['37.04432,-95.56845\n', '37.04432,-95.56834\n', '37.04424,-95.5668\n', '37.044,-95.56251\n',
# '37.04396,-95.5618\n']
Also, it works if there is less than the expected number of lines before or after the match. We match floats, not strings, as '37.04' would erroneously match '37.0444' otherwise.
This solution will print the before and after elements even if they are less than 3.
Also I am using string as it is implied from the question that you want partial matches also. ie. 37.0459 will match 37.04597
search_term='37.04462'
with open('file.txt') as f:
lines = f.readlines()
lines = [line.strip().split(',') for line in lines] #remove '\n'
for lat,lon in lines:
if search_term in lat:
index=lines.index([lat,lon])
break
left=0
right=0
for k in range (1,4): #bcoz last one is not included
if index-k >=0:
left+=1
if index+k<=(len(lines)-1):
right+=1
for i in range(index-left,index+right+1): #bcoz last one is not included
print(lines[i][0],lines[i][1])

Compare configuration data text with a default data text

I am in the process of understanding how to compare data from two text files and print the data that does not match into a new document or output.
The Program Goal:
Allow the user to compare the data in a file that contains many lines of data with a default file that has the correct values of the data.
Compare multiple lines of different data with the same parameters against a default list of the data with the same parameters
Example:
Lets say I have the following text document that has these parameters and data:
Lets call it Config.txt:
<231931844151>
Bird = 3
Cat = 4
Dog = 5
Bat = 10
Tiger = 11
Fish = 16
<92103884812>
Bird = 4
Cat = 40
Dog = 10
Bat = Null
Tiger = 19
Fish = 24
etc. etc.
Let's call this my Configuration data, now I need to make sure that the values these parameters inside my Config Data file are correct.
So I have a default data file that has the correct values for these parameters/variables. Lets call it Default.txt
<Correct Parameters>
Bird = 3
Cat = 40
Dog = 10
Bat = 10
Tiger = 19
Fish = 234
This text file is the default configuration or the correct configuration for the data.
Now I want to compare these two files and print out the data that is incorrect.
So, in theory, if I were to compare these two text document I should get an output of the following: Lets call this Output.txt
<231931844151>
Cat = 4
Dog = 5
Tiger = 11
Fish = 16
<92103884812>
Bird = 4
Bat = Null
Fish = 24
etc. etc.
Since these are the parameters that are incorrect or do not match. So in this case we see that for <231931844151> the parameters Cat, Dog, Tiger, and Fish did not match the default text file so those get printed. In the case of <92103884812> Bird, Bat, and Fish do not match the default parameters so those get printed.
So that's the gist of it for now.
Code:
Currently this is my approach I am trying to do however I'm not sure how I can compare a data file that has different sets of lines with the same parameters to a default data file.
configFile = open("Config.txt", "rb")
defaultFile = open("Default.txt", "rb")
with open(configFile) as f:
dataConfig = f.read().splitlines()
with open(defaultFile) as d:
dataDefault = d.read().splitlines()
def make_dict(data):
return dict((line.split(None, 1)[0], line) for line in data)
defdict = make_dict(dataDefault)
outdict = make_dict(dataConfig)
#Create a sorted list containing all the keys
allkeys = sorted(set(defdict) | set(outdict))
#print allkeys
difflines = []
for key in allkeys:
indef = key in defdict
inout = key in outdict
if indef and not inout:
difflines.append(defdict[key])
elif inout and not indef:
difflines.append(outdict[key])
else:
#key must be in both dicts
defval = defdict[key]
outval = outdict[key]
if outval != defval:
difflines.append(outval)
for line in difflines:
print line
Summary:
I want to compare two text documents that have data/parameters in them, One text document will have a series of data with the same parameters while the other will have just one series of data with the same parameters. I need to compare those parameters and print out the ones that do not match the default. How can I go about doing this in Python?
EDIT:
Okay so thanks to #Maria 's code I think I am almost there. Now I just need to figure out how to compare the dictionary to the list and print out the differences. Here's an example of what I am trying to do:
for i in range (len(setNames)):
print setNames[i]
for k in setData[i]:
if k in dataDefault:
print dataDefault
obvious the print line is just there to see if it worked or not but I'm not sure if this is the proper way about going through this.
Sample code for parsing the file into separate dictionaries. This works by finding the group separators (blank lines). setNames[i] is the name of the set of parameters in the dictionary at setData[i]. Alternatively you can create an object which has a string name member and a dictionary data member and keep a list of those. Doing the comparisons and outputting it how you want is up to you, this just regurgitates the input file to the command line in a slightly different format.
# The function you wrote
def make_dict(data):
return dict((line.split(None, 1)[0], line) for line in data)
# open the file and read the lines into a list of strings
with open("Config.txt" , "rb") as f:
dataConfig = f.read().splitlines()
# get rid of trailing '', as they cause problems and are unecessary
while (len(dataConfig) > 0) and (dataConfig[len(dataConfig) - 1] == ''):
dataConfig.pop()
# find the indexes of all the ''. They amount to one index past the end of each set of parameters
setEnds = []
index = 0
while '' in dataConfig[index:]:
setEnds.append(dataConfig[index:].index('') + index)
index = setEnds[len(setEnds) - 1] + 1
# separate out your input into separate dictionaries, and keep track of the name of each dictionary
setNames = []
setData = []
i = 0;
j = 0;
while j < len(setEnds):
setNames.append(dataConfig[i])
setData.append(make_dict(dataConfig[i+1:setEnds[j]]))
i = setEnds[j] + 1
j += 1
# handle the last index to the end of the list. Alternativel you could add len(dataConfig) to the end of setEnds and you wouldn't need this
if len(setEnds) > 0:
setNames.append(dataConfig[i])
setData.append(make_dict(dataConfig[i+1:]))
# regurgitate the input to prove it worked the way you wanted.
for i in range(len(setNames)):
print setNames[i]
for k in setData[i]:
print "\t" + k + ": " + setData[i][k];
print ""
Why not just use those dicts and loop through them to compare?
for keys in outdict:
if defdict.get(keys):
print outdict.get(keys)

tricky string parsing with python

I have a text file like this:
ID = 31
Ne = 5122
============
List of 104 four tuples:
1 2 12 40
2 3 4 21
.
.
51 21 41 42
ID = 34
Ne = 5122
============
List of 104 four tuples:
3 2 12 40
4 3 4 21
.
.
The four-tuples are tab delimited.
For each ID, I'm trying to make a dictionary with the ID being the key and the four-tuples (in list/tuple form) as elements of that key.
dict = {31: (1,2,12,40),(2,3,4,21)....., 32:(3,2,12,40), (4,3,4,21)..
My string parsing knowledge is limited to adding using a reference object for file.readlines(), using str.replace() and str.split() on 'ID = '. But there has to be a better way. Here some beginnings of what I have.
file = open('text.txt', 'r')
fp = file.readlines()
B = [];
for x in fp:
x.replace('\t',',')
x.replace('\n',')')
B.append(x)
something like this:
ll = []
for line in fp:
tt = tuple(int(x) for x in line.split())
ll.append(tt)
that will produce a list of tuples to assign to the key for your dictionary
Python's great for this stuff, why not write up a 5-10 liner for it? It's kind of what the language is meant to excel at.
$ cat test
ID = 31
Ne = 5122
============
List of 104 four tuples:
1 2 12 40
2 3 4 21
ID = 34
Ne = 5122
============
List of 104 four tuples:
3 2 12 40
4 3 4 21
data = {}
for block in open('test').read().split('ID = '):
if not block:
continue #empty line
lines = block.split('\n')
ID = int(lines[0])
tups = map(lambda y: int(y), [filter(lambda x: x, line.split('\t')) for line in lines[4:]])
data[ID] = tuple(filter(lambda x: x, tups))
print(data)
# {34: ([3, 2, 12, 40], [4, 3, 4, 21]), 31: ([1, 2, 12, 40], [2, 3, 4, 21])}
Only annoying thing is all the filters - sorry, that's just the result of empty strings and stuff from extra newlines, etc. For a one-off little script, it's no biggie.
I think this will do the trick for you:
import csv
def parse_file(filename):
"""
Parses an input data file containing tags of the form "ID = ##" (where ## is a
number) followed by rows of data. Returns a dictionary where the ID numbers
are the keys and all of the rows of data are stored as a list of tuples
associated with the key.
Args:
filename (string) name of the file you want to parse
Returns:
my_dict (dictionary) dictionary of data with ID numbers as keys
"""
my_dict = {}
with open(filename, "r") as my_file: # handles opening and closing file
rows = my_file.readlines()
for row in rows:
if "ID = " in row:
my_key = int(row.split("ID = ")[1]) # grab the ID number
my_list = [] # initialize a new data list for a new ID
elif row != "\n": # skip rows that only have newline char
try: # if this fails, we don't have a valid data line
my_list.append(tuple([int(x) for x in row.split()]))
except:
my_dict[my_key] = my_list # stores the data list
continue # repeat until done with file
return my_dict
I made it a function so that you can it from anywhere, just passing the filename. It makes assumptions about the file format, but if the file format is always what you showed us here, it should work for you. You would call it on your data.txt file like:
a_dictionary = parse_file("data.txt")
I tested it on the data that you gave us and it seems to work just fine after deleting the "..." rows.
Edit: I noticed one small bug. As written, it will add an empty tuple in place of a new line character ("\n") wherever that appears alone on a line. To fix this, put the try: and except: clauses inside of this:
elif row != "\n": # skips rows that only contain newline char
I added this to the full code above as well.

python--import data from file and autopopulate a dictionary

I am a python newbie and am trying to accomplish the following.
A text file contains data in a slightly weird format and I was wondering whether there is an easy way to parse it and auto-fill an empty dictionary with the correct keys and values.
The data looks something like this
01> A B 2 ##01> denotes the line number, that's all
02> EWMWEM
03> C D 3
04> EWWMWWST
05> Q R 4
06> WESTMMMWW
So each pair of lines describe a full set of instructions for a robot arm. For lines 1-2 is for arm1, 3-4 is for arm 2, and so on. The first line states the location and the second line states the set of instructions (movement, changes in direction, turns, etc.)
What I am looking for is a way to import this text file, parse it properly, and populate a dictionary that will generate automatic keys. Note the file only contains value. This is why I am having a hard time. How do I tell the program to generate armX (where X is the ID from 1 to n) and assign a tuple (or a pair) to it such that the dictionary reads.
dict = {'arm1': ('A''B'2, EWMWEM) ...}
I am sorry if the newbie-ish vocab is redundant or unclear. Please let me know and I will be happy to clarify.
A commented code that is easy to understand will help me learn the concepts and motivation.
Just to provide some context. The point of the program is to load all the instructions and then execute the methods on the arms. So if you think there is a more elegant way to do it without loading all the instructions, please suggest.
def get_instructions_dict(instructions_file):
even_lines = []
odd_lines = []
with open(instructions_file) as f:
i = 1
for line in f:
# split the lines into id and command lines
if i % 2==0:
# command line
even_lines.append(line.strip())
else:
# id line
odd_lines.append(line.strip())
i += 1
# create tuples of (id, cmd) and zip them with armX ( armX, (id, command) )
# and combine them into a dict
result = dict( zip ( tuple("arm%s" % i for i in range(1,len(odd_lines)+1)),
tuple(zip(odd_lines,even_lines)) ) )
return result
>>> print get_instructions_dict('instructions.txt')
{'arm3': ('Q R 4', 'WESTMMMWW'), 'arm1': ('A B 2', 'EWMWEM'), 'arm2': ('C D 3', 'EWWMWWST')}
Note dict keys are not ordered. If that matters, use OrderedDict
I would do something like that:
mydict = {} # empty dict
buffer = ''
for line in open('myFile'): # open the file, read line by line
linelist = line.strip().replace(' ', '').split('>') # line 1 would become ['01', 'AB2']
if len(linelist) > 1: # eliminates empty lines
number = int(linelist[0])
if number % 2: # location line
buffer = linelist[1] # we keep this till we know the instruction
else:
mydict['arm%i' % number/2] = (buffer, linelist[1]) # we know the instructions, we write all to the dict
robot_dict = {}
arm_number = 1
key = None
for line in open('sample.txt'):
line = line.strip().replace("\n",'')
if not key:
location = line
key = 'arm' + str(arm_number) #setting key for dict
else:
instruction = line
robot_dict[key] = (location,line)
key = None #reset key
arm_number = arm_number + 1

Categories