Functional approach to file parsing in Python

Functional approach to file parsing in Python - python

I have a text file describing an electronic circuit and a few other things done with it. I've built a simple Python code that splits the file into different units which can then be further analyzed if needed.
The syntax of the simulation language defines these units as contained within the following lines:
subckt xxx .....
...
...
ends xxx ...
There is a few of these 'text blocks' and other stuff I'm parsing or leaving out - like comment lines.
To accomplish this, I use the following core:
with open('input') as f:
for l in iter(f):
if 'subckt' not in l:
pass
else:
with open('output') as o:
o.write(l)
for l in iter(f):
if 'ends' in l:
o.write(l)
break
else:
o.write(l)
(can't easily paste the real code, there might be oversights)
The nice thing about it is the fact that iter(f) keeps scanning the file so when I break out of the inner loop as I reached the ends line of a subckt, the outer loop keeps going from that point onward, searching for new occurrences of the token subckt in subsequent lines.
I am looking for suggestions and/or guidance on how to transform the forest of if/then clauses into something more functional, i.e. based on 'pure' functions which just yield values (the file rows or lines) and are then composed as to bring to the final result.
Specifically, I am not sure how to approach the fact that the generator\map\filter should actually yield a different row based on the fact that it has found the subckt token or not.
I can think of a filter of the form:
line = filter(lambda x: 'subckt' in x, iter(f))
but this of course only gives me the lines where that string is present, whereas I would like - from that moment on - yield all lines, until the ends token is found.
Is this something I'd have to handle with recursion? Or maybe itertools.tee?
Seems to me that what I want is to have some form of state, i.e. "you have reached a subckt", but without resorting to a true state variable, which would be against the functional paradigm.

Not sure if this is what you are looking for. blocks(f) is a generator producing the blocks in your file f. Each block is an iterator over the lines between 'subckt' and 'ends'. If you want to include those two lines in the block, you'd have to do some more work in _blocks. But I hope this gives you an idea:
def __block(f):
while 'subckt' not in next(f): pass # raises StopIteration at EOF
return iter(next(iter([])) if 'ends' in l else l.strip() for l in f)
def blocks(f):
while 1: yield __block(f) # StopIteration from __block will stop the generator
f = open('data.txt')
for block in blocks(f):
# process block
for line in block:
# process line
next(iter([])) if is a little hack to terminate a comprehension/generator.

This answer also works, still very keen on hearing comments:
from itertools import takewhile, dropwhile
def start(l): return 'subckt' not in l
def stop(l): return 'ends' not in l
def sub(iter):
while True:
a = list(dropwhile(start,takewhile(stop,iter)))
if len(a):
yield a
else:
return
f = open('file.txt')
for b in sub(f):
#process b
f.close()
Something I couldn't work out yet: enclose the last line (containing ends keyword) in the output.

Related

Python - program for searching for relevant cells in excel does not work correctly

I've written a code to search for relevant cells in an excel file. However, it does not work as well as I had hoped.
In pseudocode, this is it what it should do:
Ask for input excel file
Ask for input textfile containing keywords to search for
Convert input textfile to list containing keywords
For each keyword in list, scan the excelfile
If the keyword is found within a cell, write it into a new excelfile
Repeat with next word
The code works, but some keywords are not found while they are present within the input excelfile. I think it might have something to do with the way I iterate over the list, since when I provide a single keyword to search for, it works correctly. This is my whole code: https://pastebin.com/euZzN3T3
This is the part I suspect is not working correctly. Splitting the textfile into a list works fine (I think).
#IF TEXTFILE
elif btext == True:
#Split each line of textfile into a list
file = open(txtfile, 'r')
#Keywords in list
for line in file:
keywordlist = file.read().splitlines()
nkeywords = len(keywordlist)
print(keywordlist)
print(nkeywords)
#Iterate over each string in list, look for match in .xlsx file
for i in range(1, nkeywords):
nfound = 0
ws_matches.cell(row = 1, column = i).value = str.lower(keywordlist[i-1])
for j in range(1, worksheet.max_row + 1):
cursor = worksheet.cell(row = j, column = c)
cellcontent = str.lower(cursor.value)
if match(keywordlist[i-1], cellcontent) == True:
ws_matches.cell(row = 2 + nfound, column = i).value = cellcontent
nfound = nfound + 1
and my match() function:
def match(keyword, content):
"""Check if the keyword is present within the cell content, return True if found, else False"""
if content.find(keyword) == -1:
return False
else:
return True
I'm new to Python so my apologies if the way I code looks like a warzone. Can someone help me see what I'm doing wrong (or could be doing better?)? Thank you for taking the time!

Splitting the textfile into a list works fine (I think).
This is something you should actually test (hint: it does but is inelegant). The best way to make easily testable code is to isolate functional units into separate functions, i.e. you could make a function that takes the name of a text file and returns a list of keywords. Then you can easily check if that bit of code works on its own. A more pythonic way to read lines from a file (which is what you do, assuming one word per line) is as follows:
with open(filename) as f:
keywords = f.readlines()
The rest of your code may actually work better than you expect. I'm not able to test it right now (and don't have your spreadsheet to try it on anyway), but if you're relying on nfound to give you an accurate count for all keywords, you've made a small but significant mistake: it's set to zero inside the loop, and thus you only get a count for the last keyword. Move nfound = 0 outside the loop.
In Python, the way to iterate over lists - or just about anything - is not to increment an integer and then use that integer to index the value in the list. Rather loop over the list (or other iterable) itself:
for keyword in keywordlist:
...
As a hint, you shouldn't need nkeywords at all.
I hope this gets you on the right track. When asking questions in future, it'd be a great help to provide more information about what goes wrong, and preferably enough to be able to reproduce the error.

Read N lines from a file

so for class we have to start out our problem doing this:
Write a function that takes as its input a filename, and an integer. The file should open the file and read in the first number of lines given as the second argument. (You'll need to have a variable to use as a counter for this part).
It's very basic and I figure a loop is needed but I can't figure out how to incorporate a loop into the question. What I've tried doesn't work and it's been about 3 hours and the best I can come up with is
def filewrite(textfile,line):
infile=open(textfile,'r',encoding='utf-8')
text=infile.readline(line)
print(text)
however that doesn't get me to what I need for the function. It's still early in my intro to python class so basic code is all we have worked with.

There are two basic looping strategies you could use here:
you could count up to n, reading lines as you go
you could read lines from file, keeping track of how many you've read, and stop when you reach a certain number.
def filewrite(textfile, n):
with open(textfile) as infile:
for _ in range(n):
print(infile.readline(), end='')
print()
def filewrite(textfile, n):
with open(textfile) as infile:
counter = 0
for line in infile:
if counter >= n:
break
print(line, end='')
counter += 1
The first is obviously more readable, and since readline will just return an empty string if it runs out of lines, it's safe to use even if the user asks for more lines than the infile has.
Here I'm also using a context manager to make sure the files are closed when I'm done with them.
Here's a version without the stuff you don't recognize
def filewrite(textfile, n):
infile = open(textfile)
count = 0
while count < n:
print(infile.readline())
count += 1
infile.close()

Parsing with multiple loops opening files

I'm trying to count the number of lines contained by a file that looks like this:
-StartACheck
---Lines--
-EndACheck
-StartBCheck
---Lines--
-EndBCheck
with this:
count=0
z={}
for line in file:
s=re.search(r'\-+Start([A-Za-z0-9]+)Check',line)
if s:
e=s.group(1)
for line in file:
z.setdefault(e,[]).append(count)
q=re.search(r'\-+End',line)
if q:
count=0
break
for a,b in z.items():
print(a,len(b))
I want to basically store the number of lines present inside ACheck , BCheck etc in a dictionary but I keep getting the wrong output
Something like this
A,15
B,9
etc
I found out that even though the code should work, it doesn't because of the way the file is opened. I can't change the way it is opened and was looking for an implementation that only opens the file once but counts the same things and gives the exact same output without all the added functions of the newer python version.

This kind of problem can be resolved with a finite state machine. This is a complex matter that would need more explanation than what I could write here. You should look into it to further understand what you can do with it.
But first of all, I'm going to do a few presumptions:
The input file doesn't have any errors
If you have more than one section with the same name, you want their count to be combined
Even though you have tagged this question python 2.7, because you are using print(), I'll presume you are using python 3.x
Here's my suggestion:
import re
input_filename = "/home/evens/Temporaire/StackOverflow/Input_file-39339007.txt"
matchers = {
'start_section' : re.compile(r'\-+Start([A-Za-z0-9]+)Check'),
'end_section' : re.compile(r'\-+End'),
}
inside_section = False # Am I inside a section ?
section_name = None # Which section am I in ?
tally = {} # Sums of each section
with open(input_filename) as file_read:
for line in file_read:
line_matches = {k: v.match(line) for (k, v) in matchers.items()}
if inside_section:
if line_matches['end_section']:
future_inside_section = False
else:
future_inside_section = True
if section_name in tally:
tally[section_name] += 1
else:
tally[section_name] = 1
else:
if line_matches['start_section']:
future_inside_section = True
section_name = line_matches['start_section'].group(1)
# Just before we go in the future
inside_section = future_inside_section
for (a,b) in tally.items():
print('Total of all "{}" sections: {}'.format(a, b))
What this code does is determine :
How it should change its state (Am I going to be inside or outside a section on the next line?)
What else should be done:
Change the name of the section I'm in ?
Count this line in the present section ?
But even this code has its problems:
It doesn't check to see if a section start has a matching section end (-StartACheck could be ended by -EndATotallyInvalidCheck)
It doesn't handle the case where two consecutive section starts (or ends) are detected (Error? Nested sections?)
It doesn't handle the case where there are lines outside a section
And probably other corner cases.
How you want to handle these cases is up to you
This code could probably be further simplified but I don't want to be too complex for now.
Hope this helps. Don't hesitate to ask if you need further explanations.

write python code in single line

Can I write the following code in single line in python?
t=int(input())
while t:
t-=1
n=int(input())
a=i=0
while not(n&1<<i):
i+=1
while n&1<<i:
n^=1<<i
a=a*2+1
i+=1
print(n^1<<i)+a/2
If not, How can I write this piece of code in minimum possible lines?(PS: I could reduce this in 6 lines, can it be any better)My Solutiont=int(input())
while t:
t-=1;n=int(input());a=i=0
while not(n&1<<i):i+=1
while n&1<<i:n^=1<<i;a=a*2+1;i+=1
print(n^1<<i)+a/2Thanks

Since pythons list comprehensions are turing complete and require no line breaks, any program can be written as a python oneliner.
If you enforce arbitrary restrictions (like "order of the statements" - what does that even mean? Execution order? First apperarance in sourcecode?), then the answer is: you can eliminate some linebreaks, but not all.
instead of
if x:
do_stuff()
you can do:
if x: do_stuff()
instead of
x = 23
y = 42
you can do:
x,y = 23, 42
and instead of
do_stuff()
do_more_stuff()
you can do
do_stuff; do_more_stuff()
And if you really, really have to, you can exec a multi-line python program in one line, so your program becomes something like:
exec('''t=int(input())\nwhile t:\n t-=1;n=int(input());a=i=0\n while not(n&1<<i):i+=1\n while n&1<<i:n^=1<<i;a=a*2+1;i+=1\n print(n^1<<i)+a/2\n''')
But if you do this in "real" code, e.g. not just for fun, kittens die.

It's not recommended to collapse lines in Python very often, because you lose Python's famous simplicity and clarity that way. And you often cannot collapse lines, because indentation levels are used to define block structure / nesting.
But if you really want a condensed version:
print "s0"
while True:
print "s1"; print "s2"
while True: print "s3"
while True: print "s4"; print "s5"; print "s6"
print "s7"
(Where your expressions have been replaced with True for simplicity`)

Finding the length of longest string

I have just started to learn how to use python. A part of my exercise is to find the length of longest string in texts, defined as 'box' in the following case:
def file(box):
maxlen=0
f=box.splitlines()
for i in f:
if len(i)>=maxlen:
maxlen=len(i)
return maxlen
print file("""abcd efgh ijkl
on different lines
I""")
In this case, I get number 14, instead of 18, which is the correct answer...can please somebody help me to solve this problem?

You've indented your return statement too much:
for i in f:
if len(i)>=maxlen:
maxlen=len(i)
return maxlen
At the moment, you're telling it to return on every iteration of the loop, which means only the first line is returned. Move the return statement outside the loop:
for i in f:
if len(i)>=maxlen:
maxlen=len(i)
return maxlen
...and it should work.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.