I am using pyparsing to parse a language called pig. I found some unexpected result when used the function 'lineno' when the input text have some '\t' in it.
For easy to ask, I simplified the code to address the problem:
#/usr/bin/env python
from pyparsing import *
ident = Word(alphas)
statement1 = ident + Literal('=')+ Keyword('GENERATE', caseless = True) + SkipTo(Literal(';'),ignore = nestedExpr())+ Literal(';').suppress()
statement2 = Keyword('STORE',caseless = True) + ident + Literal(';').suppress()
statement = statement1|statement2
text = """
fact = GENERATE
('Exp' :(a
)
) ;
STORE fact ;
"""
all_statements = statement.scanString(text)
for tokens,startloc,endloc in all_statements:
print 'startloc:' + str(startloc) , 'lineno:' + str(lineno(startloc,text))
print 'endloc:' + str(endloc), 'lineno:' + str(lineno(endloc,text))
print tokens
Notice that in the input text, there is more than 3 '\t' in the beginning of the third line
When I run this , the output is :
startloc:1 lineno:2
endloc:66 lineno:10
['fact', '=', 'GENERATE', "('Exp' :(a\n )\n) "]
startloc:68 lineno:10
endloc:80 lineno:10
['STORE', 'fact']
this should be wrong, as there is total 9 line, it said the first statement is from line 2 to line 10.
I happened to found that when I delete those '\t', the input text is :
text = """
fact = GENERATE
('Exp' :(a
)
) ;
STORE fact ;
"""
and I run it again, the result is :
startloc:1 lineno:2
endloc:34 lineno:5
['fact', '=', 'GENERATE', "('Exp' :(a\n)\n) "]
startloc:36 lineno:7
endloc:48 lineno:7
['STORE', 'fact']
And this result seems correct, the first statement is from line 2 to line 5, the second statemenet is from line 7 to line 7 .This is what I expected.
So I think there maybe something wrong in the lineno() function, or maybe scanString.
Or maybe there is something wrong in my code?
Use parseWithTabs before calling scanString.
Related
I am trying to use a triple-quoted strings in Python3(.7) to build some formated strings.
I have a list of inner strings, which all need to be tabbed in:
This is some text
across multiple
lines.
And a string which should contain the inner string
data{
// string goes here
}
I cannot tab the inner string when I create it. So, my thought was to use dedent with Python3 triple-quoted fstrings:
import textwrap
inner_str = textwrap.dedent(
'''\
This is some text
across multiple
lines.'''
)
full_str = textwrap.dedent(
f'''\
data{{
// This should all be tabbed
{inner_str}
}}'''
)
print(full_str)
However, the indentation is not maintained:
data{
// This should all be tabbed
This is some text
across multiple
lines.
}
The desired result:
data{
// This should all be tabbed
This is some text
across multiple
lines.
}
How can I preserve the indentation of the fstring without pre-tabbing the inner string?
This seems to provide what you want.
import textwrap
inner_str = textwrap.dedent(
'''\
This is some text
across multiple
lines.'''
)
full_str = textwrap.dedent(
f'''
data{{
{textwrap.indent(inner_str, " ")}
}}'''
)
A better solution:
idt = str.maketrans({'\n': "\n "})
print(textwrap.dedent(
f'''
data{{
{inner_str.translate(idt)}
}}'''
))
Another solution with customized tab width:
def indent_inner(inner_str, indent):
return inner_str.replace('\n', '\n' + indent) # os.linesep could be used if the function is needed across different OSs
print(textwrap.dedent(
f'''
data{{
{indent_inner(inner_str, " ")}
}}'''
))
None of the answers here seemed to do what I want, so I'm answering my own question with a solution which gets as close to what I was looking for as I can make. It is not required to pre-tab the data to define its indentation level, nor is it required to not indent the line (which breaks readability). Instead, the indentation level of the current line in the fstring is passed at usage time.
Not perfect but it works.
import textwrap
def code_indent(text, tab_sz, tab_chr=' '):
def indented_lines():
for i, line in enumerate(text.splitlines(True)):
yield (
tab_chr * tab_sz + line if line.strip() else line
) if i else line
return ''.join(indented_lines())
inner_str = textwrap.dedent(
'''\
This is some text
across multiple
lines.'''
)
full_str = textwrap.dedent(
f'''
data{{
{code_indent(inner_str, 8)}
}}'''
)
print(full_str)
Edited to avoid tabbing inner_str.
import textwrap
line_tab = '\n\t'
inner_str = f'''\
This is some text
across multiple
lines.
'''
full_str = textwrap.dedent(f'''\
data{{
// This should all be tabbed
{line_tab.join(inner_str.splitlines())}
}}''')
)
print(full_str)
Output:
data{
// This should all be tabbed
This is some text
across multiple
lines.
}
I'm faced with a really ugly code that is a code generator, that takes a configuration file and outputs C code.
It works, but the script is full of things like:
outstr = "if(" + mytype + " == " + otherType + "){\n"
outstr += " call_" + fun_for_type(mytype) + "();\n"
outstr += "}\n"
# Now imagine 1000 times more lines like the previous ones...
Is there a tool to automatically change code like that to something more palatable (partial changes are more than welcome)? Like:
outstr = """if ({type} == {otherType}) {
call_{fun_for_type}({type});
}
""".format(type=mytype, otherType=otherType, fun_for_type=(mytype))
If this would have been C then I would have abused of Coccinelle, but I don't know of similar tools for Python.
Thanks
You can use dictionaries :
datas = {"type":mytype, "otherType":otherType, "fun_for_type":(mytype)}
outstr = "if ({type} == {otherType}) {{\n\
call_{fun_for_type}({type});\n\
}}\n".format(**datas)
First of, i'm sort of new to Python so sorry if this question is obvious. The detect english module appears to be wrong, but it functions perfectly fine when calling it and running it on its own, theres no errors when running it alone and i've rewritten it a couple times to triple check it.
Traceback (most recent call last):
File "H:\Python\Python Cipher Program\transposition hacker.py", line 49, in <module>
main()
File "H:\Python\Python Cipher Program\transposition hacker.py", line 11, in main
hackedMessage = hackTransposition(myMessage)
File "H:\Python\Python Cipher Program\transposition hacker.py", line 34, in hackTransposition
if detectEnglish.isEnglish(decryptedText):
File "H:\Python\Python Cipher Program\detectEnglish.py", line 48, in isEnglish
wordsMatch = getEnglishCount(message) * 100 >= wordPercentage
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
this is the error i am getting when trying to run the Transposition Hacker (copied directly from here
Here is the code for the Detect English Module
# Detect english Module
# to use this code
# import detectEnglish
# detectEnglish.isEnglish(somestring)
# returns true of false
# there must be a dictionary.txt file in the same directory
# all english words
# one per line
UPPERLETTERS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
LETTERS_AND_SPACE = UPPERLETTERS + UPPERLETTERS.lower() + ' \t\n'
def loadDictionary()
dictionaryFile = open('Dictionary.txt')
englishWords = {}
for word in dictionaryFile.read().split('\n'):
englishWords[word] = None
dictionaryFile.close()
return englishWords
ENGLISH_WORDS = loadDictionary()
def getEnglishCount(message):
message = message.upper()
message = removeNonLetters(message)
possibleWords = message.split()
if possibleWords == []:
return 0.0
matches = 0
for word in possibleWords:
if word in ENGLISH_WORDS:
matches += 1
return float(matches) / len(possibleWords)
def removeNonLetters(message):
lettersOnly = []
for symbol in message:
if symbol in LETTERS_AND_SPACE:
lettersOnly.append(symbol)
return ''.join(lettersOnly)
def isEnglish(message, wordPercentage=20, letterPercentage=85):
# by default 20% of the words mustr exist in dictionary file
# 85% of charecters in messafe must be spaces or letters
wordsMatch = getEnglishCount(message) * 100 >= wordPercentage
numLetters = len(removeNonLetters(message))
messageLettersPercentage = float(numLetters) / len(message) * 100
lettersMatch = messageLettersPercentage >= letterPercentage
return wordsMatch and lettersMatcht
getEnglishCount looks like it is missing a return statement. If python gets to the end of a function without hitting a return statement it will return None as you're seeing.
try this:
def getEnglishCount(message):
message = message.upper()
message = removeNonLetters(message)
possibleWords = message.split()
# if possibleWords == []: # redundant
# return 0.0
return len(possibleWords)
Edit: #Kevin Yea I think you're right - there was more in that function. Maybe try this:
def getEnglishCount(message):
message = message.upper()
message = removeNonLetters(message)
possibleWords = message.split()
if possibleWords == []:
return 0.0
matches = 0.
for word in possibleWords:
if word in ENGLISH_WORDS:
matches += 1
return matches / len(possibleWords)
I'd guess the indentation somehow got changed when you copy and pasted the code, with the return statement nested under the if.
As the other poster has said, you're missing a return for the getEnglishCount method, so it's returning NoneType, meaning that there is no value to be returned.
You can't do math on NoneTypes, so the NoneType*100 fails, which is what the bottom of your error traceback says.
I was making a site component scanner with Python. Unfortunately, something goes wrong when I added another value to my script. This is my script:
#!/usr/bin/python
import sys
import urllib2
import re
import time
import httplib
import random
# Color Console
W = '\033[0m' # white (default)
R = '\033[31m' # red
G = '\033[1;32m' # green bold
O = '\033[33m' # orange
B = '\033[34m' # blue
P = '\033[35m' # purple
C = '\033[36m' # cyan
GR = '\033[37m' # gray
#Bad HTTP Responses
BAD_RESP = [400,401,404]
def main(path):
print "[+] Testing:",host.split("/",1)[1]+path
try:
h = httplib.HTTP(host.split("/",1)[0])
h.putrequest("HEAD", "/"+host.split("/",1)[1]+path)
h.putheader("Host", host.split("/",1)[0])
h.endheaders()
resp, reason, headers = h.getreply()
return resp, reason, headers.get("Server")
except(), msg:
print "Error Occurred:",msg
pass
def timer():
now = time.localtime(time.time())
return time.asctime(now)
def slowprint(s):
for c in s + '\n':
sys.stdout.write(c)
sys.stdout.flush() # defeat buffering
time.sleep(8./90)
print G+"\n\t Whats My Site Component Scanner"
coms = { "index.php?option=com_artforms" : "com_artforms" + "link1","index.php?option=com_fabrik" : "com_fabrik" + "ink"}
if len(sys.argv) != 2:
print "\nUsage: python jx.py <site>"
print "Example: python jx.py www.site.com/\n"
sys.exit(1)
host = sys.argv[1].replace("http://","").rsplit("/",1)[0]
if host[-1] != "/":
host = host+"/"
print "\n[+] Site:",host
print "[+] Loaded:",len(coms)
print "\n[+] Scanning Components\n"
for com,nme,expl in coms.items():
resp,reason,server = main(com)
if resp not in BAD_RESP:
print ""
print G+"\t[+] Result:",resp, reason
print G+"\t[+] Com:",nme
print G+"\t[+] Link:",expl
print W
else:
print ""
print R+"\t[-] Result:",resp, reason
print W
print "\n[-] Done\n"
And this is the error message that comes up:
Traceback (most recent call last):
File "jscan.py", line 69, in <module>
for com,nme,expl in xpls.items():
ValueError: need more than 2 values to unpack
I already tried changing the 2 value into 3 or 1, but it doesn't seem to work.
xpls.items returns a tuple of two items, you're trying to unpack it into three. You initialize the dict yourself with two pairs of key:value:
coms = { "index.php?option=com_artforms" : "com_artforms" + "link1","index.php?option=com_fabrik" : "com_fabrik" + "ink"}
besides, the traceback seems to be from another script - the dict is called xpls there, and coms in the code you posted...
you can try
for (xpl, poc) in xpls.items():
...
...
because dict.items will return you tuple with 2 values.
You have all the information you need. As with any bug, the best place to start is the traceback. Let's:
for com,poc,expl in xpls.items():
ValueError: need more than 2 values to unpack
Python throws ValueError when a given object is of correct type but has an incorrect value. In this case, this tells us that xpls.items is an iterable an thus can be unpacked, but the attempt failed.
The description of the exception narrows down the problem: xpls has 2 items, but more were required. By looking at the quoted line, we can see that "more" is 3.
In short: xpls was supposed to have 3 items, but has 2.
Note that I never read the rest of the code. Debugging this was possible using only those 2 lines.
Learning to read tracebacks is vital. When you encounter an error such as this one again, devote at least 10 minutes to try to work with this information. You'll be repayed tenfold for your effort.
As already mentioned, dict.items() returns a tuple with two values. If you use a list of strings as dictionary values instead of a string, which should be split anyways afterwards, you can go with this syntax:
coms = { "index.php?option=com_artforms" : ["com_artforms", "link1"],
"index.php?option=com_fabrik" : ["com_fabrik", "ink"]}
for com, (name, expl) in coms.items():
print com, name, expl
>>> index.php?option=com_artforms com_artforms link1
>>> index.php?option=com_fabrik com_fabrik ink
Here is a scraper I created using Python on ScraperWiki:
import lxml.html
import re
import scraperwiki
pattern = re.compile(r'\s')
html = scraperwiki.scrape("http://www.shanghairanking.com/ARWU2012.html")
root = lxml.html.fromstring(html)
for tr in root.cssselect("#UniversityRanking tr:not(:first-child)"):
if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:
data = {
'arwu_rank' : str(re.sub(pattern, r'', tr.cssselect("td.ranking")[0].text_content())),
'university' : tr.cssselect("td.rankingname")[0].text_content().strip()
}
# DEBUG BEGIN
if not type(data["arwu_rank"]) is str:
print type(data["arwu_rank"])
print data["arwu_rank"]
print data["university"]
# DEBUG END
if "-" in data["arwu_rank"]:
arwu_rank_bounds = data["arwu_rank"].split("-")
data["arwu_rank"] = int( ( float(arwu_rank_bounds[0]) + float(arwu_rank_bounds[1]) ) * 0.5 )
if not type(data["arwu_rank"]) is int:
data["arwu_rank"] = int(data["arwu_rank"])
scraperwiki.sqlite.save(unique_keys=['university'], data=data)
It works perfectly except when scraping the final data row of the table (the "York University" line), at which point instead of lines 9 through 11 of the code causing the string "401-500" to be retrieved from the table and assigned to data["arwu_rank"], those lines somehow seem instead to be causing the int 450 to be assigned to data["arwu_rank"]. You can see that I've added a few lines of "debugging" code to get a better understanding of what's going on, but also that that debugging code doesn't go very deep.
I have two questions:
What are my options for debugging scrapers run on the ScraperWiki infrastructure, e.g. for troubleshooting issues like this? E.g. is there a way to step through?
Can you tell me why the the int 450, instead of the string "401-500", is being assigned to data["arwu_rank"] for the "York University" line?
EDIT 6 May 2013, 20:07h UTC
The following scraper completes without issue, but I'm still unsure why the first one failed on the "York University" line:
import lxml.html
import re
import scraperwiki
pattern = re.compile(r'\s')
html = scraperwiki.scrape("http://www.shanghairanking.com/ARWU2012.html")
root = lxml.html.fromstring(html)
for tr in root.cssselect("#UniversityRanking tr:not(:first-child)"):
if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:
data = {
'arwu_rank' : str(re.sub(pattern, r'', tr.cssselect("td.ranking")[0].text_content())),
'university' : tr.cssselect("td.rankingname")[0].text_content().strip()
}
# DEBUG BEGIN
if not type(data["arwu_rank"]) is str:
print type(data["arwu_rank"])
print data["arwu_rank"]
print data["university"]
# DEBUG END
if "-" in data["arwu_rank"]:
arwu_rank_bounds = data["arwu_rank"].split("-")
data["arwu_rank"] = int( ( float(arwu_rank_bounds[0]) + float(arwu_rank_bounds[1]) ) * 0.5 )
if not type(data["arwu_rank"]) is int:
data["arwu_rank"] = int(data["arwu_rank"])
scraperwiki.sqlite.save(unique_keys=['university'], data=data)
There's no easy way to debug your scripts on ScraperWiki, unfortunately it just sends your code in its entirety and gets the results back, there's no way to execute the code interactively.
I added a couple more prints to a copy of your code, and it looks like the if check before the bit that assigns data
if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:
doesn't trigger for "York University" so it will be keeping the int value (you set it later on) from the previous time around the loop.