Find all strings in python code files

Find all strings in python code files - python

I would like to list all strings within my large python project.
Imagine the different possibilities to create a string in python:
mystring = "hello world"
mystring = ("hello "
"world")
mystring = "hello " \
"world"
I need a tool that outputs "filename, linenumber, string" for each string in my project. Strings that are spread over multiple lines using "\" or "('')" should be shown in a single line.
Any ideas how this could be done?

unwind's suggestion of using the ast module in 2.6 is a good one. (There's also the undocumented _ast module in 2.5.) Here's example code for that
code = """a = 'blah'
b = '''multi
line
string'''
c = u"spam"
"""
import ast
root = ast.parse(code)
class ShowStrings(ast.NodeVisitor):
def visit_Str(self, node):
print "string at", node.lineno, node.col_offset, repr(node.s)
show_strings = ShowStrings()
show_strings.visit(root)
The problem is multiline strings. If you run the above you'll get.
string at 1 4 'blah'
string at 4 -1 'multi\nline\nstring'
string at 5 4 u'spam'
You see that it doesn't report the start of the multiline string, only the end. There's no good solution for that using the builtin Python tools.
Another option is that you can use my 'python4ply' module. This is a grammar definition for Python for PLY, which is a parser generator. Here's how you might use it:
import compiler
import compiler.visitor
# from python4ply; requires the ply parser generator
import python_yacc
code = """a = 'blah'
b = '''multi
line
string'''
c = u"spam"
d = 1
"""
tree = python_yacc.parse(code, "<string>")
#print tree
class ShowStrings(compiler.visitor.ASTVisitor):
def visitConst(self, node):
if isinstance(node.value, basestring):
print "string at", node.lineno, repr(node.value)
visitor = ShowStrings()
compiler.walk(tree, visitor)
The output from this is
string at 1 'blah'
string at 2 'multi\nline\nstring'
string at 5 u'spam'
There's no support for column information. (There is some mostly complete commented out code to support that, but it's not fully tested.) Then again, I see you don't need it. It also means working with Python's 'compiler' module, which is clumsier than the AST module.
Still, with a 30-40 lines of code you should have exactly what you want.

Python's included tokenize module will also do the trick.
from __future__ import with_statement
import sys
import tokenize
for filename in sys.argv[1:]:
with open(filename) as f:
for toktype, tokstr, (lineno, _), _, _ in tokenize.generate_tokens(f.readline):
if toktype == tokenize.STRING:
strrepr = repr(eval(tokstr))
print filename, lineno, strrepr

If you can do this in Python, I'd suggest starting by looking at the ast (Abstract Syntax Tree) module, and going from there.

Are you asking about the I18N utilities in Python?
http://docs.python.org/library/gettext.html#internationalizing-your-programs-and-modules
There's a utility called po-utils (formerly xpot) that can help with this.
http://po-utils.progiciels-bpi.ca/README.html

You may also consider to parse your code with
pygments.
I don't know the other solution, but it sure is very
simple to use.

Gettext might help you. Put your strings in _(...) structures:
a = _('Test')
b = a
c = _('Another text')
Then run in the shell prompt:
pygettext test.py
You'll get a messages.pot file with the information you need:
# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR ORGANIZATION
# FIRST AUTHOR <EMAIL#ADDRESS>, YEAR.
#
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"POT-Creation-Date: 2009-02-25 08:48+BRT\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL#ADDRESS>\n"
"Language-Team: LANGUAGE <LL#li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=CHARSET\n"
"Content-Transfer-Encoding: ENCODING\n"
"Generated-By: pygettext.py 1.5\n"
#: teste.py:1
msgid "Test"
msgstr ""
#: teste.py:3
msgid "Another text"
msgstr ""

Related

How to ignore some characters when parsing a string?

I like to parse a list of log strings in Python3 via the parse module.
A string can look like:
[00:00:01.913,100] [0m<inf> main: starting v2.9.0(b2393) on Zephyr v2.6.99[0m
or
[00:00:02.789,700] [1;33m<wrn> BC127: codec config differs. set default one[0m
So far I defined my parser
myParser = parse.compile("[{}] <{}> {}: {}")
Of course my parser fails. I don't know how to tell the parse module that it should discard for example everything between ] and <. Like "]*<".
my pseudo code is
myParser = parse.compile("[{}]*<{}> {}: {}*")

Use Regular Expression:
>>> import re
>>> s = '<# """#$ FSDF > important information < not important >'
>>> re.sub('<[^>]+>', '', s
'important information'

Force Sphinx to interpret Markdown in Python docstrings instead of reStructuredText

I'm using Sphinx to document a python project. I would like to use Markdown in my docstrings to format them. Even if I use the recommonmark extension, it only covers the .md files written manually, not the docstrings.
I use autodoc, napoleon and recommonmark in my extensions.
How can I make sphinx parse markdown in my docstrings?

Sphinx's Autodoc extension emits an event named autodoc-process-docstring every time it processes a doc-string. We can hook into that mechanism to convert the syntax from Markdown to reStructuredText.
Unfortunately, Recommonmark does not expose a Markdown-to-reST converter. It maps the parsed Markdown directly to a Docutils object, i.e., the same representation that Sphinx itself creates internally from reStructuredText.
Instead, I use Commonmark for the conversion in my projects. Because it's fast — much faster than Pandoc, for example. Speed is important as the conversion happens on the fly and handles each doc-string individually. Other than that, any Markdown-to-reST converter would do. M2R2 would be a third example. The downside of any of these is that they do not support Recommonmark's syntax extensions, such as cross-references to other parts of the documentation. Just the basic Markdown.
To plug in the Commonmark doc-string converter, make sure that package is installed (pip install commonmark) and add the following to Sphinx's configuration file conf.py:
import commonmark
def docstring(app, what, name, obj, options, lines):
md = '\n'.join(lines)
ast = commonmark.Parser().parse(md)
rst = commonmark.ReStructuredTextRenderer().render(ast)
lines.clear()
lines += rst.splitlines()
def setup(app):
app.connect('autodoc-process-docstring', docstring)
Meanwhile, Recommonmark was deprecated in May 2021. The Sphinx extension MyST, a more feature-rich Markdown parser, is the replacement recommended by Sphinx and by Read-the-Docs. MyST does not yet support Markdown in doc-strings either, but the same hook as above can be used to get limited support via Commonmark.
A possible alternative to the approach outlined here is using MkDocs with the MkDocStrings plug-in, which would eliminate Sphinx and reStructuredText entirely from the process.

Building on #john-hennig answer, the following will keep the restructured text fields like: :py:attr:, :py:class: etc. . This allows you to reference other classes, etc.
import re
import commonmark
py_attr_re = re.compile(r"\:py\:\w+\:(``[^:`]+``)")
def docstring(app, what, name, obj, options, lines):
md = '\n'.join(lines)
ast = commonmark.Parser().parse(md)
rst = commonmark.ReStructuredTextRenderer().render(ast)
lines.clear()
lines += rst.splitlines()
for i, line in enumerate(lines):
while True:
match = py_attr_re.search(line)
if match is None:
break
start, end = match.span(1)
line_start = line[:start]
line_end = line[end:]
line_modify = line[start:end]
line = line_start + line_modify[1:-1] + line_end
lines[i] = line
def setup(app):
app.connect('autodoc-process-docstring', docstring)

I had to extend the accepted answer by john-hen to allow multi-line descriptions of Args: entries to be considered a single parameter:
def docstring(app, what, name, obj, options, lines):
wrapped = []
literal = False
for line in lines:
if line.strip().startswith(r'```'):
literal = not literal
if not literal:
line = ' '.join(x.rstrip() for x in line.split('\n'))
indent = len(line) - len(line.lstrip())
if indent and not literal:
wrapped.append(' ' + line.lstrip())
else:
wrapped.append('\n' + line.strip())
ast = commonmark.Parser().parse(''.join(wrapped))
rst = commonmark.ReStructuredTextRenderer().render(ast)
lines.clear()
lines += rst.splitlines()
def setup(app):
app.connect('autodoc-process-docstring', docstring)

The current #john-hennig is great, but seems to be failing for multi-line Args: in python style. Here was my fix:
def docstring(app, what, name, obj, options, lines):
md = "\n".join(lines)
ast = commonmark.Parser().parse(md)
rst = commonmark.ReStructuredTextRenderer().render(ast)
lines.clear()
lines += _normalize_docstring_lines(rst.splitlines())
def _normalize_docstring_lines(lines: list[str]) -> list[str]:
"""Fix an issue with multi-line args which are incorrectly parsed.
```
Args:
x: My multi-line description which fit on multiple lines
and continue in this line.
```
Is parsed as (missing indentation):
```
:param x: My multi-line description which fit on multiple lines
and continue in this line.
```
Instead of:
```
:param x: My multi-line description which fit on multiple lines
and continue in this line.
```
"""
is_param_field = False
new_lines = []
for l in lines:
if l.lstrip().startswith(":param"):
is_param_field = True
elif is_param_field:
if not l.strip(): # Blank line reset param
is_param_field = False
else: # Restore indentation
l = " " + l.lstrip()
new_lines.append(l)
return new_lines
def setup(app):
app.connect("autodoc-process-docstring", docstring)

How to decode python string

I have some code which I would like to be decoded but not having much luck in guessing what the codepage is, if any is being used. Any help would be much appreciated.
i am using python command line in windows 7 pc,if any python guru guide me how to decrypt and see the code thaat would be appreciated.
exec("import re;import base64");exec((lambda p,y:(lambda o,b,f:re.sub(o,b,f))(r"([0-9a-f]+)",lambda m:p(m,y),base64.b64decode("NTQgYgo1NCA3CjU0IDMKNTQgMWUKNTQgOQo1NCAxOAozZiAgICAgICA9IGIuMTAoKQoxNiAgID0gIjQzOi8vMTIuM2QvNGMvMWQuMjUuZi00ZC4zZSIKYSA9ICIxZC4yNS5mIgoyYSA4KDYpOgoJMzMgMy5jKCczYy4yZSglNTIpJyAlIDYpID09IDEKCjJhIDE1KDM1KToKCTUgPSAzLjQoMWUuNS4xYygnMTM6Ly8yZC8xZicsJzMwJykpCgkyMyA1CgkyMSA9IDcuMTQoKQoJMjEuMzgoIjEwIDI4IiwiMjAgMTAuLiIsJycsICczNiA0MCcpCgkxMT0xZS41LjFjKDUsICdlLjNlJykKCTM5OgoJCTFlLjFhKDExKQoJMWI6CgkJMmMKCQk5LmUoMzUsIDExLCAyMSkKCQkyID0gMy40KDFlLjUuMWMoJzEzOi8vMmQnLCcxZicpKQoJCTIzIDIKCQkyMS4zNCgwLCIiLCAiM2IgNDciKQoJCTE4LjQ4KDExLDIsMjEpCgkJCgkJMy41MygnMjIoKScpOyAKCQkzLjUzKCcyNigpJyk7CgkJMy41MygiNDUuZCgpIik7IAoJCTE5PTcuMzcoKTsgMTkuNTAoIjMyISIsIjJmIDNhIDQ5IDQxIDI5IiwiICAgWzI0IDQ2XTMxIDUxIDRhIDRlIDE3LjNkWy8yNF0iKQoJCSIiIgoJCTM5OgoJCQkxZS4xYSgxMSkKCQkxYjoKCQkJMmMKCQkJIzI3KCkKCQk0Mjo0NCgpCgkJIiIiCgoyYSAyYigpOgoJNGYgNGIgOChhKToKCQkxNSgxNikKCQoKCjJiKCk=")))(lambda a,b:b[int("0x"+a.group(1),16)],"0|1|addonfolder|xbmc|translatePath|path|script_name|xbmcgui|script_chk|downloader|scriptname|xbmcaddon|getCondVisibility|UpdateLocalAddons|download|supermax|Addon|lib|supermaxwizard|special|DialogProgress|INSTALL|website|SuperMaxWizard|extract|dialog|remove|except|join|plugin|os|addons|Installing|dp|UnloadSkin|print|COLOR|video|ReloadSkin|FORCECLOSE|Installer|Installed|def|Main|pass|home|HasAddon|SuperMax|packages|Brought|Success|return|update|url|Please|Dialog|create|try|Wizard|Nearly|System|com|zip|addon|Wait|been|else|http|quit|XBMC|gold|Done|all|has|You|not|sm|MP|By|if|ok|To|s|executebuiltin|import".split("|")))

The code is uglified. You can unobfuscate it yourself by executing the contents of exec(...) in your Python shell.
import re
import base64
print ((lambda p,y.....split("|")))
EDIT: As snakecharmerb says, it is generally not safe to execute unknown code. I analysed the code to find that running the insides of exec will only decrypt, and leaving off the exec itself will just result in a string. This procedure ("execute stuff inside exec") is by no means a generally safe method to decrypt uglified code, and you need to actually analyse what it does. But, at this point, I was asking you to trust my judgement, which, if it is wrong, theoretically could expose you to an attack. In addition, it seems you have problems getting it to run on your Python; so here's what I'm getting from the above:
import xbmcaddon
import xbmcgui
import xbmc
import os
import downloader
import extract
addon = xbmcaddon.Addon()
website = "http://supermaxwizard.com/sm/plugin.video.supermax-MP.zip"
scriptname = "plugin.video.supermax"
def script_chk(script_name):
return xbmc.getCondVisibility('System.HasAddon(%s)' % script_name) == 1
def INSTALL(url):
path = xbmc.translatePath(os.path.join('special://home/addons','packages'))
print path
dp = xbmcgui.DialogProgress()
dp.create("Addon Installer","Installing Addon..",'', 'Please Wait')
lib=os.path.join(path, 'download.zip')
try:
os.remove(lib)
except:
pass
downloader.download(url, lib, dp)
addonfolder = xbmc.translatePath(os.path.join('special://home','addons'))
print addonfolder
dp.update(0,"", "Nearly Done")
extract.all(lib,addonfolder,dp)
xbmc.executebuiltin('UnloadSkin()');
xbmc.executebuiltin('ReloadSkin()');
xbmc.executebuiltin("XBMC.UpdateLocalAddons()");
dialog=xbmcgui.Dialog(); dialog.ok("Success!","SuperMax Wizard has been Installed"," [COLOR gold]Brought To You By SuperMaxWizard.com[/COLOR]")
"""
try:
os.remove(lib)
except:
pass
#FORCECLOSE()
else:quit()
"""
def Main():
if not script_chk(scriptname):
INSTALL(website)
Main()

Modify/update an entry in a po file with polib in Python

polib appears to be THE library of choice for working with gettext/po files in Python. The docs show how to iterate through message strings, save po and mo files, etc. However, it's not clear to me, how can one edit a specific entry?
Let's say, I iterate over all messages in an existing po file and display them in an HTML form with textareas. By submitting the form, I get - as an example - the original
msgid = "Hello World"
and the via textarea translated
msgstr = "Hallo Welt"
The original part inside the po file may look like this:
#: .\accounts\forms.py:26 .\accounts\registration\forms.py:48
msgid "Hello World"
msgstr ""
or with fuzzy flag set:
#: .\accounts\forms.py:26 .\accounts\registration\forms.py:48
#, fuzzy
msgid "Hello World"
msgstr "Hallo"
Now how do I update this particular translation in the actual po file? And in case this message was marked as "fuzzy", how do I remove this flag?
Any help appreciated ...

Ok, after reading through the source code of polib, I found this way to achieve, what I want:
entry = po.find('Email address')
if entry:
entry.msgstr = 'E-Mail-Adresse'
if 'fuzzy' in entry.flags:
entry.flags.remove('fuzzy')
This seems to be the way to go ...
In the case of pluralisation - just as an example:
entry = po.find('%s hour ago')
if entry and entry.msgid_plural:
entry.msgstr_plural['0'] = 'Vor %s Stunde'
entry.msgstr_plural['1'] = 'Vor %s Stunden'
The docs of polib should definitively be updated. Otherwise great tool.

Escaping '<' and '>' in xml when using xml.dom.minidom

I am stuck while escaping "<" and ">" in the xml file using xml.dom.minidom.
I tried to get the unicode hex value and use that instead
http://slayeroffice.com/tools/unicode_lookup/
Tried to use the standard "<" and ">" but still with no success.
from xml.dom.minidom import Document
doc = Document()
e = doc.createElement("abc")
s1 = '<hello>bhaskar</hello>'
text = doc.createTextNode(s1)
e.appendChild(text)
e.toxml()
'<abc><hello>bhaskar</hello></abc>'
same result with writexml()
Also tried by specifying encoding 'UTF-8', 'utf-8', 'utf' in the toxml() writexml() calls but with same results.
from xml.dom.minidom import Document
doc = Document()
e = doc.createElement("abc")
s1 = u'<hello>bhaskar</hello>'
text = doc.createTextNode(s1)
e.appendChild(text)
e.toxml()
u'<abc>&lt;hello&gt;bhaskar&lt;/hello&gt;</abc>'
Tried other ways but with same results.
Only way i could work-around is by overriding the writer
import xml.dom.minidom as md
# XXX Hack to handle '<' and '>'
def wd(writer, data):
data = data.replace("<", "<").replace(">", ">")
writer.write(data)
md._write_data = wd
Edit - This is the code.
import xml.dom.minidom as md
doc = md.Document()
entity_descr = doc.createElement("EntityDescriptor")
doc.appendChild(entity_descr)
entity_descr.setAttribute('xmlns', 'urn:oasis:names:tc:SAML:2.0:metadata')
entity_descr.setAttribute('xmlns:saml', 'urn:oasis:names:tc:SAML:2.0:assertion')
entity_descr.setAttribute('xmlns:ds', 'http://www.w3.org/2000/09/xmldsig#')
# Get the entity_id from saml20_idp_settings
entity_descr.setAttribute('entityID', self.group['entity_id'])
idpssodescr = doc.createElement('IDPSSODescriptor')
idpssodescr.setAttribute('WantAuthnRequestsSigned', 'true')
idpssodescr.setAttribute('protocolSupportEnumeration',
'urn:oasis:names:tc:SAML:2.0:protocol')
entity_descr.appendChild(idpssodescr)
keydescr = doc.createElement('KeyDescriptor')
keydescr.setAttribute('use', 'signing')
idpssodescr.appendChild(keydescr)
keyinfo = doc.createElement('ds:KeyInfo')
keyinfo.setAttribute('xmlns:ds', 'http://www.w3.org/2000/09/xmldsig#')
keydescr.appendChild(keyinfo)
x509data = doc.createElement('ds:X509Data')
keyinfo.appendChild(x509data)
# check this part
s = "this is a cert blah blah"
x509cert = doc.createElement('ds:X509Certificate')
cert = doc.createTextNode(s)
x509cert.appendChild(cert)
x509data.appendChild(x509cert)
sso = doc.createElement('SingleSignOnService')
sso.setAttribute('Binding', 'urn:oasis:names:tc:SAML:2.0:bindings:HTTP-Redirect')
sso.setAttribute('Location', 'http://googleapps/singleSignOn')
idpssodescr.appendChild(sso)
# Write the metadata file.
fobj = open('metadata.xml', 'w')
doc.writexml(fobj, " ", "", "\n", "UTF-8")
fobj.close()
This produces
<?xml version="1.0" encoding="UTF-8"?>
<EntityDescriptor entityID="skar" xmlns="urn:oasis:names:tc:SAML:2.0:metadata"
xmlns:ds="http://www.w3.org/2000/09/xmldsig#"
xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion">
<IDPSSODescriptor WantAuthnRequestsSigned="true"
protocolSupportEnumeration="urn:oasis:names:tc:SAML:2.0:protocol">
<KeyDescriptor use="signing">
<ds:KeyInfo xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
<ds:X509Data>
<ds:X509Certificate>
this is a cert blah blah
</ds:X509Certificate>
</ds:X509Data>
</ds:KeyInfo>
</KeyDescriptor>
<SingleSignOnService Binding="urn:oasis:names:tc:SAML:2.0:bindings:HTTP-Redirect"
Location="http:///singleSignOn"/>
</IDPSSODescriptor>
</EntityDescriptor>
Note the "This is a cert" comes seperately
Have broken my head over this but with the same results.

This is not a bug, it is a feature. To insert actual XML, insert DOM objects instead. Text inside an XML tag needs to be entity escaped though to be valid XML.
from xml.dom.minidom import Document
doc = Document()
e = doc.createElement("abc")
eh = doc.createElement("hello")
s1 = 'bhaskar'
text = doc.createTextNode(s1)
eh.appendChild(text)
e.appendChild(eh)
e.toxml()
EDIT: I don't know what Python's API is like, but it looks very similar to C#'s, so you might be able to do something like e.innerXml = s1 to do what you're trying to do... but that could be bad. The better thing to do is parse it and appendChild it as well.
EDIT 2: I just ran this via Python locally, and there's definitely something wrong on your end, not in the libraries. Make sure that your string doesn't have any newlines or whitespace at the start of it. For reference, the test code I used was:
Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from xml.dom.minidom import Document
>>> cert = "---- START CERTIFICATE ----\n Hello world\n---- END CERTIFICATE ---"
>>> doc = Document()
>>> e = doc.createElement("cert")
>>> certEl = doc.createTextNode(cert)
>>> e.appendChild(certEl)
<DOM Text node "'---- START'...">
>>> print e.toxml()
<cert>---- START CERTIFICATE ----
Hello world
---- END CERTIFICATE ---</cert>
>>>
EDIT 3: The final edit. The problem is in your writexml call. Simply using the following fixes this:
doc.writexml(fobj)
# or
doc.writexml(fobj, "", " ", "")
Unfortuanately, it seems that you won't be able to use the newline parameter to get pretty printing though... it seems that the Python library (or atleast minidom) is written rather poorly and will modify TextNode's while printing them. Not so much a poor implementation as a naive one. A shame really...

If you use "<" as text in XML, you need to escape it, else it is considered markup. So xml.dom is right in escaping it, since you've asked for a text node.
Assuming you really want to insert a piece of XML, I recommend to use createElement("hello"). If you have a fragment of XML that you don't know the structure of, you should first parse it, and then move the nodes of that parse result into the other tree.
If you want to hack, you can inherit from xml.dom.minidom.Text, and overwrite the writexml method. See the source of minidom for details.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find all strings in python code files - python

If you can do this in Python, I'd suggest starting by looking at the ast (Abstract Syntax Tree) module, and going from there.

Are you asking about the I18N utilities in Python? http://docs.python.org/library/gettext.html#internationalizing-your-programs-and-modules There's a utility called po-utils (formerly xpot) that can help with this. http://po-utils.progiciels-bpi.ca/README.html

You may also consider to parse your code with pygments. I don't know the other solution, but it sure is very simple to use.

Related

How to ignore some characters when parsing a string?

Force Sphinx to interpret Markdown in Python docstrings instead of reStructuredText

How to decode python string

Modify/update an entry in a po file with polib in Python

Escaping '<' and '>' in xml when using xml.dom.minidom

Categories

Resources