how to insert variable into a re.compile in python [duplicate]

how to insert variable into a re.compile in python [duplicate] - python

I would like to put an int into a string. This is what I am doing at the moment:
num = 40
plot.savefig('hanning40.pdf') #problem line
I have to run the program for several different numbers, so I'd like to do a loop. But inserting the variable like this doesn't work:
plot.savefig('hanning', num, '.pdf')
How do I insert a variable into a Python string?
See also
If you tried using + to concatenate a number with a string (or between strings, etc.) and got an error message, see How can I concatenate str and int objects?.
If you are trying to assemble a URL with variable data, do not use ordinary string formatting, because it is error-prone and more difficult than necessary. Specialized tools are available. See Add params to given URL in Python.
If you are trying to assemble a SQL query, do not use ordinary string formatting, because it is a major security risk. This is the cause of "SQL injection" which costs real companies huge amounts of money every year. See for example Python: best practice and securest way to connect to MySQL and execute queries for proper techniques.
If you just want to print (output) the string, you can prepare it this way first, or if you don't need the string for anything else, print each piece of the output individually using a single call to print. See How can I print multiple things (fixed text and/or variable values) on the same line, all at once? for details on both approaches.

Using f-strings:
plot.savefig(f'hanning{num}.pdf')
This was added in 3.6 and is the new preferred way.
Using str.format():
plot.savefig('hanning{0}.pdf'.format(num))
String concatenation:
plot.savefig('hanning' + str(num) + '.pdf')
Conversion Specifier:
plot.savefig('hanning%s.pdf' % num)
Using local variable names (neat trick):
plot.savefig('hanning%(num)s.pdf' % locals())
Using string.Template:
plot.savefig(string.Template('hanning${num}.pdf').substitute(locals()))
See also:
Fancier Output Formatting - The Python Tutorial
Python 3's f-Strings: An Improved String Formatting Syntax (Guide) - RealPython

With the introduction of formatted string literals ("f-strings" for short) in Python 3.6, it is now possible to write this with a briefer syntax:
>>> name = "Fred"
>>> f"He said his name is {name}."
'He said his name is Fred.'
With the example given in the question, it would look like this
plot.savefig(f'hanning{num}.pdf')

plot.savefig('hanning(%d).pdf' % num)
The % operator, when following a string, allows you to insert values into that string via format codes (the %d in this case). For more details, see the Python documentation:
printf-style String Formatting

You can use + as the normal string concatenation function as well as str().
"hello " + str(10) + " world" == "hello 10 world"

In general, you can create strings using:
stringExample = "someString " + str(someNumber)
print(stringExample)
plot.savefig(stringExample)

If you would want to put multiple values into the string you could make use of format
nums = [1,2,3]
plot.savefig('hanning{0}{1}{2}.pdf'.format(*nums))
Would result in the string hanning123.pdf. This can be done with any array.

Special cases
Depending on why variable data is being used with strings, the general-purpose approaches may not be appropriate.
If you need to prepare an SQL query
Do not use any of the usual techniques for assembling a string. Instead, use your SQL library's functionality for parameterized queries.
A query is code, so it should not be thought about like normal text. Using the library will make sure that any inserted text is properly escaped. If any part of the query could possibly come from outside the program in any way, that is an opportunity for a malevolent user to perform SQL injection. This is widely considered one of the important computer security problems, costing real companies huge amounts of money every year and causing problems for countless customers. Even if you think you know the data is "safe", there is no real upside to using any other approach.
The syntax will depend on the library you are using and is outside the scope of this answer.
If you need to prepare a URL query string
See Add params to given URL in Python. Do not do it yourself; there is no practical reason to make your life harder.
Writing to a file
While it's possible to prepare a string ahead of time, it may be simpler and more memory efficient to just write each piece of data with a separate .write call. Of course, non-strings will still need to be converted to string before writing, which may complicate the code. There is not a one-size-fits-all answer here, but choosing badly will generally not matter very much.
If you are simply calling print
The built-in print function accepts a variable number of arguments, and can take in any object and stringify it using str. Before trying string formatting, consider whether simply passing multiple arguments will do what you want. (You can also use the sep keyword argument to control spacing between the arguments.)
# display a filename, as an example
print('hanning', num, '.pdf', sep='')
Of course, there may be other reasons why it is useful for the program to assemble a string; so by all means do so where appropriate.
It's important to note that print is a special case. The only functions that work this way are ones that are explicitly written to work this way. For ordinary functions and methods, like input, or the savefig method of Matplotlib plots, we need to prepare a string ourselves.
Concatenation
Python supports using + between two strings, but not between strings and other types. To work around this, we need to convert other values to string explicitly: 'hanning' + str(num) + '.pdf'.
Template-based approaches
Most ways to solve the problem involve having some kind of "template" string that includes "placeholders" that show where information should be added, and then using some function or method to add the missing information.
f-strings
This is the recommended approach when possible. It looks like f'hanning{num}.pdf'. The names of variables to insert appear directly in the string. It is important to note that there is not actually such a thing as an "f-string"; it's not a separate type. Instead, Python will translate the code ahead of time:
>>> def example(num):
... return f'hanning{num}.pdf'
...
>>> import dis
>>> dis.dis(example)
2 0 LOAD_CONST 1 ('hanning')
2 LOAD_FAST 0 (num)
4 FORMAT_VALUE 0
6 LOAD_CONST 2 ('.pdf')
8 BUILD_STRING 3
10 RETURN_VALUE
Because it's a special syntax, it can access opcodes that aren't used in other approaches.
str.format
This is the recommended approach when f-strings aren't possible - mainly, because the template string needs to be prepared ahead of time and filled in later. It looks like 'hanning{}.pdf'.format(num), or 'hanning{num}.pdf'.format(num=num)'. Here, format is a method built in to strings, which can accept arguments either by position or keyword.
Particularly for str.format, it's useful to know that the built-in locals, globals and vars functions return dictionaries that map variable names to the contents of those variables. Thus, rather than something like '{a}{b}{c}'.format(a=a, b=b, c=c), we can use something like '{a}{b}{c}'.format(**locals()), unpacking the locals() dict.
str.format_map
This is a rare variation on .format. It looks like 'hanning{num}.pdf'.format_map({'num': num}). Rather than accepting keyword arguments, it accepts a single argument which is a mapping.
That probably doesn't sound very useful - after all, rather than 'hanning{num}.pdf'.format_map(my_dict), we could just as easily write 'hanning{num}.pdf'.format(**my_dict). However, this is useful for mappings that determine values on the fly, rather than ordinary dicts. In these cases, unpacking with ** might not work, because the set of keys might not be determined ahead of time; and trying to unpack keys based on the template is unwieldy (imagine: 'hanning{num}.pdf'.format(num=my_mapping[num]), with a separate argument for each placeholder).
string.Formatter
The string standard library module contains a rarely used Formatter class. Using it looks like string.Formatter().format('hanning{num}.pdf', num=num). The template string uses the same syntax again. This is obviously clunkier than just calling .format on the string; the motivation is to allow users to subclass Formatter to define a different syntax for the template string.
All of the above approaches use a common "formatting language" (although string.Formatter allows changing it); there are many other things that can be put inside the {}. Explaining how it works is beyond the scope of this answer; please consult the documentation. Do keep in mind that literal { and } characters need to be escaped by doubling them up. The syntax is presumably inspired by C#.
The % operator
This is a legacy way to solve the problem, inspired by C and C++. It has been discouraged for a long time, but is still supported. It looks like 'hanning%s.pdf' % num, for simple cases. As you'd expect, literal '%' symbols in the template need to be doubled up to escape them.
It has some issues:
It seems like the conversion specifier (the letter after the %) should match the type of whatever is being interpolated, but that's not actually the case. Instead, the value is converted to the specified type, and then to string from there. This isn't normally necessary; converting directly to string works most of the time, and converting to other types first doesn't help most of the rest of the time. So 's' is almost always used (unless you want the repr of the value, using 'r'). Despite that, the conversion specifier is a mandatory part of the syntax.
Tuples are handled specially: passing a tuple on the right-hand side is the way to provide multiple arguments. This is an ugly special case that's necessary because we aren't using function-call syntax. As a result, if you actually want to format a tuple into a single placeholder, it must be wrapped in a 1-tuple.
Other sequence types are not handled specially, and the different behaviour can be a gotcha.
string.Template
The string standard library module contains a rarely used Template class. Instances provide substitute and safe_substitute methods that work similarly to the built-in .format (safe_substitute will leave placeholders intact rather than raising an exception when the arguments don't match). This should also be considered a legacy approach to the problem.
It looks like string.Template('hanning$num.pdf').substitute(num=num), and is inspired by traditional Perl syntax. It's obviously clunkier than the .format approach, since a separate class has to be used before the method is available. Braces ({}) can be used optionally around the name of the variable, to avoid ambiguity. Similarly to the other methods, literal '$' in the template needs to be doubled up for escaping.

I had a need for an extended version of this: instead of embedding a single number in a string, I needed to generate a series of file names of the form 'file1.pdf', 'file2.pdf' etc. This is how it worked:
['file' + str(i) + '.pdf' for i in range(1,4)]

You can make dict and substitute variables in your string.
var = {"name": "Abdul Jalil", "age": 22}
temp_string = "My name is %(name)s. I am %(age)s years old." % var

Related

why there is space for "?" in the code. How to remove it? [duplicate]

I would like to put an int into a string. This is what I am doing at the moment:
num = 40
plot.savefig('hanning40.pdf') #problem line
I have to run the program for several different numbers, so I'd like to do a loop. But inserting the variable like this doesn't work:
plot.savefig('hanning', num, '.pdf')
How do I insert a variable into a Python string?
See also
If you tried using + to concatenate a number with a string (or between strings, etc.) and got an error message, see How can I concatenate str and int objects?.
If you are trying to assemble a URL with variable data, do not use ordinary string formatting, because it is error-prone and more difficult than necessary. Specialized tools are available. See Add params to given URL in Python.
If you are trying to assemble a SQL query, do not use ordinary string formatting, because it is a major security risk. This is the cause of "SQL injection" which costs real companies huge amounts of money every year. See for example Python: best practice and securest way to connect to MySQL and execute queries for proper techniques.
If you just want to print (output) the string, you can prepare it this way first, or if you don't need the string for anything else, print each piece of the output individually using a single call to print. See How can I print multiple things (fixed text and/or variable values) on the same line, all at once? for details on both approaches.

Using f-strings:
plot.savefig(f'hanning{num}.pdf')
This was added in 3.6 and is the new preferred way.
Using str.format():
plot.savefig('hanning{0}.pdf'.format(num))
String concatenation:
plot.savefig('hanning' + str(num) + '.pdf')
Conversion Specifier:
plot.savefig('hanning%s.pdf' % num)
Using local variable names (neat trick):
plot.savefig('hanning%(num)s.pdf' % locals())
Using string.Template:
plot.savefig(string.Template('hanning${num}.pdf').substitute(locals()))
See also:
Fancier Output Formatting - The Python Tutorial
Python 3's f-Strings: An Improved String Formatting Syntax (Guide) - RealPython

With the introduction of formatted string literals ("f-strings" for short) in Python 3.6, it is now possible to write this with a briefer syntax:
>>> name = "Fred"
>>> f"He said his name is {name}."
'He said his name is Fred.'
With the example given in the question, it would look like this
plot.savefig(f'hanning{num}.pdf')

plot.savefig('hanning(%d).pdf' % num)
The % operator, when following a string, allows you to insert values into that string via format codes (the %d in this case). For more details, see the Python documentation:
printf-style String Formatting

You can use + as the normal string concatenation function as well as str().
"hello " + str(10) + " world" == "hello 10 world"

In general, you can create strings using:
stringExample = "someString " + str(someNumber)
print(stringExample)
plot.savefig(stringExample)

If you would want to put multiple values into the string you could make use of format
nums = [1,2,3]
plot.savefig('hanning{0}{1}{2}.pdf'.format(*nums))
Would result in the string hanning123.pdf. This can be done with any array.

Special cases
Depending on why variable data is being used with strings, the general-purpose approaches may not be appropriate.
If you need to prepare an SQL query
Do not use any of the usual techniques for assembling a string. Instead, use your SQL library's functionality for parameterized queries.
A query is code, so it should not be thought about like normal text. Using the library will make sure that any inserted text is properly escaped. If any part of the query could possibly come from outside the program in any way, that is an opportunity for a malevolent user to perform SQL injection. This is widely considered one of the important computer security problems, costing real companies huge amounts of money every year and causing problems for countless customers. Even if you think you know the data is "safe", there is no real upside to using any other approach.
The syntax will depend on the library you are using and is outside the scope of this answer.
If you need to prepare a URL query string
See Add params to given URL in Python. Do not do it yourself; there is no practical reason to make your life harder.
Writing to a file
While it's possible to prepare a string ahead of time, it may be simpler and more memory efficient to just write each piece of data with a separate .write call. Of course, non-strings will still need to be converted to string before writing, which may complicate the code. There is not a one-size-fits-all answer here, but choosing badly will generally not matter very much.
If you are simply calling print
The built-in print function accepts a variable number of arguments, and can take in any object and stringify it using str. Before trying string formatting, consider whether simply passing multiple arguments will do what you want. (You can also use the sep keyword argument to control spacing between the arguments.)
# display a filename, as an example
print('hanning', num, '.pdf', sep='')
Of course, there may be other reasons why it is useful for the program to assemble a string; so by all means do so where appropriate.
It's important to note that print is a special case. The only functions that work this way are ones that are explicitly written to work this way. For ordinary functions and methods, like input, or the savefig method of Matplotlib plots, we need to prepare a string ourselves.
Concatenation
Python supports using + between two strings, but not between strings and other types. To work around this, we need to convert other values to string explicitly: 'hanning' + str(num) + '.pdf'.
Template-based approaches
Most ways to solve the problem involve having some kind of "template" string that includes "placeholders" that show where information should be added, and then using some function or method to add the missing information.
f-strings
This is the recommended approach when possible. It looks like f'hanning{num}.pdf'. The names of variables to insert appear directly in the string. It is important to note that there is not actually such a thing as an "f-string"; it's not a separate type. Instead, Python will translate the code ahead of time:
>>> def example(num):
... return f'hanning{num}.pdf'
...
>>> import dis
>>> dis.dis(example)
2 0 LOAD_CONST 1 ('hanning')
2 LOAD_FAST 0 (num)
4 FORMAT_VALUE 0
6 LOAD_CONST 2 ('.pdf')
8 BUILD_STRING 3
10 RETURN_VALUE
Because it's a special syntax, it can access opcodes that aren't used in other approaches.
str.format
This is the recommended approach when f-strings aren't possible - mainly, because the template string needs to be prepared ahead of time and filled in later. It looks like 'hanning{}.pdf'.format(num), or 'hanning{num}.pdf'.format(num=num)'. Here, format is a method built in to strings, which can accept arguments either by position or keyword.
Particularly for str.format, it's useful to know that the built-in locals, globals and vars functions return dictionaries that map variable names to the contents of those variables. Thus, rather than something like '{a}{b}{c}'.format(a=a, b=b, c=c), we can use something like '{a}{b}{c}'.format(**locals()), unpacking the locals() dict.
str.format_map
This is a rare variation on .format. It looks like 'hanning{num}.pdf'.format_map({'num': num}). Rather than accepting keyword arguments, it accepts a single argument which is a mapping.
That probably doesn't sound very useful - after all, rather than 'hanning{num}.pdf'.format_map(my_dict), we could just as easily write 'hanning{num}.pdf'.format(**my_dict). However, this is useful for mappings that determine values on the fly, rather than ordinary dicts. In these cases, unpacking with ** might not work, because the set of keys might not be determined ahead of time; and trying to unpack keys based on the template is unwieldy (imagine: 'hanning{num}.pdf'.format(num=my_mapping[num]), with a separate argument for each placeholder).
string.Formatter
The string standard library module contains a rarely used Formatter class. Using it looks like string.Formatter().format('hanning{num}.pdf', num=num). The template string uses the same syntax again. This is obviously clunkier than just calling .format on the string; the motivation is to allow users to subclass Formatter to define a different syntax for the template string.
All of the above approaches use a common "formatting language" (although string.Formatter allows changing it); there are many other things that can be put inside the {}. Explaining how it works is beyond the scope of this answer; please consult the documentation. Do keep in mind that literal { and } characters need to be escaped by doubling them up. The syntax is presumably inspired by C#.
The % operator
This is a legacy way to solve the problem, inspired by C and C++. It has been discouraged for a long time, but is still supported. It looks like 'hanning%s.pdf' % num, for simple cases. As you'd expect, literal '%' symbols in the template need to be doubled up to escape them.
It has some issues:
It seems like the conversion specifier (the letter after the %) should match the type of whatever is being interpolated, but that's not actually the case. Instead, the value is converted to the specified type, and then to string from there. This isn't normally necessary; converting directly to string works most of the time, and converting to other types first doesn't help most of the rest of the time. So 's' is almost always used (unless you want the repr of the value, using 'r'). Despite that, the conversion specifier is a mandatory part of the syntax.
Tuples are handled specially: passing a tuple on the right-hand side is the way to provide multiple arguments. This is an ugly special case that's necessary because we aren't using function-call syntax. As a result, if you actually want to format a tuple into a single placeholder, it must be wrapped in a 1-tuple.
Other sequence types are not handled specially, and the different behaviour can be a gotcha.
string.Template
The string standard library module contains a rarely used Template class. Instances provide substitute and safe_substitute methods that work similarly to the built-in .format (safe_substitute will leave placeholders intact rather than raising an exception when the arguments don't match). This should also be considered a legacy approach to the problem.
It looks like string.Template('hanning$num.pdf').substitute(num=num), and is inspired by traditional Perl syntax. It's obviously clunkier than the .format approach, since a separate class has to be used before the method is available. Braces ({}) can be used optionally around the name of the variable, to avoid ambiguity. Similarly to the other methods, literal '$' in the template needs to be doubled up for escaping.

I had a need for an extended version of this: instead of embedding a single number in a string, I needed to generate a series of file names of the form 'file1.pdf', 'file2.pdf' etc. This is how it worked:
['file' + str(i) + '.pdf' for i in range(1,4)]

You can make dict and substitute variables in your string.
var = {"name": "Abdul Jalil", "age": 22}
temp_string = "My name is %(name)s. I am %(age)s years old." % var

How to use gettext with python >3.6 f-strings

Previously you would use gettext as following:
_('Hey {},').format(username)
but what about new Python's f-string?
f'Hey {username}'

'Hey {},' is contained in your translation dictionary as is.
If you use f'Hey {username},', that creates another string, which won't be translated.
In that case, the format method remains the only one useable, but you could approach the f-string features by using named parameters
_('Hey {username},').format(username=username)
or if you have a dictionary containing your data, this cool trick where format picks the required information in the input dictionary:
d = {"username":"John", "city":"New York", "unused":"doesn't matter"}
_('Hey {username} from {city},').format(**d)

My solution is to make a function f() which performs the f-string interpolation after gettext has been called.
from copy import copy
from inspect import currentframe
def f(s):
frame = currentframe().f_back
kwargs = copy(frame.f_globals)
kwargs.update(frame.f_locals)
return eval(s.format(**kwargs))
Now you just wrap _(...) in f() and don’t preface the string with an f:
f(_('Hey, {username}'))
Note of caution
I’m usually against the use of eval as it could make the function potentially unsafe, but I personally think it should be justified here, so long as you’re aware of what’s being formatted. That said use at your own risk.
Remember
This isn’t a perfect solution, this is just my solution. As per PEP 498 states each formatting method “have their advantages, but in addition have disadvantages” including this.
For example if you need to change the expression inside the string then it will no longer match, therefore not be translated unless you also update your .po file as well. Also if you’re not the one translating them and you use an expression that’s hard to decipher what the outcome will be then that can cause miscommunication or other issues in translation.

Is it possible to declare a variable with a value for string and a placeholder in python?

I am trying to initialize a long string value to a variable, but this string has a word that can not be constant, like this example.
Say I want to store a string like this.
str = "https://stackoverflow.com/users/7833397/meskerem"
But assume the number 7833397 will change over time, so I am trying to find a way to store the string while making making a wildcard for the number. But I am not sure if this can be done in Python

Use the format method.
template = "https://stackoverflow.com/users/{0}/meskerem"
# Lots of stuff happens here
url = template.format("7833397")
The format method supports its own little mini language, and depending on your use-case you may find it more intuitive to name the various parts of your template, too:
template = "https://stackoverflow.com/users/{id}/{username}"
# Lots of stuff happens here
url = template.format(id="7833397", username="meskerem")

First, avoid usign the identifier str. Second, you can put placeholders in strings using two methods of string formatting:
Old style
The "old" style uses C-style string formatting syntax, and "modulo" operation on the string to do the actual insertion. You can pass multiple replacements as a tuple:
s = "foo%sbaz" # expects a string
print(s%"bar")
s2 = "foo%s%d"
print(s2%("bar", 2))
New style
The "new" style uses a generic {} which can be filled using the str.format() method. Multiple replacements are passed as a unzipped tuple, i.e. as mutiple arguments:
s = "foo{}baz" # can be "anything"
print(s.format("bar"))
s2 = "foo{}{}"
print(s2.format("bar", 2))
This site might come handy as a reference.

You can use '%s'(string formatting syntax )
modified_str = "https://stackoverflow.com/users/%s/meskerem" % (number,)

Python style for `chained` function calls

More and more we use chained function calls:
value = get_row_data(original_parameters).refine_data(leval=3).transfer_to_style_c()
It can be long. To save long line in code, which is prefered?
value = get_row_data(
original_parameters).refine_data(
leval=3).transfer_to_style_c()
or:
value = get_row_data(original_parameters)\
.refine_data(leval=3)\
.transfer_to_style_c()
I feel it good to use backslash \, and put .function to new line. This makes each function call has it own line, it's easy to read. But this sounds not preferred by many. And when code makes subtle errors, when it's hard to debug, I always start to worry it might be a space or something after the backslash (\).
To quote from the Python style guide:
Long lines can be broken over multiple lines by wrapping expressions
in parentheses. These should be used in preference to using a
backslash for line continuation. Make sure to indent the continued
line appropriately. The preferred place to break around a binary
operator is after the operator, not before it.

I tend to prefer the following, which eschews the non-recommended \ at the end of a line, thanks to an opening parenthesis:
value = (get_row_data(original_parameters)
.refine_data(level=3)
.transfer_to_style_c())
One advantage of this syntax is that each method call is on its own line.
A similar kind of \-less structure is also often useful with string literals, so that they don't go beyond the recommended 79 character per line limit:
message = ("This is a very long"
" one-line message put on many"
" source lines.")
This is a single string literal, which is created efficiently by the Python interpreter (this is much better than summing strings, which creates multiple strings in memory and copies them multiple times until the final string is obtained).
Python's code formatting is nice.

What about this option:
value = get_row_data(original_parameters,
).refine_data(leval=3,
).transfer_to_style_c()
Note that commas are redundant if there are no other parameters but I keep them to maintain consistency.

The not quoting my own preference (although see comments on your question:)) or alternatives answer to this is:
Stick to the style guidelines on any project you have already - if not stated, then keep as consistent as you can with the rest of the code base in style.
Otherwise, pick a style you like and stick with that - and let others know somehow that's how you'd appreciate chained function calls to be written if not reasonably readable on one-line (or however you wish to describe it).

Python 2.x: how to automate enforcing unicode instead of string?

How can I automate a test to enforce that a body of Python 2.x code contains no string instances (only unicode instances)?
Eg.
Can I do it from within the code?
Is there a static analysis tool that has this feature?
Edit:
I wanted this for an application in Python 2.5, but it turns out this is not really possible because:
2.5 doesn't support unicode_literals
kwargs dictionary keys can't be unicode objects, only strings
So I'm accepting the answer that says it's not possible, even though it's for different reasons :)

You can't enforce that all strings are Unicode; even with from __future__ import unicode_literals in a module, byte strings can be written as b'...', as they can in Python 3.
There was an option that could be used to get the same effect as unicode_literals globally: the command-line option -U. However it was abandoned early in the 2.x series because it basically broke every script.
What is your purpose for this? It is not desirable to abolish byte strings. They are not “bad” and Unicode strings are not universally “better”; they are two separate animals and you will need both of them. Byte strings will certainly be needed to talk to binary files and network services.
If you want to be prepared to transition to Python 3, the best tack is to write b'...' for all the strings you really mean to be bytes, and u'...' for the strings that are inherently Unicode. The default string '...' format can be used for everything else, places where you don't care and/or whether Python 3 changes the default string type.

It seems to me like you really need to parse the code with an honest to goodness python parser. Then you will need to dig through the AST your parser produces to see if it contains any string literals.
It looks like Python comes with a parser out of the box. From this documentation I got this code sample working:
import parser
from token import tok_name
def checkForNonUnicode(codeString):
return checkForNonUnicodeHelper(parser.suite(codeString).tolist())
def checkForNonUnicodeHelper(lst):
returnValue = True
nodeType = lst[0]
if nodeType in tok_name and tok_name[nodeType] == 'STRING':
stringValue = lst[1]
if stringValue[0] != "u": # Kind of hacky. Does this always work?
print "%s is not unicode!" % stringValue
returnValue = False
else:
for subNode in [lst[n] for n in range(1, len(lst))]:
if isinstance(subNode, list):
returnValue = returnValue and checkForNonUnicodeHelper(subNode)
return returnValue
print checkForNonUnicode("""
def foo():
a = 'This should blow up!'
""")
print checkForNonUnicode("""
def bar():
b = u'although this is ok.'
""")
which prints out
'This should blow up!' is not unicode!
False
True
Now doc strings aren't unicode but should be allowed, so you might have to do something more complicated like from symbol import sym_name where you can look up which node types are for class and function definitions. Then the first sub-node that's simply a string, i.e. not part of an assignment or whatever, should be allowed to not be unicode.
Good question!
Edit
Just a follow up comment. Conveniently for your purposes, parser.suite does not actually evaluate your python code. This means that you can run this parser over your Python files without worrying about naming or import errors. For example, let's say you have myObscureUtilityFile.py that contains
from ..obscure.relative.path import whatever
You can
checkForNonUnicode(open('/whoah/softlink/myObscureUtilityFile.py').read())

Our SD Source Code Search Engine (SCSE) can provide this result directly.
The SCSE provides a way to search extremely quickly across large sets of files using some of the language structure to enable precise queries and minimize false positives. It handles a wide array
of languages, even at the same time, including Python. A GUI shows search hits and a page of actual text from the file containing a selected hit.
It uses lexical information from the source languages as the basis for queries, comprised of various langauge keywords and pattern tokens that match varying content langauge elements. SCSE knows the types of lexemes available in the langauge. One can search for a generic identifier (using query token I) or an identifier matching some regulatr expression. Similar, on can search for a generic string (using query token "S" for "any kind of string literal") or for a specific
type of string (for Python including "UnicodeStrings", non-unicode strings, etc, which collectively make up the set of Python things comprising "S").
So a search:
'for' ... I=ij*
finds the keyword 'for' near ("...") an identifier whose prefix is "ij" and shows you all the hits. (Language-specific whitespace including line breaks and comments are ignored.
An trivial search:
S
finds all string literals. This is often a pretty big set :-}
A search
UnicodeStrings
finds all string literals that are lexically defined as Unicode Strings (u"...")
What you want are all strings that aren't UnicodeStrings. The SCSE provides a "subtract" operator that subtracts hits of one kind that overlap hits of another. So your question, "what strings aren't unicode" is expressed concisely as:
S-UnicodeStrings
All hits shown will be the strings that aren't unicode strings, your precise question.
The SCSE provides logging facilities so that you can record hits. You can run SCSE from a command line, enabling a scripted query for your answer. Putting this into a command script would provide a tool gives your answer directly.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.