Create save filenames from userinput in Python - python

I'm programming an IRC and XMPP bot that needs to convert user provided input to a filename. I have already written a function to do this. Is it sane enough?
Here is the code:
allowednamechars = string.ascii_letters + string.digits + '_+/$.-'
def stripname(name, allowed=""):
""" strip all not allowed chars from name. """
n = name.replace(os.sep, '+')
n = n.replace("#", '+')
n = n.replace("#", '-')
n = n.replace("!", '.')
res = u""
for c in n:
if ord(c) < 31: continue
elif c in allowednamechars + allowed: res += c
else: res += "-" + str(ord(c))
return res
It's a whitelist with extra code to remove control characters and replace os.sep, as well as some repaces to make the filename Google App Engine compatible.
The bot in question is at http://jsonbot.googlecode.com.
So what do you think of it?

urllib.quote(name.encode("utf8")) will produce something human-readable, which should also be safe. Example:
In [1]: urllib.quote(u"foo bar$=+:;../..(boo)\u00c5".encode('utf8'))
Out[1]: 'foo%20bar%24%3D%2B%3A%3B../..%28boo%29%C3%85'

You might consider just doing base64.urlsafe_b64encode(name), which will always produce a safe name, unless you really want a human-readable file name. Otherwise, the number of edge cases is pretty long, and if you forget one of them, you've got a security problem.

Related

Combining values with text in python function

I have a python script that scrapes a webpage and downloads the mp3s found on it.
I am trying to name the files using elements that I have successfully captured in a separate function.
I am having trouble naming the downloaded files, this is what I have so far:
def make_safe_filename(text: str) -> str:
"""convert forbidden chars to underscores"""
return ''.join(c if (c.isalnum() or c.isspace()) else '_' for c in text).strip('_ ')
filename = make_safe_filename(a['artist'] + a['title'] + a['date'] + a['article_url'])
I am trying to save the file name as "Artist - Title - Date - Article_url" however I am struggling to do this. At the moment the variables are all mashed together without spaces, eg. ArtistTitleDateArticle_url.mp3
I've tried
filename = make_safe_filename(a['artist'] + "-" + a['title'] + "-" + a['date'] + "-" +
a['article_url'])
but this throws up errors.
Can anyone shed some light on where I am going wrong? I know it's something to do with combining variables but I am stuck. Thanks in advance.
I am guessing your a is a dictionary? Maybe you could clarify this in your question? Also what do you typically have in a['article_url']? Could you also post the traceback?
This is my attempt (note: no changes to the function):
def make_safe_filename(text: str) -> str:
"""convert forbidden chars to underscores"""
return ''.join(c if (c.isalnum() or c.isspace()) else '_' for c in text).strip('_ ')
a = {
'artist': 'Metallica',
'title': 'Nothing Else Matters',
'date': '1991',
'article_url': 'unknown',
}
filename = make_safe_filename(a['artist'] + '-' + a['title'] + '-' + a['date'] + '-' + a['article_url'])
print(filename)
Which produced the following output:
Metallica_Nothing Else Matters_1991_unknown
You code should actually work, but if you add the - before passing the joined string to the function, it will just replace those with _ as well. Instead, you could pass the individual fields and then join those in the function, after replacing the "illegal" characters for each field individually. Also, you could regular expressions and re.sub for the actual replacing:
import re
def safe_filename(*fields):
return " - ".join(re.sub("[^\w\s]", "_", s) for s in fields)
>>> safe_filename("Art!st", "S()ng", "§$%")
'Art_st - S__ng - ___'
Of course, if your a is a dictionary and you always want the same fields from that dict (artist, title, etc.) you could also just pass the dict itself and extract the fields within the function.
I had a similar problem recently, the best solution is probably to use regex, but I'm too lazy to learn that, so I wrote a replaceAll function:
def replaceAll(string, characters, replacement):
s = string
for i in characters:
s = s.replace(i, replacement)
return s
and then I used it to make a usable filename:
fName = replaceAll(name, '*<>?|"/\\.,\':', "")
in your case it would be:
filename = replaceAll(a['artist'] + a['title'] + a['date'] + a['article_url'], '*<>?|"/\\.,\':', "-")

Using function returns with None

Write a function file_in_english(filename, character_limit) that takes a filename (as a str) and a character_limit (as an int). The filename is the name of the file to convert from Cat Latin to English and the character limit is the maximum number of characters that can be converted. The limit is on the total number of characters in the output (including newline characters).
The function should return a string that contains all the converted lines in the same order as the file - remember the newline character at the end of each line (that is make sure you include a newline character at the end of each converted line so that it is included in the line's length).
If the limit is exceeded (ie, a converted sentence would take the output over the limit) then the sentence that takes the character count over the limit shouldn't be added to the output. A line with "<>" should be added at the end of the output instead. The processing of lines should then stop.
The lines in the file will each be a sentence in Weird Latin and your program should print out the English version of each sentence
The function should keep adding sentences until it runs out of input from the file or the total number of characters printed (including spaces) exceeds the limit.
The answer must include your definition of english_sentence and its helper(s) functions - that I should have called english_word or similar.
You MUST use while in your file_in_english function.
You can only use one return statement per function.
The test file used in the examples (test1.txt) has the following data:
impleseeoow estteeoow aseceeoow
impleseeoow estteeoow aseceeoow ineleeoow 2meeoow
impleseeoow estteeoow aseceeoow ineleeoow 3meeoow
impleseeoow estteeoow aseceeoow ineleeoow 4meeoow
I program works fine except that sometimes it returns None.
def english_sentence(sentence):
"""Reverse Translation"""
consonants = 'bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ'
eng_sentence = []
for coded_word in sentence.split():
if coded_word.endswith("eeoow") and (coded_word[-6] in consonants):
english_word = coded_word[-6] + coded_word[:-6]
if (coded_word[-6] == 'm') and (coded_word[0] not in consonants):
english_word = '(' + english_word + ' or ' + coded_word[:-6] + ')'
eng_sentence.append(english_word)
return " ".join(eng_sentence)
def file_in_english(filename, character_limit):
"""English File"""
newone = open(filename)
nowline = newone.readline()
characters = 0
while characters < character_limit and nowline != "":
process = nowline[0:-1]
print(english_sentence(process))
characters += len(nowline)
nowline = newone.readline()
if characters > character_limit:
return("<<Output limit exceeded>>")
ans = file_in_english('test1.txt', 20)
print(ans)
Output is:
simple test case
simple test case line (m2 or 2)
simple test case line (m3 or 3)
simple test case line (m4 or 4)
None
But I must use only one return statement in each function. How can I do that for the second function and avoid the "None" in output?
You're doing the same thing as:
def f():
print('hello')
print(f())
So basically narrows down to:
print(print('hello world'))
Also btw:
>>> type(print('hello'))
hello
<class 'NoneType'>
>>>
To solve your code do:
def file_in_english(filename, character_limit):
s=""
"""English File"""
newone = open(filename)
nowline = newone.readline()
characters = 0
while characters < character_limit and nowline != "":
process = nowline[0:-1]
s+=english_sentence(process)+'\n'
characters += len(nowline)
nowline = newone.readline()
if characters > character_limit:
s+="<<Output limit exceeded>>"
return s
ans = file_in_english('test1.txt', 20)
print(ans)
You have to make sure, that any function that should return something, does this for ALL ways that your function can end.
Your function file_in_english only returns something for the case if characters > character_limit:
If charachter == or charachter < character_limit this is not the case, the function returns nothing explicitly.
Any function that does not return something from it on end, implicitly returns None when it returns to its caller.
def something(boolean):
"""Function that only returns something meaninfull if boolean is True."""
if boolean:
return "Wow"
print(something(True)) # returns Wow
print(something(False)) # implicitly returns/prints None
You can find this fact f.e. in the python tutorial:
Coming from other languages, you might object that fib is not a
function but a procedure since it doesn’t return a value. In fact,
even functions without a return statement do return a value, albeit a
rather boring one. This value is called None (it’s a built-in name).
Writing the value None is normally suppressed by the interpreter if it
would be the only value written. You can see it if you really want to
using print():
Source:https://docs.python.org/3.7/tutorial/controlflow.html#defining-functions - just short after the 2nd green example box

Python - Concatenate a variable into string format

I'm trying to retrieve the number from a file, and determine the padding of it, so I can apply it to the new file name, but with an added number. I'm basically trying to do a file saver sequencer.
Ex.:
fileName_0026
0026 = 4 digits
add 1 to the current number and keep the same amount of digit
The result should be 0027 and on.
What I'm trying to do is retrieve the padding number from the file and use the '%04d'%27 string formatting. I've tried everything I know (my knowledge is very limited), but nothing works. I've looked everywhere to no avail.
What I'm trying to do is something like this:
O=fileName_0026
P=Retrieved padding from original file (4)
CN=retrieve current file number (26)
NN=add 1 to current file number (27)
'%0 P d' % NN
Result=fileName_0027
I hope this is clear enough, I'm having a hard time trying to articulate this.
Thanks in advance for any help.
Cheers!
There's a few things going on here, so here's my approach and a few comments.
def get_next_filename(existing_filename):
prefix = existing_filename.split("_")[0] # get string prior to _
int_string = existing_filename.split("_")[-1].split(".")[0] # pull out the number as a string so we can extract an integer value as well as the number of characters
try:
extension = existing_filename.split("_")[-1].split(".")[-1] # check for extension
except:
extension = None
int_current = int(int_string) # integer value of current filename
int_new = int(int_string) + 1 # integer value of new filename
digits = len(int_string) # number of characters/padding in name
formatter = "%0"+str(digits)+"d" # creates a statement that int_string_new can use to create a number as a string with leading zeros
int_string_new = formatter % (int_new,) # applies that format
new_filename = prefix+"_"+int_string_new # put it all together
if extension: # add the extension if present in original name
new_filename += "."+extension
return new_filename
# since we only want to do this when the file already exists, check if it exists and execute function if so
our_filename = 'file_0026.txt'
while os.path.isfile(our_filename):
our_filename = get_next_filename(our_filename) # loop until a unique filename found
I am writing some hints to acheive that. It's unclear what exactly you wanna achieve?
fh = open("fileName_0026.txt","r") #Read a file
t= fh.read() #Read the content
name= t.split("_|.") #Output:: [fileName,0026,txt]
n=str(int(name[1])+1) #27
s= n.zfill(2) #0027
newName= "_".join([fileName,s])+".txt" #"fileName_0027.txt"
fh = open(newName,"w") #Write a new file*emphasized text*
Use the rjust function from string
O=fileName_0026
P=Retrieved padding from original file (4)
CN=retrieve current file number (26)
NN=add 1 to current file number (27)
new_padding = str(NN).rjust(P, '0')
Result=fileName_ + new_padding
import re
m = re.search(r".*_(0*)(\d*)", "filenName_00023")
print m.groups()
print("fileName_{0:04d}".format(int(m.groups()[1])+1))
{0:04d} means pad out to four digits wide with leading zeros.
As you can see there are a few ways to do this that are quite similar. But one thing the other answers haven't mention is that it's important to strip off any existing leading zeroes from your file's number string before converting it to int, otherwise it will be interpreted as octal.
edit
I just realised that my previous code crashes if the file number is zero! :embarrassed:
Here's a better version that also copes with a missing file number and names with multiple or no underscores.
#! /usr/bin/env python
def increment_filename(s):
parts = s.split('_')
#Handle names without a number after the final underscore
if not parts[-1].isdigit():
parts.append('0')
tail = parts[-1]
try:
n = int(tail.lstrip('0'))
except ValueError:
#Tail was all zeroes
n = 0
parts[-1] = str(n + 1).zfill(len(tail))
return '_'.join(parts)
def main():
for s in (
'fileName_0026',
'data_042',
'myfile_7',
'tricky_99',
'myfile_0',
'bad_file',
'worse_file_',
'_lead_ing_under_score',
'nounderscore',
):
print "'%s' -> '%s'" % (s, increment_filename(s))
if __name__ == "__main__":
main()
output
'fileName_0026' -> 'fileName_0027'
'data_042' -> 'data_043'
'myfile_7' -> 'myfile_8'
'tricky_99' -> 'tricky_100'
'myfile_0' -> 'myfile_1'
'bad_file' -> 'bad_file_1'
'worse_file_' -> 'worse_file__1'
'_lead_ing_under_score' -> '_lead_ing_under_score_1'
'nounderscore' -> 'nounderscore_1'
Some additional refinements possible:
An optional arg to specify the number to add to the current file
number,
An optional arg to specify the minimum width of the file
number string,
Improved handling of names with weird number / position of
underscores.

Python Memory error during function return statement

Hi i am processing a 600Mb file. i have written the below code. What i am doing was, to search for a keyword in the data between <dest> tags and if it exists then add a city tag to <dest> tag. It worked fine for small set of data but when i ran the program on large file it is throwing MEMORY ERROR. I guess i am getting this error when i use return statement in if condition can any one please let me know how to solve this?
import re
def casp ( tx ):
def tbcnv( st ):
ct = ''
prt = re.compile(r"(?i)(Slip Copy,.*?\))", re.DOTALL|re.M)
val = re.search(prt, st)
try:
ct = val.group(1)
if re.search(r"(?i)alaska", ct):
jval = "Alaska"
print jval
if jval:
prt = re.compile(r"(?i)(.*?<dest.*?>)", re.DOTALL|re.M)
vl = re.sub(prt, "\\1\n" + "<city>" + jval + "</city>" + "\n" ,st)
return vl
else:
return st
else:
return st
except:
print "Not available"
return st
pt = re.compile("(?i)(<dest.*?</dest>)", re.DOTALL|re.M)
t = re.sub(pt, lambda m: tbcnv(m.group(1)), tx)
return t
with open('input.txt', 'r') as content_file:
content = content_file.read()
pt = re.compile(r"(?i)<Lrlevel level='3'>(.*?)</Lrlevel>", re.DOTALL|re.M)
content = re.sub(pt,lambda m: "<Lrlevel level='3'>" + casp(m.group(1) + "</Lrlevel>" ), content)
with open('out.txt', 'w') as out_file:
out_file.write(content)
If you remove the return statement just before the expect, then the string built by re.sub() is much smaller.
I'm getting memory usage that is 3 times the file size, which means that you'd get a MemoryError if you don't have (more than) 2GB. This is reasonable here --- or at least I can guess why. It's how re.sub() works.
This means that you're using somehow the wrong tools, as explained in the comments above. You should either use a full xml-processing tool like lxml, or if you want to stick with regular expressions, find a way to never need the whole string in memory; or at least to never call re.sub() on it (e.g. only the tx variable ever contains a big string, which is the input; and you do pt.search(tx, startpos) in a loop, locating the places to change, and writing piece by piece parts of tx).

How to encrypt all possible strings in a defined character set python?

I am trying to encrypt all possible strings in a defined character set then compare them to a hash given by user input.
This is what I currently have
import string
from itertools import product
import crypt
def decrypt():
hash1 = input("Please enter the hash: ")
salt = input("Please enter the salt: ")
charSet = string.ascii_letters + string.digits
for wordchars in product(charSet, repeat=2):
hash2 = crypt.METHOD_CRYPT((wordchars), (salt))
print (hash2)
Obviously its not finished yet but I am having trouble encrypting "wordchars"
Any help is appreciated
crypt.METHOD_CRYPT is not callable so the traceback that you provided doesn't correspond to the code in your question. crypt.METHOD_CRYPT could be used as the second parameter for crypt.crypt() function.
Also as #martineau pointed out wordchars is a tuple but you need a string to pass to the crypt.crypt() function.
From the docs:
Since a few crypt(3) extensions allow different values, with different
sizes in the salt, it is recommended to use the full crypted password
as salt when checking for a password.
To find a plain text from a defined character set given its crypted form: salt plus hash, you could:
from crypt import crypt
from itertools import product
from string import ascii_letters, digits
def decrypt(crypted, charset=ascii_letters + digits):
# find hash for all 4-char strings from the charset
# and compare with the given hash
for candidate in map(''.join, product(charset, repeat=4)):
if crypted == crypt(candidate, crypted):
return candidate
Example
salt, hashed = 'qb', '1Y.qWr.DHs6'
print(decrypt(salt + hashed))
# -> e2e4
assert crypt('e2e4', 'qb') == (salt + hashed)
The assert line makes sure that calling crypt with the word e2e4 and the salt qb produces qb1Y.qWr.DHs6 where qb is the salt.
Hmm may be better use bcrypt?
https://github.com/fwenzel/python-bcrypt
Below is a simple program that does what you asked:
def gen_word(charset, L):
if L == 1:
for char in charset:
yield char
raise StopIteration
for char in charset:
for word in gen_word(charset, L - 1):
yield char + word
def encrypt(word):
'''Your encrypt function, replace with what you wish'''
return word[::-1]
charset = ['1', '2', '3']
user_word = '12'
user_hash = encrypt(user_word)
max_length = 3
for length in range(1, max_length):
for word in gen_word(charset, length):
if encrypt(word) == user_hash:
print 'Word found: %s' % word
Basically, it uses a python generator for generating words from the charset of fixed length. You can replace the encrypt function with whatever you want (in the example is string reversal used as hash).
Note that with actual modern hashing methods, it'll take forever to decrypt an ordinary password, so I don't think you could actually use this.
Here's my completely different answer based on J.F. Sebastian's answer and comment about my previous answer. The most important point being that crypt.METHOD_CRYPT is not a callable even though the documentation somewhat confusingly calls a hashing method as if it were a method function of a module or an instance. It's not -- just think of it as an id or name of one of the various kinds of encryption supported by the crypt module.
So the problem with you code is two-fold: One is that you were trying to use wordchars as a string, when it actually a tuple produced by product() and second, that you're trying to call the id crypt.METHOD_CRYPT.
I'm at a bit of a disadvantage answering this because I'm not running Unix, don't have Python v3.3 installed, and don't completely understand what you're trying to accomplish with your code. Given all those caveats, I think something like the following which is derived from you code ought to at least run:
import string
from itertools import product
import crypt
def decrypt():
hash1 = input("Please enter the hash: ")
salt = input("Please enter the salt: ")
charSet = string.ascii_letters + string.digits
for wordchars in product(charSet, repeat=2):
hash2 = crypt.crypt(''.join(wordchars), salt=salt) # or salt=crypt.METHOD_CRYPT
print(hash2)

Categories