Understanding a custom encryption method in python

Understanding a custom encryption method in python - python

As part of an assignment I've been given some code written in python that was used to encrypt a message, and I have to try and understand the code and decrypt the ciphertext. I've never used python before and am somewhat out of my depth.
I understand most of it and the overall gist of what the code is trying to accomplish, however there are a few lines near the end tripping me up. Here's the entire thing (the &&& denotes sections of code which are supposed to be "damaged", while testing the code I've set secret to "test" and count to 3):
import string
import random
from base64 import b64encode, b64decode
secret = '&&&&&&&&&&&&&&' # We don't know the original message or length
secret_encoding = ['step1', 'step2', 'step3']
def step1(s):
_step1 = string.maketrans("zyxwvutsrqponZYXWVUTSRQPONmlkjihgfedcbaMLKJIHGFEDCBA","mlkjihgfedcbaMLKJIHGFEDCBAzyxwvutsrqponZYXWVUTSRQPON")
return string.translate(s, _step1)
def step2(s): return b64encode(s)
def step3(plaintext, shift=4):
loweralpha = string.ascii_lowercase
shifted_string = loweralpha[shift:] + loweralpha[:shift]
converted = string.maketrans(loweralpha, shifted_string)
return plaintext.translate(converted)
def make_secret(plain, count):
a = '2{}'.format(b64encode(plain))
for count in xrange(count):
r = random.choice(secret_encoding)
si = secret_encoding.index(r) + 1
_a = globals()[r](a)
a = '{}{}'.format(si, _a)
return a
if __name__ == '__main__':
print make_secret(secret, count=&&&)
Essentially, I assume the code is meant to choose randomly from the three encryption methods step1, step2 and step3, then apply them to the cleartext a number or times as governed by whatever the value of "count" is.
The "make_secret" method is the part that's bothering me, as I'm having difficulty working out how it ties everything together and what the overall purpose of it is. I'll go through it line by line and give my reasons on each part, so someone can correct me if I'm mistaken.
a = '2{}'.format(b64encode(plain))
This takes the base64 encoding of whatever the "plain" variable corresponds to and appends a 2 to the start of it, resulting in something like "2VGhpcyBpcyBhIHNlY3JldA==" using "this is a secret" for plain as a test. I'm not sure what the 2 is for.
r = random.choice(secret_encoding)
si = secret_encoding.index(r) + 1
r is a random selection from the secret_encoding array, while si corresponds to the next array element after r.
_a = globals()[r](a)
This is one of the parts that has me stumped. From researching global() it seems that the intention here is to turn "r" into a global dictionary consisting of the characters found in "a", ie somewhere later in the code a's characters will be used as a limited character set to choose from. Is this correct or am I way off base?
I've tried printing _a, which gives me what appears to be the letters and numbers found in the final output of the code.
a = '{}{}'.format(si, _a)
It seems as if this is creating a string which is a concatenation of the si and _a variables, however I'll admit I don't understand the purpose of doing this.
I realize this is a long question, but I thought it would be best to put the parts that are bothering me into context.

I will refrain from commenting on the readability of the code. I daresay
it was all intentional, anyway, for purposes of obfuscation. Your
professor is an evil bastard and I want to take his or her course :)
r = random.choice(secret_encoding)
...
_a = globals()[r](a)
You're way off base. This is essentially an ugly and hard-to-read way to
randomly choose one of the three functions and run it on a. The
function globals() returns a dict that maps names to identifiers; it
includes the three functions and other things. globals()[r] looks up
one of the three functions based on the name r. Putting (a) after
that runs the function with a as the argument.
a = '{}{}'.format(si, _a)
The idea here is to prepend each interim result with the number of the
function that encrypted it, so you know which function you need to
reverse to decrypt that step. They all accumulate at the beginning, and
get encrypted and re-encrypted with each step, except for the last one.
a = '2{}'.format(b64encode(plain))
Essentially, this is applying step2 first. Each encryption with
step2 prepends a 2.
So, the program applies count encryptions to the plaintext, with each
step using a randomly-chosen transformation, and the choice appears in
plaintext before the ciphertext. Your task is to read each prepended
number and apply the inverse transformation to the rest of the message.
You stop when the first character is not in "123".
One problem I see is that if the plaintext begins with a digit in
"123", it will look like we should perform another decryption step. In
practice, however, I feel sure that the professor's choice of plaintext
does not begin with such a digit (unless they're really evil).

Related

Write a custom JSON interpreter for a file that looks like json but isnt using Python

What I need to do is to write a module that can read and write files that use the PDX script language. This language looks alot like json but has enough differences that a custom encoder/decoder is needed to do anything with those files (without a mess of regex substitutions which would make maintenance hell). I originally went with just reading them as txt files, and use regex to find and replace things to convert it to valid json. This lead me to my current point, where any additions to the code requires me to write far more code than I would want to, just to support some small new thing. So using a custom json thing I could write code that shows what valid key:value pairs are, then use that to handle the files. To me that will be alot less code and alot easier to maintain.
So what does this code look like? In general it looks like this (tried to put all possible syntax, this is not an example of a working file):
#key = value # this is the definition for the scripted variable
key = {
# This is a comment. No multiline comments
function # This is a single key, usually optimize_memory
# These are the accepted key:value pairs. The quoted version is being phased out
key = "value"
key = value
key = #key # This key is using a scripted variable, defined either in the file its in or in the `scripted_variables` folder. (see above for example on how these are initially defined)
# type is what the key type is. Like trigger:planet_stability where planet_stability is a trigger
key = type:key
# Variables like this allow for custom names to be set. Mostly used for flags and such things
[[VARIABLE_NAME]
math_key = $VARIABLE_NAME$
]
# this is inline math, I dont actually understand how this works in the script language yet as its new. The "<" can be replaced with any math symbol.
# Valid example: planet_stability < #[ stabilitylevel2 + 10 ]
key < #[ key + 10 ]
# This is used alot to handle code blocks. Valid example:
# potential = {
# exists = owner
# owner = {
# has_country_flag = flag_name
# }
# }
key = {
key = value
}
# This is just a list. Inline brackets are used alot which annoys me...
key = { value value }
}
The major differences between json and PDX script is the nearly complete lack of quotations, using an equals sign instead of a colon for separation and no comma's at the end of the lines. Now before you ask me to change the PDX code, I cant. Its not mine. This is what I have to work with and cant make any changes to the syntax. And no I dont want to convert back and forth as I have already mentioned this would require alot of work. I have attempted to look for examples of this, however all I can find are references to convert already valid json to a python object, which is not what I want. So I cant give any examples of what I have already done, as I cant find anywhere to even start.
Some additional info:
Order of key:value pairs does not technically matter, however it is expected to be in a certain order, and when not in that order causes issues with mods and conflict solvers
bool properties always use yes or no rather than true or false
Lowercase is expected and in some cases required
Math operators are used as separators as well, eg >=, <= ect
The list of syntax is not exhaustive, but should contain most of the syntax used in the language
Past work:
My last attempts at this all revolved around converting it from a text file to a json file. This was alot of work just to get a small piece of this to work.
Example:
potential = {
exists = owner
owner = {
is_regular_empire = yes
is_fallen_empire = no
}
NOR = {
has_modifier = resort_colony
has_modifier = slave_colony
uses_habitat_capitals = yes
}
}
And what i did to get most of the way to json (couldnt find a way to add quotes)
test_string = test_string.replace("\n", ",")
test_string = test_string.replace("{,", "{")
test_string = test_string.replace("{", "{\n")
test_string = test_string.replace(",", ",\n")
test_string = test_string.replace("}, ", "},\n")
test_string = "{\n" + test_string + "\n}"
# Replace the equals sign with a colon
test_string = test_string.replace(" =", ":")
This resulted in this:
{
potential: {
exists: owner,
owner: {
is_regular_empire: yes,
is_fallen_empire: no,
},
NOR: {
has_modifier: resort_colony,
has_modifier: slave_colony,
uses_habitat_capitals: yes,
},
}
}
Very very close yes, but in no way could I find a way to add the quotations to each word (I think I did try a regex sub, but wasnt able to get it to work, since this whole thing is just one unbroken string), making this attempt stuck and also showing just how much work is required just to get a very simple potential block to mostly work. However this is not the method I want anymore, one because its alot of work and two because I couldnt find anything to finish it. So a custom json interpreter is what I want.

The classical approach (potentially leading to more code, but also more "correctness"/elegance) is probably to build a "recursive descent parser", from a bunch of conditionals/checks, loops and (sometimes recursive?) functions/handlers to deal with each of the encountered elements/characters on the input stream. An implicit parse/call tree might be sufficient if you directly output/print the JSON equivalent, or otherwise you could also create a representation/model in memory for later output/conversion.
Related book recommendation could be "Language Implementation Patterns" by Terence Parr, me avoiding to promote my own interpreters and introductory materials :-) In case you need further help, maybe write me?

I do not understand why this function reverses the string

This function requests a string input and reverses it. For some reason, I just cannot wrap my head around the logic behind it.
def reverse(s):
new = ""
for i in s:
print(new)
new = i + new
return new
oldStr = input("String?")
newStr = reverse(oldStr)
print(newStr)
print(reverse("good bye"))
A friend suggested I print the variable new in the string which I added and it helped a little, but I just don't understand it.

It looks to me as if you are in a stage where you want to learn how to debug small programs.
For questions on StackOverflow, you want to provide a minimal reproducible example. What does that mean in your case?
"repoducible" means that you should not depend on user input. Replace user input by hardcoded values.
"minimal" means that you don't call the function twice. Once is enough.
For that reason, I'll not walk you through your original program, but through this one instead:
def reverse(s):
new = ""
for i in s:
print(new)
new = i + new
return new
print(reverse("abcdef"))
I'll use a free program called PyCharm Community Edition. You set a breakpoint where you want to understand how things work. Since you don't understand the reverse() method, put it right at the beginning of that method. You do that by clicking between the line number and the text:
Even if your code has no bug, go to the Run/Debug menu:
Execution will stop at the breakpoint and you'll now be able to see all the variables and step through your code line by line.
Look at all the variables after each single line. Compare the values to what you think the values should be. If there is a mismatch, it's likely a misunderstanding on your side, not by Python. Do that a few times and it'll be obvious why and how the string is reversed.

Let analysis it:
Iteration 1:
new = 'A', i = g
Perform new = i + new : new = gA
Iteration 2:
new = gA, i = o
Perform new = i + new: new = ogA
.
.
.
This happens because we add new i before we add the existing string from previous iteration.

the key is in "new = i + new" note that new is the previous iteration character and is in the right side of the current character, this cause the string to reverse

Extracting information from unconventional text files? (Python)

I am trying to extract some information from a set of files sent to me by a collaborator. Each file contains some python code which names a sequence of lists. They look something like this:
#PHASE = 0
x = np.array(1,2,...)
y = np.array(3,4,...)
z = np.array(5,6,...)
#PHASE = 30
x = np.array(1,4,...)
y = np.array(2,5,...)
z = np.array(3,6,...)
#PHASE = 40
...
And so on. There are 12 files in total, each with 7 phase sets. My goal is to convert each phase into it's own file which can then be read by ascii.read() as a Table object for manipulation in a different section of code.
My current method is extremely inefficient, both in terms of resources and time/energy required to assemble. It goes something like this: Start with a function
def makeTable(a,b,c):
output = Table()
output['x'] = a
output['y'] = b
output['z'] = c
return output
Then for each phase, I have manually copy-pasted the relevant part of the text file into a cell and appended a line of code
fileName_phase = makeTable(a,b,c)
Repeat ad nauseam. It would take 84 iterations of this to process all the data, and naturally each would need some minor adjustments to match the specific fileName and phase.
Finally, at the end of my code, I have a few lines of code set up to ascii.write each of the tables into .dat files for later manipulation.
This entire method is extremely exhausting to set up. If it's the only way to handle the data, I'll do it. I'm hoping I can find a quicker way to set it up, however. Is there one you can suggest?

If efficiency and code reuse instead of copy is the goal, I think that Classes might provide a good way. I'm going to sleep now, but I'll edit later. Here's my thoughts: create a class called FileWithArrays and use a parser to read the lines and put them inside the object FileWithArrays you will create using the class. Once that's done, you can then create a method to transform the object in a table.
P.S. A good idea for the parser is to store all the lines in a list and parse them one by one, using list.pop() to auto shrink the list. Hope it helps, tomorrow I'll look more on it if this doesn't help a lot. Try to rewrite/reformat the question if I misunderstood anything, it's not very easy to read.

I will suggest a way which will be scorned by many but will get your work done.
So apologies to every one.
The prerequisites for this method is that you absolutely trust the correctness of the input files. Which I guess you do. (After all he is your collaborator).
So the key point here is that the text in the file is code which means it can be executed.
So you can do something like this
import re
import numpy as np # this is for the actual code in the files. You might have to install numpy library for this to work.
file = open("xyz.txt")
content = file.read()
Now that you have all the content, you have to separate it by phase.
For this we will use the re.split function.
phase_data = re.split("#PHASE = .*\n", content)
Now we have the content of each phase in an array.
Now comes for the part of executing it.
for phase in phase_data:
if len(phase.strip()) == 0:
continue
exec(phase)
table = makeTable(x, y, z) # the x, y and z are defined by the exec.
# do whatever you want with the table.
I will reiterate that you have to absolutely trust the contents of the file. Since you are executing it as code.
But your work seems like a scripting one and I believe this will get your work done.
PS : The other "safer" alternative to exec is to have a sandboxing library which takes the string and executes it without affecting the parent scope.

To avoid the safety issue of using exec as suggested by #Ajay Brahmakshatriya, but keeping his first processing step, you can create your own minimal 'phase parser', something like:
VARS = 'xyz'
def makeTable(phase):
assert len(phase) >= 3
output = Table()
for i in range(3):
line = [s.strip() for s in phase[i].split('=')]
assert len(line) == 2
var, arr = line
assert var == VARS[i]
assert arr[:10]=='np.array([' and arr[-2:]=='])'
output[var] = np.fromstring(arr[10:-2], sep=',')
return output
and then call
table = makeTable(phase)
instead of
exec(phase)
table = makeTable(x, y, z)
You could also skip all these assert statements without compromising safety, if the file is corrupted or not formatted as expected the error that will be thrown might just be harder to understand...

Algorithm for A/B testing

I need to develop an A/B testing method for my users. Basically I need to split my users into a number of groups - for example 40% and 60%.
I have around 1,000,00 users and I need to know what would be my best approach. Random numbers are not an option because the users will get different results each time. My second option is to alter my database so each user will have a predefined number (randomly generated). The negative side is that if I get 50 for example, I will always have that number unless I create a new user. I don't mind but I'm not sure that altering the database is a good idea for that purpose.
Are there any other solutions so I can avoid that?

Run a simple algorithm against the primary key. For instance, if you have an integer for user id, separate by even and odd numbers.
Use a mod function if you need more than 2 groups.

Well you are using MySQL so whether it's a good idea or not, it's hard to tell. Altering databases could be costly. Also it could affect performance in the long run if it starts getting bigger. Also you would have to modify your system to include that number in the database for every new user. You have tagged this as a python question. So here is another way of doing it without making any changes to the database. Since you are talking about users you probably have a unique identifier for all of them, let's say e-mail. Instead of email I'll be using uuid's.
import hashlib
def calculateab(email):
maxhash = 16**40
emailhash = int(hashlib.sha1(email).hexdigest(), 16)
div = (maxhash/100)-1
return int(float(emailhash/div))
#A small demo
if __name__ == '__main__':
import uuid, time, json
emails = []
verify = {}
for i in range(1000000):
emails.append(str(uuid.uuid4()))
starttime = time.time()
for i in emails:
ab = calculateab(i)
if ab not in verify:
verify[ab] = 1
else:
verify[ab] += 1
#json for your eye's pleasure
print json.dumps(verify, indent = 4)
#if you look at the numbers, you'll see that they are well distributed so
#unless you are going to do that every second for all users, it should work fine
print "total calculation time {0} seconds".format((time.time() - starttime))
Not that much to do with python, more of a math solution. You could use md5, sha1 or anything along those lines, as long as it has a fixed length and it's a hex number. The -1 on the 6-th line is optional - it sets the range from 0 to 99 instead of 1 to 100. You could also modify that to use floats which will give you a greater flexibility.

I would add an auxiliary table with just userId and A/B. You do not change existent table and it is easy to change the percentage per class if you ever need to. It is very little invasive.

Here is the JS one liner:
const AB = (str) => parseInt(sha1(str).slice(0, 1), 16) % 2 === 0 ? 'A': 'B';
and the result for 10 million random emails:
{ A: 5003530, B: 4996470 }

Caeser cipher encryption from file in Python

My Caeser cipher works interactively in the shell with a string, but when I've tried to undertake separate programs to encrypt and decrypt I've run into problems, I don't know whether the input is not being split into a list or not, but the if statement in my encryption function is being bypassed and defaulting to the else statement that fills the list unencrypted. Any suggestions appreciated. I'm using FileUtilities.py from the Goldwasser book. That file is at http://prenhall.com/goldwasser/sourcecode.zip in chapter 11, but I don't think the problem is with that, but who knows. Advance thanks.
#CaeserCipher.py
class CaeserCipher:
def __init__ (self, unencrypted="", encrypted=""):
self._plain = unencrypted
self._cipher = encrypted
self._encoded = ""
def encrypt (self, plaintext):
self._plain = plaintext
plain_list = list(self._plain)
i = 0
final = []
while (i <= len(plain_list)-1):
if plain_list[i] in plainset:
final.append(plainset[plain_list[i]])
else:
final.append(plain_list[i])
i+=1
self._encoded = ''.join(final)
return self._encoded
def decrypt (self, ciphertext):
self._cipher = ciphertext
cipher_list = list(self._cipher)
i = 0
final = []
while (i <= len(cipher_list)-1):
if cipher_list[i] in cipherset:
final.append(cipherset[cipher_list[i]])
else:
final.append(cipher_list[i])
i+=1
self._encoded = ''.join(final)
return self._encoded
def writeEncrypted(self, outfile):
encoded_file = self._encoded
outfile.write('%s' %(encoded_file))
#encrypt.py
from FileUtilities import openFileReadRobust, openFileWriteRobust
from CaeserCipher import CaeserCipher
caeser = CaeserCipher()
source = openFileReadRobust()
destination = openFileWriteRobust('encrypted.txt')
caeser.encrypt(source)
caeser.writeEncrypted(destination)
source.close()
destination.close()
print 'Encryption completed.'

caeser.encrypt(source)
into
caeser.encrypt(source.read())
source is a file object - the fact that this code "works" (by not encrypting anything) is interesting - turns out that you call list() over the source before iterating and that turns it into a list of lines in the file. Instead of the usual result of list(string) which is a list of characters. So when it tries to encrypt each chracter, it finds a whole line that doesn't match any of the replacements you set.
Also like others pointed out, you forgot to include plainset in the code, but that doesn't really matter.
A few random notes about your code (probably nitpicking you didn't ask for, heh)
You typo'd "Caesar"
You're using idioms which are inefficient in python (what's usually called "not pythonic"), some of which might come from experience with other languages like C.
Those while loops could be for item in string: - strings already work as lists of bytes like what you tried to convert.
The line that writes to outfile could be just outfile.write(self._encoded)
Both functions are very similar, almost copy-pasted code. Try to write a third function that shares the functionality of both but has two "modes", encrypt and decrypt. You could just make it work over cipher_list or plain_list depending on the mode, for example
I know you're doing this for practice but the standard library includes these functions for this kind of replacements. Batteries included!
Edit: if anyone is wondering what those file functions do and why they work, they call raw_input() inside a while loop until there's a suitable file to return. openFileWriteRobust() has a parameter that is the default value in case the user doesn't input anything. The code is linked on the OP post.

Some points:
Using a context manager (with) makes sure that files are closed after being read or written.
Since the caesar cipher is a substitution cipher where the shift parameter is the key, there is no need for a separate encrypt and decrypt member function: they are the same but with the "key" negated.
The writeEncrypted method is but a wrapper for a file's write method. So the class has effectively only two methods, one of which is __init__.
That means you could easily replace it with a single function.
With that in mind your code can be replaced with this;
import string
def caesartable(txt, shift):
shift = int(shift)
if shift > 25 or shift < -25:
raise ValueError('illegal shift value')
az = string.ascii_lowercase
AZ = string.ascii_uppercase
eaz = az[-shift:]+az[:-shift]
eAZ = AZ[-shift:]+AZ[:-shift]
tt = string.maketrans(az + AZ, eaz + eAZ)
return tt
enc = caesartable(3) # for example. decrypt would be caesartable(-3)
with open('plain.txt') as inf:
txt = inf.read()
with open('encrypted.txt', 'w+') as outf:
outf.write(txt.translate(enc))
If you are using a shift of 13, you can use the built-in rot13 encoder instead.

It isn't obvious to me that there will be anything in source after the call to openFileReadRobust(). I don't know the specification for openFileReadRobust() but it seems like it won't know what file to open unless there is a filename given as a parameter, and there isn't one.
Thus, I suspect source is empty, thus plain is empty too.
I suggest printing out source, plaintext and plain to ensure that their values are what you expect them to be.
Parenthetically, the openFileReadRobust() function doesn't seem very helpful to me if it can return non-sensical values for non-sensical parameter values. I very much prefer my functions to throw an exception immediately in that sort of circumstance.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.