How to find parenthesis bound strings in python - python

I'm learning Python and wanted to automate one of my assignments in a cybersecurity class.
I'm trying to figure out how I would look for the contents of a file that are bound by a set of parenthesis. The contents of the (.txt) file look like:
cow.jpg : jphide[v5](asdfl;kj88876)
fish.jpg : jphide[v5](65498ghjk;0-)
snake.jpg : jphide[v5](poi098*/8!##)
test_practice_0707.jpg : jphide[v5](sJ*=tT#&Ve!2)
test_practice_0101.jpg : jphide[v5](nKFdFX+C!:V9)
test_practice_0808.jpg : jphide[v5](!~rFX3FXszx6)
test_practice_0202.jpg : jphide[v5](X&aC$|mg!wC2)
test_practice_0505.jpg : jphide[v5](pe8f%yC$V6Z3)
dog.jpg : negative`
And here is my code so far:
import sys, os, subprocess, glob, shutil
# Finding the .jpg files that will be copied.
sourcepath = os.getcwd() + '\\imgs\\'
destpath = 'stegdetect'
rawjpg = glob.glob(sourcepath + '*.jpg')
# Copying the said .jpg files into the destpath variable
for filename in rawjpg:
shutil.copy(filename, destpath)
# Asks user for what password file they want to use.
passwords = raw_input("Enter your password file with the .txt extension:")
shutil.copy(passwords, 'stegdetect')
# Navigating to stegdetect. Feel like this could be abstracted.
os.chdir('stegdetect')
# Preparing the arguments then using subprocess to run
args = "stegbreak.exe -r rules.ini -f " + passwords + " -t p *.jpg"
# Uses open to open the output file, and then write the results to the file.
with open('cracks.txt', 'w') as f: # opens cracks.txt and prepares to w
subprocess.call(args, stdout=f)
# Processing whats in the new file.
f = open('cracks.txt')

If it should just be bound by ( and ) you can use the following regex, which ensures starting ( and closing ) and you can have numbers and characters between them. You can add any other symbol also that you want to include.
[\(][a-z A-Z 0-9]*[\)]
[\(] - starts the bracket
[a-z A-Z 0-9]* - all text inside bracket
[\)] - closes the bracket
So for input sdfsdfdsf(sdfdsfsdf)sdfsdfsdf , the output will be (sdfdsfsdf)
Test this regex here: https://regex101.com/

I'm learning Python
If you are learning you should consider alternative implementations, not only regexps.
TO iterate line by line of a text file you just open the file and for over the file handle:
with open('file.txt') as f:
for line in f:
do_something(line)
Each line is a string with the line contents, including the end-of-line char '/n'. To find the start index of a specific substring in a string you can use find:
>>> A = "hello (world)"
>>> A.find('(')
6
>>> A.find(')')
12
To get a substring from the string you can use the slice notation in the form:
>>> A[6:12]
'(world'

You should use regular expressions which are implemented in the Python re module
a simple regex like \(.*\) could match your "parenthesis string"
but it would be better with a group \((.*)\) which allows to get only the content in the parenthesis.
import re
test_string = """cow.jpg : jphide[v5](asdfl;kj88876)
fish.jpg : jphide[v5](65498ghjk;0-)
snake.jpg : jphide[v5](poi098*/8!##)
test_practice_0707.jpg : jphide[v5](sJ*=tT#&Ve!2)
test_practice_0101.jpg : jphide[v5](nKFdFX+C!:V9)
test_practice_0808.jpg : jphide[v5](!~rFX3FXszx6)
test_practice_0202.jpg : jphide[v5](X&aC$|mg!wC2)
test_practice_0505.jpg : jphide[v5](pe8f%yC$V6Z3)
dog.jpg : negative`"""
REGEX = re.compile(r'\((.*)\)', re.MULTILINE)
print(REGEX.findall(test_string))
# ['asdfl;kj88876', '65498ghjk;0-', 'poi098*/8!##', 'sJ*=tT#&Ve!2', 'nKFdFX+C!:V9' , '!~rFX3FXszx6', 'X&aC$|mg!wC2', 'pe8f%yC$V6Z3']

Related

Python: Make List of Matching Patterns for Subprocess Call to pcregrep multiline

TLDR: Is there a clean way to make a list of entries for subprocess.check_output('pcregrep', '-M', '-e', pattern, file)?
I'm using python's subprocess.check_output() to call pcregrep -M. Normally I would separate results by calling splitlines() but since I'm looking for a multiline pattern, that won't work. I'm having trouble finding a clean way to create a list of the matching patterns, where each entry of the list is an individual matching pattern.
Here's a simple example file I'm pcgrep'ing
module test_module(
input wire in0,
input wire in1,
input wire in2,
input wire cond,
input wire cond2,
output wire out0,
output wire out1
);
assign out0 = (in0 & in1 & in2);
assign out1 = cond1 ? in1 & in2 :
cond2 ? in1 || in2 :
in0;
Here's (some of) my python code
#!/usr/bin/env python
import subprocess, re
output_str = subprocess.check_output(['pcregrep', '-M', '-e',"^\s*assign\\s+\\bout0\\b[^;]+;",
"/home/<username>/pcregrep_file.sv"]).split(';')
# Print out the matches
for idx, line in enumerate(output_str):
print "output_str[%d] = %s" % (idx, line)
# Clear out the whitespace list entries
output_str = [line for line in output_str if re.match(\S+, line)]
Here is the output
output_str[0] =
assign out0 = in0 & in1 & in2
output_str[1] =
assign out1 = cond1 ? in1 & in2 :
cond2 ? in1 || in2 :
in0
output_str[2] =
It would be nice if I could do something like
output_list = subprocess.check_output('pcregrep', -M, -e, <pattern>, <file>).split(<multiline_delimiter>)
without creating garbage to clean up (whitespace list entries) or even to have a delimiter to split() on that is independent on the pattern.
Is there a clean way to create a list of the matching multiline patterns?
Per Casimir et Hippolyte's comment, and the very helpful post, How do I re.search or re.match on a whole file without reading it all into memory?, I read in the file using re instead of an external call to pcregrep and used re.findall(pattern, file, re.MULTILINE)
Full solution (which only slightly modifies the referenced post)
#!/usr/bin/env python
import re, mmap
filename = "/home/<username>/pcregrep_file.sv"
with open(filename, 'r+') as f:
data = mmap.mmap(f.fileno(), 0)
output_str = re.findall(r'^\s*assign\s+\bct_ela\b[^;]+;', data, re.MULTILINE)
for i, l in enumerate(output_str):
print "output_str[%d] = '%s'" % (i,l)
which creates the desired list.
Don't do that. If you can't use the Python regular expression module for some reason, just use the Python bindings for pcre.

Find, decode and replace all base64 values in text file

I have a SQL dump file that contains text with html links like:
<a href="http://blahblah.org/kb/getattachment.php?data=NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY=">attached file</a>
I'd like to find, decode and replace the base64 part of the text in each of these links.
I've been trying to use Python w/ regular expressions and base64 to do the job. However, my regex skills are not up to the task.
I need to select any string that starts with
'getattachement.php?data='
and ends with
'"'
I then need to decode the part between 'data=' and '&quot' using base64.b64decode()
results should look something like:
<a href="http://blahblah.org/kb/4/Topcon_data-download_howto.pdf">attached file</a>
I think the solution will look something like:
import re
import base64
with open('phpkb_articles.sql') as f:
for line in f:
re.sub(some_regex_expression_here, some_function_here_to_decode_base64)
Any ideas?
EDIT: Answer for anyone who's interested.
import re
import base64
import sys
def decode_base64(s):
"""
Method to decode base64 into ascii
"""
# fix escaped equal signs in some base64 strings
base64_string = re.sub('%3D', '=', s.group(1))
decodedString = base64.b64decode(base64_string)
# substitute '|' for '/'
decodedString = re.sub('\|', '/', decodedString)
# escape the spaces in file names
decodedString = re.sub(' ', '%20', decodedString)
# print 'assets/' + decodedString + '&quot' # Print for debug
return 'assets/' + decodedString + '&quot'
count = 0
pattern = r'getattachment.php\?data=([^&]+?)&quot'
# Open the file and read line by line
with open('phpkb_articles.sql') as f:
for line in f:
try:
# globally substitute in new file path
edited_line = re.sub(pattern, decode_base64, line)
# output the edited line to standard out
sys.stdout.write(edited_line)
except TypeError:
# output unedited line if decoding fails to prevent corruption
sys.stdout.write(line)
# print line
count += 1
you already have it, you just need the small pieces:
pattern: r'data=([^&]+?)&quot' will match anything after data= and before &quot
>>> pat = r'data=([^&]+?)&quot'
>>> line = '<a href="http://blahblah.org/kb/getattachment.php?data=NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY=">attached file</a>'
>>> decodeString = re.search(pat,line).group(1) #because the b64 string is capture by grouping, we only want group(1)
>>> decodeString
'NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY='
you can then use str.replace() method as well as base64.b64decode() method to finish the rest. I dont want to just write your code for you but this should give you a good idea of where to go.

re.match doesn't pick up on txt file format

import os.path
import re
def request ():
print ("What file should I write to?")
file = input ()
thing = os.path.exists (file)
if thing == True:
start = 0
elif re.match ("^.+.\txt$", file):
stuff = open (file, "w")
stuff.write ("Some text.")
stuff.close ()
start = 0
else:
start = 1
go = "yes"
list1 = (start, file, go)
return list1
start = 1
while start == 1:
list1 = request ()
(start, file, go) = list1
Whenever I enter Thing.txt as the text, the elif should catch that it's in the format given. However, start doesn't change to 0, and a file isn't created. Have I formatted the re.match incorrectly?
"^.+.\txt$" is an incorrect pattern for match .txt files you can use the following regex :
r'^\w+\.txt$'
As \w matches word character if you want that the file name only contain letters you could use [a-zA-Z] instead :
r'^[a-zA-Z]+\.txt$'
Note that you need to escape the . as is a special sign in regular expression .
re.match (r'^\w+\.txt$',file)
But as an alternative answer for match file names with special format you can use endswith() :
file.endswith('.txt')
Also instead of if thing == True you can just use if thing : that is more pythonic !
You should escape second dot and unescape the "t" character:
re.match ("^.+\.txt$", file)
Also note that you don't really need regex for this, you can simply use endswith or search for module that can give you files extensions:
import os
fileName, fileExtension = os.path.splitext('your_file.txt')
fileExtension is .txt, which is exactly what you're looking for.

how to manipulate SREC file

I have an S19 file looking something like below:
S0030000FC
S30D0003C0000F0000000000000020
S3FD00000000782EFF1FB58E00003D2B00003D2B00003D2B00003D2B00003D2B00003D
S3ED000000F83D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D
S31500000400FFFFFFFFFFFFFFFFFFFFFFFF7EF9FFFF7D
S3FD0000041010B5DFF828000468012147F22C10C4F20300016047F22010C4F2030000
S70500008EB4B8
I want to separate the first two characters and also the next two characters, and so on... I want it to look like below (last two characters are also to be separated for each line):
S0, 03, 0000, FC
S3, 0D, 0003C000, 0F00000000000000, 20
S3, FD, 00000000, 782EFF1FB58E00003D2B00003D2B00003D2B00003D2B00003D2B0000, 3D
S3, ED, 000000F8, 3D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D2B0000, 3D
S3, 15, 00000400, FFFFFFFFFFFFFFFFFFFFFFFF7EF9FFFF, 7D
S3, FD, 00000410, 10B5DFF828000468012147F22C10C4F20300016047F22010C4F20300, 00
S7, 05, 00008EB4, B8
How can I do this in Python?
I have something like this:
#!/usr/bin/python
import string,os,sys,re,fileinput
print "hi"
inputfile = "k60.S19"
outputfile = "k60_out.S19"
# open the source file and read it
fh = file(inputfile, 'r')
subject = fh.read()
fh.close()
# create the pattern object. Note the "r". In case you're unfamiliar with Python
# this is to set the string as raw so we don't have to escape our escape characters
pattern2 = re.compile(r'S3')
pattern3 = re.compile(r'S7')
pattern1 = re.compile(r'S0')
# do the replace
result1 = pattern1.sub("S0, ", subject)
result2 = pattern2.sub("S3, ", subject)
result3 = pattern3.sub("S7, ", subject)
# write the file
f_out = file(outputfile, 'w')
f_out.write(result1)
f_out.write(result2)
f_out.write(result3)
f_out.close()
#EoF
but it is not working as I like!! Can someone help me with how to come up with proper regular expression use for this?
try package bincopy, maybe you need it.
bincopy - Interpret strings as packed binary data
Mangling of various file formats that conveys binary information (Motorola S-Record, Intel HEX and binary files).
import bincopy
f = bincopy.BinFile()
f.add_srec_file("path/to/your/s19/flie.s19")
f.as_binary() # print s19 as binary
or you can easily use open() for a file:
with open("path/to/your/s19/flie.s19") as s19:
for line in s19:
type = line[0:2]
count = line[2:4]
adress = line[4:12]
data = line[12:-2]
crc = line[-2:]
print type + ", "+ count + ", " + adress + ", " + data + ", " + crc + "\n"
hope it helps.
Motorola S-record file format
You can do it using a callback function as replacement with re.sub:
#!/usr/bin/python
import re
data = r'''S0030000FC
S30D0003C0000F0000000000000020
S3FD00000000782EFF1FB58E00003D2B00003D2B00003D2B00003D2B00003D2B00003D
S3ED000000F83D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D
S31500000400FFFFFFFFFFFFFFFFFFFFFFFF7EF9FFFF7D
S3FD0000041010B5DFF828000468012147F22C10C4F20300016047F22010C4F2030000
S70500008EB4B8'''
pattern = re.compile(r'^(..)(..)((?:.{4}){1,2})(.*)(?=..)', re.M)
def repl(m):
repstr = ''
for g in m.groups():
if (g):
repstr += g + ', '
return repstr
print re.sub(pattern, repl, data)
However, as Mark Setchell notices it, there is probably a nice way to do it with slicing.
I know you are thinking Python and regexes, but this was made for awk and the following will maybe help you work out the way to do it using slicing:
awk '{r=length($0);print substr($0,1,2),substr($0,3,2),substr($0,5,8),substr($0,13,r-14),substr($0,r-1)}' OFS=, k60.s19
That says "get the length of the line in variable r, then print the first two characters, the next two characters, the next 8 characters and so on... and use a comma as the field separator".
EDITED
Here are a few more hints to get you started...
if you want to avoid printing line 1, you can do
awk 'FNR==1{next} ...rest of awk script above ... '
If you want to only process lines longer than 40 characters, you can do
awk 'length($0)>40 {print}' yourfile
If you only want to process lines where the second field is "xx", you can do
awk '$2 ~ "xx" {print}' yourfile

Ignore characters in quotation marks inside a find and replace algorithm

I have been wondering how I can make Python ignore characters inside double quotation marks (") in my find and replace function. My code is:
def findAndReplace(textToSearch, textToReplace,fileToSearch):
oldFileName = 'old-' + fileToSearch
tempFileName = 'temp-' + fileToSearch
tempFile = open( tempFileName, 'w' )
for line in fileinput.input( fileToSearch ):
tempFile.write( line.replace( textToSearch, textToReplace ) )
tempFile.close()
# Rename the original file by prefixing it with 'old-'
os.rename( fileToSearch, oldFileName )
# Rename the temporary file to what the original was named...
os.rename( tempFileName, fileToSearch )
Suppose that our file (test.txt) has contents (THIS IS OUR ACTUAL TEXT):
I like your code "I like your code"
and I execute
findAndReplace('code','bucket',test.txt)
which will write the following to my file:
I like your bucket "I like your bucket"
However, I want it to skip the double-quoted part and get this as a result
I like your bucket "I like your code"
What should I add to my source code?
Thanks in advance
haystack = 'I like your code "I like your code"'
needle = "code"
replacement = "bucket"
parts = haystack.split('"')
for i in range(0,len(parts),2):
parts[i] = parts[i].replace(needle,replacement)
print '"'.join(parts)
assuming you cannot have nested quotes ...
If you don't need to handle quotes inside quotes, or anything like that, this is pretty easy. You could do it with regular expressions. But, since I'm guessing you don't know regexp (or you would have used it in the first place), let's do it with simple strings methods: split your string on quote characters, then replace only the even substrings, then join it back together:
for line in fileinput.input( fileToSearch ):
bits = line.split('"')
bits[::2] = [bit.replace(textToSearch, textToReplace) for bit in bits[::2]]
tempFile.write('"'.join(bits))

Categories