regex for substring on a string then replace text

regex for substring on a string then replace text - python

I have a text file that I'm reading as:
with open(file,'r+') as f:
file_data = f.read()
the file_data is a long string that has the following text:
'''This file starts here dig_1 = hello\n doge ras = friend\n sox = pie\n'''
I want to search for dig_1 then get all the text after the '=' up to the new line character \n and replace it with a different text so that it is dig_1 = hello\n is now dig_1 = unknown and do the same with the others (ras = friend\n to ras = unknown and sox = pie\n to sox = unknown). Is there an easy way to do this using regex?

You can make use sub function of python's re module
The pattern that you want to replace looks something like a word followed by an equal sign and space and also has a preceding newline
# import re module
import re
# original text
txt = '''This file starts here dig_1 = hello\n doge ras = friend\n sox = pie\n'''
# pattern to look for
pattern = '= (.*?\n)'
# string to replace with
repl = 'unknown'
# replace 'pattern' with the string inside 'repl' in the string 'txt'
re.sub(pattern, repl, txt)
'This file starts here dig_1 unknown doge ras unknown sox unknown'

You may use re.sub here:
inp = "This file starts here dig_1 = hello\n doge ras = friend\n sox = pie\n"
output = re.sub(r'\b(\S+)\s*=\s*.*(?=\n|$)', r'\1 = unknown', inp)
print(output)
This prints:
This file starts here dig_1 = unknown
doge ras = unknown
sox = unknown

Related

searching word following a giving pattern

I want get the word 'MASTER_INACTIVE' in the string:
'p_esco_link->state = MASTER_INACTIVE; /*M-t10*/'
by searching reg-expression 'p_esco_link->state =' to find the following word.
I have to replace date accessing to API functions. I try some reg-expression in python 3.6, but it does not work.
pattern = '(?<=\bp_esco_link->state =\W)\w+'
if __name__ == "__main__":
syslogger.info(sys.argv)
if version_info.major != 3:
raise Exception('Olny work on Python 3.x')
with open(cFile, encoding='utf-8') as file_obj:
lineNum = 0
for line in file_obj:
print(len(line))
re_obj = re.compile(pattern)
result = re.search(pattern, line)
lineNum += 1
#print(result)
if result:
print(str(lineNum) + ' ' +str(result.span()) + ' ' + result.group())
excepted Python re module can find the position of 'MASTER_INACTIVE' and put it into result.group().
error message is that Python re module find nothing.

Your pattern is working fine,
Just change the bellow line in your code,
pattern = r'(?<=\bp_esco_link->state =\W)\w+' # add r prefix
Check this sample work, I added line as your string.
import re
pattern = r'(?<=\bp_esco_link->state =\W)\w+'
line = 'p_esco_link->state = MASTER_INACTIVE; /*M-t10*/'
re_obj = re.compile(pattern)
result = re.search(pattern, line)
print(result.span()) # (21, 36)
print(result.group()) # 'MASTER_INACTIVE'
Check below question to get more understand about 'r' prefix,
Python regex - r prefix
What exactly do “u” and “r” string flags do, and what are raw string literals?
What does preceding a string literal with “r” mean? [duplicate]

How to find parenthesis bound strings in python

I'm learning Python and wanted to automate one of my assignments in a cybersecurity class.
I'm trying to figure out how I would look for the contents of a file that are bound by a set of parenthesis. The contents of the (.txt) file look like:
cow.jpg : jphide[v5](asdfl;kj88876)
fish.jpg : jphide[v5](65498ghjk;0-)
snake.jpg : jphide[v5](poi098*/8!##)
test_practice_0707.jpg : jphide[v5](sJ*=tT#&Ve!2)
test_practice_0101.jpg : jphide[v5](nKFdFX+C!:V9)
test_practice_0808.jpg : jphide[v5](!~rFX3FXszx6)
test_practice_0202.jpg : jphide[v5](X&aC$|mg!wC2)
test_practice_0505.jpg : jphide[v5](pe8f%yC$V6Z3)
dog.jpg : negative`
And here is my code so far:
import sys, os, subprocess, glob, shutil
# Finding the .jpg files that will be copied.
sourcepath = os.getcwd() + '\\imgs\\'
destpath = 'stegdetect'
rawjpg = glob.glob(sourcepath + '*.jpg')
# Copying the said .jpg files into the destpath variable
for filename in rawjpg:
shutil.copy(filename, destpath)
# Asks user for what password file they want to use.
passwords = raw_input("Enter your password file with the .txt extension:")
shutil.copy(passwords, 'stegdetect')
# Navigating to stegdetect. Feel like this could be abstracted.
os.chdir('stegdetect')
# Preparing the arguments then using subprocess to run
args = "stegbreak.exe -r rules.ini -f " + passwords + " -t p *.jpg"
# Uses open to open the output file, and then write the results to the file.
with open('cracks.txt', 'w') as f: # opens cracks.txt and prepares to w
subprocess.call(args, stdout=f)
# Processing whats in the new file.
f = open('cracks.txt')

If it should just be bound by ( and ) you can use the following regex, which ensures starting ( and closing ) and you can have numbers and characters between them. You can add any other symbol also that you want to include.
[\(][a-z A-Z 0-9]*[\)]
[\(] - starts the bracket
[a-z A-Z 0-9]* - all text inside bracket
[\)] - closes the bracket
So for input sdfsdfdsf(sdfdsfsdf)sdfsdfsdf , the output will be (sdfdsfsdf)
Test this regex here: https://regex101.com/

I'm learning Python
If you are learning you should consider alternative implementations, not only regexps.
TO iterate line by line of a text file you just open the file and for over the file handle:
with open('file.txt') as f:
for line in f:
do_something(line)
Each line is a string with the line contents, including the end-of-line char '/n'. To find the start index of a specific substring in a string you can use find:
>>> A = "hello (world)"
>>> A.find('(')
6
>>> A.find(')')
12
To get a substring from the string you can use the slice notation in the form:
>>> A[6:12]
'(world'

You should use regular expressions which are implemented in the Python re module
a simple regex like \(.*\) could match your "parenthesis string"
but it would be better with a group \((.*)\) which allows to get only the content in the parenthesis.
import re
test_string = """cow.jpg : jphide[v5](asdfl;kj88876)
fish.jpg : jphide[v5](65498ghjk;0-)
snake.jpg : jphide[v5](poi098*/8!##)
test_practice_0707.jpg : jphide[v5](sJ*=tT#&Ve!2)
test_practice_0101.jpg : jphide[v5](nKFdFX+C!:V9)
test_practice_0808.jpg : jphide[v5](!~rFX3FXszx6)
test_practice_0202.jpg : jphide[v5](X&aC$|mg!wC2)
test_practice_0505.jpg : jphide[v5](pe8f%yC$V6Z3)
dog.jpg : negative`"""
REGEX = re.compile(r'\((.*)\)', re.MULTILINE)
print(REGEX.findall(test_string))
# ['asdfl;kj88876', '65498ghjk;0-', 'poi098*/8!##', 'sJ*=tT#&Ve!2', 'nKFdFX+C!:V9' , '!~rFX3FXszx6', 'X&aC$|mg!wC2', 'pe8f%yC$V6Z3']

Find, decode and replace all base64 values in text file

I have a SQL dump file that contains text with html links like:
<a href="http://blahblah.org/kb/getattachment.php?data=NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY=">attached file</a>
I'd like to find, decode and replace the base64 part of the text in each of these links.
I've been trying to use Python w/ regular expressions and base64 to do the job. However, my regex skills are not up to the task.
I need to select any string that starts with
'getattachement.php?data='
and ends with
'"'
I then need to decode the part between 'data=' and '&quot' using base64.b64decode()
results should look something like:
<a href="http://blahblah.org/kb/4/Topcon_data-download_howto.pdf">attached file</a>
I think the solution will look something like:
import re
import base64
with open('phpkb_articles.sql') as f:
for line in f:
re.sub(some_regex_expression_here, some_function_here_to_decode_base64)
Any ideas?
EDIT: Answer for anyone who's interested.
import re
import base64
import sys
def decode_base64(s):
"""
Method to decode base64 into ascii
"""
# fix escaped equal signs in some base64 strings
base64_string = re.sub('%3D', '=', s.group(1))
decodedString = base64.b64decode(base64_string)
# substitute '|' for '/'
decodedString = re.sub('\|', '/', decodedString)
# escape the spaces in file names
decodedString = re.sub(' ', '%20', decodedString)
# print 'assets/' + decodedString + '&quot' # Print for debug
return 'assets/' + decodedString + '&quot'
count = 0
pattern = r'getattachment.php\?data=([^&]+?)&quot'
# Open the file and read line by line
with open('phpkb_articles.sql') as f:
for line in f:
try:
# globally substitute in new file path
edited_line = re.sub(pattern, decode_base64, line)
# output the edited line to standard out
sys.stdout.write(edited_line)
except TypeError:
# output unedited line if decoding fails to prevent corruption
sys.stdout.write(line)
# print line
count += 1

you already have it, you just need the small pieces:
pattern: r'data=([^&]+?)&quot' will match anything after data= and before &quot
>>> pat = r'data=([^&]+?)&quot'
>>> line = '<a href="http://blahblah.org/kb/getattachment.php?data=NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY=">attached file</a>'
>>> decodeString = re.search(pat,line).group(1) #because the b64 string is capture by grouping, we only want group(1)
>>> decodeString
'NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY='
you can then use str.replace() method as well as base64.b64decode() method to finish the rest. I dont want to just write your code for you but this should give you a good idea of where to go.

re.match doesn't pick up on txt file format

import os.path
import re
def request ():
print ("What file should I write to?")
file = input ()
thing = os.path.exists (file)
if thing == True:
start = 0
elif re.match ("^.+.\txt$", file):
stuff = open (file, "w")
stuff.write ("Some text.")
stuff.close ()
start = 0
else:
start = 1
go = "yes"
list1 = (start, file, go)
return list1
start = 1
while start == 1:
list1 = request ()
(start, file, go) = list1
Whenever I enter Thing.txt as the text, the elif should catch that it's in the format given. However, start doesn't change to 0, and a file isn't created. Have I formatted the re.match incorrectly?

"^.+.\txt$" is an incorrect pattern for match .txt files you can use the following regex :
r'^\w+\.txt$'
As \w matches word character if you want that the file name only contain letters you could use [a-zA-Z] instead :
r'^[a-zA-Z]+\.txt$'
Note that you need to escape the . as is a special sign in regular expression .
re.match (r'^\w+\.txt$',file)
But as an alternative answer for match file names with special format you can use endswith() :
file.endswith('.txt')
Also instead of if thing == True you can just use if thing : that is more pythonic !

You should escape second dot and unescape the "t" character:
re.match ("^.+\.txt$", file)
Also note that you don't really need regex for this, you can simply use endswith or search for module that can give you files extensions:
import os
fileName, fileExtension = os.path.splitext('your_file.txt')
fileExtension is .txt, which is exactly what you're looking for.

Unicode to original character in Python

When I use for example,
unicode_string = u"Austro\u002dHungarian_gulden"
unicode_string.encode("ascii", "ignore")
Then it will give this output:'Austro-Hungarian_gulden'
But I am using a txt file which contains set of data as below:
Austria\u002dHungary Austro\u002dHungarian_gulden
Cocos_\u0028Keeling\u0029_Islands Australian_dollar
El_Salvador Col\u00f3n_\u0028currency\u0029
Faroe_Islands Faroese_kr\u00f3na
Georgia_\u0028country\u0029 Georgian_lari
And I have to process this data using regular expressions in Python, so I have created a script as below, but it does not work for replacing Unicode values with appropiate characters in the string.
Likewise
'\u002d' has appropriate character '-'
'\u0028' has appropriate character '('
'\u0029' has appropriate character ')'
Script for processing text file:
import re
import collections
def extract():
filename = raw_input("Enter file Name:")
in_file = file(filename,"r")
out_file = file("Attribute.txt","w+")
for line in in_file:
values = line.split("\t")
if values[1]:
str1 = ""
for list in values[1]:
list = re.sub("[^\Da-z0-9A-Z()]","",list)
list = list.replace('_',' ')
out_file.write(list)
str1 += list
out_file.write(" ")
if values[2]:
str2 = ""
for list in values[2]:
list = re.sub("[^\Da-z0-9A-Z\n]"," ",list)
list = list.replace('"','')
list = list.replace('_',' ')
out_file.write(list)
str2 += list
s1 = str1.lstrip()
s1 = str1.rstrip()
s2 = str2.lstrip()
s2 = str2.rstrip()
print s1+s2
Expected output for the given data is:
Austria-Hungary Austro-Hungarian gulden
Cocos (Keeling) Islands Australian dollar
El Salvador Coln (currency)
FaroeIslands Faroese krna
Georgia (country) Georgian lari
How can I do it?

Convert the input into Unicode using decode("unicode_escape"), then encode() the output to an encoding of your choice.
>>> r"Austro\u002dHungarian_gulden".decode("unicode_escape")
u'Austro-Hungarian_gulden'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex for substring on a string then replace text - python

You may use re.sub here: inp = "This file starts here dig_1 = hello\n doge ras = friend\n sox = pie\n" output = re.sub(r'\b(\S+)\s=\s.*(?=\n|$)', r'\1 = unknown', inp) print(output) This prints: This file starts here dig_1 = unknown doge ras = unknown sox = unknown

Related

searching word following a giving pattern

How to find parenthesis bound strings in python

Find, decode and replace all base64 values in text file

re.match doesn't pick up on txt file format

Unicode to original character in Python

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex for substring on a string then replace text - python

You may use re.sub here: inp = "This file starts here dig_1 = hello\n doge ras = friend\n sox = pie\n" output = re.sub(r'\b(\S+)\s*=\s*.*(?=\n|$)', r'\1 = unknown', inp) print(output) This prints: This file starts here dig_1 = unknown doge ras = unknown sox = unknown

Related

searching word following a giving pattern

How to find parenthesis bound strings in python

Find, decode and replace all base64 values in text file

re.match doesn't pick up on txt file format

Unicode to original character in Python

Categories

Resources

You may use re.sub here: inp = "This file starts here dig_1 = hello\n doge ras = friend\n sox = pie\n" output = re.sub(r'\b(\S+)\s=\s.*(?=\n|$)', r'\1 = unknown', inp) print(output) This prints: This file starts here dig_1 = unknown doge ras = unknown sox = unknown