Parsing text file containing unique pattern using Python

Parsing text file containing unique pattern using Python - python

How to parse a text file containing this pattern "KEYWORD: Out:" and dump the result into output file using Python?
input.txt
DEBUG 2020-11:11:17.401 KEYWORD: Out:0xaaaf0000 In:0x80000000.1110ffff.
DEBUG 2020-11:11:17.401 KEYWORD: Out:0xaaaf00cc In:0x80000000.1110ffaa.
output.txt
0xaaaf0000:1110ffff
0x80000000:1110ffaa

You could use a regex:
import re
txt='''\
DEBUG 2020-11:11:17.401 KEYWORD: Out:0xaaaf0000 In:0x80000000.1110ffff.
DEBUG 2020-11:11:17.401 KEYWORD: Out:0xaaaf00cc In:0x80000000.1110ffaa.'''
pat=r'KEYWORD: Out:(0x[a-f0-9]+)[ \t]+In:0x[a-f0-9]+\.([a-f0-9]+)'
>>> '\n'.join([m[0]+':'+m[1] for m in re.findall(pat, txt)])
0xaaaf0000:1110ffff
0xaaaf00cc:1110ffaa
If you want to do this line-by-line from a file:
import re
pat=r'KEYWORD: Out:(0x[a-f0-9]+)[ \t]+In:0x[a-f0-9]+\.([a-f0-9]+)'
with open(ur_file) as f:
for line in f:
m=re.search(pat, line)
if m:
print(m.group(1)+':'+m.group(2))

In [6]: lines
Out[6]:
['DEBUG 2020-11:11:17.401 KEYWORD: Out:0xaaaf0000 In:0x80000000.1110ffff.',
'DEBUG 2020-11:11:17.401 KEYWORD: Out:0xaaaf00cc In:0x80000000.1110ffaa.']
In [7]: [x.split('Out:')[1].split(' ')[0] + ':' + x.split('In:')[1].split('.')[1] for x in lines]
Out[7]: ['0xaaaf0000:1110ffff', '0xaaaf00cc:1110ffaa']

I think the 2nd line for your 'output.txt' might be wrong (or complex if not - you'd need to point this out).
Otherwise, maybe a RegEx like this:
(.*Out:)(0x[0-9a-f]{1,8}) In:0x[0-9a-f]{1,8}\.([0-9a-f]{1,8}).
https://regex101.com/r/lMrR06/2

Related

Replace decimals within a special character in file

I am currently trying to read in file and replace all the decimals that are only between the thorn character in it such that:
ie.
þ219.91þ
þ122.1919þ
þ467.426þ
þ104.351þ
þ104.0443þ
will become
þ219þ
þ122þ
þ467þ
þ104þ
þ104þ
The gist of something I'm trying to replicate that works in Notepad++ (regex replacing - below) and trying to replicate it in python (code below which is not working). Any suggestions?
In Notepad++:
Find: (\xFE\d+)\.\d+(\xFE)
Replace: $1$2
Python:
for line in file:
line = re.sub("(\xFE\d+)\.\d+(\xFE)", "\xFE\d+\xFE", line)

I don't think it would be necessary to have \xFE and this might simply work:
import re
regex = r"(þ\d+)\.\d+(þ)"
test_str = ("þ219.91þ\n"
"þ122.1919þ\n"
"þ467.426þ\n"
"þ104.351þ\n"
"þ104.0443þ")
subst = "\\1\\2"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)

You're not replacing decimals: you're truncating the values. Will the mathematical treatment do for you? This assumes that all lines are of the format you show.
for line in file:
_, val, _ = line.split('þ') # null string, value, null string
line = 'þ' + str(int(val))+ 'þ'
Note that you could reduce this a little with a single line in the loop:
line = 'þ' + str(int(line.split('þ')[1]))+ 'þ'

You could use a one-liner such as:
f = ["þ219.91þ", "þ122.1919þ", "þ467.426þ", "þ104.351þ", "þ104.0443þ"]
print(["þ{}þ".format(int(float(l.strip("þ")))) for l in f])
Result:
['þ219þ', 'þ122þ', 'þ467þ', 'þ104þ', 'þ104þ']

How to find parenthesis bound strings in python

I'm learning Python and wanted to automate one of my assignments in a cybersecurity class.
I'm trying to figure out how I would look for the contents of a file that are bound by a set of parenthesis. The contents of the (.txt) file look like:
cow.jpg : jphide[v5](asdfl;kj88876)
fish.jpg : jphide[v5](65498ghjk;0-)
snake.jpg : jphide[v5](poi098*/8!##)
test_practice_0707.jpg : jphide[v5](sJ*=tT#&Ve!2)
test_practice_0101.jpg : jphide[v5](nKFdFX+C!:V9)
test_practice_0808.jpg : jphide[v5](!~rFX3FXszx6)
test_practice_0202.jpg : jphide[v5](X&aC$|mg!wC2)
test_practice_0505.jpg : jphide[v5](pe8f%yC$V6Z3)
dog.jpg : negative`
And here is my code so far:
import sys, os, subprocess, glob, shutil
# Finding the .jpg files that will be copied.
sourcepath = os.getcwd() + '\\imgs\\'
destpath = 'stegdetect'
rawjpg = glob.glob(sourcepath + '*.jpg')
# Copying the said .jpg files into the destpath variable
for filename in rawjpg:
shutil.copy(filename, destpath)
# Asks user for what password file they want to use.
passwords = raw_input("Enter your password file with the .txt extension:")
shutil.copy(passwords, 'stegdetect')
# Navigating to stegdetect. Feel like this could be abstracted.
os.chdir('stegdetect')
# Preparing the arguments then using subprocess to run
args = "stegbreak.exe -r rules.ini -f " + passwords + " -t p *.jpg"
# Uses open to open the output file, and then write the results to the file.
with open('cracks.txt', 'w') as f: # opens cracks.txt and prepares to w
subprocess.call(args, stdout=f)
# Processing whats in the new file.
f = open('cracks.txt')

If it should just be bound by ( and ) you can use the following regex, which ensures starting ( and closing ) and you can have numbers and characters between them. You can add any other symbol also that you want to include.
[\(][a-z A-Z 0-9]*[\)]
[\(] - starts the bracket
[a-z A-Z 0-9]* - all text inside bracket
[\)] - closes the bracket
So for input sdfsdfdsf(sdfdsfsdf)sdfsdfsdf , the output will be (sdfdsfsdf)
Test this regex here: https://regex101.com/

I'm learning Python
If you are learning you should consider alternative implementations, not only regexps.
TO iterate line by line of a text file you just open the file and for over the file handle:
with open('file.txt') as f:
for line in f:
do_something(line)
Each line is a string with the line contents, including the end-of-line char '/n'. To find the start index of a specific substring in a string you can use find:
>>> A = "hello (world)"
>>> A.find('(')
6
>>> A.find(')')
12
To get a substring from the string you can use the slice notation in the form:
>>> A[6:12]
'(world'

You should use regular expressions which are implemented in the Python re module
a simple regex like \(.*\) could match your "parenthesis string"
but it would be better with a group \((.*)\) which allows to get only the content in the parenthesis.
import re
test_string = """cow.jpg : jphide[v5](asdfl;kj88876)
fish.jpg : jphide[v5](65498ghjk;0-)
snake.jpg : jphide[v5](poi098*/8!##)
test_practice_0707.jpg : jphide[v5](sJ*=tT#&Ve!2)
test_practice_0101.jpg : jphide[v5](nKFdFX+C!:V9)
test_practice_0808.jpg : jphide[v5](!~rFX3FXszx6)
test_practice_0202.jpg : jphide[v5](X&aC$|mg!wC2)
test_practice_0505.jpg : jphide[v5](pe8f%yC$V6Z3)
dog.jpg : negative`"""
REGEX = re.compile(r'\((.*)\)', re.MULTILINE)
print(REGEX.findall(test_string))
# ['asdfl;kj88876', '65498ghjk;0-', 'poi098*/8!##', 'sJ*=tT#&Ve!2', 'nKFdFX+C!:V9' , '!~rFX3FXszx6', 'X&aC$|mg!wC2', 'pe8f%yC$V6Z3']

Find, decode and replace all base64 values in text file

I have a SQL dump file that contains text with html links like:
<a href="http://blahblah.org/kb/getattachment.php?data=NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY=">attached file</a>
I'd like to find, decode and replace the base64 part of the text in each of these links.
I've been trying to use Python w/ regular expressions and base64 to do the job. However, my regex skills are not up to the task.
I need to select any string that starts with
'getattachement.php?data='
and ends with
'"'
I then need to decode the part between 'data=' and '&quot' using base64.b64decode()
results should look something like:
<a href="http://blahblah.org/kb/4/Topcon_data-download_howto.pdf">attached file</a>
I think the solution will look something like:
import re
import base64
with open('phpkb_articles.sql') as f:
for line in f:
re.sub(some_regex_expression_here, some_function_here_to_decode_base64)
Any ideas?
EDIT: Answer for anyone who's interested.
import re
import base64
import sys
def decode_base64(s):
"""
Method to decode base64 into ascii
"""
# fix escaped equal signs in some base64 strings
base64_string = re.sub('%3D', '=', s.group(1))
decodedString = base64.b64decode(base64_string)
# substitute '|' for '/'
decodedString = re.sub('\|', '/', decodedString)
# escape the spaces in file names
decodedString = re.sub(' ', '%20', decodedString)
# print 'assets/' + decodedString + '&quot' # Print for debug
return 'assets/' + decodedString + '&quot'
count = 0
pattern = r'getattachment.php\?data=([^&]+?)&quot'
# Open the file and read line by line
with open('phpkb_articles.sql') as f:
for line in f:
try:
# globally substitute in new file path
edited_line = re.sub(pattern, decode_base64, line)
# output the edited line to standard out
sys.stdout.write(edited_line)
except TypeError:
# output unedited line if decoding fails to prevent corruption
sys.stdout.write(line)
# print line
count += 1

you already have it, you just need the small pieces:
pattern: r'data=([^&]+?)&quot' will match anything after data= and before &quot
>>> pat = r'data=([^&]+?)&quot'
>>> line = '<a href="http://blahblah.org/kb/getattachment.php?data=NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY=">attached file</a>'
>>> decodeString = re.search(pat,line).group(1) #because the b64 string is capture by grouping, we only want group(1)
>>> decodeString
'NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY='
you can then use str.replace() method as well as base64.b64decode() method to finish the rest. I dont want to just write your code for you but this should give you a good idea of where to go.

How to squeeze the characters in python

I want to squeez the characters, for this I write the code but my regular expression did not worked
python code is:
file1 = open("C:/Python26/Normalized.txt");
normal = re.compile(r'(.)(\1+)')
f1=open("rzlt.txt",'w')
contents1=file1.read();
tokens1 = nltk.word_tokenize(contents1)
for t in tokens1:
t = re.sub(normal,r'\1',t)
f1.write(t+"\n")
f1.close()
my file is like
AA-0
A-aaaa-aaaaaaaaaaa
aaaaaaaa-aaaaaaaa
aaaaaaaaaaaaa-aaaaaa
AA-aaaaa-A
aaaaa-A-aaaa
AAA-0-aaaa-aaaaaaaa-aaaaaa
AAA-0
AAA-0-aaaaaaaa
AAA-0
aaaaaaaa
Desired output is
A-0
A-a-a
a-a
a-a
A-a-A
......

import re
normal = re.compile(r'(.)(\1+)')
with open("Normalized.txt") as file1:
with open("rzlt.txt", 'w') as f1:
for line in file1:
f1.write(normal.sub(r'\1', line))
This produces the output:
A-0
A-a-a
a-a
a-a
A-a-A
a-A-a
A-0-a-a-a
A-0
A-0-a
A-0
a
Notes
To open files, with statements are used. This assures that the files are subsequently closed.

theres no need to use regular expressions for this purpose, the most common way of eliminating duplicates is to turn the collection into a set. but, as Shashank has commented, for the input format you're dealing with even that is not necessary:
for line in file:
newline = '-'.join(x[0] for x in line.split('-'))

IPv4 address substitution in Python script

I'm having trouble getting this to work, and I am hoping for any ideas:
My goal: to take a file, read it line by line, substitute any IP address for a specific substitute, and write the changes to the same file.
I KNOW THIS IS NOT CORRECT SYNTAX
Pseudo-Example:
$ cat foo
10.153.193.0/24 via 10.153.213.1
def swap_ip_inline(line):
m = re.search('some-regex', line)
if m:
for each_ip_it_matched:
ip2db(original_ip)
new_line = reconstruct_line_with_new_ip()
line = new_line
return line
for l in foo.readlines():
swap_ip_inline(l)
do some foo to rebuild the file.
I want to take the file 'foo', find each IP in a given line, substitute the ip using the ip2db function, and then output the altered line.
Workflow:
1. Open File
2. Read Lines
3. Swap IP's
4. Save lines (altered/unaltered) into tmp file
5. Overwrite original file with tmp file
*edited to add pseudo-code example

Here you go:
>>> import re
>>> ip_addr_regex = re.compile(r'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b')
>>> f = open('foo')
>>> for line in f:
... print(line)
...
10.153.193.0/24 via 10.153.213.1
>>> f.seek(0)
>>>
specific_substitute = 'foo'
>>> for line in f:
... re.sub(ip_addr_regex, specific_substitute, line)
...
'foo/24 via foo\n'

This link gave me the breatkthrough I was looking for:
Python - parse IPv4 addresses from string (even when censored)
a simple modification passes initial smoke tests:
def _sub_ip(self, line):
pattern = r"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)([ (\[]?(\.|dot)[ )\]]?(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3})"
ips = [each[0] for each in re.findall(pattern, line)]
for item in ips:
location = ips.index(item)
ip = re.sub("[ ()\[\]]", "", item)
ip = re.sub("dot", ".", ip)
ips.remove(item)
ips.insert(location, ip)
for ip in ips:
line = line.replace(ip, self._ip2db(ip))
return line
I'm sure I'll clean it up down the road, but it's a great start.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing text file containing unique pattern using Python - python

I think the 2nd line for your 'output.txt' might be wrong (or complex if not - you'd need to point this out). Otherwise, maybe a RegEx like this: (.*Out:)(0x[0-9a-f]{1,8}) In:0x[0-9a-f]{1,8}\.([0-9a-f]{1,8}). https://regex101.com/r/lMrR06/2

Related

Replace decimals within a special character in file

How to find parenthesis bound strings in python

Find, decode and replace all base64 values in text file

How to squeeze the characters in python

IPv4 address substitution in Python script

Categories

Resources