Finding a pattern multiple times between start and end patterns python regex - python

i am trying to find a certain pattern between a start and end patterns for multiple lines. Here is what i mean:
i read a file and saved it in variable File, this is what the original file looks like:
File:
...
...
...
Keyword some_header_file {
XYZ g1234567S7894561_some_other_trash_underscored_text;
XYZ g1122334S9315919_different_other_trash_underscored_text;
}
...
...
...
I am trying to grab the 1234567 between the g and S and also the 1122334. The some_header_file block can be any number of lines but always ends with }
So i am trying to grab exactly 7 digits between the g and the S for all the lines from the "Keyword" till the "}" for that specific header.
this is what i used:
FirstSevenDigitPart = str(re.findall(r"Keyword\s%s.*\n.*XYZ\s[gd]([0-9]{7})[A-Z][0-9]{7}.*\}"%variable , str(File) , flags=re.MULTILINE))
but unfortunately it does not return anything.. just a blank []
what am i doing wrong? how can i accomplish this?
Thanks in advance.

You may read your file into a contents variable and use
import re
contents = "...\n...\n...\nKeyword some_header_file {\n XYZ \ng1234567S7894561_some_other_trash_underscored_text;\n XYZ \n1122334S9315919_different_other_trash_underscored_text;\n}\n...\n...\n..."
results = []
variable = 'some_header_file'
block_rx = r'Keyword\s+{}\s*{{([^{{}}]*)}}'.format(re.escape(variable))
value_rx = r'XYZ\s[gd]([0-9]{7})[A-Z][0-9]{7}'
for block in re.findall(block_rx, contents):
results.extend(re.findall(value_rx, block))
print(results)
# => ['1234567', '1122334']
See the Python demo.
The first regex (block_rx) will look like Keyword\s+some_header_file\s*{([^{}]*)} and will match all the blocks you need to search for values in. The second regex, XYZ\s[gd]([0-9]{7})[A-Z][0-9]{7}, matches what you need and returns the list of captures.

I think that the simplest way here will be to use two expressions and run it in two steps. There is a little example. Of course you should optimize it for your needs.
import re
text = """Keyword some_header_file {
XYZ g1234567S7894561_some_other_trash_underscored_text;
XYZ g1122334S9315919_different_other_trash_underscored_text;
}"""
all_lines_pattern = 'Keyword\s*%s\s*\{\n(?P<all_lines>(.|\s)*)\}'
first_match = re.match(all_lines_pattern % 'some_header_file', text)
if first_match is None:
# some break logic here
pass
found_lines = first_match.group(1)
print(found_lines) # ' XYZ g1234567S7894561_some_other_trash_underscored_text;\n XYZ g1122334S9315919_different_other_trash_underscored_text;\n '
sub_pattern = '(XYZ\s*[gd](?P<your_pattern>[0-9]{7})[A-Z]).*;'
found_groups = re.findall(sub_pattern, found_lines)
print(found_groups) # [('XYZ g1234567S', '1234567'), ('XYZ g1122334S', '1122334')]

Related

multiple modification to a list at once

I have a text file of some ip's and Mac's. The format of the Mac's are xxxx.xxxx.xxxx, I need to change all the MAC's to xx:xx:xx:xx:xx:xx
I am already reading the file and putting it into a list. Now I am looping through each line of the list and I need to make multiple modification. I need to remove the IP's and then change the MAC format.
The problem I am running into is that I cant seem to figure out how to do this in one shot unless I copy the list to a newlist for every modification.
How can I loop through the list once, and update each element on the list with all my modification?
count = 0
output3 = []
for line in output:
#print(line)
#removes any extra spaces between words in a string.
output[count] = (str(" ".join(line.split())))
#create a new list with just the MAC addresses
output3.append(str(output[count].split(" ")[3]))
#create a new list with MAC's using a ":"
count += 1
print(output3)
It appears you are trying to overthink the problem, so that may be where your frustration is spinning you around a bit.
First, you should always consider if you need a count variable in python. Usually you do not, and the enumerate() function is your friend here.
Second, there is no need to process data multiple times in python. You can use variables to your advantage and leverage python's expressiveness, rather than trying to hide your problem from the language.
PSA an implementation example that may help you think through your approach. Good luck on solving your harder problems, and I hope python will help you out with them!
#! /usr/bin/env python3
import re
from typing import Iterable
# non-regex reformat mac to be xx:xx:xx:xx:xx:xx
# NOTE: this assumes a source with '.' separators only
# reformat_mac = lambda _: ':'.join(_ for _ in _.split('.') for _ in (_[:2], _[2:]))
# regex reformat mac to be xx:xx:xx:xx:xx:xx
# NOTE: Only requires at least two hex digits adjacent at a time
reformat_mac = lambda _: ":".join(re.findall(r"(?i)[\da-f]{2}", _))
def generate_output3(output: Iterable[str]) -> Iterable[str]:
for line in output:
col1, col2, col3, mac, *cols = line.split()
mac = reformat_mac(mac)
yield " ".join((col1, col2, col3, mac, *cols))
if __name__ == "__main__":
output = [
"abc def ghi 1122.3344.5566",
"jklmn op qrst 11a2.33c4.55f6 uv wx yz",
"zyxwu 123 next 11a2.33c4.55f6 uv wx yz",
]
for line in generate_output3(output):
print(line)
Solution
You can use the regex (regular expression) module to extract any pattern that matches that of the
mac-ids: "xxxx:xxxx:xxxx" and then process it to produce the expected output ("xx-xx-xx-xx-xx-xx")
as shown below.
Note: I have used a dummy data file (see section: Dummy Data below) to make this answer
reproducible. It should work with your data as well.
# import re
filepath = "input.txt"
content = read_file(filepath)
mac_ids = extract_mac_ids(content, format=True) # format=False --> "xxxx:xxxx:xxxx"
print(mac_ids)
## OUTPUT:
#
# ['a0-b1-ff-33-ac-d5',
# '11-b9-33-df-55-f6',
# 'a4-d1-e7-33-ff-55',
# '66-a1-b2-f3-b9-c5']
Code: Convenience Functions
How does the regex work? see this example
def read_file(filepath: str):
"""Reads and returns the content of a file."""
with open(filepath, "r") as f:
content = f.read() # read in one attemp
return content
def format_mac_id(mac_id: str):
"""Returns a formatted mac_id.
INPUT FORMAT: "xxxxxxxxxxxx"
OUTPUT FORMAT: "xx-xx-xx-xx-xx-xx"
"""
mac_id = list(mac_id)
mac_id = ''.join([ f"-{v}" if (i % 2 == 0) else v for i, v in enumerate(mac_id)])[1:]
return mac_id
def extract_mac_ids(content: str, format: bool=True):
"""Extracts and returns a list of formatted mac_ids after.
INPUT FORMAT: "xxxx:xxxx:xxxx"
OUTPUT FORMAT: "xx-xx-xx-xx-xx-xx"
"""
import re
# pattern = "(" + ':'.join([r"\w{4}"]*3) + "|" + ':'.join([r"\w{2}"]*6) + ")"
# pattern = r"(\w{4}:\w{4}:\w{4}|\w{2}:\w{2}:\w{2}:\w{2}:\w{2}:\w{2})"
pattern = r"(\w{4}:\w{4}:\w{4})"
pat = re.compile(pattern)
mac_ids = pat.findall(content) # returns a list of all mac-ids
# Replaces the ":" with "" and then formats
# each mac-id as: "xx-xx-xx-xx-xx-xx"
if format:
mac_ids = [format_mac_id(mac_id.replace(":", "")) for mac_id in mac_ids]
return mac_ids
Dummy Data
The following code block creates a dummy file with some sample mac-ids.
filepath = "input.txt"
s = """
a0b1:ff33:acd5 ghwvauguvwi ybvakvi
klasilvavh; 11b9:33df:55f6
haliviv
a4d1:e733:ff55
66a1:b2f3:b9c5
"""
# Create dummy data file
with open(filepath, "w") as f:
f.write(s)

Parse unstructured text in python

Am new to python and am trying to read a PDF file to pull the ID No.. I have been successful so far to extract the text out of the PDF file using pdfplumber. Below is the code block:
import pdfplumber
with pdfplumber.open('ABC.pdf') as pdf_file:
firstpage = pdf_file.pages[0]
raw_text = firstpage.extract_text()
print (raw_text)
Here is the text output:
Welcome to ABC
01 January, 1991
ID No. : 10101010
Welcome to your ABC portal. Learn
More text here..
Even more text here..
Mr Jane Doe
Jack & Jill Street Learn more about your
www.abc.com
....
....
....
However, am unable to find the optimum way to parse this unstructured text further. The final output am expecting to be is just the ID No. i.e. 10101010. On a side note, the script would be using against fairly huge set of PDFs so performance would be of concern.
Try using a regular expression:
import pdfplumber
import re
with pdfplumber.open('ABC.pdf') as pdf_file:
firstpage = pdf_file.pages[0]
raw_text = firstpage.extract_text()
m = re.search(r'ID No\. : (\d+)', raw_text)
if m:
print(m.group(1))
Of course you'll have to iterate over all the PDF's contents - not just the first page! Also ask yourself if it's possible that there's more than one match per page. Anyway: you know the structure of the input better than I do (and we don't have access to the sample file), so I'll leave it as an exercise for you.
If the length of the id number is always the same, I would try to find the location of it with the find-function. position = raw_text.find('ID No. : ')should return the position of the I in ID No. position + 9 should be the first digit of the id. When the number has always a length of 8 you could get it with int(raw_text[position+9:position+17])
If you are new to Python and actually need to process serious amounts of data, I suggest that you look at Scala as an alternative.
For data processing in general, and regular expression matching in particular, the time it takes to get results is much reduced.
Here is an answer to your question in Scala instead of Python:
import com.itextpdf.text.pdf.PdfReader
import com.itextpdf.text.pdf.parser.PdfTextExtractor
val fil = "ABC.pdf"
val textFromPage = (1 until (new PdfReader(fil)).getNumberOfPages).par.map(page => PdfTextExtractor.getTextFromPage(new PdfReader(fil), page)).mkString
val r = "ID No\\. : (\\d+)".r
val res = for (m <- r.findAllMatchIn(textFromPage )) yield m.group(0)
res.foreach(println)

RegEx for capturing groups using dictionary key

I'm having trouble displaying the right named capture in my dictionary function. My program reads a .txt file and then transforms the text in that file into a dictionary. I already have the right regex formula to capture them.
Here is my File.txt:
file Science/Chemistry/Quantum 444 1
file Marvel/CaptainAmerica 342 0
file DC/JusticeLeague/Superman 300 0
file Math 333 0
file Biology 224 1
Here is the regex link that is able to capture the ones I want:
By looking at the link, the ones I want to display is highlighted in green and orange.
This part of my code works:
rx= re.compile(r'file (?P<path>.*?)( |\/.*?)? (?P<views>\d+).+')
i = sub_pattern.match(data) # 'data' is from the .txt file
x = (i.group(1), i.group(3))
print(x)
But since I'm making the .txt into a dictionary I couldn't figure out how to make .group(1) or .group(3) as keys to display specifically for my display function. I don't know how to make those groups display when I use print("Title: %s | Number: %s" % (key[1], key[3])) and it will display those contents. I hope someone can help me implement that in my dictionary function.
Here is my dictionary function:
def create_dict(data):
dictionary = {}
for line in data:
line_pattern = re.findall(r'file (?P<path>.*?)( |\/.*?)? (?P<views>\d+).+', line)
dictionary[line] = line_pattern
content = dictionary[line]
print(content)
return dictionary
I'm trying to make my output look like this from my text file:
Science 444
Marvel 342
DC 300
Math 333
Biology 224
You may create and populate a dictionary with your file data using
def create_dict(data):
dictionary = {}
for line in data:
m = re.search(r'file\s+([^/\s]*)\D*(\d+)', line)
if m:
dictionary[m.group(1)] = m.group(2)
return dictionary
Basically, it does the following:
Defines a dictionary dictionary
Reads data line by line
Searches for a file\s+([^/\s]*)\D*(\d+) match, and if there is a match, the two capturing group values are used to form a dictionary key-value pair.
The regex I suggest is
file\s+([^/\s]*)\D*(\d+)
See the Regulex graph explaining it:
Then, you may use it like
res = {}
with open(filepath, 'r') as f:
res = create_dict(f)
print(res)
See the Python demo.
You already used named group in your 'line_pattern', simply put them to your dictionary. re.findall would not work here. Also the character escape '\' before '/' is redundant. Thus your dictionary function would be:
def create_dict(data):
dictionary = {}
for line in data:
line_pattern = re.search(r'file (?P<path>.*?)( |/.*?)? (?P<views>\d+).+', line)
dictionary[line_pattern.group('path')] = line_pattern.group('views')
content = dictionary[line]
print(content)
return dictionary
This RegEx might help you to divide your inputs into four groups where group 2 and group 4 are your target groups that can be simply extracted and spaced with a space:
(file\s)([A-Za-z]+(?=\/|\s))(.*)(\d{3})

How to extract text part from file using Python & Regular Expressions

Using Python I want to read a text file, search for a string and print all lines between this matching string and another one.
The textfile looks like the following:
Text=variables.Job_SalesDispatch.CaptionNew
Tab=0
TabAlign=0
}
}
}
[UserVariables]
User1=#StJid;IF(fields.Fieldtype="Artikel.Gerät" , STR$(fields.id,0,0) , #StJid)
[Parameters]
[#Parameters]
{
[Parameters]
{
LL.ProjectDescription=? (default)
LL.SortOrderID=
}
}
[PageLayouts]
[#PageLayouts]
{
[PageLayouts]
{
[PageLayout]
{
DisplayName=
Condition=Page() = 1
SourceTray=0
Now I want to print all "UserVariables", so only the lines between [UserVariables] and the next line starting with a square bracket. In this example this would be [Parameters].
What I have done so far is:
with open("path/testfile.lst", encoding="utf8", errors="ignore") as file:
for line in file:
uservars = re.findall('\b(\w*UserVariables\w*)\b', line)
print (uservars)
what gives me only [].
If using regular expressions is not a mandatory requirement for you, you can go with something like this:
with open("path/testfile.lst", encoding="utf8", errors="ignore") as file:
inside_uservars = False
for line in file:
if inside_uservars:
if line.strip().startswith('['):
inside_uservars = False
else:
print(line)
if line.strip() == '[UserVariables]':
inside_uservars = True
We can try using re.findall with the following regex pattern:
\[UserVariables\]\n((?:(?!\[.*?\]).)*)
This says to match a [UserVariables] tag, followed by a slightly complicated looking expression:
((?:(?!\[.*?\]).)*)
This expression is a tempered dot trick which matches any character, one at a time, so long as what lies immediately ahead is not another tag contained in square brackets.
matches = re.findall(r'\[UserVariables\]\n((?:(?!\[.*?\]).)*)', input, re.DOTALL)
print(matches)
[' User1=#StJid;IF(fields.Fieldtype="Artikel.Ger\xc3\xa4t" , STR$(fields.id,0,0) , #StJid)\n']
Edit:
My answer assumes that the entire file content sits in memory, in a single Python string. You may read the entire file using:
with open('Path/to/your/file.txt', 'r') as content_file:
input = content_file.read()
matches = re.findall(r'\[UserVariables\]\n((?:(?!\[.*?\]).)*)', input, re.DOTALL)
print(matches)

Match lines from a file and parse them on Python

I have this file with different lines, and I want to take only some information from each line (not the whole of it) here is a sample of how the file looks like:
18:10:12.960404 IP 132.227.127.62.12017 > 134.157.0.129.53: 28192+ A? safebrowsing-cache.google.com. (47)
18:10:12.961114 IP 134.157.0.129.53 > 132.227.127.62.12017: 28192 12/4/4 CNAME safebrowsing.cache.l.google.com., A 173.194.40.102, A 173.194.40.103, A 173.194.40.104, A 173.194.40.105, A 173.194.40.110, A 173.194.40.96, A 173.194.40.97, A 173.194.40.98, A 173.194.40.99, A 173.194.40.100, A 173.194.40.101 (394)
18:13:46.206371 IP 132.227.127.62.49296 > 134.157.0.129.53: 47153+ PTR? b._dns-sd._udp.upmc.fr. (40)
18:13:46.206871 IP 134.157.0.129.53 > 132.227.127.62.49296: 47153 NXDomain* 0/1/0 (101)
18:28:57.253746 IP 132.227.127.62.54232 > 134.157.0.129.53: 52694+ TXT? time.apple.com. (32)
18:28:57.254647 IP 134.157.0.129.53 > 132.227.127.62.54232: 52694 1/8/8 TXT "ntp minpoll 9 maxpoll 12 iburst" (381)
.......
.......
It is actually the output of a DNS request, so from it I want to extract these elements:
[timestamp], [srcip], [src prt], [dst ip], [dst prt], [domaine (if existed)], [related ips addresses]
After looking in the website old topics, I found that the re.match() is a great and helpful way to do that, but since as you see every line is different of the other, I am kind of lost, some help would be great, here is the code I wrote so far and it is correct:
def extractDNS(filename):
objList = []
obj = {}
with open(filename) as fi:
for line in fi:
line = line.lower().strip()
#18:09:29.960404
m = re.match("(\d+):(\d+):(\d+.\d+)",line)
if m:
obj = {} #New object detected
hou = int(m.group(1))
min = int(m.group(2))
sec = float(m.group(3))
obj["time"] = (hou*3600)+(min*60)+sec
objList.append(obj)
#IP 134.157.0.129.53
m=re.match("IP\s+(\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}).(\d+)",bb)
if m:
obj["dnssrcip"] = m.group(1)
obj["dnssrcport"] = m.group(2)
# > 134.157.0.129.53:
m = re.match("\s+>\s+(\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}).(\d+):",line)
if m:
obj["dnsdstip"] = m.group(1)
obj["dnsdstport"] = m.group(2)
tstFile3=open("outputFile","w+")
tstFile3.write("%s\n" %objList)
tstFile3.close()
extractDNS(sys.argv[1])
I know I have to make if else statements after this, because what comes after them is different every time, and I showed in the 3 cases I get generaly in every dns output file, which are :
- A? followed by CNAME, the exact domain, and the IP addresses,
- PTR? followed by a NXDOmain, means the domain is non-existant, so I will just ignore this line,
- TXT? followed by a domain, but it only gives words, so I ll ignore this one two
I only want the request that their responses contain IP addresses, which are in this case the A?
If you know the first 5 columns are always present, why don't you just split the line up and handle those directly (use datetime for the timestamp, and manually parse the IP addresses/ports). Then, you could use your regular expression to match only the CNAME records and the contents you are interested in from that one field. There is no need to have a regular expression scan over the different possibilities if you aren't going to actually use the output. So, if it doesn't match the CNAME form, then you don't care how it would be handled. At least, that's what it sounds like.
As user632657 said above, there's no need to care about the lines you, well, don't care about. Just use one regular expression for that line, and if it doesn't match, ignore that line:
pattern = re.compile( r'(\d{2}):(\d{2}):(\d{2}\.\d+)\s+IP\s+(\d+\.\d+\.\d+\.\d+)\.(\d+)\s+>\s+(\d+\.\d+\.\d+\.\d+)\.(\d+):\s+(\d+).*?CNAME\s+([^,]+),\s+(.*?)\s+\(\d+\)' )
That will match the CNAME records only. You only need to define that once, outside your loop. Then, within the loop:
try:
hour, minute, seconds, source_ip, source_port, dst_ip, dst_port, domain, records = pattern.match( line ).groups()
except:
continue
records = [ r.split() for r in records.split( ', ' ) ]
This will pull all the fields you've asked about in the relevant variables, and parse the associated IPs into a list of pairs (class, IP), which I figure will be useful :P

Categories