Regex with end of line in group - python

Given this kind of input:
.-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
|
| server = 127.0.0.1/502
| os = ???
| dist = 0
| params = none
| raw_sig = 4:64+0:0:0:32768,0:::0
|
`----
.-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
|
| server = 127.0.0.1/502
| os = ???
| dist = 0
| params = none
| raw_sig = 4:64+0:0:0:32768,0:::0
|
`----
...
I'm trying use regex to get the value of all the os in the output (there will be hundreds).
I've tried this:
import os, subprocess, re
dir = '/home/user/Documents/ics-passif-asset-enumeration/pcap/'
for filename in os.listdir(dir):
inp = '...'
match = re.match( r'(.*)os(.*)\n(.*)', inp )
print match.group(1)
But match is a NoneType. Never really played with regex before and I'm a bit lost.
Edit:
The expected output is a list of all the os values. In this case it would be:
???
???

I hope this is what you are looking for
>>> import re
>>> string = """.-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
... |
... | server = 127.0.0.1/502
... | os = ???
... | dist = 0
... | params = none
... | raw_sig = 4:64+0:0:0:32768,0:::0
... |
... `----"""
>>> match = re.match( r'(.*)os\s*=(.*?)\n', string, re.DOTALL)
>>> match.group(2)
' ???'
Changes made
re.DOTALL This flag is required so that you are trying to match multiline inputs.
os\s*=(.*?)
\s*= The = and spaces are made out of the capture group since we are not interested in them.
(.*?) The ? makes it non greedy so that it matches till the end of the first line
match.group(2) it is the second match group not the first.
A better and short solution
You can use the re.findall() with slighter different regex
os\s*=(.*)
Test
>>> string = """.-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
... |
... | server = 127.0.0.1/502
... | os = ???
... | dist = 0
... | params = none
... | raw_sig = 4:64+0:0:0:32768,0:::0
... |
... `----
...
... .-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
... |
... | server = 127.0.0.1/502
... | os = ???
... | dist = 0
... | params = none
... | raw_sig = 4:64+0:0:0:32768,0:::0
... |
... `----
... ..."""
>>> re.findall(r"os\s*=(.*)", string)
[' ???', ' ???']

re.findall will return an array of results! Fantastic! Assuming the format of your input is pretty consistent, this should work like a charm:
>>> inp = '''
... .-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
... |
... | server = 127.0.0.1/502
... | os = ???
... | dist = 0
... | params = none
... | raw_sig = 4:64+0:0:0:32768,0:::0
... |
... `----
...
... .-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
... |
... | server = 127.0.0.1/502
... | os = ???
... | dist = 0
... | params = none
... | raw_sig = 4:64+0:0:0:32768,0:::0
... |
... `----
... ...
... '''
>>> re.findall(r'^| os\s+= (.*)$', inp, flags=re.MULTILINE)
['???', '???']
I agree with the idea that the format should be strict to ensure that the string won't appear somewhere else. If this all came from a script then the strictness shouldn't be a problem (you'd hope). If it was via manual entry... I'd be surprised.

For the dot operator (.) to match newlines, add a flag to the match-call:
match = re.match( r'(.*)os(.*)\n(.*)', inp, flags=re.DOTALL )

If I understand what you wished (and assuming your input is what you copied here (multiline, multientry) this regex should do with modifier gm to match all and let ^ and $ match respectively start and end of line:
^|\s*os\s*=\s*(.*)$
Demo Here

You may try use findall() method:
for filename in os.listdir(dir):
inp = '...'
match = re.findall('os(.*)\n', inp)
print match

As #Tensibai says, you're probably best to use ^ and $ to match the start and end of the line, and a very specific pattern (as he gives) to make sure that the string "os" is not matched somewhere else, like within a hostname for example.
To directly find all of the matching "os = " lines, use re.findall( r'^|\s*os\s*=\s*(.*)$', inp, re.MULTILINE ), which returns a list of the matching os values.

Related

Regex substitution reversal?

I have a question:
starting from this text example:
input_test = "أكتب الدر_س و إحفضه ثم إقرأ القصـــــــــــــــيـــــــــــدة"
I managed to clean this text using these functions:
arabic_punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ'''
english_punctuations = string.punctuation
punctuations_list = arabic_punctuations + english_punctuations
arabic_diacritics = re.compile("""
ّ | # Tashdid
َ | # Fatha
ً | # Tanwin Fath
ُ | # Damma
ٌ | # Tanwin Damm
ِ | # Kasra
ٍ | # Tanwin Kasr
ْ | # Sukun
ـ # Tatwil/Kashida
""", re.VERBOSE)
def normalize_arabic(text):
text = re.sub("[إأآا]", "ا", text)
return text
def remove_diacritics(text):
text = re.sub(arabic_diacritics, '', text)
return text
def remove_punctuations(text):
translator = str.maketrans('', '', punctuations_list)
return text.translate(translator)
def remove_repeating_char(text):
return re.sub(r'(.)\1+', r'\1', text)
Which gives me this text as the result:
result = "اكتب الدرس و احفضه ثم اقرا القصيدة"
Now if I have have this case, how can I find the word "اقرا" in the orginal input_test?
The input text can be in English, too. I'm thinking of regex — but I don't know from where to start…

Python regex pattern match starts with dot and store it in dict format

#-----------------------------------------------------------------------------------
from pprint import pprint
data = '''
.
.
.
#Long log file
-------------------------------------------------------------------------------
Section Name | Budget | Size | Prev Size | Overflow
--------------------------------+-----------+-----------+-----------+----------
.text.resident | 712924 | 794576 | 832688 | YES
.rodata.resident | 77824 | 77560 | 21496 | YES
.data.resident | 28672 | 28660 | 42308 | NO
.bss.resident | 52672 | 1051632 | 1455728 | YES
.
.
.
'''
Output expected:
MEMDICT = {'.text.resident' : {'Budget':'712924', 'Size':'794576', 'Prev Size': '832688' , 'Overflow': 'YES'},
'.rodata.resident' : {'Budget':'', 'Size':'', 'Prev Size': '' , 'Overflow': 'YES'},
'.data.resident' :{'Budget':'', 'Size':'', 'Prev Size': '' , 'Overflow': 'NO'},
'.bss.resident' :{'Budget':'', 'Size':'', 'Prev Size': '' , 'Overflow': 'YES'}}
I am a beginer in python. Please suggest some simple steps
Logic:
Search for a regex pattern and get the headers in a list
pattern = re.compile(r'\sSection Name\s|\sBudget*') # This can be improved,
if(pattern.match(line)):
key_list = (''.join(line.split())).split('|') # Unable to handle space issues, so trimmed and used.
Search for a regex pattern to match .something.resident | \d+ | \d+ | \d+ | **
Need some help and get it in value_list
Making all list into the dict in a loop
mem_info = {} # reset the list
for i in range(0,len(key_list)):
mem_info[key_list[i]] = value_list[i]
MEMDICT[sta_info[0]] = sta_info
The only thing you haven't shown us is what line ends the section. Other than that, this is what you need:
keeper = False
memdict = {}
for line in open(file):
if not keeper:
if 'Section Name' in line:
keeper = True
continue
if '-------------------' in line:
continue
if 'whatever ends the section' in line:
break
parts = line.split()
memdict[parts[0]] = {
'Budget': int(parts[1]),
'Size': int(parts[2]),
'Prev Size': int(parts[3]),
'Overflow': parts[4]
)

Print elements containing only 2 strings

I have this list
lst = [' SOME TEXT\nSOME TEXT\nFTY = 1', 'A|1\nB|5\nC|3\n \nD|0\nE|0', 'D|4\nE|1\nG|1', '\nblah blah', '\n--- HHGTY',
'SOME TEXT\nFTY = 1\nA|3\nB|2\nC|8\nD|6\nE|9\nF|3', '', 'blah blah\n \nblah blah',
'--- HHGTY'
]
and I want to print only the elements containing | or HHGTY. I using the code below, but is printing
SOME TEXT and FTY = 1 too. What is wrong? Thanks
>>> for s in lst:
... if ("|" in s) or ("HHGTY" in s):
... print(s)
...
A|1
B|5
C|3
D|0
E|0
D|4
E|1
G|1
--- HHGTY
SOME TEXT
FTY = 1
A|3
B|2
C|8
D|6
E|9
F|3
--- HHGTY
>>>
I think what you want is:
for s in lst:
for subs in s.split('\n'):
if ("|" in subs) or ("HHGTY" in subs):
print(subs)
Your code is doing everything right:
SOME TEXT and FTY = 1 are parts of SOME TEXT \ nFTY = 1 \ nA | 3 \ nB | 2 \ nC | 8 \ nD | 6 \ nE | 9 \ nF | 3.
Because in your 'SOME TEXT\nFTY = 1\nA|3\nB|2\nC|8\nD|6\nE|9\nF|3' element '|' is present.

Formating a table from a csv file

I'm trying to make a table from data from a CSV file using only the CSV module. Could anyone tell me what should I do to display the '|' at the end of every row(just after the last element in the row)?
Here's what I have so far:
def display_playlist( filename ):
if filename.endswith('.csv')==False: #check if it ends with CSV extension
filename = filename + ('.csv') #adding .csv if given without .csv extension
max_element_length=0
#aligning columns to the longest elements
for row in get_datalist_from_csv( filename ):
for element in row:
if len(element)>max_element_length:
max_element_length=len(element)
# print(max_element_length)
#return max_element_length
print('-----------------------------------------------------------------------------')
for row in get_datalist_from_csv( filename ):
for element in row:
print('| ', end='')
if (len(element)<=4 and element.isdigit==True):
print(pad_to_length(element,4), end=' |') #trying to get '|' at the end[enter image description here][1]
else:
print(pad_to_length(element, max_element_length), end=' ')
print('\n')
print('-----------------------------------------------------------------------------')
## Read data from a csv format file
def get_datalist_from_csv( filename ):
## Create a 'file object' f, for accessing the file:
with open( filename ) as f:
reader = csv.reader(f) # create a 'csv reader' from the file object
datalist = list( reader ) # create a list from the reader
return datalist # we have a list of lists
## For aligning table columns
## It adds spaces to the end of a string to make it up to length n.
def pad_to_length( string, n):
return string + " "* (n-len(string)) ## s*n gives empty string for n<1
The image I get for now is:
| Track | Artist | Album | Time
| Computer Love | Kraftwerk | Computer World | 7:15
| Paranoid Android | Radiohead | OK Computer | 6:27
| Computer Age | Neil Young | Trans | 5:24
| Digital | Joy Division | Still | 2:50
| Silver Machine | Hawkwind | Roadhawks | 4:39
| Start the Simulator | A-Ha | Foot of the Mountain | 5:11
| Internet Connection | M.I.A. | MAYA | 2:56
| Deep Blue | Arcade Fire | The Suburbs | 4:29
| I Will Derive! | MindofMatthew | You Tube | 3:17
| Lobachevsky | Tom Lehrer | You Tube | 3:04

getting alphabets after applying sentence tokenizer of nltk instead of sentences in Python 3.5.1

import codecs, os
import re
import string
import mysql
import mysql.connector
y_ = ""
'''Searching and reading text files from a folder.'''
for root, dirs, files in os.walk("/Users/ultaman/Documents/PAN dataset/Pan Plagiarism dataset 2010/pan-plagiarism-corpus-2010/source-documents/test1"):
for file in files:
if file.endswith(".txt"):
x_ = codecs.open(os.path.join(root,file),"r", "utf-8-sig")
for lines in x_.readlines():
y_ = y_ + lines
'''Tokenizing the senteces of the text file.'''
from nltk.tokenize import sent_tokenize
raw_docs = sent_tokenize(y_)
tokenized_docs = [sent_tokenize(y_) for sent in raw_docs]
'''Removing punctuation marks.'''
regex = re.compile('[%s]' % re.escape(string.punctuation))
tokenized_docs_no_punctuation = ''
for review in tokenized_docs:
new_review = ''
for token in review:
new_token = regex.sub(u'', token)
if not new_token == u'':
new_review+= new_token
tokenized_docs_no_punctuation += (new_review)
print(tokenized_docs_no_punctuation)
'''Connecting and inserting tokenized documents without punctuation in database field.'''
def connect():
for i in range(len(tokenized_docs_no_punctuation)):
conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'test' )
cursor = conn.cursor()
cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,(tokenized_docs_no_punctuation[i])))
conn.commit()
conn.close()
if __name__ == '__main__':
connect()
After writing the above code, The result is like
2 | S | N |
| 3 | S | o |
| 4 | S | |
| 5 | S | d |
| 6 | S | o |
| 7 | S | u |
| 8 | S | b |
| 9 | S | t |
| 10 | S | |
| 11 | S | m |
| 12 | S | y |
| 13 | S |
| 14 | S | d
in the database.
It should be like:
1 | S | No doubt, my dear friend.
2 | S | no doubt.
I suggest making the following edits(use what you would like). But this is what I used to get your code running. Your issue is that review in for review in tokenized_docs: is already a string. So, this makes token in for token in review: characters. Therefore to fix this I tried -
tokenized_docs = ['"No doubt, my dear friend, no doubt; but in the meanwhile suppose we talk of this annuity.', 'Shall we say one thousand francs a year."', '"What!"', 'asked Bonelle, looking at him very fixedly.', '"My dear friend, I mistook; I meant two thousand francs per annum," hurriedly rejoined Ramin.', 'Monsieur Bonelle closed his eyes, and appeared to fall into a gentle slumber.', 'The mercer coughed;\nthe sick man never moved.', '"Monsieur Bonelle."']
'''Removing punctuation marks.'''
regex = re.compile('[%s]' % re.escape(string.punctuation))
tokenized_docs_no_punctuation = []
for review in tokenized_docs:
new_token = regex.sub(u'', review)
if not new_token == u'':
tokenized_docs_no_punctuation.append(new_token)
print(tokenized_docs_no_punctuation)
and got this -
['No doubt my dear friend no doubt but in the meanwhile suppose we talk of this annuity', 'Shall we say one thousand francs a year', 'What', 'asked Bonelle looking at him very fixedly', 'My dear friend I mistook I meant two thousand francs per annum hurriedly rejoined Ramin', 'Monsieur Bonelle closed his eyes and appeared to fall into a gentle slumber', 'The mercer coughed\nthe sick man never moved', 'Monsieur Bonelle']
The final format of the output is up to you. I prefer using lists. But you could concatenate this into a string as well.
nw = []
for review in tokenized_docs[0]:
new_review = ''
for token in review:
new_token = regex.sub(u'', token)
if not new_token == u'':
new_review += new_token
nw.append(new_review)
'''Inserting into database'''
def connect():
for j in nw:
conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'Thesis' )
cursor = conn.cursor()
cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,j))
conn.commit()
conn.close()
if __name__ == '__main__':
connect()

Categories