I have this list
lst = [' SOME TEXT\nSOME TEXT\nFTY = 1', 'A|1\nB|5\nC|3\n \nD|0\nE|0', 'D|4\nE|1\nG|1', '\nblah blah', '\n--- HHGTY',
'SOME TEXT\nFTY = 1\nA|3\nB|2\nC|8\nD|6\nE|9\nF|3', '', 'blah blah\n \nblah blah',
'--- HHGTY'
]
and I want to print only the elements containing | or HHGTY. I using the code below, but is printing
SOME TEXT and FTY = 1 too. What is wrong? Thanks
>>> for s in lst:
... if ("|" in s) or ("HHGTY" in s):
... print(s)
...
A|1
B|5
C|3
D|0
E|0
D|4
E|1
G|1
--- HHGTY
SOME TEXT
FTY = 1
A|3
B|2
C|8
D|6
E|9
F|3
--- HHGTY
>>>
I think what you want is:
for s in lst:
for subs in s.split('\n'):
if ("|" in subs) or ("HHGTY" in subs):
print(subs)
Your code is doing everything right:
SOME TEXT and FTY = 1 are parts of SOME TEXT \ nFTY = 1 \ nA | 3 \ nB | 2 \ nC | 8 \ nD | 6 \ nE | 9 \ nF | 3.
Because in your 'SOME TEXT\nFTY = 1\nA|3\nB|2\nC|8\nD|6\nE|9\nF|3' element '|' is present.
Related
I have a question:
starting from this text example:
input_test = "أكتب الدر_س و إحفضه ثم إقرأ القصـــــــــــــــيـــــــــــدة"
I managed to clean this text using these functions:
arabic_punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ'''
english_punctuations = string.punctuation
punctuations_list = arabic_punctuations + english_punctuations
arabic_diacritics = re.compile("""
ّ | # Tashdid
َ | # Fatha
ً | # Tanwin Fath
ُ | # Damma
ٌ | # Tanwin Damm
ِ | # Kasra
ٍ | # Tanwin Kasr
ْ | # Sukun
ـ # Tatwil/Kashida
""", re.VERBOSE)
def normalize_arabic(text):
text = re.sub("[إأآا]", "ا", text)
return text
def remove_diacritics(text):
text = re.sub(arabic_diacritics, '', text)
return text
def remove_punctuations(text):
translator = str.maketrans('', '', punctuations_list)
return text.translate(translator)
def remove_repeating_char(text):
return re.sub(r'(.)\1+', r'\1', text)
Which gives me this text as the result:
result = "اكتب الدرس و احفضه ثم اقرا القصيدة"
Now if I have have this case, how can I find the word "اقرا" in the orginal input_test?
The input text can be in English, too. I'm thinking of regex — but I don't know from where to start…
I have a string variable like below.
AKT= PDK1 & ~ PTEN
AP1= JUN & (FOS | ATF2)
Apoptosis= ~ BCL2 & ~ ERK & FOXO3 & p53
ATF2= JNK | p38
ATM= DNA_damage
BCL2= CREB & AKT
I want to remove '&', '~', '(', ')', 'or' and to list words left like below.
AKT = ['PDK1', 'PTEN']
AP1 = ['JUN', 'FOS', 'ATF2']
...
Here's one way you can do this,
s = '''AKT= PDK1 & ~ PTEN
AP1= JUN & (FOS | ATF2)
Apoptosis= ~ BCL2 & ~ ERK & FOXO3 & p53
ATF2= JNK | p38
ATM= DNA_damage
BCL2= CREB & AKT'''
import re
final_list = []
for line in s.split('\n'):
valid_words = re.findall(r'\w+', line)
rhs = valid_words[0]
lhs = valid_words[1:]
final_list.append([rhs, lhs])
for item in final_list:
print(item[0],'=', item[1])
Outputs:
AKT = ['PDK1', 'PTEN']
AP1 = ['JUN', 'FOS', 'ATF2']
Apoptosis = ['BCL2', 'ERK', 'FOXO3', 'p53']
ATF2 = ['JNK', 'p38']
ATM = ['DNA_damage']
BCL2 = ['CREB', 'AKT']
You could split and join, i.e.
APT = APT.split('&') #APT = ['PDK1', '~PTEN']
APT = join(APT)
APT = split('~')
APT = join(APT)
...
I would like to write a function in python3 to parse a string based on the input list element. The following function works but is there a better way to do it?
def func(oStr, s_s):
if not oStr:
return s_s
elif '' in s_s:
return [oStr]
else:
for x in s_s:
st = oStr.find(x)
end = st + len(x)
res.append(oStr[st:end])
oStr = oStr.replace(x, '')
if oStr:
res.append(oStr)
return res
case 1
o_str = 'ABCNew York - Address'
s_str = ['ABC']
return ['ABC', 'New York - Address']
case 2
o_str = 'New York Friend Add | NumberABCNewYork Name | FirstName Last Name | time : Jan-31-2017'
s_str = ['New York Friend Add | Number', 'ABC', 'NewYork Name | FirstName Last Name | time: Jan-31-2017']
return ['New York Friend Add | Number', 'ABC', 'NewYork Name | FirstName Last Name | time: Jan-31-2017']
case 3
o_str = '-'
s_str = ['']
return ['-']
case 4
o_str = '1'
s_str = ['']
return ['1']
case 5
o_str = '1234Family-Name'
s_str = ['1234']
return ['1234', 'Family-Name']
case 6
o_str = ''
s_str = ['12345667', 'name']
return ['12345667', 'name']
To use a string like an array, you would just program it in the same way. For example
myStr="Hello, World!"
myString.insert(len(myString),"""Your character here""")
For your purposes .append() would work exactly the same way. Hope I helped.
import codecs, os
import re
import string
import mysql
import mysql.connector
y_ = ""
'''Searching and reading text files from a folder.'''
for root, dirs, files in os.walk("/Users/ultaman/Documents/PAN dataset/Pan Plagiarism dataset 2010/pan-plagiarism-corpus-2010/source-documents/test1"):
for file in files:
if file.endswith(".txt"):
x_ = codecs.open(os.path.join(root,file),"r", "utf-8-sig")
for lines in x_.readlines():
y_ = y_ + lines
'''Tokenizing the senteces of the text file.'''
from nltk.tokenize import sent_tokenize
raw_docs = sent_tokenize(y_)
tokenized_docs = [sent_tokenize(y_) for sent in raw_docs]
'''Removing punctuation marks.'''
regex = re.compile('[%s]' % re.escape(string.punctuation))
tokenized_docs_no_punctuation = ''
for review in tokenized_docs:
new_review = ''
for token in review:
new_token = regex.sub(u'', token)
if not new_token == u'':
new_review+= new_token
tokenized_docs_no_punctuation += (new_review)
print(tokenized_docs_no_punctuation)
'''Connecting and inserting tokenized documents without punctuation in database field.'''
def connect():
for i in range(len(tokenized_docs_no_punctuation)):
conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'test' )
cursor = conn.cursor()
cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,(tokenized_docs_no_punctuation[i])))
conn.commit()
conn.close()
if __name__ == '__main__':
connect()
After writing the above code, The result is like
2 | S | N |
| 3 | S | o |
| 4 | S | |
| 5 | S | d |
| 6 | S | o |
| 7 | S | u |
| 8 | S | b |
| 9 | S | t |
| 10 | S | |
| 11 | S | m |
| 12 | S | y |
| 13 | S |
| 14 | S | d
in the database.
It should be like:
1 | S | No doubt, my dear friend.
2 | S | no doubt.
I suggest making the following edits(use what you would like). But this is what I used to get your code running. Your issue is that review in for review in tokenized_docs: is already a string. So, this makes token in for token in review: characters. Therefore to fix this I tried -
tokenized_docs = ['"No doubt, my dear friend, no doubt; but in the meanwhile suppose we talk of this annuity.', 'Shall we say one thousand francs a year."', '"What!"', 'asked Bonelle, looking at him very fixedly.', '"My dear friend, I mistook; I meant two thousand francs per annum," hurriedly rejoined Ramin.', 'Monsieur Bonelle closed his eyes, and appeared to fall into a gentle slumber.', 'The mercer coughed;\nthe sick man never moved.', '"Monsieur Bonelle."']
'''Removing punctuation marks.'''
regex = re.compile('[%s]' % re.escape(string.punctuation))
tokenized_docs_no_punctuation = []
for review in tokenized_docs:
new_token = regex.sub(u'', review)
if not new_token == u'':
tokenized_docs_no_punctuation.append(new_token)
print(tokenized_docs_no_punctuation)
and got this -
['No doubt my dear friend no doubt but in the meanwhile suppose we talk of this annuity', 'Shall we say one thousand francs a year', 'What', 'asked Bonelle looking at him very fixedly', 'My dear friend I mistook I meant two thousand francs per annum hurriedly rejoined Ramin', 'Monsieur Bonelle closed his eyes and appeared to fall into a gentle slumber', 'The mercer coughed\nthe sick man never moved', 'Monsieur Bonelle']
The final format of the output is up to you. I prefer using lists. But you could concatenate this into a string as well.
nw = []
for review in tokenized_docs[0]:
new_review = ''
for token in review:
new_token = regex.sub(u'', token)
if not new_token == u'':
new_review += new_token
nw.append(new_review)
'''Inserting into database'''
def connect():
for j in nw:
conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'Thesis' )
cursor = conn.cursor()
cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,j))
conn.commit()
conn.close()
if __name__ == '__main__':
connect()
Given this kind of input:
.-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
|
| server = 127.0.0.1/502
| os = ???
| dist = 0
| params = none
| raw_sig = 4:64+0:0:0:32768,0:::0
|
`----
.-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
|
| server = 127.0.0.1/502
| os = ???
| dist = 0
| params = none
| raw_sig = 4:64+0:0:0:32768,0:::0
|
`----
...
I'm trying use regex to get the value of all the os in the output (there will be hundreds).
I've tried this:
import os, subprocess, re
dir = '/home/user/Documents/ics-passif-asset-enumeration/pcap/'
for filename in os.listdir(dir):
inp = '...'
match = re.match( r'(.*)os(.*)\n(.*)', inp )
print match.group(1)
But match is a NoneType. Never really played with regex before and I'm a bit lost.
Edit:
The expected output is a list of all the os values. In this case it would be:
???
???
I hope this is what you are looking for
>>> import re
>>> string = """.-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
... |
... | server = 127.0.0.1/502
... | os = ???
... | dist = 0
... | params = none
... | raw_sig = 4:64+0:0:0:32768,0:::0
... |
... `----"""
>>> match = re.match( r'(.*)os\s*=(.*?)\n', string, re.DOTALL)
>>> match.group(2)
' ???'
Changes made
re.DOTALL This flag is required so that you are trying to match multiline inputs.
os\s*=(.*?)
\s*= The = and spaces are made out of the capture group since we are not interested in them.
(.*?) The ? makes it non greedy so that it matches till the end of the first line
match.group(2) it is the second match group not the first.
A better and short solution
You can use the re.findall() with slighter different regex
os\s*=(.*)
Test
>>> string = """.-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
... |
... | server = 127.0.0.1/502
... | os = ???
... | dist = 0
... | params = none
... | raw_sig = 4:64+0:0:0:32768,0:::0
... |
... `----
...
... .-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
... |
... | server = 127.0.0.1/502
... | os = ???
... | dist = 0
... | params = none
... | raw_sig = 4:64+0:0:0:32768,0:::0
... |
... `----
... ..."""
>>> re.findall(r"os\s*=(.*)", string)
[' ???', ' ???']
re.findall will return an array of results! Fantastic! Assuming the format of your input is pretty consistent, this should work like a charm:
>>> inp = '''
... .-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
... |
... | server = 127.0.0.1/502
... | os = ???
... | dist = 0
... | params = none
... | raw_sig = 4:64+0:0:0:32768,0:::0
... |
... `----
...
... .-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
... |
... | server = 127.0.0.1/502
... | os = ???
... | dist = 0
... | params = none
... | raw_sig = 4:64+0:0:0:32768,0:::0
... |
... `----
... ...
... '''
>>> re.findall(r'^| os\s+= (.*)$', inp, flags=re.MULTILINE)
['???', '???']
I agree with the idea that the format should be strict to ensure that the string won't appear somewhere else. If this all came from a script then the strictness shouldn't be a problem (you'd hope). If it was via manual entry... I'd be surprised.
For the dot operator (.) to match newlines, add a flag to the match-call:
match = re.match( r'(.*)os(.*)\n(.*)', inp, flags=re.DOTALL )
If I understand what you wished (and assuming your input is what you copied here (multiline, multientry) this regex should do with modifier gm to match all and let ^ and $ match respectively start and end of line:
^|\s*os\s*=\s*(.*)$
Demo Here
You may try use findall() method:
for filename in os.listdir(dir):
inp = '...'
match = re.findall('os(.*)\n', inp)
print match
As #Tensibai says, you're probably best to use ^ and $ to match the start and end of the line, and a very specific pattern (as he gives) to make sure that the string "os" is not matched somewhere else, like within a hostname for example.
To directly find all of the matching "os = " lines, use re.findall( r'^|\s*os\s*=\s*(.*)$', inp, re.MULTILINE ), which returns a list of the matching os values.