Print elements containing only 2 strings

Print elements containing only 2 strings - python

I have this list
lst = [' SOME TEXT\nSOME TEXT\nFTY = 1', 'A|1\nB|5\nC|3\n \nD|0\nE|0', 'D|4\nE|1\nG|1', '\nblah blah', '\n--- HHGTY',
'SOME TEXT\nFTY = 1\nA|3\nB|2\nC|8\nD|6\nE|9\nF|3', '', 'blah blah\n \nblah blah',
'--- HHGTY'
]
and I want to print only the elements containing | or HHGTY. I using the code below, but is printing
SOME TEXT and FTY = 1 too. What is wrong? Thanks
>>> for s in lst:
... if ("|" in s) or ("HHGTY" in s):
... print(s)
...
A|1
B|5
C|3
D|0
E|0
D|4
E|1
G|1
--- HHGTY
SOME TEXT
FTY = 1
A|3
B|2
C|8
D|6
E|9
F|3
--- HHGTY
>>>

I think what you want is:
for s in lst:
for subs in s.split('\n'):
if ("|" in subs) or ("HHGTY" in subs):
print(subs)

Your code is doing everything right:
SOME TEXT and FTY = 1 are parts of SOME TEXT \ nFTY = 1 \ nA | 3 \ nB | 2 \ nC | 8 \ nD | 6 \ nE | 9 \ nF | 3.

Because in your 'SOME TEXT\nFTY = 1\nA|3\nB|2\nC|8\nD|6\nE|9\nF|3' element '|' is present.

Related

Regex substitution reversal?

I have a question:
starting from this text example:
input_test = "أكتب الدر_س و إحفضه ثم إقرأ القصـــــــــــــــيـــــــــــدة"
I managed to clean this text using these functions:
arabic_punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ'''
english_punctuations = string.punctuation
punctuations_list = arabic_punctuations + english_punctuations
arabic_diacritics = re.compile("""
ّ | # Tashdid
َ | # Fatha
ً | # Tanwin Fath
ُ | # Damma
ٌ | # Tanwin Damm
ِ | # Kasra
ٍ | # Tanwin Kasr
ْ | # Sukun
ـ # Tatwil/Kashida
""", re.VERBOSE)
def normalize_arabic(text):
text = re.sub("[إأآا]", "ا", text)
return text
def remove_diacritics(text):
text = re.sub(arabic_diacritics, '', text)
return text
def remove_punctuations(text):
translator = str.maketrans('', '', punctuations_list)
return text.translate(translator)
def remove_repeating_char(text):
return re.sub(r'(.)\1+', r'\1', text)
Which gives me this text as the result:
result = "اكتب الدرس و احفضه ثم اقرا القصيدة"
Now if I have have this case, how can I find the word "اقرا" in the orginal input_test?
The input text can be in English, too. I'm thinking of regex — but I don't know from where to start…

Hi, How can I remove some symbols in string and make rest words listed?

I have a string variable like below.
AKT= PDK1 & ~ PTEN
AP1= JUN & (FOS | ATF2)
Apoptosis= ~ BCL2 & ~ ERK & FOXO3 & p53
ATF2= JNK | p38
ATM= DNA_damage
BCL2= CREB & AKT
I want to remove '&', '~', '(', ')', 'or' and to list words left like below.
AKT = ['PDK1', 'PTEN']
AP1 = ['JUN', 'FOS', 'ATF2']
...

Here's one way you can do this,
s = '''AKT= PDK1 & ~ PTEN
AP1= JUN & (FOS | ATF2)
Apoptosis= ~ BCL2 & ~ ERK & FOXO3 & p53
ATF2= JNK | p38
ATM= DNA_damage
BCL2= CREB & AKT'''
import re
final_list = []
for line in s.split('\n'):
valid_words = re.findall(r'\w+', line)
rhs = valid_words[0]
lhs = valid_words[1:]
final_list.append([rhs, lhs])
for item in final_list:
print(item[0],'=', item[1])
Outputs:
AKT = ['PDK1', 'PTEN']
AP1 = ['JUN', 'FOS', 'ATF2']
Apoptosis = ['BCL2', 'ERK', 'FOXO3', 'p53']
ATF2 = ['JNK', 'p38']
ATM = ['DNA_damage']
BCL2 = ['CREB', 'AKT']

You could split and join, i.e.
APT = APT.split('&') #APT = ['PDK1', '~PTEN']
APT = join(APT)
APT = split('~')
APT = join(APT)
...

parse string into list based on input list

I would like to write a function in python3 to parse a string based on the input list element. The following function works but is there a better way to do it?
def func(oStr, s_s):
if not oStr:
return s_s
elif '' in s_s:
return [oStr]
else:
for x in s_s:
st = oStr.find(x)
end = st + len(x)
res.append(oStr[st:end])
oStr = oStr.replace(x, '')
if oStr:
res.append(oStr)
return res
case 1
o_str = 'ABCNew York - Address'
s_str = ['ABC']
return ['ABC', 'New York - Address']
case 2
o_str = 'New York Friend Add | NumberABCNewYork Name | FirstName Last Name | time : Jan-31-2017'
s_str = ['New York Friend Add | Number', 'ABC', 'NewYork Name | FirstName Last Name | time: Jan-31-2017']
return ['New York Friend Add | Number', 'ABC', 'NewYork Name | FirstName Last Name | time: Jan-31-2017']
case 3
o_str = '-'
s_str = ['']
return ['-']
case 4
o_str = '1'
s_str = ['']
return ['1']
case 5
o_str = '1234Family-Name'
s_str = ['1234']
return ['1234', 'Family-Name']
case 6
o_str = ''
s_str = ['12345667', 'name']
return ['12345667', 'name']

To use a string like an array, you would just program it in the same way. For example
myStr="Hello, World!"
myString.insert(len(myString),"""Your character here""")
For your purposes .append() would work exactly the same way. Hope I helped.

getting alphabets after applying sentence tokenizer of nltk instead of sentences in Python 3.5.1

import codecs, os
import re
import string
import mysql
import mysql.connector
y_ = ""
'''Searching and reading text files from a folder.'''
for root, dirs, files in os.walk("/Users/ultaman/Documents/PAN dataset/Pan Plagiarism dataset 2010/pan-plagiarism-corpus-2010/source-documents/test1"):
for file in files:
if file.endswith(".txt"):
x_ = codecs.open(os.path.join(root,file),"r", "utf-8-sig")
for lines in x_.readlines():
y_ = y_ + lines
'''Tokenizing the senteces of the text file.'''
from nltk.tokenize import sent_tokenize
raw_docs = sent_tokenize(y_)
tokenized_docs = [sent_tokenize(y_) for sent in raw_docs]
'''Removing punctuation marks.'''
regex = re.compile('[%s]' % re.escape(string.punctuation))
tokenized_docs_no_punctuation = ''
for review in tokenized_docs:
new_review = ''
for token in review:
new_token = regex.sub(u'', token)
if not new_token == u'':
new_review+= new_token
tokenized_docs_no_punctuation += (new_review)
print(tokenized_docs_no_punctuation)
'''Connecting and inserting tokenized documents without punctuation in database field.'''
def connect():
for i in range(len(tokenized_docs_no_punctuation)):
conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'test' )
cursor = conn.cursor()
cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,(tokenized_docs_no_punctuation[i])))
conn.commit()
conn.close()
if __name__ == '__main__':
connect()
After writing the above code, The result is like
2 | S | N |
| 3 | S | o |
| 4 | S | |
| 5 | S | d |
| 6 | S | o |
| 7 | S | u |
| 8 | S | b |
| 9 | S | t |
| 10 | S | |
| 11 | S | m |
| 12 | S | y |
| 13 | S |
| 14 | S | d
in the database.
It should be like:
1 | S | No doubt, my dear friend.
2 | S | no doubt.

I suggest making the following edits(use what you would like). But this is what I used to get your code running. Your issue is that review in for review in tokenized_docs: is already a string. So, this makes token in for token in review: characters. Therefore to fix this I tried -
tokenized_docs = ['"No doubt, my dear friend, no doubt; but in the meanwhile suppose we talk of this annuity.', 'Shall we say one thousand francs a year."', '"What!"', 'asked Bonelle, looking at him very fixedly.', '"My dear friend, I mistook; I meant two thousand francs per annum," hurriedly rejoined Ramin.', 'Monsieur Bonelle closed his eyes, and appeared to fall into a gentle slumber.', 'The mercer coughed;\nthe sick man never moved.', '"Monsieur Bonelle."']
'''Removing punctuation marks.'''
regex = re.compile('[%s]' % re.escape(string.punctuation))
tokenized_docs_no_punctuation = []
for review in tokenized_docs:
new_token = regex.sub(u'', review)
if not new_token == u'':
tokenized_docs_no_punctuation.append(new_token)
print(tokenized_docs_no_punctuation)
and got this -
['No doubt my dear friend no doubt but in the meanwhile suppose we talk of this annuity', 'Shall we say one thousand francs a year', 'What', 'asked Bonelle looking at him very fixedly', 'My dear friend I mistook I meant two thousand francs per annum hurriedly rejoined Ramin', 'Monsieur Bonelle closed his eyes and appeared to fall into a gentle slumber', 'The mercer coughed\nthe sick man never moved', 'Monsieur Bonelle']
The final format of the output is up to you. I prefer using lists. But you could concatenate this into a string as well.

nw = []
for review in tokenized_docs[0]:
new_review = ''
for token in review:
new_token = regex.sub(u'', token)
if not new_token == u'':
new_review += new_token
nw.append(new_review)
'''Inserting into database'''
def connect():
for j in nw:
conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'Thesis' )
cursor = conn.cursor()
cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,j))
conn.commit()
conn.close()
if __name__ == '__main__':
connect()

Regex with end of line in group

Given this kind of input:
.-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
|
| server = 127.0.0.1/502
| os = ???
| dist = 0
| params = none
| raw_sig = 4:64+0:0:0:32768,0:::0
|
`----
.-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
|
| server = 127.0.0.1/502
| os = ???
| dist = 0
| params = none
| raw_sig = 4:64+0:0:0:32768,0:::0
|
`----
...
I'm trying use regex to get the value of all the os in the output (there will be hundreds).
I've tried this:
import os, subprocess, re
dir = '/home/user/Documents/ics-passif-asset-enumeration/pcap/'
for filename in os.listdir(dir):
inp = '...'
match = re.match( r'(.*)os(.*)\n(.*)', inp )
print match.group(1)
But match is a NoneType. Never really played with regex before and I'm a bit lost.
Edit:
The expected output is a list of all the os values. In this case it would be:
???
???

I hope this is what you are looking for
>>> import re
>>> string = """.-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
... |
... | server = 127.0.0.1/502
... | os = ???
... | dist = 0
... | params = none
... | raw_sig = 4:64+0:0:0:32768,0:::0
... |
... `----"""
>>> match = re.match( r'(.*)os\s*=(.*?)\n', string, re.DOTALL)
>>> match.group(2)
' ???'
Changes made
re.DOTALL This flag is required so that you are trying to match multiline inputs.
os\s*=(.*?)
\s*= The = and spaces are made out of the capture group since we are not interested in them.
(.*?) The ? makes it non greedy so that it matches till the end of the first line
match.group(2) it is the second match group not the first.
A better and short solution
You can use the re.findall() with slighter different regex
os\s*=(.*)
Test
>>> string = """.-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
... |
... | server = 127.0.0.1/502
... | os = ???
... | dist = 0
... | params = none
... | raw_sig = 4:64+0:0:0:32768,0:::0
... |
... `----
...
... .-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
... |
... | server = 127.0.0.1/502
... | os = ???
... | dist = 0
... | params = none
... | raw_sig = 4:64+0:0:0:32768,0:::0
... |
... `----
... ..."""
>>> re.findall(r"os\s*=(.*)", string)
[' ???', ' ???']

re.findall will return an array of results! Fantastic! Assuming the format of your input is pretty consistent, this should work like a charm:
>>> inp = '''
... .-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
... |
... | server = 127.0.0.1/502
... | os = ???
... | dist = 0
... | params = none
... | raw_sig = 4:64+0:0:0:32768,0:::0
... |
... `----
...
... .-[ 127.0.0.1/44963 -> 127.0.0.1/502 (syn+ack) ]-
... |
... | server = 127.0.0.1/502
... | os = ???
... | dist = 0
... | params = none
... | raw_sig = 4:64+0:0:0:32768,0:::0
... |
... `----
... ...
... '''
>>> re.findall(r'^| os\s+= (.*)$', inp, flags=re.MULTILINE)
['???', '???']
I agree with the idea that the format should be strict to ensure that the string won't appear somewhere else. If this all came from a script then the strictness shouldn't be a problem (you'd hope). If it was via manual entry... I'd be surprised.

For the dot operator (.) to match newlines, add a flag to the match-call:
match = re.match( r'(.*)os(.*)\n(.*)', inp, flags=re.DOTALL )

If I understand what you wished (and assuming your input is what you copied here (multiline, multientry) this regex should do with modifier gm to match all and let ^ and $ match respectively start and end of line:
^|\s*os\s*=\s*(.*)$
Demo Here

You may try use findall() method:
for filename in os.listdir(dir):
inp = '...'
match = re.findall('os(.*)\n', inp)
print match

As #Tensibai says, you're probably best to use ^ and $ to match the start and end of the line, and a very specific pattern (as he gives) to make sure that the string "os" is not matched somewhere else, like within a hostname for example.
To directly find all of the matching "os = " lines, use re.findall( r'^|\s*os\s*=\s*(.*)$', inp, re.MULTILINE ), which returns a list of the matching os values.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Print elements containing only 2 strings - python

I think what you want is: for s in lst: for subs in s.split('\n'): if ("|" in subs) or ("HHGTY" in subs): print(subs)

Your code is doing everything right: SOME TEXT and FTY = 1 are parts of SOME TEXT \ nFTY = 1 \ nA | 3 \ nB | 2 \ nC | 8 \ nD | 6 \ nE | 9 \ nF | 3.

Because in your 'SOME TEXT\nFTY = 1\nA|3\nB|2\nC|8\nD|6\nE|9\nF|3' element '|' is present.

Related

Regex substitution reversal?

Hi, How can I remove some symbols in string and make rest words listed?

parse string into list based on input list

getting alphabets after applying sentence tokenizer of nltk instead of sentences in Python 3.5.1

Regex with end of line in group

Categories

Resources