Python Regular Expression / Middle word in result

Python Regular Expression / Middle word in result - python

I have problem with unnecessary strings in result. I want pull only https from files.
My code is:
import sys
import os
import hashlib
import re
if len(sys.argv) < 2 :
sys.exit('Aby uzyc wpisz: python %s filename' % sys.argv[0])
if not os.path.exists(sys.argv[1]):
sys.exit('BLAD!: Plik "%s" nie znaleziony!' % sys.argv[1])
with open(sys.argv[1], 'rb') as f:
plik = f.read()
print("MD5: %s" % hashlib.md5(plik).hexdigest())
print("SHA1: %s" % hashlib.sha1(plik).hexdigest())
print("SHA256: %s" % hashlib.sha256(plik).hexdigest())
print("Podejrzane linki: \n")
pliki = open(sys.argv[1], 'r')
for line in pliki:
if re.search("(H|h)ttps:(.*)",line):
print(line)
elif re.search("(H|h)ttp:(.*)",line):
print(line)
pliki.close()
In result:
MD5: f16a93fd2d6f2a9f90af9f61a19d28bd
SHA1: 0a9b89624696757e188412da268afb2bf5b600aa
SHA256: 3b365deb0e272146f00f9d723a9fd4dbeacddc10123aec8237a37c10c19fe6df
Podejrzane linki:
GrizliPolSurls = "http://xxx.xxx.xxx.xxx"
FilnMoviehttpsd.Open "GET", "https://xxx.xxx.xxx.xxx",False
I want only strings in "" and starts from http or https e.g http://xxx.xxx.xxx.xxx
Desired result:
MD5: f16a93fd2d6f2a9f90af9f61a19d28bd
SHA1: 0a9b89624696757e188412da268afb2bf5b600aa
SHA256: 3b365deb0e272146f00f9d723a9fd4dbeacddc10123aec8237a37c10c19fe6df
Podejrzane linki:
http://xxx.xxx.xxx.xxx
https://xxx.xxx.xxx.xxx

You can use re.findall with the following regex (explained on regex101):
"([Hh]ttps?.*?)"
so:
import re
s = '''MD5MD5:: f16a93fd2d6f2a9f90af9f61a19d28bd
SHA1 f16a93fd2 : 0a9b89624696757e188412da268afb2bf5b600aa
SHA256: 3b365deb0e272146f00f9d723a9fd4dbeacddc10123aec8237a37c10c19fe6df
Podejrzane linki:
GrizliPolSurls = "http://xxx.xxx.xxx.xxx"
FilnMoviehttpsd.Open "GET", "https://xxx.xxx.xxx.xxx",False'''
urls = re.findall('"([Hh]ttps?.*?)"', s)
#['http://xxx.xxx.xxx.xxx', 'https://xxx.xxx.xxx.xxx']

You need this pattern: (?<=")http[^"]+.
(?<=") - positive lookbehind, to determine if " precceds current position.
http - match http literally.
[^"]+ - match everything until ", this is negated class technique to avoid quantifiers :)
Demo

re.search() returns a Match Object
You have to fetch the information from the result:
line = "my text line contains a http://192.168.1.1 magic url"
result = re.search("[Hh]ttps?://\d+\.\d+\.\d+\.\d+", line)
print(result.group()) # will print http://192.168.1.1

Related

Get rid of parenthesis in output

I think this is an easy question for you as i am a beginner on python3.
When printing header of fasta file it contains parenthesis. How can i remove them ??
import sys
from Bio import Entrez
from Bio import SeqIO
#define email for entrez login
db = "nuccore"
Entrez.email = "someone#email.com"
#load accessions from arguments
if len(sys.argv[1:]) > 1:
accs = sys.argv[1:]
else: #load accesions from stdin
accs = [ l.strip() for l in sys.stdin if l.strip() ]
#fetch
sys.stderr.write( "Fetching %s entries from GenBank: %s\n" % (len(accs), ", ".join(accs[:10])))
for i,acc in enumerate(accs):
try:
sys.stderr.write( " %9i %s \r" % (i+1,acc))
handle = Entrez.efetch(db=db, rettype="fasta", id=acc)
seq_record = SeqIO.read(handle, "fasta")
if (len(seq_record.seq) > 0):
header = ">" + seq_record.description + " Len:" , len(seq_record.seq)
print(header)
print(seq_record.seq)
except:
sys.stderr.write( "Error! Cannot fetch: %s \n" % acc)
./acc2fasta.py 163345 303239
It will return
(">M69206.1 Bovine MHC class I AW10 mRNA (haplotype AW10), 3' end Len:", 1379)
TCCTGCTGCTCTCGGGGGTCCTGGTCCTGACCGAGACCCGGGCTGGCTCCCACTCGATGAGGTATTTCAGCACCGCCGTGTCCCGGCCCGGCCTCGGGGAGCCCCGGTACCTGGAAGTCGGCTACGTGGACGACACGCAGTTCGTGCGGTTTGACAGCGACGCCCCGAATCCGAGGATGGAGCCGCGGGCGCGGTGGGTGGAGCAGGAGGGGCCGGAGTATTGGGATCGGGAGACGCAAAGGGCCAAGGGCAACGCACAATTTTTCCGAGTGAGCCTGAACAACCTGCGCGGCTACTACAACCAGAGCGAGGCCGGGTCTCACACCCTCCAGTGGATGTCCGGCTGCTACGTGGGGCCGGACGGGCGTCCTCCGCGCGGGTTCATGCAGTTCGGCTACGACGGCAGAGATTACCTCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGTGGAGACGATGGCTCAGATCTCCAAACGCAAGATGGAGGCGGCCGGTGAAGCTGAGGTACAGAGGAACTACCTGGAGGGCCGGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAGGACACGCTGCTGCGCGCAGACCCTCCAAAGGCACATGTGACCCGTCACCCGATCTCTGGTCGTGAGGTCACCCTGAGGTGCTGGGCCCTGGGCTTCTACCCTGAAGAGATCTCACTGACCTGGCAGCGCAATGGGGAGGACCAGACCCAGGACATGGAGCTTGTGGAGACCAGGCCTTCAGGGGACGGAAACTTCCAGAAGTGGGCGGCCCTGTTGGTGCCTTCTGGAGAGGAGCAGAAATACACATGCCAAGTGCAGCACGAGGGGCTTCAGGAGCCCCTCACCCTGAAATGGGAACCTCCTCAGCCCTCCTTCCTCACCATGGGCATCATTGTTGGCCTGGTTCTCCTCGTGGTCACTGGAGCTGTGGTGGCTGGAGTTGTGATCTGCATGAAGAAGCGCTCAGGTGAAAAACGAGGGACTTATATCCAGGCTTCAAGCAGTGACAGTGCCCAGGGCTCTGATGTGTCTCTCACGGTTCCTAAAGTGTGAGACACCTGCCTTCGGGGGACTGAGTGATGCTTCATCCCGCTATGTGACATCAGATCCCCGGAACCCCTTTTTCTGCAGCTGCATCTGAATGTGTCAGTGCCCCTATTCGCATAAGTAGGAGTTAGGGAGACTGGCCCACCCATGCCCACTGCTGCCCTTCCCCACTGCCGTCCCTCCCCACCCTGACCTGTGTTCTCTTCCCTGATCCACTGTCCTGTTCCAGCAGAGACGAGGCTGGACCATGTCTATCCCTGTCTTTGCTTTATATGCACTGAAAAATGATATCTTCTTTCCTTATTGAAAATAAAATCTGTC
Error! Cannot fetch: 303239
How to get rid of parenthesis in output ??

header = ">" + seq_record.description + " Len:" , len(seq_record.seq)
print(header)
You're printing the representation of the tuple by doing so, with commas (expected) but also parentheses (unrequired)
The best way would be to join the data instead, so comma is inserted between the string fields, but tuple representation is left out:
print(",".join(header))
In your case it's a little tricker, you have to convert non-string arguments to string (tuple representation did the conversion but join doesn't):
print(",".join([str(x) for x in header]))
result:
>M69206.1 Bovine MHC class I AW10 mRNA (haplotype AW10), 3' end Len:,1379

find and replace regular expression rather than full string

I've loaded a dictionary of "regex":"picture" pairs parsed from a json.
These values are intended to match the regex within a message string and replace it with the picture for display in a flash plugin that displays HTML text.
for instance typing:
Hello MVGame everyone.
Would return:
Hello <img src='http://static-cdn.jtvnw.net/jtv_user_pictures/chansub-global-emoticon-1a1a8bb5cdf6efb9-24x32.png' height = '32' width = '24'> everyone.
However:
If I type,
Hello :) everyone.
it will not parse the :) because this is encoded as a regular expression "\\:-?\\)" rather than just a string match.
How do I get it to parse the regular expression as the matching parameter?
Here is my test code:
# regular expression test
import urllib
import json # for loading json's for emoticons
import urllib.request # more for loadings jsons from urls
import re # allows pattern filtering for emoticons
def loademotes():
#Create emoteicon dictionary
try:
print ("Trying to load emoteicons from twitch")
response = urllib.request.urlopen('https://api.twitch.tv/kraken/chat/emoticons').read()
mydata = json.loads(response.decode('utf-8'))
for idx,item in enumerate(mydata['emoticons']):
regex = item['regex']
url = "<img src='" + item['images'][0]['url'] + "'" + " height = '" + str(item['images'][0]['height']) + "'" + " width = '" + str(item['images'][0]['width']) + "' >"
emoticonDictionary[regex] = url
print ("All emoteicons loaded")
except IOError as e:
print ("I/O error({0}) : {1}".format(e.errno, e.strerror))
print ("Cannot load emoteicons.")
emoticonDictionary = {} # create emoticon dictionary indexed by words returns url in html image tags
loademotes()
while 1:
myString = input ("Here you type something : ")
pattern = re.compile(r'\b(' + '|'.join(emoticonDictionary.keys()) + r')\b')
results = pattern.sub(lambda x: emoticonDictionary[x.group()], myString)
print (results)

I think you could make sure each syntactic character in regular expressions is surrounded by character classes before you feed it to the re. Like write something that takes :) and makes it [:][)]

Removing \r and \n from list

I'm trying to remove \r and \n from a urban dictionary json api but everytime I use re.sub I get this:
expected string or buffer
I'm not sure why though, but here's the code:
elif used_prefix and cmd == "udi" and len(args) > 0 and self.getAccess(user) >= 1:
try:
f = urllib.request.urlopen("http://api.urbandictionary.com/v0/define?term=%s" % args.lower().replace(' ', '+'))
data = json.loads(f.readall().decode("utf-8"))
data = re.sub(r'\s+', ' ', data).replace("\\","")
if (len(data['list']) > 0):
definition = data['list'][0][u'definition']
example = data['list'][0][u'example']
permalink = data['list'][0][u'permalink']
room.message("Urban Dictionary search for %s: %s Example: %s Link: %s" % (args.title(), definition, example, permalink), True)
else: room.message("Word not found.")
except:
room.message((str(sys.exc_info()[1])))
print(traceback.format_exc())
This is the traceback:
Traceback (most recent call last): File "C:\Users\dell\Desktop\b0t\TutorialBot.py", line 2186, in onMessage data = re.sub(r'\s+', ' ', data).replace("\\","") File "C:\lib\re.py", line 170, in sub return _compile(pattern, flags).sub(repl, string, count) TypeError: expected string or buffer

The problem is that you are trying to use re.sub on a dict rather than a string. Further, your code seems to be a little messy in places. Try this instead:
import urllib2
import json
import re
def test(*args):
f = urllib2.urlopen("http://api.urbandictionary.com/v0/define?term=%s" % '+'.join(args).lower()) # note urllib2.urlopen rather than urllib.request.urlopen
data = json.loads(f.read().decode("utf-8")) # note f.read() instead of f.readall()
if len(data['list']) > 0:
definition = data['list'][0][u'definition']
example = data['list'][0][u'example']
permalink = data['list'][0][u'permalink']
return "Urban Dictionary search for %s: %s Example: %s Link: %s" % (str(args), definition, example, permalink) # returns a string
print test('mouth', 'hugging').replace('\n\n', '\n') # prints the string after replacing '\n\n' with '\n'
The result:
Urban Dictionary search for ('mouth', 'hugging'): When you put a beer bottle in your mouth, and keep your mouth wrapped around it all day. Example: Josh: "mhmgdfhwrmhhh (attempts to talk while drinking a beer)"
Ryan: "You know I can't hear you when you're mouth hugging."
Josh: "mmmffwrrggddsshh" Link: http://mouth-hugging.urbanup.com/7493517

Find all lines that match regex pattern and grab part of string

f = open("machinelist.txt", 'r')
lines = f.readlines()
for host in lines:
hostnames = host.strip()
print hostnames
Returns:
\\TESTHOSTDEV01
\\TESTHOSTDEVDB01
\\TESTHOSTDEVDBQA
\\TESTHOSTDEVQA02
\\BTLCMOODY01 MRA Server
\\BTLCSTG05 StG Server
\\BTLCWEB02
\\BTLCWSUS01 Test Update Server
\\HIMSAPP01
\\SLVAPP01
\\TORAAPP01
\\HNSVAPP01
\\TESAPP01
I am curious if there is a way to use re.findall() to grab all lines that begin with "\" however I just want to capture return the hostnames, not the "\ or the comments after the host such as "MRA Server" (example: BTLCMOODY01)

You can do something like this(no need of regex):
Use str.startswith to check if a line starts with '\\':
>>> strs = "\\BTLCMOODY01 MRA Server\n"
>>> strs.startswith('\\')
True
Then use a combination of str.split and str.lstrip to get the first word:
>>> strs.split(None, 1)
['\\BTLCMOODY01', 'MRA Server\n']
#apply str.lstrip on the first item
>>> strs.split(None, 1)[0].lstrip('\\')
'BTLCMOODY01'
Code:
>>> with open('abc1') as f:
... for line in f:
... if line.startswith('\\'): #check if the line startswith `\`
... print line.split(None,1)[0].lstrip('\\')
...
TESTHOSTDEV01
TESTHOSTDEVDB01
TESTHOSTDEVDBQA
TESTHOSTDEVQA02
BTLCMOODY01
BTLCSTG05
BTLCWEB02
BTLCWSUS01
HIMSAPP01
SLVAPP01
TORAAPP01
HNSVAPP01
TESAPP01

An approach using regular expression:
import re
f = open("machinelist.txt", 'r')
lines = f.readlines()
for host in lines:
hostnames = host.strip()
if hostnames.startswith('\\'):
print(re.match(r'\\\\(\S+)',hostnames).group(1))
It yields:
TESTHOSTDEV01
TESTHOSTDEVDB01
TESTHOSTDEVDBQA
TESTHOSTDEVQA02
BTLCMOODY01
BTLCSTG05
BTLCWEB02
BTLCWSUS01
HIMSAPP01
SLVAPP01
TORAAPP01
HNSVAPP01
TESAPP01

import re
pattern = re.compile(r"\\([a-z]+)[\s]+",re.I) # single-slash, foll'd by word: \HOSTNAME
fh = open("file.txt","r")
for x in fh:
match = re.search(pattern,x)
if(match): print(match.group(1))

Python: Replace tags but preserve inner text V2

I've got a script to do search and replace. it's based on a script here.
It was modified to accept file as input but it does not seem to recognize regex well.
The script:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, os
import re
import glob
_replacements = {
'[B]': '**',
'[/B]': '**',
'[I]': '//',
'[/I]': '//',
}
def _do_replace(match):
return _replacements.get(match.group(0))
def replace_tags(text, _re=re.compile('|'.join((r) for r in _replacements))):
return _re.sub(_do_replace, text)
def getfilecont(FN):
if not glob.glob(FN): return -1 # No such file
text = open(FN, 'rt').read()
text = replace_tags(text, re.compile('|'.join(re.escape(r) for r in _replacements)))
return replace_tags(text)
scriptName = os.path.basename(sys.argv[0])
if sys.argv[1:]:
srcfile = glob.glob(sys.argv[1])[0]
else:
print """%s: Error you must specify file, to convert forum tages to wiki tags!
Type %s FILENAME """ % (scriptName, scriptName)
exit(1)
dstfile = os.path.join('.' , os.path.basename(srcfile)+'_wiki.txt')
converted = getfilecont(srcfile)
try:
open(dstfile, 'wt+').write(converted)
print 'Done.'
except:
print 'Error saving file %s' % dstfile
print converted
#print replace_tags("This is an [[example]] sentence. It is [[{{awesome}}]].")
What I want is to replace
'[B]': '**',
'[/B]': '**',
with only one line like this as in regex
\[B\](.*?)\[\/B\] : **\1**
That very would be helpful with BBcode tags like this:
[FONT=Arial]Hello, how are you?[/FONT]
Then I can use something like this
\[FONT=(.*?)\](.*?)\[\/FONT\] : ''\2''
But I can not seem to be able to do that with this script. There are another ways to do regex search and replace in the original source of this script but it works for one tag at a time using re.sub. Other advantage of this script that I can add as many line as I want so I can update it later.

For starters, you're escaping the patterns on this line:
text = replace_tags(text, re.compile('|'.join(re.escape(r) for r in _replacements)))
re.escape takes a string and escapes it in such a way that if the new string were used as a regex it would match exactly the input string.
Removing the re.escape won't entirely solve your problem, however, ans you find the replacement by just doing a lookup of the matched text in your dict on this line:
return _replacements.get(match.group(0))
To fix this you could make each pattern into its own capture group:
text = replace_tags(text, re.compile('|'.join('(%s)' % r for r in _replacements)))
You'll also need to know which pattern goes with which substitution. Something like this might work:
_replacements_dict = {
'[B]': '**',
'[/B]': '**',
'[I]': '//',
'[/I]': '//',
}
_replacements, _subs = zip(*_replacements_dict.items())
def _do_replace(match):
for i, group in m.groups():
if group:
return _subs[i]
Note that this changes _replacements into a list of the patterns, and creates a parallel array _subs for the actual replacements. (I would have named them regexes and replacements, but didn't want to have to re-edit every occurrence of "_replacements").

Someone has done it here.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, os
import re
import glob
_replacements_dict = {
'\[B\]': '**',
'\[\/B\]': '**',
'\[I\]': '//',
'\[\/I\]': '//',
'\[IMG\]' : '{{',
'\[\/IMG\]' : '}}',
'\[URL=(.*?)\]\s*(.*?)\s*\[\/URL\]' : r'[[\1|\2]]',
'\[URL\]\s*(.*?)\s*\[\/URL\]' : r'[[\1]]',
'\[FONT=(.*?)\]' : '',
'\[color=(.*?)\]' : '',
'\[SIZE=(.*?)\]' : '',
'\[CENTER]' : '',
'\[\/CENTER]' : '',
'\[\/FONT\]' : '',
'\[\/color\]' : '',
'\[\/size\]' : '',
}
_replacements, _subs = zip(*_replacements_dict.items())
def replace_tags(text):
for i, _s in enumerate(_replacements):
tag_re = re.compile(r''+_s, re.I)
text, n = tag_re.subn(r''+_subs[i], text)
return text
def getfilecont(FN):
if not glob.glob(FN): return -1 # No such file
text = open(FN, 'rt').read()
return replace_tags(text)
scriptName = os.path.basename(sys.argv[0])
if sys.argv[1:]:
srcfile = glob.glob(sys.argv[1])[0]
else:
print """%s: Error you must specify file, to convert forum tages to wiki tags!
Type %s FILENAME """ % (scriptName, scriptName)
exit(1)
dstfile = os.path.join('.' , os.path.basename(srcfile)+'_wiki.txt')
converted = getfilecont(srcfile)
try:
open(dstfile, 'wt+').write(converted)
print 'Done.'
except:
print 'Error saving file %s' % dstfile
#print converted
#print replace_tags("This is an [[example]] sentence. It is [[{{awesome}}]].")
http://pastie.org/1447448

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regular Expression / Middle word in result - python

You need this pattern: (?<=")http[^"]+. (?<=") - positive lookbehind, to determine if " precceds current position. http - match http literally. [^"]+ - match everything until ", this is negated class technique to avoid quantifiers :) Demo

re.search() returns a Match Object You have to fetch the information from the result: line = "my text line contains a http://192.168.1.1 magic url" result = re.search("[Hh]ttps?://\d+\.\d+\.\d+\.\d+", line) print(result.group()) # will print http://192.168.1.1

Related

Get rid of parenthesis in output

find and replace regular expression rather than full string

Removing \r and \n from list

Find all lines that match regex pattern and grab part of string

Python: Replace tags but preserve inner text V2

Categories

Resources