Fetching specific data within the quotes using regex in python - python

I am trying to fetch the plugin version from the config XML file(fread) using regex.
Using the following regex. But I am getting the entire line instead, I am just interested in the version i.e "4.3.0". Any help on how that can be achieved?
(Pdb) key
'plugin="git'
(Pdb) re.findall(key+".*",fread)
['plugin="git#4.2.2">\\n <configVersion>2</configVersion>\\n <userRemoteConfigs>\\n <hudson.plugins.git.UserRemoteConfig>\\n

This may not be the most optimal regex check, but it does work: Regexr Tester The link will explain how the capture works. From that just strip the string of the # symbol.
this would also allow for finding multiple sets if they are defined if you want a more precise check you can search for git# blah blah blah and return all of those then just strip the front part essentially

Related

RegEx works in regexr but not in python re

I have this regex: If you don't want these messages, please [a-zA-Z0-9öäüÖÄÜ<>\n\-=#;&?_ "/:.#]+settings<\/a>. It works on regexr but not when I am using the re
library in Python:
data = "<my text (comes from a file)>"
search = "If you don't want these messages, please [a-zA-Z0-9öäüÖÄÜ<>\n\-=#;&?_ \"/:.#]+settings<\/a>" # this search string comes from a database, so it's not hardcoded into my script
print(re.search(search, data))
Is there something I don't see?
Thank you!
the pattern you are using on regexr contains \- but in your exemple shows \\- wich may give an incorrect regex. (and add the r in front of of the string as jupiterby said).

Issues with extracting URLs from text

I am trying to find a regular expression to extract any valid URLs (not only http[s]) using a regular expression. Unfortunately, each one outputs weird things. The best results I achieved using this regex:
\b((?:[a-z][\w\-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]|\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
But I can mark at least the following issues:
http://208.206.41.61/email/email_log.cfm?useremail=3Dtana.jones#enron.com&=refdoc=3D(01-128) is extracted as http://208.206.41.61/email/email_log.cfm?useremail=3Dtana.jones#enron.com&=
http://www.onlinefilefolder.com',AJAXTHRESHOLD should be extracted without AJAXTHRESHOLD
CSS / HTML styling is extracted, for example xmlns:x="urn:schemas-microsoft-com:xslt, ze:12px;color:#666, font-size:12px;color etc
How can I improve this regex to make sure only valid URLs are extracted? I am not only extracting it from the HTML, but also from a plain text. Therefore, using only beautifulsoup is impossible for my use case.
No regex is perfect, but this one might help you:
(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&##\/%=~_|$?!:,.]*\)|[-A-Z0-9+&##\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&##\/%=~_|$?!:,.]*\)|[A-Z0-9+&##\/%=~_|$])
Flag to enable: insensitive, global, multiline (igm)
Source: http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/

Replace text in HTML and BBCode sample

First of all I'd like to say this is my first post on SO, which has been of great help for years to me, so thank you all!
Now onto my question:
I have a string of characters containing unicode text, html tags and bbcode tags (which is obviously extracted from a forum).
Sample:
This is my sample text.
It may contain HTML tags,
[b]BBCode[b],
or even [b][u]both[/u] intricated[/b]!
I have also a list of keywords which may appear in the text described above, and for each of these words I have an associated URL.
Sample:
kw = {'sample': 'http://www.sample.fr', 'BBCode': 'http://www.bbcode.sp'}
As you can see I'm currently using Python because I'm used to the language, but I can be flexible.
My goal is to detect which word(s) in my keyword list is present in the sample text, and to "decorate" the matching word(s) with a link (preferably in bbcode) to the corresponding URL, without altering the rest of the string (just like for Wikis).
Taking further the examples above I'd like to retrieve:
This is my [url=http://www.sample.fr]sample[/url] text.
It may contain HTML tags,
[b][url=http://www.bbcode.sp]BBCode[/url][b],
or even [b][u]both[/u] intricated[/b]!
The main problem here is that sometimes, one of the keywords in my list appears inside a tag, which I do not want to "decorate" with a link for obvious reasons.
In other words, the text I'd like to replace can be located only outside the anchor tags:
**HERE** <not here>[not here] **HERE** [/not here]</not here> **HERE**
Also, I've already tried using BeautifulSoup (along with PostMarkup to convert BBCode to HTML before parsing with BeautifulSoup) but it doesn't allow me to keep the initial string...
Remark: "real" text actually can never be placed between brackets (angle nor squared) due to the general usage of my forum, so this simplifies the problem quite a bit.
I'm sorry for my very long question, I hope everything is clear!
Any help appreciated, thanks to everyone by advance!
Update: Casimir's solution in Python (see below) works just great. Thank you Casimir et Hippolyte!
To do that, the way is always the same: you must match first what you want to avoid.
Example:
(?s) # dotall mode
( # capture with all what you want to avoid
<!--.*?--> # html comment
|
<[^>]+> # html tag
|
\[[^\]]+\] # bbcode
)
| # OR
kw1|kw2|kw3|...
Then you must use a function as replacement, inside the function when the capture group 1 is defined, you return the match, otherwise you return the corresponding string for the keyword.

regex regarding symbols in urls

I want to replace consecutive symbols just one such as;
this is a dog???
to
this is a dog?
I'm using
str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)
however I notice that this might replace symbols in urls that might happen in my text.
like http://example.com/this--is-a-page.html
Can someone give me some advice how to alter my regex?
So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.
Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.
Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.
So you might come up with something like
(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+
which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.
All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.
EDIT:
OK, you're already working on the parsed text, but it still might contain URLs.
Then try the following:
result = re.sub(
r"""(?ix) # case-insensitive, verbose regex
# Either match a URL
# (protocol optional (if so, URL needs to start with www or ftp))
(?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$])
# or
|
# match repeated non-word characters
(?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""",
# and replace with both captured groups (one will always be empty)
r"\g<URL>\g<rpt>", subject)
Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...
Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(

Regular expression to extract URL from an HTML link

I’m a newbie in Python. I’m learning regexes, but I need help here.
Here comes the HTML source:
http://www.ptop.se
I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?
If you're only looking for one:
import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
print(match.group(1))
If you have a long string, and want every instance of the pattern in it:
import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))
Where s is the string that you're looking for matches in.
Quick explanation of the regexp bits:
r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)
"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.
Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."
"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.
The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.
It's pretty easy to do:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
print(tag['href'])
Once you've installed BeautifulSoup, anyway.
Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.
this should work, although there might be more elegant ways.
import re
url='http://www.ptop.se'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)
John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:
http://daringfireball.net/2009/11/liberal_regex_for_matching_urls
If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.
this regex can help you, you should get the first group by \1 or whatever method you have in your language.
href="([^"]*)
example:
amgheziName
result:
http://www.amghezi.com
There's tonnes of them on regexlib
Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.
This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/
(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)
Oputput:
Match 1. /wiki/Main_Page
Match 2. /wiki/Portal:Contents
Match 3. /wiki/Portal:Featured_content
Match 4. /wiki/Portal:Current_events
Match 5. /wiki/Special:Random
Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
You can use this.
<a[^>]+href=["'](.*?)["']

Categories