Extract URL's inclusive with fragments in string using Python with Regex

Extract URL's inclusive with fragments in string using Python with Regex - python

Ok i know ppl are going to say this question has been asked a million times.. but my question is DIFFERENT. I have searched stackoverflow many many many times to ensure this is not a duplicate..
I want a regex in Python that also helps to extract the URL from a string INCLUDING FRAGMENTS
What i have done so far is:
import re
test = 'This is a string with my URL as follows http://www.example.org/foo.html#bar and here i continue with my string'
test = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', test)
print (test)
The output i get for the above code is ['http://www.example.org/foo.html']
Which is not what i want..
I want to the output to be ['http://www.example.org/foo.html#bar']

Your original regex is this:
http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
Couldn't you just add '#' Like this?:
http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),#]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
I am unclear as to what you mean by 'fragments'... Do you mean anything up to the space in the string?

Related

Extract values in name=value lines with regex

I'm really sorry for asking because there are some questions like this around. But can't get the answer fixed to make problem.
This are the input lines (e.g. from a config file)
profile2.name=share2
profile8.name=share8
profile4.name=shareSSH
profile9.name=share9
I just want to extract the values behind the = sign with Python 3.9. regex.
I tried this on regex101.
^profile[0-9]\.name=(.*?)
But this gives me the variable name including the = sign as result; e.g. profile2.name=. But I want exactly the inverted opposite.
The expected results (what Pythons re.find_all() return) are
['share2', 'share8', 'shareSSH', 'share9']

Try pattern profile\d+\.name=(.*), look at Regex 101 example
import re
re.findall('profile\d+\.name=(.*)', txt)
# output
['share2', 'share8', 'shareSSH', 'share9']
But this problem doesn't necessarily need regex, split should work absolutely fine:

Try removing the ? quantifier. It will make your capture group match an empty st
regex101

How to get python to search for whole numbers in a string-not just digits

Okay please do not close this and send me to a similar question because I have been looking for hours at similar questions with no luck.
Python can search for digits using re.search([0-9])
However, I want to search for any whole number. It could be 547 or 2 or 16589425. I don't know how many digits there are going to be in each whole number.
Furthermore I need it to specifically find and match numbers that are going to take a form similar to this: 1005.2.15 or 100.25.1 or 5.5.72 or 1102.170.24 etc.
It may be that there isn't a way to do this using re.search but any info on what identifier I could use would be amazing.

Just use
import re
your_string = 'this is 125.156.56.531 and this is 0540505050.5 !'
result = re.findall(r'\d[\d\.]*', your_string)
print(result)
output
['125.156.56.531', '0540505050.5']

Assuming that you're looking for whole numbers only, try re.search(r"[0-9]+")

find substrings and replace them but get their information [python]

I want to do something like this to a text (This is just an example to show the problem):
new_text = re.sub(r'\[(?P<index>[0-9]+)\]',
'(Found pattern the ' + index + ' time', text)
Where text is my original text. I want to find any substring like this: [3] or [454]. But this isn't the hard part. The hard part is to get the number in there. I want to use the number to use a method called add_link(number) which expects a number(instead of the string I'm building with "Found pattern..." - that's just an example). (In a database it has stored links matched to IDs where it finds the links.)
Python tells me it doesn't know the local variable index. How can I make it knowing?
Edit: I have been told I didn't ask clearly. (I already have an answer but maybe someone is going to read this in future.) The question was how to get the pattern known as [0-9]+ get as a local variable. I guessed it would be something like this: (?P<index>[0-9]+), and it was.
Thanx in advanced, Asqiir

You can reference a named group in the replacement string with the syntax \g<field name>. So your code should be written as:
new_text = re.sub(r'\[(?P<index>[0-9]+)\]', '(Found pattern the \g<index> time', text)

Finding a random sentence in HTML with python regex

I'm trying to write a small function for another script that pulls the generated text from "http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1"
Essentially, I need it to pull whatever sentence is between < br> tags.
I've been trying my darndest using regular expressions, but I never really could get the hang of those.
All of the searching I did turned up things for pulling either specific sentences, or single words.
This however needs to pull whatever arbitrary string is between < br> tags.
Can anyone help me out? Thanks.
Best I could come up with:
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('\<br>.*\<br>', html)
EDIT: Ended up going with a different approach all together, simply splitting the HTML in a list seperated by < br> and pulling [3], made for cleaner code and less string operations. Keeping this question up for future reference and other people with similar questions.

You need to use the DOTALL flag as there are newlines in the expression that you need to match. I would use
re.findall('<br>(.*?)<br>', html, re.S)
However will return multiple results as there are a bunch of <br><br> on that page. You may want to use the more specific:
re.findall('<hr><br>(.*?)<br><hr>', html, re.S)

from urllib import urlopen
import re
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('<body>.*?>\n*([^<]{5,})<.*?</body>', html, re.S)
if (len(output) > 0):
print(output)
output = re.sub('\n', ' ', output[0])
output = re.sub('\t', '', output)
print(output)
Terminal
imac2011:Desktop allendar$ python test.py
['A black cat crossing your path signifies that the animal is going somewhere.\n\t\t-- Groucho Marx\n\n']
A black cat crossing your path signifies that the animal is going somewhere. -- Groucho Marx
You could also strip of the final \n's and replace all those inside the text (on longer quotes) with <br /> if you are displaying it in HTML again, so you would maintain the original line breaks visually.

All jokes of that page have the same model, no ambigous things, you can use this
output = re.findall('(?<=<br>\s)[^<]+(?=\s{2}<br)', html)
No need to use the dotall flag cause there's no dot.

This is uh, 7 years later, but for future reference:
Use the beautifulsoup library for these kind of purposes, as suggested by Floris in the comments.

Regular expression for string between two strings?

Sorry, I know this is probably a duplicate but having searched for 'python regular expression match between' I haven't found anything that answers my question!
The document (which to make clear, is a long HTML page) I'm searching has a whole bunch of strings in it (inside a JavaScript function) that look like this:
link: '/Hidden/SidebySideGreen/dei1=1204970159862'};
link: '/Hidden/SidebySideYellow/dei1=1204970159862'};
I want to extract the links (i.e. everything between quotes within these strings) - e.g. /Hidden/SidebySideYellow/dei1=1204970159862
To get the links, I know I need to start with:
re.matchall(regexp, doc_sting)
But what should regexp be?

The answer to your question depends on how the rest of the string may look like. If they are all like this link: '<URL>'}; then you can do it very simple using simple string manipulation:
myString = "link: '/Hidden/SidebySideGreen/dei1=1204970159862'};"
print( myString[7:-3] )
(If you just have one string with multiple lines by that, you can just split the string into lines.)
If it is a bit more complex though, using regular expressions are fine. One example that just looks for the url inside of the quotes would be:
myDoc = """link: '/Hidden/SidebySideGreen/dei1=1204970159862'};
link: '/Hidden/SidebySideYellow/dei1=1204970159862'};"""
print( re.findall( "'([^']+)'", myDoc ) )
Depending on how the whole string looks, you might have to include the link: as well:
print( re.findall( "link: '([^']+)'", myDoc ) )

I'd start with:
regexp = "'([^']+)'"
And check if it works okay - I mean, if the only condition is that string is in one line between '', it should be good as it is.

Use a few simple splits
>>> s="link: '/Hidden/SidebySideGreen/dei1=1204970159862'};"
>>> s.split("'")
['link: ', '/Hidden/SidebySideGreen/dei1=1204970159862', '};']
>>> for i in s.split("'"):
... if "/" in i:
... print i
...
/Hidden/SidebySideGreen/dei1=1204970159862
>>>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract URL's inclusive with fragments in string using Python with Regex - python

Related

Extract values in name=value lines with regex

How to get python to search for whole numbers in a string-not just digits

find substrings and replace them but get their information [python]

Finding a random sentence in HTML with python regex

Regular expression for string between two strings?

Categories

Resources