Excluding a string containing character regex [duplicate] - python

This question already has answers here:
How do you validate a URL with a regular expression in Python?
(12 answers)
Closed 3 years ago.
Currently I am trying to get proper URLs from a string containing both proper and improper URLs using Regular Expressions. Result of the code should give a list of the proper URLs from the input string. The problem is I cannot get rid of the "http://example{.com", because all I came up with is getting to the "{" character and getting "http://example" in results.
The code I am checking is below:
import re
text = "https://example{.com http://example.com http://example.hgg.com/da.php?=id42 http\\:example.com http//: example.com"
print(re.findall('http[s]?[://](?:[a-zA-Z0-9$-_#.&+])+', text))
So is there a good way to get all the matches but excluding matches containing bad characters (like "{")?

It's difficult to know exactly what you need but this should help. It's hard to parse URLs with regular expressions. But Python comes with a URL parser. It looks like they are space separated so you could do something like this
from urllib.parse import urlparse
text = "https://example{.com http://example.com http://example.hgg.com/da.php?=id42 http\\:example.com http//: example.com"
for token in text.split():
result = urlparse(token)
if result.scheme in {'http', 'https'} \
and result.netloc \
and all(c == '.' or c.isalpha() for c in result.netloc):
print(token)
Split the text into a list of strings text.split, try parse each item urlparse(token). Print if the scheme is http or https and the domain (a.k.a netloc) is non-empty and all characters are a-z or a dot.

In your example, an URL ends with a white space, so you can use a lookahead to find the next space (or the end of the string). To do that, you can use: (?=\s|$).
Your RegEx can be fixed as follow:
print(re.findall(r'http[s]?[:/](?:[a-zA-Z0-9$-_#.&+])+(?=\r|$)', text))
note: don't forget to use a raw string (prefixed by a "r").
You can also improve your RegEx, for instance:
import re
text = "https://example{.com http://example.com http://example.hgg.com/da.php?=id42 http\\:example.com http//: example.com"
URL_REGEX = r"(?:https://|http://|ftp://|file://|mailto:)[-\w+&##/%=~_|?!:,.;]+[-\w+&##/%=~_|](?=\s|$)"
print(re.findall(URL_REGEX, text))
You get:
['http://example.com', 'http://example.hgg.com/da.php?=id42']
To have a good RegEx, you can take a look at this question: “What is the best regular expression to check if a string is a valid URL?”
A answer point this RegEx for Python:
URL_REGEX = re.compile(
r'(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' # domain...
r'localhost|' # localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|' # ...or ipv4
r'\[?[A-F0-9]*:[A-F0-9:]+\]?)' # ...or ipv6
r'(?::\d+)?' # optional port
r'(?:/?|[/?]\S+)', re.IGNORECASE)
It works like a charm!

Related

Find n characters after a specific string in Python

I have a webpage's source. It's just a ton of random numbers and letters and function names, saved as a string in python3. I want to find the text that says \"followerCount\": in the source code of this string, but I also want to find a little bit of the text that follows it (n characters). This would hopefully have the piece of text I'm looking for. Can I search for a specific part of a string and the n characters that follow it in python3?
Use .find() to get the position:
html = "... lots of html source ..."
position = html.find('"followerCount":')
Then use string slicing to extract that part of the string:
n = 50 # or however many characters you want
print(html[position:position+n])
A standard way of looking for text based on a pattern is a regex. For example here you can ask for any three characters following "followerCount:"
import re
s = 'a bunch of randoms_characters/"followerCount":123_more_junk'
match = re.search(r'(?<="followerCount":).{3}', s)
if match:
print(match.group(0))
#prints '123'
Alternatively you can make a regex without the lookbehind and capture the three characters in a group:
import re
s = 'a bunch of randoms_characters/"followerCount":123_more_junk'
match = re.search(r'"followerCount":(.{3})', s)
if match:
print(match.group(1))
#prints '123'

use dynamic int variable inside regex pattern python

I'm in my initial days of learning python, sorry if this question is already been asked.
I'm writing here as those didn't help me, my requirement is reading a file and printing all the URL's inside in it.Inside a for loop the regex pattern i had used is [^https://][\w\W]*, it worked fine. But I wanted to know if can I dynamically pass the length of line which is after https:// and get the output with occurrences instead of *
I had tried [^https://][\w\W]{var}} where var=len(line)-len(https://)
These are some other patterns I had tried like
pattern = '[^https://][\w\W]{'+str(int(var))+'}'
pattern = r'[^https://][\w\W]{{}}'.format(var)
pattern = r'[^https://][\w\W]{%s}'%var
I might be misunderstanding your question, but if you know that the url is always starting with https:// then that would be the first eight characters. Then you can get the length after finding the urls:
# Example of list containing urls - you should fill that with your for loop
list_urls = ['https://stackoverflow.com/questions/61006253/use-dynamic-int-variable-inside-regex-pattern-python', 'https://google.com', 'https://stackoverflow.com']
for url in list_urls:
print(url[8:])
Out
stackoverflow.com/questions/61006253/use-dynamic-int-variable-inside-regex-pattern-python
google.com
stackoverflow.com
Instead of a for loop you could find all urls using re.findall
import re
url_pattern = "((https:\/\/)([\w-]+\.)+[\w-]+[.+]+([\w%\/~\+#]*))"
# text referes to your document, that should be read before this
urls = re.findall(url_pattern, text)
# Using list comprehensions
# Get the unique urls by using set
# Only get text after https:// using [8:]
# Only parse the first element of the group that is returned by re.findall using [0]
unique_urls = list(set([x[0][8:] for x in urls]))
# print the urls
print(unique_urls)
In your pattern you use [^https://] which is a negated character class [^ which will match any char except the listed.
One option is to make use of literal string interpolation. Assuming your links do not contain spaces, you could use \S instead of [\w\W] as the latter variant will match any character including spaces and newlines.
\bhttps://\S{{{var}}}(?!\S)
Regex demo
The assertion (?!\S) at the end is a whitespace boundary to prevent partial matches and the word boundary \b will prevent http being part of a larger word.
Python demo
For example
import re
line = "https://www.test.com"
lines = "https://www.test.com https://thisisatestt https://www.dontmatchme"
var=len(line)-len('https://')
pattern = rf"\bhttps://\S{{{var}}}(?!\S)"
print(re.findall(pattern, lines))
Output
['https://www.test.com', 'https://thisisatestt']

A way of stripping everything before and after ' or " (including the quotes themselves)

I am trying to find a means to strip everything before or after either single or double quotes, including the quotes themselves.
For example:
<script src = "https://example.com/file.js"></script>
Result:
https://example.com/file.js
Or:
url = 'https://example.com/service/api'
Result:
https://example.com/service/api
I have tried using .strip and .replace , as well as the re library, but I am grasping in the dark here.
Using an HTML parsing library is no good here, because we don't know in advance in which language the code is. We are searching through lines of text looking for URL's to then send the URL itself to another API. This could be in text files, yaml, json, java, c#, python, ruby, etc.
Rather than trying to remove everything prior and after the target string, you can think of it as extracting the target string and not its surrounding context.
Extract the quoted string using regex match groups:
import re
string = '<script src = "https://example.com/file.js"></script>'
match = re.search("(\".+?\"|'.+?')", string)
target = match.group(1).strip("\"'")
target is equal to https://example.com/file.js.
The regex in re.search() matches either "somestring" or 'somestring'. The contents of the group between parentheses can be extracted using match.group(1). We then remove the quotes on either side using strip().
You might want to use something like
if match:
target = match.group(1).strip("\"'")
because match will be None if the regex doesn't match anything.

Regex URL Help: Word or Phrase

I am an absolute noob at regex (I kind of know the basics and need to help a word, or a phrase. If it is a phrase, then separate each word with a hyphen - :
This is my current regex, which only matches one word:
r'^streams/search/(?P<stream_query>\w+)/$
The ?P just allows the URL to take a parameter.
Extra note: I am using python re module with the Django urls.py
Any suggestions?
Here are some examples:
game
gsl
starcraft-2014
final-fantasy-iv
word1-word2-word-3
Updated explanation:
I basically need a regular expression to expand the current one, so inside the same regex, no other one:
r'^streams/search/(?P<stream_query>\w+)/$
So include the new regex INSIDE this one, where ?P\w+ is any word that Django considers a parameter (and is passed into a function).
URL definition, which includes the regex:
url(r'^streams/search/(?P\w+)/$', 'stream_search', name='stream_search')
Then, Django passes that parameter into the stream_search function, which takes that parameter:
def stream_search(request, stream_query):
#here I manipulate the stream_query string, ie: removing the hyphens
So, once again, I need an re to match a word or phrase, that are passed into the stream_query parameter (or if necessary, a second one).
So, what I want stream_query to have is:
word1
or
word1-word2-word3
If I understand your question correctly then you might not have to use regexs at all.
Based on your example:
example.com/streams/search/rocket-league-fsdfs-fsdfs
It seems that the term you want to deal with is always found after the last /. So you can rsplit and then check for -. Here is an example:
url = "example.com/streams/search/rocket-league-fsdfs-fsdfs"
result = url.rsplit("/", 1)[-1]
#result = ["example.com/streams/search", "rocket-league-fsdfs-fsdfs"]
if "-" in result:
#do whatever you want with the string
else:
#do whatever you want with the string
or a regex that would match either word or word-word-word would be: [\w-]+
Try this,
import re
str = "http://example.com/something?id=123&action=yes"
regex = "(query\d+)=(\w+)"
re.findall(regex, str)
You can also use Python's urlparse library,
from urlparse import url parse
urlparse = urlparse("http://example.com/something?id=123&action=yes")
Just call url parse to return
ParseResult(scheme='http', netloc='example.com', path='/something', params='', query='id=123&action=yes', fragment='')

Python Get Tags from URL

I have the following URL:
http://google.com/sadfasdfsd$AA=mytag&SS=sdfsdf
What is the best way in Python to get mytag from the string ~$AA=mytag&~?
Try this,
>>> import re
>>> str = 'http://google.com/sadfasdfsd$AA=mytag&SS=sdfsdf'
>>> m = re.search(r'.*\$AA=([^&]*)\&.*', str)
>>> m.group(1)
'mytag'
There is a special meaning for $ and & in regex, so you have to escape those characters to tell python interpreter that these characters are literal $ and &.
Use this regex =(.+)&
import re
regex = "=(.+)&"
print re.findall(regex,"http://google.com/sadfasdfsd$AA=mytag&SS=sdfsdf")[0]
To retrieve the mytag that comes after $AA, you can use this simple regex (see demo):
(?<=\$AA=)[^&]+
In Python:
match = re.search(r"(?<=\$AA=)[^&]+", subject)
Explain Regex
(?<= # look behind to see if there is:
\$ # '$'
AA= # 'AA='
) # end of look-behind
[^&]+ # any character except: '&' (1 or more times
# (matching the most amount possible))
I'm just going to throw this one out there to show there are other ways of doing this:
import urlparse
url = "http://google.com/sadfasdfsd?AA=mytag&SS=sdfsdf"
query = urlparse.urlparse(url).query # Extract the query string from the full URL
parsed_query = urlparse.parse_qs(query) # Parses the query string into a dict
print parsed_query["AA"][0]
# mytag
See here: https://docs.python.org/2/library/urlparse.html for documentation on the urlparse module.
NB parse_qs returns a list so we use [0] to get the first result.
Also, I have assumed the question has a typo and have amended the url so that it represents a traditional query string.

Categories