python Get a link from string - python

I need to use a python script to take a email and fine a link from it and them open use that link to send a packet to a server that has that verification link inside of it so it verifies an account. How would I use python to take the
https://www.boomlings.com/database/accounts/activate.php?uid=8722046actcode=xLCReGjLdkWmINt1GY9e
out of
{'Sender': 'Geometry Dash', 'Subject': 'Please activate your account.', 'body': b'<style type="text/css">\n#google_translate_element{\n float: right;\n padding:0 0 10px 10px;\n}\n/* twitter do\xc4\x9frulama linki fix */\n.bulletproof-btn-1 a {\n font-size: 20px!important;\n color: #fff!important;\n padding: 20px!important;\n line-height: 33px!important;\n text-decoration: none!important;\n}\n</style>\n<div id="google_translate_element"></div><script type="text/javascript">\nfunction googleTranslateElementInit() {\n new google.translate.TranslateElement({pageLanguage: \'en\', layout: google.translate.TranslateElement.InlineLayout.SIMPLE, autoDisplay: false, multilanguagePage: true}, \'google_translate_element\');\n}\n</script><script type="text/javascript" src="//translate.google.com/translate_a/element.js?cb=googleTranslateElementInit"></script>\n\r\n\r\n<html>\r\n<head>\r\n\t<title></title>\r\n</head>\r\n<body>\r\n<p>Thank you for registering a Geometry Dash account</p>\r\n\r\n<p>Your account information:<br />\r\nUsername: SUKAFUTCUCK</p>\r\n\r\n<p>Please click the link below to activate your account:<br />\r\nClick\r\nHere</p>\r\n\r\n<p>Please contact support#robtopgames.com if you have any questions or\r\nneed assistance.</p>\r\n\r\n<p>If you did not send an account request using this email, then you\r\ncan safely disregard this message and nothing will happen.</p>\r\n\r\n<p>Regards,<br />\r\nRobTop Games</p>\r\n</body>\r\n</html>\r\n\r\n\r\n'}
The link will be different in different emails so I need something that can do this.
https://www.boomlings.com/database/accounts/activate.php?uid=*actcode=*
When the * means that string at any length can go there because it will be a different activate.php cod

You can use regex for that with something like:
import re
c = re.search("<a href=\".*?(?=\")", yourDict["body"].decode("utf-8"))
print(c.group())
but is much better if you find a package like parsel because you extract the html with xpath and not with regex, check this
EDIT
I use the regular expression because is the shortest and the fastest way with no need of download a package, but if your response changes drastically I recommend parsel for that. Example:
from parsel import Selector
sel = Selector(text=yourDict["body"].decode("utf-8"))
url = sel.xpath('//a[#target="_blank"]/#href').extract_first()

Assuming that dict from your description is now in a variable named d (it was just a bit long to put in here):
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(d['body'], 'lxml')
>>> link = soup.find('a', target='_blank')
>>> link['href']
'http://www.boomlings.com/database/accounts/activate.php?uid=8722046&actcode=xlCReGjLdkWmINt1GY9e'
BeautifulSoup docs

The email could in HTML or text format.
If it's in HTML format then use libraries like bs4, pyquery etc.
If it's text then use regex to search the URL using the following regex
regex = ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
Refer: http://www.ietf.org/rfc/rfc3986.txt
Use re module to search the string as
import re
regex = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
urls = re.findall( regex, text )
print(urls)
Use pyquery module
from pyquery import pyQuery as pq
q = pq( text )
a_list = q( "a" )
urls = [ a.attr[ 'href' ] for a in a_list ]
print(urls)
EDIT:
Instead of using generic URL we can use specific URL, for example https?:\/\/www\.boomlings\.com\/database\/accounts\/activate\.php\?uid=.*&actcode=.*
https://ideone.com/NFj90L

Related

Trouble parsing HTML with BeautifulSoup or golang colly

FTR I have written quite a few scrapers successfully in both frameworks but I'm stumped. Here is a screenshot of the data I'm trying to scrape (you can also go to the actual link in the get request):
I attempt to target the div.section_content:
import requests
from bs4 import BeautifulSoup
html = requests.get("https://www.baseball-reference.com/boxes/ARI/ARI201803300.shtml").text
soup = BeautifulSoup(html)
soup.findAll("div", {"class": "section_content"})
Printing the last line shows some other divs, but not the one with the pitching data.
However, I can see it's in the text, so it's not a javascript triggered loading problem (the phrase "Pitching" only comes up in that table):
>>> "Pitching" in soup.text
True
Here is an abbreviated version of one of the golang attempts:
package main
import (
"fmt"
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector(
colly.AllowedDomains("www.baseball-reference.com"),
)
c.OnHTML("div.table_wrapper", func(e *colly.HTMLElement) {
fmt.Println(e.ChildText("div.section_content"))
})
c.Visit("https://www.baseball-reference.com/boxes/ARI/ARI201803300.shtml")
}
}
It looks to me like the HTML is actually commented out, so that's why BeautifulSoup can't find it. Either remove the comment markers from the HTML string before you parse it or use BeautifulSoup to extract the comments and parse the return value.
For example:
for element in soup(text=lambda text: isinstance(text, Comment)):
comment = element.extract()
comment_soup = BeautifulSoup(comment)
# work with comment_soup

Python 2.7 BeautifulSoup , email scraping

Hope you are all well. I'm new in Python and using python 2.7.
I'm trying to extract only the mailto from this public website business directory: http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search
the mails i'm looking for are the emails mentioned in every widget from a-z in the full directory. This directory does not have an API unfortunately.
I'm using BeautifulSoup, but with no success so far.
here is mycode:
import urllib
from bs4 import BeautifulSoup
website = raw_input("Type website here:>\n")
html = urllib.urlopen('http://'+ website).read()
soup = BeautifulSoup(html)
tags = soup('a')
for tag in tags:
print tag.get('href', None)
what i get is just the website of the actual website , like http://www.tecomdirectory.com with other href rather then the mailto or websites in the widgets. i also tried replacing soup('a') with soup ('target'), but no luck! Can anybody help me please?
You cannot just find every anchor, you need to specifically look for "mailto:" in the href, you can use a css selector a[href^=mailto:] which finds anchor tags that have a href starting with mailto::
import requests
soup = BeautifulSoup(requests.get("http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search").content)
print([a["href"] for a in soup.select("a[href^=mailto:]")])
Or extract the text:
print([a.text for a in soup.select("a[href^=mailto:]")])
Using find_all("a") you would need to use a regex to achieve the same:
import re
find_all("a", href=re.compile(r"^mailto:"))

Need to extract data from a website and store in list using regex

So I have a task which requires me to extract data from a website to form a 'top 10 list'. I have chosen IMDB top 250 page http://www.imdb.com/chart/top.
In other words I need a little help using regex to isolate the names of the films and then store them in a list. I already have the HTML stored in a variable as a string (if this is the wrong way of approaching it let me know).
Also, I am limited to use of modules urlopen, re and htmlparser
import HTMLParser
from urllib import urlopen
import re
site = urlopen("http://www.imdb.com/chart/top?tt0468569")
content = site.read()
print content
You really shouldn't use regex but you stated in your comment you have to, so here it is with regex:
import requests
respText = requests.get("http://www.imdb.com/chart/top").text
for title in re.findall(r'<td class="titleColumn">.+?>(.+?)<', respText, re.DOTALL):
print(title)
In BeautifulSoup (Which you can't use)
soup = BeautifulSoup(respText, "html.parser")
for item in soup.find_all("td", {"class" : "titleColumn"}):
print(item.find("a").text)

Python:Getting text from html using Beautifulsoup

I am trying to extract the ranking text number from this link link example: kaggle user ranking no1. More clear in an image:
I am using the following code:
def get_single_item_data(item_url):
sourceCode = requests.get(item_url)
plainText = sourceCode.text
soup = BeautifulSoup(plainText)
for item_name in soup.findAll('h4',{'data-bind':"text: rankingText"}):
print(item_name.string)
item_url = 'https://www.kaggle.com/titericz'
get_single_item_data(item_url)
The result is None. The problem is that soup.findAll('h4',{'data-bind':"text: rankingText"}) outputs:
[<h4 data-bind="text: rankingText"></h4>]
but in the html of the link when inspecting this is like:
<h4 data-bind="text: rankingText">1st</h4>. It can be seen in the image:
Its clear that the text is missing. How can I overpass that?
Edit:
Printing the soup variable in the terminal I can see that this value exists:
So there should be a way to access through soup.
Edit 2: I tried unsuccessfully to use the most voted answer from this stackoverflow question. Could be a solution around there.
If you aren't going to try browser automation through selenium as #Ali suggested, you would have to parse the javascript containing the desired information. You can do this in different ways. Here is a working code that locates the script by a regular expression pattern, then extracts the profile object, loads it with json into a Python dictionary and prints out the desired ranking:
import re
import json
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.kaggle.com/titericz")
soup = BeautifulSoup(response.content, "html.parser")
pattern = re.compile(r"profile: ({.*}),", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
profile_text = pattern.search(script.text).group(1)
profile = json.loads(profile_text)
print profile["ranking"], profile["rankingText"]
Prints:
1 1st
The data is databound using javascript, as the "data-bind" attribute suggests.
However, if you download the page with e.g. wget, you'll see that the rankingText value is actually there inside this script element on initial load:
<script type="text/javascript"
profile: {
...
"ranking": 96,
"rankingText": "96th",
"highestRanking": 3,
"highestRankingText": "3rd",
...
So you could use that instead.
I have solved your problem using regex on the plain text:
def get_single_item_data(item_url):
sourceCode = requests.get(item_url)
plainText = sourceCode.text
#soup = BeautifulSoup(plainText, "html.parser")
pattern = re.compile("ranking\": [0-9]+")
name = pattern.search(plainText)
ranking = name.group().split()[1]
print(ranking)
item_url = 'https://www.kaggle.com/titericz'
get_single_item_data(item_url)
This return only the rank number, but I think it will help you, since from what I see the rankText just add 'st', 'th' and etc to the right of the number
This could because of dynamic data filling.
Some javascript code, fill this tag after page loading. Thus if you fetch the html using requests it is not filled yet.
<h4 data-bind="text: rankingText"></h4>
Please take a look at Selenium web driver. Using this driver you can fetch the complete page and running js as normal.

scraping HTML in Python

I'm trying to find a series of URLs (twitter links) from the source of a page and then put them into a list in a text document. The problem I have is that once I .readlines() the urlopen object, I have a grand total of 3-4 lines each consisting of dozens of urls that I need to collect one-by-one. This is the snippet of my code where I try to rectify this:
page = html.readlines()
for line in page:
ind_start = line.find('twitter')
ind_end = line.find('</a>', ind_start+1)
while ('twitter' in line[ind_start:ind_end]):
output.write(line[ind_start:ind_end] + "\n")
ind_start = line.find('twitter', ind_start)
ind_end = line.find('</a>', ind_start + 1)
Unfortunately I can't extract any urls using this. Any advice?
You can extract links using lxml and a xpath expression :
from lxml.html import parse
p = parse('http://domain.tld/path')
for link in p.xpath('.//a/#href'):
if "twitter" in link:
print link, "match 'twitter'"
Using regex there, is not the better way : parsing HTML is a solved problem in 2013. See RegEx match open tags except XHTML self-contained tags
You could use the BeautifulSoup module:
from bs4 import BeautifulSoup
soup = BeautifulSoup('your html')
elements = soup.findAll('a')
for el in elements:
print el['href']
If not - just use regexp:
import re
expression = re.compile(r'http:\/\/*')
m = expression.search('your string')
if m:
print 'match found!'
This would match also the urls within <img /> tags, but you can tweak my solution easily to only find urls within <a /> tags

Categories