python - parsing an url - python

I am writing a simple script that checks if a website is present on google first search for a determined keyword.
Now,this is the function that parse a url and return the host name:
def parse_url(url):
url = urlparse(url)
hostname = url.netloc
return hostname
and starting from a list of tags selected by:
linkElems = soup.select('.r a') #in google first page the resulting urls have class r
I wrote this:
for link in linkElems:
l = link.get("href")[7:]
url = parse_url(l)
if "www.example.com" == url:
#do stuff (ex store in a list, etc)
in this last one, in the second line, i have to start from the seventh index, because all href values start with '/url?q='.
I am learning python, so i am wondering if there is a better way to do this, or simply an alternative one (maybe with regex or replace method or from urlparse library)

You can use python lxml module to do that which is also order of magnitude faster than BeautifulSoup.
This can be done something like this :
import requests
from lxml import html
blah_url = "https://www.google.co.in/search?q=blah&oq=blah&aqs=chrome..69i57j0l5.1677j0j4&sourceid=chrome&ie=UTF-8"
r = requests.get(blah_url).content
root = html.fromstring(r)
print(root.xpath('//h3[#class="r"]/a/#href')[0].replace('/url?q=', ''))
print([url.replace('/url?q=', '') for url in root.xpath('//h3[#class="r"]/a/#href')])
This will result in :
http://www.urbandictionary.com/define.php%3Fterm%3Dblah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggTMAA&usg=AFQjCNFge5GFNmjpan7S_UCNjos1RP5vBA
['http://www.urbandictionary.com/define.php%3Fterm%3Dblah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggTMAA&usg=AFQjCNFge5GFNmjpan7S_UCNjos1RP5vBA', 'http://www.dictionary.com/browse/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggZMAE&usg=AFQjCNE1UVR3krIQHfEuIzHOeL0ZvB5TFQ', 'http://www.dictionary.com/browse/blah-blah-blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggeMAI&usg=AFQjCNFw8eiSqTzOm65PQGIFEoAz0yMUOA', 'https://en.wikipedia.org/wiki/Blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggjMAM&usg=AFQjCNFxEB8mEjEy6H3YFOaF4ZR1n3iusg', 'https://www.merriam-webster.com/dictionary/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggpMAQ&usg=AFQjCNHYXX53LmMF-DOzo67S-XPzlg5eCQ', 'https://en.oxforddictionaries.com/definition/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgguMAU&usg=AFQjCNGlgcUx-BpZe0Hb-39XvmNua2n8UA', 'https://en.wiktionary.org/wiki/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggzMAY&usg=AFQjCNGc9VmmyQls_rOBOR_lMUnt1j3Flg', 'http://dictionary.cambridge.org/dictionary/english/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgg5MAc&usg=AFQjCNHJgZR1c6VY_WgFa6Rm-XNbdFJGmA', 'http://www.thesaurus.com/browse/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgg-MAg&usg=AFQjCNEtnpmKxVJqUR7P1ss4VHnt34f4Kg', 'https://www.youtube.com/watch%3Fv%3D3taEuL4EHAg&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQtwIIRTAJ&usg=AFQjCNFnKlMFxHoYAIkl1MCrc_OXjgiClg']

Related

list index out of range - beautiful soup

NEW TO PYTHON*** Below is my code I am using to pull a zip file from a website but I am getting the error, "list index out of range". I was given this code by someone else who wrote it but I had to change the URL and now I am getting the error. When I print(list_of_documents) it is blank.
Can someone help me with this? The url requires access so you won't be able to try to input this code directly. I am trying to understand how to use beautiful soup in this and how I can get the list to populate correctly.
import datetime
import requests
import csv
from zipfile import ZipFile as zf
import os
import pandas as pd
import time
from bs4 import BeautifulSoup
import pyodbc
import re
#set download location
downloads_folder = r"C:\Scripts\"
##### Creating outage dataframe
#Get list of download links
res = requests.get('https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD')
ercot_soup = BeautifulSoup(res.text, "lxml")
list_of_documents = ercot_soup.findAll('td', attrs={'class': 'labelOptional_ind'})
list_of_links = ercot_soup.select('a')'
##create the url for the download
loc = str(list_of_links[0])[9:len(str(list_of_links[0]))-9]
link = 'http://www.ercot.com' + loc
link = link.replace('amp;','')
# Define file name and set download path
file_name = str(list_of_documents[0])[30:len(str(list_of_documents[0]))-5]
file_path = downloads_folder + '/' + file_name
You can't expect code tailored to scrape one website to work for a different link! You should always inspect and explore your target site, especially the parts you need to scrape, so you know the tag names [like td and a here] and identifying attributes [like name, id, class, etc.] of the elements you need to extract data from.
With this site, if you want the info from the reportTable, it gets generated after the page gets loaded with javascript, so it wouldn't show up in the request response. You could either try something like Selenium, or you could try retrieving the data from the source itself.
If you inspect the site and look at the network tab, you'll find a request (which is what actually retrieves the data for the table) that looks like this, and when you inspect the table's html, you'll find above it the scripts to generate the data.
In the suggested solution below, the getReqUrl scrapes your link to get the url for requesting the reports (and also the template of the url for downloading the documents).
def getReqUrl(scrapeUrl):
res = requests.get(scrapeUrl)
ercot_soup = BeautifulSoup(res.text, "html.parser")
script = [l.split('"') for l in [
s for s in ercot_soup.select('script')
if 'reportListUrl' in s.text
and 'reportTypeID' in s.text
][0].text.split('\n') if l.count('"') == 2]
rtID = [l[1] for l in script if 'reportTypeID' in l[0]][0]
rlUrl = [l[1] for l in script if 'reportListUrl' in l[0]][0]
rdUrl = [l[1] for l in script if 'reportDownloadUrl' in l[0]][0]
return f'{rlUrl}{rtID}&_={int(time.time())}', rdUrl
(I couldn't figure out how to scrape the last query parameter [the &_=... part] from the site exactly, but {int(time.time())}} seems to get close enough - the results are the same even then and even when that last bit is omitted entirely; so it's totally optional.)
The url returned can be used to request the documents:
#import json
url = 'https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD'
reqUrl, ddUrl = getReqUrl(url)
reqRes = requests.get(reqUrl[0]).text
rsJson = json.loads(reqRes)
for doc in rsJson['ListDocsByRptTypeRes']['DocumentList']:
d = doc['Document']
downloadLink = ddUrl+d['DocID']
#print(f"{d['FriendlyName']} {d['PublishDate']} {downloadLink}")
print(f"Download '{d['ConstructedName']}' at\n\t {downloadLink}")
print(len(rsJson['ListDocsByRptTypeRes']['DocumentList']))
The print results will look like

How to Extract/Scrape a specific part of information from WikiData URLs

I have a list of webid that I want to scrape from WikiData website. Here are the two links as an example.
https://www.wikidata.org/wiki/Special:EntityData/Q317521.jsonld
https://www.wikidata.org/wiki/Special:EntityData/Q478214.jsonld
I only need the first set of "P31" from the URL. For the first URL, the information that I need will be "wd:Q5" and second URL will be ["wd:Q786820", "wd:Q167037", "wd:Q6881511","wd:Q4830453","wd:Q431289","wd:Q43229","wd:Q891723"] and store them into a list.
When I use find and input "P31", I only need the first results out of all the results. The picture above illustrate it
The output will look like this.
info = ['wd:Q5',
["wd:Q786820", "wd:Q167037", "wd:Q6881511","wd:Q4830453","wd:Q431289","wd:Q43229","wd:Q891723"],
]
lst = ["Q317521","Q478214"]
for q in range(len(lst)):
link =f'https://www.wikidata.org/wiki/Special:EntityData/{q}.jsonld'
page = requests.get(link)
soup = BeautifulSoup(page.text, 'html.parser')
After that, I do not know how to extract the information from the first set of "P31". I am using request, BeautifulSoup, and Selenium libraries but I am wondering are there any better ways to scrape/extract that information from the URL besides using XPath or Class?
Thank you so much!
You only need requests as you are getting a JSON response.
You can use a function which loops the relevant JSON nested object and exits at first occurrence of target key whilst appending the associated value to your list.
The loop variable should be the id to add into the url for the request.
import requests
lst = ["Q317521","Q478214"]
info = []
def get_first_p31(data):
for i in data['#graph']:
if 'P31' in i:
info.append(i['P31'])
break
with requests.Session() as s:
s.headers = {"User-Agent": "Safari/537.36"}
for q in lst:
link =f'https://www.wikidata.org/wiki/Special:EntityData/{q}.jsonld'
try:
r = s.get(link).json()
get_first_p31(r)
except:
print('failed with link: ', link)

Regular Expression In Order To Identify Tor Domains

I am working on a scraper that goes through html code trying to scrape tor domains. However I am having trouble coming up with a piece of code to match tor domains.
Tor domains are typically in the format of:
http://sitegoeshere.onion
or
https://sitegoeshere.onion
I just want to match urls that would be contained within a page, in the format http://sitetexthere.onion or https://sitehereitis.onion. This is within a bunch of text that may not be urls. It should just pull out the urls.
I am sure there is an easy or good piece of regex that'll do this but I have not been able to find one. If anyone is able to link one or quickly spin one up that'd be muchos appreciated. Many thanks.
session = requests.session()
session.proxies = {}
session.proxies['http'] = 'socks5h://localhost:9050'
session.proxies['https'] = 'socks5h://localhost:9050'
r = session.get('http://facebookcorewwwi.onion')
print(r.text)
The regex.match will return None if the URL isn't matched.
import re
regex = re.compile(r"^https?\:\/\/[\w\-\.]+\.onion")
url = 'https://sitegoes-here.onion'
if regex.match(url):
print('Valid Tor Domain!')
else:
print('Invalid Tor Domain!')
For optional http(s):
regex = re.compile(r"^(?:https?\:\/\/)?[\w\-\.]+\.onion")
Regex patterns are mostly standard, so, i would recommend you this pattern:
'.onion$'
Backslash escapes the dot, and '$' character means the end of string. Since all urls starts with 'http(s)://' there's no need to including it in the pattern.
Assuming these are taken from href attributes you could try an attribute = value selector with $ ends with operator
from bs4 import BeautifulSoup as bs
import requests
resp = requests.get("https://en.wikipedia.org/wiki/Tor_(anonymity_network)") #example url. Replace with yours.
soup = bs(resp.text,'lxml')
links = [item['href'] for item in soup.select('[href$=".onion"]')]

How to get all urls on a specific google search python

So I am trying to create a program that gets all the urls on a google webpage search, and returns a list of them all in order of where they are on that page. So if its the top url on a google search page for "random", this link, then the first item in the list that should be returned should be "https://www.random.org/". This is due to it being the first link when you search random on google in the source code. I am using urllib3 and the re module because I do not really know how to use beautiful soup or lxml but if you can do this in beautiful soup and/or lxml that would also be fine. This is my code so far:
import urllib.request
import re
def find(start,end):
urls = []
with open('data.txt', 'r') as myFile:
pass # Needs to append the every instance of all urls between the start and end inputs in data.txt
# Returns all instances of urls between the start and end paramaters in data.txt
return urls
def parse(query):
# Creates the url with the query
url = 'https://www.google.com/search?q=' + query
# Gets past googles attempt to block parsing
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
# Fetches data
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
# Saves the source code in a txt file
saveFile = open('data.txt','w')
saveFile.write(str(respData))
saveFile.close()
# Finds the urls and returns them
newUrl = find('<h3 class="r"><a href="','"')
return newUrl
print(parse("random"))
PROBLEM: My problem is making the find() function work, I am not sure how to get the urls from the source code saved in data.txt and the variable respData, I want to do make this efficient so I was thinking of using regular expressions. However I am not sure how to get it to get the urls from the source code based on where the url starts (the class bit which is a parameter for the find function) and where it starts (the inverted comma which is another parameter for the find function).
SIMPLIFIED PROBLEM: Given some text data how would you create a list with all instances of some text in data between two strings start and finish. And how would you make this efficient for a large amount of data stored in data and then apply that to the find() function in my original code.
NOTE: Using python 3.6.3 therefore, I'm not using urllib2 instead urllib3. And if it is going to to take a long time to get every url on a google search webpage the first 10 urls are fine.
With beautiful soup, after urlopen
from bs4 import BeautifulSoup
#code snip
resp = urllib.request.urlopen(req)
soup = BeautifulSoup(resp)
for x in soup.findAll('a', {"class": "r"}):
print(x)
I have not tested, but this is how you do the search in Beautiful soup
On a side note, using Regex to parse html can be tricky when going it alone. Better to use Beautiful Soap 4 or Scrapy to handle the parsing for you.

Searching through a website directory, validate, then place URL in a list depending on content

I've been working on a script and I thought I would ask for help. I'm looking to search a series of websites, check if the site is valid. Then the next step would be to check for specific content on the site. If the site holds that content, place the URL in a list.
import urllib2
def getPage():
url="import urllib2
National=[]
Local=[]
Sports=[]
Culture=[]
def getPage():
url="http://readingeagle.com/section.aspx?id=2"
for i in range (0,100,1)
req = urllib2.Request(http://readingeagle.com/section.aspx?id=,i)
if "national" in response:
response = urllib2.urlopen(req)
return response.read()
for g in range (0,100,1)
if "national" in response:
National.append("http://readingeagle.com/section.aspx?id=,g"
# I would like to set-up an iteration to check the 'entryid from 1-100. If the term is found on the page, place the url in the list.
if __name__ == "__main__":
namesPage = getPage()
print (namesPage)
Here's my answer to the question of how to validate a given web site.
python check html valid
For checking the context of the page the tools consist of basic string methods, regex, or more sophisticated tools like lxml or beautifulsoup.
matchingSites = []
matchingSites.append(url) #Since you asked. :-p

Categories