BeautifulSoup Soup Recursive

BeautifulSoup Soup Recursive - python

I would like to retrieve the url's of a web page recursively and get the result in a list.
This is the code I'm using:
catalog_url = "http://nomads.ncep.noaa.gov:9090/dods/gfs_0p25/"
from bs4 import BeautifulSoup # conda install -c asmeurer beautiful-soup=4.3.2
import urllib2
from datetime import datetime
html_page = urllib2.urlopen(catalog_url)
soup = BeautifulSoup(html_page)
urls_day = []
for link in soup.findAll('a'):
if datetime.today().strftime('%Y') in link.get('href'): # String contains today's year in name
print link.get('href')
urls_day.append(link.get('href'))
urls_final = []
for run in urls_day:
html_page2 = urllib2.urlopen(run)
soup2 = BeautifulSoup(html_page2)
for links in soup2.findAll('a'):
if datetime.today().strftime('%Y') in soup2.get('a'):
print links.get('href')
urls_final.append(links.get('href'))
In the first loop I get the url's in the catalog_url. urls_day is a list object with the url's that contain the string of the current year in it.
The second loop fails with the following output:
GrADS Data Server
Traceback (most recent call last):
File "<stdin>", line 6, in <module>
TypeError: argument of type 'NoneType' is not iterable
urls_final should be the list object containing the url's of my interest.
Any idea of how to solve it? I've checked similar posts of beautiful soup with recursion, but I always get the same 'NoneType' response.

You should check if the returned value is a NoneType before calling the recursive function. I wrote an example which you can improve upon.
from bs4 import BeautifulSoup
from datetime import datetime
import urllib2
CATALOG_URL = "http://nomads.ncep.noaa.gov:9090/dods/gfs_0p25/"
today = datetime.today().strftime('%Y')
cache = {}
def cached(func):
def wraps(url):
if url not in cache:
cache[url] = True
return func(url)
return wraps
#cached
def links_from_url(url):
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page, "lxml")
s = set([link.get('href') for link in soup.findAll('a') if today in link.get('href')])
return s if len(s) else url
def crawl(links):
if not links: # Checking for NoneType
return
if type(links) is str:
return links
if len(links) > 1:
return [crawl(links_from_url(link)) for link in links]
if __name__ == '__main__':
crawl(links_from_url(CATALOG_URL))
print cache.keys()

Related

BeautifulSoup and Lists

I am attempting to use beautifulsoup to look through and request each url in a txt file. So far I am able to scrape the first link for what I seek, progressing to the next url I hit an error.
This is the error I keep getting:
AttributeError: ResultSet object has no attribute 'find'. You're
probably treating a list of elements like a single element. Did you
call find_all() when you meant to call find()?
from bs4 import BeautifulSoup as bs
import requests
import constants as c
file = open(c.fvtxt)
read = file.readlines()
res = []
DOMAIN = c.vatican_domain
pdf = []
def get_soup(url):
return bs(requests.get(url).text, 'html.parser')
for link in read:
bs = get_soup(link)
res.append(bs)
soup = bs.find('div', {'class': 'headerpdf'})
pdff = soup.find('a')
li = pdff.get('href')
surl = f"{DOMAIN}{li}"
pdf.append(f"{surl}\n")
print(pdf)

It's your variable name confuses the Python interpreter, you cannot have the same name as a function and a variable at the same time, in your case 'bs'.
It should work fine if you rename the variable bs to parsed_text or something else but bs.
for link in read:
parsed_text = get_soup(link)
res.append(parsed_text)
soup = parsed_text.find('div', {'class': 'headerpdf'})
pdff = soup.find('a')
li = pdff.get('href')
print(li)
surl = f"{DOMAIN}{li}"
pdf.append(f"{surl}\n")
print(pdf)
The result:

Trying to Scrape multiple URLS from one page

I am trying to scrape the info from the election results in 18 NI constituencies here:
http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/UK-Parliamentary-Election-2019-Results
Each of the unique URLs starts like this:
http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/
The selector for the 18 URLS is as follows:
#container > div.two-column-content.clearfix > div > div.right-column.cms > div > ul > li
What I want to start with is a list with the 18 URLS. This list should be clean (i.e. just have the actual addresses, no tags, etc)
My Code so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from time import sleep
from random import randint
from selenium import webdriver
url = 'http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/UK-Parliamentary-Election-2019-Results'
response = requests.get(url)
response.status_code
text = requests.get(url).text
soup = BeautifulSoup(text, parser="html5lib")
link_list = []
for a in soup('a'):
if a.has_attr('href'):
link_list.append(a)
re_pattern = r"^/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/"
This is where I get lost, as I need to search for all 18 URLS that start with that pattern (The pattern is wrong I am pretty sure. Please help!)
The rest of the code:
import re
good_urls = [url for url in link_list if re.match(re_pattern, url)]
here I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-36-f3fbbd3199b1> in <module>
----> 1 good_urls = [url for url in link_list if re.match(re_pattern, url)]
<ipython-input-36-f3fbbd3199b1> in <listcomp>(.0)
----> 1 good_urls = [url for url in link_list if re.match(re_pattern, url)]
~/opt/anaconda3/lib/python3.7/re.py in match(pattern, string, flags)
173 """Try to apply the pattern at the start of the string, returning
174 a Match object, or None if no match was found."""
--> 175 return _compile(pattern, flags).match(string)
176
177 def fullmatch(pattern, string, flags=0):
TypeError: expected string or bytes-like object
What should I type differently to get those 18 urls? Thank you!

This seems to do the job.
I've removed some unnecessary imports and stuff that's not needed here, just readd them if you need them elsewhere of course.
The error message was due to triyng to do a regex comparison on a soup object, it needs to be cast to string (same problem as discussed in the link #Huzefa posted, so that was definitely relevant).
Fixing that still left the issue of trying to isolate the correct strings. I've simplified the regex for matching, then use a simple string split on " and selecting the second object resulting from the split (which is our url)
import requests
from bs4 import BeautifulSoup
import re
url = 'http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/UK-Parliamentary-Election-2019-Results'
response = requests.get(url)
text = requests.get(url).text
soup = BeautifulSoup(text, "html.parser")
re_pattern = "<a href=\".*/Elections-2019/.*"
link_list = []
for a in soup('a'):
if a.has_attr('href') and re.match(re_pattern, str(a)):
link_list.append((str(a).split('"')[1]))
Hope it fits your purpose, ask if anything is unclear.

Crawling over a website directories using BeautifulSoup?

This is my code:
https://pastebin.com/R11qiTF4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as req
from urllib.parse import urljoin
import re
urls = ["https://www.helios-gesundheit.de"]
domain_list = ["https://www.helios-gesundheit.de/kliniken/schwerin/"]
prohibited = ["info", "news"]
text_keywords = ["Helios", "Helios"]
url_list = []
desired = "https://www.helios-gesundheit.de/kliniken/schwerin/unser-angebot/unsere-fachbereiche-klinikum/allgemein-und-viszeralchirurgie/team-allgemein-und-viszeralchirurgie/"
for x in range(len(domain_list)):
url_list.append(urls[x]+domain_list[x].replace(urls[x], ""))
print(url_list)
def prohibitedChecker(prohibited_list, string):
for x in prohibited_list:
if x in string:
return True
else:
return False
break
def parseHTML(url):
requestHTML = req(url)
htmlPage = requestHTML.read()
requestHTML.close()
parsedHTML = soup(htmlPage, "html.parser")
return parsedHTML
searched_word = "Helios"
for url in url_list:
parsedHTML = parseHTML(url)
href_crawler = parsedHTML.find_all("a", href=True)
for href in href_crawler:
crawled_url = urljoin(url,href.get("href"))
print(crawled_url)
if "www" not in crawled_url:
continue
parsedHTML = parseHTML(crawled_url)
results = parsedHTML.body.find_all(string=re.compile('.*{0}.*'.format(searched_word)), recursive=True)
for single_result in results:
keyword_text_check = prohibitedChecker(text_keywords, single_result.string)
if keyword_text_check != True:
continue
print(single_result.string)
I'm trying to print the contents of ''desired'' variable. The problem is the following, my code doesn't even get to request the URL of ''desired'' because its not in the website scope. ''desired'' href link is inside another href link that's inside the page I'm currently scraping. I thought I'd fix this by adding another for loop inside line 39 for loop, that requests every href found in my first, but this is too messy and not efficient
Is there way to get a list of every directory of a website url?

web-scraping using python ('NoneType' object is not iterable)

I am new to python and web-scraping. I am trying to scrape a website (link is the url). I am getting an error as "'NoneType' object is not iterable", with the last line of below code. Could anyone point what could have gone wrong?
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = 'https://labtestsonline.org/tests-index'
soup = BeautifulSoup(requests.get(url).content, 'lxml')
# Function to get hyper-links for all test components
hyperlinks = []
def parseUrl(url):
global hyperlinks
page = requests.get(url).content
soup = BeautifulSoup(page, 'lxml')
for a in soup.findAll('div',{'class':'field-content'}):
a = a.find('a')
href = urlparse.urljoin(Url,a.get('href'))
hyperlinks.append(href)
parseUrl(url)
# function to get header and common questions for each test component
def header(url):
page = requests.get(url).content
soup = BeautifulSoup(page, 'lxml')
h = []
commonquestions = []
for head in soup.find('div',{'class':'field-item'}).find('h1'):
heading = head.get_text()
h.append(heading)
for q in soup.find('div',{'id':'Common_Questions'}):
questions = q.get_text()
commonquestions.append(questions)
for i in range(0, len(hyperlinks)):
header(hyperlinks[i])
Below is the traceback error:
<ipython-input-50-d99e0af6db20> in <module>()
1 for i in range(0, len(hyperlinks)):
2 header(hyperlinks[i])
<ipython-input-49-15ac15f9071e> in header(url)
5 soup = BeautifulSoup(page, 'lxml')
6 h = []
for head in soup.find('div',{'class':'field-item'}).find('h1'):
heading = head.get_text()
h.append(heading)
TypeError: 'NoneType' object is not iterable

soup.find('div',{'class':'field-item'}).find('h1') is returning None. First check whether the function returns anything before looping over it.
Something like:
heads = soup.find('div',{'class':'field-item'}).find('h1')
if heads:
for head in heads:
# remaining code

Try this. It should solve the issues you are having at this moment. I used css selector to get the job done.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
link = 'https://labtestsonline.org/tests-index'
page = requests.get(link)
soup = BeautifulSoup(page.content, 'lxml')
for a in soup.select('.field-content a'):
new_link = urljoin(link,a.get('href')) ##joining broken urls so as to reuse these
response = requests.get(new_link) ##sending another http requests
sauce = BeautifulSoup(response.text,'lxml')
for item in sauce.select("#Common_Questions .field-item"):
print(item.text)
print("<<<<<<<<<>>>>>>>>>>>")

Why am I getting errors relating to strip() when I'm not using it? (Python)

I'm working through a scraping task in Python using BeautifulSoup and am getting some strange errors. It's mentioning strip, which I'm not using, but I'm guessing might be related to the processes of BSoup?
In the task I'm trying to go to the original url, find the 18th link, click that link 7 times, and then return the name result for the 18th link on the 7th page. I'm trying to use a function to get the href from the 18th link, then adjust the global variable to recurse through with a different url each time. Any advice on what I'm missing would be really helpful. I'll list the code and errors:
from bs4 import BeautifulSoup
import urllib
import re
nameList = []
urlToUse = "http://python-data.dr-chuck.net/known_by_Basile.html"
def linkOpen():
global urlToUse
html = urllib.urlopen(urlToUse)
soup = BeautifulSoup(html, "lxml")
tags = soup("li")
count = 0
for tag in tags:
if count == 17:
tagUrl = re.findall('href="([^ ]+)"', str(tag))
nameList.append(tagUrl)
urlToUse = tagUrl
count = count + 1
else:
count = count + 1
continue
bigCount = 0
while bigCount < 9:
linkOpen()
bigCount = bigCount + 1
print nameList[8]
Errors:
Traceback (most recent call last):
File "assignmentLinkScrape.py", line 26, in <module>
linkOpen()
File "assignmentLinkScrape.py", line 10, in linkOpen
html = urllib.urlopen(urlToUse)
File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url) File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 185, in open
fullurl = unwrap(toBytes(fullurl)) File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1075, in unwrap
url = url.strip() AttributeError: 'list' object has no attribute 'strip'

re.findall() returns a list of matches. urlToUse is a list and you are trying to pass it to urlopen() which expects a URL string instead.

Alexce has explained your error but you don't need a regex at all, you just want to get the 18th li tag and extract the href from the anchor tag inside that, you can use find with find_all:
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("http://python-data.dr-chuck.net/known_by_Basile.html").content,"lxml")
url = soup.find("ul").find_all("li", limit=18)[-1].a["href"]
Or use a css selector:
url = soup.select_one("ul li:nth-of-type(18) a")["href"]
So to get the name after visiting the url seven times, put the logic in a function, visit the intial url then visit and extract the anchor seven times, then on the last visit just extract the text from the anchor:
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("http://python-data.dr-chuck.net/known_by_Basile.html").content,"lxml")
def get_nth(n, soup):
return soup.select_one("ul li:nth-of-type({}) a".format(n))
start = get_nth(18, soup)
for _ in range(7):
soup = BeautifulSoup(requests.get(start["href"]).content,"html.parser")
start = get_nth(18, soup)
print(start.text)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup Soup Recursive - python

Related

BeautifulSoup and Lists

Trying to Scrape multiple URLS from one page

Crawling over a website directories using BeautifulSoup?

web-scraping using python ('NoneType' object is not iterable)

Why am I getting errors relating to strip() when I'm not using it? (Python)

Categories

Resources