How do I get href links from href using python/pandas

How do I get href links from href using python/pandas - python

I need to get href links which is present in href(which i have already) So I need to hit that href links and collect the other href. I tried but from that code only first href are getting, want to hit that one and collect href which present in that previous one. so how could I do that.
I Tried:
from bs4 import BeautifulSoup
import requests
url = 'https://www.iea.org/oilmarketreport/reports/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
#soup.prettify()
#table = soup.find("table")
#print(table)
links = []
for href in soup.find_all(class_='omrlist'):
#print(href)
links.append(href.find('a').get('href'))
print(links)

here how to loop to get report url
import requests
root_url = 'https://www.iea.org'
def getLinks(url):
all_links = []
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
for href in soup.find_all(class_='omrlist'):
all_links.append(root_url + href.find('a').get('href')) # add prefix 'http://....'
return all_links
yearLinks = getLinks(root_url + '/oilmarketreport/reports/')
# get report URL
reportLinks = []
for url in yearLinks:
links = getLinks(url)
reportLinks.extend(links)
print(reportLinks)
for url in reportLinks:
if '.pdf' in url:
url = url.replace('../../..', '')
# do download pdf file
....
else:
# do extract pdf url from html and download it
....
....
now you can loop reportLinks to get pdf url

Related

Cannot find the table data within the soup, but I know its there

I am trying create a function that scrapes college baseball team roster pages for a project. And I have created a function that crawls the roster page, gets a list of the links I want to scrape. But when I try to scrape the individual links for each player, it works but cannot find the data that is on their page.
This is the link to the page I am crawling from at the start:
https://gvsulakers.com/sports/baseball/roster
These are just functions that I call within the function that I am having a problem with:
def parse_row(rows):
return [str(x.string)for x in rows.find_all('td')]
def scrape(url):
page = requests.get(url, headers = headers)
html = page.text
soop = BeautifulSoup(html, 'lxml')
return(soop)
def find_data(url):
page = requests.get(url, headers = headers)
html = page.text
soop = BeautifulSoup(html, 'lxml')
row = soop.find_all('tr')
lopr = [parse_row(rows) for rows in row]
return(lopr)
Here is what I am having an issue with. when I assign type1_roster with a variable and print it, i only get an empty list. Ideally it should contain data about a player or players from a players roster page.
# Roster page crawler
def type1_roster(team_id):
url = "https://" + team_id + ".com/sports/baseball/roster"
soop = scrape(url)
href_tags = soop.find_all(href = True)
hrefs = [tag.get('href') for tag in href_tags]
# get all player links
player_hrefs = []
for href in hrefs:
if 'sports/baseball/roster' in href:
if 'sports/baseball/roster/coaches' not in href:
if 'https:' not in href:
player_hrefs.append(href)
# get rid of duplicates
player_links = list(set(player_hrefs))
# scrape the roster links
for link in player_links:
player_ = url + link[24:]
return(find_data(player_))

A number of things:
I would pass the headers as a global
You are slicing 1 character too late the link I think for player_
You need to re-work the logic of find_data(), as data is present in a mixture of element types and not in table/tr/td elements e.g. found in spans. The html attributes are nice and descriptive and will support targeting content easily
You can target the player links from the landing page more tightly with the css selector list shown below. This removes the need for multiple loops as well as the use of list(set())
import requests
from bs4 import BeautifulSoup
HEADERS = {'User-Agent': 'Mozilla/5.0'}
def scrape(url):
page = requests.get(url, headers=HEADERS)
html = page.text
soop = BeautifulSoup(html, 'lxml')
return(soop)
def find_data(url):
page = requests.get(url, headers=HEADERS)
#print(page)
html = page.text
soop = BeautifulSoup(html, 'lxml')
# re-think logic here to return desired data e.g.
# soop.select_one('.sidearm-roster-player-jersey-number').text
first_name = soop.select_one('.sidearm-roster-player-first-name').text
# soop.select_one('.sidearm-roster-player-last-name').text
# need targeted string cleaning possibly
bio = soop.select_one('#sidearm-roster-player-bio').get_text('')
return (first_name, bio)
def type1_roster(team_id):
url = "https://" + team_id + ".com/sports/baseball/roster"
soop = scrape(url)
player_links = [i['href'] for i in soop.select(
'.sidearm-roster-players-container .sidearm-roster-player h3 > a')]
# scrape the roster links
for link in player_links:
player_ = url + link[23:]
# print(player_)
return(find_data(player_))
print(type1_roster('gvsulakers'))

Scraping multiple pages in Python

I am trying to scrape a page that includes 12 links. I need to open each of these links and scrape all of their titles. When I open each page, I face multiple pages in each link. However, my code could only scrape the first page in all of these 12 links
By below code, I can print all the 12 links URLs that exist on the main page.
url = 'http://mlg.ucd.ie/modules/COMP41680/assignment2/index.html'
res = requests.get (url)
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all("a")
all_urls = []
for link in links[1:]:
link_address ='http://mlg.ucd.ie/modules/COMP41680/assignment2/' + link.get("href")
all_urls.append(link_address)
Then, I looped in all of them.
for i in range(0,12):
url = all_urls[i]
res = requests.get (url)
soup = BeautifulSoup(res.text, 'html.parser')
The title could be extracted by below lines:
title_news = []
news_div = soup.find_all('div', class_ = 'article')
for container in news_div:
title = container.h5.a.text
title_news.append(title)
The output of this code only includes the title for the first page of each of these 12 pages, while I need my code to go through multiple pages in these 12 URLs.
The below gives me the links of all the pages that exist in each of these 12 links if it defines in an appropriate loop. ( It reads the pagination section and look for the next page URL link)
page = soup.find('ul', {'class' : 'pagination'}).select('li', {'class': "page-link"})[2].find('a')['href']
How I should use a page variable inside my code to extract multiple pages in all of these 12 links and read all the titles and not only first-page titles.

You can use this code to get all titles from all the pages:
import requests
from bs4 import BeautifulSoup
base_url = "http://mlg.ucd.ie/modules/COMP41680/assignment2/"
soup = BeautifulSoup(
requests.get(base_url + "index.html").content, "html.parser"
)
title_news = []
for a in soup.select("#all a"):
next_link = a["href"]
print("Getting", base_url + next_link)
while True:
soup = BeautifulSoup(
requests.get(base_url + next_link).content, "html.parser"
)
for title in soup.select("h5 a"):
title_news.append(title.text)
next_link = soup.select_one('a[aria-label="Next"]')["href"]
if next_link == "#":
break
print("Length of title_news:", len(title_news))
Prints:
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-feb-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-mar-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-apr-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-may-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jun-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jul-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-aug-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-sep-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-oct-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-nov-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-dec-001.html
Length of title_news: 16226

How do I get hrefs from hrefs?

How do I get hrefs from hrefs using Python in class and method format?
I have tried:
root_url = 'https://www.iea.org'
class IEAData:
def __init__(self):
try:--
except:
def get_links(self, url):
all_links = []
page = requests.get(root_url)
soup = BeautifulSoup(page.text, 'html.parser')
for href in soup.find_all(class_='omrlist'):
all_links.append(root_url + href.find('a').get('href'))
return all_links
#print(all_links)
iea_obj = IEAData()
yearLinks = iea_obj.get_links(root_url + '/oilmarketreport/reports/')
reportLinks = []
for url in yearLinks:
links =iea_obj.get_links(yearLinks)
print(links)
Recommended: links variable must have all month hrefs but not getting, so please tell me how I should do it.

There were a couple of issues with your code. Your get_links() function was not using the url that was passed to it. When looping over the returned links, you were passing yearLinks rather than the url.
The following should get you going:
from bs4 import BeautifulSoup
import requests
root_url = 'https://www.iea.org'
class IEAData:
def get_links(self, url):
all_links = []
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
for li in soup.find_all(class_='omrlist'):
all_links.append(root_url + li.find('a').get('href'))
return all_links
iea_obj = IEAData()
yearLinks = iea_obj.get_links(root_url + '/oilmarketreport/reports/')
for url in yearLinks:
links = iea_obj.get_links(url)
print(url, links)
This would give you output starting:
https://www.iea.org/oilmarketreport/reports/2018/ ['https://www.iea.org/oilmarketreport/reports/2018/0118/', 'https://www.iea.org/oilmarketreport/reports/2018/0218/', 'https://www.iea.org/oilmarketreport/reports/2018/0318/', 'https://www.iea.org/oilmarketreport/reports/2018/0418/', 'https://www.iea.org/oilmarketreport/reports/2018/0518/', 'https://www.iea.org/oilmarketreport/reports/2018/0618/', 'https://www.iea.org/oilmarketreport/reports/2018/0718/', 'https://www.iea.org/oilmarketreport/reports/2018/0818/', 'https://www.iea.org/oilmarketreport/reports/2018/1018/']
https://www.iea.org/oilmarketreport/reports/2017/ ['https://www.iea.org/oilmarketreport/reports/2017/0117/', 'https://www.iea.org/oilmarketreport/reports/2017/0217/', 'https://www.iea.org/oilmarketreport/reports/2017/0317/', 'https://www.iea.org/oilmarketreport/reports/2017/0417/', 'https://www.iea.org/oilmarketreport/reports/2017/0517/', 'https://www.iea.org/oilmarketreport/reports/2017/0617/', 'https://www.iea.org/oilmarketreport/reports/2017/0717/', 'https://www.iea.org/oilmarketreport/reports/2017/0817/', 'https://www.iea.org/oilmarketreport/reports/2017/0917/', 'https://www.iea.org/oilmarketreport/reports/2017/1017/', 'https://www.iea.org/oilmarketreport/reports/2017/1117/', 'https://www.iea.org/oilmarketreport/reports/2017/1217/']

I'm fairly new to programming, and I'm still learning and trying to understand how classes and whatnot all work together. But gave it a shot (that's how we learn, right?)
Not sure if this is what you're looking for as your output. I changed 2 things and was able to put all the links from within the yearLinks into a list. Note that it'll also include the PDF links as well as the months links that I think you wanted. If you don't want those PDF links, and exclusively the months, then just don't include the pdf.
So here's the code I did it with, and maybe you can use that to fit into how you have it structured.
root_url = 'https://www.iea.org'
class IEAData:
def get_links(self, url):
all_links = []
page = requests.get(url)
soup = bs4.BeautifulSoup(page.text, 'html.parser')
for href in soup.find_all(class_='omrlist'):
all_links.append(root_url + href.find('a').get('href'))
return all_links
#print(all_links)
iea_obj = IEAData()
yearLinks = iea_obj.get_links(root_url + '/oilmarketreport/reports/')
reportLinks = []
for url in yearLinks:
links = iea_obj.get_links(url)
# uncomment line below if you do not want the .pdf links
#links = [ x for x in links if ".pdf" not in x ]
reportLinks += links

Using beautiful soup to scrape data from indeed

i am trying to use bs to scrape resume on indeed but i met some problems
here is the sample site: https://www.indeed.com/resumes?q=java&l=&cb=jt
here is my code:
URL = "https://www.indeed.com/resumes?q=java&l=&cb=jt"
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
def scrape_job_title(soup):
job = []
for div in soup.find_all(name='li', attrs={'class':'sre'}):
for a in div.find_all(name='a', attrs={'class':'app-link'}):
job.append(a['title'])
return(job)
scrape_job_title(soup)
it print out nothing: []
As you can see in the picture, I want to grab the job title "Java developer".

The class is app_link, not app-link. Additionally, a['title'] doesn't do what you want. Use a.contents[0] instead.
URL = "https://www.indeed.com/resumes?q=java&l=&cb=jt"
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
def scrape_job_title(soup):
job = []
for div in soup.find_all(name='li', attrs={'class':'sre'}):
for a in div.find_all(name='a', attrs={'class':'app_link'}):
job.append(a.contents[0])
return(job)
scrape_job_title(soup)

Try this to get all the job titles:
import requests
from bs4 import BeautifulSoup
URL = "https://www.indeed.com/resumes?q=java&l=&cb=jt"
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html5lib')
for items in soup.select('.sre'):
data = [item.text for item in items.select('.app_link')]
print(data)

How to web scrape images that are in a link in a table?

We need to get an image out a link of a table. there is a table, in there is a link en in that link are images. What is the best way to get all the images of the website?
Now we have the text out the table and the links:
But how do we get the images of this site?
from bs4 import BeautifulSoup
url = "http://www.gestolenkunst.nl/gestolen%20overzicht.htm" # change to whatever your url is
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")
link = soup.a
def main():
"""scrape some wikipedia"""
for tr in soup.find_all('tr')[2:]:
tds = tr.find_all('td')
print "Datum: %s\n Plaats: %s\n Gestolen van: %s\n" %\
(tds[0].text.strip(), tds[1].text.strip(), tds[2].text.strip())
for link in soup.find_all('a'):
print link["href"]
print link.renderContents()

If you want all the links with href's that include images/..., you can use a select using a css-selector:
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")
print([a["href"] for a in soup.select("a[href^=images/]")])
Which will give you:
[u'images/2015/eindhoven/eindhoven.htm', u'images/2015/rtl4/rtl4_schilderij%20gezocht.htm', u'images/2015/eindhoven/eindhoven.htm', u'images/2015/rtl4/rtl4_schilderij%20gezocht.htm', u'images/2015/doetinchem/doetinchem.htm', u'images/2014/emmeloord/emmeloord.htm', u'images/2014/heelsum/heelsum.htm', u'images/2014/blaricum/blaricum.htm', u'images/2014/hilversum/hilversum.htm', u'images/2014/kerkrade/kerkrade.htm', u'images/2013/heerhugowaard/artfarm.htm', u'images/2013/sittard/sittard.htm', u'images/2013/rotterdam/rotterdan.htm', u'images/2013/epe/epe.htm', u'images/2013/beek/beek.htm', u'images/2012/utrecht/utrecht.htm', u'images/2012/amsterdam/amsterdam.htm', u'images/2012/zwijndrecht/zwijdrecht.htm', u'images/2012/de_bilt/bakker.htm', u'images/2012/zutphen/zutphen.htm', u'images/2012/rheden/carmiggelt.htm', u'images/2012/dieren/dieren.htm', u'images/2011/denhaag/denhaag.htm', u'images/2011/rotterdam/bronzenbeeld_rotterdam.htm', u'images/2011/utrecht/utrecht.htm', u'images/2012/denhaag/denhaag.htm', u'images/2011/oosterbeek/oosterbeek.htm', u'images/2011/teruggevonden_alkmaar/sumo_worstelaar_brons.htm', u'images/2011/teruggevonden_alkmaar/sumo_worstelaar_brons.htm', u'images/2011/vlierden/vlierden.htm', u'images/2011/bergen/bergen.htm', u'images/2011/alkmaar/mca.htm', u'images/2010/hendrik_ido_ambacht/hendrik-ido-ambacht.htm', u'images/2010/nijmegen/nijmegen.htm', u'images/2010/heesch', u'images/2010/boxie/boxie_gestolen_powerbook.htm', u'images/2010/ijmuiden/ijmuiden_nic_jonk.htm', u'images/2010/jitsbakker/bakkerjuli2010.htm', u'images/2010/gouda/gouda.htm', u'images/2010/enschede/enschede.htm', u'images/2010/someren/someren.htm', u'images/2010/jitsbakker/jitsbakker_steltloper_maart2010.htm', u'images/2009/hogeweg/hogeweg.htm', u'images/2009/wildevuur/wildevuur.htm', u'images/2009/bad-nieuweschans/heus.htm', u'images/2009/bad-nieuweschans/heus.htm', u'images/2009/darp/kneulman.htm', u'images/2009/keramikos/keramikos.htm', u'images/2009/maasbracht/maasbracht.htm', u'images/2008/groet/groet.htm', u'images/2009/rotterdam/rotterdam.htm', u'images/2009/rotterdam/rotterdam.htm', u'images/2008/swifterband/swifterband.htm', u'images/2008/laren_snoek/laren_snoek.htm', u'images/2008/beetsterzwaag/Beetsterzwaag.htm', u'images/2008/callantsoog/booghaard.htm', u'images/2008/lelystad/lelystad.htm', u'images/2008/amsterdam_kok/amsterdam_kok.htm', u'images/2008/lochem/lochem.htm', u'images/2008/liempde/liempde.htm', u'images/2008/heerhugowaard/zande.htm', u'images/2008/amsterdam_hatterman/amsterdam_hatterman.htm', u'images/2008/delft/delft.htm', u'images/2008/sgraveland/sgraveland.htm', u'images/2008/laren/laren110308.htm', u'images/2008/laren/laren110308.htm', u'images/2008/alphen_ad_rijn/alphen_ad_rijn.htm', u'images/2008/hardinxveld/hardinxveld.htm', u'images/2008/denhaag_karres/denhaag_karres.htm', u'images/2008/amsterdam_eijsenberger/amsterdam_eijs.htm', u'images/2008/amsterdam/amsterdam.htm', u'images/2008/denhaag/denhaag_brauw.htm', u'images/2008/groenhart/groenhart.htm', u'images/2007/aalsmeer/aalsmeer.htm', u'images/2007/delft/delft_klaus.htm', u'images/2007/malden/malden.htm', u'images/2007/sterksel/sterksel.htm', u'images/2007/zeist/zeist_achmea.htm', u'images/2007/maaseik_laar/maaseik.htm', u'images/2007/meerssen/meerssen.htm', u'images/2007/lisse/lisse.htm', u'images/2007/kortenhoef/kortenhoef.htm', u'images/2007/schijndel/schijndel.htm', u'images/2007/alkmaar/smit.htm', u'images/2007/heerlen/heerlen.htm', u'images/2007/heerlen/heerlen.htm', u'images/2007/tiel/kaayk.htm', u'images/2007/arnhem/arnhem.htm', u'images/2007/amsterdam_noort/amsterdam_noort.htm', u'images/2007/sgravenhage/sgravenhage.htm', u'images/2007/hazelaar/hazelaar.htm', u'images/2007/putte-stabroek/putte_stabroek.htm', u'images/2007/maarssen/maarssen_beeldentuin.htm', u'images/2007/huizen/huizen_gemeente.htm', u'images/2007/Maastricht_laar/maastricht_laar.htm', u'images/2007/bilthoven/bilthoven_v.htm', u'images/2007/sypesteyn/sypesteyn.htm', u'images/2007/hulzen/hulzen.htm', u'images/2007/huizen_limieten/huizen_limieten.htm', u'images/2007/elburg/elburg_galerie.htm', u'images/2007/hasselt/schildwacht_hasselt.htm', u'images/2006/woerden/woerden.htm', u'images/2006/amsterdam_slotervaart/sheils.htm', u'images/2006/recr_klepperstee_ouddorp/klepperstee_ouddorp.htm', u'images/2006/recr_klepperstee_ouddorp/klepperstee_ouddorp.htm', u'images/2007/erichem/krol.htm', u'images/2006/someren/someren.htm', u'images/2006/sliedrecht/sliedrecht_jitsbakker.htm', u'images/2006/blank/blank_2006.htm', u'images/2006/kemps_eindhoven/kemps_eindhoven.htm', u'images/2006/schoorl/begraafplaats_diels.htm', u'images/2006/bloemendaal/bloemendaal.htm', u'images/2010/zwolle/zwolle_leeser.htm', u'images/2006/vinkeveen/vinkeveen_jozefschool.htm', u'images/2006/gemeente_bergen/gemeente_bergen.htm', u'images/2006/alphen_ad_rijn/alphenadrijn.htm', u'images/2006/alphen_ad_rijn/alphenadrijn.htm', u'images/2006/Nieuwegein/nieuwegein.htm', u'images/2006/alkmaar_lange/lange.htm', u'images/2006/janneman/janneman.htm', u'images/2006/schoffelen/schoffelen.htm', u'images/2006/keuenhof_ede/keuenhof.htm', u'images/2006/rucphen/rucphen.htm', u'images/2006/lemberger_amsterdam/lemberger.htm', u'images/2006/bronckhorst/bronckhorst.htm', u'images/2006/arnhem_peja/peja.htm', u'images/2005/klomp/klomp.htm', u'images/2005/kalverda/kalverda.htm', u'images/2005/schellekens/schellekens.htm', u'images/2005/beeldhouwwinkel/beeldhouwwinkel.htm', u'images/2005/huisman/huisman.htm', u'images/2005/lith/lith.htm', u'images/2005/bergen/bergen_onna.htm', u'images/2005/emst_remeeus/emst.htm', u'images/2005/water/water.htm', u'images/2005/maastricht_drielsma/maastricht_drielsma.htm', u'images/2005/bosch/bosch.htm', u'images/2005/fransen/fransen.htm', u'images/vaart/vaart.htm', u'images/2005/lammertink/lammertink.htm', u'images/brocke/brocke.htm', u'images/wood/wood.htm', u'images/2005/klijzing/klijzing.htm', u'images/metalart/ponsioen.htm', u'images/harderwijk/harderwijk.htm', u'images/gulpen/gulpen.htm', u'images/limburg/limburg.htm', u'images/landgraaf/landgraaf.htm', u'images/pijnappel.htm', u'images/termaat/termaat.htm', u'images/vries/vries.htm', u'images/hartigh/hartigh.htm', u'images/hengelo/hengelo.htm', u'images/nijmegen/teunen/teunen.htm', u'images/nijmegen/nijmegen.htm', u'images/hollants/hollants_carla.htm', u'images/laren/laren_gils.htm', u'images/2003/smakt/smakt.htm', u'images/koopman/koopman.htm', u'images/voorschoten/voorschoten.htm', u'images/hagen/hagen.htm', u'images/bakker/gestolen%20kunst%20Jits%20Bakker.htm', u'images/bliekendaal/bliekendaal.htm', u'images/2003/utrecht/utrecht.htm', u'images/davina/davina.htm', u'images/janneman/janneman.htm', u'images/dijk/dijk.htm', u'images/clarissenbolwerk/havermans.htm', u'images/appelhof/appelhof.htm', u'images/blank/blank.htm', u'images/dussen/dussen.htm', u'images/bakker/gestolen%20kunst%20Jits%20Bakker.htm', u'images/rijs/rijs.htm', u'images/janssen', u'images/bakker/gestolen%20kunst%20Jits%20Bakker.htm', u'images/2002/nijssen/ouderamstel_nijssen.htm', u'images/onna/onna.htm', u'images/haring/haring.htm', u'images/dijk/dijk.htm', u'images/janneman/janneman.htm', u'images/hessels/hessels.htm', u'images/onna/onna.htm', u'images/2012/culemborg/culemborg_1998.htm', u'images/1998/mierlo/mierlo.htm', u'images/oud/wigbold/wigbold.htm', u'images/oud/nijmegen/nijmegen.htm', u'images/oud/amsterdam_ommen/amstedam_ommen.htm']
a[href^=images/] simply means find all the anchor tags that have a href attribute where tha value starts with images/.., then we just pull the href from each anchor.
To use the links you will need to join the link to the main url:
from bs4 import BeautifulSoup
from urlparse import urljoin
url = "http://www.gestolenkunst.nl/gestolen%20overzicht.htm" # change to whatever your url is
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")
base = "http://www.gestolenkunst.nl/"
images = [urljoin(base,a["href"]) for a in soup.select("a[href^=images/]")]
print(images)
Output:
[u'http://www.gestolenkunst.nl/images/2015/eindhoven/eindhoven.htm', u'http://www.gestolenkunst.nl/images/2015/rtl4/rtl4_schilderij%20gezocht.htm', u'http://www.gestolenkunst.nl/images/2015/eindhoven/eindhoven.htm', u'http://www.gestolenkunst.nl/images/2015/rtl4/rtl4_schilderij%20gezocht.htm', u'http://www.gestolenkunst.nl/images/2015/doetinchem/doetinchem.htm', u'http://www.gestolenkunst.nl/images/2014/emmeloord/emmeloord.htm', u'http://www.gestolenkunst.nl/images/2014/heelsum/heelsum.htm', u'http://www.gestolenkunst.nl/images/2014/blaricum/blaricum.htm', u'http://www.gestolenkunst.nl/images/2014/hilversum/hilversum.htm', u'http://www.gestolenkunst.nl/images/2014/kerkrade/kerkrade.htm', u'http://www.gestolenkunst.nl/images/2013/heerhugowaard/artfarm.htm', u'http://www.gestolenkunst.nl/images/2013/sittard/sittard.htm', u'http://www.gestolenkunst.nl/images/2013/rotterdam/rotterdan.htm', u'http://www.gestolenkunst.nl/images/2013/epe/epe.htm', u'http://www.gestolenkunst.nl/images/2013/beek/beek.htm', u'http://www.gestolenkunst.nl/images/2012/utrecht/utrecht.htm', u'http://www.gestolenkunst.nl/images/2012/amsterdam/amsterdam.htm', u'http://www.gestolenkunst.nl/images/2012/zwijndrecht/zwijdrecht.htm', u'http://www.gestolenkunst.nl/images/2012/de_bilt/bakker.htm', u'http://www.gestolenkunst.nl/images/2012/zutphen/zutphen.htm', u'http://www.gestolenkunst.nl/images/2012/rheden/carmiggelt.htm', u'http://www.gestolenkunst.nl/images/2012/dieren/dieren.htm', u'http://www.gestolenkunst.nl/images/2011/denhaag/denhaag.htm', u'http://www.gestolenkunst.nl/images/2011/rotterdam/bronzenbeeld_rotterdam.htm', u'http://www.gestolenkunst.nl/images/2011/utrecht/utrecht.htm', u'http://www.gestolenkunst.nl/images/2012/denhaag/denhaag.htm', u'http://www.gestolenkunst.nl/images/2011/oosterbeek/oosterbeek.htm', u'http://www.gestolenkunst.nl/images/2011/teruggevonden_alkmaar/sumo_worstelaar_brons.htm', u'http://www.gestolenkunst.nl/images/2011/teruggevonden_alkmaar/sumo_worstelaar_brons.htm', u'http://www.gestolenkunst.nl/images/2011/vlierden/vlierden.htm', u'http://www.gestolenkunst.nl/images/2011/bergen/bergen.htm', u'http://www.gestolenkunst.nl/images/2011/alkmaar/mca.htm', u'http://www.gestolenkunst.nl/images/2010/hendrik_ido_ambacht/hendrik-ido-ambacht.htm', u'http://www.gestolenkunst.nl/images/2010/nijmegen/nijmegen.htm', u'http://www.gestolenkunst.nl/images/2010/heesch', u'http://www.gestolenkunst.nl/images/2010/boxie/boxie_gestolen_powerbook.htm', u'http://www.gestolenkunst.nl/images/2010/ijmuiden/ijmuiden_nic_jonk.htm', u'http://www.gestolenkunst.nl/images/2010/jitsbakker/bakkerjuli2010.htm', u'http://www.gestolenkunst.nl/images/2010/gouda/gouda.htm', u'http://www.gestolenkunst.nl/images/2010/enschede/enschede.htm', u'http://www.gestolenkunst.nl/images/2010/someren/someren.htm', u'http://www.gestolenkunst.nl/images/2010/jitsbakker/jitsbakker_steltloper_maart2010.htm', u'http://www.gestolenkunst.nl/images/2009/hogeweg/hogeweg.htm', u'http://www.gestolenkunst.nl/images/2009/wildevuur/wildevuur.htm', u'http://www.gestolenkunst.nl/images/2009/bad-nieuweschans/heus.htm', u'http://www.gestolenkunst.nl/images/2009/bad-nieuweschans/heus.htm', u'http://www.gestolenkunst.nl/images/2009/darp/kneulman.htm', u'http://www.gestolenkunst.nl/images/2009/keramikos/keramikos.htm', u'http://www.gestolenkunst.nl/images/2009/maasbracht/maasbracht.htm', u'http://www.gestolenkunst.nl/images/2008/groet/groet.htm', u'http://www.gestolenkunst.nl/images/2009/rotterdam/rotterdam.htm', u'http://www.gestolenkunst.nl/images/2009/rotterdam/rotterdam.htm', u'http://www.gestolenkunst.nl/images/2008/swifterband/swifterband.htm', u'http://www.gestolenkunst.nl/images/2008/laren_snoek/laren_snoek.htm', u'http://www.gestolenkunst.nl/images/2008/beetsterzwaag/Beetsterzwaag.htm', u'http://www.gestolenkunst.nl/images/2008/callantsoog/booghaard.htm', u'http://www.gestolenkunst.nl/images/2008/lelystad/lelystad.htm', u'http://www.gestolenkunst.nl/images/2008/amsterdam_kok/amsterdam_kok.htm', u'http://www.gestolenkunst.nl/images/2008/lochem/lochem.htm', u'http://www.gestolenkunst.nl/images/2008/liempde/liempde.htm', u'http://www.gestolenkunst.nl/images/2008/heerhugowaard/zande.htm', u'http://www.gestolenkunst.nl/images/2008/amsterdam_hatterman/amsterdam_hatterman.htm', u'http://www.gestolenkunst.nl/images/2008/delft/delft.htm', u'http://www.gestolenkunst.nl/images/2008/sgraveland/sgraveland.htm', u'http://www.gestolenkunst.nl/images/2008/laren/laren110308.htm', u'http://www.gestolenkunst.nl/images/2008/laren/laren110308.htm', u'http://www.gestolenkunst.nl/images/2008/alphen_ad_rijn/alphen_ad_rijn.htm', u'http://www.gestolenkunst.nl/images/2008/hardinxveld/hardinxveld.htm', u'http://www.gestolenkunst.nl/images/2008/denhaag_karres/denhaag_karres.htm', u'http://www.gestolenkunst.nl/images/2008/amsterdam_eijsenberger/amsterdam_eijs.htm', u'http://www.gestolenkunst.nl/images/2008/amsterdam/amsterdam.htm', u'http://www.gestolenkunst.nl/images/2008/denhaag/denhaag_brauw.htm', u'http://www.gestolenkunst.nl/images/2008/groenhart/groenhart.htm', u'http://www.gestolenkunst.nl/images/2007/aalsmeer/aalsmeer.htm', u'http://www.gestolenkunst.nl/images/2007/delft/delft_klaus.htm', u'http://www.gestolenkunst.nl/images/2007/malden/malden.htm', u'http://www.gestolenkunst.nl/images/2007/sterksel/sterksel.htm', u'http://www.gestolenkunst.nl/images/2007/zeist/zeist_achmea.htm', u'http://www.gestolenkunst.nl/images/2007/maaseik_laar/maaseik.htm', u'http://www.gestolenkunst.nl/images/2007/meerssen/meerssen.htm', u'http://www.gestolenkunst.nl/images/2007/lisse/lisse.htm', u'http://www.gestolenkunst.nl/images/2007/kortenhoef/kortenhoef.htm', u'http://www.gestolenkunst.nl/images/2007/schijndel/schijndel.htm', u'http://www.gestolenkunst.nl/images/2007/alkmaar/smit.htm', u'http://www.gestolenkunst.nl/images/2007/heerlen/heerlen.htm', u'http://www.gestolenkunst.nl/images/2007/heerlen/heerlen.htm', u'http://www.gestolenkunst.nl/images/2007/tiel/kaayk.htm', u'http://www.gestolenkunst.nl/images/2007/arnhem/arnhem.htm', u'http://www.gestolenkunst.nl/images/2007/amsterdam_noort/amsterdam_noort.htm', u'http://www.gestolenkunst.nl/images/2007/sgravenhage/sgravenhage.htm', u'http://www.gestolenkunst.nl/images/2007/hazelaar/hazelaar.htm', u'http://www.gestolenkunst.nl/images/2007/putte-stabroek/putte_stabroek.htm', u'http://www.gestolenkunst.nl/images/2007/maarssen/maarssen_beeldentuin.htm', u'http://www.gestolenkunst.nl/images/2007/huizen/huizen_gemeente.htm', u'http://www.gestolenkunst.nl/images/2007/Maastricht_laar/maastricht_laar.htm', u'http://www.gestolenkunst.nl/images/2007/bilthoven/bilthoven_v.htm', u'http://www.gestolenkunst.nl/images/2007/sypesteyn/sypesteyn.htm', u'http://www.gestolenkunst.nl/images/2007/hulzen/hulzen.htm', u'http://www.gestolenkunst.nl/images/2007/huizen_limieten/huizen_limieten.htm', u'http://www.gestolenkunst.nl/images/2007/elburg/elburg_galerie.htm', u'http://www.gestolenkunst.nl/images/2007/hasselt/schildwacht_hasselt.htm', u'http://www.gestolenkunst.nl/images/2006/woerden/woerden.htm', u'http://www.gestolenkunst.nl/images/2006/amsterdam_slotervaart/sheils.htm', u'http://www.gestolenkunst.nl/images/2006/recr_klepperstee_ouddorp/klepperstee_ouddorp.htm', u'http://www.gestolenkunst.nl/images/2006/recr_klepperstee_ouddorp/klepperstee_ouddorp.htm', u'http://www.gestolenkunst.nl/images/2007/erichem/krol.htm', u'http://www.gestolenkunst.nl/images/2006/someren/someren.htm', u'http://www.gestolenkunst.nl/images/2006/sliedrecht/sliedrecht_jitsbakker.htm', u'http://www.gestolenkunst.nl/images/2006/blank/blank_2006.htm', u'http://www.gestolenkunst.nl/images/2006/kemps_eindhoven/kemps_eindhoven.htm', u'http://www.gestolenkunst.nl/images/2006/schoorl/begraafplaats_diels.htm', u'http://www.gestolenkunst.nl/images/2006/bloemendaal/bloemendaal.htm', u'http://www.gestolenkunst.nl/images/2010/zwolle/zwolle_leeser.htm', u'http://www.gestolenkunst.nl/images/2006/vinkeveen/vinkeveen_jozefschool.htm', u'http://www.gestolenkunst.nl/images/2006/gemeente_bergen/gemeente_bergen.htm', u'http://www.gestolenkunst.nl/images/2006/alphen_ad_rijn/alphenadrijn.htm', u'http://www.gestolenkunst.nl/images/2006/alphen_ad_rijn/alphenadrijn.htm', u'http://www.gestolenkunst.nl/images/2006/Nieuwegein/nieuwegein.htm', u'http://www.gestolenkunst.nl/images/2006/alkmaar_lange/lange.htm', u'http://www.gestolenkunst.nl/images/2006/janneman/janneman.htm', u'http://www.gestolenkunst.nl/images/2006/schoffelen/schoffelen.htm', u'http://www.gestolenkunst.nl/images/2006/keuenhof_ede/keuenhof.htm', u'http://www.gestolenkunst.nl/images/2006/rucphen/rucphen.htm', u'http://www.gestolenkunst.nl/images/2006/lemberger_amsterdam/lemberger.htm', u'http://www.gestolenkunst.nl/images/2006/bronckhorst/bronckhorst.htm', u'http://www.gestolenkunst.nl/images/2006/arnhem_peja/peja.htm', u'http://www.gestolenkunst.nl/images/2005/klomp/klomp.htm', u'http://www.gestolenkunst.nl/images/2005/kalverda/kalverda.htm', u'http://www.gestolenkunst.nl/images/2005/schellekens/schellekens.htm', u'http://www.gestolenkunst.nl/images/2005/beeldhouwwinkel/beeldhouwwinkel.htm', u'http://www.gestolenkunst.nl/images/2005/huisman/huisman.htm', u'http://www.gestolenkunst.nl/images/2005/lith/lith.htm', u'http://www.gestolenkunst.nl/images/2005/bergen/bergen_onna.htm', u'http://www.gestolenkunst.nl/images/2005/emst_remeeus/emst.htm', u'http://www.gestolenkunst.nl/images/2005/water/water.htm', u'http://www.gestolenkunst.nl/images/2005/maastricht_drielsma/maastricht_drielsma.htm', u'http://www.gestolenkunst.nl/images/2005/bosch/bosch.htm', u'http://www.gestolenkunst.nl/images/2005/fransen/fransen.htm', u'http://www.gestolenkunst.nl/images/vaart/vaart.htm', u'http://www.gestolenkunst.nl/images/2005/lammertink/lammertink.htm', u'http://www.gestolenkunst.nl/images/brocke/brocke.htm', u'http://www.gestolenkunst.nl/images/wood/wood.htm', u'http://www.gestolenkunst.nl/images/2005/klijzing/klijzing.htm', u'http://www.gestolenkunst.nl/images/metalart/ponsioen.htm', u'http://www.gestolenkunst.nl/images/harderwijk/harderwijk.htm', u'http://www.gestolenkunst.nl/images/gulpen/gulpen.htm', u'http://www.gestolenkunst.nl/images/limburg/limburg.htm', u'http://www.gestolenkunst.nl/images/landgraaf/landgraaf.htm', u'http://www.gestolenkunst.nl/images/pijnappel.htm', u'http://www.gestolenkunst.nl/images/termaat/termaat.htm', u'http://www.gestolenkunst.nl/images/vries/vries.htm', u'http://www.gestolenkunst.nl/images/hartigh/hartigh.htm', u'http://www.gestolenkunst.nl/images/hengelo/hengelo.htm', u'http://www.gestolenkunst.nl/images/nijmegen/teunen/teunen.htm', u'http://www.gestolenkunst.nl/images/nijmegen/nijmegen.htm', u'http://www.gestolenkunst.nl/images/hollants/hollants_carla.htm', u'http://www.gestolenkunst.nl/images/laren/laren_gils.htm', u'http://www.gestolenkunst.nl/images/2003/smakt/smakt.htm', u'http://www.gestolenkunst.nl/images/koopman/koopman.htm', u'http://www.gestolenkunst.nl/images/voorschoten/voorschoten.htm', u'http://www.gestolenkunst.nl/images/hagen/hagen.htm', u'http://www.gestolenkunst.nl/images/bakker/gestolen%20kunst%20Jits%20Bakker.htm', u'http://www.gestolenkunst.nl/images/bliekendaal/bliekendaal.htm', u'http://www.gestolenkunst.nl/images/2003/utrecht/utrecht.htm', u'http://www.gestolenkunst.nl/images/davina/davina.htm', u'http://www.gestolenkunst.nl/images/janneman/janneman.htm', u'http://www.gestolenkunst.nl/images/dijk/dijk.htm', u'http://www.gestolenkunst.nl/images/clarissenbolwerk/havermans.htm', u'http://www.gestolenkunst.nl/images/appelhof/appelhof.htm', u'http://www.gestolenkunst.nl/images/blank/blank.htm', u'http://www.gestolenkunst.nl/images/dussen/dussen.htm', u'http://www.gestolenkunst.nl/images/bakker/gestolen%20kunst%20Jits%20Bakker.htm', u'http://www.gestolenkunst.nl/images/rijs/rijs.htm', u'http://www.gestolenkunst.nl/images/janssen', u'http://www.gestolenkunst.nl/images/bakker/gestolen%20kunst%20Jits%20Bakker.htm', u'http://www.gestolenkunst.nl/images/2002/nijssen/ouderamstel_nijssen.htm', u'http://www.gestolenkunst.nl/images/onna/onna.htm', u'http://www.gestolenkunst.nl/images/haring/haring.htm', u'http://www.gestolenkunst.nl/images/dijk/dijk.htm', u'http://www.gestolenkunst.nl/images/janneman/janneman.htm', u'http://www.gestolenkunst.nl/images/hessels/hessels.htm', u'http://www.gestolenkunst.nl/images/onna/onna.htm', u'http://www.gestolenkunst.nl/images/2012/culemborg/culemborg_1998.htm', u'http://www.gestolenkunst.nl/images/1998/mierlo/mierlo.htm', u'http://www.gestolenkunst.nl/images/oud/wigbold/wigbold.htm', u'http://www.gestolenkunst.nl/images/oud/nijmegen/nijmegen.htm', u'http://www.gestolenkunst.nl/images/oud/amsterdam_ommen/amstedam_ommen.htm']
You could be more specific with the select using the td "td a[href^=images/]" but it will return the same as all those links are in td tags.
The actually image are in the linked pages so we need to visit each link, find the img tag, extract the src then download and save the image:
from bs4 import BeautifulSoup
from urlparse import urljoin
url = "http://www.gestolenkunst.nl/gestolen%20overzicht.htm" # change to whatever your url is
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")
base = "http://www.gestolenkunst.nl/"
images = [urljoin(base,a["href"]) for a in soup.select("td a[href^=images/]")]
for url in images:
img = BeautifulSoup(urllib2.urlopen(url).read(),"lxml").find("img")["src"]
with open(img,"w") as f:
f.write(urllib2.urlopen("{}/{}".format(url.rsplit("/", 1)[0], img)).read())
So for http://www.gestolenkunst.nl/images/2015/eindhoven/eindhoven.htm we get:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I get href links from href using python/pandas - python

Related

Cannot find the table data within the soup, but I know its there

Scraping multiple pages in Python

How do I get hrefs from hrefs?

Using beautiful soup to scrape data from indeed

How to web scrape images that are in a link in a table?

Categories

Resources