I am iterating through the website in the code. The following is what my code does. Loops through the 52 pages and gets the link to each URLs.
Then it iterates through those URLs and tries to get the link for the English Translation. if you see the Mongolian website, it has a section "Орчуулга" on the top right and it has "English" underneath - that is the link to the English translation.
However, my code fails to grab the link for the english translation and gives a wrong url.
Below is a sample output for the first article.
1
{'https://mn.usembassy.gov/mn/2020-naadam-mn/': 'https://mn.usembassy.gov/mn/sitemap-mn/'}
The expected output for the first page should be
1
{'https://mn.usembassy.gov/mn/2020-naadam-mn/': 'https://mn.usembassy.gov/2020-naadam/'}
Below is my code
import requests
from bs4 import BeautifulSoup
url = 'https://mn.usembassy.gov/mn/news-events-mn/page/{page}/'
urls = []
for page in range(1, 53):
print(str(page) + "/52")
soup = BeautifulSoup(requests.get(url.format(page=page)).content, 'html.parser')
for h in soup.find_all('h2'):
a = h.find('a')
urls.append(a.attrs['href'])
print(urls)
i = 0
bilingual_dict = {}
for url in urls:
i += 1
print(i)
soup = BeautifulSoup(requests.get(url.format(page=url)).content, 'html.parser')
for div in soup.find_all('div', class_='translations_sidebar'):
for ul in soup.find_all('ul'):
for li in ul.find_all('li'):
a = li.find('a')
bilingual_dict[url] = a['href']
print(bilingual_dict)
print(bilingual_dict)
This script will print link to english translation:
import requests
from bs4 import BeautifulSoup
url = 'https://mn.usembassy.gov/mn/2020-naadam-mn/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
link = soup.select_one('a[hreflang="en"]')
print(link['href'])
Prints:
https://mn.usembassy.gov/2020-naadam/
Complete code: (Where there isn't link to english translation, the value is set to None)
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url = 'https://mn.usembassy.gov/mn/news-events-mn/page/{page}/'
urls = []
for page in range(1, 53):
print('Page {}...'.format(page))
soup = BeautifulSoup(requests.get(url.format(page=page)).content, 'html.parser')
for h in soup.find_all('h2'):
a = h.find('a')
urls.append(a.attrs['href'])
pprint(urls)
bilingual_dict = {}
for url in urls:
print(url)
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
link = soup.select_one('a[hreflang="en"]')
bilingual_dict[url] = link['href'] if link else None
pprint(bilingual_dict)
I'm trying to scrape the urls of the ads on "Marktplaats" website (link is provided below).
As you can see I'm looking for 30 URLs. These URLs are placed inside a 'href' field and all start with "/a/auto-s/". Unfortunately, I only keep getting the first few URLs. I found out that on this sites all the data is places within "<li class = "mp-Listing mp-Listing--list-item"> ... </li>". Does anyone have an idea how to fix it? (you can see that you won't find all the URLs of the ads when you run my code)
Link:
https://www.marktplaats.nl/l/auto-s/#f:10882,10898|PriceCentsTo:350000|constructionYearFrom:2001|offeredSince:TODAY|searchInTitleAndDescription:true
My code:
import requests
from bs4 import BeautifulSoup
url = "https://www.marktplaats.nl/l/auto-s/#f:10882,10898|PriceCentsTo:350000|constructionYearFrom:2001|offeredSince:TODAY|searchInTitleAndDescription:true"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
url_list = soup.find_all(class_ = 'mp-Listing mp-Listing--list-item')
print(url_list)
You can try something like this:
import requests
from bs4 import BeautifulSoup
def parse_links(url):
links = []
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for li in soup.find_all(class_="mp-Listing mp-Listing--list-item"):
links.append(li.a.get('href'))
return links
url = "https://www.marktplaats.nl/l/auto-s/#f:10882,10898|PriceCentsTo:350000|constructionYearFrom:2001|offeredSince:TODAY|searchInTitleAndDescription:true"
links = parse_links(url)
print('\n'.join(map(str, links)))
Output
/a/auto-s/oldtimers/a1302359148-allis-chalmers-ed40-1965.html
/a/auto-s/bestelauto-s/a1258166221-opel-movano-2-3-cdti-96kw-2018.html
/a/auto-s/oldtimers/a1302359184-chevrolet-biscayne-bel-air-1960.html
/a/auto-s/renault/a1240974413-ruim-aanbod-rolstoelauto-s-www-autoland-nl.html
/a/auto-s/volkswagen/m1457703674-golf-6-1-2tsi-comfortline-bluemotion-77kw-2de-eigenaar.html
/a/auto-s/peugeot/m1457564187-peugeot-208-1-6-e-hdi-68kw-92pk-5-d-2014-zwart.html
/a/auto-s/volkswagen/m1457124365-volkswagen-touareg-3-2-v6-177kw-4motion-aut-2004-grijs.html
/a/auto-s/volkswagen/m1456753596-volkswagen-golf-vii-2-0-tdi-highline-150pk-xenon-trekhaak.html
/a/auto-s/bestelauto-s/a1001658686-200-nw-en-gebruikte-bestelwagens-personenbusjes-pick-ups.html
/a/auto-s/bestelauto-s/m940111355-bus-verkopen-bestelauto-inkoop-bestelwagen-opkoper-rdw.html
/a/auto-s/volkswagen/m1456401063-volkswagen-golf-1-6-74kw-2000-zwart.html
/a/auto-s/renault/m1456242548-renault-espace-2-0-dci-110kw-e4-2006-zwart.html
/a/auto-s/nissan/m1448699345-nissan-qashqai-1-5-dci-connect-2011-grijs-panoramadak.html
/a/auto-s/bestelauto-s/a1212708374-70-x-kleine-bestelwagens-lage-km-scherpe-prijzen.html
/a/auto-s/bmw/m1452641019-bmw-5-serie-2-0-520d-touring-aut-2014-grijs.html
/a/auto-s/mercedes-benz/m1448671698-mercedes-benz-a-klasse-a250-amg-224pk-7g-dct-panoramadak-wid.html
/a/auto-s/bmw/m1455671862-bmw-3-serie-2-0-i-320-cabrio-aut-2007-bruin.html
/a/auto-s/bestelauto-s/m1455562699-volkswagen-transporter-kmstand-151-534-2-5-tdi-65kw-2002.html
/a/auto-s/bestelauto-s/a1295698562-35-x-renault-kangoo-2013-t-m-2015-v-a-25000-km.html
/a/auto-s/infiniti/m1458111256-infiniti-q50-3-5-hybrid-awd-2016-grijs.html
/a/auto-s/ford/m1458111166-ford-ka-1-3-i-44kw-2007-zwart.html
/a/auto-s/bestelauto-s/m1457499260-renault-master-l3h2-2018-airco-camera-cruise-laadruimte-12.html
/a/auto-s/land-rover/m1458110209-land-rover-discovery-4-3-0-tdv6-2010-grijs.html
/a/auto-s/dodge/a1279463634-5-jaar-ram-dealer-garantie-lage-bijtelling.html
/a/auto-s/bmw/m1455389317-bmw-320i-e46-sedan-bieden.html
/a/auto-s/ford/m1457306473-ford-galaxy-2-0-tdci-85kw-dpf-2011-blauw.html
/a/auto-s/peugeot/m1456912876-peugeot-407-2-0-16v-sw-2006-grijs.html
/a/auto-s/bestelauto-s/m1457161395-renault-master-t35-2-3-dci-l3h2-130-pk-navi-airco-camera-pdc.html
/a/auto-s/bestelauto-s/a1299134880-citroen-berlingo-1-6-hdi-2017-airco-sd-3-zits-v-a-179-p-m.html
/a/auto-s/hyundai/m1458105451-hyundai-atos-gezocht-hoge-prijs-tel-0653222206.html
/a/auto-s/volkswagen/m1458103618-volkswagen-polo-1-4-tsi-132kw-dsg-2012-wit.html
/a/auto-s/vrachtwagens/m1458101965-scania-torpedo.html
/a/auto-s/toyota/m1458101624-toyota-yaris-1-0-12v-vvt-i-aspiration-5dr-2012.html
/a/auto-s/dodge/a1279447576-5-jaar-ram-dealer-garantie-en-historie-bekijk-onze-website.html
You can also build the actual url of the page by appending 'https://www.marktplaats.nl' to li.a.get('href'). So, your whole code should look like this:
import requests
from bs4 import BeautifulSoup
def parse_links(url):
links = []
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for li in soup.find_all(class_="mp-Listing mp-Listing--list-item"):
links.append('https://www.marktplaats.nl' + li.a.get('href'))
return links
url = "https://www.marktplaats.nl/l/auto-s/#f:10882,10898|PriceCentsTo:350000|constructionYearFrom:2001|offeredSince:TODAY|searchInTitleAndDescription:true"
links = parse_links(url)
print('\n'.join(map(str, links)))
It should produce the output like this:
https://www.marktplaats.nl/a/auto-s/renault/a1302508082-mooi-renault-megane-scenic-1-6-16v-aut-2005-2003-groen-airco.html
https://www.marktplaats.nl/a/auto-s/oldtimers/a1302359157-morris-minor-cabriolet-1970.html
https://www.marktplaats.nl/a/auto-s/oldtimers/a1302743902-online-veiling-oldtimers-en-classic-cars-zedelgem-vavato.html
https://www.marktplaats.nl/a/auto-s/oldtimers/a1302359138-mercedes-benz-g-500-guard-pantzer-1999.html
https://www.marktplaats.nl/a/auto-s/volkswagen/m1457703674-golf-6-1-2tsi-comfortline-bluemotion-77kw-2de-eigenaar.html
https://www.marktplaats.nl/a/auto-s/peugeot/m1457564187-peugeot-208-1-6-e-hdi-68kw-92pk-5-d-2014-zwart.html
https://www.marktplaats.nl/a/auto-s/volkswagen/m1457124365-volkswagen-touareg-3-2-v6-177kw-4motion-aut-2004-grijs.html
https://www.marktplaats.nl/a/auto-s/volkswagen/m1456753596-volkswagen-golf-vii-2-0-tdi-highline-150pk-xenon-trekhaak.html
https://www.marktplaats.nl/a/auto-s/volkswagen/a1279696849-vw-take-up-5-d-radio-airco-private-lease.html
https://www.marktplaats.nl/a/auto-s/bestelauto-s/m940111355-bus-verkopen-bestelauto-inkoop-bestelwagen-opkoper-rdw.html
https://www.marktplaats.nl/a/auto-s/volkswagen/m1456401063-volkswagen-golf-1-6-74kw-2000-zwart.html
https://www.marktplaats.nl/a/auto-s/renault/m1456242548-renault-espace-2-0-dci-110kw-e4-2006-zwart.html
https://www.marktplaats.nl/a/auto-s/nissan/m1448699345-nissan-qashqai-1-5-dci-connect-2011-grijs-panoramadak.html
https://www.marktplaats.nl/a/auto-s/citroen/a1277007710-citroen-c1-feel-5-d-airco-private-lease-vanaf-189-euro-mnd.html
https://www.marktplaats.nl/a/auto-s/bmw/m1452641019-bmw-5-serie-2-0-520d-touring-aut-2014-grijs.html
https://www.marktplaats.nl/a/auto-s/mercedes-benz/m1448671698-mercedes-benz-a-klasse-a250-amg-224pk-7g-dct-panoramadak-wid.html
https://www.marktplaats.nl/a/auto-s/bmw/m1455671862-bmw-3-serie-2-0-i-320-cabrio-aut-2007-bruin.html
https://www.marktplaats.nl/a/auto-s/bestelauto-s/m1455562699-volkswagen-transporter-kmstand-151-534-2-5-tdi-65kw-2002.html
https://www.marktplaats.nl/a/auto-s/peugeot/a1298813052-private-lease-occasion-outlet-prive-lease.html
https://www.marktplaats.nl/a/auto-s/audi/m1458114563-audi-a4-2-0-tfsi-132kw-avant-multitronic-nl-auto.html
https://www.marktplaats.nl/a/auto-s/mercedes-benz/m1452983872-mercedes-a-klasse-2-0-cdi-a200-5drs-aut-2007-grijs.html
https://www.marktplaats.nl/a/auto-s/bestelauto-s/m1457499260-renault-master-l3h2-2018-airco-camera-cruise-laadruimte-12.html
https://www.marktplaats.nl/a/auto-s/infiniti/m1458111256-infiniti-q50-3-5-hybrid-awd-2016-grijs.html
https://www.marktplaats.nl/a/auto-s/bestelauto-s/a1001658686-200-nw-en-gebruikte-bestelwagens-personenbusjes-pick-ups.html
https://www.marktplaats.nl/a/auto-s/ford/m1458111166-ford-ka-1-3-i-44kw-2007-zwart.html
https://www.marktplaats.nl/a/auto-s/land-rover/m1458110209-land-rover-discovery-4-3-0-tdv6-2010-grijs.html
https://www.marktplaats.nl/a/auto-s/bmw/m1455389317-bmw-320i-e46-sedan-bieden.html
https://www.marktplaats.nl/a/auto-s/bestelauto-s/m1457161395-renault-master-t35-2-3-dci-l3h2-130-pk-navi-airco-camera-pdc.html
https://www.marktplaats.nl/a/auto-s/renault/a1302508082-mooi-renault-megane-scenic-1-6-16v-aut-2005-2003-groen-airco.html
https://www.marktplaats.nl/a/auto-s/ford/m1457306473-ford-galaxy-2-0-tdci-85kw-dpf-2011-blauw.html
https://www.marktplaats.nl/a/auto-s/peugeot/m1456912876-peugeot-407-2-0-16v-sw-2006-grijs.html
https://www.marktplaats.nl/a/auto-s/hyundai/m1458105451-hyundai-atos-gezocht-hoge-prijs-tel-0653222206.html
https://www.marktplaats.nl/a/auto-s/volkswagen/m1458103618-volkswagen-polo-1-4-tsi-132kw-dsg-2012-wit.html
https://www.marktplaats.nl/a/auto-s/oldtimers/a1302743902-online-veiling-oldtimers-en-classic-cars-zedelgem-vavato.html
Good luck!
I am just a beginner at Python.
I am trying to scrape data from a site and have managed to write the below code.
However, I am not sure how to proceed ahead as I am unable to get the href tags so that I can go to each listing & get the data. I am also not very well aware of HTML Tags, so I suspect that I have not identified the tags properly.
Here is my code :
import requests
from bs4 import BeautifulSoup
urls = []
for i in range(1,5):
pages = "https://directory.singaporefintech.org/?p={0}&category=0&zoom=15&is_mile=0&directory_radius=0&view=list&hide_searchbox=0&hide_nav=0&hide_nav_views=0&hide_pager=0&featured_only=0&feature=1&perpage=20&sort=random".format(i)
urls.append(pages)
Data = []
for info in urls:
page = requests.get(info)
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all('a', attrs ={'class' :'sabai-directory-title'})
hrefs = [link['href'] for link in links]
The above code is producing hrefs as a blank list.
Any help would be highly appreciated!!
Thanks!!!
Code is fine, the class that you're looking for just doesn't exist on those pages. For example, substituted sabai-directory-title class with comment-reply-link after inspecting https://directory.singaporefintech.org/hello-world/?category=0&zoom=15&is_mile=0&directory_radius=0&view=list&hide_searchbox=0&hide_nav=0&hide_nav_views=0&hide_pager=0&featured_only=0&feature=1&perpage=20&sort=random and got results when i added print statements
You can scrap links using CSS selector. Selector div.sabai-directory-title a will find any <a> tags inside <div> tag with class sabai-directory-title (I updated the URL, yours was giving me error pages):
from bs4 import BeautifulSoup
import requests
from pprint import pprint
r = requests.get('https://directory.singaporefintech.org/')
soup = BeautifulSoup(r.text, 'lxml')
hrefs = [a['href'] for a in soup.select('div.sabai-directory-title a')]
pprint(hrefs)
This will print:
['https://directory.singaporefintech.org/directory/listing/silent-eight',
'https://directory.singaporefintech.org/directory/listing/incomlend',
'https://directory.singaporefintech.org/directory/listing/bizgrow',
'https://directory.singaporefintech.org/directory/listing/makerscut',
'https://directory.singaporefintech.org/directory/listing/soho-fintech',
'https://directory.singaporefintech.org/directory/listing/dxmarkets',
'https://directory.singaporefintech.org/directory/listing/fundrevo',
'https://directory.singaporefintech.org/directory/listing/money4money',
'https://directory.singaporefintech.org/directory/listing/onelyst',
'https://directory.singaporefintech.org/directory/listing/hearti-lab',
'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/ceo-1',
'https://directory.singaporefintech.org/directory/listing/arcadier',
'https://directory.singaporefintech.org/directory/listing/plmp-fintech-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/cash-in-asia',
'https://directory.singaporefintech.org/directory/listing/grc-systems',
'https://directory.singaporefintech.org/directory/listing/sendexpense',
'https://directory.singaporefintech.org/directory/listing/jinjerjade',
'https://directory.singaporefintech.org/directory/listing/hatcher',
'https://directory.singaporefintech.org/directory/listing/fintech-consortium']
Hi I have made few changes to code:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
urls = []
for i in range(1,5):
pages = "https://directory.singaporefintech.org"
urls.append(pages)
Data = []
hrefs = []
for info in urls:
page = requests.get(info)
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all('div', attrs ={'class' :'sabai-directory-title'})
for link in links:
Data.extend([a['href'].encode('ascii') for a in link.find_all('a', href=True) if a.text])
pprint (Data)
output:
['https://directory.singaporefintech.org/directory/listing/silent-eight',
'https://directory.singaporefintech.org/directory/listing/moolahsense',
'https://directory.singaporefintech.org/directory/listing/myfinb',
'https://directory.singaporefintech.org/directory/listing/wefinance',
'https://directory.singaporefintech.org/directory/listing/quber',
'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/ceo-1',
'https://directory.singaporefintech.org/directory/listing/acekards',
'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/fundmylife',
'https://directory.singaporefintech.org/directory/listing/mooments',
'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/junotele_',
'https://directory.singaporefintech.org/directory/listing/mobilecover',
'https://directory.singaporefintech.org/directory/listing/cherrypay',
'https://directory.singaporefintech.org/directory/listing/toast',
'https://directory.singaporefintech.org/directory/listing/cashdab',
'https://directory.singaporefintech.org/directory/listing/silent-eight',
'https://directory.singaporefintech.org/directory/listing/moolahsense',
'https://directory.singaporefintech.org/directory/listing/myfinb',
'https://directory.singaporefintech.org/directory/listing/wefinance',
'https://directory.singaporefintech.org/directory/listing/quber',
'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/ceo-1',
'https://directory.singaporefintech.org/directory/listing/acekards',
'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/fundmylife',
'https://directory.singaporefintech.org/directory/listing/mooments',
'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/junotele_',
'https://directory.singaporefintech.org/directory/listing/mobilecover',
'https://directory.singaporefintech.org/directory/listing/cherrypay',
'https://directory.singaporefintech.org/directory/listing/toast',
'https://directory.singaporefintech.org/directory/listing/cashdab',
'https://directory.singaporefintech.org/directory/listing/silent-eight',
'https://directory.singaporefintech.org/directory/listing/moolahsense',
'https://directory.singaporefintech.org/directory/listing/myfinb',
'https://directory.singaporefintech.org/directory/listing/wefinance',
'https://directory.singaporefintech.org/directory/listing/quber',
'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/ceo-1',
'https://directory.singaporefintech.org/directory/listing/acekards',
'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/fundmylife',
'https://directory.singaporefintech.org/directory/listing/mooments',
'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/junotele_',
'https://directory.singaporefintech.org/directory/listing/mobilecover',
'https://directory.singaporefintech.org/directory/listing/cherrypay',
'https://directory.singaporefintech.org/directory/listing/toast',
'https://directory.singaporefintech.org/directory/listing/cashdab',
'https://directory.singaporefintech.org/directory/listing/silent-eight',
'https://directory.singaporefintech.org/directory/listing/moolahsense',
'https://directory.singaporefintech.org/directory/listing/myfinb',
'https://directory.singaporefintech.org/directory/listing/wefinance',
'https://directory.singaporefintech.org/directory/listing/quber',
'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/ceo-1',
'https://directory.singaporefintech.org/directory/listing/acekards',
'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/fundmylife',
'https://directory.singaporefintech.org/directory/listing/mooments',
'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
'https://directory.singaporefintech.org/directory/listing/junotele_',
'https://directory.singaporefintech.org/directory/listing/mobilecover',
'https://directory.singaporefintech.org/directory/listing/cherrypay',
'https://directory.singaporefintech.org/directory/listing/toast',
'https://directory.singaporefintech.org/directory/listing/cashdab']
Is this the data output you are expecting.
Hope it helps!!
how to get the 'things to do' list? I am new to webscraping and i don't know how to loop through each page to get the href of all 'things to do'?tell me where i am doing wrong?Any help would be highly apreciated. Thanks in advance.
import requests
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen
offset = 0
url = 'https://www.tripadvisor.com/Attractions-g255057-Activities-oa' + str(offset) + '-Canberra_Australian_Capital_Territory-Hotels.html#ATTRACTION_LIST_CONTENTS'
urls = []
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
for link in soup.find_all('a', {'last'}):
page_number = link.get('data-page-number')
last_offset = int(page_number) * 30
print('last offset:', last_offset)
for offset in range(0, last_offset, 30):
print('--- page offset:', offset, '---')
url = 'https://www.tripadvisor.com/Attractions-g255057-oa' + str(offset) + '-Canberra_Australian_Capital_Territory-Hotels.html#ATTRACTION_LIST_CONTENTS'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
for link in soup.find_all('a', {'property_title'}):
iurl='https://www.tripadvisor.com/Attraction_Review-g255057' + link.get('href')
print(iurl)
Basically i want the href of each 'things to do'.
My desired output for 'things to do' is:
https://www.tripadvisor.com/Attraction_Review-g255057-d3377852-Reviews-Weston_Park-Canberra_Australian_Capital_Territory.html
https://www.tripadvisor.com/Attraction_Review-g255057-d591972-Reviews-Canberra_Museum_and_Gallery-Canberra_Australian_Capital_Territory.html
https://www.tripadvisor.com/Attraction_Review-g255057-d312426-Reviews-Lanyon_Homestead-Canberra_Australian_Capital_Territory.html
https://www.tripadvisor.com/Attraction_Review-g255057-d296666-Reviews-Australian_National_University-Canberra_Australian_Capital_Territory.html
Like in below example i used this code for getting the href of each restaurant in canberra city
my code for restauranr which works perfectly is:
import requests
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen
with requests.Session() as session:
for offset in range(0, 1050, 30):
url = 'https://www.tripadvisor.com/Restaurants-g255057-oa{0}-Canberra_Australian_Capital_Territory.html#EATERY_LIST_CONTENTS'.format(offset)
soup = BeautifulSoup(session.get(url).content, "html.parser")
for link in soup.select('a.property_title'):
iurl = 'https://www.tripadvisor.com/' + link.get('href')
print(iurl)
the output of restaurant code is:
https://www.tripadvisor.com/Restaurant_Review-g255057-d1054676-Reviews-Lanterne_Rooms-Canberra_Australian_Capital_Territory.html
https://www.tripadvisor.com/Restaurant_Review-g255057-d755055-Reviews-Courgette_Restaurant-Canberra_Australian_Capital_Territory.html
https://www.tripadvisor.com/Restaurant_Review-g255057-d6893178-Reviews-Pomegranate-Canberra_Australian_Capital_Territory.html
https://www.tripadvisor.com/Restaurant_Review-g255057-d7262443-Reviews-Les_Bistronomes-Canberra_Australian_Capital_Territory.html
.
.
.
.
Ok , it's not that hard, you just have to know which tags to use .
Let me explain with this example :
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.tripadvisor.com/' ## we need this to join the links later ##
main_page = 'https://www.tripadvisor.com/Attractions-g255057-Activities-oa{}-Canberra_Australian_Capital_Territory-Hotels.html#ATTRACTION_LIST_CONTENTS'
links = []
## get the initial page to find the number of pages ##
r = requests.get(main_page.format(0))
soup = BeautifulSoup(r.text, "html.parser")
## select the last page from the list of pages ('a', {'class':'pageNum taLnk'}) ##
last_page = max([ int(page.get('data-offset')) for page in soup.find_all('a', {'class':'pageNum taLnk'}) ])
## now iterate over that range (first page, last page, number of links), and extract the links from each page ##
for i in range(0, last_page + 30, 30):
page = main_page.format(i)
soup = BeautifulSoup(requests.get(page).text, "html.parser") ## get the next page and parse it with BeautifulSoup ##
## get the hrefs from ('div', {'class':'listing_title'}), and join them with base_url to make the links ##
links += [ base_url + link.find('a').get('href') for link in soup.find_all('div', {'class':'listing_title'}) ]
for link in links :
print(link)
That gives us 8 pages and 212 links in total ( 30 on each page, 2 on the last ) .
I hope this clears things up a bit
I need to create a code to extract a word from one scrape of images.
I'll explain, from a page sitemap.xml ,my code must try in every link present in this xml file, found insiede each link if there a specific word, inside an image link.
the sitemap is adidas = http://www.adidas.it/on/demandware.static/-/Sites-adidas-IT-Library/it_IT/v/sitemap/product/adidas-IT-it-it-product.xml
this is the code i created for search the image contains the word "ZOOM" :
import requests
from bs4 import BeautifulSoup
html = requests.get(
'http://www.adidas.it/scarpe-superstar/C77124.html').text
bs = BeautifulSoup(html)
possible_links = bs.find_all('img')
for link in possible_links:
if link.has_attr('src'):
if link.has_key('src'):
if 'zoom' in link['src']:
print link['src']
but im search a metod to scrape a list in automatic
thankyou so much
i try to do this for have list :
from bs4 import BeautifulSoup
import requests
url = "http://www.adidas.it/on/demandware.static/-/Sites-adidas-IT-Library/it_IT/v/sitemap/product/adidas-IT-it-it-product.xml"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for url in soup.findAll("loc"):
print url.text
but i cant to attach request..
i can find the word "Zoom" in any link present in sitemap.xml
thankyou so much
import requests
from bs4 import BeautifulSoup
import re
def make_soup(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
return soup
# put urls in a list
def get_xml_urls(soup):
urls = [loc.string for loc in soup.find_all('loc')]
return urls
# get the img urls
def get_src_contain_str(soup, string):
srcs = [img['src']for img in soup.find_all('img', src=re.compile(string))]
return srcs
if __name__ == '__main__':
xml = 'http://www.adidas.it/on/demandware.static/-/Sites-adidas-IT-Library/it_IT/v/sitemap/product/adidas-IT-it-it-product.xml'
soup = make_soup(xml)
urls = get_xml_urls(soup)
# loop through the urls
for url in urls:
url_soup = make_soup(url)
srcs = get_src_contain_str(url_soup, 'zoom')
print(srcs)