I'm trying to a get the number of pages from a webpage and their links but for some of the webpages there is no href on the page number. So I've tried to create an attempt on an if statement, but it still returns the error.
The aim was to say if the page count is not present then just assign it the value one. Though I'm pretty unexperienced so I'f appreciate some support on this.
It seems to fail on the seventh page:
final_data = []
for m in range(0, 10):
df = {'data':[]}
driver.get(links_countries['links'][m])
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'html5lib')
pages = soup.select("#yw2 a")
pag = []
for p in pages:
pag.append(p.get_text(strip=True).replace('', ''))
pag = [string for string in pag if string != ""]
if int(pag[-1]) < 1:
int(1)
else:
continue
print('Page number', pag)
href = []
for t in pages:
href.append(t['href'])
href = [string for string in href if string != ""]
urls = "https://www.transfermarkt.co.uk" + href[0]
print(urls)
This output is given by replacing the if statement with pag = pag[-1]
Output:
Page number 2
https://www.transfermarkt.co.uk/spieler-statistik/wertvollstespieler/marktwertetop/plus/ausrichtung/alle/spielerposition_id/alle/altersklasse/alle/jahrgang/0/land_id/1/kontinent_id/0/yt0/Show/0/
Page number 10
https://www.transfermarkt.co.uk/spieler-statistik/wertvollstespieler/marktwertetop/plus/ausrichtung/alle/spielerposition_id/alle/altersklasse/alle/jahrgang/0/land_id/2/kontinent_id/0/yt0/Show/0/
Page number 10
https://www.transfermarkt.co.uk/spieler-statistik/wertvollstespieler/marktwertetop/plus/ausrichtung/alle/spielerposition_id/alle/altersklasse/alle/jahrgang/0/land_id/3/kontinent_id/0/yt0/Show/0/
Page number 10
https://www.transfermarkt.co.uk/spieler-statistik/wertvollstespieler/marktwertetop/plus/ausrichtung/alle/spielerposition_id/alle/altersklasse/alle/jahrgang/0/land_id/4/kontinent_id/0/yt0/Show/0/
Page number 4
https://www.transfermarkt.co.uk/spieler-statistik/wertvollstespieler/marktwertetop/plus/ausrichtung/alle/spielerposition_id/alle/altersklasse/alle/jahrgang/0/land_id/5/kontinent_id/0/yt0/Show/0/
Page number 5
https://www.transfermarkt.co.uk/spieler-statistik/wertvollstespieler/marktwertetop/plus/ausrichtung/alle/spielerposition_id/alle/altersklasse/alle/jahrgang/0/land_id/6/kontinent_id/0/yt0/Show/0/
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-199-1fd034d415eb> in <module>
10 pag.append(p.get_text(strip=True).replace('', ''))
11 pag = [string for string in pag if string != ""]
---> 12 pag = int(pag[-1])
13 print('Page number', pag)
14 href = []
IndexError: list index out of range
Take a look at the error message.
IndexError: list index out of range
The reason why you get the above error is because your list pag is empty, so you cannot slice it.
If you want to skip the page, then at least use this check instead of handling the error.
if pag:
pag = int(pag[-1])
else:
continue
The statement if pag is equivalent to if len(pag) > 0. However, I follow Google Python Style Guide so I'm going to stick with if pag.
Managed to at least skip the url with:
try:
pag = int(pag[-1])
except:
continue
Related
I'm trying to scrape yellow pages, my code is stuck in taking the first business of each page but skips every other business on the page. Ex. 1st company of page 1, 1st company of page2 etc.
I have no clue why it isn't iterating first through the 'web_page' variable, then checking for additional pages and thirdly looking for closing statement and executing ´break´.
If anyone can provide me with clues or help it would be highly appreciated!
web_page_results = []
def yellow_pages_scraper(search_term, location):
page = 1
while True:
url = f'https://www.yellowpages.com/search?search_terms={search_term}&geo_location_terms={location}&page={page}'
r = requests.get(url, headers = headers)
soup = bs(r.content, 'html.parser')
web_page = soup.find_all('div', {'class':'search-results organic'})
for business in web_page:
business_dict = {}
try:
business_dict['name'] = business.find('a', {'class':'business-name'}).text
print(f'{business_dict["name"]}')
except AttributeError:
business_dict['name'] = ''
try:
business_dict['street_address'] = business.find('div', {'class':'street-address'}).text
except AttributeError:
business_dict['street_address'] = ''
try:
business_dict['locality'] = business.find('div', {'class':'locality'}).text
except AttributeError:
business_dict['locality'] = ''
try:
business_dict['phone'] = business.find('div', {'class':'phones phone primary'}).text
except AttributeError:
business_dict['phone'] = ''
try:
business_dict['website'] = business.find('a', {'class':'track-visit-website'})['href']
except AttributeError:
business_dict['website'] = ''
try:
web_page_results.append(business_dict)
print(web_page_results)
except:
print('saving not working')
# If the last iterated page doesn't find the "next page" button, break the loop and return the list
if not soup.find('a', {'class': 'next ajax-page'}):
break
page += 1
return web_page_results
It's worth looking at this line;
web_page = soup.find_all('div', {'class':'search-results organic'})
When I go to the request url I can only find one instance of search-results organic on the page. You then go and iterate over the list (web_page), but there will only be 1 value in the list. So when you do the for loop;
for business in web_page:
you will always only do it once, due to the single item in the list and therefore only get the first result on the page.
You need to loop through the list of businesses on the page not the container holding the business listings. I recommend creating a list from class='srp-listing':
web_page = soup.find_all('div', {'class':'srp-listing'})
This should give you a list of all the businesses on the page. When you iterate over the new list of businesses you will go through more than just the one listing.
Im trying to write a scraper that randomly chooses a wiki article link from a page, goes there, grabs another, and loops that. I want to exclude links with "Category:", "File:", "List" in the href. Im pretty sure the links i want are all inside of p tags, but when I include "p" in find_all, i get "int object is not subscriptable" error.
The code below returns wiki pages but does not exclude the things i want to filter out.
This is a learning journey for me. All help is appreciated.
import requests
from bs4 import BeautifulSoup
import random
import time
def scrapeWikiArticle(url):
response = requests.get(
url=url,
)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
print(title.text)
print(url)
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
# Here i am trying to select hrefs with /wiki/ in them and exclude hrefs with "Category:" etc. It does select for wikis but does not exclude anything.
if link['href'].find("/wiki/") == -1:
if link['href'].find("Category:") == 1:
if link['href'].find("File:") == 1:
if link['href'].find("List") == 1:
continue
# Use this link to scrape
linkToScrape = link
articleTitles = open("savedArticles.txt", "a+")
articleTitles.write(title.text + ", ")
articleTitles.close()
time.sleep(6)
break
scrapeWikiArticle("https://en.wikipedia.org" + linkToScrape['href'])
scrapeWikiArticle("https://en.wikipedia.org/wiki/Anarchism")
You need to modify the for loop, .attrs is used to access the attributes of any tag. If you want to exclude links if the href value contains particular keyword then use !=-1 comparison.
Modified code:
import requests
from bs4 import BeautifulSoup
import random
import time
def scrapeWikiArticle(url):
response = requests.get(
url=url,
)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
if("href" in link.attrs):
if link.attrs['href'].find("/wiki/") == -1 or link.attrs['href'].find("Category:") != -1 or link.attrs['href'].find("File:") != -1 or link.attrs['href'].find("List") != -1:
continue
linkToScrape = link
articleTitles = open("savedArticles.txt", "a+")
articleTitles.write(title.text + ", ")
articleTitles.close()
time.sleep(6)
break
if(linkToScrape):
scrapeWikiArticle("https://en.wikipedia.org" + linkToScrape.attrs['href'])
scrapeWikiArticle("https://en.wikipedia.org/wiki/Anarchism")
This section seems problematic.
if link['href'].find("/wiki/") == -1:
if link['href'].find("Category:") == 1:
if link['href'].find("File:") == 1:
if link['href'].find("List") == 1:
continue
find returns the index of the substring you are looking for, you are also using it wrong.
So if wiki is not found or Category:, File: etc. appears in href, then continue.
if link['href'].find("/wiki/") == -1 or \
link['href'].find("Category:") != -1 or \
link['href'].find("File:") != -1 or \
link['href'].find("List")!= -1 :
print("skipped " + link["href"])
continue
Saint Petersburg
https://en.wikipedia.org/wiki/St._Petersburg
National Diet Library
https://en.wikipedia.org/wiki/NDL_(identifier)
Template talk:Authority control files
https://en.wikipedia.org/wiki/Template_talk:Authority_control_files
skipped #searchInput
skipped /w/index.php?title=Template_talk:Authority_control_files&action=edit§ion=1
User: Tom.Reding
https://en.wikipedia.org/wiki/User:Tom.Reding
skipped http://toolserver.org/~dispenser/view/Main_Page
Iapetus (moon)
https://en.wikipedia.org/wiki/Iapetus_(moon)
87 Sylvia
https://en.wikipedia.org/wiki/87_Sylvia
skipped /wiki/List_of_adjectivals_and_demonyms_of_astronomical_bodies
Asteroid belt
https://en.wikipedia.org/wiki/Main_asteroid_belt
Detached object
https://en.wikipedia.org/wiki/Detached_object
Use :not() to handle the list of exclusions within href alongside * contains operator. This will filter out hrefs containing (*) specified substrings. Precede this with an attribute = value selector that contains * /wiki/. I have specified a case insensitive match via i, for the first two, which can be removed:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://en.wikipedia.org/wiki/2018_FIFA_World_Cup#Prize_money')
soup = bs(r.content, 'lxml') # 'html.parser'
links = [i['href'] for i in soup.select('#bodyContent a[href*="/wiki/"]:not([href*="Category:" i], [href*="File:" i], [href*="List"])')]
I am trying to conduct some NLP on subreddit pages. I have a chunk of code that gathers a bunch of data two web pages. It scrapes data until I get to range(40). This would be fine, except I know that the subreddits I have chosen have more posts than my code is allowing me to scrape.
Could anyone figure out what is going on here?
posts_test = []
url = 'https://www.reddit.com/r/TheOnion/.json?after='
for i in range(40):
res = requests.get(url, headers={'User-agent': 'Maithili'})
the_onion = res.json()
for i in range(25):
post_t = []
post_t.append(the_onion['data']['children'][i]['data']['title'])
post_t.append(the_onion['data']['children'][i]['data']['subreddit'])
posts_test.append(post_t)
after = the_onion['data']['after']
url = 'https://www.reddit.com/r/TheOnion/.json?after=' + after
time.sleep(3)
# Not the onion
url = 'https://www.reddit.com/r/nottheonion/.json?after='
for i in range(40):
res3 = requests.get(url, headers=headers2)
not_onion_json = res2.json()
for i in range(25):
post_t = []
post_t.append(not_onion_json['data']['children'][i]['data']['title'])
post_t.append(not_onion_json['data']['children'][i]['data']['subreddit'])
posts_test.append(post_t)
after = not_onion_json['data']['after']
url = "https://www.reddit.com/r/nottheonion/.json?after=" + after
time.sleep(3)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-57-6c1cfdd42421> in <module>
7 for i in range(25):
8 post_t = []
----> 9 post_t.append(the_onion['data']['children'][i]['data']['title'])
10 post_t.append(the_onion['data']['children'][i]['data']['subreddit'])
11 posts_test.append(post_t)
IndexError: list index out of range"```
The reason you are stopping at 40 is because you are telling python to stop at 40
for i in range(40):
The good news is you are collecting the next page here
after = not_onion_json['data']['after']
On the assumption that once you get to the end of the pages after == null, I would suggest performing a while loop. Something like
while after != None:
That will continue until you get to the end.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page=1&q=laptop&sid=6bo%2Fb5g&viewType=list")
c = r.content
soup = BeautifulSoup(c,"html.parser")
all = soup.find_all("div",{"class":"col _2-gKeQ"})
page_nr=soup.find_all("a",{"class":"_33m_Yg"})[-1].text
print(page_nr,"number of pages were found")
#all[0].find("div",{"class":"_1vC4OE _2rQ-NK"}).text
l=[]
base_url="https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page=1&q=laptop&sid=6bo%2Fb5g&viewType=list"
for page in range(0,int(page_nr)*10,10):
print( )
r=requests.get(base_url+str(page)+".html")
c=r.content
#c=r.json()["list"]
soup=BeautifulSoup(c,"html.parser")
for item in all:
d ={}
#price
d["Price"] = item.find("div",{"class":"_1vC4OE _2rQ-NK"}).text
#Name
d["Name"] = item.find("div",{"class":"_3wU53n"}).text
for li in item.find_all("li",{"class":"_1ZRRx1"}):
if " EMI" in li.text:
d["EMI"] = li.text
else:
d["EMI"] = None
for li1 in item.find_all("li",{"class":"_1ZRRx1"}):
if "Special " in li1.text:
d["Special Price"] = li1.text
else:
d["Special Price"] = None
for val in item.find_all("li",{"class":"tVe95H"}):
if "Display" in val.text:
d["Display"] = val.text
elif "Warranty" in val.text:
d["Warrenty"] = val.text
elif "RAM" in val.text:
d["Ram"] = val.text
l.append(d)
import pandas
df = pandas.DataFrame(l)
This might work on standard pagination
i = 1
items_parsed = set()
loop = True
base_url = "https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page={}&q=laptop&sid=6bo%2Fb5g&viewType=list"
while True:
page = requests.get(base_url.format(i))
items = requests.get(#yourelements#)
if not items:
break
for item in items:
#Scrap your item and once you sucessfully done the scrap, return the url of the parsed item into url_parsed (details below code) for example:
url_parsed = your_stuff(items)
if url_parsed in items_parsed:
loop = False
items_parsed.add(url_parsed)
if not loop:
break
i += 1
I formatted your URL where ?page=X with base_url.format(i) so it can iterate until you have no items found on the page OR sometimes you return on page 1 when you reached max_page + 1.
If above the maximum page you get the items you already parsed on the first page you can declare a set() and put the URL of every items you parsed and then check if you already parsed them.
Note that this is just an idea.
Since the page number in the URL is almost in the middle I'd apply a similar change to your code:
base_url="https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page="
end_url ="&q=laptop&sid=6bo%2Fb5g&viewType=list"
for page in range(1, page_nr + 1):
r=requests.get(base_url+str(page)+end_url+".html")
You have access to only first 10 pages from initial URL.
You can make a loop from "&page=1" to "&page=26".
My problem is related to this answer.
I have following code:
import urllib.request
from bs4 import BeautifulSoup
time = 0
html = urllib.request.urlopen("https://www.kramerav.com/de/Product/VM-2N").read()
html2 = urllib.request.urlopen("https://www.kramerav.com/de/Product/SDIA-IN2-F16").read()
try:
div = str(BeautifulSoup(html).select("div.large-image")[0])
if(str(BeautifulSoup(html).select("div.large-image")[1]) != ""):
div += str(BeautifulSoup(html).select("div.large-image")[1])
time = time + 1
except IndexError:
div = ""
time = time + 1
finally:
print(str(time) + div)
The site of the variable html has 2 div-classes named "large-image". The site of the variable html2 only has 1.
With html the program works as intended. But if I switch to html2 the variable div is going to be completely empty.
I would like to save the 1 div-class rather than saving nothing. How could I archieve this?
the variable div is going to be completely empty.
That's because your error handler assigned it the empty string.
Please don't use subscripts, conditionals, and handlers in that way. It would be more natural to iterate over the results of select() with for, building up a result list (or string).
Also, you should create soup = BeautifulSoup(html) just once, as that can be a fairly expensive operation, since it carefully parses a potentially long web page. With that, you could build up a list of HTML fragments with:
images = [image
for image in soup.select('div.large-image')]
Or if for some reason you're not fond list comprehensions, you could equivalently write:
images = []
for image in soup.select('div.large-image'):
images.append(image)
and then get the required html with div = '\n'.join(images).
You can concatenate all items inside for loop
all_divs = soup.select("div.large-image")
for item in all_divs:
div += str(item)
time += 1
or using join()
time = len(all_divs)
div = ''.join(str(item) for item in all_divs)
You can also write in file directly inside for loop and you get to row
for item in all_divs:
csv_writer.writerow( [str(item).strip()] )
time += 1
Working example
import urllib.request
from bs4 import BeautifulSoup
import csv
div = ""
time = 0
f = open('output.csv', 'w')
csv_writer = csv.writer(f)
all_urls = [
"https://www.kramerav.com/de/Product/VM-2N",
"https://www.kramerav.com/de/Product/SDIA-IN2-F16",
]
for url in all_urls:
print('url:', url)
html = urllib.request.urlopen(url).read()
try:
soup = BeautifulSoup(html)
all_divs = soup.select("div.large-image")
for item in all_divs:
div += str(item)
time += 1
# or
time = len(all_divs)
div = ''.join(str(item) for item in all_divs)
# or
for item in all_divs:
#div += str(item)
#time += 1
csv_writer.writerow( [time, str(item).strip()] )
except IndexError as ex:
print('Error:', ex)
time += 1
finally:
print(time, div)
f.close()