My problem is related to this answer.
I have following code:
import urllib.request
from bs4 import BeautifulSoup
time = 0
html = urllib.request.urlopen("https://www.kramerav.com/de/Product/VM-2N").read()
html2 = urllib.request.urlopen("https://www.kramerav.com/de/Product/SDIA-IN2-F16").read()
try:
div = str(BeautifulSoup(html).select("div.large-image")[0])
if(str(BeautifulSoup(html).select("div.large-image")[1]) != ""):
div += str(BeautifulSoup(html).select("div.large-image")[1])
time = time + 1
except IndexError:
div = ""
time = time + 1
finally:
print(str(time) + div)
The site of the variable html has 2 div-classes named "large-image". The site of the variable html2 only has 1.
With html the program works as intended. But if I switch to html2 the variable div is going to be completely empty.
I would like to save the 1 div-class rather than saving nothing. How could I archieve this?
the variable div is going to be completely empty.
That's because your error handler assigned it the empty string.
Please don't use subscripts, conditionals, and handlers in that way. It would be more natural to iterate over the results of select() with for, building up a result list (or string).
Also, you should create soup = BeautifulSoup(html) just once, as that can be a fairly expensive operation, since it carefully parses a potentially long web page. With that, you could build up a list of HTML fragments with:
images = [image
for image in soup.select('div.large-image')]
Or if for some reason you're not fond list comprehensions, you could equivalently write:
images = []
for image in soup.select('div.large-image'):
images.append(image)
and then get the required html with div = '\n'.join(images).
You can concatenate all items inside for loop
all_divs = soup.select("div.large-image")
for item in all_divs:
div += str(item)
time += 1
or using join()
time = len(all_divs)
div = ''.join(str(item) for item in all_divs)
You can also write in file directly inside for loop and you get to row
for item in all_divs:
csv_writer.writerow( [str(item).strip()] )
time += 1
Working example
import urllib.request
from bs4 import BeautifulSoup
import csv
div = ""
time = 0
f = open('output.csv', 'w')
csv_writer = csv.writer(f)
all_urls = [
"https://www.kramerav.com/de/Product/VM-2N",
"https://www.kramerav.com/de/Product/SDIA-IN2-F16",
]
for url in all_urls:
print('url:', url)
html = urllib.request.urlopen(url).read()
try:
soup = BeautifulSoup(html)
all_divs = soup.select("div.large-image")
for item in all_divs:
div += str(item)
time += 1
# or
time = len(all_divs)
div = ''.join(str(item) for item in all_divs)
# or
for item in all_divs:
#div += str(item)
#time += 1
csv_writer.writerow( [time, str(item).strip()] )
except IndexError as ex:
print('Error:', ex)
time += 1
finally:
print(time, div)
f.close()
Related
Im trying to write a scraper that randomly chooses a wiki article link from a page, goes there, grabs another, and loops that. I want to exclude links with "Category:", "File:", "List" in the href. Im pretty sure the links i want are all inside of p tags, but when I include "p" in find_all, i get "int object is not subscriptable" error.
The code below returns wiki pages but does not exclude the things i want to filter out.
This is a learning journey for me. All help is appreciated.
import requests
from bs4 import BeautifulSoup
import random
import time
def scrapeWikiArticle(url):
response = requests.get(
url=url,
)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
print(title.text)
print(url)
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
# Here i am trying to select hrefs with /wiki/ in them and exclude hrefs with "Category:" etc. It does select for wikis but does not exclude anything.
if link['href'].find("/wiki/") == -1:
if link['href'].find("Category:") == 1:
if link['href'].find("File:") == 1:
if link['href'].find("List") == 1:
continue
# Use this link to scrape
linkToScrape = link
articleTitles = open("savedArticles.txt", "a+")
articleTitles.write(title.text + ", ")
articleTitles.close()
time.sleep(6)
break
scrapeWikiArticle("https://en.wikipedia.org" + linkToScrape['href'])
scrapeWikiArticle("https://en.wikipedia.org/wiki/Anarchism")
You need to modify the for loop, .attrs is used to access the attributes of any tag. If you want to exclude links if the href value contains particular keyword then use !=-1 comparison.
Modified code:
import requests
from bs4 import BeautifulSoup
import random
import time
def scrapeWikiArticle(url):
response = requests.get(
url=url,
)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
if("href" in link.attrs):
if link.attrs['href'].find("/wiki/") == -1 or link.attrs['href'].find("Category:") != -1 or link.attrs['href'].find("File:") != -1 or link.attrs['href'].find("List") != -1:
continue
linkToScrape = link
articleTitles = open("savedArticles.txt", "a+")
articleTitles.write(title.text + ", ")
articleTitles.close()
time.sleep(6)
break
if(linkToScrape):
scrapeWikiArticle("https://en.wikipedia.org" + linkToScrape.attrs['href'])
scrapeWikiArticle("https://en.wikipedia.org/wiki/Anarchism")
This section seems problematic.
if link['href'].find("/wiki/") == -1:
if link['href'].find("Category:") == 1:
if link['href'].find("File:") == 1:
if link['href'].find("List") == 1:
continue
find returns the index of the substring you are looking for, you are also using it wrong.
So if wiki is not found or Category:, File: etc. appears in href, then continue.
if link['href'].find("/wiki/") == -1 or \
link['href'].find("Category:") != -1 or \
link['href'].find("File:") != -1 or \
link['href'].find("List")!= -1 :
print("skipped " + link["href"])
continue
Saint Petersburg
https://en.wikipedia.org/wiki/St._Petersburg
National Diet Library
https://en.wikipedia.org/wiki/NDL_(identifier)
Template talk:Authority control files
https://en.wikipedia.org/wiki/Template_talk:Authority_control_files
skipped #searchInput
skipped /w/index.php?title=Template_talk:Authority_control_files&action=edit§ion=1
User: Tom.Reding
https://en.wikipedia.org/wiki/User:Tom.Reding
skipped http://toolserver.org/~dispenser/view/Main_Page
Iapetus (moon)
https://en.wikipedia.org/wiki/Iapetus_(moon)
87 Sylvia
https://en.wikipedia.org/wiki/87_Sylvia
skipped /wiki/List_of_adjectivals_and_demonyms_of_astronomical_bodies
Asteroid belt
https://en.wikipedia.org/wiki/Main_asteroid_belt
Detached object
https://en.wikipedia.org/wiki/Detached_object
Use :not() to handle the list of exclusions within href alongside * contains operator. This will filter out hrefs containing (*) specified substrings. Precede this with an attribute = value selector that contains * /wiki/. I have specified a case insensitive match via i, for the first two, which can be removed:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://en.wikipedia.org/wiki/2018_FIFA_World_Cup#Prize_money')
soup = bs(r.content, 'lxml') # 'html.parser'
links = [i['href'] for i in soup.select('#bodyContent a[href*="/wiki/"]:not([href*="Category:" i], [href*="File:" i], [href*="List"])')]
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page=1&q=laptop&sid=6bo%2Fb5g&viewType=list")
c = r.content
soup = BeautifulSoup(c,"html.parser")
all = soup.find_all("div",{"class":"col _2-gKeQ"})
page_nr=soup.find_all("a",{"class":"_33m_Yg"})[-1].text
print(page_nr,"number of pages were found")
#all[0].find("div",{"class":"_1vC4OE _2rQ-NK"}).text
l=[]
base_url="https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page=1&q=laptop&sid=6bo%2Fb5g&viewType=list"
for page in range(0,int(page_nr)*10,10):
print( )
r=requests.get(base_url+str(page)+".html")
c=r.content
#c=r.json()["list"]
soup=BeautifulSoup(c,"html.parser")
for item in all:
d ={}
#price
d["Price"] = item.find("div",{"class":"_1vC4OE _2rQ-NK"}).text
#Name
d["Name"] = item.find("div",{"class":"_3wU53n"}).text
for li in item.find_all("li",{"class":"_1ZRRx1"}):
if " EMI" in li.text:
d["EMI"] = li.text
else:
d["EMI"] = None
for li1 in item.find_all("li",{"class":"_1ZRRx1"}):
if "Special " in li1.text:
d["Special Price"] = li1.text
else:
d["Special Price"] = None
for val in item.find_all("li",{"class":"tVe95H"}):
if "Display" in val.text:
d["Display"] = val.text
elif "Warranty" in val.text:
d["Warrenty"] = val.text
elif "RAM" in val.text:
d["Ram"] = val.text
l.append(d)
import pandas
df = pandas.DataFrame(l)
This might work on standard pagination
i = 1
items_parsed = set()
loop = True
base_url = "https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page={}&q=laptop&sid=6bo%2Fb5g&viewType=list"
while True:
page = requests.get(base_url.format(i))
items = requests.get(#yourelements#)
if not items:
break
for item in items:
#Scrap your item and once you sucessfully done the scrap, return the url of the parsed item into url_parsed (details below code) for example:
url_parsed = your_stuff(items)
if url_parsed in items_parsed:
loop = False
items_parsed.add(url_parsed)
if not loop:
break
i += 1
I formatted your URL where ?page=X with base_url.format(i) so it can iterate until you have no items found on the page OR sometimes you return on page 1 when you reached max_page + 1.
If above the maximum page you get the items you already parsed on the first page you can declare a set() and put the URL of every items you parsed and then check if you already parsed them.
Note that this is just an idea.
Since the page number in the URL is almost in the middle I'd apply a similar change to your code:
base_url="https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page="
end_url ="&q=laptop&sid=6bo%2Fb5g&viewType=list"
for page in range(1, page_nr + 1):
r=requests.get(base_url+str(page)+end_url+".html")
You have access to only first 10 pages from initial URL.
You can make a loop from "&page=1" to "&page=26".
Been using beautiful soup to iterate through pages, but for whatever reason I can't get the loop to advance beyond the first page. it seems like it should be easy because it's a text string, but it seems to loop back, maybe it's my structure not my text string?
Here's what I have:
import csv
import urllib2
from bs4 import BeautifulSoup
f = open('nhlstats.csv', "w")
groups=['points', 'shooting', 'goaltending', 'defensive', 'timeonice', 'faceoffs', 'minor-penalties', 'major-penalties']
year = ["2016", "2015","2014","2013","2012"]
for yr in year:
for gr in groups:
url = "http://www.espn.com/nhl/statistics/player/_/stat/points/year/"+str(yr)
#www.espn.com/nhl/statistics/player/_/stat/points/year/2014/
page = urllib2.urlopen(url)
soup=BeautifulSoup(page, "html.parser")
pagecount = soup.findAll(attrs= {"class":"page-numbers"})[0].string
pageliteral = int(pagecount[5:])
for i in range(0,pageliteral):
number = int(((i*40) + 1))
URL = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/"+str(yr) + "/count/"+str(number)
page = urllib2.urlopen(url)
soup=BeautifulSoup(page, "html.parser")
for tr in soup.select("#my-players-table tr[class*=player]"):
row =[]
for ob in range(1,15):
player_info = tr('td')[ob].get_text(strip=True)
row.append(player_info)
f.write(str(yr) +","+",".join(row) + "\n")
f.close()
this gets the same first 40 records over and over.
I tried using this solution as an if and did find that doing
prevLink = soup.select('a[rel="nofollow"]')[0]
newurl = "http:" + prevLink.get('href')
did work better, but I'm not sure how to do the loop in such a way that it advances? possibly just tired but my loop there still just goes to the next set of records and gets stuck on that one. please help me fix my loop
UPDATE
my formatting was lost in the copy paste, my actual code looks like:
import csv
import urllib2
from bs4 import BeautifulSoup
f = open('nhlstats.csv', "w")
groups=['points', 'shooting', 'goaltending', 'defensive', 'timeonice', 'faceoffs', 'minor-penalties', 'major-penalties']
year = ["2016", "2015","2014","2013","2012"]
for yr in year:
for gr in groups:
url = "http://www.espn.com/nhl/statistics/player/_/stat/points/year/"+str(yr)
#www.espn.com/nhl/statistics/player/_/stat/points/year/2014/
page = urllib2.urlopen(url)
soup=BeautifulSoup(page, "html.parser")
pagecount = soup.findAll(attrs= {"class":"page-numbers"})[0].string
pageliteral = int(pagecount[5:])
for i in range(0,pageliteral):
number = int(((i*40) + 1))
URL = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/"+str(yr) + "/count/"+str(number)
page = urllib2.urlopen(url)
soup=BeautifulSoup(page, "html.parser")
for tr in soup.select("#my-players-table tr[class*=player]"):
row =[]
for ob in range(1,15):
player_info = tr('td')[ob].get_text(strip=True)
row.append(player_info)
f.write(str(yr) +","+",".join(row) + "\n")
f.close()
Your code indenting was mostly at fault. Also it would be wise to actually use the CSV library you imported, this will automatically wrap the player names in quotes to avoid any commas inside from ruining the csv structure.
This works by looking for the link to the next page and extracting the starting count. This is then used to build your the next page get. If no next page can be found, it moves to the next year group. Note, the count is not a page count but a starting entry count.
import csv
import urllib2
from bs4 import BeautifulSoup
groups= ['points', 'shooting', 'goaltending', 'defensive', 'timeonice', 'faceoffs', 'minor-penalties', 'major-penalties']
year = ["2016", "2015", "2014", "2013", "2012"]
with open('nhlstats.csv', "wb") as f_output:
csv_output = csv.writer(f_output)
for yr in year:
for gr in groups:
start_count = 1
while True:
#print "{}, {}, {}".format(yr, gr, start_count) # show progress
url = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/{}/count/{}".format(yr, start_count)
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, "html.parser")
for tr in soup.select("#my-players-table tr[class*=player]"):
row = [yr]
for ob in range(1, 15):
player_info = tr('td')[ob].get_text(strip=True)
row.append(player_info)
csv_output.writerow(row)
try:
start_count = int(soup.find(attrs= {"class":"page-numbers"}).find_next('a')['href'].rsplit('/', 1)[1])
except:
break
Using with will also automatically close your file at the end.
This would give you a csv file starting as follows:
2016,"Patrick Kane, RW",CHI,82,46,60,106,17,30,1.29,287,16.0,9,17,20
2016,"Jamie Benn, LW",DAL,82,41,48,89,7,64,1.09,247,16.6,5,17,13
2016,"Sidney Crosby, C",PIT,80,36,49,85,19,42,1.06,248,14.5,9,10,14
2016,"Joe Thornton, C",SJ,82,19,63,82,25,54,1.00,121,15.7,6,8,21
You are changing the URL many times before you are opening it the first time, due to an indentation error. Try this:
for gr in groups:
url = "...some_url..."
page = urllib2.urlopen(url)
...everything else should be indented....
My code:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = "https://realpython.com/practice/profiles.html"
html_page = urlopen(url)
html_text = html_page.read()
soup = BeautifulSoup(html_text)
links = soup.find_all('a', href = True)
files = []
base = "https://realpython.com/practice/"
def page_names():
for a in links:
files.append(base + a['href'])
page_names()
for i in files:
all_page = urlopen(i)
all_text = all_page.read()
all_soup = BeautifulSoup(all_text)
print all_soup
The first half of the parsing collects three links, the second half is supposed to print out all of their html.
Sadly, it only prints the last link's html.
Possibly because of
for i in files:
all_page = urlopen(i)
It was working previously with 8 lines of code serving the for i in files: purpose but I wanted to clean it up and got it down to those two. Well, clearly not because it doesn't work.
No error though!
You only store the last value in your loop, you need to move all the assignments and the print inside the loop:
for i in files:
all_page = urlopen(i)
all_text = all_page.read()
all_soup = BeautifulSoup(all_text)
print all_soup
If you are going to use functions I would pass parameters and create the list otherwise you might get unexpected output:
def page_names(b,lnks):
files = []
for a in lnks:
files.append(b + a['href'])
return files
for i in page_names(base,links):
all_page = urlopen(i)
all_text = all_page.read()
all_soup = BeautifulSoup(all_text)
print all_s
Your function can then return a list comprehension:
def page_names(b,lnks):
return [b + a['href'] for a in lnks]
In your for loop you are assinging to all_page, which will overwrite it on each loop through, so it will only ever have the value of the last iteration.
If you want it to print the all_soup for each page you could just indent those 3 lines to be inside the for loop as well, then they would be executed each time through the loop.
It seems to be jsut a formatting issue, you probably meant to print it in the loop, right?
for i in files:
all_page = urlopen(i)
all_text = all_page.read()
all_soup = BeautifulSoup(all_text)
print all_soup
I'm trying to scrape multiple pages off of a single website for BeautifulSoup to parse. So far, I've tried using urllib2 to do this, but have been encountering some problems. What I've attempted is:
import urllib2,sys
from BeautifulSoup import BeautifulSoup
for numb in ('85753', '87433'):
address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb)
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
title = soup.find("span", {"class":"paperstitle"})
date = soup.find("span", {"class":"docdate"})
span = soup.find("span", {"class":"displaytext"}) # span.string gives you the first bit
paras = [x for x in span.findAllNext("p")]
first = title.string
second = date.string
start = span.string
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
last = paras[-1].contents[0]
print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last)
This only gives me results for the second number in the numb sequence, i.e. http://www.presidency.ucsb.edu/ws/index.php?pid=87433. I also made some attempts at using mechanize, but had no success. Ideally what I would like to be able to do is have a page, with a list of links, and then automatically select a link, pass the HTML off to BeautifulSoup, and then move to the next link in the list.
You need to put the rest of the code inside the loop. Right now you're iterating over both items in the tuple, but at the end of the iteration only the last item remains assigned to address which subsequently gets parsed outside the loop.
I think you just missed the indentation in the loop :
import urllib2,sys
from BeautifulSoup import BeautifulSoup
for numb in ('85753', '87433'):
address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb)
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
title = soup.find("span", {"class":"paperstitle"})
date = soup.find("span", {"class":"docdate"})
span = soup.find("span", {"class":"displaytext"}) # span.string gives you the first bit
paras = [x for x in span.findAllNext("p")]
first = title.string
second = date.string
start = span.string
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
last = paras[-1].contents[0]
print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last)
I think this should solve the problem..
Here's a tidier solution (using lxml):
import lxml.html as lh
root_url = 'http://www.presidency.ucsb.edu/ws/index.php?pid='
page_ids = ['85753', '87433']
def scrape_page(page_id):
url = root_url + page_id
tree = lh.parse(url)
title = tree.xpath("//span[#class='paperstitle']")[0].text
date = tree.xpath("//span[#class='docdate']")[0].text
text = tree.xpath("//span[#class='displaytext']")[0].text_content()
return title, date, text
if __name__ == '__main__':
for page_id in page_ids:
title, date, text = scrape_page(page_id)