I'm trying to make a function that scrapes book names from goodreads using python and Beautifulsoup.
I've realized some goodread pages have a common url that have the form:
"https://www.goodreads.com/shelf/show/" + category_name + "?page=" + page_number so I've made a function that receives a category name and a max page range in order to iterate from page 1 to max_pages.
The problem is that every time the program iterates it doesn't update the page but instead goes to the first (default) page for the category. I've tried to provide the full url like for example: https://www.goodreads.com/shelf/show/art?page=2 but it still doesn't work so I'm guessing it might be that BeautifulSoup converts the url I'm passing into another format that's not working, but I don't know.
def scrape_category(category_name, search_range):
book_names = []
for i in range(search_range):
quote_page = "https://www.goodreads.com/shelf/show/" + category_name + "?page=" + str(i + 1)
page = urlopen(quote_page)
soup = BeautifulSoup(page,'lxml')
names = soup.find_all('a', attrs={"class":'bookTitle'})
for name in names:
book_name = name.text
book_name = re.sub(r'\"','',book_name)
book_names.append(book_name)
return book_names
The result from this code is always the book names from the first page of the category I'm passing as parameter, never the second, third ... or n page from range 1 to max_pages that I'm requesting.
I see the same books when I enter https://www.goodreads.com/shelf/show/art?page=2 and https://www.goodreads.com/shelf/show/art?page=15 in my browser. This is not a problem in BeautifulSoup, this is just how this site was built.
Related
I am trying to paginate a scraper on my my university's website.
Here is the url for one of the pages:
https://www.bu.edu/com/profile/david-abel/
where david-abel is a first followed by last name. (It would be first-middle-last if a middle was given which poses a problem based on my code only finding first and last currently). I have a plan to deal with middle names but my question is:
How do I go about adding names from my first and lastnames list to my base url to get a corresponding url in the layout above
import requests
from bs4 import BeautifulSoup
url = 'https://www.bu.edu/com/profiles/faculty/page/1/'
data = requests.get(url)
my_data = []
split_names = []
firstnames = []
lastnames = []
middlenames = []
html = BeautifulSoup(data.text, 'html.parser')
professors = html.select('h4.profile-card__name')
for professor in professors:
my_data.append(professor.text)
for name in my_data:
x = name.split()
split_names.append(x)
for name in split_names:
f, l = zip(*split_names)
firstnames.append(f)
lastnames.append(l)
#\/ appending searchable url using names
for name in split_names:
baseurl = "https://www.bu.edu/com/profile/"
newurl = baseurl +
print(firstnames)
print(lastnames)
This simple modification should give you what you want, let me know if you have any more questions or if anything needs to be changed!
# appending searchable url using names
for name in split_names:
baseurl = "https://www.bu.edu/com/profile/"
newurl = baseurl + "-".join(name)
print(newurl)
Even better:
for name in split_names:
profile_url = f"https://www.bu.edu/com/profile/{'-'.join(name)}"
print(profile_url)
As for the pagination part, this should work and is not hard coded. Let's say that new faculty join and there are now 9 pages. This code should still work in that case.
url = 'https://www.bu.edu/com/profiles/faculty/page'
with requests.get(f"{url}/1") as response:
soup = BeautifulSoup(response.text, 'html.parser')
# select pagination numbers shown ex: [2, 3, 7, Next] (Omit the next)
page_numbers = [int(n.text) for n in soup.select("a.page-numbers")[:-1]]
# take the min and max for pagination
start_page, stop_page = min(page_numbers), max(page_numbers) + 1
# loop through pages
for page in range(start_page, stop_page):
with requests.get(f"{url}/{page}") as response:
soup = BeautifulSoup(response.text, 'html.parser')
professors = soup.select('h4.profile-card__name')
# ---
I believe this is the best and most concise way to solve your problem. Just as a tip you should use with when making requests as it takes care of a lot of issues for you and you don't have to pollute the namespace with things like resp1, resp2, etc. Like mentions above, f-strings are amazing and super easy to use.
I'm building a Scrapy that crawling under two pages (e.x: PageDucky, PageHorse), and I pass that two pages in a starts_url field.
But for pagination, I need to pass my URL and concatenate with "?page=", so I can't pass the entire list.
I already tried to make a for loop, but without success.
Anyone does how can I make the pagination work for both pages?
Here is my code for now:
class QuotesSpider(scrapy.Spider):
name = 'QuotesSpider'
start_urls = ['https://PageDucky.com', 'https://PageHorse.com']
categories = []
count = 1
def parse(self, response):
# Get categories
urli = response.url
QuotesSpider.categories = urli[urli.find('/browse')+7:].split('/')
QuotesSpider.categories.pop(0)
#GET ITEMS PER PAGE AND CALC THE PAGINATION
items = int(response.xpath(
'*//div[#id="body"]/div/label[#class="item-count"]/text()').get().replace(' items', ''))
pages = items / 10
#CALL THE OTHER DEF TO READ THE PAGE ITSELF
for i in response.css('div#body div a::attr(href)').getall():
if i[:5] == '/item':
yield scrapy.Request('http://mainpage' + i, callback=self.parseobj)
#HERE IS THE PROBLEM, I TESTED AND WITHOUT FOR LOOP WORKS FOR ONE URL ONLY
for y in QuotesSpider.start_urls:
if pages >= QuotesSpider.count:
next_page = y + '?page=' + str(QuotesSpider.count)
QuotesSpider.count = QuotesSpider.count + 1
yield scrapy.Request(next_page, callback=self.parse)
Whatever website you're scraping, find the xpath/css location where the 'next page' button is. Get the href of that, and yield your next request to that link.
Alternatively you don't need to use start_urls if you write your own start_requests function, where you can put custom logic inside of it, like looping through your desired urls and appendimng the correct page number to each. See: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests
UPDATE WITH SOLUTION
I can't use "href" because isn't the same link, for example the page 01 was 'https:pageducky.com' and the page 02 was 'https:duckyducky.com?page=2'
So I use response.url and manipulate the string considering the ?page=... something like that:
resp1 = response.url[:response.url.find('?page=')]
resp = resp1 + '?page=' + str(QuotesSpider.count)
I am tring to extract different information from websites with BeautifulSoup, such as title of the product and the price.
I do that with different urls, looping through the urls with for...in.... Here, I'll just provide a snippet without the loop.
from bs4 import BeautifulSoup
import requests
import csv
url= 'https://www.mediamarkt.ch/fr/product/_lg-oled65gx6la-1991479.html'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
price = soup.find('meta', property="product:price:amount")
title = soup.find("div", {"class": "flix-model-name"})
title2 = soup.find('div', class_="flix-model-name")
title3 = soup.find("div", attrs={"class": "flix-model-name"})
print(price['content'])
print(title)
print(title2)
print(title3)
So from this URL https://www.mediamarkt.ch/fr/product/_lg-oled65gx6la-1991479.html I wasnt to extract the product number. the only place I find it is in the div class="flix-model-name". However, I am totally unable to reach it. I tried different ways to access it in the title, title2, title3, but I always have the output none.
I am a bit of a beginner, so I guess I am probably missing something basic... If so, please pardon me for that.
Any help is welcome! Many thanks in advance!
just for info, with each url I thought of appending the data and write them on a CSV file like that:
for url in urls:
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
row=[]
try:
# title = YOUR VERY WELCOMED ANSWER
prices = soup.find('meta', property="product:price:amount")
row = (title.text+','+prices['content']+'\n')
data.append(row)
except:
pass
file = open('database.csv','w')
i = 0
while i < (len(data)):
file.write(data[i])
i +=1
file.close()
Many thanks in advance for your help!
David
Try below approach using python - requests simple, straightforward, reliable, fast and less code is required when it comes to requests. I have fetched the API URL from website itself after inspecting the network section of google chrome browser.
What exactly below script is doing:
First it will take the API URL, create the URL based on 2 dynamic parameters(product and category) and then do GET request to get the data.
After getting the data script will parse the JSON data using json.loads library.
Finally, it will iterate all over the list of products one by one and print the details which are divided in 2 categotries 'box1_ProductToProduct' and 'box2_KategorieTopseller' like Brand, Name, Product number and Unit price. Same way you can add more details by looking in to the API call.
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
def scrap_product_details():
PRODUCT = 'MMCH1991479' #Product number
CATEGORY = '680942' #Category number
URL = 'https://www.mediamarkt.ch/rde_server/res/MMCH/recomm/product_detail/sid/WACXyEbIf3khlu6FcHlh1B1?product=' + PRODUCT + '&category=' + CATEGORY # dynamic URL
response = requests.get(URL,verify = False) #GET request to fetch the data
result = json.loads(response.text) # Parse JSON data using json.loads
box1_ProductToProduct = result[0]['box1_ProductToProduct'] # Extracted data from API
box2_KategorieTopseller = result[1]['box2_KategorieTopseller']
for item in box1_ProductToProduct: # loop over extracted data
print('-' * 100)
print('Brand : ',item['brand'])
print('Name : ',item['name'])
print('Net Unit Price : ',item['netUnitPrice'])
print('Product Number : ',item['product_nr'])
print('-' * 100)
for item in box2_KategorieTopseller: # loop over extracted data
print('-' * 100)
print('Brand : ',item['brand'])
print('Name : ',item['name'])
print('Net Unit Price : ',item['netUnitPrice'])
print('Product Number : ',item['product_nr'])
print('-' * 100)
scrap_product_details()
Below is my code and it works , But issue it sometime it does not work ? I can say intermmeidate issue and probably because of dynamic elements in in page? what is solution for dynamic elements?
def collect_bottom_url(product_string):
"""
collect_bottom_url:
This function will accept product name as a argument.
create a url of product and then collect all the urls given in bottom of page for the product.
:return: list_of_urls
"""
url = 'https://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=' + product_string
# download the main webpage of product
webpage = requests.get(url)
# Store the main URL of Product in a list
list_of_urls = list()
list_of_urls.append(url)
# Create a web page of downloaded page using lxml parser
my_soup = BeautifulSoup(webpage.text, "lxml")
# find_all class = pagnLink in web page
urls_at_bottom = my_soup.find_all(class_='pagnLink')
empty_list = list()
for b_url in urls_at_bottom:
empty_list.append(b_url.find('a')['href'])
for item in empty_list:
item = "https://www.amazon.in/" + item
list_of_urls.append(item)
print(list_of_urls)
collect_bottom_url('book')
Here is output 1 which is fine :
['https://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=book', 'https://www.amazon.in//book/s?ie=UTF8&page=2&rh=i%3Aaps%2Ck%3Abook', 'https://www.amazon.in//book/s?ie=UTF8&page=3&rh=i%3Aaps%2Ck%3Abook']
Here is output 2 which is incorrect :
['https://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=book']
its not dynamic but it ask captcha because you use default user-agent, change it.
headers= {"User-Agent" : 'Mozilla/5.0.............'}
def collect_bottom_url(product_string):
.....
webpage = requests.get(url, headers=headers)
for dynamic page use Selenium.
I am trying to scrape multiple pages using beautifulsoup concept, but am getting only the last page results as output, please suggest the right way. Below is my code.
# For every page
for page in range(0,8):
# Make a get request
response = get('http://nationalacademyhr.org/fellowsdirectory?page=0%2C{}' + format(page))
# Pause the loop
sleep(randint(8,15))
# Monitor the requests
requests += 1
elapsed_time = time() - start_time
print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
clear_output(wait = True)
html_soup = BeautifulSoup(response.text, 'html.parser')
all_table_info = html_soup.find('table', class_ = "views-table cols-4")
for name in all_table_info.find_all('div',
class_="views-field views-field-view"):
names.append(name.text.replace("\n", " ")if name.text else None)
for organization in all_table_info.find_all('td',
class_="views-field views-field-field-employer"):
orgs.append(organization.text.strip() if organization.text else None)
for year in all_table_info.find_all('td',
class_ = "views-field views-field-view-2"):
Years.append(year.text.strip() if year.text else None)
df = pd.DataFrame({'Name' : names, 'Org' : orgs, 'year' : Years })
print (df)
There is a typing error: a plus instead of a dot. You need 'http://nati...ge=0%2C{}'.format(page),
but you wrote
'http://nati...ge=0%2C{}' + format(page)
URLs having braces before the page number end up at the same page.
EDIT:
If I was not clear, you need just change the line
response = get('http://nationalacademyhr.org/fellowsdirectory?page=0%2C{}' + format(page))
to
response = get('http://nationalacademyhr.org/fellowsdirectory?page=0%2C{}'.format(page))
In the first case the resulting URL contains also the substring '{}', which causes the problem.
Note: there are 9 pages on the site identified by page=0,0 through to page=0,8. Your loop should use range(9). Or, even better, load the first page then get the URL for the next page using the next link. Iterate over all the page by following the next link until there is no next link on the page.
Further to xhancar's answer which identifies the problem, a better way is to avoid string operations when building URLs, and instead let requests construct the URL query string for you:
for page in range(9):
params = {'page': '0,{}'.format(page)}
response = get('http://nationalacademyhr.org/fellowsdirectory', params=params)
The params parameter is passed to requests.get() which adds the values to the URL query string. The query parameters will be properly encoded, e.g. the , replaced with %2C.