How can scrape prices from next pages? - python

I'm new to python and web scraping.
I wrote some codes by using requests and beautifulsoup. One code is for scraping prices and names and links. Which works fine and is as follows:
from bs4 import BeautifulSoup
import requests
urls = "https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-1"
source = requests.get(urls).text
soup = BeautifulSoup(source, 'lxml')
for figcaption in soup.find_all('figcaption'):
price = figcaption.div.text
name = figcaption.find('a', class_='title').text
link = figcaption.find('a', class_='title')['href']
print(price)
print(name)
print(link)
and also one for making other urls that I need those information scraped from, which also gives the correct urls when I use print():
x = 0
counter = 1
for x in range(0, 70)
urls = "https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-" + str(counter)
counter += 1
x += 1
print(urls)
But when I try to combine these two in order to scrape a page and then change url to new one and then scrape it, it just gives the scraped information on the first page 70 times. please guide me through this. the whole code is as follows:
from bs4 import BeautifulSoup
import requests
x = 0
counter = 1
for x in range(0, 70):
urls = "https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-" + str(counter)
source = requests.get(urls).text
soup = BeautifulSoup(source, 'lxml')
counter += 1
x += 1
print(urls)
for figcaption in soup.find_all('figcaption'):
price = figcaption.div.text
name = figcaption.find('a', class_='title').text
link = figcaption.find('a', class_='title')['href']
print(price)
print()
print(name)
print()
print(link)

Your x=0 and then incriminating it by 1 is redundant and not needed, as you have it iterating through that range range(0, 70). I'm also not sure why you have a counter as you don't need that either. Here's how you would do it below:
HOWEVER, I believe that issue is not with the iteration or looping, but the url itself. If you manually go to the two pages as listed below, the content doesn’t change:
https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-1
and then
https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-2
Since the site is dynamic, you'll need to find a different way to iterate page to page, or figure out what the exact url is. So try:
from bs4 import BeautifulSoup
import requests
for x in range(0, 70):
try:
urls = 'https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html&pagesize[]=24&order[]=new&stock[]=1&page[]=' +str(x+1) + '&ajax=ok?_=1561559181560'
source = requests.get(urls).text
soup = BeautifulSoup(source, 'lxml')
print('Page: %s' %(x+1))
for figcaption in soup.find_all('figcaption'):
price = figcaption.find('span', {'class':'new_price'}).text.strip()
name = figcaption.find('a', class_='title').text
link = figcaption.find('a', class_='title')['href']
print('%s\n%s\n%s' %(price, name, link))
except:
break
You can find that link by going to the website and looking at the dev tools (Ctrl +Shift+I or right-click 'Inspect') -> network -> XHR
When I did that and then physically click to the next page, I can see how that data was rendered, and found the reference url.

Related

Python web scraping multiple pages

I am scraping all the words from website Merriam-Webster.
I want to scrape all pages starting from a-z and all pages within them and save them to a text file. The problem i'm having is i only get first result of the table instead of all. I know that this is a large amount of text (around 500k) but i'm doing it for educating myself.
CODE:
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://www.merriam-webster.com/browse/dictionary/a/'
page = 1
# for page in range(1, 75):
req = requests.get(URL + str(page))
soup = bs(req.text, 'html.parser')
containers = soup.find('div', attrs={'class', 'entries'})
table = containers.find_all('ul')
for entries in table:
links = entries.find_all('a')
name = links[0].text
print(name)
Now what i want is to get all the entries from this table, but instead i only get the first entry.
I'm kinda stuck here so any help would be appreciated.
Thanks
https://www.merriam-webster.com/browse/medical/a-z
https://www.merriam-webster.com/browse/legal/a-z
https://www.merriam-webster.com/browse/dictionary/a-z
https://www.merriam-webster.com/browse/thesaurus/a-z
To get all entries, you can use this example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.merriam-webster.com/browse/dictionary/a/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for a in soup.select('.entries a'):
print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href']))
Prints:
(a) heaven on earth https://www.merriam-webster.com/dictionary/%28a%29%20heaven%20on%20earth
(a) method in/to one's madness https://www.merriam-webster.com/dictionary/%28a%29%20method%20in%2Fto%20one%27s%20madness
(a) penny for your thoughts https://www.merriam-webster.com/dictionary/%28a%29%20penny%20for%20your%20thoughts
(a) quarter after https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20after
(a) quarter of https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20of
(a) quarter past https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20past
(a) quarter to https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20to
(all) by one's lonesome https://www.merriam-webster.com/dictionary/%28all%29%20by%20one%27s%20lonesome
(all) choked up https://www.merriam-webster.com/dictionary/%28all%29%20choked%20up
(all) for the best https://www.merriam-webster.com/dictionary/%28all%29%20for%20the%20best
(all) in good time https://www.merriam-webster.com/dictionary/%28all%29%20in%20good%20time
...and so on.
To scrape multiple pages:
url = 'https://www.merriam-webster.com/browse/dictionary/a/{}'
for page in range(1, 76):
soup = BeautifulSoup(requests.get(url.format(page)).content, 'html.parser')
for a in soup.select('.entries a'):
print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href']))
EDIT: To get all pages from A to Z:
import requests
from bs4 import BeautifulSoup
url = 'https://www.merriam-webster.com/browse/dictionary/{}/{}'
for char in range(ord('a'), ord('z')+1):
page = 1
while True:
soup = BeautifulSoup(requests.get(url.format(chr(char), page)).content, 'html.parser')
for a in soup.select('.entries a'):
print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href']))
last_page = soup.select_one('[aria-label="Last"]')['data-page']
if last_page == '':
break
page += 1
EDIT 2: To save to file:
import requests
from bs4 import BeautifulSoup
url = 'https://www.merriam-webster.com/browse/dictionary/{}/{}'
with open('data.txt', 'w') as f_out:
for char in range(ord('a'), ord('z')+1):
page = 1
while True:
soup = BeautifulSoup(requests.get(url.format(chr(char), page)).content, 'html.parser')
for a in soup.select('.entries a'):
print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href']))
print('{}\t{}'.format(a.text, 'https://www.merriam-webster.com' + a['href']), file=f_out)
last_page = soup.select_one('[aria-label="Last"]')['data-page']
if last_page == '':
break
page += 1
I think you need another loop:
for entries in table:
links = entries.find_all('a')
for name in links:
print(name.text)

python crawling beautifulsoup how to crawl several pages?

Please Help.
I want to get all the company names of each pages and they have 12 pages.
http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/1
http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/2
-- this website only changes the number.
So Here is my code so far.
Can I get just the title (company name) of 12 pages?
Thank you in advance.
from bs4 import BeautifulSoup
import requests
maximum = 0
page = 1
URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/1'
response = requests.get(URL)
source = response.text
soup = BeautifulSoup(source, 'html.parser')
whole_source = ""
for page_number in range(1, maximum+1):
URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/' + str(page_number)
response = requests.get(URL)
whole_source = whole_source + response.text
soup = BeautifulSoup(whole_source, 'html.parser')
find_company = soup.select("#content > div.wrap_analysis_data > div.public_con_box.public_list_wrap > ul > li:nth-child(13) > div > strong")
for company in find_company:
print(company.text)
---------Output of one page
---------page source :)
So, you want to remove all the headers and get only the string of the company name?
Basically, you can use the soup.findAll to find the list of company in the format like this:
<strong class="company"><span>중소기업진흥공단</span></strong>
Then you use the .find function to extract information from the <span> tag:
<span>중소기업진흥공단</span>
After that, you use .contents function to get the string from the <span> tag:
'중소기업진흥공단'
So you write a loop to do the same for each page, and make a list called company_list to store the results from each page and append them together.
Here's the code:
from bs4 import BeautifulSoup
import requests
maximum = 12
company_list = [] # List for result storing
for page_number in range(1, maximum+1):
URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/{}'.format(page_number)
response = requests.get(URL)
print(page_number)
whole_source = response.text
soup = BeautifulSoup(whole_source, 'html.parser')
for entry in soup.findAll('strong', attrs={'class': 'company'}): # Finding all company names in the page
company_list.append(entry.find('span').contents[0]) # Extracting name from the result
The company_list will give you all the company names you want
I figured it out eventually. Thank you for your answer though!
image : code captured in jupyter notebook
Here is my final code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
company_list=[]
for n in range(12):
url = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/{}'.format(n+1)
webpage = urlopen(url)
source = BeautifulSoup(webpage,'html.parser',from_encoding='utf-8')
companys = source.findAll('strong',{'class':'company'})
for company in companys:
company_list.append(company.get_text().strip().replace('\n','').replace('\t','').replace('\r',''))
file = open('company_name1.txt','w',encoding='utf-8')
for company in company_list:
file.write(company+'\n')
file.close()

Trouble parsing product names out of some links with different depth

I've written a script in python to reach the target page where each category has their avaiable item names in a website. My below script can get the product names from most of the links (generated through roving category links and then subcategory links).
The script can parse sub-category links revealed upon clicking + sign located right next to each category which are visible in the below image and then parse all the product names from the target page. This is one of such target pages.
However, few of the links do not have the same depth as other links. For example this link and this one are different from usual links like this one.
How can I get all the product names from all the links irrespective of their different depth?
This is what I've tried so far:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
link = "https://www.courts.com.sg/"
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".nav-dropdown li a"):
if "#" in item.get("href"):continue #kick out invalid links
newlink = urljoin(link,item.get("href"))
req = requests.get(newlink)
sauce = BeautifulSoup(req.text,"lxml")
for elem in sauce.select(".product-item-info .product-item-link"):
print(elem.get_text(strip=True))
How to find trget links:
The site has six main product categories. Products that belong to a subcategory can also be found in a main category (for example the products in /furniture/furniture/tables can also be found in /furniture), so you only have to collect products from the main categories. You could get the categories links from the main page, but it'd be easier to use the sitemap.
url = 'https://www.courts.com.sg/sitemap/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
cats = soup.select('li.level-0.category > a')[:6]
links = [i['href'] for i in cats]
As you've mentioned there are some links that have differend structure, like this one: /televisions. But, if you click the View All Products link on that page you will be redirected to /tv-entertainment/vision/television. So, you can get all the /televisions rpoducts from /tv-entertainment. Similarly, the products in links to brands can be found in the main categories. For example, the /asus products can be found in /computing-mobile and other categories.
The code below collects products from all the main categories, so it should collect all the products on the site.
from bs4 import BeautifulSoup
import requests
url = 'https://www.courts.com.sg/sitemap/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
cats = soup.select('li.level-0.category > a')[:6]
links = [i['href'] for i in cats]
products = []
for link in links:
link += '?product_list_limit=24'
while link:
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
link = (soup.select_one('a.action.next') or {}).get('href')
for elem in soup.select(".product-item-info .product-item-link"):
product = elem.get_text(strip=True)
products += [product]
print(product)
I've increased the number of products per page to 24, but still this code takes a lot of time, as it collects products from all main categories and their pagination links. However, we could make it much faster with the use of threads.
from bs4 import BeautifulSoup
import requests
from threading import Thread, Lock
from urllib.parse import urlparse, parse_qs
lock = Lock()
threads = 10
products = []
def get_products(link, products):
soup = BeautifulSoup(requests.get(link).text, 'html.parser')
tags = soup.select(".product-item-info .product-item-link")
with lock:
products += [tag.get_text(strip=True) for tag in tags]
print('page:', link, 'items:', len(tags))
url = 'https://www.courts.com.sg/sitemap/'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
cats = soup.select('li.level-0.category > a')[:6]
links = [i['href'] for i in cats]
for link in links:
link += '?product_list_limit=24'
soup = BeautifulSoup(requests.get(link).text, 'html.parser')
last_page = soup.select_one('a.page.last')['href']
last_page = int(parse_qs(urlparse(last_page).query)['p'][0])
threads_list = []
for i in range(1, last_page + 1):
page = '{}&p={}'.format(link, i)
thread = Thread(target=get_products, args=(page, products))
thread.start()
threads_list += [thread]
if i % threads == 0 or i == last_page:
for t in threads_list:
t.join()
print(len(products))
print('\n'.join(products))
This code collects 18,466 products from 773 pages in about 5 minutes. I'm using 10 threads because I don't want to stress the server too much, but you could use more (most servers can handle 20 threads easily).
I would recommend starting your scrape from the pages sitemap
Found here
If they were to add products, it's likely to show up here as well.
Since your main issue is finding the links, here is a generator that will find all of the category and sub-category links using the sitemap krflol pointed out in his solution:
from bs4 import BeautifulSoup
import requests
def category_urls():
response = requests.get('https://www.courts.com.sg/sitemap')
html_soup = BeautifulSoup(response.text, features='html.parser')
categories_sitemap = html_soup.find(attrs={'class': 'xsitemap-categories'})
for category_a_tag in categories_sitemap.find_all('a'):
yield category_a_tag.attrs['href']
And to find the product names, simply scrape each of the yielded category_urls.
I saw the website for parsing and found that all the products are available at the bottom left side of the main page https://www.courts.com.sg/ .After clicking one of these we goes to advertisement front page of a particular category. Where we have to go in click All Products for getting it.
Following is the code as whole:
import requests
from bs4 import BeautifulSoup
def parser():
parsing_list = []
url = 'https://www.courts.com.sg/'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
ul = soup.find('footer',{'class':'page-footer'}).find('ul')
for l in ul.find_all('li'):
nextlink = url + l.find('a').get('href')
response = requests.get(nextlink)
inner_soup = BeautifulSoup(response.text, "html.parser")
parsing_list.append(url + inner_soup.find('div',{'class':'category-static-links ng-scope'}).find('a').get('href'))
return parsing_list
This function will return list of all products of all categories which your code didn't scrape from it.

How to chose a random link on the page?

i am using beautiful soup to get links from a page.
What i would like it to do is select one of the links at random and continue with the rest of the program. Currently it is using all the links and continuing with the rest of the program, however i only want it to choose 1 link.
The the rest of the program will then look at the link and decide if it was good enough for what i want. If it is not good enough it will then go back and click another link. And repeat the processes.
Any idea how you would get it to do this?
This is my current code for looking up the links.
import requests
import os.path
from bs4 import BeautifulSoup
import urllib.request
import hashlib
import random
max_page = 1
img_limit = 5
def pic_spider(max_pages):
page = random.randrange(0, max_page)
pid = page * 40
pic_good = 1
while pic_good == 1:
if page <= max_pages:
url = 'http://safebooru.org/index.php?page=post&s=list&tags=yuri&pid=' + str(pid)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
id_list_location = os.path.join(id_save, "ids.txt")
first_link = soup.findAll('a', id=True, limit=img_limit)
for link in first_link:
href = "http://safebooru.org/" + link.get('href')
picture_id = link.get('id')
print("Page number = " + str(page + 1))
print("pid = " + str(pid))
print("Id = " + picture_id)
print(href)
if picture_id in open(id_list_location).read():
print("Already Downloaded or Picture checked to be too long")
else:
log_id(picture_id)
if ratio_get(href) >= 1.3:
print("Picture too long")
else:
#img_download_link(href, picture_id)
print("Ok download")
im not really sure how i would do it so any ideas would help me out, if you have any questions feel free to ask!
Am I missing something? Don't you just need to replace this:
first_link = soup.findAll('a', id=True, limit=img_limit)
for link in first_link:
With:
from random import choice
first_link = soup.findAll('a', id=True, limit=img_limit)
link = choice(first_link)
This will select one random item from the list

Display all search results when web scraping with Python

I'm trying to scrape a list of URL's from the European Parliament's Legislative Observatory. I do not type in any search keyword in order to get all links to documents (currently 13172). I can easily scrape a list of the first 10 results which are displayed on the website using the code below. However, I want to have all links so that I would not need to somehow press the next page button. Please let me know if you know of a way to achieve this.
import requests, bs4, re
# main url of the Legislative Observatory's search site
url_main = 'http://www.europarl.europa.eu/oeil/search/search.do?searchTab=y'
# function gets a list of links to the procedures
def links_to_procedures (url_main):
# requesting html code from the main search site of the Legislative Observatory
response = requests.get(url_main)
soup = bs4.BeautifulSoup(response.text) # loading text into Beautiful Soup
links = [a.attrs.get('href') for a in soup.select('div.procedure_title a')] # getting a list of links of the procedure title
return links
print(links_to_procedures(url_main))
You can follow the pagination by specifying the page GET parameter.
First, get the results count, then calculate the number of pages to process by dividing the count on the results count per page. Then, iterate over pages one by one and collect the links:
import re
from bs4 import BeautifulSoup
import requests
response = requests.get('http://www.europarl.europa.eu/oeil/search/search.do?searchTab=y')
soup = BeautifulSoup(response.content)
# get the results count
num_results = soup.find('span', class_=re.compile('resultNum')).text
num_results = int(re.search('(\d+)', num_results).group(1))
print "Results found: " + str(num_results)
results_per_page = 50
base_url = "http://www.europarl.europa.eu/oeil/search/result.do?page={page}&rows=%s&sort=d&searchTab=y&sortTab=y&x=1411566719001" % results_per_page
links = []
for page in xrange(1, num_results/results_per_page + 1):
print "Current page: " + str(page)
url = base_url.format(page=page)
response = requests.get(url)
soup = BeautifulSoup(response.content)
links += [a.attrs.get('href') for a in soup.select('div.procedure_title a')]
print links

Categories