Scraping through every product on retailer website - python
We are trying to scrape every product for every category on Forever 21's website. Given a product page, we know how to extract the information we need, and given a category, we can extract every product. However, we do not know how to crawl through every product category. Here is our code for a given category and getting every product:
import requests
from bs4 import BeautifulSoup
import json
#import re
params = {"action": "getcategory",
"br": "f21",
#"category": re.compile('\S+'),
"category": "dress",
"pageno": 1,
"pagesize": "",
"sort": "",
"fsize": "",
"fcolor": "",
"fprice": "",
"fattr": ""}
url = "http://www.forever21.com/Ajax/Ajax_Category.aspx"
js = requests.get(url, params=params).json()
soup = BeautifulSoup(js[u'CategoryHTML'], "html.parser")
i = 0
j = 0
while len(soup.select("div.item_pic a")) != 0:
for a in soup.select("div.item_pic a"):
#print a["href"]
i = i + 1
params["pageno"] = params["pageno"] + 1
j = j + 1
js = requests.get(url, params=params).json()
soup = BeautifulSoup(js[u'CategoryHTML'], "html.parser")
print i
print j
As you can see in the comments, we tried to use regular expressions for the category but had no success. i and j are just product and page counters. Any suggestions on how to modify/add to this code to get every product category?
You can scrape the category page and get all subcategories from the navigation menu:
import requests
from bs4 import BeautifulSoup
url = "http://www.forever21.com/Product/Category.aspx?br=f21&category=app-main"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"})
soup = BeautifulSoup(response.content, "html.parser")
menues = [li["class"][0] for li in soup.select("#has_sub .white nav ul > li")]
print(menues)
Prints:
[u'women-new-arrivals', u'want_list', u'dress', u'top_blouses', u'outerwear_coats-and-jackets', u'bottoms', u'intimates_loungewear', u'activewear', u'swimwear_all', u'acc', u'shoes', u'branded-shop-women-clothing', u'sale_women|women', u'women-new-arrivals-clothing-dresses', u'women-new-arrivals-clothing-tops', u'women-new-arrivals-clothing-outerwear', u'women-new-arrivals-clothing-bottoms', u'women-new-arrivals-clothing-intimates-loungewear', u'women-new-arrivals-clothing-swimwear', u'women-new-arrivals-clothing-activewear', u'women-new-arrivals-accessories|women-new-arrivals', u'women-new-arrivals-shoes|women-new-arrivals', u'promo-web-exclusives', u'promo-best-sellers-app', u'backinstock-women', u'promo-shop-by-outfit-women', u'occasion-shop-wedding', u'contemporary-main', u'promo-basics', u'21_items', u'promo-summer-forever', u'promo-coming-soon', u'dress_casual', u'dress_romper', u'dress_maxi', u'dress_midi', u'dress_mini', u'occasion-shop-dress', u'top_blouses-off-shoulder', u'top_blouses-lace-up', u'top_bodysuits-bustiers', u'top_graphic-tops', u'top_blouses-crop-top', u'top_t-shirts', u'sweater', u'top_blouses-sweatshirts-hoodies', u'top_blouses-shirts', u'top_plaids', u'outerwear_bomber-jackets', u'outerwear_blazers', u'outerwear_leather-suede', u'outerwear_jean-jackets', u'outerwear_lightweight', u'outerwear_utility-jackets', u'outerwear_trench-coats', u'outerwear_faux-fur', u'promo-jeans-refresh|bottoms', u'bottoms_pants', u'bottoms_skirt', u'bottoms_shorts', u'bottoms_shorts-active', u'bottoms_leggings', u'bottoms_sweatpants', u'bottom_jeans|', u'intimates_loungewear-bras', u'intimates_loungewear-panties', u'intimates_loungewear-bodysuits-slips', u'intimates_loungewear-seamless', u'intimates_loungewear-accessories', u'intimates_loungewear-sets', u'activewear_top', u'activewear_sports-bra', u'activewear_bottoms', u'activewear_accessories', u'swimwear_tops', u'swimwear_bottoms', u'swimwear_one-piece', u'swimwear_cover-ups', u'acc_features', u'acc_jewelry', u'acc_handbags', u'acc_glasses', u'acc_hat', u'acc_hair', u'acc_legwear', u'acc_scarf-gloves', u'acc_home-and-gift-items', u'shoes_features', u'shoes_boots', u'shoes_high-heels', u'shoes_sandalsflipflops', u'shoes_wedges', u'shoes_flats', u'shoes_oxfords-loafers', u'shoes_sneakers', u'Shoes_slippers', u'branded-shop-new-arrivals-women', u'branded-shop-women-clothing-dresses', u'branded-shop-women-clothing-tops', u'branded-shop-women-clothing-outerwear', u'branded-shop-women-clothing-bottoms', u'branded-shop-women-clothing-intimates', u'branded-shop-women-accessories|branded-shop-women-clothing', u'branded-shop-women-accessories-jewelry|', u'branded-shop-shoes-women|branded-shop-women-clothing', u'branded-shop-sale-women', u'/brandedshop/brandlist.aspx', u'promo-branded-boho-me', u'promo-branded-rare-london', u'promo-branded-selfie-leslie', u'sale-newly-added', u'sale_dresses', u'sale_tops', u'sale_outerwear', u'sale_sweaters', u'sale_bottoms', u'sale_intimates', u'sale_swimwear', u'sale_activewear', u'sale_acc', u'sale_shoes', u'the-outlet', u'sale-under-5', u'sale-under-10', u'sale-under-15']
Note the values of br and category GET parameters. f21 is the "Women" category, app-main is the main page for a category.
Related
python requests for load more data
i want to scrap product images from https://society6.com/art/i-already-want-to-take-a-nap-tomorrow-pink of each product > step=1 first i go in div', class_='card_card__l44w (which is having each product link) step=2 then parse the href of each product > but its getting back only first 15 product link inspite of all 44 ============================== second thing is when i parse each product link and then grab json from there ['product']['response']['product']['data']['attributes']['media_map'] after media_map key there are many other keys like b , c , d , e , f , g (all having src: in it with the image link i only want to parse .jpg image from every key) below is my code import requests import json from bs4 import BeautifulSoup import pandas as pd baseurl = 'https://society6.com/' headers = { "User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36' } r = requests.get('https://society6.com/art/flamingo-cone501586', headers=headers) soup = BeautifulSoup(r.content, 'lxml') productslist = soup.find_all('div', class_='card_card__l44w') productlinks = [] for item in productslist: for link in item.find_all('a', href=True): productlinks.append(baseurl + link['href']) newlist = [] for link in productlinks: r = requests.get(link, headers=headers) soup = BeautifulSoup(r.content, 'lxml') scripts = soup.find_all('script')[9].text.strip()[24:] data = json.loads(scripts) url = data['product']['response']['product']['data']['attributes']['media_map'] detail = { 'links' : url } newlist.append(detail) print('saving') df = pd.DataFrame(newlist) df.to_csv('haja.csv')` [1]: https://i.stack.imgur.com/qdhXP.png
All the information is loaded at first visit and all 66 products are stored in window.__INITIAL_STATE. If you scroll almost to the end of the file you can see it. You can use that to parse the information. import re import json data = json.loads((soup .find("script", text=re.compile("^window.__INITIAL_STATE")) .text .replace("</script>", "") .replace("window.__INITIAL_STATE = ", ""))) products = data["designDetails"]["response"]["designDetails"]["data"]["products"] products is a list with 66 items. Example: {'sku': 's6-7120491p92a240v826', 'retail_price': 29, 'discount_price': 24.65, 'image_url': 'https://ctl.s6img.com/society6/img/yF7u4l5D3MODQBBerUQBHdYsfN8/h_264,w_264/acrylic-boxes/small/top/~artwork,fw_1087,fh_1087,fx_-401,fy_-401,iw_1889,ih_1889/s6-original-art-uploads/society6/uploads/misc/f7916751f46d4d9c9fb7f6fe4e5d5729/~~/flamingo-cone501586-acrylic-boxes.jpg', 'product_type': {'id': 92, 'title': 'Acrylic Box', 'slug': 'acrylic-box', 'slug_plural': 'acrylic-boxes'}, 'department': {'id': 83, 'title': 'Office'}, 'sort': 0}
how to scrape page inside the result card using Bs4?
<img class="no-img" data-src="https://im1.dineout.co.in/images/uploads/restaurant/sharpen/4/h/u/p4059-15500352575c63a9394c209.jpg?tr=tr:n-medium" alt="Biryani By Kilo" data-gatype="RestaurantImageClick" data-url="/delhi/biryani-by-kilo-connaught-place-central-delhi-40178" data-w-onclick="cardClickHandler" src="https://im1.dineout.co.in/images/uploads/restaurant/sharpen/4/h/u/p4059-15500352575c63a9394c209.jpg?tr=tr:n-medium"> page url - https://www.dineout.co.in/delhi-restaurants?search_str=biryani&p=1 this page contains some restaurants card now while scrapping the page in the loop I want to go inside the restaurant card URL which is in the above HTML code name by data-url class and scrape the no. of reviews from inside it, I don't know how to do it my current code for normal front page scrapping is ; def extract(page): url = f"https://www.dineout.co.in/delhi-restaurants?search_str=biryani&p={page}" # URL of the website header = {'User-Agent':'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'} # Temporary user agent r = requests.get(url, headers=header) soup = BeautifulSoup(r.content, 'html.parser') return soup def transform(soup): # function to scrape the page divs = soup.find_all('div', class_ = 'restnt-card restaurant') for item in divs: title = item.find('a').text.strip() # restaurant name loc = item.find('div', class_ = 'restnt-loc ellipsis').text.strip() # restaurant location try: # used this try and except method because some restaurants are unrated and while scrpaping those we would run into an error rating = item.find('div', class_="img-wrap").text rating = (re.sub("[^0-9,.]", "", rating)) except: rating = None pricce = item.find('span', class_="double-line-ellipsis").text.strip() # price for biriyani price = re.sub("[^0-9]", "", pricce)[:-1] biry_del = { 'name': title, 'location': loc, 'rating': rating, 'price': price } rest_list.append(biry_del) rest_list = [] for i in range(1,18): print(f'getting page, {i}') c = extract(i) transform(c) I hope you guys understood please ask in comment for any confusion.
It's not very fast but it looks like you can get all the details you want including the review count (not 232!) if you hit this backend api endpoint: https://www.dineout.co.in/get_rdp_data_main/delhi/69676/restaurant_detail_main import requests from bs4 import BeautifulSoup import pandas as pd rest_list = [] for page in range(1,3): print(f'getting page, {page}') s = requests.Session() url = f"https://www.dineout.co.in/delhi-restaurants?search_str=biryani&p={page}" # URL of the website header = {'User-Agent':'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'} # Temporary user agent r = s.get(url, headers=header) soup = BeautifulSoup(r.content, 'html.parser') divs = soup.find_all('div', class_ = 'restnt-card restaurant') for item in divs: code = item.find('a')['href'].split('-')[-1] # restaurant code print(f'Getting details for {code}') data = s.get(f'https://www.dineout.co.in/get_rdp_data_main/delhi/{code}/restaurant_detail_main').json() info = data['header'] info.pop('share') #clean up csv info.pop('options') rest_list.append(info) df = pd.DataFrame(rest_list) df.to_csv('dehli_rest.csv',index=False)
Scrape table fields from html with specific class
So I want to build a simple scraper for google shopping and I encountered some problems. This is the html text from my request(to https://www.google.es/shopping/product/7541391777504770249/online) where I'm trying to query the highlighted div class sh-osd__total-price inside the div class sh-osd__offer-row : My code is currently: from bs4 import BeautifulSoup from requests import get url = 'https://www.google.es/shopping/product/7541391777504770249/online' response = get(url) html_soup = BeautifulSoup(response.text, 'html.parser') r = html_soup.findAll('tr', {'class': 'sh-osd__offer-row'}) #Returns empty print(r) r = html_soup.findAll('tr', {'class': 'sh-osd__total-price'}) #Returns empty print(r) Where both r are empty, beatiful soup doesn't find anything. Is there any way to find these two div classes with beautiful soup?
You need to add user agent into the headers: from bs4 import BeautifulSoup from requests import get url = 'https://www.google.es/shopping/product/7541391777504770249/online' headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'} #<-- added line response = get(url, headers=headers) #<--- include here html_soup = BeautifulSoup(response.text, 'html.parser') r = html_soup.find_all('tr', {'class': 'sh-osd__offer-row'}) #Returns empty print(r) r = html_soup.findAll('tr', {'class': 'sh-osd__total-price'}) #Returns empty print(r) But, since it's a <table> tag, you can use pandas (it uses beautifulsoup under the hood), but does the hard work for you. It will return a list of all elements that are <table>s as dataframes import pandas as pd url = 'https://www.google.es/shopping/product/7541391777504770249/online' dfs = pd.read_html(url) print(dfs[-1]) Output: print(dfs[-1]) Sellers Seller Rating ... Base Price Total Price 0 One Fragance No rating ... £30.95 +£8.76 delivery £39.71 1 eBay No rating ... £46.81 £46.81 2 Carethy.co.uk No rating ... £34.46 +£3.99 delivery £38.45 3 fruugo.co.uk No rating ... £36.95 +£9.30 delivery £46.25 4 cosmeticsmegastore.com/gb No rating ... £36.95 +£9.30 delivery £46.25 5 Perfumes Club UK No rating ... £30.39 +£5.99 delivery £36.38 [6 rows x 5 columns]
Not able to scrape the all the reviews
I am trying to scrape this website and trying to get the reviews but I am facing an issue, The page loads only 50 reviews. To load more you have to click "Show More Reviews" and I don't know how to get all the data as there is no page link, also "Show more Reviews" doesn't have a URL to explore, the address remains the same. url = "https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews" import requests from bs4 import BeautifulSoup import json import pandas as pd a = [] url = requests.get(url) html = url.text soup = BeautifulSoup(html, "html.parser") table = soup.findAll("div", {"class":"review-comments"}) #print(table) for x in table: a.append(x.text) df = pd.DataFrame(a) df.to_csv("review.csv", sep='\t') I know this is not pretty code but I am just trying to get the review text first. kindly help. As I am little new to this.
Looking at the website, the "Show more reviews" button makes an ajax call and returns the additional info, all you have to do is find it's link and send a get request to it (which I've done with some simple regex): import requests import re from bs4 import BeautifulSoup headers = { "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.169 Chrome/74.0.3729.169 Safari/537.36" } url = "https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews" Data = [] #Each page equivalant to 50 comments: MaximumCommentPages = 3 with requests.Session() as session: info = session.get(url) #Get product ID, needed for getting more comments productID = re.search(r'"product_id":(\w*)', info.text).group(1) #Extract info from main data soup = BeautifulSoup(info.content, "html.parser") table = soup.findAll("div", {"class":"review-comments"}) for x in table: Data.append(x) #Number of pages to get: #Get additional data: params = { "page": "", "product_id": productID } while(MaximumCommentPages > 1): # number 1 because one of them was the main page data which we already extracted! MaximumCommentPages -= 1 params["page"] = str(MaximumCommentPages) additionalInfo = session.get("https://www.capterra.com/gdm_reviews", params=params) print(additionalInfo.url) #print(additionalInfo.text) #Extract info for additional info: soup = BeautifulSoup(additionalInfo.content, "html.parser") table = soup.findAll("div", {"class":"review-comments"}) for x in table: Data.append(x) #Extract data the old fashioned way: counter = 1 with open('review.csv', 'w') as f: for one in Data: f.write(str(counter)) f.write(one.text) f.write('\n') counter += 1 Notice how I'm using a session to preserve cookies for the ajax call. Edit 1: You can reload the webpage multiple times and call the ajax again to get even more data. Edit 2: Save data using your own method. Edit 3: Changed some stuff, now gets any number of pages for you, saves to file with good' ol open()
Beautifulsoup parsing error
I am trying to extract some information about an App on Google Play and BeautifulSoup doesn't seem to work. The link is this(say): https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts My code: url = "https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts" r = requests.get(url) html = r.content soup = BeautifulSoup(html) l = soup.find_all("div", { "class" : "document-subtitles"}) print len(l) 0 #How is this 0?! There is clearly a div with that class I decided to go all in, didn't work either: i = soup.select('html body.no-focus-outline.sidebar-visible.user-has-no-subscription div#wrapper.wrapper.wrapper-with-footer div#body-content.body-content div.outer-container div.inner-container div.main-content div div.details-wrapper.apps.square-cover.id-track-partial-impression.id-deep-link-item div.details-info div.info-container div.info-box-top') print i What am I doing wrong?
You need to pretend to be a real browser by supplying the User-Agent header: import requests from bs4 import BeautifulSoup url = "https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts" r = requests.get(url, headers={ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36" }) html = r.content soup = BeautifulSoup(html, "html.parser") title = soup.find(class_="id-app-title").get_text() rating = soup.select_one(".document-subtitle .star-rating-non-editable-container")["aria-label"].strip() print(title) print(rating) Prints the title and the current rating: Weird Facts Rated 4.3 stars out of five stars To get the additional information field values, you can use the following generic function: def get_info(soup, text): return soup.find("div", class_="title", text=lambda t: t and t.strip() == text).\ find_next_sibling("div", class_="content").get_text(strip=True) Then, if you do: print(get_info(soup, "Size")) print(get_info(soup, "Developer")) You will see printed: 1.4M Email email#here.com