Python & BeautifulSoup 4 - Loop returning duplicate results

Python & BeautifulSoup 4 - Loop returning duplicate results - python

I'm trying to scrape from 6pm.com and I'm running into an issue - my loop seems to be returning duplicate results, e.g. it keeps repeating the same product multiple times when a distinct product should only appear once.
Here's my code:
url_list1 = ['https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=1',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=2',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=3',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=4',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=5',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=6',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=7',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=8',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=9',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=10',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=11',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=12',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=13',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=14',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=15',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=16',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=17',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=18',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=19',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=20',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=21',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=22',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=23',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=24',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=25',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=26',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=27',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=28',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=29',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=30',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=31',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=32',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=33',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=34',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=35',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=36',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=37',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=38',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=39',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=40',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=41',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=42',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=43',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=44',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=45'
]
url_list2 = []
for url1 in url_list1:
data1 = requests.get(url1)
soup1 = BeautifulSoup(data1.text, 'html.parser')
productUrls = soup1.findAll('article')
for url2 in productUrls:
get_urls = "https://www.6pm.com"+url2.find('a', attrs={'itemprop': 'url'})['href']
url_list2.append(get_urls)
print(url_list2)
So the first part (url_list1) is basically a link list. Each link leads to a page with 100 products of the selected brands. When I click each link and it opens in my browser, each page contains different products and there are no duplicates(that I'm aware of).
Next up, I initialize an empty list (url_list2) where I'm trying to store all the actual product URLs (so this list should contain 46 pages*100 products = around 4600 product URLs).
The first "for" loop iterates through each link in url_list1. The productUrls variable is a list that is supposed to store all "article" elements on each of the 46 pages.
The second, nested "for" loop iterates through the productUrls list and constructs the actual product URL. It is then supposed to append the constructed product URL to the empty list I initialized earlier, url_list2.
Testing the results with the print statement, I have noticed that products are duplicates instead of distinct.
Why would this be happening if by opening each url manually in my browser in url_list1 I can see different products on each page and don't notice any duplicates?
Any and all help is much appreciated.

You can do it a better for this scenario.You no need to take all urls in list.Please try below code which is simple way you can achieve the result.
from bs4 import BeautifulSoup
import re
import requests
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso"
url_list2 = []
page_num = 1
session = requests.Session()
while page_num <47:
pageTree = session.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
productUrls = pageSoup.findAll('article')
for url2 in productUrls:
get_urls = "https://www.6pm.com"+url2.find('a', attrs={'itemprop': 'url'})['href']
url_list2.append(get_urls)
page = "https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?p={}".format(page_num)
page_num +=1
print(url_list2)
print(len(url_list2))
Let me know if that helps.

What happens is that the pages that you see with your browser are not the same that requests gets. To solve the problem you must keep the session (of requests) alive.
Try this, it worked for me. Replace your big for loop by:
with requests.Session() as s: # <--- here we create a session that stays alive
for url1 in url_list1:
data1 = s.get(url1) # <--- here we call the links with the same session
soup1 = BeautifulSoup(data1.text, 'html.parser')
productUrls = soup1.findAll('article')
for url2 in productUrls:
get_urls = "https://www.6pm.com"+url2.find('a', attrs={'itemprop': 'url'})['href']
url_list2.append(get_urls)
Good luck !

Related

How to get all page results - Web Scraping - Pagination

I am a beginner in regards to coding. Right now I am trying to get a grip on simple web scrapers using python.
I want to scrape a real estate website and get the Title, price, sqm, and what not into a CSV file.
My questions:
It seems to work for the first page of results but then it repeats and it does not run through the 40 pages. It rather fills the file with the same results.
The listings have info about "square meter" and the "number of rooms". When I inspect the page it seems though that it uses the same class for both elements. How would I extract the room numbers for example?
Here is the code that I have gathered so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def extract(page):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'}
url = f'https://www.immonet.de/immobiliensuche/sel.do?suchart=2&city=109447&marketingtype=1&pageoffset=1&radius=0&parentcat=2&sortby=0&listsize=26&objecttype=1&page={1}'
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')
return soup
def transform(soup):
divs = soup.find_all('div', class_ = 'col-xs-12 place-over-understitial sel-bg-gray-lighter')
for item in divs:
title = item.find('div', {'class': 'text-225'}).text.strip().replace('\n', '')
title2 = title.replace('\t', '')
hausart = item.find('span', class_ = 'text-100').text.strip().replace('\n', '')
hausart2 = hausart.replace('\t', '')
try:
price = item.find('span', class_ = 'text-250 text-strong text-nowrap').text.strip()
except:
price = 'Auf Anfrage'
wohnflaeche = item.find('p', class_ = 'text-250 text-strong text-nowrap').text.strip().replace('m²', '')
angebot = {
'title': title2,
'hausart': hausart2,
'price': price
}
hauslist.append(angebot)
return
hauslist=[]
for i in range(0, 40):
print(f'Getting page {i}...')
c = extract(i)
transform(c)
df = pd.DataFrame(hauslist)
print(df.head())
df.to_csv('immonetHamburg.csv')
This is my first post on stackoverflow so please be kind if I should have posted my problem differently.
Thanks
Pat

You have stupid mistake.
In url you have to use {page} instead of {1}. That's all.
url = f'https://www.immonet.de/immobiliensuche/sel.do?suchart=2&city=109447&marketingtype=1&pageoffset=1&radius=0&parentcat=2&sortby=0&listsize=26&objecttype=1&page={page}'
I see other problem:
You start scraping at page 0 but servers often give the same result for page 0 and 1.
You should use range(1, ...) instead of range(0, ...)
As for searching elements.
Beautifulsoup may search not only classes but also id and any other value in tag - ie. name, style, data, etc. It can also search by text "number of rooms". It can also use regex for this. You can also assign own function which will check element and return True/False to decide if it has to keep it in results.
You can also combine .find() with another .find() or .find_all().
price = item.find('div', {"id": lambda value:value and value.startswith('selPrice')}).find('span')
if price:
print("price:", price.text)
And if you know that "square meter" is before "number of rooms" then you could use find_all() to get both of them and later use [0] to get first of them and [1] to get second of them.
You should read all documentation beacause it can be very useful.

I advice you use Selenium instead, because you can physically click the 'next-page' button until you cover all pages and the whole code will only take a few lines.

As #furas mentioned you have a mistake with the page.
To get all rooms you need to find_all and get the last index with -1. Because sometimes there are 3 items or 2.
#to remote all \n and \r
translator = str.maketrans({chr(10): '', chr(9): ''})
rooms = item.find_all('p', {'class': 'text-250'})
if rooms:
rooms = rooms[-1].text.translate(translator).strip()

Scrapy - Every Page is scraped but scrapy wraps around and scrapes first x amount of pages

class HomedepotcrawlSpider(CrawlSpider):
name = 'homeDepotCrawl'
#allowed_domains = ['homedepot.com']
start_urls =['https://www.homedepot.com/b/Appliances/ZLINE-Kitchen-and-Bath/N-5yc1vZbv1wZhsy?experienceName=default&Nao=0']
def parse(self, response):
for item in self.parseHomeDepot(response):
yield item
next_page_url = response.xpath('//link[#rel="next"]/#href').extract_first()
if next_page_url:
yield response.follow(url=next_page_url, callback=self.parse)
def parseHomeDepot(self, response):
items = response.css('.plp-pod')
for product in items:
item = HomedepotSpiderItem()
#get SKU
productSKU = product.css('.pod-plp__model::text').getall()
#get rid of all the stuff i dont need
productSKU = [x.strip(' ') for x in productSKU] #whiteSpace
productSKU = [x.strip('\n') for x in productSKU]
productSKU = [x.strip('\t') for x in productSKU]
productSKU = [x.strip(' Model# ') for x in productSKU] #gets rid of the model name
productSKU = [x.strip('\xa0') for x in productSKU] #gets rid of the model name
item['productSKU'] = productSKU
yield item
Explanation of the Problem
Here is part of the program that I have been working on to scrape data. I left out my code for scraping other fields because I did not think it was necessary to include with this post. When I run this program and export data to excel, I get the first 240 items (10 pages). That goes up to row 241 of my spreadsheet(The first row is occupied by labels). Then starting from row 242, the first 241 rows are repeated once again. Then again on row 482 and 722.
The Scraper outputs the first 240 items 3 times
EDIT
So I was looking through the log of during scraping and it turned out that every page was getting scraped. The last page is:
https://www.homedepot.com/b/Appliances/ZLINE-Kitchen-and-Bath/N-5yc1vZbv1wZhsy?experienceName=default&Nao=696&Ns=None>
then afterwards the logfile is showing the first page getting scraped again, which is:
https://www.homedepot.com/b/Appliances/ZLINE-Kitchen-and-Bath/N-5yc1vZbv1wZhsy?experienceName=default
I assume because of..
The terminal command that I'm using to export to excel is:
scrapy crawl homeDepotCrawl -t csv -o - > "(File Location)"
Edit: The reason why I am using this command is because when exporting, Scrapy appends the scraped data to the file, so this erases the target file and just creates it again.
The code that I used to derive getting all pages is:
<a class="hd-pagination__link" title="Next" href="/b/Appliances/ZLINE-Kitchen-and-Bath/N-5yc1vZbv1wZhsy?experienceName=default&Nao=24&Ns=None" data-pagenumber="2"></a>
Originally I thought it was the website that was causing this unexpected behavior so on settings.py I changed ROBOTSTXT_OBEY = 0 and I added a delay but that did not change anything.
So what I would like help with:
-Figuring out why the CSV output only takes the first 240 Items (10 Pages) and repeats 3 times
-How ensure the spider doesn't go back to the first page after scraping the first 30

I would suggest doing something like this. The main difference is I'm grabbing the info from the json stored on the page and I'm paginating myself by recognizing the Nao is the product offset. The code is much shorter too:
import requests,json,re
product_skus = set()
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'}
base_url = 'https://www.homedepot.com/b/Appliances/ZLINE-Kitchen-and-Bath/N-5yc1vZbv1wZhsy?experienceName=default&Nao=%s'
for page_num in range(1,1000):
url = base_url % (page_num*24)
res = requests.get(url, headers=headers)
json_data = json.loads(re.search(r'digitalData\.content=(.+);', res.text).group(1))
prev_len = len(product_skus)
for product in json_data['product']:
product_skus.add(product['productInfo']['sku'])
if len(product_skus) == prev_len: break # this line is optional and can determine when you want to break
Additionally, it looks like the Home Depot pages repeat every 10 pages (at least in what you sent) which is why you're seeing the 240 limitation. Here is an example from browsing it myself:
HD Page 5
HD Page 15

You are indeed wrapping around to the beginning, chrome dev tools reveals that 'next' points to the first set of items when you reach the end.
You can detect and circumvent this with logic that looks at the current item index:
>>> from urllib.parse import urlparse, parse_qs
>>> url = 'https://www.homedepot.com/b/Appliances/ZLINE-Kitchen-and-Bath/N-5yc1vZbv1wZhsy?experienceName=default&Nao=696&Ns=None'
>>> parsed = urlparse(url)
>>> page_index = int(parse_qs(parsed.query)['Nao'][0])
>>> page_index
696
and edit your if next_page_url logic to include logic like and page_index > last_page_index

Not able to scrape the all the reviews

I am trying to scrape this website and trying to get the reviews but I am facing an issue,
The page loads only 50 reviews.
To load more you have to click "Show More Reviews" and I don't know how to get all the data as there is no page link, also "Show more Reviews" doesn't have a URL to explore, the address remains the same.
url =
"https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews"
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
a = []
url = requests.get(url)
html = url.text
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
#print(table)
for x in table:
a.append(x.text)
df = pd.DataFrame(a)
df.to_csv("review.csv", sep='\t')
I know this is not pretty code but I am just trying to get the review text first.
kindly help. As I am little new to this.

Looking at the website, the "Show more reviews" button makes an ajax call and returns the additional info, all you have to do is find it's link and send a get request to it (which I've done with some simple regex):
import requests
import re
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.169 Chrome/74.0.3729.169 Safari/537.36"
}
url = "https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews"
Data = []
#Each page equivalant to 50 comments:
MaximumCommentPages = 3
with requests.Session() as session:
info = session.get(url)
#Get product ID, needed for getting more comments
productID = re.search(r'"product_id":(\w*)', info.text).group(1)
#Extract info from main data
soup = BeautifulSoup(info.content, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
for x in table:
Data.append(x)
#Number of pages to get:
#Get additional data:
params = {
"page": "",
"product_id": productID
}
while(MaximumCommentPages > 1): # number 1 because one of them was the main page data which we already extracted!
MaximumCommentPages -= 1
params["page"] = str(MaximumCommentPages)
additionalInfo = session.get("https://www.capterra.com/gdm_reviews", params=params)
print(additionalInfo.url)
#print(additionalInfo.text)
#Extract info for additional info:
soup = BeautifulSoup(additionalInfo.content, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
for x in table:
Data.append(x)
#Extract data the old fashioned way:
counter = 1
with open('review.csv', 'w') as f:
for one in Data:
f.write(str(counter))
f.write(one.text)
f.write('\n')
counter += 1
Notice how I'm using a session to preserve cookies for the ajax call.
Edit 1: You can reload the webpage multiple times and call the ajax again to get even more data.
Edit 2: Save data using your own method.
Edit 3: Changed some stuff, now gets any number of pages for you, saves to file with good' ol open()

find_element in Selenium (Python with BeautifulSoup) is not finding all elements from a certain class in a website. Why is that?

I'm trying to apply web scraping to get all the discounts in a website (www.ofertop.pe).
I have a script that happens to crawl a website, using Selenium together with BS4 and Requests (Python 3.5+) to list all the subsections and then scrape each subsection (which is a grid with several discounts).
The website goes as follow: www.ofertop.pe.
I have analyzed the website HTML and found there is a tree of sections inside the element "accesoSeccion". From that, I find all the elements nivel2" in order to get all the subsections inside the first section map. You can see the code following here:
binary = FirefoxBinary(r'/usr/bin/firefox')
driver = webdriver.Firefox(firefox_binary=binary, executable_path='/usr/local/bin/geckodriver')
def load_html(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
return soup
soup = load_html("https://ofertop.pe/")
lista_urls = ['https://ofertop.pe' + c.find('a')['href'] for c in soup.find('ul', {'id':'accesoSeccion'}).find_all('li')]
categories = {c.split('/')[-1]:c for c in lista_urls}
sp = load_html(categories["bienestar-y-salud"]) #can be any section, since nivel2 repeats in each of them
subcats = [[ss.find("a")["href"] for ss in s.find_all("li")][1:] for s in sp.findAll("div", {"class":"nivel2"})]
subcats_flat = [x for y in subcats for x in y]
When I print categories I get:
https://ofertop.pe/descuentos/salon-de-belleza
https://ofertop.pe/descuentos/gastronomia
https://ofertop.pe/descuentos/entretenimiento
https://ofertop.pe/descuentos/viajes-y-turismo
https://ofertop.pe/descuentos/servicios
https://ofertop.pe/descuentos/estetica
https://ofertop.pe/descuentos/productos
https://ofertop.pe/descuentos/bienestar-y-salud
The subcategories (stored as a flat list in subcategories_flat) are the categories inside the menu at the left side of each category. For example, for the last category (bienestar y salud) you can get the following subsections:
https://ofertop.pe/descuentos/bienestar-y-salud/dental
https://ofertop.pe/descuentos/bienestar-y-salud/gimnasio-fitness
https://ofertop.pe/descuentos/bienestar-y-salud/otros-bienestar-salud
https://ofertop.pe/descuentos/bienestar-y-salud/spa-y-relajacion
https://ofertop.pe/descuentos/bienestar-y-salud/salud
Note: I delete the first subcategory since it's equivalent to the union of every remaining subcategory.
After this, I iterate over all the subsections to get the discounts. Using the following loop:
df_final_discounts = pd.DataFrame()
for x in subcats_flat:
url = "https://ofertop.pe" + x
df = category_scraping(url)
df["segmento"] = url.split("/")[-2]
df["subsegmento"] = url.split("/")[-1]
df_final_discounts = pd.concat([df_final_discounts, df])
The code that analyze each subsection and gets all the elements is:
def category_scraping(url_discounts):
df_final = pd.DataFrame()
driver.implicitly_wait(random.randrange(4))
while url_discounts != '':
driver.get(url_discounts)
#print(url_discounts)
try:
driver.implicitly_wait(random.randrange(3))
url_discounts = driver.find_element_by_id('tripleBasic').find_element_by_xpath("//a[#rel='next']").get_attribute("href")
except:
url_discounts = ''
print(url_discounts)
#Already used this one but it doesn't get 100% of discounts either
#df = pd.DataFrame([get_discount_window(c) for c in driver.find_elements_by_class_name("prod_track")])
df = pd.DataFrame([get_discount_window(c) for c in driver.find_elements(By.XPATH, "//article[contains(#class, 'prod_track')]")])
df_final = pd.concat([df_final, df])
#print(driver.page_source)
return df_final
Reviewing the HTML from every subsection (for example https://ofertop.pe/descuentos/estetica/rostro-y-piel), I can see that the discounts are produced using Embedded Javascript and they are article objects with the class prod_track, therefore I use Selenium to get all those elements. However, whenever I ran the script, I get values different to the ones I see on a normal browser (and with my Selenium driver browser too). Currently, I'm getting around 600 different discounts, but the website Ofertop has around 1600.

Different results from BeautifulSoup each time

I'm doing a web scrape of a website with 122 different pages with 10 entries per page. The code breaks on random pages, on random entries each time it is ran. I can run the code on a url one time and it works while other times it does not.
def get_soup(url):
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
return soup
def from_soup(soup, myCellsList):
cellsList = soup.find_all('li', {'class' : 'product clearfix'})
for i in range (len(cellsList)):
ottdDict = {}
ottdDict['Name'] = cellsList[i].h3.text.strip()
This is only a piece of my code, but this is where the error is occurring. The problem is that when I use this code, the h3 tag is not always appearing in each item in the cellsList. This results in a NoneType error when the last line of the code is ran. However, the h3 tag is always there in the HTML when I inspect the webpage.
cellsList vs html 1
same comparison made from subsequent soup request
What could be causing these differences and how can I avoid this problem? I was able to run the code successfully for a time, and it seems to have all of a sudden stopped working. The code is able to scrape some pages without problem but it randomly does not register the h3 tags on random entries on random pages.

There are slight discrepancies in the html for various elements as you progress through the site pages, the best way to get the name is actually to select the outer div and extract the text from the anchor.
This will get all the info from each product and put it into dicts where the keys are 'Tissue', 'Cell' etc.. and the values are the relating descriptionm:
import requests
from time import sleep
def from_soup(url):
with requests.Session() as s:
s.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"})
# id for next oage anchor.
id_ = "#layoutcontent_2_middlecontent_0_threecolumncontent_0_content_ctl00_rptCenterColumn_dcpCenterColumn_0_ctl00_0_productRecords_0_bottomPaging_0_liNextPage_0"
soup = BeautifulSoup(s.get(url).content)
for li in soup.select("ul.product-list li.product.clearfix"):
name = li.select_one("div.product-header.clearfix a").text.strip()
d = {"name": name}
for div in li.select("div.search-item"):
k = div.strong.text
d[k.rstrip(":")] = " ".join(div.text.replace(k, "", 1).split())
yield d
# get anchor for next page and loop until no longer there.
nxt = soup.select_one(id_)
# loop until mo more next page.
while nxt:
# sleep between requests
sleep(.5)
resp = s.get(nxt.a["href"])
soup = BeautifulSoup(resp.content)
for li in soup.select("ul.product-list li.product.clearfix"):
name = li.select_one("div.product-header.clearfix a").text.strip()
d = {"name": name}
for div in li.select("div.search-item"):
k = div.strong.text
d[k.rstrip(":")] = " ".join(div.text.replace(k,"",1).split())
yield d
After running:
for ind, h in enumerate(from_soup(
"https://www.lgcstandards-atcc.org/Products/Cells_and_Microorganisms/Cell_Lines/Human/Alphanumeric.aspx?geo_country=gb")):
print(ind, h)
You will see 1211 dicts with all the data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python & BeautifulSoup 4 - Loop returning duplicate results - python

Related

How to get all page results - Web Scraping - Pagination

Scrapy - Every Page is scraped but scrapy wraps around and scrapes first x amount of pages

Not able to scrape the all the reviews

find_element in Selenium (Python with BeautifulSoup) is not finding all elements from a certain class in a website. Why is that?

Different results from BeautifulSoup each time

Categories

Resources