I am a beginner in regards to coding. Right now I am trying to get a grip on simple web scrapers using python.
I want to scrape a real estate website and get the Title, price, sqm, and what not into a CSV file.
My questions:
It seems to work for the first page of results but then it repeats and it does not run through the 40 pages. It rather fills the file with the same results.
The listings have info about "square meter" and the "number of rooms". When I inspect the page it seems though that it uses the same class for both elements. How would I extract the room numbers for example?
Here is the code that I have gathered so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def extract(page):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'}
url = f'https://www.immonet.de/immobiliensuche/sel.do?suchart=2&city=109447&marketingtype=1&pageoffset=1&radius=0&parentcat=2&sortby=0&listsize=26&objecttype=1&page={1}'
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')
return soup
def transform(soup):
divs = soup.find_all('div', class_ = 'col-xs-12 place-over-understitial sel-bg-gray-lighter')
for item in divs:
title = item.find('div', {'class': 'text-225'}).text.strip().replace('\n', '')
title2 = title.replace('\t', '')
hausart = item.find('span', class_ = 'text-100').text.strip().replace('\n', '')
hausart2 = hausart.replace('\t', '')
try:
price = item.find('span', class_ = 'text-250 text-strong text-nowrap').text.strip()
except:
price = 'Auf Anfrage'
wohnflaeche = item.find('p', class_ = 'text-250 text-strong text-nowrap').text.strip().replace('m²', '')
angebot = {
'title': title2,
'hausart': hausart2,
'price': price
}
hauslist.append(angebot)
return
hauslist=[]
for i in range(0, 40):
print(f'Getting page {i}...')
c = extract(i)
transform(c)
df = pd.DataFrame(hauslist)
print(df.head())
df.to_csv('immonetHamburg.csv')
This is my first post on stackoverflow so please be kind if I should have posted my problem differently.
Thanks
Pat
You have stupid mistake.
In url you have to use {page} instead of {1}. That's all.
url = f'https://www.immonet.de/immobiliensuche/sel.do?suchart=2&city=109447&marketingtype=1&pageoffset=1&radius=0&parentcat=2&sortby=0&listsize=26&objecttype=1&page={page}'
I see other problem:
You start scraping at page 0 but servers often give the same result for page 0 and 1.
You should use range(1, ...) instead of range(0, ...)
As for searching elements.
Beautifulsoup may search not only classes but also id and any other value in tag - ie. name, style, data, etc. It can also search by text "number of rooms". It can also use regex for this. You can also assign own function which will check element and return True/False to decide if it has to keep it in results.
You can also combine .find() with another .find() or .find_all().
price = item.find('div', {"id": lambda value:value and value.startswith('selPrice')}).find('span')
if price:
print("price:", price.text)
And if you know that "square meter" is before "number of rooms" then you could use find_all() to get both of them and later use [0] to get first of them and [1] to get second of them.
You should read all documentation beacause it can be very useful.
I advice you use Selenium instead, because you can physically click the 'next-page' button until you cover all pages and the whole code will only take a few lines.
As #furas mentioned you have a mistake with the page.
To get all rooms you need to find_all and get the last index with -1. Because sometimes there are 3 items or 2.
#to remote all \n and \r
translator = str.maketrans({chr(10): '', chr(9): ''})
rooms = item.find_all('p', {'class': 'text-250'})
if rooms:
rooms = rooms[-1].text.translate(translator).strip()
Related
I need to get value "Anti-Mage" from class in python. How can I do it?
<td class="cell-xlarge">Anti-Mage<div class="subtext minor"><time data-time-ago="2021-07-26T23:27:54+00:00" datetime="2021-07-26T23:27:54+00:00" title="Mon, 26 Jul 2021 23:27:54 +0000">2021-07-26</time></div></td>
r1 = requests.get(f"https://www.dotabuff.com/players/{a}/heroes", headers = headers)
html1 = BS(r1.content, 'lxml')
for a in html1.find('td', {'class': 'cell-xlarge'}):
b = a.findChildren('a', recursive=False)
a_value = b.string
print(a_value)
EDIT
Based on your comment, to get all the <a>.
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.dotabuff.com/players/432283612'
headers = {
"Accept":"*/*",
"User-Agent":"Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text)
[x.text for x in soup.select('article a[href*="matches?hero"]')]
Output
['Anti-Mage', 'Shadow Fiend', 'Slark', 'Morphling', 'Tinker', 'Bristleback', 'Invoker', 'Broodmother', 'Templar Assassin', 'Monkey King']
Assuming HTML posted in your question is the BeautifulSoup object, call text method on the <a>:
soup.a.text
or select more specific with class you mentioned:
soup.select_one('.cell-xlarge a').text
Note: Selecting elements by class in some cases is only third best option, cause classes can be dynamic, are not unique, ... - Better strategies are to select by id, tag
First, you'll need to select the parent item (td in this case) from its class name. You can do something like
td = soup.find('td', {'class': 'cell-xlarge'})
and then find the a children tags with something like this
a = td.findChildren('a', recursive=False)[0]
And this will give you the a tag. To get its value, you use .string like this
a_value = a.string
And that gives you the value of Anti-Mage
I am trying to scrape this website and trying to get the reviews but I am facing an issue,
The page loads only 50 reviews.
To load more you have to click "Show More Reviews" and I don't know how to get all the data as there is no page link, also "Show more Reviews" doesn't have a URL to explore, the address remains the same.
url =
"https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews"
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
a = []
url = requests.get(url)
html = url.text
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
#print(table)
for x in table:
a.append(x.text)
df = pd.DataFrame(a)
df.to_csv("review.csv", sep='\t')
I know this is not pretty code but I am just trying to get the review text first.
kindly help. As I am little new to this.
Looking at the website, the "Show more reviews" button makes an ajax call and returns the additional info, all you have to do is find it's link and send a get request to it (which I've done with some simple regex):
import requests
import re
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.169 Chrome/74.0.3729.169 Safari/537.36"
}
url = "https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews"
Data = []
#Each page equivalant to 50 comments:
MaximumCommentPages = 3
with requests.Session() as session:
info = session.get(url)
#Get product ID, needed for getting more comments
productID = re.search(r'"product_id":(\w*)', info.text).group(1)
#Extract info from main data
soup = BeautifulSoup(info.content, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
for x in table:
Data.append(x)
#Number of pages to get:
#Get additional data:
params = {
"page": "",
"product_id": productID
}
while(MaximumCommentPages > 1): # number 1 because one of them was the main page data which we already extracted!
MaximumCommentPages -= 1
params["page"] = str(MaximumCommentPages)
additionalInfo = session.get("https://www.capterra.com/gdm_reviews", params=params)
print(additionalInfo.url)
#print(additionalInfo.text)
#Extract info for additional info:
soup = BeautifulSoup(additionalInfo.content, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
for x in table:
Data.append(x)
#Extract data the old fashioned way:
counter = 1
with open('review.csv', 'w') as f:
for one in Data:
f.write(str(counter))
f.write(one.text)
f.write('\n')
counter += 1
Notice how I'm using a session to preserve cookies for the ajax call.
Edit 1: You can reload the webpage multiple times and call the ajax again to get even more data.
Edit 2: Save data using your own method.
Edit 3: Changed some stuff, now gets any number of pages for you, saves to file with good' ol open()
I'm trying to scrape from 6pm.com and I'm running into an issue - my loop seems to be returning duplicate results, e.g. it keeps repeating the same product multiple times when a distinct product should only appear once.
Here's my code:
url_list1 = ['https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=1',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=2',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=3',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=4',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=5',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=6',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=7',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=8',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=9',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=10',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=11',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=12',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=13',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=14',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=15',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=16',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=17',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=18',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=19',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=20',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=21',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=22',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=23',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=24',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=25',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=26',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=27',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=28',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=29',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=30',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=31',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=32',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=33',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=34',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=35',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=36',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=37',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=38',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=39',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=40',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=41',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=42',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=43',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=44',
'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=45'
]
url_list2 = []
for url1 in url_list1:
data1 = requests.get(url1)
soup1 = BeautifulSoup(data1.text, 'html.parser')
productUrls = soup1.findAll('article')
for url2 in productUrls:
get_urls = "https://www.6pm.com"+url2.find('a', attrs={'itemprop': 'url'})['href']
url_list2.append(get_urls)
print(url_list2)
So the first part (url_list1) is basically a link list. Each link leads to a page with 100 products of the selected brands. When I click each link and it opens in my browser, each page contains different products and there are no duplicates(that I'm aware of).
Next up, I initialize an empty list (url_list2) where I'm trying to store all the actual product URLs (so this list should contain 46 pages*100 products = around 4600 product URLs).
The first "for" loop iterates through each link in url_list1. The productUrls variable is a list that is supposed to store all "article" elements on each of the 46 pages.
The second, nested "for" loop iterates through the productUrls list and constructs the actual product URL. It is then supposed to append the constructed product URL to the empty list I initialized earlier, url_list2.
Testing the results with the print statement, I have noticed that products are duplicates instead of distinct.
Why would this be happening if by opening each url manually in my browser in url_list1 I can see different products on each page and don't notice any duplicates?
Any and all help is much appreciated.
You can do it a better for this scenario.You no need to take all urls in list.Please try below code which is simple way you can achieve the result.
from bs4 import BeautifulSoup
import re
import requests
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso"
url_list2 = []
page_num = 1
session = requests.Session()
while page_num <47:
pageTree = session.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
productUrls = pageSoup.findAll('article')
for url2 in productUrls:
get_urls = "https://www.6pm.com"+url2.find('a', attrs={'itemprop': 'url'})['href']
url_list2.append(get_urls)
page = "https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?p={}".format(page_num)
page_num +=1
print(url_list2)
print(len(url_list2))
Let me know if that helps.
What happens is that the pages that you see with your browser are not the same that requests gets. To solve the problem you must keep the session (of requests) alive.
Try this, it worked for me. Replace your big for loop by:
with requests.Session() as s: # <--- here we create a session that stays alive
for url1 in url_list1:
data1 = s.get(url1) # <--- here we call the links with the same session
soup1 = BeautifulSoup(data1.text, 'html.parser')
productUrls = soup1.findAll('article')
for url2 in productUrls:
get_urls = "https://www.6pm.com"+url2.find('a', attrs={'itemprop': 'url'})['href']
url_list2.append(get_urls)
Good luck !
I'm trying to apply web scraping to get all the discounts in a website (www.ofertop.pe).
I have a script that happens to crawl a website, using Selenium together with BS4 and Requests (Python 3.5+) to list all the subsections and then scrape each subsection (which is a grid with several discounts).
The website goes as follow: www.ofertop.pe.
I have analyzed the website HTML and found there is a tree of sections inside the element "accesoSeccion". From that, I find all the elements nivel2" in order to get all the subsections inside the first section map. You can see the code following here:
binary = FirefoxBinary(r'/usr/bin/firefox')
driver = webdriver.Firefox(firefox_binary=binary, executable_path='/usr/local/bin/geckodriver')
def load_html(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
return soup
soup = load_html("https://ofertop.pe/")
lista_urls = ['https://ofertop.pe' + c.find('a')['href'] for c in soup.find('ul', {'id':'accesoSeccion'}).find_all('li')]
categories = {c.split('/')[-1]:c for c in lista_urls}
sp = load_html(categories["bienestar-y-salud"]) #can be any section, since nivel2 repeats in each of them
subcats = [[ss.find("a")["href"] for ss in s.find_all("li")][1:] for s in sp.findAll("div", {"class":"nivel2"})]
subcats_flat = [x for y in subcats for x in y]
When I print categories I get:
https://ofertop.pe/descuentos/salon-de-belleza
https://ofertop.pe/descuentos/gastronomia
https://ofertop.pe/descuentos/entretenimiento
https://ofertop.pe/descuentos/viajes-y-turismo
https://ofertop.pe/descuentos/servicios
https://ofertop.pe/descuentos/estetica
https://ofertop.pe/descuentos/productos
https://ofertop.pe/descuentos/bienestar-y-salud
The subcategories (stored as a flat list in subcategories_flat) are the categories inside the menu at the left side of each category. For example, for the last category (bienestar y salud) you can get the following subsections:
https://ofertop.pe/descuentos/bienestar-y-salud/dental
https://ofertop.pe/descuentos/bienestar-y-salud/gimnasio-fitness
https://ofertop.pe/descuentos/bienestar-y-salud/otros-bienestar-salud
https://ofertop.pe/descuentos/bienestar-y-salud/spa-y-relajacion
https://ofertop.pe/descuentos/bienestar-y-salud/salud
Note: I delete the first subcategory since it's equivalent to the union of every remaining subcategory.
After this, I iterate over all the subsections to get the discounts. Using the following loop:
df_final_discounts = pd.DataFrame()
for x in subcats_flat:
url = "https://ofertop.pe" + x
df = category_scraping(url)
df["segmento"] = url.split("/")[-2]
df["subsegmento"] = url.split("/")[-1]
df_final_discounts = pd.concat([df_final_discounts, df])
The code that analyze each subsection and gets all the elements is:
def category_scraping(url_discounts):
df_final = pd.DataFrame()
driver.implicitly_wait(random.randrange(4))
while url_discounts != '':
driver.get(url_discounts)
#print(url_discounts)
try:
driver.implicitly_wait(random.randrange(3))
url_discounts = driver.find_element_by_id('tripleBasic').find_element_by_xpath("//a[#rel='next']").get_attribute("href")
except:
url_discounts = ''
print(url_discounts)
#Already used this one but it doesn't get 100% of discounts either
#df = pd.DataFrame([get_discount_window(c) for c in driver.find_elements_by_class_name("prod_track")])
df = pd.DataFrame([get_discount_window(c) for c in driver.find_elements(By.XPATH, "//article[contains(#class, 'prod_track')]")])
df_final = pd.concat([df_final, df])
#print(driver.page_source)
return df_final
Reviewing the HTML from every subsection (for example https://ofertop.pe/descuentos/estetica/rostro-y-piel), I can see that the discounts are produced using Embedded Javascript and they are article objects with the class prod_track, therefore I use Selenium to get all those elements. However, whenever I ran the script, I get values different to the ones I see on a normal browser (and with my Selenium driver browser too). Currently, I'm getting around 600 different discounts, but the website Ofertop has around 1600.
I'm doing a web scrape of a website with 122 different pages with 10 entries per page. The code breaks on random pages, on random entries each time it is ran. I can run the code on a url one time and it works while other times it does not.
def get_soup(url):
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
return soup
def from_soup(soup, myCellsList):
cellsList = soup.find_all('li', {'class' : 'product clearfix'})
for i in range (len(cellsList)):
ottdDict = {}
ottdDict['Name'] = cellsList[i].h3.text.strip()
This is only a piece of my code, but this is where the error is occurring. The problem is that when I use this code, the h3 tag is not always appearing in each item in the cellsList. This results in a NoneType error when the last line of the code is ran. However, the h3 tag is always there in the HTML when I inspect the webpage.
cellsList vs html 1
same comparison made from subsequent soup request
What could be causing these differences and how can I avoid this problem? I was able to run the code successfully for a time, and it seems to have all of a sudden stopped working. The code is able to scrape some pages without problem but it randomly does not register the h3 tags on random entries on random pages.
There are slight discrepancies in the html for various elements as you progress through the site pages, the best way to get the name is actually to select the outer div and extract the text from the anchor.
This will get all the info from each product and put it into dicts where the keys are 'Tissue', 'Cell' etc.. and the values are the relating descriptionm:
import requests
from time import sleep
def from_soup(url):
with requests.Session() as s:
s.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"})
# id for next oage anchor.
id_ = "#layoutcontent_2_middlecontent_0_threecolumncontent_0_content_ctl00_rptCenterColumn_dcpCenterColumn_0_ctl00_0_productRecords_0_bottomPaging_0_liNextPage_0"
soup = BeautifulSoup(s.get(url).content)
for li in soup.select("ul.product-list li.product.clearfix"):
name = li.select_one("div.product-header.clearfix a").text.strip()
d = {"name": name}
for div in li.select("div.search-item"):
k = div.strong.text
d[k.rstrip(":")] = " ".join(div.text.replace(k, "", 1).split())
yield d
# get anchor for next page and loop until no longer there.
nxt = soup.select_one(id_)
# loop until mo more next page.
while nxt:
# sleep between requests
sleep(.5)
resp = s.get(nxt.a["href"])
soup = BeautifulSoup(resp.content)
for li in soup.select("ul.product-list li.product.clearfix"):
name = li.select_one("div.product-header.clearfix a").text.strip()
d = {"name": name}
for div in li.select("div.search-item"):
k = div.strong.text
d[k.rstrip(":")] = " ".join(div.text.replace(k,"",1).split())
yield d
After running:
for ind, h in enumerate(from_soup(
"https://www.lgcstandards-atcc.org/Products/Cells_and_Microorganisms/Cell_Lines/Human/Alphanumeric.aspx?geo_country=gb")):
print(ind, h)
You will see 1211 dicts with all the data.