Scraping dynamic DataTable of many pages but same URL - python

I have experience with C and I'm starting to approach Python, mostly for fun.
I am trying to scrape this page here https://www.justetf.com/it/find-etf.html?groupField=index&from=search&/it/find-etf.html%3F1-1.0-esearch-etfsPanel.
Since the table, with the content I'm interested on, is dynamically created after connecting to the page, I'm using:
Selenium to load the page in the browser
Beautiful soup 4 for scraping the data loaded
At the moment I'm able to scrape all the fields of interest of the first 25 entries, the ones which are loaded once connected to the page. I can have up to 100 entries in one page but there are 1045 entries in total, which are split in different pages. The problem is that the url is the same for all the pages and the content of the table is dynamically loaded at runtime.
What I would like to do is find a way to be able to scrape all the entries, which are 1045. Reading through the internet I have understood I should send a proper POST request (I've also founded that they are retrieving data from https://www.finanztreff.de/) from my code, get the data from the response and scrape it.
I can see two possibilities :
Retrieve all the entries in once
Retrieve one page after the other and scrape one after the other
I have no idea how to build up the POST request.
I think there is no need to post the code but if needed I can re-edit the question.
Thanks in advance to everybody.
EDITED
Here you go with some code
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from bs4 import BeautifulSoup
import requests
firefox_binary = FirefoxBinary('some path\\firefox.exe')
browser = webdriver.Firefox(firefox_binary=firefox_binary)
url = "https://www.justetf.com/it/find-etf.html"
browser.get(url)
delay = 5 # seconds
try:
myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'Alerian')))
print("Page is ready!")
except TimeoutException:
print ("Loading took too much time!")
page_source = browser.page_source
soup = BeautifulSoup(page_source, 'lxml')
from here on I just play a bit with bs4 APIs.

This should do the trick (getting all the data at once):
import requests as r
link = 'https://www.justetf.com/it/find-etf.html?groupField=index&from=search&/it/find-etf.html%3F1-1.0-esearch-etfsPanel'
link2 = 'https://www.justetf.com/servlet/etfs-table'
data = {
'draw': 1,
'start': 0,
'length': 10000000,
'lang': 'it',
'country': 'DE',
'universeType': 'private',
'etfsParams': link.split('?')[1]
}
res = r.post(link2, data=data)
result = res.json()
print(len(result["data"]))
EDIT: For the explanation, I did open network tab in chrome and click on the next pages to see what requests have been made, and I noticed that a POST requests was made to link2 with a lot of parameters and most were mandatory.
For the needed parameters, draw I only needed one draw (one request), start starting from position 0, length I used a big number to scrape everything at once. If length was for example 10, you'd need a lot of draws, they go like draw=2&start=10&length=10, draw=3&start=20&length=10 and so on. For lang, country and universeType I didn't know the exact use but removing them would reject the request. And last the etfsParams is what comes after '?' in link.

Related

Using Selenium, Python and XPATH to try to grab image urls from a website, doesn't work

None of this seems to work, the browser just closes or it just prints "NONE"
Any idea if it's wrong xpaths or what is going on?
Thanks a lot, in advance
Here's the HTML containing the image:
`
<a data-altimg="" data-prdcount="" href="/product/prd-5178/levis-505-regular-jeans-men.jsp?prdPV=5" rel="/product/prd-5178/levis-505-regular-jeans-men.jsp?prdPV=5">
<img alt="Men's Levi's® 505™ Regular Jeans" class="pmp-hero-img" title="Men's Levi's® 505™ Regular Jeans" width="120px" data-herosrc="https://media.kohlsimg.com/is/image/kohls/5178_Light_Blue?wid=240&hei=240&op_sharpen=1" loading="lazy" srcset="https://media.kohlsimg.com/is/image/kohls/5178_Light_Blue?wid=240&hei=240&op_sharpen=1 240w, https://media.kohlsimg.com/is/image/kohls/5178_Light_Blue?wid=152&hei=152&op_sharpen=1 152w" sizes="(max-width: 728px) 20px" src="https://media.kohlsimg.com/is/image/kohls/5178_Light_Blue?wid=240&hei=240&op_sharpen=1">
</a>
`
Here's my script:
`
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
#from selenium.webdriver.support import expected_conditions as EC
#from selenium.webdriver.common.action_chains import ActionChains
import time
# Start a webdriver instance
browser = webdriver.Firefox()
# Navigate to the page you want to scrape
browser.get('https://www.kohls.com/catalog/mens-clothing.jsp?CN=Gender:Mens+Department:Clothing&cc=mens-TN2.0-S-mensclothing')
time.sleep(12)
#images = browser.find_elements(By.XPATH, "//img[#class='pmp-hero-img']")
#images = browser.find_elements(By.CLASS_NAME, 'pmp-hero-img')
images = browser.find_elements(By.XPATH, "/html/body/div[2]/div[2]/div[2]/div[2]/div[1]/div/div/div[3]/div/div[4]/ul/li[*]/div[1]/div[2]/a/img")
#images = browser.find_elements(By.XPATH, "//*[#id='root_panel4124695']/div[4]/ul/li[5]/div[1]/div[2]/a/img")
for image in images:
prod_img = (image.get_attribute("src"))
print(prod_img)
# Close the webdriver instance
browser.close()
`
Tried to get the url's , wasn't successful
First - do Not use very long xpath strings. They are hard to read and work it.
You can't find your images like this:
images = browser.find_elements(By.CSS,
'img[class="pmp-hero-img"]')
Now, the attribute you want to find:
for image in images:
prod_img = (image.get_attribute("data-herosrc"))
print(prod_img)
As I've said in my comment I suggest always doing a request only approach. There are some very limited use cases when one should do a browser based web automation.
First I would like to give you a step by step instruction on how I would do such a job.
Go to the website look for the data you want to be scraped
Open up the Browser Dev Tools and go to Networking
Hardreload the page and look for the Backend API calls that give you the data you are looking for
If the Site is SSR with PHP for example you would need to extract the data from the raw HTML. But most sites today are CSR and receive their content dynamically.
The Biggest "pro" of doing this is that you can extract way more content out of a request. Most APIs deliver their data in a JSON format which one can directly use. Now let's look at your example:
While inspecting the Network tap this request came to my attention:
https://api-bd.kohls.com/v1/ede/experiences?cid=WebStore&pgid=PMP&plids=Horizontal1%7C15
Further inspecting shows that this api call gives us all the products and corresponding information like Image urls. Not all you need to do is to check if you can further manipulate the call to give you more products and then save the urls.
When we inspect the API call with Postman we can see that one parameter is the following:
Horizontal1%7C15
It seems that the 15 at the end corresponds with the number of products received by the backend. Let's test it with 100.
https://api-bd.kohls.com/v1/ede/experiences?cid=WebStore&pgid=PMP&plids=Horizontal1%7C100
I was right changing this parameter of the URL gets us more products. Let's see what's the upper boundary is. Lets change the parameter to the max amount of products.
I've tested it. It did not work.The upper boundary is 155. So you can Scrape 155 products per request. Not too shabby. But how do we retrieve the rest? Let's further investigate that url.
Mhm... Seems like this website we can't get the data for the following pages with the same url as they are using another url for the following pages. That's a bummer.
Here is the code for the first page:
import requests
url = "https://api-bd.kohls.com/v1/ede/experiences?cid=WebStore&pgid=PMP&plids=Horizontal1%7C100"
payload = "{\"departmentName\":\"Clothing\",\"gender\":\"Mens\",\"mcmId\":\"39824086452562678713609272051249254622\"}"
headers = {
'x-app-api_key': 'NQeOQ7owHHPkdkMkKuH5tPpGu0AvIIOu',
'Content-Type': 'text/plain',
'Cookie': '_abck=90C88A9A2DEE673DCDF0BE9C3126D29B~-1~YAAQnTD2wapYufCEAQAA+/cLUAmeLRA+xZuD/BVImKI+dwMVjQ/jXiYnVjPyi/kRL3dKnruzGKHvwWDd8jBDcwGHHKgJbJ0Hyg60cWzpwLDLr7QtA969asl8ENsgUF6Qu37tVpmds9K7H/k4zr2xufBDD/QcUejcrvj3VGnWbgLCb6MDhUJ35QPh41dOVUzJehpnZDOs/fucNOil1CGeqhyKazN9f16STd4T8mBVhzh3c6NVRrgPV1a+5itJfP+NryOPkUj4L1C9X5DacYEUJauOgaKhoTPoHxXXvKyjmWwYJJJ+sdU05zJSWvM5kYuor15QibXx714mO3aBuYIAHY3k3vtOaDs2DqGbpS/dnjJAiRQ8dmC5ft9+PvPtaeFFxflv8Ldo+KTViHuYAqTNWntvQrinZxAif8pJnMzd00ipxmrj2NrLgxIKQOu/s1VNsaXrLiAIxADl7nMm7lAEr5rxKa27/RjCxA+SLuaz0w9PnYdxUdyfqKeWuTLy5EmRCUCYgzyhO3i8dUTSSgDLglilJMM9~0~-1~1672088271; _dyid_server=7331860388304706917; ak_bmsc=B222629176612AB1EBF71F96EAB74FA1~000000000000000000000000000000~YAAQnTD2wXhfufCEAQAAxuAOUBKVYljdVEM6mA086TVhGvlLmbRxihQ+5L1dtLAKrX5oFG1qC+dg6EbPflwDVA7cwPkft84vUGj0bJkllZnVb0FZKSuVwD728oW1+rCdos7GLBUTkq3XFzCXh/qMr8oagYPHKMIEzXb839+BKmOjGlNvBQsP/eJm+BmxnSlYq03uLpBZVRdmPX7mDAq2KyPq9kCnB+6o+D+eVfzchurFxbpvmWb+XCG0oAD+V5PgW3nsSey99M27WSy4LMyFFljUqLPkSdTRFQGrm8Wfwci6rWuoGgVpF00JAVBpdO2eIVjxQdBVXS7q5CmNYRifMU3I1GpLUr6EH+kKoeMiDQNhvU95KXg/e8lrTkvaaJLOs5BZjeC3ueLY; bm_sv=CF184EA45C8052AF231029FD15170EBD~YAAQnTD2wSxgufCEAQAARkkPUBKJBEwgLsWkuV8MSzWmw5svZT0N7tUML8V5x3su83RK3/7zJr0STY4BrULhET6zGrHeEo1xoSz0qvgRHB3NGYVE6QFAhRZQ4qnqNoLBxM/EhIXl2wBere10BrAtmc8lcIYSGkPr8emEekEQ9bBLUL9UqXyJWSoaDjlY7Z2NdEQVQfO5Z8NxQv5usQXOBCqW/ukgxbuM3C5S2byDmjLtU7f2S5VjdimJ3kNSzD80~1; 019846f7bdaacd7a765789e7946e59ec=52e83be20b371394f002256644836518; akacd_EDE_GCP=2177452799~rv=5~id=a309910885f7706f566b983652ca61e9'
}
response = requests.request("POST", url, headers=headers, data=payload)
data = response.json()
print(data)
for product in data["payload"]["experiences"][0]["expPayload"]["products"]:
print(product["image"]["url"])
Do something similar for the following pages and you will be set.

How to scrape news articles from cnbc with keyword "Green hydrogen"?

I am trying to scrap news article listed in this url, all article are in span.Card-title. But this gives blank output. Is there any to resolve this?
from bs4 import BeautifulSoup as soup
import requests
cnbc_url = "https://www.cnbc.com/search/?query=green%20hydrogen&qsearchterm=green%20hydrogen"
html = requests.get(cnbc_url)
bsobj = soup(html.content,'html.parser')
day = bsobj.find(id="root")
print(day.find_all('span',class_='Card-title'))
for link in bsobj.find_all('span',class_='Card-title'):
print('Headlines : {}'.format(link.text))
The problem is that content is not present on page when it loads initially, only afterwards is it fetched from server using url like this
https://api.queryly.com/cnbc/json.aspx?queryly_key=31a35d40a9a64ab3&query=green%20hydrogen&endindex=0&batchsize=10&callback=&showfaceted=false&timezoneoffset=-240&facetedfields=formats&facetedkey=formats%7C&facetedvalue=!Press%20Release%7C&needtoptickers=1&additionalindexes=4cd6f71fbf22424d,937d600b0d0d4e23,3bfbe40caee7443e,626fdfcd96444f28
and added to page.
Take a look at /json.aspx endpoint in devtools, data seems to be there.
As mentioned in another answer, the data about the articles are loaded using another link, which you can find via the networks tab in devtools. [In chrome, you can open devtools with Ctrl+Shift+I, then go to the networks tab to see requests made, and then click on the name starting with 'json.aspx?...' to see details, then copy the Request URL from Headers section.]
Once you have the Request URL, you can copy it and make the request in your code to get the data:
# dataReqUrl contains the copied Request URL
dataReq = requests.get(dataReqUrl)
for r in dataReq.json()['results']: print(r['cn:title'])
If you don't feel like trying to find that one request in 250+ other requests, you might also try to assemble a shorter form of the url with something like:
# import urllib.parse
# find link to js file with api key
jsLinks = bsobj.select('link[href][rel="preload"]')
jUrl = [m.get('href') for m in jsLinks if 'main' in m.get('href')][0]
jRes = requests.get(jUrl) # request js file api key
# get api key from javascript
qKey = jRes.text.replace(' ', '').split(
'QUERYLY_KEY:'
)[-1].split(',')[0].replace('"', '').strip()
# form url
qParams = {
'queryly_key': qKey,
'query': search_for, # = 'green hydrogen'
'batchsize': 10 # can go up to 100 apparently
}
qUrlParams = urllib.parse.urlencode(qParams, quote_via=urllib.parse.quote)
dataReqUrl = f'https://api.queryly.com/cnbc/json.aspx?{qUrlParams}'
Even though the assembled dataReqUrl is not identical to the copied one, it seems to be giving the same results (I checked with a few different search terms). However, I don't know how reliable this method is, especially compared to the much less convoluted approach with selenium:
# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC
# define chromeDriver_path <-- where you saved 'chromedriver.exe'
cnbc_url = "https://www.cnbc.com/search/?query=green%20hydrogen&qsearchterm=green%20hydrogen"
driver = webdriver.Chrome(chromeDriver_path)
driver.get(cnbc_url)
ctSelector = 'span.Card-title'
WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located(
(By.CSS_SELECTOR, ctSelector)))
cardTitles = driver.find_elements(By.CSS_SELECTOR, ctSelector)
cardTitles_text = [ct.get_attribute('innerText') for ct in cardTitles]
for c in cardTitles_text: print(c)
In my opinion, this approach is more reliable as well as simpler.

Create get request in a webdriver with selenium

Is it possible to send a get request to a webdriver using selenium?
I want to scrape a website with an infinite page and want to scrape a substantial amount of the objects on the website. For this I use Selenium to open the website in a webdriver and scroll down the page until enough objects on the page are visible.
However, I'd like to scrape the information on the page with BeautifulSoup since this is the most effective way in this case. If the get request is send in the normal way (see the code) the response only holds the first objects and not the objects from the scrolled-down page (which makes sence).
But is there any way to send a get request to an open webdriver?
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import requests
from bs4 import BeautifulSoup
# Opening the website in the webdriver
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
# Loop for scrolling
scroll_start = 0
for i in range(100):
scroll_end = scroll_start + 1080
driver.execute_script(f'window.scrollTo({scroll_start}, {scroll_end})')
time.sleep(2)
scroll_start = scroll_end
# The get request
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
You should probably find out what is the endpoint that the website is using to get the data for the infinite scrolling.
Go to the website, open the Dev Tools, open the Network tab and find the HTTP request that is asking for the content you're seeking, then MAYBE you can use it too. Just know that there are a lot of variables like, are they using some sort of authorization for their APIs? Are the APIs returning JSON, XML, HTML, ...? Also, I am not sure if this is fair-use.

Can't get all titles from a list with Python WebScraping

I'm practicing web scraping with Python atm and I found a problem, I wanted to scrape one website that has a list of anime that I watched before but when I try to scrape it (via requests or selenium) it only gets around 30 of 110 anime names from the page.
Here is my code with selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get("https://anilist.co/user/Agusmaris/animelist/Completed")
data = BeautifulSoup(browser.page_source, 'lxml')
for title in data.find_all(class_="title"):
print(title.getText())
And when I run it, the page source only shows up until an anime called 'Golden time' when there are like 70 or more left that are in the page.
Thanks
Edit: Code that works now thanks to 'supputuri':
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Firefox()
driver.get("https://anilist.co/user/Agusmaris/animelist/Completed")
time.sleep(3)
footer = driver.find_element_by_css_selector("div.footer")
preY = 0
print(str(footer))
while footer.rect['y'] != preY:
preY = footer.rect['y']
footer.location_once_scrolled_into_view
print('loading')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
for title in soup.find_all(class_="title"):
print(title.getText())
driver.close()
driver.quit()
ret = input()
Here is the solution.
Make sure to add import time
driver.get("https://anilist.co/user/Agusmaris/animelist/Completed")
time.sleep(3)
footer =driver.find_element_by_css_selector("div.footer")
preY =0
while footer.rect['y']!=preY:
preY = footer.rect['y']
footer.location_once_scrolled_into_view
time.sleep(1)
print(str(driver.page_source))
This will iterate until all the anime is loaded and then gets the page source.
Let us know if this was helpful.
So, this is the jist of what I get when I load the page source:
AniListwindow.al_token = 'E1lPa1kzYco5hbdwT3GAMg3OG0rj47Gy5kF0PUmH';Sorry, AniList requires Javascript.Please enable Javascript or http://outdatedbrowser.com>upgrade to a modern web browser.Sorry, AniList requires a modern browser.Please http://outdatedbrowser.com>upgrade to a newer web browser.
Since I know damn well that Javascript is enabled and my Chrome version is fully up to date, and the URL listed takes one to a nonsecure website to "download" a new version of your browser, I think this is a spam site. Not sure if you were aware of that when posting so I won't flag as such, but I wanted you and others who come across this to be aware.

Web scraping when scrolling down is needed

I want to scrape, e.g., the title of the first 200 questions under the web page https://www.quora.com/topic/Stack-Overflow-4/all_questions. And I tried the following code:
import requests
from bs4 import BeautifulSoup
url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
print("url")
print(url)
r = requests.get(url) # HTTP request
print("r")
print(r)
html_doc = r.text # Extracts the html
print("html_doc")
print(html_doc)
soup = BeautifulSoup(html_doc, 'lxml') # Create a BeautifulSoup object
print("soup")
print(soup)
It gave me a text https://pastebin.com/9dSPzAyX. If we search href='/, we can see that the html does contain title of some questions. However, the problem is that the number is not enough; actually on the web page, a user needs to manually scroll down to trigger extra load.
Does anyone know how I could mimic "scrolling down" by the program to load more content of the page?
Infinite scrolls on a webpage is based on the Javascript functionality. Therefore, to find out what URL we need to access and what parameters to use, we need to either thoroughly study the JS code working inside the page or, and preferably, examine the requests that the browser does when you scroll down the page. We can study requests using the Developer Tools.
See example for quora
the more you scroll down, the more requests generated. so now your requests will be done to that url instead of normal url but keep in mind to send correct headers and playload.
other easier solution will be by using selenium
Couldn't find a response using request. But you can use Selenium. First printed out the number of questions at first load, then send the End key to mimic scrolling down. You can see number of questions went from 20 to 40 after sending the End key.
I used driver.implicitly wait for 5 seconds before loading the DOM again in case the script load to fast before the DOM was loaded. You can improve by using EC with selenium.
The page loads 20 questions per scroll. So if you are looking to scrape 100 questions, then you need to send the End key 5 times.
To use the code below you need to install chromedriver.
http://chromedriver.chromium.org/downloads
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
CHROMEDRIVER_PATH = ""
CHROME_PATH = ""
WINDOW_SIZE = "1920,1080"
chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
chrome_options.binary_location = CHROME_PATH
prefs = {'profile.managed_default_content_settings.images':2}
chrome_options.add_experimental_option("prefs", prefs)
url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
def scrape(url, times):
if not url.startswith('http'):
raise Exception('URLs need to start with "http"')
driver = webdriver.Chrome(
executable_path=CHROMEDRIVER_PATH,
chrome_options=chrome_options
)
driver.get(url)
counter = 1
while counter <= times:
q_list = driver.find_element_by_class_name('TopicAllQuestionsList')
questions = [x for x in q_list.find_elements_by_xpath('//div[#class="pagedlist_item"]')]
q_len = len(questions)
print(q_len)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
wait = WebDriverWait(driver, 5)
time.sleep(5)
questions2 = [x for x in q_list.find_elements_by_xpath('//div[#class="pagedlist_item"]')]
print(len(questions2))
counter += 1
driver.close()
if __name__ == '__main__':
scrape(url, 5)
I recommend using selenium rather than bs.
selenium can control browser and parsing. like scroll down, click button, etc…
this example is for scroll down for get all liker user in instagram.
https://stackoverflow.com/a/54882356/5611675
If the content only loads on "scrolling down", this probably means that the page is using Javascript to dynamically load the content.
You can try using a web client such as PhantomJS to load the page and execute the javascript in it, and simulate the scroll by injecting some JS such as document.body.scrollTop = sY; (Simulate scroll event using Javascript).

Categories