Scraping data from tables dependent on interactive map - python

Another scraping question. I'm trying to scrape data from the following website:
https://www.flightradar24.com/data/airlines/kl-klm/routes
However, the data I want to get only shows up after you click on one of the airports, in the form of a table under the map. From this table, I want to extract the number that indicates the frequency of daily flights to each airport. E.g., if you click on Paris Charles de Gaulle and inspect the country Netherlands from the table, it shows td rowspan="6" on the row above, which in this case indicates that KLM has 6 flights a day to Paris.
I'm assuming that I would need to use a browser session like Selenium or something similar, so I started with the following code, but I'm not sure where to go from here as I'm not able to locate the airport dots in the source code.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.binary_location = 'C:/Users/C55480/AppData/Local/Google/Chrome SxS/Application/chrome.exe'
driver = webdriver.Chrome(executable_path='C:/Users/C55480/.spyder-py3/going_headless/chromedriver.exe', chrome_options=chrome_options)
airlines = ['kl-klm', 'dy-nax', 'lh-dlh']
for a in airlines:
url = 'https://www.flightradar24.com/data/airlines/' + a + '/routes'
page = driver.get(url)
Is there a way to make Selenium click on each dot and scrape the number of daily flights for every airport, and then from this find the total number of daily flights to each country?

Instead of using Selenium, try to get required data with direct HTTP-requests:
import requests
import json
s = requests.session()
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0"}
r = s.get("https://www.flightradar24.com/data/airlines/kl-klm/routes", headers=headers)
Data for each airport can be found in script node that looks like
<script>var arrRoutes=[{"airport1":{"country":"Denmark","iata":"AAL","icao":"EKYT","lat":57.092781,"lon":9.849164,"name":"Aalborg Airport"}...]</script>
To get JSON from arrRoutes variable:
my_json = json.loads(r.text.split("arrRoutes=")[-1].split(", arrDates=")[0])
You need to get abbreviation (value for "iata" key) for each airport:
abbs_list = []
for route in my_json:
if route["airport1"]["country"] == "Netherlands":
abbs_list.append(route["airport2"]["iata"])
Output of print(abbs_list) should be like ['AAL', 'ABZ'...]
Now we can request data for each airport:
url = "https://www.flightradar24.com/data/airlines/kl-klm/routes?get-airport-arr-dep={}"
for abbr in abbs_list:
cookie = r.cookies.get_dict()
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0",
"Content-Type": "application/json",
"x-fetch": "true"}
response = s.get(url.format(abbr), cookies=cookie, headers=headers).json()
print(abbr, ": ", response["arrivals"]["Netherlands"]["number"]["flights"])

The map isn't represented through HTML/CSS so I do no think it is possible to interact with it through Selenium natively.
However, I stumbled upon Sikuli API, which enables image recognition to interact with Google Maps (like on the page you linked), Captchas, ... You could crop that marker and try to use Sikuli to recognize it and click on it. See http://www.assertselenium.com/maven/sikuliwebdriver-2/ for a small example on how to use it.
The data in the tables can however easily be selected using Xpaths and parsed using a tool like Selenium. It seems however that Sikuli is usable only in Java so you'll have to use Selenium with Java too.

Related

Trouble collecting different property ids from a webpage using the requests module

After clicking on the button 11.331 Treffer located at the top right corner within the filter of this webpage, I can see the result displayed on that page. I've created a script using the requests module to fetch the ID numbers of different properties from that page.
However, when I run the script, I get json.decoder.JSONDecodeError. If I copy the cookies from dev tools directly and paste them within the headers, I get the results accordingly.
I don't wish to copy cookies from dev tools every time I run the script, so I used Selenium to collect cookies from the landing page and supply them within headers to get the desired result, but I still get the same error.
I'm trying like:
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
start_url = 'https://www.immobilienscout24.de/'
link = 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?pagenumber=1'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'referer': 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?enteredFrom=one_step_search',
'accept': 'application/json; charset=utf-8',
'x-requested-with': 'XMLHttpRequest'
}
def get_cookies():
with webdriver.Chrome() as driver:
driver.get(start_url)
time.sleep(10)
cookiejar = {c['name']:c['value'] for c in driver.get_cookies()}
return cookiejar
cookies = get_cookies()
cookie_string = "; ".join([f"{item}={val}" for item,val in cookies.items()])
with requests.Session() as s:
s.headers.update(headers)
s.headers['cookie'] = cookie_string
res = s.get(link)
container = res.json()['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']
for item in container:
try:
project_id = item['#id']
except KeyError: project_id = ""
print(project_id)
How can I scrape property ids from that webpage using the requests module?
EDIT:
The existence of the following portion within cookies is crucial, without which the script probably leads to that error I mentioned. However, selenium failed to include that portion within cookies.
reese84=3:/qdGO9he7ld4/8a35vlw8g==:+/xBfAtVPRKHBSJgzngTQw1ywoViUvmVKLws+f8Y6edDgM+3s0Xzo17NvfgPrx9Z/suRy7hee5xcEgo85V3LdGsIop9/29g1ib1JQ0pO3UHWrtn81MseS6G8KE6AF4SrWZ2t8eTr1SEogUmCkB1HNSqXT88sAZaEi+XSzUyAGqikVjEcLX9TeI+KN37QNr9Sl+oTaOPchSgS/IowPj83zvT471Ewabg8CAc6q8I9AJ8Zb9FfLqePweCM+QFKIw+ZUp5GR4TXxZVcWdipbIEAyv3kj2x9Xs1K1k+8aXmy9VES6rFvW1xOsAjLmXbg6REPBye+QcAgPUh/x79mBWktcWC/uQ5L2W2dBLBS4eM2+bpEBw5EHMfjq9bk9hnmmZuxPGALLKASeXBt5lUUwx7x+wtGcjyvB9ZSE6gI2VxFLYqncYmhKqoNzgwQY8wRThaEraiJF/039/vVMa2G3S38iwniiOGHsOxq6VTdnWJGgvJqUmpWfXzz6XQXWL2xcykAoj7LMqHF2tC0DQyInUmZ3T7zjPBV7mEMgZkDn0z272E=:qQHyFe1/pp8/BS4RHAtxftttcOYJH4oqG1mW0+aNXF4=;
I think another part of your problem is that the link is not json. It's an html document. Part of the html document does contains javascript that sets a js variable to a json object. You can't get that with res.json()
In theory, you could use selenium to go to the link and grab the contents of the IS24.resultList variable by executing javascript like this:
driver.get(link)
time.sleep(10)
result_list = json.loads( driver.execute_script("return window.IS24.resultList"))
In practice, I think they're really serious about blocking bots and I suspect convincing them you're not a bot might take more than spoofing a cookie. When I visit via Selenium I don't even get the recaptcha option that I get when visiting through a regular browser session with incognito mode.

Python scraper not returning full html code on some subdomains

I am throwing together a Walmart review scraper, it currently scrapes html from most Walmart pages without a problem. As soon as I try scraping a page of reviews, it only comes back with a small portion of the page's code, mainly just text from reviews and a few errant tags. Anyone know what the problem could be?
import requests
headers = {
'Accept': '*/*',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36',
'Accept-Language': 'en-us',
'Referer': 'https://www.walmart.com/',
'sec-ch-ua-platform': 'Windows',
}
cookie_jar = {
'_pxvid': '35ed81e0-cb1a-11ec-aad0-504d5a625548',
}
product_num = input('Enter Product Number: ')
url2 = ('https://www.walmart.com/reviews/product/'+str(product_num))
r = requests.get(url2, headers=headers, cookies=cookie_jar, timeout=5)
print(r.text)
As larsks already commented, some content is loaded in dynamically, for example if you scroll down far enough.
BeautifulSoup or requests don't load the whole page, but you can solve this with Selenium.
What Selenium does is it opens your url in a script-controlled web browser, it lets you fill out forms and also scroll down. Below is a code example on how to use Selenium with BS4.
from bs4 import BeautifulSoup
from selenium import webdriver
# Search on google for the driver and save it in the path below
driver = webdriver.Firefox(executable_path="C:\Program Files (x86)\geckodriver.exe")
# for Chrome it's: driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe")
# Here you open the url with the reviews
driver.get("https://www.example.com")
driver.maximize_window()
# This function scrolls down to the bottom of the website
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
# Now you can scrape the given website from your Selenium browser using:
html = driver.page_source
soup = BeautifulSoup(html)
This solution assumes that the reviews are loaded in through scrolling down the page. Of course you don't have to use BeautifulSoup to scrape the site, it's personal preference. Let me know if it helped.

Problem with web scraping - JavaSript on website is disabled

Hello,
I've been playing with discord bots (in Python) for a while now and I've come across a problem with scraping information on some websites that protect themselves from data collection by disabling javascript on their side so you can't get to their data.
I have already looked at many websites recommending changing in headers among other things, but it has not helped.
The next step was to use selenium, which returns me this information.
We're sorry but Hive-Engine Explorer doesn't work properly without JavaScript enabled. Please enable it to continue.
Code:
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://he.dtools.dev/richlist/BEE")
htmlSource = driver.page_source
print(htmlSource)
I also checked how it looks like on the browser side itself and as we can see after entering the page there is no way to see the html file
Image from website
My question is, is it possible to bypass such security measures? Unfortunately I wanted to download the information from the API but it is not possible in this case.
You don't need to run Selenium to get this data, the site uses a backend api to deliver the data which you can replicate easily in python:
import requests
import pandas as pd
import time
import json
token = 'BEE'
limit = 100
id_ = int(time.time())
headers = {
'accept':'application/json, text/plain, */*',
'content-type':'application/json',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.3'
}
url = 'https://api.hive-engine.com/rpc/contracts'
payload = {"jsonrpc":"2.0","id":id_,"method":"find","params":{"contract":"tokens","table":"balances","query":{"symbol":token},"offset":0,"limit":limit}}
resp = requests.post(url,headers=headers,data=json.dumps(payload)).json()
df= pd.DataFrame(resp['result'])
df.to_csv('HiveData.csv',index=False)
print('Saved to HiveData.csv')

Unable to print the information in this div on a webpage? - Tried multiple methods - Python - BS4

Currently having some trouble attempting to pull the below text from the webpage:
"https://www.johnlewis.com/mulberry-bayswater-small-zipped-leather-handbag-summer-khaki/p5807862"
I am using the below code and I'm trying to print the product name, product price and number available in stock.
I am easily able to print the name and price, but seem to be unable to print the # in stock.
I have tried using both StockInformation_stock__3OYkv & DefaultTemplate_product-stock-information__dFTUx but I am either presented with nothing, or the price again.
What am i doing wrong?
Thanks in advance.
import requests
from bs4 import BeautifulSoup
url = 'https://www.johnlewis.com/mulberry-bayswater-small-zipped-leather-handbag-summer-khaki/p5807862'
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'
})
soup = BeautifulSoup(response.content, 'html.parser')
numInStock = soup.find(class_="StockInformation_stock__3OYkv").get_text().strip()
productName = soup.find(id="confirmation-anchor-desktop").get_text().strip()
productPrice = soup.find(class_="ProductPrice_price__DcrIr").get_text().strip()
print (productName)
print (productPrice)
print (numInStock)
The webpage you chose has some dynamic elements, meaning rapidly changing elements such as the stock number. In this case, the page you pulled first displays the more static elements such as the product name and price, then does supplementary requests to different API urls for the data on stock (since it changes frequently). After the browser requests the supplemental data it injects it into the original HTML page, which is why the frame of the name and product are there but not the stock. In simple terms, the webpage is "still loading" as you do the request to grab it, and there is hundreds of other requests for images, files, and data that must also be done to get the rest of the data for the full image that your browser and eyes would regularly see.
Fortunately, we only need one more request, which grabs the stock data.
To fix this, we are going to do an additional request to the URL for the stock information. I am unsure how much you know about reverse engineering but I'll touch on it lightly. I did some reverse engineering and found it is to https://www.johnlewis.com/fashion-ui/api/stock/v2 in the form of a post with the json parameters {"skus":["240280782"]} (the skus being a list of products). The SKU is available in the webpage, so the full code to get the stock is as follows:
import requests
from bs4 import BeautifulSoup
url = 'https://www.johnlewis.com/longchamp-le-pliage-original-large-shoulder-bag/p5051141'
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'
})
soup = BeautifulSoup(response.content, 'html.parser')
numInStock = soup.find(class_="StockInformation_stock__3OYkv").get_text().strip()
productName = soup.find(id="confirmation-anchor-desktop").get_text().strip()
# also find the sku by extracting the numbers out of the following mess found in the webpage: ......"1,150.00"},"productId":"5807862","sku":"240280782","url":"https://www.johnlewis.com/mulberry-ba.....
sku = response.text.split('"sku":"')[1].split('"')[0]
#supplemental request with the newfound sku
response1 = requests.post('https://www.johnlewis.com/fashion-ui/api/stock/v2', headers={
'authority': 'www.johnlewis.com',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36',
'content-type': 'application/json',
'accept': '*/*',
'origin': 'https://www.johnlewis.com',
'referer': 'https://www.johnlewis.com/mulberry-bayswater-small-zipped-leather-handbag-summer-khaki/p5807862',
}, json={"skus":[sku]})
# returns the json: {"stocks":[{"skuId":"240280782","stockQuantity":2,"availabilityStatus":"SKU_AVAILABLE","stockMessage":"Only 2 in stock online","lastUpdated":"2021-12-05T22:03:27.613Z"}]}
# index the json
try:
productPrice = response1.json()["stocks"][0]["stockQuantity"]
except:
print("There was an error getting the stock")
productPrice = "NaN"
print (productName)
print (productPrice)
print (numInStock)
I also made sure to test via other products. Since we dynamically simulate what a webpage does by Step 1. getting the page template, then Step 2. using the data from the template to make additional requests to the server, it works for any product URL.
This is EXTREMELY difficult and a pain. You need knowledge of front end, back end, json, and parsing to get it.

Using python to scrape push data?

I'm trying to scrape the left side of this news site (= SENESTE NYT):
https://www.dr.dk/nyheder/
But it seems the data isn't anywhere to be found? Neither in the html or related api/json etc. Is it some kind of push data?
Using Chrome's Network console I've found this api but it doesn't contain the news items on the left side:
https://www.dr.dk/tjenester/newsapp-content/teasers?reqoffset=0&reqlimit=100
Can anyone help me? How do I scrape "SENESTE NYT"?
I first loaded the page with selenium and then processed with BeautifulSoup.
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://www.dr.dk/nyheder"
driver = webdriver.Chrome()
driver.get(url)
page_source = driver.page_source
soup = BeautifulSoup(page_source, "lxml")
div = soup.find("div", {"class":"timeline-container"})
headlines = div.find_all("h3")
print(headlines)
And it seems to find the headlines:
[<h3>Puigdemont: Debatterede spørgsmål af interesse for hele Europa</h3>,
<h3>Afblæser tsunami-varsel for Hawaii</h3>,
<h3>56.000 flygter fra vulkan i udbrud </h3>,
<h3>Pence: USA offentliggør snart plan for ambassadeflytning </h3>,
<h3>Østjysk motorvej genåbnet </h3>]
Not sure if this is what you wanted.
-----EDITED----
More efficient way would be to create request with some custom headers (already confirmed this is not working)
import requests
headers = {
"Accept":"*/*",
"Host":"www.dr.dk",
"Referer":"https://www.dr.dk/nyheder",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
r = requests.get(url="https://www.dr.dk/tjenester/newsapp-content/teasers?reqoffset=0&reqlimit=100", headers=headers)
r.json()

Categories