Is there any way to download the audio from a certain page - python

I am working on a selenium script with python, and want to download the audio coming from a certain page.
the page looks like this :
the HTML code of the page :
<html>
<head>
<meta name="viewport" content="width=device-width">
</head>
<body>
<video controls="" autoplay="" name="media">
<source src="https://website//id=47c484fc7f8f" type="audio/mp3">
</video>
</body>
</html>
my code so far:
from seleniumwire import webdriver
import sys
from webdriver_manager.chrome import ChromeDriverManager
import time
import pyaudio
import wave
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
# for linux/Ubuntu only
#chrome_options.add_argument("--no-sandbox")
browser = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)
browser.get("website")
search = browser.find_element_by_id("text-area")
search.clear()
text = input("text here : ")
search.send_keys(text)
#print(data)
time.sleep(2)
browser.find_element_by_id("btn").click()
# Access and print requests via the `requests` attribute
for request in browser.requests:
if request.response and request.url.__contains__('website//id'):
browser.get(request.url)
I am open to work with any language to achieve the goal

You don't need Selenium for this, requests library is enough. You must provide a unique identifier to your post request as sessionID, so you can pick up the generated file in the next get request.
Use the following snippet as an example, it saves the generated file under provided sessionID name.
import requests
sessionID = '78aa8dd0-9529-11eb-a8b3-0242ac130003'
payload = {'ssmlText': '<prosody pitch=\"default\" rate=\"-0%\">Roses are red, violets are blue</prosody>', 'sessionID': sessionID}
r1 = requests.post("https://www.ibm.com/demos/live/tts-demo/api/tts/store", data = payload)
r1.raise_for_status()
print(r1.status_code, r1.reason)
tts_url = 'https://www.ibm.com/demos/live/tts-demo/api/tts/newSynthesize?voice=en-US_OliviaV3Voice&id=' + sessionID
try:
r2 = requests.get(tts_url, timeout = 10, cookies = r1.cookies)
print(r2.status_code, r2.reason)
try:
with open(sessionID + '.mp3', "w+b") as f:
f.write(r2.content)
except IOError:
print("IOError: could not write a file")
except requests.exceptions.Timeout as err:
print("Timeout: could not get response from the server")

Related

How Do I Monitor Network Flow with Selenium?

I am trying to scrape data from this url with Python-Selenium.
ยป https://shopee.co.id/PANCI-PRESTO-24cm-3.5L-TEFLON-i.323047288.19137193916?sp_atk=7e8e7abc-834c-4f4a-9234-19da9ddb2445&xptdk=7e8e7abc-834c-4f4a-9234-19da9ddb2445
If you watch the network stream you will see that it returns an api on the back end like this https://shopee.co.id/api/v4/item/get?itemid=19137193916&shopid=323047288. How can I get the response returned by this api with selenium?
Solved!
import json
import time
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
# Set up Selenium webdriver
capabilities = DesiredCapabilities.CHROME
capabilities["goog:loggingPrefs"] = {"performance": "ALL"}
options = webdriver.ChromeOptions()
options.binary_location = "/usr/bin/brave"
options.add_argument("--ignore-certificate-errors")
driver = webdriver.Chrome(desired_capabilities=capabilities, options=options)
# Navigate to URL and monitor network flow
url = "https://shopee.co.id/PANCI-PRESTO-24cm-3.5L-TEFLON-i.323047288.19137193916?sp_atk=7e8e7abc-834c-4f4a-9234-19da9ddb2445&xptdk=7e8e7abc-834c-4f4a-9234-19da9ddb2445"
driver.get(url)
time.sleep(3) # Wait for the page to load
# Find any API requests and print the returned data to the screen
logs = driver.get_log("performance")
for entry in logs:
message = entry.get("message", {})
parsed_message = json.loads(message)
message_dict = parsed_message.get("message", {})
method = message_dict.get("method")
if method == "Network.requestWillBeSent":
request = message_dict.get("params", {}).get("request", {})
url = request.get("url")
if "https://shopee.co.id/api/v4/item/get?itemid=19137193916&shopid=323047288" in url:
response_url = url.replace("request", "response")
response = driver.execute_cdp_cmd(
"Network.getResponseBody", {"requestId": message_dict.get("params", {}).get("requestId")}
)
with open("response.json", "w") as f:
f.write(response.get("body", ""))
I use selenium wire for this. You can do pip install selenium-wire to get it and then import it into your project and use it like so:
from seleniumwire import webdriver
#Sets the Option to disable response encoding
sw_options = {
'disable_encoding': True
}
#Creates driver with selected options
driver = webdriver.Chrome(seleniumwire_options=sw_options)
#Starts selenium wire interceptor to monitor network traffic
driver.request_interceptor = interceptor
#Navigate to page
driver.get('https://shopee.co.id/PANCI-PRESTO-24cm-3.5L-TEFLON-i.323047288.19137193916?sp_atk=7e8e7abc-834c-4f4a-9234-19da9ddb2445&xptdk=7e8e7abc-834c-4f4a-9234-19da9ddb2445')
#Iterate through requests and find the one with the endpoint you need in the url
for a in driver.requests:
if("/api/v4/item/get?itemid=19137193916&shopid=323047288" in a.url):
body = a.response.body
print(body)
We add disable encoding to the options otherwise the body would come back encoded and youd have to decode it manually which can be done like so
body = decode(response.body, response.headers.get('Content-Encoding', 'identity'))
Or done in the browser options as I did.
You can find more information here:
https://pypi.org/project/selenium-wire/#response-objects

Web Scraping Blocked by Robots Meta Directives

I am working on a web scraper to access scheduling data from a website. Our company has full access to this website and data via login credentials. With dynamic site navigation required, I am using Selenium for automated data scraping, Python, and BeautifulSoup to work with the HTML structure. With all variables defined, I have the following code:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import lxml.html as lh
opt = Options()
opt.headless = True
driver = webdriver.Chrome(options=opt, executable_path=<path to chromedriver.exe>)
driver = webdriver.Chrome(<path to chromedriver.exe>)
driver.get(<website login page URL>?username=' + username + '&password=' + password)
driver.get(<url of website page with data>?start_date=' + start_date + '&end_date=' + end_date +'&type=Excel')
soup = BeautifulSoup(driver.page_source, 'lxml')
print(soup)
The result of the print(soup) is as follows:
<html style="height:100%">
<head>
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="initial-scale=1.0" name="viewport"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
</head>
<body> ... irrelevant ... </body></html>
Before any questions, I do not have much knowledge regarding robot or HTTP requests. My questions are:
When I run a headless driver as above, the scrape is blocked by robots. When I run a regular, non-headless driver where an automated browser opens, the scrape is successful. Why is this the case?
What is the best method to get around this? The scrape is legal and non-exploitive as we practically have full access to the data we are scraping (we are a registered client). Will using the requests library solve this problem? Are there other methods of running headless web drivers that won't get blocked? Is there some parameter I can change that prevents the block?
How do I see the robots.txt file of a website?
you can use the following code to hide the webdriver
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
also, add this to your chromedriver options
options.add_argument("--disable-blink-features")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option('useAutomationExtension', False)

How to download website when URL doesn't change after data addition

I would like to download data from http://ec.europa.eu/taxation_customs/vies/ site. Case is that when I enter data on it through program the URL doesn't change, so file saved on disc has a page same as the one which were opened from the begining without data.Maybe I don't know how to access this site after adding data? I'm new in Python and tried to look for solution but with no result so if there was such issue, please link me. Here's my code. I appreciate all responses:)
import requests
import selenium
import select as something
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import pdfkit
url = "http://ec.europa.eu/taxation_customs/vies/?locale=pl"
driver = webdriver.Chrome(executable_path ="C:\\Users\\Python\\Chromedriver.exe")
driver.get("http://ec.europa.eu/taxation_customs/vies/")
#wait = WebDriverWait(driver, 10)
obj = Select(driver.find_element_by_id("countryCombobox"))
obj = obj.select_by_index(1)
vies_r = requests.get(url)
vies_vat = driver.find_element_by_id("number")
vies_vat.send_keys('U54799909')
vies_verify = driver.find_element_by_id("submit")
vies_verify.click()
path_wkhtmltopdf = r'C:\Users\Python\wkhtmltox\wkhtmltox\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=path_wkhtmltopdf)
print(driver.current_url)
pdfkit.from_url(driver.current_url, "out.pdf", configuration=config)
Ukalo

How to download embedded PDF from webpage using selenium?

I want to download embedded PDF from a webpage using selenium just like in this image.
Embedded PDF image
For example, page like this:
https://www.sebi.gov.in/enforcement/orders/jun-2019/adjudication-order-in-respect-of-three-entities-in-the-matter-of-prism-medico-and-pharmacy-ltd-_43323.html
I tried the code mentioned below but it did not work out.
def download_pdf(lnk):
from selenium import webdriver
from time import sleep
options = webdriver.ChromeOptions()
download_folder = "/*My folder*/"
profile = {"plugins.plugins_list": [{"enabled": False,
"name": "Chrome PDF Viewer"}],
"download.default_directory": download_folder,
"download.extensions_to_open": ""}
options.add_experimental_option("prefs", profile)
print("Downloading file from link: {}".format(lnk))
driver = webdriver.Chrome('/*Path of chromedriver*/',chrome_options = options)
driver.get(lnk)
imp_by1 = driver.find_element_by_id("secondaryToolbarToggle")
imp_by1.click()
imp_by = driver.find_element_by_id("secondaryDownload")
imp_by.click()
print("Status: Download Complete.")
driver.close()
download_pdf('https://www.sebi.gov.in/enforcement/orders/jun-2019/adjudication-order-in-respect-of-three-entities-in-the-matter-of-prism-medico-and-pharmacy-ltd-_43323.html')
Any help is appreciated.
Thanks in advance!!
Here You go, description in code:
=^..^=
from selenium import webdriver
import os
# initialise browser
browser = webdriver.Chrome(os.getcwd()+'/chromedriver')
# load page with iframe
browser.get('https://www.sebi.gov.in/enforcement/orders/jun-2019/adjudication-order-in-respect-of-three-entities-in-the-matter-of-prism-medico-and-pharmacy-ltd-_43323.html')
# find pdf url
pdf_url = browser.find_element_by_tag_name('iframe').get_attribute("src")
# load page with pdf
browser.get(pdf_url)
# download file
download = browser.find_element_by_xpath('//*[#id="download"]')
download.click()
Here is another way to grab the file without clicking/downloading. This method also helps you to download the file to your local machine if your tests are executed in Selenium Grid (remote nodes).
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import org.openqa.selenium.Cookie;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
public class FileDownloader extends MyPage(){
public void downloadFile(){
//grab the file download url from your download icon/button/element
String src = iframe.getAttribute("src");
driver.get(src); //driver object from 'MyPage.java'
// Grab cookies from current driver session (authenticated cookie information
// is vital to download the file from 'src'
StringBuilder cookies = new StringBuilder();
for (Cookie cookie : driver.manage().getCookies()){
String value = cookie.getName() + "=" + cookie.getValue();
if (cookies.length() == 0 )
cookies.append(value);
else
cookies.append(";").append(value);
}
try{
HttpURLConnection con = (HttpURLConnection) new URL(src).openConnection();
con.setRequestMethod("GET");
con.addRequestProperty("Cookie",cookies.toString());
//set your own download path, probably a dynamic file name with timestamp
String downloadPath = System.getProperty("user.dir") + File.separator + "file.pdf";
OutputStream outputStream = new FileOutputStream(new File(downloadPath));
InputStream inputStream = con.getInputStream();
int BUFFER_SIZE = 4096;
byte[] buffer = new byte[BUFFER_SIZE];
int bytesRead = -1;
while((bytesRead = inputStream.read(buffer)) != -1)
outputStream.write(buffer, 0, bytesRead);
outputStream.close();
}catch(Exception e){
// file download failed.
}
}
}
Here is how my dom looks like
<iframe src="/files/downloads/pdfgenerator.aspx" id="frame01">
#document
<html>
<body>
<embed width="100%" height ="100%" src="about:blank" type="application/pdf" internalid="1234567890">
</body>
</html>
</iframe>

How to handle alerts with Python?

I wuold like to handle alerts with Python. What I wuold like to do is:
Open a url
Submit a form or click some links
Check if an alert occurs in the new page
I made this with Javascript using PhantomJS, but I would made even with Python.
Here is the javascript code:
file test.js:
var webPage = require('webpage');
var page = webPage.create();
var url = 'http://localhost:8001/index.html'
page.onConsoleMessage = function (msg) {
console.log(msg);
}
page.open(url, function (status) {
page.evaluate(function () {
document.getElementById('myButton').click()
});
page.onConsoleMessage = function (msg) {
console.log(msg);
}
page.onAlert = function (msg) {
console.log('ALERT: ' + msg);
};
setTimeout(function () {
page.evaluate(function () {
console.log(document.documentElement.innerHTML)
});
phantom.exit();
}, 1000);
});
file index.html
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title></title>
<meta charset="utf-8" />
</head>
<body>
<form>
<input id="username" name="username" />
<button id="myButton" type="button" value="Page2">Go to Page2</button>
</form>
</body>
</html>
<script>
document.getElementById("myButton").onclick = function () {
location.href = "page2.html";
};
</script>
file page2.html
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title></title>
<meta charset="utf-8" />
</head>
<body onload="alert('hello')">
</body>
</html>
This works; it detects the alert on page2.html.
Now I made this python script:
test.py
import requests
from test import BasicTest
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'http://localhost:8001/index.html'
def main():
#browser = webdriver.Firefox()
browser = webdriver.PhantomJS()
browser.get(url)
html_source = browser.page_source
#browser.quit()
soup = BeautifulSoup(html_source, "html.parser")
soup.prettify()
request = requests.get('http://localhost:8001/page2.html')
print request.text
#Handle Alert
if __name__ == "__main__":
main();
Now, how can I check if an alert occur on page2.html with Python? First I open the page index.html, then page2.html.
I'm at the beginning, so any suggestions will be appreciate.
p.s.
I also tested webdriver.Firefox() but it is extremely slow.
Also i read this question : Check if any alert exists using selenium with python
but it doesn't work (below is the same previous script plus the solution suggested in the answer).
.....
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
....
def main():
.....
#Handle Alert
try:
WebDriverWait(browser, 3).until(EC.alert_is_present(),
'Timed out waiting for PA creation ' +
'confirmation popup to appear.')
alert = browser.switch_to.alert()
alert.accept()
print "alert accepted"
except TimeoutException:
print "no alert"
if __name__ == "__main__":
main();
I get the error :
"selenium.common.exceptions.WebDriverException: Message: Invalid
Command Method.."
PhantomJS uses GhostDriver to implement the WebDriver Wire Protocol, which is how it works as a headless browser within Selenium.
Unfortunately, GhostDriver does not currently support Alerts. Although it looks like they would like help to implement the features:
https://github.com/detro/ghostdriver/issues/20
You could possibly switch to the javascript version of PhantomJS or use the Firefox driver within Selenium.
from selenium import webdriver
from selenium.common.exceptions import NoAlertPresentException
if __name__ == '__main__':
# Switch to this driver and switch_to_alert will fail.
# driver = webdriver.PhantomJS('<Path to Phantom>')
driver = webdriver.Firefox()
driver.set_window_size(1400, 1000)
driver.get('http://localhost:8001/page2.html')
try:
driver.switch_to.alert.accept()
print('Alarm! ALARM!')
except NoAlertPresentException:
print('*crickets*')

Categories