How to scrape news articles from cnbc with keyword "Green hydrogen"?

How to scrape news articles from cnbc with keyword "Green hydrogen"? - python

I am trying to scrap news article listed in this url, all article are in span.Card-title. But this gives blank output. Is there any to resolve this?
from bs4 import BeautifulSoup as soup
import requests
cnbc_url = "https://www.cnbc.com/search/?query=green%20hydrogen&qsearchterm=green%20hydrogen"
html = requests.get(cnbc_url)
bsobj = soup(html.content,'html.parser')
day = bsobj.find(id="root")
print(day.find_all('span',class_='Card-title'))
for link in bsobj.find_all('span',class_='Card-title'):
print('Headlines : {}'.format(link.text))

The problem is that content is not present on page when it loads initially, only afterwards is it fetched from server using url like this
https://api.queryly.com/cnbc/json.aspx?queryly_key=31a35d40a9a64ab3&query=green%20hydrogen&endindex=0&batchsize=10&callback=&showfaceted=false&timezoneoffset=-240&facetedfields=formats&facetedkey=formats%7C&facetedvalue=!Press%20Release%7C&needtoptickers=1&additionalindexes=4cd6f71fbf22424d,937d600b0d0d4e23,3bfbe40caee7443e,626fdfcd96444f28
and added to page.
Take a look at /json.aspx endpoint in devtools, data seems to be there.

As mentioned in another answer, the data about the articles are loaded using another link, which you can find via the networks tab in devtools. [In chrome, you can open devtools with Ctrl+Shift+I, then go to the networks tab to see requests made, and then click on the name starting with 'json.aspx?...' to see details, then copy the Request URL from Headers section.]
Once you have the Request URL, you can copy it and make the request in your code to get the data:
# dataReqUrl contains the copied Request URL
dataReq = requests.get(dataReqUrl)
for r in dataReq.json()['results']: print(r['cn:title'])
If you don't feel like trying to find that one request in 250+ other requests, you might also try to assemble a shorter form of the url with something like:
# import urllib.parse
# find link to js file with api key
jsLinks = bsobj.select('link[href][rel="preload"]')
jUrl = [m.get('href') for m in jsLinks if 'main' in m.get('href')][0]
jRes = requests.get(jUrl) # request js file api key
# get api key from javascript
qKey = jRes.text.replace(' ', '').split(
'QUERYLY_KEY:'
)[-1].split(',')[0].replace('"', '').strip()
# form url
qParams = {
'queryly_key': qKey,
'query': search_for, # = 'green hydrogen'
'batchsize': 10 # can go up to 100 apparently
}
qUrlParams = urllib.parse.urlencode(qParams, quote_via=urllib.parse.quote)
dataReqUrl = f'https://api.queryly.com/cnbc/json.aspx?{qUrlParams}'
Even though the assembled dataReqUrl is not identical to the copied one, it seems to be giving the same results (I checked with a few different search terms). However, I don't know how reliable this method is, especially compared to the much less convoluted approach with selenium:
# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC
# define chromeDriver_path <-- where you saved 'chromedriver.exe'
cnbc_url = "https://www.cnbc.com/search/?query=green%20hydrogen&qsearchterm=green%20hydrogen"
driver = webdriver.Chrome(chromeDriver_path)
driver.get(cnbc_url)
ctSelector = 'span.Card-title'
WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located(
(By.CSS_SELECTOR, ctSelector)))
cardTitles = driver.find_elements(By.CSS_SELECTOR, ctSelector)
cardTitles_text = [ct.get_attribute('innerText') for ct in cardTitles]
for c in cardTitles_text: print(c)
In my opinion, this approach is more reliable as well as simpler.

Related

Using Selenium, Python and XPATH to try to grab image urls from a website, doesn't work

None of this seems to work, the browser just closes or it just prints "NONE"
Any idea if it's wrong xpaths or what is going on?
Thanks a lot, in advance
Here's the HTML containing the image:
`
<a data-altimg="" data-prdcount="" href="/product/prd-5178/levis-505-regular-jeans-men.jsp?prdPV=5" rel="/product/prd-5178/levis-505-regular-jeans-men.jsp?prdPV=5">
<img alt="Men's Levi's® 505™ Regular Jeans" class="pmp-hero-img" title="Men's Levi's® 505™ Regular Jeans" width="120px" data-herosrc="https://media.kohlsimg.com/is/image/kohls/5178_Light_Blue?wid=240&hei=240&op_sharpen=1" loading="lazy" srcset="https://media.kohlsimg.com/is/image/kohls/5178_Light_Blue?wid=240&hei=240&op_sharpen=1 240w, https://media.kohlsimg.com/is/image/kohls/5178_Light_Blue?wid=152&hei=152&op_sharpen=1 152w" sizes="(max-width: 728px) 20px" src="https://media.kohlsimg.com/is/image/kohls/5178_Light_Blue?wid=240&hei=240&op_sharpen=1">
</a>
`
Here's my script:
`
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
#from selenium.webdriver.support import expected_conditions as EC
#from selenium.webdriver.common.action_chains import ActionChains
import time
# Start a webdriver instance
browser = webdriver.Firefox()
# Navigate to the page you want to scrape
browser.get('https://www.kohls.com/catalog/mens-clothing.jsp?CN=Gender:Mens+Department:Clothing&cc=mens-TN2.0-S-mensclothing')
time.sleep(12)
#images = browser.find_elements(By.XPATH, "//img[#class='pmp-hero-img']")
#images = browser.find_elements(By.CLASS_NAME, 'pmp-hero-img')
images = browser.find_elements(By.XPATH, "/html/body/div[2]/div[2]/div[2]/div[2]/div[1]/div/div/div[3]/div/div[4]/ul/li[*]/div[1]/div[2]/a/img")
#images = browser.find_elements(By.XPATH, "//*[#id='root_panel4124695']/div[4]/ul/li[5]/div[1]/div[2]/a/img")
for image in images:
prod_img = (image.get_attribute("src"))
print(prod_img)
# Close the webdriver instance
browser.close()
`
Tried to get the url's , wasn't successful

First - do Not use very long xpath strings. They are hard to read and work it.
You can't find your images like this:
images = browser.find_elements(By.CSS,
'img[class="pmp-hero-img"]')
Now, the attribute you want to find:
for image in images:
prod_img = (image.get_attribute("data-herosrc"))
print(prod_img)

As I've said in my comment I suggest always doing a request only approach. There are some very limited use cases when one should do a browser based web automation.
First I would like to give you a step by step instruction on how I would do such a job.
Go to the website look for the data you want to be scraped
Open up the Browser Dev Tools and go to Networking
Hardreload the page and look for the Backend API calls that give you the data you are looking for
If the Site is SSR with PHP for example you would need to extract the data from the raw HTML. But most sites today are CSR and receive their content dynamically.
The Biggest "pro" of doing this is that you can extract way more content out of a request. Most APIs deliver their data in a JSON format which one can directly use. Now let's look at your example:
While inspecting the Network tap this request came to my attention:
https://api-bd.kohls.com/v1/ede/experiences?cid=WebStore&pgid=PMP&plids=Horizontal1%7C15
Further inspecting shows that this api call gives us all the products and corresponding information like Image urls. Not all you need to do is to check if you can further manipulate the call to give you more products and then save the urls.
When we inspect the API call with Postman we can see that one parameter is the following:
Horizontal1%7C15
It seems that the 15 at the end corresponds with the number of products received by the backend. Let's test it with 100.
https://api-bd.kohls.com/v1/ede/experiences?cid=WebStore&pgid=PMP&plids=Horizontal1%7C100
I was right changing this parameter of the URL gets us more products. Let's see what's the upper boundary is. Lets change the parameter to the max amount of products.
I've tested it. It did not work.The upper boundary is 155. So you can Scrape 155 products per request. Not too shabby. But how do we retrieve the rest? Let's further investigate that url.
Mhm... Seems like this website we can't get the data for the following pages with the same url as they are using another url for the following pages. That's a bummer.
Here is the code for the first page:
import requests
url = "https://api-bd.kohls.com/v1/ede/experiences?cid=WebStore&pgid=PMP&plids=Horizontal1%7C100"
payload = "{\"departmentName\":\"Clothing\",\"gender\":\"Mens\",\"mcmId\":\"39824086452562678713609272051249254622\"}"
headers = {
'x-app-api_key': 'NQeOQ7owHHPkdkMkKuH5tPpGu0AvIIOu',
'Content-Type': 'text/plain',
'Cookie': '_abck=90C88A9A2DEE673DCDF0BE9C3126D29B~-1~YAAQnTD2wapYufCEAQAA+/cLUAmeLRA+xZuD/BVImKI+dwMVjQ/jXiYnVjPyi/kRL3dKnruzGKHvwWDd8jBDcwGHHKgJbJ0Hyg60cWzpwLDLr7QtA969asl8ENsgUF6Qu37tVpmds9K7H/k4zr2xufBDD/QcUejcrvj3VGnWbgLCb6MDhUJ35QPh41dOVUzJehpnZDOs/fucNOil1CGeqhyKazN9f16STd4T8mBVhzh3c6NVRrgPV1a+5itJfP+NryOPkUj4L1C9X5DacYEUJauOgaKhoTPoHxXXvKyjmWwYJJJ+sdU05zJSWvM5kYuor15QibXx714mO3aBuYIAHY3k3vtOaDs2DqGbpS/dnjJAiRQ8dmC5ft9+PvPtaeFFxflv8Ldo+KTViHuYAqTNWntvQrinZxAif8pJnMzd00ipxmrj2NrLgxIKQOu/s1VNsaXrLiAIxADl7nMm7lAEr5rxKa27/RjCxA+SLuaz0w9PnYdxUdyfqKeWuTLy5EmRCUCYgzyhO3i8dUTSSgDLglilJMM9~0~-1~1672088271; _dyid_server=7331860388304706917; ak_bmsc=B222629176612AB1EBF71F96EAB74FA1~000000000000000000000000000000~YAAQnTD2wXhfufCEAQAAxuAOUBKVYljdVEM6mA086TVhGvlLmbRxihQ+5L1dtLAKrX5oFG1qC+dg6EbPflwDVA7cwPkft84vUGj0bJkllZnVb0FZKSuVwD728oW1+rCdos7GLBUTkq3XFzCXh/qMr8oagYPHKMIEzXb839+BKmOjGlNvBQsP/eJm+BmxnSlYq03uLpBZVRdmPX7mDAq2KyPq9kCnB+6o+D+eVfzchurFxbpvmWb+XCG0oAD+V5PgW3nsSey99M27WSy4LMyFFljUqLPkSdTRFQGrm8Wfwci6rWuoGgVpF00JAVBpdO2eIVjxQdBVXS7q5CmNYRifMU3I1GpLUr6EH+kKoeMiDQNhvU95KXg/e8lrTkvaaJLOs5BZjeC3ueLY; bm_sv=CF184EA45C8052AF231029FD15170EBD~YAAQnTD2wSxgufCEAQAARkkPUBKJBEwgLsWkuV8MSzWmw5svZT0N7tUML8V5x3su83RK3/7zJr0STY4BrULhET6zGrHeEo1xoSz0qvgRHB3NGYVE6QFAhRZQ4qnqNoLBxM/EhIXl2wBere10BrAtmc8lcIYSGkPr8emEekEQ9bBLUL9UqXyJWSoaDjlY7Z2NdEQVQfO5Z8NxQv5usQXOBCqW/ukgxbuM3C5S2byDmjLtU7f2S5VjdimJ3kNSzD80~1; 019846f7bdaacd7a765789e7946e59ec=52e83be20b371394f002256644836518; akacd_EDE_GCP=2177452799~rv=5~id=a309910885f7706f566b983652ca61e9'
}
response = requests.request("POST", url, headers=headers, data=payload)
data = response.json()
print(data)
for product in data["payload"]["experiences"][0]["expPayload"]["products"]:
print(product["image"]["url"])
Do something similar for the following pages and you will be set.

How to download PDF from url in python

Note: This is very different problem compared to other SO answers (Selenium Webdriver: How to Download a PDF File with Python?) available for similar questions.
This is because The URL: https://webice.ongc.co.in/pay_adv?TRACKNO=8262# does not directly return the pdf but in turn makes several other calls and one of them is the url that returns the pdf file.
I want to be able to call the url with a variable for the query param TRACKNO and to be able to save the pdf file using python.
I was able to do this using selenium, but my code fails to work when the browser is used in headless mode and I need it to work in headless mode. The code that I wrote is as follows:
import requests
from urllib3.exceptions import InsecureRequestWarning
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
def extract_url(driver):
advice_requests = driver.execute_script("var performance = window.performance || window.mozPerformance || window.msPerformance || window.webkitPerformance || {}; var network = performance.getEntries() || {}; return network;")
print(advice_requests)
for request in advice_requests:
if(request.get('initiatorType',"") == 'object' and request.get('entryType',"") == 'resource'):
link_split = request['name'].split('-')
if(link_split[-1] == 'filedownload=X'):
print("..... Successful")
return request['name']
print("..... Failed")
def save_advice(advice_url,tracking_num):
requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)
response = requests.get(advice_url,verify=False)
with open(f'{tracking_num}.pdf', 'wb') as f:
f.write(response.content)
def get_payment_advice(tracking_nums):
options = webdriver.ChromeOptions()
# options.add_argument('headless') # DOES NOT WORK IN HEADLESS MODE SO COMMENTED OUT
driver = webdriver.Chrome(options=options)
for num in tracking_nums:
print(num,end=" ")
driver.get(f'https://webice.ongc.co.in/pay_adv?TRACKNO={num}#')
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'ls-highlight-domref')))
time.sleep(0.1)
advice_url = extract_url(driver)
save_advice(advice_url,num)
except:
pass
driver.quit()
get_payment_advice['8262']
As it can be seen I get all the network calls that the browser makes in the first line of the extract_url function and then parse each request to find the correct one. However this does not work in headless mode
Is there any other way of doing this as this seems like a workaround? If not, can this be fixed to work in headless mode?

I fixed it, i only changed one function. The correct url is in the given page_source of the driver (with beautifulsoup you can parse html, xml etc.):
from bs4 import BeautifulSoup
def extract_url(driver):
soup = BeautifulSoup(driver.page_source, "html.parser")
object_element = soup.find("object")
data = object_element.get("data")
return f"https://webice.ongc.co.in{data}"
The hostname part may can be extracted from the driver.
I think i did not changed anything else, but if it not work for you, I can paste the full code.
Old Answer:
if you print the text of the returned page (print(driver.page_source)) i think you would get a message that says something like:
"Because of your system configuration the pdf can't be loaded"
This is because the requested site checks some preferences to decide if you are a roboter or not. Maybe it helps to change some arguments (screen size, user agent) to fix this. Here are some information about, how to detect a headless browser.
And for the next time you should paste all relevant code into the question (imports) to make it easier to test.

Retrieve Mechanical Soup results after submitting a form

I am struggling to retrieve some results from a simple form submission. This is what I have so far:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.set_verbose(2)
url = "https://www.dermcoll.edu.au/find-a-derm/"
browser.open(url)
form = browser.select_form("#find-derm-form")
browser["postcode"] = 3000
browser.submit_selected()
form.print_summary()
Where do these results end up...?
Many thanks

As per the MechanicalSoup FAQ, you shouldn't use this library when dealing with a dynamic JavaScript-enabled form, which seems to be the case for the website in your example.
Instead, you can use Selenium in combination with BeautifulSoup (and a little bit of help from webdriver-manager) to achieve your desired result. A short example would look like this:
from selenium import webdriver
from bs4 import BeautifulSoup
from webdriver_manager.chrome import ChromeDriverManager
# set up the Chrome driver instance using webdriver_manager
driver = webdriver.Chrome(ChromeDriverManager().install())
# navigate to the page
driver.get("https://www.dermcoll.edu.au/find-a-derm/")
# find the postcode input and enter your desired value
postcode_input = driver.find_element_by_name("postcode")
postcode_input.send_keys("3000")
# find the search button and perform the search
search_button = driver.find_element_by_class_name("search-btn.location_derm_search_icon")
search_button.click()
# get all search results and load them into a BeautifulSoup object for parsing
search_results = driver.find_element_by_id("search_result")
search_results = search_results.get_attribute('innerHTML')
search_results = BeautifulSoup(search_results)
# get individual result cards
search_results = search_results.find_all("div", {"class": "address_sec_contents"})
# now you can parse for whatever information you need
[x.find("h4") for x in search_results] # names
[x.find("p", {"class": "qualification"}) for x in search_results] # qualifications
[x.find("address") for x in search_results] # addresses
While this way may seem more involved, it's a lot more robust and can be easily repurposed for many more situations where MechanicalSoup falls short.

Scraping dynamic DataTable of many pages but same URL

I have experience with C and I'm starting to approach Python, mostly for fun.
I am trying to scrape this page here https://www.justetf.com/it/find-etf.html?groupField=index&from=search&/it/find-etf.html%3F1-1.0-esearch-etfsPanel.
Since the table, with the content I'm interested on, is dynamically created after connecting to the page, I'm using:
Selenium to load the page in the browser
Beautiful soup 4 for scraping the data loaded
At the moment I'm able to scrape all the fields of interest of the first 25 entries, the ones which are loaded once connected to the page. I can have up to 100 entries in one page but there are 1045 entries in total, which are split in different pages. The problem is that the url is the same for all the pages and the content of the table is dynamically loaded at runtime.
What I would like to do is find a way to be able to scrape all the entries, which are 1045. Reading through the internet I have understood I should send a proper POST request (I've also founded that they are retrieving data from https://www.finanztreff.de/) from my code, get the data from the response and scrape it.
I can see two possibilities :
Retrieve all the entries in once
Retrieve one page after the other and scrape one after the other
I have no idea how to build up the POST request.
I think there is no need to post the code but if needed I can re-edit the question.
Thanks in advance to everybody.
EDITED
Here you go with some code
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from bs4 import BeautifulSoup
import requests
firefox_binary = FirefoxBinary('some path\\firefox.exe')
browser = webdriver.Firefox(firefox_binary=firefox_binary)
url = "https://www.justetf.com/it/find-etf.html"
browser.get(url)
delay = 5 # seconds
try:
myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'Alerian')))
print("Page is ready!")
except TimeoutException:
print ("Loading took too much time!")
page_source = browser.page_source
soup = BeautifulSoup(page_source, 'lxml')
from here on I just play a bit with bs4 APIs.

This should do the trick (getting all the data at once):
import requests as r
link = 'https://www.justetf.com/it/find-etf.html?groupField=index&from=search&/it/find-etf.html%3F1-1.0-esearch-etfsPanel'
link2 = 'https://www.justetf.com/servlet/etfs-table'
data = {
'draw': 1,
'start': 0,
'length': 10000000,
'lang': 'it',
'country': 'DE',
'universeType': 'private',
'etfsParams': link.split('?')[1]
}
res = r.post(link2, data=data)
result = res.json()
print(len(result["data"]))
EDIT: For the explanation, I did open network tab in chrome and click on the next pages to see what requests have been made, and I noticed that a POST requests was made to link2 with a lot of parameters and most were mandatory.
For the needed parameters, draw I only needed one draw (one request), start starting from position 0, length I used a big number to scrape everything at once. If length was for example 10, you'd need a lot of draws, they go like draw=2&start=10&length=10, draw=3&start=20&length=10 and so on. For lang, country and universeType I didn't know the exact use but removing them would reject the request. And last the etfsParams is what comes after '?' in link.

Need to scrape a table which is loaded through ajax using python(selenium)

I have a page that has a table (table id= "ctl00_ContentPlaceHolder_ctl00_ctl00_GV" class="GridListings" )i need to scrape.
I usually use BeautifulSoup & urllib for it,but in this case the problem is that the table takes some time to load ,so it isnt captured when i try to fetch it using BS.
I cannot use PyQt4,drysracpe or windmill because of some installation issues,so the only possible way is to use Selenium/PhantomJS
I tried the following,still no success:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get(url)
wait = WebDriverWait(driver, 10)
table = wait.until(EC.presence_of_element_located(By.CSS_SELECTOR, 'table#ctl00_ContentPlaceHolder_ctl00_ctl00_GV'))
The above code doesnt give me the desired contents of the table.
How do i go about achieveing this???

You can get the data using requests and bs4,, with almost if not all asp sites there are a few post params that always need to be provided like __EVENTTARGET, __EVENTVALIDATION etc.. :
from bs4 import BeautifulSoup
import requests
data = {"__EVENTTARGET": "ctl00$ContentPlaceHolder$ctl00$ctl00$RadAjaxPanel_GV",
"__EVENTARGUMENT": "LISTINGS;0",
"ctl00$ContentPlaceHolder$ctl00$ctl00$ctl00$hdnProductID": "139",
"ctl00$ContentPlaceHolder$ctl00$ctl00$hdnProductID": "139",
"ctl00$ContentPlaceHolder$ctl00$ctl00$drpSortField": "Listing Number",
"ctl00$ContentPlaceHolder$ctl00$ctl00$drpSortDirection": "A-Z, Low-High",
"__ASYNCPOST": "true"}
And for the actual post, we need to add a few more values to out post data:
post = "https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx"
with requests.Session() as s:
s.headers.update({"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"})
soup = BeautifulSoup(s.get(post).content)
data["__VIEWSTATEGENERATOR"] = soup.select_one("#__VIEWSTATEGENERATOR")["value"]
data["__EVENTVALIDATION"] = soup.select_one("#__EVENTVALIDATION")["value"]
data["__VIEWSTATE"] = soup.select_one("#__VIEWSTATE")["value"]
r = s.post(post, data=data)
soup2 = BeautifulSoup(r.content)
table = soup2.select_one("div.GridListings")
print(table)
You will see the table printed when you run the code.

If you want to scrap something, it will be nice first to install a web debugger ( Firebug for Mozilla Firefox for example) to watch how the website you want to scrap is working.
Next, you need to copy the process of how the website is connecting to backoffice
As you said, the content that you want to scrap is being loaded asynchronously (only when the document is ready)
Assuming the debugger is running and also you have refreshed the page, you will see on the network tab the following request:
POST https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx
The final process flow to reach your goal will be:
1/ Use requests python module
2/ Open a requests session to the index page website site (with cookies handling)
3/ Scrap all the input for the specific POST form request
4/ Build a POST parameter DICT containing all inputs & value fields scrapped in the previous step + adding some specific fixed params.
5/ POST the request (with required data)
6/ Use finally BS4 module (as usual) to soup the answered html to scrap your data
Please see bellow a working code:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from bs4 import BeautifulSoup
import requests
base_url="https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx"
#create requests session
s = requests.session()
#get index page
r=s.get(base_url)
#soup page
bs=BeautifulSoup(r.text)
#extract FORM html
form_soup= bs.find('form',{'name':'aspnetForm'})
#extracting all inputs
input_div = form_soup.findAll("input")
#build the data parameters for POST request
#we add some required <fixed> data parameters for post
data={
'__EVENTARGUMENT':'LISTINGS;0',
'__EVENTTARGET':'ctl00$ContentPlaceHolder$ctl00$ctl00$RadAjaxPanel_GV',
'__EVENTVALIDATION':'/wEWGwKis6fzCQLDnJnSDwLq4+CbDwK9jryHBQLrmcucCgL56enHAwLRrPHhCgKDk6P+CwL1/aWtDQLm0q+gCALRvI2QDAKch7HjBAKWqJHWBAKil5XsDQK58IbPAwLO3dKwCwL6uJOtBgLYnd3qBgKyp7zmBAKQyTBQK9qYAXAoieq54JAuG/rDkC1djKyQMC1qnUtgoC0OjaygUCv4b7sAhfkEODRvsa3noPfz2kMsxhAwlX3Q=='
}
#we add some <dynamic> data parameters
for input_d in input_div:
try:
data[ input_d['name'] ] =input_d['value']
except:
pass #skip unused input field
#post request
r2=s.post(base_url,data=data)
#write the result
with open("post_result.html","w") as f:
f.write(r2.text.encode('utf8'))
Now, please get a look at "post_result.html" content and you will find the data !
Regards

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape news articles from cnbc with keyword "Green hydrogen"? - python

Related

Using Selenium, Python and XPATH to try to grab image urls from a website, doesn't work

How to download PDF from url in python

Retrieve Mechanical Soup results after submitting a form

Scraping dynamic DataTable of many pages but same URL

Need to scrape a table which is loaded through ajax using python(selenium)

Categories

Resources