Problems scraping dynamic content with requests and BeautifulSoup

Problems scraping dynamic content with requests and BeautifulSoup - python

I have tried to scrape the response of the form on the website https://www.languagesandnumbers.com/how-to-count-in-german/en/deu/ by trying to fill out the form and submitting it with requests and BeautifulSoup. After inspecting network-traffic of the submit, I found out that the post params are "numberz" and "lang". That's why I tried to post the following:
import requests
from bs4 import BeautifulSoup
with requests.Session() as session:
response = session.post('https://www.languagesandnumbers.com/how-to-count-in-german/en/deu/', data={
"numberz": "23",
"lang": "deu"
})
soup = BeautifulSoup(response.content, "lxml")
print(soup.find(id='words').get_text())
Unfortunately, the response is dynamic and not visible, so after submitting the form I always get the main page back without any text in the particular div which actually carries that response. Is there another way to scrape the response using requests and BeautifulSoup and not use selenium?

You do not need BeautifulSoup but the correct url to get only the result of written number:
https://www.languagesandnumbers.com/ajax/en/
Cause it returns in this way ack:::dreiundzwanzig you hav to extract the string:
response.text.split(':')[-1]
Example
import requests
with requests.Session() as session:
response = session.post('https://www.languagesandnumbers.com/ajax/en/', data={
"numberz": "23",
"lang": "deu"
})
response.text.split(':')[-1]
Output
dreiundzwanzig

Related

Getting HTML of dynamic websites with BeautifulSoup and requests

Code:
from bs4 import BeautifulSoup as bs
import requests
keyword = "dog"
url = f"https://www.pinterest.de/search/pins/?q={keyword}&rs=typed&term_meta[]={keyword}%7Ctyped"
r = requests.get(url)
soup = bs(r.content)
print(r)
Output: <Response [200]>
If i change the print(r) to print(soup), i get a lot of text that dosen't look like HTML.
Something like: ..."marketing_brand_charlotte_phoenix":"control","marketing_brand_houston_chicago":"enabled","marketing_brand_houston_miami_business":"enabled","marketing_brand_northnhinenestphalia_bavaria":"enabled","marketing_brand_northrhinewestphalia_bavaria_business":"enabled","marketing_brand_seattle_dallas_business":"control","marketing_brand_seattle_orlando":"control","merchant_discovery_shopify_boosting":"control","merchant_storefront_mojito_migration":"enabled","merchant_success_activation_banner_collapse_over_dismiss":"enabled","merchant_success_auto_enroll_approved_merchant":"enabled","merchant_success_catalog_activation_card_copy_update":"control","merchant_success_claim_your_website_copy_update":"control","merchant_success_i18n_umr_review_flag":"enabled","merchant_success_product_tagging":"enabled","merchant_success_switch_merchant_review_queues":"enabled","merchant_success_tag_activation_card_copy_update":"control","merchant_success_tag_installation_redirect":"enabled","merchant_success_unified_merchant_review":"enabled","mini_renux_homefeed_refresh":"enabled","more_ideas_email_notifications":"enabled","more_ideas_newshub_notifications":"enabled","more_ideas_push_notifications":"enabled","msft_pwa_announcement_email":"control","multi_format_ad_group":"enabled","mweb_account_switcher_v2":"enabled","mweb_advertiser_growth_add_biz_create_entrypoint":"enabled","mweb_all_profiles_follow_parity":"enabled","mweb_auth_android_lite_low_res_limit_width":"enabled_736x","mweb_auth_low_res_limit_width":"enabled_736x","mweb_auth_no_client_context":"enabled"...
How can I get the HTLM to then eventually scrape the pictures of a page depending on the keyword? And how to handle endless pages when scraping?

You can get the right page if you request it properly. The servers, to prevent from scraping, use some basic checks to filter them. So the problem concerns only the request not the scraping part. The first and cheapest attempt is pass a "fake" user agent to your request.
user_agent = # of firefox, chrome, safari, opera,...
response = requests.get(url, headers={ "user-agent": user_agent})

Login into a php-website and webscrape website using python without selenium

I am trying to login to this website https://www.icloudemserp.com/tpct/ for scraping some data but I am unable to login. I was trying it with requests in python, using get to get the URL and post to send the post URL and the form data. It just doesn't work or either I don't understand it.
I know how I can achieve this with selenium. The website only posts to this website after I log in https://www.icloudemserp.com/corecampus/checkuser1.php and this is the form data:
[General][1]
[Form data][2]
form data:
branchid: 1
userid:****
pass_word:***
branchid: 17
sel_acad_yr: 2013-2014
sel_sem: Sem 1
import requests
from bs4 import BeautifulSoup
login_data = {
'branchid':'1',
'userid':'****',
'pass_word':'***',
'branchid':'17',
'sel_acad_yr': '2013-2014',
'sel_sem': 'Sem1',
}
with requests.Session() as s:
url = 'https://www.icloudemserp.com/tpct/'
r = s.get(url)
#soup = BeautifulSoup(r.content, 'html5lib')
r = s.post('https://www.icloudemserp.com/corecampus/checkuser1.php', data=login_data)
print(r.content)
Am I even on the right track?

how to pull the shipping price from banggood.com using beautifulsoup

i'm trying to get the shipping price from this link:
https://www.banggood.com/Xiaomi-Mi-Air-Laptop-2019-13_3-inch-Intel-Core-i7-8550U-8GB-RAM-512GB-PCle-SSD-Win-10-NVIDIA-GeForce-MX250-Fingerprint-Sensor-Notebook-p-1535887.html?rmmds=search&cur_warehouse=CN
but it seems that the "strong" is empty.
i've tried few solutions but all of them gave me an empty "strong"
i'm using beautifulsoup in python 3.
for example this code led me to an empty "strong":
client = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(client.content, 'lxml')
for child in soup.find("span", class_="free_ship").children:
print(child)

The issue is the 'Free Shipping' is generated by JavaScript after the page loads, rather than being sent in the webpage.
It might obtain the Shipping price by performing a HTTP request after the page has loaded or it may be hidden within the page
You might be able to try to find the XHR Request to pulls the Shipping price using DevTools in Firefox or chrome using the 'networking' tab and using that to get the price.

Using the XHR, you can find that data:
import requests
from bs4 import BeautifulSoup
import json
url = 'https://m.banggood.com/ajax/product/dynamicPro/index.html'
payload = {
'c': 'api',
'sq': 'IY38TmCNgDhATYCmIDGxYisATHA7ANn2HwX2RNwEYrcAGAVgDNxawIQFhLpFhkOCuZFFxA'}
response = requests.get(url, params=payload).json()
data = response['result']
shipping = data['shipment']
for each in shipping.items():
print (each)
print (shipping['shipCost'])
Output:
print (shipping['shipCost'])
<b>Free Shipping</b>

Website form login using Python urllib2

I've breen trying to learn to use the urllib2 package in Python. I tried to login in as a student (the left form) to a signup page for maths students: http://reg.maths.lth.se/. I have inspected the code (using Firebug) and the left form should obviously be called using POST with a key called pnr whose value should be a string 10 characters long (the last part can perhaps not be seen from the HTML code, but it is basically my social security number so I know how long it should be). Note that the action in the header for the appropriate POST method is another URL, namely http://reg.maths.lth.se/login/student.
I tried (with a fake pnr in the example below, but I used my real number in my own code).
import urllib
import urllib2
url = 'http://reg.maths.lth.se/'
values = dict(pnr='0000000000')
data = urllib.urlencode(values)
req = urllib2.Request(url,data)
resp = urllib2.urlopen(req)
page = resp.read()
print page
While this executes, the print is the source code of the original page http://reg.maths.lth.se/, so it doesn't seem like I logged in. Also, I could add any key/value pairs to the values dictionary and it doesn't produce any error, which seems strange to me.
Also, if I go to the page http://reg.maths.lth.se/login/student, there is clearly no POST method for submitting data.
Any suggestions?

If you would inspect what request is sent to the server when you enter the number and submit the form, you would notice that it is a POST request with pnr and _token parameters:
You are missing the _token parameter which you need to extract from the HTML source of the page. It is a hidden input element:
<input name="_token" type="hidden" value="WRbJ5x05vvDlzMgzQydFxkUfcFSjSLDhknMHtU6m">
I suggest looking into tools like Mechanize, MechanicalSoup or RoboBrowser that would ease the form submission. You may also parse the HTML with an HTML parser, like BeautifulSoup yourself, extract the token and send via urllib2 or requests:
import requests
from bs4 import BeautifulSoup
PNR = "00000000"
url = "http://reg.maths.lth.se/"
login_url = "http://reg.maths.lth.se/login/student"
with requests.Session() as session:
# extract token
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
token = soup.find("input", {"name": "_token"})["value"]
# submit form
session.post(login_url, data={
"_token": token,
"pnr": PNR
})
# navigate to the main page again (should be logged in)
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
print(soup.title)

Fill and submit html form

I am trying / wanting to write a Python script (2.7) that goes to a form on a website (with the name "form1") and fills in the first input-field in said form with the word hello, the second input-field with the word Ronald, and the third field with ronaldG54#gmail.com
Can anyone help me code or give me any tips or pointers on how to do this ?

Aside from Mechanize and Selenium David has mentioned, it can also be achieved with Requests and BeautifulSoup.
To be more clear, use Requests to send request to and retrieve responses from server, and use BeautifulSoup to parse the response html to know what parameters to send to the server.
Here is an example script I wrote that uses both Requests and BeautifulSoup to submit username and password to login to wikipedia:
import requests
from bs4 import BeautifulSoup as bs
def get_login_token(raw_resp):
soup = bs(raw_resp.text, 'lxml')
token = [n['value'] for n in soup.find_all('input')
if n['name'] == 'wpLoginToken']
return token[0]
payload = {
'wpName': 'my_username',
'wpPassword': 'my_password',
'wpLoginAttempt': 'Log in',
#'wpLoginToken': '',
}
with requests.session() as s:
resp = s.get('http://en.wikipedia.org/w/index.php?title=Special:UserLogin')
payload['wpLoginToken'] = get_login_token(resp)
response_post = s.post('http://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login',
data=payload)
response = s.get('http://en.wikipedia.org/wiki/Special:Watchlist')
Update:
For your specific case, here is the working code:
import requests
from bs4 import BeautifulSoup as bs
def get_session_id(raw_resp):
soup = bs(raw_resp.text, 'lxml')
token = soup.find_all('input', {'name':'survey_session_id'})[0]['value']
return token
payload = {
'f213054909': 'o213118718', # 21st checkbox
'f213054910': 'Ronald', # first input-field
'f213054911': 'ronaldG54#gmail.com',
}
url = r'https://app.e2ma.net/app2/survey/39047/213008231/f2e46b57c8/?v=a'
with requests.session() as s:
resp = s.get(url)
payload['survey_session_id'] = get_session_id(resp)
response_post = s.post(url, data=payload)
print response_post.text

Take a look at Mechanize and Selenium. Both are excellent pieces of software that would allow you to automate filling and submitting a form, among other browser tasks.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problems scraping dynamic content with requests and BeautifulSoup - python

Related

Getting HTML of dynamic websites with BeautifulSoup and requests

Login into a php-website and webscrape website using python without selenium

how to pull the shipping price from banggood.com using beautifulsoup

Website form login using Python urllib2

Fill and submit html form

Categories

Resources