Last time i had a question:
MechanicalSoup and Request
was closed because i had 2 questions, only 1 question is allowed. It's weird but i'll take it since i managed to answer question 2 with this code by using re module.
html = "<boop boop bap bap> </boop boop bap bap> <title form=ipooped> test </title>"
match_results = re.search("<title.*?>.*?</title.*?>", html, re.IGNORECASE)
content = match_results.group() #title = re.sub("<.*?>", "", title)
print(content);
So now i'm gonna ask a question again, and that is how i can set the cookie on mechanicalsoup like as in requests.session(). For example, this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...',
'cookie': 'PHPSESSID=a9ej3sro77tkdoh7hdhj832m68; security=low ...'
}
response = session.get(url, headers=headers);
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
In a HTML file, I have a tag that includes <source type="audio/mpeg" src="/us/media in that, and extract src element from that using bs4?
Here is the desired output:
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'}
res = requests.get('https://dictionary.cambridge.org/us/dictionary/english/vulnerable', headers = headers)
soup = BeautifulSoup(res.content, 'html.parser')
srcs = soup.select('source[src*="us/media"]')
for src in srcs:
try:
print(src['src'])
except:
pass
Output:
/us/media/english/us_pron/v/vul/vulne/vulnerable.mp3
/us/media/english/us_pron_ogg/v/vul/vulne/vulnerable.ogg
/us/media/english/uk_pron/u/ukv/ukvor/ukvorte027.mp3
/us/media/english/uk_pron_ogg/u/ukv/ukvor/ukvorte027.ogg
/us/media/english/us_pron/v/vul/vulne/vulnerable.mp3
/us/media/english/us_pron_ogg/v/vul/vulne/vulnerable.ogg
/us/media/english/us_pron/e/eus/eus74/eus74904.mp3
/us/media/english/us_pron_ogg/e/eus/eus74/eus74904.ogg
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I am trying to scrape houzz website
In browser dev tools it shows HTML content. But when I scrape it with beautifulsoup, it returns something else together with some of the html, I do not have much knowledge on this.
A little part of what I get is as follows.
</div><style data-styled="true" data-styled-version="5.2.1">.fzynIk.fzynIk{box-sizing:border-box;margin:0;overflow:hidden;}/*!sc*/
.eiQuKK.eiQuKK{box-sizing:border-box;margin:0;margin-bottom:4px;}/*!sc*/
.chJVzi.chJVzi{box-sizing:border-box;margin:0;margin-left:8px;}/*!sc*/
.kCIqph.kCIqph{box-sizing:border-box;margin:0;padding-top:32px;padding-bottom:32px;border-top:1px solid;border-color:#E6E6E6;}/*!sc*/
.dIRCmF.dIRCmF{box-sizing:border-box;margin:0;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-box-pack:justify;-webkit-justify-content:space-between;-ms-flex-pack:justify;justify-content:space-between;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;margin-bottom:16px;}/*!sc*/
.kmAORk.kmAORk{box-sizing:border-box;margin:0;margin-bottom:24px;}/*!sc*/
.bPERLb.bPERLb{box-sizing:border-box;margin:0;margin-bottom:-8px;}/*!sc*/
What should I do with this? Is not this achievable with beautfulsoup?
Developer Tools operate on a live browser DOM, what you’ll see when inspecting the page source is not the original HTML, but a modified one after applying some browser clean up and executing JavaScript code.
Requests is not executing JavaScript so content can deviate slightly, but you can scrape - Just take a deeper look into your soup.
Example (project titles)
from bs4 import BeautifulSoup
import requests
url_news = " https://www.houzz.com.au/professionals/home-builders/turrell-building-pty-ltd-pfvwau-pf~1099128087"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
response = requests.get(url_news, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
[title.text for title in soup.select('#projects h3')]
Output
['Major Renovation & Master Wing',
'"The Italian Village" Private Residence',
'Country Classic',
'Residential Resort',
'Resort Style Extension, Stone and Timber',
'Old Northern Rd Estate']
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I want to find the email in this webpage:
https://reachuae.com/livesearch/brand-detail/3910/A-ALICO-LTD-Sharjah
I created this code but no email found:
import requests
import re
url = 'https://reachuae.com/livesearch/brand-detail/3910/A-ALICO-LTD-Sharjah'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
r = requests.get(url, headers=headers)
print(r.status_code)
page_text = r.text
email = re.findall(r'\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,}\b',page_text)
print(email)
return empty list
The email is not found in the URL which you have mentioned in the question, but when you click the "(Click here to send enquiry)" another URL is generating at the bottom of page. That URL is containing the mail id. Using the below python code you can extract that Mail id
import requests
from lxml import html
Mail_url = 'https://reachuae.com/livesearch/brand-detail/3910/A-ALICO-LTD-Sharjah'
def mailExtractor():
mail = Mail_url.split('/')
innumber = mail[-2]
Actual_url = 'https://reachuae.com/includes/contact_company.php?id={}&KeepThis=true&'.format(innumber)
getr = requests.get(Actual_url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"})
sour = html.fromstring(getr.content)
emails = sour.xpath('//input[#name="mail"]//#value')
for mail in emails:
print(mail)
mailExtractor()
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
A url has this JSON which is not in standard format:
{}&& {identifier:'ID', label:'As at 08-03-2018 5:06 PM',items:[{ID:0,N:'2ndChance W200123',SIP:'',NC:'CDWW',R:'',I:'',M:'',LT:0.009,C:0.000,VL:108.200,BV:2149.900,B:'0.008',S:'0.009',SV:7218.300,O:0.009,H:0.009,L:0.008,V:873.700,SC:'5',PV:0.009,P:0.000,BL:'100',P_:'X',V_:''},{ID:1,N:'3Cnergy',SIP:'',NC:'502',R:'',I:'',M:'t',LT:0,C:0,VL:0.000,BV:50.000,B:'0.022',S:'0.025',SV:36.000,O:0,H:0,L:0,V:0.000,SC:'2',PV:0.021,P:0,BL:'100',P_:'X',V_:''},{ID:2,N:'3Cnergy W200528',SIP:'',NC:'1E0W',R:'',I:'',M:'t',LT:0,C:0,VL:0.000,BV:0,B:'',S:'0.004',SV:50.000,O:0,H:0,L:0,V:0.000,SC:'5',PV:0.002,P:0,BL:'100',P_:'X',V_:''}`
I want to make all the data into list or in pandas started from ID.
{}&& {identifier:'ID', label:'As at 08-03-2018 5:06 PM',items: is not wanted when I requested the url
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url ='AttributeError: 'http://www.sgx.com/JsonRead/JsonstData?qryId=RAll'
page = requests.get(url,headers=headers)
alldata = html.fromstring(page.content)
However, I am unable to continue as the JSON format is not standard. How to correct it?
import requests
import execjs
url = 'http://www.sgx.com/JsonRead/JsonstData?qryId=RAll'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
page = requests.get(url,headers=headers)
content = page.content[len('{}&& '):] if page.content.startswith('{}&& ') else page.content
data = execjs.get().eval(content)
print(data)
The data is JavaScript Object, in literal notation.
We can use PyExecJs to eval it and get corresponding python dict.