Different content when accessing a website with requests - python

I am trying to get corresponding handle ids in ARIN automatically using a companies' name, like "Google".
https://search.arin.net/rdap/?query=google*
My naive approach is to use requests and BeautifulSoup:
import requests
from bs4 import BeautifulSoup
html = 'https://search.arin.net/rdap/?query='
comp = 'google*'
r = requests.get(html + comp)
soup = BeautifulSoup(r.text, 'html.parser')
#example search
search = soup.body.find_all(text = "Handle$")
However, I do not get the same output when I am using requests as when I simply use Google Chrome. The html code that is returned by requests is different and I cannot access the corresponding handles.
Does anyone know how to change the code?

The data you see on the page is loaded from external API URL. You can use requests module to simulate it:
import json
import requests
api_url = "https://rdap.arin.net/registry/entities"
params = {"fn": "google*"}
data = requests.get(api_url, params=params).json()
# pretty print the data:
print(json.dumps(data, indent=4))
Prints:
...
{
"handle": "GF-231",
"vcardArray": [
"vcard",
[
[
"version",
{},
"text",
"4.0"
],
[
"fn",
{},
"text",
"GOOGLE FIBER INC"
],
[
"adr",
{
"label": "3425 MALONE DR\nCHAMBLEE\nGA\n30341\nUnited States"
},
"text",
[
"",
"",
"",
"",
"",
"",
""
]
],
[
"kind",
{},
"text",
"org"
]
]
],
...

Related

Extract URLs from a website ( The Hindu) which uses google search console using python

I'm trying to extract links from website using beautiful soup.The website link is https://www.thehindu.com/search/?q=central+vista&sort=relevance&start=#gsc.tab=0&gsc.q=central%20vista&gsc.page=1
The code which i used is given below
import requests
from bs4 import BeautifulSoup
url=[]
url = 'https://www.thehindu.com/search/?q=central+vista&sort=relevance&start=#gsc.tab=0&gsc.q=central%20vista&gsc.page=1'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
urls.append(link.get('href'))
The code runs and gives all the urls present in the website except the one present in the google search console which is the required part.I am basically stuck. Can someone help me to sort it out?
The data you see is loaded with JavaScript, so beautifulsoup doesn't see it. You can use requests + re/json modules to get the data:
import re
import json
import requests
url = "https://cse.google.com/cse/element/v1"
params = {
"rsz": "filtered_cse",
"num": "10",
"hl": "sk",
"source": "gcsc",
"gss": ".com",
"cselibv": "f275a300093f201a",
"cx": "264d7caeb1ba04bfc",
"q": "central vista",
"safe": "active",
"cse_tok": "AB1-RNWPlN01WUQgebV0g3LpWU6l:1670351743367",
"lr": "",
"cr": "",
"gl": "",
"filter": "0",
"sort": "",
"as_oq": "",
"as_sitesearch": "",
"exp": "csqr,cc,4861326",
"callback": "google.search.cse.api3099",
}
data = requests.get(url, params=params).text
data = re.search(r"(?s)\((.*)\)", data).group(1)
data = json.loads(data)
for r in data["results"]:
print(r["url"])
Prints:
https://www.thehindu.com/news/national/estimated-cost-of-central-vista-revamp-plan-without-pmo-goes-up-to-13450-cr/article33358124.ece
https://www.thehindu.com/news/national/central-vista-project-sc-dismisses-plea-against-delhi-hc-verdict-refusing-to-halt-work/article35031575.ece
https://www.thehindu.com/opinion/editorial/monumental-hurry-on-central-vista-project/article31734021.ece
https://www.thehindu.com/news/national/central-vista-new-buildings-on-kg-marg-africa-avenue-proposed-for-relocating-govt-offices/article31702494.ece
https://www.thehindu.com/society/beyond-the-veils-of-secrecy-the-central-vista-project-is-both-the-cause-and-effect-of-its-own-multiple-failures/article32980560.ece
https://www.thehindu.com/news/national/estimated-cost-of-central-vista-revamp-plan-without-pmo-goes-up-to-13450-cr/article33358124.ece%3Fhomepage%3Dtrue
https://www.thehindu.com/news/national/work-on-new-parliament-central-vista-avenue-projects-on-track/article36296821.ece
https://www.thehindu.com/news/national/2466-trees-removed-for-central-vista-projects-so-far-govt/article65665595.ece
https://www.thehindu.com/news/national/central-vista-avenue-redevelopment-project-to-be-completed-by-july-18-puri/article65611471.ece%3Fhomepage%3Dtrue
https://www.thehindu.com/news/national/central-vista-jharkhand-firm-is-lowest-bidder-for-vice-president-enclave/article37310541.ece

How do I look for the right class and id for parsing a page?

This is the code i have so far, i'm attempting to get the subscriber count.
This is this the error I get:
AttributeError: 'NoneType' object has no attribute 'find_all'
I hope someone can help!
from bs4 import BeautifulSoup
import requests as r
url = r.get("https://www.youtube.com/channel/UC57EgpLB1Q0tXc5tWDhttoQ)
soup_content = BeautifulSoup(url.content, 'html.parser')
id_ = soup_content.find(id="meta")
class_ = id_.find_all(class_="style-scope ytd-c4-tabbed-header-renderer")
hopefully_it_works = class_[0]
print(hopefully_it_works.prettify())
the BS and web scraping is not the correct way how to get the youtube data.
Please, consider using the official youtube api, which provides these data you need and much more :)
https://developers.google.com/youtube/v3/docs/
you can try your requests here https://developers.google.com/youtube/v3/docs/channels/list#try-it
in your case you need to fill:
part: statistics
id: UC57EgpLB1Q0tXc5tWDhttoQ
(https://www.googleapis.com/youtube/v3/channelspart=statistics&id=channel_id&key=your_key)
the response is
{
"kind": "youtube#channelListResponse",
"etag": "eQRUDH-2j1eYIpexSuSOsz12tc8",
"pageInfo": {
"totalResults": 1,
"resultsPerPage": 1
},
"items": [
{
"kind": "youtube#channel",
"etag": "D-yZ896UMFRcDDSrfATBaiygDkc",
"id": "UC57EgpLB1Q0tXc5tWDhttoQ",
"statistics": {
"viewCount": "9",
"commentCount": "0",
"subscriberCount": "3",
"hiddenSubscriberCount": false,
"videoCount": "1"
}
}
]
}
edit:
if you really want to use BS, there is your solution:
from bs4 import BeautifulSoup as bs
import requests as r
content = r.get("https://www.youtube.com/channel/UC57EgpLB1Q0tXc5tWDhttoQ")
soup = bs(content.content, "html.parser")
channel_subscribers = soup.find("span", attrs={"class": "channel-header-subscription-button-container yt-uix-button-subscription-container with-preferences"}).find("span", attrs={"class": "yt-subscription-button-subscriber-count-branded-horizontal subscribed yt-uix-tooltip"}).text
print(channel_subscribers)

How to print json info with python?

I have a json (url = http://open.data.amsterdam.nl/ivv/parkeren/locaties.json) and I want to print all 'title', 'adres', 'postcode'. How can I do that?
I want to print it like this:
title.
adres.
postcode.
title.
adres.
postcode.
so among themselves
I hope you can help me with this
import urllib, json
url = "http://open.data.amsterdam.nl/ivv/parkeren/locaties.json"
import requests
search = requests.get(url).json()
print(search['title'])
print(search['adres'])
print(search['postcode'])
Using print(json.dumps(r, indent=4)) you can see that the structure is
{
"parkeerlocaties": [
{
"parkeerlocatie": {
"title": "Fietsenstalling Tolhuisplein",
"Locatie": "{\"type\":\"Point\",\"coordinates\":[4.9032801,52.3824545]}",
...
}
},
{
"parkeerlocatie": {
"title": "Fietsenstalling Paradiso",
"Locatie": "{\"type\":\"Point\",\"coordinates\":[4.8833735,52.3621851]}",
...
}
},
So to access the inner properties, you need to follow the JSON path
import requests
url = ' http://open.data.amsterdam.nl/ivv/parkeren/locaties.json'
search = requests.get(url).json()
for parkeerlocatie in search["parkeerlocaties"]:
content = parkeerlocatie['parkeerlocatie']
print(content['title'])
print(content['adres'])
print(content['postcode'])
print()

How to get text within <script> tag

I am scraping the LaneBryant website.
Part of the source code is
<script type="application/ld+json">
{
"#context": "http://schema.org/",
"#type": "Product",
"name": "Flip Sequin Teach & Inspire Graphic Tee",
"image": [
"http://lanebryant.scene7.com/is/image/lanebryantProdATG/356861_0000015477",
"http://lanebryant.scene7.com/is/image/lanebryantProdATG/356861_0000015477_Back"
],
"description": "Get inspired with [...]",
"brand": "Lane Bryant",
"sku": "356861",
"offers": {
"#type": "Offer",
"url": "https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861",
"priceCurrency": "USD",
"price":"44.95",
"availability": "http://schema.org/InStock",
"itemCondition": "https://schema.org/NewCondition"
}
}
}
}
</script>
In order to get price in USD, I have written this script:
def getPrice(self,start):
fprice=[]
discount = ""
price1 = start.find('script', {'type': 'application/ld+json'})
data = ""
#print("price 1 is + "+ str(price1)+"data is "+str(data))
price1 = str(price1).split(",")
#price1=str(price1).split(":")
print("final price +"+ str(price1[11]))
where start is :
d = webdriver.Chrome('/Users/fatima.arshad/Downloads/chromedriver')
d.get(url)
start = BeautifulSoup(d.page_source, 'html.parser')
It doesn't print the price even though I am getting correct text. How do I get just the price?
In this instance you can just regex for the price
import requests, re
r = requests.get('https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861#color/0000015477', headers = {'User-Agent':'Mozilla/5.0'})
p = re.compile(r'"price":"(.*?)"')
print(p.findall(r.text)[0])
Otherwise, target the appropriate script tag by id and then parse the .text with json library
import requests, json
from bs4 import BeautifulSoup
r = requests.get('https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861#color/0000015477', headers = {'User-Agent':'Mozilla/5.0'})
start = BeautifulSoup(r.text, 'html.parser')
data = json.loads(start.select_one('#pdpInitialData').text)
price = data['pdpDetail']['product'][0]['price_range']['sale_price']
print(price)
price1 = start.find('script', {'type': 'application/ld+json'})
This is actually the <script> tag, so a better name would be
script_tag = start.find('script', {'type': 'application/ld+json'})
You can access the text inside the script tag using .text. That will give you the JSON in this case.
json_string = script_tag.text
Instead of splitting by commas, use a JSON parser to avoid misinterpretations:
import json
clothing=json.loads(json_string)

parsing and getting list from response of get request

I'm trying to parse a website with the requests module:
import requests
some_data = {'a':'',
'b':''}
with requests.Session() as s:
result = s.post('http://website.com',data=some_data)
print(result.text)
The page is responding as below:
{
"arrangetype":"U",
"list": [
{
"product_no":43,
"display_order":4,
"is_selling":"T",
"product_empty":"F",
"fix_position":null,
"is_auto_sort":false
},
{
"product_no":44,
"display_order":6,
"is_selling":"T",
"product_empty":"F",
"fix_position":null,
"is_auto_sort":false
}
],
"length":2
}
I found that instead of parsing full HTML, it would be better to deal with the response as all the data I want is in that response.
What I want to get is a list of the values of product_no, so the expected result is:
[43,44]
How do I do this?
Convert your JSON response to a dictionary with json.loads(), and collect your results in a list comprehension.
Demo:
from json import loads
data = """{
"arrangetype":"U",
"list": [
{
"product_no":43,
"display_order":4,
"is_selling":"T",
"product_empty":"F",
"fix_position":null,
"is_auto_sort":false
},
{
"product_no":44,
"display_order":6,
"is_selling":"T",
"product_empty":"F",
"fix_position":null,
"is_auto_sort":false
}
],
"length":2
}"""
json_dict = loads(data)
print([x['product_no'] for x in json_dict['list']])
# [43, 44]
Full Code:
import requests
from json import loads
some_data = {'a':'',
'b':''}
with requests.Session() as s:
result = s.post('http://website.com',data=some_data)
json_dict = loads(result.text)
print([x["product_no"] for x in json_dict["list"]])

Categories