Web scraping through API - Python - python

I'm trying to web scrape a web site through python.
URL = "https://www.boerse-frankfurt.de/bond/xs0216072230"
With the code below, I am getting no result, it shows this in output : {}
Code is below :
import requests
url = (
"https://api.boerse-frankfurt.de/v1/data/master_data_bond?isin=XS0216072230"
)
headers = {
"X-Client-TraceId": "d87b41992f6161c09e875c525c70ffcf",
"X-Security": "d361b3c92e9c50a248e85a12849f8eee",
"Client-Date": "2022-08-25T09:07:36.196Z",
}
data = requests.get(url, headers=headers).json()
print(data)
It should print :
{
"isin": "XS0216072230",
"type": {
"originalValue": "25",
"translations": {
"de": "(Industrie-) und Bankschuldverschreibungen",
"en": "Industrial and bank bonds",
},
},
"market": {
"originalValue": "OPEN",
"translations": {"de": "Freiverkehr", "en": "Open Market"},
Any help would be appreciated, I am avoiding Selenium approach for this at the moment.
Thanks in advance.

URL must have some data. https://api.boerse-frankfurt.de/v1/data/master_data_bond?isin=XS0216072230 this url is Empty

This works for me
import requests
url = (
"https://api.boerse-frankfurt.de/v1/data/master_data_bond?isin=XS0216072230"
)
header = {
"authority":"api.boerse-frankfurt.de",
"method":"GET",
"path":"/v1/data/master_data_bond?isin=XS0216072230",
"scheme":"https",
"accept":"application/json, text/plain, */*",
"accept-encoding":"gzip, deflate, br",
"accept-language":"en-US,en;q=0.6",
"client-date":"2022-08-26T18:35:26.470Z",
"origin":"https://www.boerse-frankfurt.de",
"referer":"https://www.boerse-frankfurt.de/",
"x-client-traceid":"21eb43fb86f0065542ba9a34b7f2fa93",
"x-security":"14407a81ab4670847d3d55b0d74a3aea",
}
data = requests.get(url, headers=header).json()
print(data)
But I think you might need to update x-client-traceid,client-date, and x-security regularly

Related

Reading key values a JSON array which is a set in Python

I have the following code
import requests
import json
import sys
credentials_User=sys.argv[1]
credentials_Password=sys.argv[2]
email=sys.argv[3]
def auth_api(login_User,login_Password,):
gooddata_user=login_User
gooddata_password=login_Password
body = json.dumps({
"postUserLogin":{
"login": gooddata_user,
"password": gooddata_password,
"remember":1,
"verify_level":0
}
})
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json'
}
url="https://reports.domain.com/gdc/account/login"
response = requests.request(
"POST",
url,
headers=headers,
data=body
)
sst=response.headers.get('Set-Cookie')
return sst
def query_api(cookie,email):
url="https://reports.domain.com/gdc/account/domains/domain/users?login="+email
body={}
headers={
'Content-Type': 'application/json',
'Accept': 'application/json',
'Cookie': cookie
}
response = requests.request(
"GET",
url,
headers=headers,
data=body
)
jsonContent=[]
jsonContent.append({response.text})
accountSettings=jsonContent[0]
print(accountSettings)
cookie=auth_api(credentials_User,credentials_Password)
profilehash=query_api(cookie,email)
The code itself works and sends a request to the Gooddata API.
The query_api() function returns JSON similar to below
{
"accountSettings": {
"items": [
{
"accountSetting": {
"login": "user#example.com",
"email": "user#example.com",
"firstName": "First Name",
"lastName": "Last Name",
"companyName": "Company Name",
"position": "Data Analyst",
"created": "2020-01-08 15:44:23",
"updated": "2020-01-08 15:44:23",
"timezone": null,
"country": "United States",
"phoneNumber": "(425) 555-1111",
"old_password": "secret$123",
"password": "secret$234",
"verifyPassword": "secret$234",
"authenticationModes": [
"SSO"
],
"ssoProvider": "sso-domain.com",
"language": "en-US",
"ipWhitelist": [
"127.0.0.1"
],
"links": {
"projects": "/gdc/account/profile/{profile_id}/projects",
"self": "/gdc/account/profile/{profile_id}",
"domain": "/gdc/domains/default",
"auditEvents": "/gdc/account/profile/{profile_id}/auditEvents"
},
"effectiveIpWhitelist": "[ 127.0.0.1 ]"
}
}
],
"paging": {
"offset": 20,
"count": 100,
"next": "/gdc/uri?offset=100"
}
}
}
The issue I am having is reading specific keys from this JSON Dict, I can use accountSettings=jsonContent[0] but that just returns the same JSON.
What I want to do is read the value of the project key within links
How would I do this with a dict?
Thanks
Based on your description, uyou have your value inside a list, (not a set. Foergt about set: sets are not used with JSON). Inside your list, you either your content as a single string, which then you'd have to parse with json.loads, or it is simply a well behaved nested data structure already extracted from JSON, but which is inside a single element list. This seems the most likely.
So, you should be able to do:
accountlink = jsonContent[0]["items"][0]["accountSetting"]["login"]
otherwise, if it is encoded as a a json string, you have to parse it first:
import json
accountlink = json.loads(jsonContent[0])["items"][0]["accountSetting"]["login"]
Now, given your question, I'd say your are on a begginer level as a programmer, or a casual user, just using Python to automatize something either way, I'd recommend you do try some exercising before proceeding: it will save you time (a lot of time). I am not trying to bully or mock anything here: this is the best advice I can offer you. Seek for tutorials that play around on the interactive mode, rather than trying entire programs at once that you'd just copy and paste.
Using the below code fixed the issue
jsonContent=json.loads(response.text)
print(type(jsonContent))
test=jsonContent["accountSettings"]["items"][0]
test2=test["accountSetting"]["links"]["self"]
print(test)
print(test2)
I believe this works because for some reason I didn't notice I was using .append for my jsonContent. This resulted in the data type being something other than it should have been.
Thanks to everyone who tried helping me.

Scraping tabbed table from AWS pricing

I am trying to builder scraper to scrape tabs which are tables in this page (https://aws.amazon.com/sagemaker/pricing/) I am only interested in the data thats training, processing and few others.
req = requests.get(url)
soup = bs4.BeautifulSoup(req.content)
tables = soup.find_all("table")
inst_table = str(tables[0])
But it looks like I have to use some sort of a dynamic mechanism to get the tabbed switch.
Assume we clicked on training tab, My goal is to build a file that stores scraped data
"ml.t2.medium": {
"vCPU": 2.0,
"mem_GiB": 4.0,
"price": 0.15,
"category": "Standard",
"task": "training",
}
The good news is you don't need the heavy guns of selenium.
As with AWS, there's almost alwyas an API you can query that returns the data you want.
Here's what you need and how to get it:
import json
import time
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:94.0) Gecko/20100101 Firefox/94.0",
}
endpoint = f"https://b0.p.awsstatic.com/pricing/2.0/meteredUnitMaps/" \
f"sagemaker/USD/current/sagemaker-instances.json?" \
f"timestamp={int(time.time())}"
response = requests.get(endpoint, headers=headers).json()
for region, region_data in response["regions"].items():
if region == "EU (Frankfurt)":
for instance_type, instance_data in region_data.items():
print(json.dumps(instance_data, indent=2))
Sample output for EU (Frankfurt) (shortened for brevity):
{
"rateCode": "X7Z5CZBN2ZY5QED6.JRTCKXETXF.6YS6EN2CT7",
"price": "6.1120000000",
"Instance": "ml.g4dn.12xlarge",
"Clock Speed": "2.5 GHz",
"Instance Type": "ml.g4dn.12xlarge-AsyncInf",
"Component": "AsyncInf",
"VCPU": "48",
"Memory": "192 GiB"
}
{
"rateCode": "F926HEYB3SV5TQ3Y.JRTCKXETXF.6YS6EN2CT7",
"price": "6.8000000000",
"Instance": "ml.g4dn.16xlarge",
"Clock Speed": "2.5 GHz",
"Instance Type": "ml.g4dn.16xlarge-AsyncInf",
"Component": "AsyncInf",
"VCPU": "64",
"Memory": "256 GiB"
}
{
"rateCode": "7SMSS7DTJHR8UWN7.JRTCKXETXF.6YS6EN2CT7",
"price": "1.8810000000",
"Instance": "ml.g4dn.4xlarge",
"Clock Speed": "2.5 GHz",
"Instance Type": "ml.g4dn.4xlarge-AsyncInf",
"Component": "AsyncInf",
"VCPU": "16",
"Memory": "64 GiB"
}
and much more ...

TF400898: An Internal Error Occurred. Activity Id: 1fc05eca-fed8-4065-ae1a-fc8f2741c0ea

i’m trying to push files into git repo via azure API but getting activity_id error. I followed their documentation and trying to add simple file in my repo.
Here is my code:
import requests, base64
pat_token = "xxxx-xxxxx-xxxx"
b64Val = base64.b64encode(pat_token.encode()).decode()
payload = {
"refUpdates": [
{
"name": "refs/heads/main",
"oldObjectId": "505aae1f15ae153b7fc53e8bdb79ac997caa99e6"
}
],
"commits": [
{
"comment": "Added task markdown file.",
"changes": [
{
"changeType": "add",
"item": {
"path": "TimeStamps.txt"
},
"newContent": {
"content": "# Tasks\n\n* Item 1\n* Item 2",
"contentType": "rawtext"
}
}
]
}
]
}
headers = {
'Authorization': 'Basic %s' % b64Val,
'Content-Type': 'application/json',
}
params = (
('api-version', '6.0'),
)
response = requests.post('https://dev.azure.com/{organization}/{project}/_apis/git/repositories/{repo}/pushes', headers=headers, data=payload, params=params)
Anyone knows how to solve this issue? I have also added this issue on their developer community
I’ve fixed that error, actually the payload was not in json format so i have to make it as json and after that it worked fine.
Like this
response = requests.post('https://dev.azure.com/{organization}/{project}/_apis/git/repositories/{repoId}/pushes', headers=headers, params=params, data=json.dumps(payload))

how to feed data to Elasticseach as Integer using Python?

i am using this python script to feed my data to elasticsearch 6.0. How can i store the variable Value with type float in Elasticsearch?
I can't use the metric options for the visualization in Kibana, because all the data is stored automatically as string
from elasticsearch import Elasticsearch
Device=""
Value=""
for key, value in row.items():
Device = key
Value = value
print("Dev",Device, "Val:", Value)
doc = {'Device':Device, 'Measure':Value , 'Sourcefile':filename}
print(' doc: ', doc)
es.index(index=name, doc_type='trends', body=doc)
Thanks
EDIT:
After the advice of #Saul, i could fix this problem with the following code:
import os,csv
import time
from elasticsearch import Elasticsearch
#import pandas as pd
import requests
Datum = time.strftime("%Y-%m-%d_")
path = '/home/pi/Desktop/Data'
os.chdir(path)
name = 'test'
es = Elasticsearch()
#'Time': time ,
#url = 'http://localhost:9200/index_name?pretty'
doc = {
"mappings": {
"doc": {
"properties": {
"device": { "type": "text" },
"measure": { "type": "text" },
"age": { "type": "integer" },
"created": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
}
}
}
#headers = {'Content-type': 'application/json', 'Accept': 'text/plain'}
#r = requests.post(url, data=json.dumps(data), headers=headers)
r= es.index(index=name, doc_type='trends', body=doc)
print(r)
You need to send a HTTP Post request using python request, as follows:
url = "http://localhost:9200/index_name?pretty”
data = {
"mappings": {
"doc": {
"properties": {
"title": { "type": "text" },
"name": { "type": "text" },
"age": { "type": "integer" },
"created": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
}
}
}
headers = {'Content-type': 'application/json', 'Accept': 'text/plain'}
r = requests.post(url, data=json.dumps(data), headers=headers)
Please replace index_name in the URL with the name of the index you are defining in to elasticsearch engine.
If you want to delete the index before creating it again, please do as follows:
url = "http://localhost:9200/index_name”
data = { }
headers = {'Content-type': 'application/json', 'Accept': 'text/plain'}
r = requests.delete(url, data=json.dumps(data), headers=headers)
please replace index_name in the URL with your actual index name. After deleting the index, create it again with the first code example above including the mappings that you would need. Enjoy.
Elasticsearch defines field types in the index mapping. It looks like you probably have dynamic mapping enabled, so when you send data to Elasticsearch for the first time, it makes an educated guess about the shape of your data and the field types.
Once those types are set, they are fixed for that index, and Elasticsearch will continue to interpret your data according to those types no matter what you do in your python script.
To fix this you need to either:
Define the index mapping before you load any data. This is the better option as it gives you complete control over how your data is interpreted. https://www.elastic.co/guide/en/elasticsearch/reference/6.0/mapping.html
Make sure that, the first time you send data into the index, you use the correct data types. This will rely dynamic mapping generation, but it will typically do the right thing.
Defining the index mapping is the best option. It's common to do that once off, in Kibana or with curl, or if you create a lot of indices, with a template.
However if you want to use python, you should look at the create or put_mapping functions on IndicesClient

Simple Python social media scrape of Public information

I just want to grab public information from my accounts on two social media sites. (Instagram and Twitter) My code returns info for twitter, and I know the xpath is correct for instagram but for some reason i'm not getting data for it. I know the XPATH's could be more specific but I can fix that later. Both my accounts are public.
1) I thought maybe it didn't like the python header, so I tried changing it and I still get nothing. That line is commented out but its still there.
2) I heard something about an API on github, this lengthy code is very intimidating and way above my level of understanding. I don't know more than half of what i'm reading on there.
from lxml import html
import requests
import webbrowser
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
#page = requests.get('https://www.instagram.com/<my account>/', headers=headers)
page = requests.get('https://www.instagram.com/<my account>/')
tree = html.fromstring(page.text)
pageTwo = requests.get('http://www.twitter.com/<my account>')
treeTwo = html.fromstring(pageTwo.text)
instaFollowers = tree.xpath("//span[#data-reactid='.0.1.0.0:0.1.3.1.0']/span[2]/text()")
instaFollowing = tree.xpath("//span[#data-reactid='.0.1.0.0:0.1.3.2.0']/span[2]/text()")
twitFollowers = treeTwo.xpath("//a[#data-nav='followers']/span[#class='ProfileNav-value']/text()")
twitFollowing = treeTwo.xpath("//a[#data-nav='following']/span[#class='ProfileNav-value']/text()")
print ''
print '--------------------'
print 'Social Media Checker'
print '--------------------'
print ''
print 'Instagram: ' + str(instaFollowers) + ' / ' + str(instaFollowing)
print ''
print 'Twitter: ' + str(twitFollowers) + ' / ' + str(twitFollowing)
As mentioned, Instragram's page source does not reflect its rendered source as a Javascript function is called to pass content from JSON data to browser. Hence, what Python scrapes in page source does not show exactly what browser renders to screen. Welcome to the new world of dynamic web programming! Consider using Instagram's API or other web parser that can retrieve html generated content (not just page source).
With that said, if you simply need the IG account data you can still use Python's lxml to XPath the JSON content in <script> tag (specifically sixth occurrence but adjust to your needed page). Below example parses Google's Instagram JSON data:
import lxml.etree as et
import urllib.request as rq
rqpage = rq.urlopen('https://instagram.com/google')
txtpage = rqpage.read()
tree = et.HTML(txtpage)
jsondata = tree.xpath("//script[#type='text/javascript' and position()=6]/text()")
for i in jsondata:
print(i)
OUTPUT
window._sharedData = {"qs":"{\"shift\":10,\"header
\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob
\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-
rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-
6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDX
zj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}","static_root":"
\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff","entry_data":
{"ProfilePage":[{"__query_string":"?","__path":"\/google\/","__get_params":
{},"user":{"username":"google","has_blocked_viewer":false,"follows":
{"count":10},"requested_by_viewer":false,"followed_by":
{"count":977186},"country_block":null,"has_requested_viewer":false,"followed_
by_viewer":false,"follows_viewer":false,"profile_pic_url":"https:
\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150
\/11910217_933356470069152_115044571_a.jpg","is_private":false,"full_name":
"Google","media":{"count":180,"page_info":
{"has_previous_page":false,"start_cursor":"1126896719808871555","end_cursor":
"1092117490206686720","has_next_page":true},"nodes":[{"code":"-
jipiawryD","dimensions":{"width":640,"height":640},"owner":
{"id":"1067259270"},"comments":{"count":105},"caption":"Today's the day!
Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70
#GoogleTrends","likes":
{"count":11410},"date":1448556579.0,"thumbnail_src":"https:\/
\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\
/11848856_482502108621097_589421586_n.jpg","is_video":true,"id":"112689671980
8871555","display_src":"https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-
xat1\/t51.2885-15
...
JSON Pretty Print (extracting the window._sharedData variable from above)
See below where user (followers, following, etc.) data shows at beginning:
{
"qs": "{\"shift\":10,\"header\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDXzj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}",
"static_root": "\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff",
"entry_data": {
"ProfilePage": [
{
"__query_string": "?",
"__path": "\/google\/",
"__get_params": {
},
"user": {
"username": "google",
"has_blocked_viewer": false,
"follows": {
"count": 10
},
"requested_by_viewer": false,
"followed_by": {
"count": 977186
},
"country_block": null,
"has_requested_viewer": false,
"followed_by_viewer": false,
"follows_viewer": false,
"profile_pic_url": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150\/11910217_933356470069152_115044571_a.jpg",
"is_private": false,
"full_name": "Google",
"media": {
"count": 180,
"page_info": {
"has_previous_page": false,
"start_cursor": "1126896719808871555",
"end_cursor": "1092117490206686720",
"has_next_page": true
},
"nodes": [
{
"code": "-jipiawryD",
"dimensions": {
"width": 640,
"height": 640
},
"owner": {
"id": "1067259270"
},
"comments": {
"count": 105
},
"caption": "Today's the day! Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70 #GoogleTrends",
"likes": {
"count": 11410
},
"date": 1448556579,
"thumbnail_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg",
"is_video": true,
"id": "1126896719808871555",
"display_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg"
},
{
"code": "-hwbf2wr0O",
"dimensions": {
"width": 640,
"height": 640
},
"owner": {
"id": "1067259270"
},
"comments": {
"count": 95
},
"caption": "Thanksgiving dinner is waiting. But first, the airport. \u2708\ufe0f #GoogleApp",
"likes": {
"count": 12621
},
...
IF anyone is interested in this sort of thing still, using selenium solved my problems.
http://pastebin.com/5eHeDt3r
Is there a faster way ?
In case you want to find information about yourself and others without hassling with code, try this piece of software. Apart from automatic scraping, it analyzes and visualizes the received information on a PDF report from such social networks: Facebook, Twitter, Instagram and from the Google Search engine.
P.S. I am the main developer and maintainer of this project.

Categories