Scraping images with Wikipedia API does not work all the time

Scraping images with Wikipedia API does not work all the time - python

I'm trying to scrape the first image of Wikipedia pages of companies. (Quite often it's a logo.)
This works sometimes:
import requests
API_ENDPOINT = "https://en.wikipedia.org/w/api.php"
title = "Microsoft"
# Set parameters for the API request
params = {
"action": "query",
"format": "json",
"formatversion": 2,
"prop": "pageimages",
"piprop": "original",
"titles": title,
}
response = requests.get(API_ENDPOINT, params=params)
data = response.json()
print(data)
But for other companies, let's say Binance or Coinbase, it does not work. I'm not able to figure out why.

I can't see it documented anywhere, but I suspect that pageimages does not include .svg which is the only image file in the Coinbase article. Using images instead works fine:
import requests
API_ENDPOINT = "https://en.wikipedia.org/w/api.php"
title = "Coinbase"
# Set parameters for the API request
params = {
"action": "query",
"format": "json",
"formatversion": 2,
"prop": "images",
"titles": title,
"imlimit":1
}
response = requests.get(API_ENDPOINT, params=params)
data = response.json()
print(data)
Returns:
{'continue': {'imcontinue': '39596725|Commons-logo.svg', 'continue': '||'}, 'query': {'pages': [{'pageid': 39596725, 'ns': 0, 'title': 'Coinbase', 'images': [{'ns': 6, 'title': 'File:Coinbase.svg'}]}]}}

Related

Web scraping through API - Python

I'm trying to web scrape a web site through python.
URL = "https://www.boerse-frankfurt.de/bond/xs0216072230"
With the code below, I am getting no result, it shows this in output : {}
Code is below :
import requests
url = (
"https://api.boerse-frankfurt.de/v1/data/master_data_bond?isin=XS0216072230"
)
headers = {
"X-Client-TraceId": "d87b41992f6161c09e875c525c70ffcf",
"X-Security": "d361b3c92e9c50a248e85a12849f8eee",
"Client-Date": "2022-08-25T09:07:36.196Z",
}
data = requests.get(url, headers=headers).json()
print(data)
It should print :
{
"isin": "XS0216072230",
"type": {
"originalValue": "25",
"translations": {
"de": "(Industrie-) und Bankschuldverschreibungen",
"en": "Industrial and bank bonds",
},
},
"market": {
"originalValue": "OPEN",
"translations": {"de": "Freiverkehr", "en": "Open Market"},
Any help would be appreciated, I am avoiding Selenium approach for this at the moment.
Thanks in advance.

URL must have some data. https://api.boerse-frankfurt.de/v1/data/master_data_bond?isin=XS0216072230 this url is Empty

This works for me
import requests
url = (
"https://api.boerse-frankfurt.de/v1/data/master_data_bond?isin=XS0216072230"
)
header = {
"authority":"api.boerse-frankfurt.de",
"method":"GET",
"path":"/v1/data/master_data_bond?isin=XS0216072230",
"scheme":"https",
"accept":"application/json, text/plain, */*",
"accept-encoding":"gzip, deflate, br",
"accept-language":"en-US,en;q=0.6",
"client-date":"2022-08-26T18:35:26.470Z",
"origin":"https://www.boerse-frankfurt.de",
"referer":"https://www.boerse-frankfurt.de/",
"x-client-traceid":"21eb43fb86f0065542ba9a34b7f2fa93",
"x-security":"14407a81ab4670847d3d55b0d74a3aea",
}
data = requests.get(url, headers=header).json()
print(data)
But I think you might need to update x-client-traceid,client-date, and x-security regularly

How to exclude particular websites in Bing Web Search API result?

I'm using Bing Web Search API to search for some info in Bing.com. However, for some search queries, I'm getting websites from search, which are not relevant for my purposes.
For example, if the search query is books to read this summer sometimes Bing returns youtube.com or amazon.com as a result. However, I don't want to have these kinds of websites in my search results. So, I want to block these kinds of websites before the search starts.
Here is my sample code:
params = {
"mkt": "en-US",
"setLang": "en-US",
"textDecorations": False,
"textFormat": "raw",
"responseFilter": "Webpages",
"q": "books to read this summer",
"count": 20
}
headers = {
"Ocp-Apim-Subscription-Key": MY_APY_KEY_HERE,
"Accept": "application/json",
"Retry-After": "1",
}
response = requests.get(WEB_SEARCH, headers=headers, params=params, timeout=15)
response.raise_for_status()
search_data = response.json()
search_data variable contains links I want to block without iteration over the response object. Particularly, I want Bing not to include youtube and amazon in the result.
Any help will be appreciated.
Edited: WEB_SEARCH is the web search endpoint

After digging Bing Web Search documentation I found an answer. Here it is.
LINKS_TO_EXCLUDE = ["-site:youtube.com", "-site:amazon.com"]
def bing_data(user_query):
params = {
"mkt": "en-US",
"setLang": "en-US",
"textDecorations": False,
"textFormat": "raw",
"responseFilter": "Webpages",
"count": 30
}
headers = {
"Ocp-Apim-Subscription-Key": MY_APY_KEY_HERE,
"Accept": "application/json",
"Retry-After": "1",
}
# This is the line what I needed to have
params["q"] = user_query + " " + " ".join(LINKS_TO_EXCLUDE)
response = requests.get(WEB_SEARCH_ENDPOINT, headers=headers, params=params)
response.raise_for_status()
search_data = response.json()
return search_data

add params to url in python

I would like to pass two parameters to my url(status code & parent id). The json response of the url request is such :
{
"page": 1,
"per_page": 10,
"total": 35,
"total_pages": 4,
"data": [
{
"id": 11,
"timestamp": 1565193225660,
"status": "RUNNING",
"operatingParams": {
"rotorSpeed": 2363,
"slack": 63.07,
"rootThreshold": 0
},
"asset": {
"id": 4,
"alias": "Secondary Rotor"
},
"parent": {
"id": 2,
"alias": "Main Rotor Shaft"
}
}]
I would like to know how to pass the two parameters in the url. Passing ?status=RUNNING gives the response of all the devices which have running as status (thats pretty straightforward).
For now I have tried this:
import requests
resp = requests.get('https://jsonmock.hackerrank.com/api/iot_devices/search?status=RUNNING')
q = resp.json()
print(q)
How should I pass in parentid=2, so it returns a response with devices which have their parent id=2.Thank you.

It's plainly documented under "Passing Parameters in URLs" in the Requests docs.
resp = requests.get(
'https://jsonmock.hackerrank.com/api/iot_devices/search',
params={
'status': 'RUNNING',
'parentid': 2,
},
)

To add a second get parameter, use the & separator :
import requests
resp = requests.get('https://jsonmock.hackerrank.com/api/iot_devices/search?status=RUNNING&parentid=2')
q = resp.json()
print(q)

If you want to send data via get request the process is straight forward note how different values are seperated with '&'.
url?name1=value1&name2=value2
If you are using Flask for backend then you can access these parameters like.
para1=request.args.get("name1")
para2=request.args.get("name2")
On the front end you can use ajax to send the request
var xhttp=new XMLHttpRequest();
var url="url?name1=value1&name2=value2"
xhttp.open("GET",url,true)
xhttp.onreadystatechange = function() {
if (this.readyState == 4 && this.status == 200)
{
console.log(this.responseText);
}
};
xhttp.send();

POST CSV to server using python

I an trying to Post CSV using python request. I am using the following code but getting this error:
{"code":400,"message":"Invalid input parameters","status":"error"}
Here is my code:
import requests
import json
api_url = "https://anlyticstts.com//api/insights/v1/reports"
headers = {
'Content-Type': 'application/json',
'X-access-key': 'e13168e9f1504d63455'
}
data = {
"search_term_ids": [60, 61],
"product_list_ids": [120],
"start_date": "20180801",
"end_date": "20180805",
"columns": {
"product": ["crawl_date", "product_name"],
"status": ["no_longer_available"],
"ranking": ["search_rank"],
"pricing": ["price"]},
"page_one_only": True, "format": "csv"
}
r = requests.post(url=api_url, data=data, headers=headers)
print(r.text)

You can try this sample code
import json
r = requests.post(api_url, data=json.dumps(data), headers=headers)
instead of
r = requests.post(url=api_url, data=data, headers=headers)

Python post api request

I need to make a post into an API. I'm working with python. I'm new on this and I can't create an ad tag. I tried with create a dict with the api example information but it didn't work. When I run the >>> sitios_creados, the answer is ''
and when I run sites.status_codeI reveice `415.
I don't understand why because if you see in my code, I did a right post python requests before with the token
I must to take the publisherid and with it and the token id create the ad tag.
my publisherid is: 15663
my code:
import requests
import json
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
'Authorization': 'Basic ',
}
data = [
('grant_type', 'password'),
('username', ''),
('password', ''),
]
response = requests.post('http://api.site.com/v1/oauth/generateOauthToken', headers=headers, data=data)
json_data = json.loads(response.text)
token = json_data['access_token'].encode("utf-8")
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
'Authorization': 'Bearer {}'.format(token)
}
sites = requests.post('http://api.site.com/v1/inventorymgmt/publisherAdTag?entityId=15663', headers=headers, data=data)
sitios_creados = sites.content
Api information example:
URL: http://api.site.com/v1/inventorymgmt/publisherAdTag?entityId=2685
Method: POST
Request Body:
{
"publisherId": 2685,
"publisherSiteurl": "http://example.org",
"adTagName": "THIS_IS_TEST_DEMAND_5",
"adCodeTypeId": 1,
"foldPlacementId": 1,
"adTypeId": 3,
"pagePlacementId": 1,
"adExpansionDirectionId": 1,
"adSize": {
"name": null,
"width": 0,
"height": 0,
"id": 9
},
"adTagPlacements": [{
"adTagPlacementId": 0,
"linkOnlyToGeo": false,
"ecpm": 1,
"adScript": "THIS IS DEMO SCRIPT",
"currency": 1
}],
"adTagCustomParamMap": [{
"name": "kadcarrier",
"macroValue": "techno.carrier"
}, {
"name": "kadcity",
"macroValue": "geo.city"
}]
}

Is it an API for a website?
When yes, you can do a network analysis in the developer tool of your browser and copy the curl command of the POST package.
Then you surf to curl.trillworks.com and convert the curl command it into a Python POST request.
Inside of your python request you can modify the values.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping images with Wikipedia API does not work all the time - python

Related

Web scraping through API - Python

How to exclude particular websites in Bing Web Search API result?

add params to url in python

POST CSV to server using python

Python post api request

Categories

Resources