I'm using Bing Web Search API to search for some info in Bing.com. However, for some search queries, I'm getting websites from search, which are not relevant for my purposes.
For example, if the search query is books to read this summer sometimes Bing returns youtube.com or amazon.com as a result. However, I don't want to have these kinds of websites in my search results. So, I want to block these kinds of websites before the search starts.
Here is my sample code:
params = {
"mkt": "en-US",
"setLang": "en-US",
"textDecorations": False,
"textFormat": "raw",
"responseFilter": "Webpages",
"q": "books to read this summer",
"count": 20
}
headers = {
"Ocp-Apim-Subscription-Key": MY_APY_KEY_HERE,
"Accept": "application/json",
"Retry-After": "1",
}
response = requests.get(WEB_SEARCH, headers=headers, params=params, timeout=15)
response.raise_for_status()
search_data = response.json()
search_data variable contains links I want to block without iteration over the response object. Particularly, I want Bing not to include youtube and amazon in the result.
Any help will be appreciated.
Edited: WEB_SEARCH is the web search endpoint
After digging Bing Web Search documentation I found an answer. Here it is.
LINKS_TO_EXCLUDE = ["-site:youtube.com", "-site:amazon.com"]
def bing_data(user_query):
params = {
"mkt": "en-US",
"setLang": "en-US",
"textDecorations": False,
"textFormat": "raw",
"responseFilter": "Webpages",
"count": 30
}
headers = {
"Ocp-Apim-Subscription-Key": MY_APY_KEY_HERE,
"Accept": "application/json",
"Retry-After": "1",
}
# This is the line what I needed to have
params["q"] = user_query + " " + " ".join(LINKS_TO_EXCLUDE)
response = requests.get(WEB_SEARCH_ENDPOINT, headers=headers, params=params)
response.raise_for_status()
search_data = response.json()
return search_data
Related
I'm trying to scrape the first image of Wikipedia pages of companies. (Quite often it's a logo.)
This works sometimes:
import requests
API_ENDPOINT = "https://en.wikipedia.org/w/api.php"
title = "Microsoft"
# Set parameters for the API request
params = {
"action": "query",
"format": "json",
"formatversion": 2,
"prop": "pageimages",
"piprop": "original",
"titles": title,
}
response = requests.get(API_ENDPOINT, params=params)
data = response.json()
print(data)
But for other companies, let's say Binance or Coinbase, it does not work. I'm not able to figure out why.
I can't see it documented anywhere, but I suspect that pageimages does not include .svg which is the only image file in the Coinbase article. Using images instead works fine:
import requests
API_ENDPOINT = "https://en.wikipedia.org/w/api.php"
title = "Coinbase"
# Set parameters for the API request
params = {
"action": "query",
"format": "json",
"formatversion": 2,
"prop": "images",
"titles": title,
"imlimit":1
}
response = requests.get(API_ENDPOINT, params=params)
data = response.json()
print(data)
Returns:
{'continue': {'imcontinue': '39596725|Commons-logo.svg', 'continue': '||'}, 'query': {'pages': [{'pageid': 39596725, 'ns': 0, 'title': 'Coinbase', 'images': [{'ns': 6, 'title': 'File:Coinbase.svg'}]}]}}
I have python code that successfully downloads a Nessus scan report in csv format, but I need to add some additional fields to the downloaded report. I include parameters in the request payload to include some fields, but the scan that is downloaded does not include those fields.
I've tried changing the value of the reportedContents params to actual Boolean types with the True keyword.
Also, I changed the format to pdf and it exports a PDF file that is just a title page and a page with a blank table of contents.
The downloaded csv file has data in it, but only includes the default headers (i.e.):
Plugin ID,CVE,CVSS v2.0 Base Score,Risk,Host,Protocol,Port,Name,Synopsis,Description,Solution,See Also,Plugin Output
The raw output of the POST request looks like:
POST https://localhost:8834/scans/<scan_id>/export
X-ApiKeys: accessKey=accessKey;secretKey=secretKey
Content-Type: application/x-www-form-urlencoded
Content-Length: 122
format=csv&reportContents.vulnerabilitySections.exploitable_with=true&reportContents.vulnerabilitySections.references=true
def download_scan(scan_num):
# Post an export request
headers = {
'X-ApiKeys': 'accessKey=accessKey;secretKey=secretKey',
'Content-Type': 'application/x-www-form-urlencoded'
}
data = {
'format': 'csv',
'reportContents.vulnerabilitySections.exploitable_with': 'true',
'reportContents.vulnerabilitySections.references': 'true'
}
res = requests.post(url + '/scans/{id_num}/export'.format(id_num = scan_num), data=data, verify=False, headers=headers)
if res.status_code == 200:
export = json.loads(res.text)
file_id = export.get('file')
# Continually check the scan status until the status is ready
while True:
# Check file status
res = requests.get(url + '/scans/{id_num}/export/{file_num}/status'.format(id_num = scan_num, file_num = file_id), verify=False, headers=headers)
if res.status_code == 200:
status = json.loads(res.text)['status']
if status == 'ready':
break
# Download the scan
res = requests.get(url + '/scans/{scan_num}/export/{file_num}/download'.format(scan_num = scan_num, file_num = file_id), verify=False, headers=headers)
# If the scan is successfully downloaded, get the attachment file
if res.status_code == 200:
attachment = res.content
print("Scan downloaded!!!")
else:
raise Exception("Download request failed with status code: " + str(res))
return attachment
def main():
# Download the scan based on the scan_id. I have a helper function that returns the id that I am omitting here
try:
scan = download_scan(scan_id)
except Exception as e:
print(e)
quit()
with open("scan.csv", "wb") as f:
f.write(scan)
f.close()
if __name__ == "__main__":
main()
I'm having the exact same issue but with PowerShell. Neither my additional columns nor filters appear to be working. Was wondering if you'd had any joy getting this to work?
If I change the scan_id I get the correct different results, which suggests it is receiving the JSON but ignoring the columns and filters.
My JSON is as follows...
{
"scan_id": 3416,
"format": "csv",
"reportContents.vulnerabilitySections.cvss3_base_score": true,
"filters": {
"filter.0.quality": "gt",
"filter.0.filter": "cvss2_base_score",
"filter.0.value": "6.9",
"filter.1.quality": "neq",
"filter.1.filter": "cvss2_base_score",
"filter.1.value": ""
}
}
I managed to fix it, my problem was that I was using Python's requests module and it's data={} keyword, which defaults to header content-type: application-x-www-form-urlencoded, it generates reports with strictly 13 fields regardless of your payload.
To make it actually consider your payload, use the header "content-type": "application/json", in your code implicitly and json={} in your payload instead of data={}.
WILL NOT WORK:
requests.post(
nessus_url + f"/scans/{scan_id}/export",
data={
"format": "csv",
"template_id": "",
"reportContents": {
"csvColumns": {
"id": True,
"cve": True,
"cvss": True,
**other_columns,
}
}
},
verify=False,
headers={
"X-ApiKeys": f"accessKey={credentials['access_key']}; secretKey={credentials['secret_key']}",
},
)
WILL WORK:
requests.post(
nessus_url + f"/scans/{scan_id}/export",
json={
"format": "csv",
"template_id": "",
"reportContents": {
"csvColumns": {
"id": True,
"cve": True,
"cvss": True,
**other_columns
}
}
},
verify=False,
headers={
"X-ApiKeys": f"accessKey={credentials['access_key']}; secretKey={credentials['secret_key']}",
"content-type": "application/json",
},
)
I am currently trying to update a Sharepoint 2013 list.
This is the module that I am using using to accomplish that task. However, when I run the post task I receive the following error:
"b'{"error":{"code":"-1, Microsoft.SharePoint.Client.InvalidClientQueryException","message":{"lang":"en-US","value":"Invalid JSON. A token was not recognized in the JSON content."}}}'"
Any idea of what I am doing wrong?
def update_item(sharepoint_user, sharepoint_password, ad_domain, site_url, sharepoint_listname):
login_user = ad_domain + '\\' + sharepoint_user
auth = HttpNtlmAuth(login_user, sharepoint_password)
sharepoint_url = site_url + '/_api/web/'
sharepoint_contextinfo_url = site_url + '/_api/contextinfo'
headers = {
'accept': 'application/json;odata=verbose',
'content-type': 'application/json;odata=verbose',
'odata': 'verbose',
'X-RequestForceAuthentication': 'true'
}
r = requests.post(sharepoint_contextinfo_url, auth=auth, headers=headers, verify=False)
form_digest_value = r.json()['d']['GetContextWebInformation']['FormDigestValue']
item_id = 4991 # This id is one of the Ids returned by the code above
api_page = sharepoint_url + "lists/GetByTitle('%s')/items(%d)" % (sharepoint_listname, item_id)
update_headers = {
"Accept": "application/json; odata=verbose",
"Content-Type": "application/json; odata=verbose",
"odata": "verbose",
"X-RequestForceAuthentication": "true",
"X-RequestDigest": form_digest_value,
"IF-MATCH": "*",
"X-HTTP-Method": "MERGE"
}
r = requests.post(api_page, {'__metadata': {'type': 'SP.Data.TestListItem'}, 'Title': 'TestUpdated'}, auth=auth, headers=update_headers, verify=False)
if r.status_code == 204:
print(str('Updated'))
else:
print(str(r.status_code))
I used your code for my scenario and fixed the problem.
I also faced the same problem. I think the way that data passed for update is not correct
Pass like below:
json_data = {
"__metadata": { "type": "SP.Data.TasksListItem" },
"Title": "updated title from Python"
}
and pass json_data to requests like below:
r= requests.post(api_page, json.dumps(json_data), auth=auth, headers=update_headers, verify=False).text
Note: I used SP.Data.TasksListItem as it is my type. Use http://SharePointurl/_api/web/lists/getbytitle('name')/ListItemEntityTypeFullName to find the type
I am trying to get all results from https://www.ncl.com/. I found that the request must be GET and sent to this link:https://www.ncl.com/search_vacations
so far i got the first 12 results and there is no problem parsing them. The problem is i cannot find a way to "change" the page of the results. I get 12 of 499 and i need to get them all. I've tried to do this https://www.ncl.com/search_vacations?current_page=1 and increment it every time but i get the same (first) result every time. Tried adding json body to the request json = {"current_page": '1'} again with no success.
This is my code so far:
import math
import requests
session = requests.session()
proxies = {'https': 'https://97.77.104.22:3128'}
headers = {
"authority": "www.ncl.com",
"method": "GET",
"path": "/search_vacations",
"scheme": "https",
"accept": "application/json, text/plain, */*",
"connection": "keep-alive",
"referer": "https://www.ncl.com",
"cookie": "AkaUTrackingID=5D33489F106C004C18DFF0A6C79B44FD; AkaSTrackingID=F942E1903C8B5868628CF829225B6C0F; UrCapture=1d20f804-718a-e8ee-b1d8-d4f01150843f; BIGipServerpreprod2_www2.ncl.com_http=61515968.20480.0000; _gat_tealium_0=1; BIGipServerpreprod2_www.ncl.com_r4=1957341376.10275.0000; MP_COUNTRY=us; MP_LANG=en; mp__utma=35125182.281213660.1481488771.1481488771.1481488771.1; mp__utmc=35125182; mp__utmz=35125182.1481488771.1.1.utmccn=(direct)|utmcsr=(direct)|utmcmd=(none); utag_main=_st:1481490575797$ses_id:1481489633989%3Bexp-session; s_pers=%20s_fid%3D37513E254394AD66-1292924EC7FC34CB%7C1544560775848%3B%20s_nr%3D1481488775855-New%7C1484080775855%3B; s_sess=%20s_cc%3Dtrue%3B%20c%3DundefinedDirect%2520LoadDirect%2520Load%3B%20s_sq%3D%3B; _ga=GA1.2.969979116.1481488770; mp__utmb=35125182; NCL_LOCALE=en-US; SESS93afff5e686ba2a15ce72484c3a65b42=5ecffd6d110c231744267ee50e4eeb79; ak_location=US,NY,NEWYORK,501; Ncl_region=NY; optimizelyEndUserId=oeu1481488768465r0.23231006365903206",
"Proxy-Authorization": "Basic QFRLLTVmZjIwN2YzLTlmOGUtNDk0MS05MjY2LTkxMjdiMTZlZTI5ZDpAVEstNWZmMjA3ZjMtOWY4ZS00OTQxLTkyNjYtOTEyN2IxNmVlMjlk"
}
def get_count():
response = requests.get(
"https://www.ncl.com/search_vacations?cruise=1&cruiseTour=0&cruiseHotel=0&cruiseHotelAir=0&flyCruise=0&numberOfGuests=4294953449&state=undefined&pageSize=10¤tPage=",
proxies=proxies)
tmpcruise_results = response.json()
tmpline = tmpcruise_results['meta']
total_record_count = tmpline['aggregate_record_count']
return total_record_count
total_cruise_count = get_count()
total_page_count = math.ceil(int(total_cruise_count) / 10)
session.headers.update(headers)
cruises = []
page_counter = 1
while page_counter <= total_page_count:
url = "https://www.ncl.com/search_vacations?current_page=" + str(page_counter) + ""
page = requests.get(url, headers=headers, proxies=proxies)
cruise_results = page.json()
for line in cruise_results['results']:
cruises.append(line)
print(line)
page_counter += 1
print(cruise_results['pagination']["current_page"])
print("----------")
print(len(cruises))
Using requests and a proxy. Any ideas how to do that?
The website claims to have 12264 search results (for a blank search), organised in pages of 12.
The search url takes a parameter Nao which seems to define the search result offset from which your result page will start.
So fetching https://www.ncl.com/uk/en/search_vacations?Nao=45
should get a "page" of 12 search results, starting with result 46.
and sure enough:
"pagination": {
"starting_record": "46",
"ending_record": "57",
"current_page": "4",
"start_page": "1",
...
So to page through all results, start with Nao = 0 and add 12 for each fetch.
I have a string like this:
url = 'http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10'
I wish to convert it to this:
converted_url = 'https://scholar.google.pl/citations?view_op=search_authors&hl=en&mauthors=label:security&after_author=rukAAOJ8__8J&astart=10'
I have tried this:
converted_url = url.decode('utf-8')
However, this error is thrown:
AttributeError: 'str' object has no attribute 'decode'
You can use requests to do decoding automatically for you.
Note: after_author URL parameter is a next page token, so when you make a request to the exact URL you provided, the returned HTML will not be the same as you expect because after_author URL parameters changes on every request, for example in my case it is different - uB8AAEFN__8J, and in your URL it's rukAAOJ8__8J.
To get it to work you need to parse the next page token from the first page that will lead to the second page and so on, for example:
# from my other answer:
# https://github.com/dimitryzub/stackoverflow-answers-archive/blob/main/answers/scrape_all_scholar_profiles_bs4.py
params = {
"view_op": "search_authors",
"mauthors": "valve",
"hl": "pl",
"astart": 0
}
authors_is_present = True
while authors_is_present:
# if next page is present -> update next page token and increment to the next page
# if next page is not present -> exit the while loop
if soup.select_one("button.gs_btnPR")["onclick"]:
params["after_author"] = re.search(r"after_author\\x3d(.*)\\x26", str(soup.select_one("button.gs_btnPR")["onclick"])).group(1) # -> XB0HAMS9__8J
params["astart"] += 10
else:
authors_is_present = False
Code and example to extract profiles data in the online IDE:
from parsel import Selector
import requests, json
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "label:security",
"hl": "pl",
"view_op": "search_authors"
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://scholar.google.pl/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
profiles = []
for profile in selector.css(".gs_ai_chpr"):
profile_name = profile.css(".gs_ai_name a::text").get()
profile_link = f'https://scholar.google.com{profile.css(".gs_ai_name a::attr(href)").get()}'
profile_email = profile.css(".gs_ai_eml::text").get()
profile_interests = profile.css(".gs_ai_one_int::text").getall()
profiles.append({
"profile_name": profile_name,
"profile_link": profile_link,
"profile_email": profile_email,
"profile_interests": profile_interests
})
print(json.dumps(profiles, indent=2))
Alternatively, you can achieve the same thing using Google Scholar Profiles API from SerpApi. It's a paid API with a free plan.
The difference is that you don't need to figure out how to extract data, bypass blocks from search engines, increase the number of requests, and so on.
Example code to integrate:
from serpapi import GoogleSearch
import os, json
params = {
"api_key": os.getenv("API_KEY"), # SerpApi API key
"engine": "google_scholar_profiles", # SerpApi profiles parsing engine
"hl": "pl", # language
"mauthors": "label:security" # search query
}
search = GoogleSearch(params)
results = search.get_dict()
for profile in results["profiles"]:
print(json.dumps(profile, indent=2))
# part of the output:
'''
{
"name": "Johnson Thomas",
"link": "https://scholar.google.com/citations?hl=pl&user=eKLr0EgAAAAJ",
"serpapi_link": "https://serpapi.com/search.json?author_id=eKLr0EgAAAAJ&engine=google_scholar_author&hl=pl",
"author_id": "eKLr0EgAAAAJ",
"affiliations": "Professor of Computer Science, Oklahoma State University",
"email": "Zweryfikowany adres z cs.okstate.edu",
"cited_by": 159999,
"interests": [
{
"title": "Security",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Asecurity",
"link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:security"
},
{
"title": "cloud computing",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Acloud_computing",
"link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:cloud_computing"
},
{
"title": "big data",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Abig_data",
"link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:big_data"
}
],
"thumbnail": "https://scholar.google.com/citations/images/avatar_scholar_56.png"
}
'''
Disclaimer, I work for SerpApi.
decode is used to convert bytes into string. And your url is string, not bytes.
You can use encode to convert this string into bytes and later use decode to convert to correct string.
(I use prefix r to simulate text with this problem - without prefix url doesn't have to be converted)
url = r'http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10'
print(url)
url = url.encode('utf-8').decode('unicode_escape')
print(url)
result:
http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10
http://scholar.google.pl/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:security&after_author=rukAAOJ8__8J&astart=10
BTW: first check print(url) maybe you have correct url but you use wrong method to display it. Python Shell displays all result without print() using print(repr()) which display some chars as code to show what endcoding is used in text (utf-8, iso-8859-1, win-1250, latin-1, etc.)