Azure Cognitive Search: - python

I have recently upgraded my Azure Cognitive Search instance so it has semantic search.
However, when I add query_type=semantic, in the client search I get the following stacktrace...
Traceback (most recent call last):
File "call_semantic_search.py", line 34, in <module>
c, r = main(search_text='what is a ')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "call_semantic_search.py", line 28, in main
count: float = search_results.get_count()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.11/site-packages/azure/search/documents/_paging.py", line 82, in get_count
return self._first_iterator_instance().get_count()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.11/site-packages/azure/search/documents/_paging.py", line 91, in wrapper
self._response = self._get_next(self.continuation_token)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.11/site-packages/azure/search/documents/_paging.py", line 115, in _get_next_cb
return self._client.documents.search_post(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.11/site-packages/azure/search/documents/_generated/operations/_documents_operations.py", line 312, in search_post
raise HttpResponseError(response=response, model=error)
azure.core.exceptions.HttpResponseError: () The request is invalid. Details: parameters : Requested value 'semantic' was not found.
Code:
Message: The request is invalid. Details: parameters : Requested value 'semantic' was not found.
This is the code that I have been using to call the search index.
import logging
from typing import Dict, Iterable, Tuple
import settings as settings
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from search import SearchableItem
TOP = 10
SKIP = 0
def main(search_text: str) -> Tuple[float, Iterable[Dict]]:
client = SearchClient(
api_version="2021-04-30-Preview",
endpoint=settings.SEARCH_SERVICE_ENDPOINT,
index_name=settings.SOCIAL_IDX_NAME,
credential=AzureKeyCredential(key=settings.SEARCH_SERVICE_KEY)
)
logging.info(f"Calling: /search?top={TOP}&skip={SKIP}&q={search_text}")
search_results = client.search(
search_text=search_text,
top=TOP,
skip=SKIP,
query_type="semantic",
include_total_count=True,
)
count: float = search_results.get_count()
results = SearchableItem.from_result_as_dict(search_results)
return count, results
if __name__ == "__main__":
count, results = main(search_text='what is a ')
print(count, list(results))
And here is my Azure configuration (I'm able to perform Semantic searches via the portal:
EDITS
Taking #Thiago Custodio's advice;
I enabled logging with:
import sys
logger = logging.getLogger('azure')
logger.setLevel(logging.DEBUG)
# Configure a console output
handler = logging.StreamHandler(stream=sys.stdout)
logger.addHandler(handler)
# ...
search_results = client.search(
search_text=search_text,
top=TOP,
skip=SKIP,
query_type="semantic",
include_total_count=True,
logging_enable=True
)
# ...
And I got the following:
DEBUG:azure.core.pipeline.policies._universal:Request URL: 'https://search.windows.net//indexes('idx-name')/docs/search.post.search?api-version=2020-06-30'
Request method: 'POST'
Request headers:
'Content-Type': 'application/json'
'Accept': 'application/json;odata.metadata=none'
'Content-Length': '86'
'x-ms-client-request-id': 'fbaafc9e-qwww-11ed-9117-a69cwa6c72e'
'api-key': '***'
'User-Agent': 'azsdk-python-search-documents/11.3.0 Python/3.11.1 (macOS-13.0-x86_64-i386-64bit)'
So this shows the request URL going out is pinned to api-version=2020-06-30 - in the Azure Portal, if I change the search version to the same, semantic search is unavailable.
I seem to have an outdated version of the search library even though I installed via:
pip install azure-search-documents
The most notable difference is that in my local azure/search/documents/_generated/operations/_documents_operations.py - the api_version seems to be hardcoded to 2020-06-30 see:
Looking at the source, I actually need the api_version to be dynamically set, so at the caller I can pass it in the search client. This is something thats already implemented within there main branch of the source, see: Source, but for some reason, my local version is different

from your code:
search_results = client.search(
search_text=search_text,
top=TOP,
skip=SKIP,
query_type="semantic",
include_total_count=True,
)
Semantic search is not a parameter, but an endpoint. Rather than calling /search, you should call /semantic
that's what you need:
def semantic_ranking():
# [START semantic_ranking]
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
endpoint = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")
index_name = os.getenv("AZURE_SEARCH_INDEX_NAME")
api_key = os.getenv("AZURE_SEARCH_API_KEY")
credential = AzureKeyCredential(api_key)
client = SearchClient(endpoint=endpoint,
index_name=index_name,
credential=credential)
results = list(client.search(search_text="luxury", query_type="semantic", query_language="en-us"))
note: query_type part in the last line

Fixed with:
azure-search-documents==11.4.0b3

Related

UnknownApiNameOrVersion from Google's My Business API (mybusiness)

I'm using Google's My Business API via Google's API Python Client Library.
Without further ado, here is a complete code example:
from dotenv import load_dotenv
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build
from os.path import exists
from pprint import pprint
import os
import pickle
load_dotenv()
API_DEVELOPER_KEY = os.getenv('API_DEVELOPER_KEY')
API_SCOPE = os.getenv('API_SCOPE')
STORED_CLIENT_CREDENTIALS = os.getenv('STORED_CLIENT_CREDENTIALS')
GOOGLE_APPLICATION_CREDENTIALS = os.getenv('GOOGLE_APPLICATION_CREDENTIALS')
def get_google_credentials(path=STORED_CLIENT_CREDENTIALS):
'''Loads stored credentials. Gets and stores new credentials if necessary.'''
if exists(path):
pickle_in = open(path, 'rb')
credentials = pickle.load(pickle_in)
else:
flow = InstalledAppFlow.from_GOOGLE_APPLICATION_CREDENTIALS_file(
GOOGLE_APPLICATION_CREDENTIALS_file=GOOGLE_APPLICATION_CREDENTIALS, scopes=API_SCOPE)
flow.run_local_server()
credentials = flow.credentials
store_google_credentials(credentials)
return credentials
def store_google_credentials(credentials, path=STORED_CLIENT_CREDENTIALS):
'''Store credentials for future reuse to avoid authenticating every time.'''
pickle_out = open(path, 'wb')
pickle.dump(credentials, pickle_out)
pickle_out.close()
def get_google_api_interface(credentials, service_name, service_version, service_discovery_url=None):
'''Get a resource object with methods for interacting with Google's API.'''
return build(service_name,
service_version,
credentials=credentials,
developerKey=API_DEVELOPER_KEY,
discoveryServiceUrl=service_discovery_url)
def extract_dict_key(dict, key):
'''Utility to extract particular values from a dictionary by their key.'''
return [d[key] for d in dict]
def transform_list_to_string(list, separator=' '):
return separator.join(map(str, list))
def get_google_account_names():
'''Get a list of all account names (unique ids).'''
google = get_google_api_interface(
get_google_credentials(),
service_name='mybusinessaccountmanagement',
service_version='v1',
service_discovery_url='https://mybusinessaccountmanagement.googleapis.com/$discovery/rest?version=v1')
accounts = google.accounts().list().execute()
return extract_dict_key(accounts['accounts'], 'name')
def get_google_store_reviews(account_name):
'''Get all store reviews for a specific account from Google My Business.'''
google = get_google_api_interface(
get_google_credentials(),
service_name='mybusiness',
service_version='v4',
service_discovery_url='https://mybusiness.googleapis.com/$discovery/rest?version=v4')
return google.accounts().locations().batchGetReviews(account_name).execute()
account_names = get_google_account_names()
pprint(account_names)
first_account_name = account_names[0]
pprint(get_google_store_reviews(first_account_name))
And here is the contents of .env:
API_DEVELOPER_KEY = ********
API_SCOPE = https://www.googleapis.com/auth/business.manage
STORED_CLIENT_CREDENTIALS = secrets/credentials.pickle
GOOGLE_APPLICATION_CREDENTIALS = secrets/client_secrets.json
My function get_google_account_names() works fine and returns the expected data:
['accounts/******************020',
'accounts/******************098',
'accounts/******************872',
'accounts/******************021',
'accounts/******************112']
I have tested and validated get_google_credentials() to ensure that CLIENT_CREDENTIALS and API_DEVELOPER_KEY are indeed loaded correctly and working.
Also, in .env, I'm setting the environment variable GOOGLE_APPLICATION_CREDENTIALS to the client_secret.json path, as required some methods in Google's Python Client Library.
My function get_google_store_reviews(), however, results in this error:
Traceback (most recent call last):
File "/my-project-dir/my-script.py", line 88, in <module>
pprint(get_google_store_reviews())
File "/my-project-dir/my-script.py", line 76, in get_google_store_reviews
google = get_google_api_interface(
File "/my-project-dir/my-script.py", line 46, in get_google_api_interface
return build(service_name,
File "/my-project-dir/.venv/lib/python3.9/site-packages/googleapiclient/_helpers.py", line 131, in positional_wrapper
return wrapped(*args, **kwargs)
File "/my-project-dir/.venv/lib/python3.9/site-packages/googleapiclient/discovery.py", line 324, in build
raise UnknownApiNameOrVersion("name: %s version: %s" % (serviceName, version))
googleapiclient.errors.UnknownApiNameOrVersion: name: mybusiness version: v4
I have also tried v1 of the Discovery Document with the same result.
Does anyone know what's going on here? It seems like the API mybusiness is not discoverable via the Discovery Document provided by Google, but I'm not sure how to verify my suspicion.
Note that this and this issue is related, but not exactly the same. The answers in those questions are old don't seem to be applicable anymore after recent changes by Google.
Update:
As a commenter pointed out, this API appears to be deprecated. That might explain the issues I'm having, however, Google's documentation states:
"Deprecated indicates that the version of the API will continue to function […]"
Furthermore, notice that even though the top-level accounts.locations is marked as deprecated, some other the underlying methods (including batchGetReviews) are not.
See screenshot for more details:
This issue has also been reported in GitHub.
The batchGetReviews method expects a single account as the path parameter.
You should thus loop over get_google_account_names() and call .batchGetReviews(google_account) instead of .batchGetReviews(google_accounts).

Bad Request creating Automation Account in Azure Python SDK

I'm trying to create a new AutomationAccount using Python SDK. There's no problem if I get, list, update or delete any account, but I'm getting a BadRequest error when I try to create a new one.
Documentation is pretty easy: AutomationAccountOperations Class > create_or_update()
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import os
from azure.identity import AzureCliCredential
from azure.mgmt.automation import AutomationClient
credential = AzureCliCredential()
automation_client = AutomationClient(credential, "xxxxx")
result = automation_client.automation_account.create_or_update("existing_rg", 'my_automation_account', {"location": "westeurope"})
print(f'Automation account {result.name} created')
This tiny script is throwing me this error:
Traceback (most recent call last):
File ".\deploy.py", line 10
result = automation_client.automation_account.create_or_update("*****", 'my_automation_account', {"location": "westeurope"})
File "C:\Users\Dave\.virtualenvs\new-azure-account-EfYek8IT\lib\site-packages\azure\mgmt\automation\operations\_automation_account_operations.py", line 174, in create_or_update
raise HttpResponseError(response=response, model=error, error_format=ARMErrorFormat)
azure.core.exceptions.HttpResponseError: (BadRequest) {"Message":"The request body on Account must be present, and must specify, at a minimum, the required fields set to valid values."}
Code: BadRequest
Message: {"Message":"The request body on Account must be present, and must specify, at a minimum, the required fields set to valid values."}
I've tried to use this method (create_or_update) on a different sdk like powershell using same parameters and it worked.
Some thoughts?
Solution is setting the Azure SKU parameter.
For some reason is not necessary on Powershell but it is on Python SDK. Now this snippet is creating my AutomationAccount successfully.
credential = AzureCliCredential()
automation_client = AutomationClient(credential, "xxxxx")
params = {"name": my_automation_account, "location": LOCATION, "tags": {}, "sku": {"name": "free"}}
result = automation_client.automation_account.create_or_update("existing_rg", 'my_automation_account', params)
print(f'Automation account {result.name} created')
Docs about this:
AutomationAccountOperations Class > create_or_update
AutomationAccountCreateOrUpdateParameters Class
Sku Class
Thanks #UpQuark

Google Ads API Python - 403 Request had insufficient authentication scopes

I'm trying to access Google Ads campaing reports from Python folowing this tutorial.
I've requested my Developer Token with Basic Access. I think it has enough privileges to execute the script. I Can see my token active when I go to "API Center" in google ads.
I've created a project in google cloud and an Oauth Token.
In google Cloud:
Created a new project
Activated the Google Ads API.
When I go to Manage-> Credentials I see that the Oauth token is compatible with that API.
I have successfully created a refresh token.
I'm using this script as proof of concept:
import os
import json
import sys
from google.ads.google_ads.errors import GoogleAdsException
# Put an account id to download stats from. Note: not MCC, no dash lines
CUSTOMER_ID = "xxxxxxxxxx"
def get_account_id(account_id, check_only=False):
"""
Converts int to str, checks if str has dashes. Returns 10 chars str or raises error
:check_only - if True, returns None instead of Error
"""
if isinstance(account_id, int) and len(str(account_id)) == 10:
return str(account_id)
if isinstance(account_id, str) and len(account_id.replace("-", "")) == 10:
return account_id.replace("-", "")
if check_only:
return None
raise ValueError(f"Couldn't recognize account id from {account_id}")
def micros_to_currency(micros):
return micros / 1000000.0
def main(client, customer_id):
ga_service = client.get_service("GoogleAdsService")# , version="v5")
query = """
SELECT
campaign.id,
campaign.name,
ad_group.id,
ad_group.name,
ad_group_criterion.criterion_id,
ad_group_criterion.keyword.text,
ad_group_criterion.keyword.match_type,
metrics.impressions,
metrics.clicks,
metrics.cost_micros
FROM keyword_view
WHERE
segments.date DURING LAST_7_DAYS
AND campaign.advertising_channel_type = 'SEARCH'
AND ad_group.status = 'ENABLED'
AND ad_group_criterion.status IN ('ENABLED', 'PAUSED')
ORDER BY metrics.impressions DESC
LIMIT 50"""
# Issues a search request using streaming.
response = ga_service.search_stream(customer_id, query) #THIS LINE GENERATES THE ERROR
keyword_match_type_enum = client.get_type(
"KeywordMatchTypeEnum"
).KeywordMatchType
try:
for batch in response:
for row in batch.results:
campaign = row.campaign
ad_group = row.ad_group
criterion = row.ad_group_criterion
metrics = row.metrics
keyword_match_type = keyword_match_type_enum.Name(
criterion.keyword.match_type
)
print(
f'Keyword text "{criterion.keyword.text}" with '
f'match type "{keyword_match_type}" '
f"and ID {criterion.criterion_id} in "
f'ad group "{ad_group.name}" '
f'with ID "{ad_group.id}" '
f'in campaign "{campaign.name}" '
f"with ID {campaign.id} "
f"had {metrics.impressions} impression(s), "
f"{metrics.clicks} click(s), and "
f"{metrics.cost_micros} cost (in micros) during "
"the last 7 days."
)
except GoogleAdsException as ex:
print(
f'Request with ID "{ex.request_id}" failed with status '
f'"{ex.error.code().name}" and includes the following errors:'
)
for error in ex.failure.errors:
print(f'\tError with message "{error.message}".')
if error.location:
for field_path_element in error.location.field_path_elements:
print(f"\t\tOn field: {field_path_element.field_name}")
sys.exit(1)
if __name__ == "__main__":
# credentials dictonary
creds = {"google_ads": "googleads.yaml"}
if not os.path.isfile(creds["google_ads"]):
raise FileExistsError("File googleads.yaml doesn't exists. ")
resources = {"config": "config.json"}
# This logging allows to see additional information on debugging
import logging
logging.basicConfig(level=logging.INFO, format='[%(asctime)s - %(levelname)s] %(message).5000s')
logging.getLogger('google.ads.google_ads.client').setLevel(logging.DEBUG)
# Initialize the google_ads client
from google.ads.google_ads.client import GoogleAdsClient
gads_client = GoogleAdsClient.load_from_storage(creds["google_ads"])
id_to_load = get_account_id(CUSTOMER_ID)
main(gads_client, id_to_load)
I've changed CUSTOMER_ID to the account number that appears on the upper left corner
I've created a googleads.yaml and I've loaded the aforementioned information.
When I execute the script I get this error:
Traceback (most recent call last):
File "download_keywords_from_account.py", line 138, in <module>
main(gads_client, id_to_load)
File "download_keywords_from_account.py", line 70, in main
response = ga_service.search_stream(customer_id, query)
File "google/ads/google_ads/v6/services/google_ads_service_client.py", line 366, in search_stream
return self._inner_api_calls['search_stream'](request, retry=retry, timeout=timeout, metadata=metadata)
File google/api_core/gapic_v1/method.py", line 145, in __call__
return wrapped_func(*args, **kwargs)
File "google/api_core/retry.py", line 281, in retry_wrapped_func
return retry_target(
File "google/api_core/retry.py", line 184, in retry_target
return target()
File "google/api_core/timeout.py", line 214, in func_with_timeout
return func(*args, **kwargs)
File "google/api_core/grpc_helpers.py", line 152, in error_remapped_callable
six.raise_from(exceptions.from_grpc_error(exc), exc)
File "<string>", line 3, in raise_from
google.api_core.exceptions.PermissionDenied: 403 Request had insufficient authentication scopes
The googleads.yaml file looks like this:
#############################################################################
# Required Fields #
#############################################################################
developer_token: {developer token as seen in google ads -> tools -> api center}
#############################################################################
# Optional Fields #
#############################################################################
login_customer_id: {Id from the top left corner in google ads, only numbers}
# user_agent: INSERT_USER_AGENT_HERE
# partial_failure: True
validate_only: False
#############################################################################
# OAuth2 Configuration #
# Below you may provide credentials for either the installed application or #
# service account flows. Remove or comment the lines for the flow you're #
# not using. #
#############################################################################
# The following values configure the client for the installed application
# flow.
client_id: {Oauth client id taken from gcloud -> api -> credentials} ends with apps.googleusercontent.com
client_secret: {got it while generating the token}
refresh_token: 1//0hr.... made with generate_refresh_token.py
# The following values configure the client for the service account flow.
path_to_private_key_file: ads.json
# delegated_account: INSERT_DOMAIN_WIDE_DELEGATION_ACCOUNT
#############################################################################
# ReportDownloader Headers #
# Below you may specify boolean values for optional headers that will be #
# applied to all requests made by the ReportDownloader utility by default. #
#############################################################################
# report_downloader_headers:
# skip_report_header: False
# skip_column_header: False
# skip_report_summary: False
# use_raw_enum_values: False
NOTES:
The file ads.json contains the private key downloaded from the credentials page in gcloud.
I've seen some posts on this issue but none of them are Python + GoogleADs and I couldn't find a solution there either.
I have also tried other Python + GoogleAds examples getting the same error. This makes me
think that I must be configuring something wrong in gcloud / google ads. But I don't understand what.
Please help me make the query I'm really stuck.
Thanks a lot!
Comments of #DazWilkin solved my problem. Thanks!

Disabling Billing on Google Cloud Using Google Cloud Function (Keyerror: 'data')

I am attempting to write a Google Cloud Function to set caps to disable usage above a certain limit. I followed the instructions here: https://cloud.google.com/billing/docs/how-to/notify#cap_disable_billing_to_stop_usage.
This is what my cloud function looks like (I am just copying and pasting from the Google Cloud docs page linked above):
import base64
import json
import os
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
PROJECT_ID = os.getenv('GCP_PROJECT')
PROJECT_NAME = f'projects/{PROJECT_ID}'
def stop_billing(data, context):
pubsub_data = base64.b64decode(data['data']).decode('utf-8')
pubsub_json = json.loads(pubsub_data)
cost_amount = pubsub_json['costAmount']
budget_amount = pubsub_json['budgetAmount']
if cost_amount <= budget_amount:
print(f'No action necessary. (Current cost: {cost_amount})')
return
billing = discovery.build(
'cloudbilling',
'v1',
cache_discovery=False,
credentials=GoogleCredentials.get_application_default()
)
projects = billing.projects()
if __is_billing_enabled(PROJECT_NAME, projects):
print(__disable_billing_for_project(PROJECT_NAME, projects))
else:
print('Billing already disabled')
def __is_billing_enabled(project_name, projects):
"""
Determine whether billing is enabled for a project
#param {string} project_name Name of project to check if billing is enabled
#return {bool} Whether project has billing enabled or not
"""
res = projects.getBillingInfo(name=project_name).execute()
return res['billingEnabled']
def __disable_billing_for_project(project_name, projects):
"""
Disable billing for a project by removing its billing account
#param {string} project_name Name of project disable billing on
#return {string} Text containing response from disabling billing
"""
body = {'billingAccountName': ''} # Disable billing
res = projects.updateBillingInfo(name=project_name, body=body).execute()
print(f'Billing disabled: {json.dumps(res)}')
Also attaching screenshot of what it looks like on Google Cloud Function UI:
I'm also attaching a screenshot to show that I copied and pasted the relevant things to the requirements.txt file as well.
But when I go to test the code, it gives me an error:
Expand all | Collapse all{
insertId: "000000-69dce50a-e079-45ed-b949-a241c97fdfe4"
labels: {…}
logName: "projects/stanford-cs-231n/logs/cloudfunctions.googleapis.com%2Fcloud-functions"
receiveTimestamp: "2020-02-06T16:24:26.800908134Z"
resource: {…}
severity: "ERROR"
textPayload: "Traceback (most recent call last):
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function
_function_handler.invoke_user_function(event_object)
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function
return call_user_function(request_or_event)
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function
event_context.Context(**request_or_event.context))
File "/user_code/main.py", line 9, in stop_billing
pubsub_data = base64.b64decode(data['data']).decode('utf-8')
KeyError: 'data'
"
timestamp: "2020-02-06T16:24:25.411Z"
trace: "projects/stanford-cs-231n/traces/8e106d5ab629141d5d91b6b68fb30c82"
}
Any idea why?
Relevant Stack Overflow Post: https://stackoverflow.com/a/58673874/3507127
There seems to be an error in the code Google provided. I got it working when I changed the stop_billing function:
def stop_billing(data, context):
if 'data' in data.keys():
pubsub_data = base64.b64decode(data['data']).decode('utf-8')
pubsub_json = json.loads(pubsub_data)
cost_amount = pubsub_json['costAmount']
budget_amount = pubsub_json['budgetAmount']
else:
cost_amount = data['costAmount']
budget_amount = data['budgetAmount']
if cost_amount <= budget_amount:
print(f'No action necessary. (Current cost: {cost_amount})')
return
if PROJECT_ID is None:
print('No project specified with environment variable')
return
billing = discovery.build('cloudbilling', 'v1', cache_discovery=False, )
projects = billing.projects()
billing_enabled = __is_billing_enabled(PROJECT_NAME, projects)
if billing_enabled:
__disable_billing_for_project(PROJECT_NAME, projects)
else:
print('Billing already disabled')
The problem is that the pub/sub message provides input as a json message with a 'data' entry that is base64 encoded. In the testing functionality you provide the json entry without a 'data' key and without encoding it. This is checked for in the function that I rewrote above.

How to download large files in Python 2

I'm trying to download large files (approx. 1GB) with mechanize module, but I have been unsuccessful. I've been searching for similar threads, but I have found only those, where the files are publicly accessible and no login is required to obtain a file. But this is not my case as the file is located in the private section and I need to login before the download. Here is what I've done so far.
import mechanize
g_form_id = ""
def is_form_found(form1):
return "id" in form1.attrs and form1.attrs['id'] == g_form_id
def select_form_with_id_using_br(br1, id1):
global g_form_id
g_form_id = id1
try:
br1.select_form(predicate=is_form_found)
except mechanize.FormNotFoundError:
print "form not found, id: " + g_form_id
exit()
url_to_login = "https://example.com/"
url_to_file = "https://example.com/download/files/filename=fname.exe"
local_filename = "fname.exe"
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
br.addheaders = [('User-agent', 'Firefox')]
response = br.open(url_to_login)
# Find login form
select_form_with_id_using_br(br, 'login-form')
# Fill in data
br.form['email'] = 'email#domain.com'
br.form['password'] = 'password'
br.set_all_readonly(False) # allow everything to be written to
br.submit()
# Try to download file
br.retrieve(url_to_file, local_filename)
But I'm getting an error when 512MB is downloaded:
Traceback (most recent call last):
File "dl.py", line 34, in <module>
br.retrieve(br.retrieve(url_to_file, local_filename)
File "C:\Python27\lib\site-packages\mechanize\_opener.py", line 277, in retrieve
block = fp.read(bs)
File "C:\Python27\lib\site-packages\mechanize\_response.py", line 199, in read
self.__cache.write(data)
MemoryError: out of memory
Do you have any ideas how to solve this?
Thanks
You can use bs4 and requests to get you logged in then write the streamed content. There are a few form fields required including a _token_ field that is definitely necessary:
from bs4 import BeautifulSoup
import requests
from urlparse import urljoin
data = {'email': 'email#domain.com', 'password': 'password'}
base = "https://support.codasip.com"
with requests.Session() as s:
# update headers
s.headers.update({'User-agent': 'Firefox'})
# use bs4 to parse the from fields
soup = BeautifulSoup(s.get(base).content)
form = soup.select_one("#frm-loginForm")
# works as it is a relative path. Not always the case.
action = form["action"]
# Get rest of the fields, ignore password and email.
for inp in form.find_all("input", {"name":True,"value":True}):
name, value = inp["name"], inp["value"]
if name not in data:
data[name] = value
# login
s.post(urljoin(base, action), data=data)
# get protected url
with open(local_filename, "wb") as f:
for chk in s.get(url_to_file, stream=True).iter_content(1024):
f.write(chk)
Try downloading/writing it by chunks. Seems like file takes all your memory.
You should specify Range header for your request if server supports it.
https://en.wikipedia.org/wiki/List_of_HTTP_header_fields

Categories