Schedule web scraping jobs on Azure and store results on ADLS

Schedule web scraping jobs on Azure and store results on ADLS - python

I have a python job which uses beautiful soup to scrape data from the web.I have tried executing the script using U-SQL, however I keep receiving a generic error message :
An unhandled exception from user code has been reported
I haven't explored the error too much as I am not sure if it is possible to scrape the web through U-SQL.
Is this possible using U-SQL, and if not which Azure resource can i use to schedule this script and store the results on Azure data lake store?

Also, it normally would be helpful if you provided the complete error code and exactly how you want to scrape the web.
I make the random assumption right now that you wrote some code that accessed web pages and tried to run it from within U-SQL. If that is correct, you will get blocked by that the U-SQL container blocks all external network access. For more details why that is done, see the previous answer here.

Hi I'm a PM from the Azure Data Lake team and I'd love to help out with this. I just need some clarification first about what you're trying to do. Could you reach out to me at mabasile(at)microsoft.com with the job ID of the failed job? (Any sensitive information can of course be scrubbed out). That'll be the best way to figure out exactly what you're trying to do and if it's possible on ADL.
Thanks, and I hope to hear from you soon!
Matt Basile
Azure Data Lake Analytics
Update: Confirming Michael Rys's answer - you cannot call external services through U-SQL, because if ADLA scales out to hundreds of vertices and each vertex makes a separate call, you could end up DDOSing the service, so ADLA blocks external calls.

Related

Use of ELK with Python

The project that I am working on is a bit confidential, but I will try to explain my issues and be as clear as possible because I need your opinion.
Project:
They asked me to set up a local ELK environment , and to use Python scripts to communicate with this stack (ELK), to store data, retrieve it, analyse it and visualise it thanks to Kibana, and finally there is a decision making based on that data(AI). So as you can see, it is a Data Engineering project with some AI for the decision making process. The issues that I am facing are:
I don't know how to use Python to communicate with the stack, I didn't find resources about it
Since the data is confidential, how can I assure a high security?
How many instances to use?
I am lost because I am new to ELK and my team is not Dev oriented
I am new to ELK, so please any advice would be really helpful!

I don't know how to use Python to communicate with the stack, I didn't
find resources about it
For learning how to interact with your stack use the python library:
You can install using pip3 install elasticsearch and the following links contain a wealth of tutorials on almost anything you would need to be doing.
https://kb.objectrocket.com/category/elasticsearch?filter=python
Suggest you start with these two:
https://kb.objectrocket.com/elasticsearch/how-to-parse-lines-in-a-text-file-and-index-as-elasticsearch-documents-using-python-641
https://kb.objectrocket.com/elasticsearch/how-to-query-elasticsearch-documents-in-python-268
Since the data is confidential, how can I assure a high security?
You can mask the data or restrict index access.
https://www.elastic.co/guide/en/elasticsearch/reference/current/authorization.html
https://nl.devoteam.com/expert-view/field-level-security-and-data-masking-in-elasticsearch/
How many instances to use?
I am lost because I am new to ELK and my team is not Dev oriented
I suggest you start with 1 Elasticsearch node, if you're on AWS use a t3a.large or equivalent and run Elasticsearch, Kibana and Logstash all on the same machine.
For setting it up: https://www.elastic.co/guide/en/elastic-stack-get-started/current/get-started-stack-docker.html#run-docker-secure

If you want to use phyton as your integration tools to Elasticsearch you can use elasticsearch phyton client.
The other options you can use python to create the result and save it in log file or insert to database than Logstash will get your data.
For the security ELK have good security from API authorization user authentication to cluster security. you can see in here Secure the Elastic Stack
I just use 1 instance, but feel free if you think you will need to separate between Kibana and Elasticsearch and Logstash (if you use it) or you can use docker to separate it.
Based on my experience, if you are going to load a lot of data in a short time it will be wise If you separate it so the processes don't interfere with each other.

API endpoint for data from Microsoft Exchange Online Protection?

I am working on a project where I have been using Python to make API calls to our organization's various technologies to get data, which I then push to Power BI to track metrics over time relating to IT Security.
My boss wants to see info added from Exchange Online Protection such as malware detected in emails, spam blocks etc., essentially replicating some of the email and collaboration reports you'd see in M365 defender > reports > email and collaboration (security.microsoft.com/emailandcollabreport).
I have tried the Defender API and MS Graph API, read through a ton of documentation, and can't seem to find anywhere to pull this info from. Has anyone done something similar, or know where this data can be pulled from?
Thanks in advance.

You can try using the Microsoft Graph Security API using which you can get the alerts, information protection, secure score using that. Also you can refer the alerts section in the documentation which talks about the list of supported providers at this point using the Microsoft Graph security api.

In case anyone else runs into this, this is the solution I ended up using (hacky as it may be);
The only way to extract the pertinent info seems to be through PowerShell, you need the modules ExchangeOnlineManagement and PSWSMan so those will need to be installed.
You need to add an app to your Azure instance with global reader role minimum (or something custom) and generate and upload self-signed certificates to the app.
I then ran the following lines as a ps1 script:
Connect-ExchangeOnline -CertificateFilePath "<PATH>" -AppID "<APPID>" -Organization "<ORG>.onmicrosoft.com" -CertificatePassword (ConvertTo-SecureString -String '<PASSWORD>' -AsPlainText -Force)
$dte = (Get-Date).AddDays(-30)
Get-MailflowStatusReport -StartDate $dte -EndDate (Get-Date)
Disconnect-ExchangeOnline
I used python to call the powershell script, then extract the info I needed from the output and push it to PowerBI.
I'm sure there is a more secure and efficient way to do this but I was able to accomplish the task this way.

how to debug long running python commands in Azure Databricks notebook?

I am following this tutorial: https://learn.microsoft.com/en-us/academic-services/graph/tutorial-azure-databricks-hindex
I have obtained access to the Microsoft Academic Graph data set and want to issue some basic pySpark code against it, precisely per the tutorial.
For example, this code:
# Get affiliations
Affiliations = MAG.getDataframe('Affiliations')
Affiliations = Affiliations.select(Affiliations.AffiliationId, Affiliations.DisplayName)
Affiliations.show(3)
When I run the code with 'Shift + Enter', it goes into a state of 'Running command' - and never seems to finish, even after half an hour. I have inserted a screen shot of this and attached to my post.
I have run these commands individually, and it's the last one (Affiliations.show(3)) that causes the slowness.
For example, when I run the command (Affiliations = MAG.getDataframe('Affiliations')) by itself, I actually get a result:
AffiliationId:long
Rank:integer
NormalizedName:string
DisplayName:string
GridId:string
OfficialPage:string
WikiPage:string
PaperCount:long
CitationCount:long
Latitude:float
Longitude:float
CreatedDate:date
Question: how can I debug this to find out what's causing the slowness?

Debugging a distributed application is still challenging in the notebook environment. Even though the web UI has the necessary information, there is a gap between web UIs and the development environment: it’s usually difficult to locate information in the web UI that is relevant to the code you are investigating; and there is no easy way to find historical runtime information.
Understanding how to debug with the Databricks Spark UI:
The Spark UI contains a wealth of information you can use for debugging your Spark jobs. There are a bunch of great visualizations, and we have a blog post here about those features.
For more details, click on Jobx View (Stages):
Reference: Tips to Debug Apache Spark UI with Databricks
Hope this helps.

503 error when downloading data from imdb api

I am trying to download a plot for almost 25 000 movies with the usage of imdbpy module for python. To speed up, I'm using Pool function from Multiprocessing module. However after almost 100 requests the 503 error occurs with a following message: Service Temporarily Unavailable. After 10-15 minutes I can process again but after approximately 20 requests the same error occurs again.
I am aware that it might be a simple block from the api to prevent too many calls however I can't find any info about maximum number of requests per time unit on the web.
Do you have any idea how to process so many calls without being shutdown? Moreover, do you know where I can find the documentation of imdb api?
Best

Please, don't do it.
Scraping is forbidden by IMDb's terms of service, and IMDbPY was never intended to be used to mass-scrape the web site: in fact it's explicitly designed to fetch a single movie at a time.
In theory IMDbPY can manage the plain text data files they distribute, but unfortunately they recently changed both the format and the content of the data.
IMDb has no APIs that I know of; if you have to manage such a huge portion of their data, you have to get a licence.
Please consider the use of http://www.omdbapi.com/

IBM Bluemix IoT Watson service Driver Behaviour

I want to explore and use the driver behavior service in my application. Unfortunately I got stuck as I'm getting empty response from getAnalyzedTripSummary API instead Trip UUID.
Here are the steps I've followed.
I've added the services called Driver behavior and Context Mapping to my application #Bluemix.
Pushed multiple sample data packets to the Driver Behavior using "sendCarProbe" API
Sent Job Request using "sendJobRequest" API with from and to dates as post data.
Tried "getJobInfo" API, which results the status of job "job_status" : "SUCCEEDED")
Tried "getAnalyzedTripSummaryList" to get trip_uuid. But
its resulting empty. []
Could someone help me to understand what's wrong and why I'm getting empty response?

I think your procedure is OK.
There are following possibilities not to get valid analysis result.
(1) In current Driving Behavior Analysis, it requires at least 10 valid gps points within a trip (trip_id) on a vehicle (trip_id). Please check your data which is used on "sendCarProbe" API.
(2) Please check "sendJobRequest" API's from and to date (yyyy-mm-dd) really matches with your car probe timestamp.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Schedule web scraping jobs on Azure and store results on ADLS - python

Related

Use of ELK with Python

API endpoint for data from Microsoft Exchange Online Protection?

how to debug long running python commands in Azure Databricks notebook?

503 error when downloading data from imdb api

IBM Bluemix IoT Watson service Driver Behaviour

Categories

Resources