I am following this tutorial: https://learn.microsoft.com/en-us/academic-services/graph/tutorial-azure-databricks-hindex
I have obtained access to the Microsoft Academic Graph data set and want to issue some basic pySpark code against it, precisely per the tutorial.
For example, this code:
# Get affiliations
Affiliations = MAG.getDataframe('Affiliations')
Affiliations = Affiliations.select(Affiliations.AffiliationId, Affiliations.DisplayName)
Affiliations.show(3)
When I run the code with 'Shift + Enter', it goes into a state of 'Running command' - and never seems to finish, even after half an hour. I have inserted a screen shot of this and attached to my post.
I have run these commands individually, and it's the last one (Affiliations.show(3)) that causes the slowness.
For example, when I run the command (Affiliations = MAG.getDataframe('Affiliations')) by itself, I actually get a result:
AffiliationId:long
Rank:integer
NormalizedName:string
DisplayName:string
GridId:string
OfficialPage:string
WikiPage:string
PaperCount:long
CitationCount:long
Latitude:float
Longitude:float
CreatedDate:date
Question: how can I debug this to find out what's causing the slowness?
Debugging a distributed application is still challenging in the notebook environment. Even though the web UI has the necessary information, there is a gap between web UIs and the development environment: it’s usually difficult to locate information in the web UI that is relevant to the code you are investigating; and there is no easy way to find historical runtime information.
Understanding how to debug with the Databricks Spark UI:
The Spark UI contains a wealth of information you can use for debugging your Spark jobs. There are a bunch of great visualizations, and we have a blog post here about those features.
For more details, click on Jobx View (Stages):
Reference: Tips to Debug Apache Spark UI with Databricks
Hope this helps.
Related
I've created an Azure Function App with python and have published an app that runs every 5 minutes. I used to go to the Function > Monitoring to see the last 30 day runs. I've checked today and all logs have disappeared and the function does not display any runs in the Overview
The last time I checked before this happened, I had loads of logs in here but now I have none. I know the function is running because if I go to Application Insight into Live Monitoring I can see the traces and also can check that the results are being processed. I haven't changed anything to the script and not sure why this is happening. Has anyone experienced this and found a fix?
EDIT
I've recreated the Function App and noticed that it creates a DefaultResourceGroup-XXX resource group with a Default Workspace in it which I remember deleting it when I first created the Function App. I've left it on and now I see the logs in Monitoring but cannot see any connections to the Function App itself. Does anyone know how does this workspace relate to the logs and is there a way I can create a more user-friendly workspace name and link it to the App?
Thank you sheldonzy. Posting your suggestions as answer to help other community members.
On your function App go to monitor there if application insights is enabled you will see an option of Run Query in Application Insights
Open run query and check exception tables in application insights
I have a python script on my local machine that reads a CSV file and outputs some metrics. The end goal is to create a web interface where the user uploads the CSV file and the metrics are displayed, while all being hosted on Azure.
I want to use a VM on Azure to run this python script.
The script takes the CSV file and outputs metrics which are stored in CosmosDB.
A web interface reads from this DB and displays graphs from the data generated by the script.
Can someone elaborate on the steps I need to follow to achieve this? Detailed steps are not essentially required, but a brief overview with links to relevant learning sources would be helpful.
There's an article that lists the primary options for hosting sites in Azure: https://learn.microsoft.com/en-us/azure/developer/python/quickstarts-app-hosting
As Sadiq mentioned, Functions is probably your best choice as it will probably be less expensive, less maintenance, and can handle both the script and the web interface. Here is a python tutorial for that method: https://learn.microsoft.com/en-us/azure/developer/python/tutorial-vs-code-serverless-python-01
Option 2 would be to run a traditional website on an App Service plan, with background tasks handled either by Functions or a Webjob- they both use the webjobs SDK, so the code is very similar: https://learn.microsoft.com/en-us/learn/paths/deploy-a-website-with-azure-app-service/
VMs are an option if either of those two don't work, but it comes with significantly more administration. This learning path has info on how to do this. The website is built on the MEAN stack, but is applicable to Python as well: https://learn.microsoft.com/en-us/learn/paths/deploy-a-website-with-azure-virtual-machines/
I'm following the documentation from the link below.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_gbq.html#pandas.DataFrame.to_gbq
Everything is setup perfectly fine in my data frame. Now, I'm trying to export if to GBQ. Here's a one-liner that should pretty much work...but it doesn't...
pandas_gbq.to_gbq(my_df, 'table_in_gbq', 'my_project_id', chunksize=None, reauth=False, if_exists='append', private_key=False, auth_local_webserver=True, table_schema=None, location=None, progress_bar=True, verbose=None)
I'm having a lot of trouble getting the cloud scheduler to run the job successfully. The scheduler runs, and I see a message saying the 'Result' was a 'Success', but actually, and no data is loaded into Big Query. When I run the job on the client side, everything is fine. When I run it from the server, no data gets loaded. I'm guessing, the credentials are throwing it off, but I can't tell for sure what's going on. All I know for sure is that Google says the job runs successfully, but no data is loaded into my table.
My question is, how can I modify this to run using Google Cloud Scheduler, with minimal security, so I can rule out some kind of security issue? Or, otherwise determine exactly what's going on here?
I have been working on a website for over a year now, using Django and Python3 primarily. A few of my buddies and I built a front end where a user enters some parameters and submits, this goes to the GAE to run the job and return the results.
In my local dev environment, everything works well. I have two separate dev environments. One builds the entire service up in a docker container. This produces the desired results in roughly 11 seconds. The other environment runs the source files locally on my computer and connects to the Postgres database hosted in Google Cloud. The Python application runs locally. It takes roughly 2 minutes for it to run locally, a lot of latency between the cloud and the post/gets from my local machine.
Once I perform the Gcloud app deploy and attempt to run in production, it never finishes. I have some print statements built into the code, I know it gets to the part where the submitted parameters go to the Python code. I monitor via this command on my local computer: gcloud app logs read.
I suspect that since my local computer is a beast (i7-7770 processor with 64 GB of RAM), it runs the whole thing no problem. But in the GAE, I don't think it's providing the proper machines to do the job efficiently (not enough compute, not enough RAM). That's my guess.
So, I need help in how to troubleshoot this. I tried changing my app.yaml file so that resources would scale to 16 GB of memory, but it would never deploy. I received an error 13.
One other note, after it spins around trying to run the job for 60 minutes, the website crashes and displays this message:
502 Server Error
Error: Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
OK, so just in case anybody in the future is having a similar problem...the constant crashing of my Google App Engine workers was because of using Pandas dataframes in the production environment. I don't know exactly what Pandas was doing, but I kept getting Memory Errors, it would crash the site...and it didn't appear to be occurring in a single line of code. That is, it randomly happened somewhere in a Pandas Dataframe operation.
I am still using a Pandas Dataframe simply to read in a csv file. I then use
data_I_care_about = dict(zip(df.col1, df.col2))
#or
other_data = df.col3.values.tolist()
and then go to town with processing. As a note, on my local machine (my development environment basically) - it took 6 seconds to run from start to finish . That's a long time for a web request but I was in a hurry, thus why I used Pandas to begin with.
After refactoring, the same job completed in roughly 200ms using python lists and dicts (again, in my dev environment). The website is up and running very smoothly now. It takes a maximum of 7 seconds after pressing "Submit" for the back-end to return the data sets and render on the web page. Thanks for the help peeps!
I have a python job which uses beautiful soup to scrape data from the web.I have tried executing the script using U-SQL, however I keep receiving a generic error message :
An unhandled exception from user code has been reported
I haven't explored the error too much as I am not sure if it is possible to scrape the web through U-SQL.
Is this possible using U-SQL, and if not which Azure resource can i use to schedule this script and store the results on Azure data lake store?
Also, it normally would be helpful if you provided the complete error code and exactly how you want to scrape the web.
I make the random assumption right now that you wrote some code that accessed web pages and tried to run it from within U-SQL. If that is correct, you will get blocked by that the U-SQL container blocks all external network access. For more details why that is done, see the previous answer here.
Hi I'm a PM from the Azure Data Lake team and I'd love to help out with this. I just need some clarification first about what you're trying to do. Could you reach out to me at mabasile(at)microsoft.com with the job ID of the failed job? (Any sensitive information can of course be scrubbed out). That'll be the best way to figure out exactly what you're trying to do and if it's possible on ADL.
Thanks, and I hope to hear from you soon!
Matt Basile
Azure Data Lake Analytics
Update: Confirming Michael Rys's answer - you cannot call external services through U-SQL, because if ADLA scales out to hundreds of vertices and each vertex makes a separate call, you could end up DDOSing the service, so ADLA blocks external calls.