Apologies in advance as this question is quite broad. My overall goal is to create a web application that allows for interactive analysis of NOAA model output data. For now, let's say the project will consist of two different dashboards: The first dashboard will take current output model data from NOAA based on user inputs (e.g., location) and display the data using interactive plotly plots. The second dashboard will be for displaying historical data from archived NOAA data.
My background is more in chemistry and science, so while I have a good understanding of data analysis packages like Pandas, I'm much newer to web dev tools such as Django. I was originally able to create my two dashboards using Plotly Dash. I saved some sample data to my local computer and used Pandas to select subsets of data that were then plotted with plotly. However, I'd like to scale this to a fully-functioning webpage and therefore need to migrate this project to a Django framework. This is where I need help.
Basically I'm trying to determine how the big-picture flow of this project would need to look. For example, below I give the workflow for a specific example in which a user selects a location and three plots are loaded for temperature, wind speed, and pressure. Can you please help me flesh this out or correct any mistakes I'm making?
The user inserts a certain location using a form.
A timeseries of weather data for the selected location is loaded from a database into a Django model* and includes windspeed, temperature, pressure, as well as any other data the observational instrument collected.
*NOTE: I am really struggling with this aspect of the workflow. The data exists as CSVs on NOAA's webpages. So should I query these CSVs each time a user inputs a location or should I create my own SQL database these CSVs in advance? Also, is it even possible for a Django model to be a timeseries of data? In all the tutorials, model attributes are given as a single data point.
The data is then loaded into Pandas and processed (for example, maybe I want the monthly averages or want to convert to different units, etc.)
The three plots are then created from the Pandas dataframes using plotly.
The plots are displayed on the webpage
As I mentioned above, I am struggling with how this process looks using Django. I think specifically my confusion arises in the use of Django Models. Once the data gets to Pandas, I'm pretty much set, but it's Step 2 and the workflow for querying the data and loading it into a model that's confusing me.
Again, sorry for how broad this question is. Any advice at all is greatly appreciated.
Related
I am currently trying to build an adjustment framework for a forecasting tool.
Essentially, I have an ML tool that you upload financial data to, and it creates a forecast. I am adding a feature where the use can manually adjust some of the forecasting (i.e., add $1M to FY22Q3) but I am stuck on how to make user inputs dynamic based off of categories of their data. Right now, it is hard coded based off of the 4 categories in my example workbook data. I want to be able to read the data frame and have the inputs be based off of the names of the column headers. Any help would be appreciated. Thanks!
I am working on a project that tracks my Spotify play history which I’ll use to generate personalized playlists.
I am saving my play history to a local dataframe (that I append to every week) as a pickle file. I also have a second dataframe that contains specific track features, also pickled locally.
I’m wondering 1) if there is a better way of structuring my data and 2) if I should be using other ways of saving my data.
I currently have a project where I extract the data from a Firebird database and do the ETL process with Knime, then the CSV files are imported into PowerBI, where I create table relationships and develop the measures.
With Knime I summarize several tables, denormalizing.
I would like to migrate to Python completely, I am learning Pandas.
I would like to know how to deal with relational modeling in Python, star schema for example.
In PowerBI there is a section dedicated to it where I establish relationships, indicating if they are uni or bi directional.
The only thing I can think of so far is to work in Pandas with joins in every required situation / function, but it seems to me that there must be a better way.
I would be grateful if you would indicate that I should learn to face this.
I think I can answer your question now that I have a better understanding of what you're trying to do in Python. My stack for reporting also involves Python for ETL operations and Power BI for the front end, so this is how I approach it even if there may be other ways that I'm not aware of.
While I create actual connections in Power BI for the data model I am using, we don't actually need to tell Python anything in advance. Power BI is declarative. You build the visualizations by specifying what information you want related and Power BI will do the required operations on the backend to get that data. However, you need to give it some minimal information in order to do this. So, you communicate the way you want the data modeled to the Power BI.
Python, in contrast, is imperative. Instead of telling it what you want at the end, you tell it what instructions you want it to perform. This means that you have to give all of the instructions yourself and that you need to know the data model.
So, the simple answer is that you don't need to deal with relational modeling. The more complicated and correct answer is that you need to plan your ETL tools around a logical data model. The logical data model doesn't really exist in one physical space like how Power BI stores what you tell it. It basically comes down to you knowing how the tables are supposed to relate and ensuring that the data stored within them allows those transformations to take place.
When the time comes to join tables in Python, perform join operations as needed, using the proper functions (i.e. merge()) in combination with the logical data model you have in your head (or written down).
The link I'm including here is a good place to start research/learning on how to think about data modeling on the more conceptual level you will need to:
https://www.guru99.com/data-modelling-conceptual-logical.html
How to handle the frequent changes in the dataset in Azure Machine Learning Studio. My dataset may change over time, I need to add more rows to dataset. How will I refresh the dataset which I currently use to train the model by using the newly updated dataset. I need this work to be done programmatically(in c# or python) instead of doing it manually in the studio.
When registering an AzureML Dataset, no data is moved, just some information like where the data is and how it should be loaded are stored. The purpose is to make accessing the data as simple as calling dataset = Dataset.get(name="my dataset")
In the snippet below (full example), if I register the dataset, I could technically overwrite weather/2018/11.csv with a new version after registering, and my Dataset definition would stay the same, but the new data would be available if you use in it training after overwriting.
# create a TabularDataset from 3 paths in datastore
datastore_paths = [(datastore, 'weather/2018/11.csv'),
(datastore, 'weather/2018/12.csv'),
(datastore, 'weather/2019/*.csv')]
weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)
However, there are two more recommended approaches (my team does both)
Isolate your data and register a new version of the Dataset, so that you can always roll-back to a previous version of a Dataset version . Dataset Versioning Best Practice
Use a wildcard/glob datapath to refer to a folder that has new data loaded into it on a regular basis. In this way you can have a Dataset that is growing in size over time without having to re-register.
Does that works for you?
https://stackoverflow.com/a/60639631/12925558
You can manipulate the dataset object
I have extracted 6 months of email metadata and saved it as a csv file. The csv now only contains two columns (from and to email addresses). I want to build a graph where the vertices are those with whom I am communicating and whom communicated with me and the edges are created by a communications link labeling the edges by how many communications I had. What is the best approach for going about this?
One approach is to use Linked Data principles (although not advisable if you are short on time and don't have a background in Linked Data). Here's a possible approach:
Depict each entity as a URI
Use an existing ontology (such as foaf) to describe the data
The data is transformed into Resource Description Framework (RDF)
Use an RDF visualization tool.
Since RDF is inherently a graph, you will be able to visualize your data as well as extend it.
If you are unfamiliar with Linked Data, a way to view the garphs is using Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek/). This approach is much simpler but lacks the benefits of semantic interoperability, provided you care about them in the first place.
Cytoscape might be able to import your data in that format and build a network from it.
http://www.cytoscape.org/
Your question (while mentioning Python) does not say what part or how much you want to do with Python. I will assume Python is a tool you know but that the main goal is to get the data visualized. In that case:
1) use Gephi network analysis tool - there are tools that can use your CSV file as-is and Gephi is one of them. in your case edge weights need to be preserved (= number of emails exchanged b/w 2 email addresses) which can be done using the "mixed" variation of Gephi's CSV format.
2) another option is to pre-process your CSV file (e.g. using Python), calculate edge weights (the number of e-mail between every 2 email addresses) and save it in any format you like. The result can be visualized in network analysis tools (such as Gephi) or directly in Python (e.g. using https://graph-tool.skewed.de).
Here's an example of an email network analysis project (though their graph does not show weights).