How to manipulate header excel header names with python - python

I am currently trying to build an adjustment framework for a forecasting tool.
Essentially, I have an ML tool that you upload financial data to, and it creates a forecast. I am adding a feature where the use can manually adjust some of the forecasting (i.e., add $1M to FY22Q3) but I am stuck on how to make user inputs dynamic based off of categories of their data. Right now, it is hard coded based off of the 4 categories in my example workbook data. I want to be able to read the data frame and have the inputs be based off of the names of the column headers. Any help would be appreciated. Thanks!

Related

Is there a way to read a general ledger file into a Dataframe in python?

I have two excel files which are in general ledger format, I am trying to open them as dataframes so I can do some analysis, specifically look for duplicates. I tried opening them using
pd.read_excel(r"Excelfile.xls)in pandas. The files are being read but when I use df.head() , I am getting nans for all the records and columns as well. Is there a way to load the data in general ledger format into a data frame?
This is how the dataset looks like in the Jupyter notebook
This is what the dataset looks like in excel
I am new to stack overflow and I haven't learnt its full functionality to upload part of a dataset yet
I hope the images help in describing my situation

What is the best strategy to save/load data (using python) that will be updated on a weekly basis over long periods of time?

I am working on a project that tracks my Spotify play history which I’ll use to generate personalized playlists.
I am saving my play history to a local dataframe (that I append to every week) as a pickle file. I also have a second dataframe that contains specific track features, also pickled locally.
I’m wondering 1) if there is a better way of structuring my data and 2) if I should be using other ways of saving my data.

Developing a dashboard using Django?

Apologies in advance as this question is quite broad. My overall goal is to create a web application that allows for interactive analysis of NOAA model output data. For now, let's say the project will consist of two different dashboards: The first dashboard will take current output model data from NOAA based on user inputs (e.g., location) and display the data using interactive plotly plots. The second dashboard will be for displaying historical data from archived NOAA data.
My background is more in chemistry and science, so while I have a good understanding of data analysis packages like Pandas, I'm much newer to web dev tools such as Django. I was originally able to create my two dashboards using Plotly Dash. I saved some sample data to my local computer and used Pandas to select subsets of data that were then plotted with plotly. However, I'd like to scale this to a fully-functioning webpage and therefore need to migrate this project to a Django framework. This is where I need help.
Basically I'm trying to determine how the big-picture flow of this project would need to look. For example, below I give the workflow for a specific example in which a user selects a location and three plots are loaded for temperature, wind speed, and pressure. Can you please help me flesh this out or correct any mistakes I'm making?
The user inserts a certain location using a form.
A timeseries of weather data for the selected location is loaded from a database into a Django model* and includes windspeed, temperature, pressure, as well as any other data the observational instrument collected.
*NOTE: I am really struggling with this aspect of the workflow. The data exists as CSVs on NOAA's webpages. So should I query these CSVs each time a user inputs a location or should I create my own SQL database these CSVs in advance? Also, is it even possible for a Django model to be a timeseries of data? In all the tutorials, model attributes are given as a single data point.
The data is then loaded into Pandas and processed (for example, maybe I want the monthly averages or want to convert to different units, etc.)
The three plots are then created from the Pandas dataframes using plotly.
The plots are displayed on the webpage
As I mentioned above, I am struggling with how this process looks using Django. I think specifically my confusion arises in the use of Django Models. Once the data gets to Pandas, I'm pretty much set, but it's Step 2 and the workflow for querying the data and loading it into a model that's confusing me.
Again, sorry for how broad this question is. Any advice at all is greatly appreciated.

Python graphing from csv

I have extracted 6 months of email metadata and saved it as a csv file. The csv now only contains two columns (from and to email addresses). I want to build a graph where the vertices are those with whom I am communicating and whom communicated with me and the edges are created by a communications link labeling the edges by how many communications I had. What is the best approach for going about this?
One approach is to use Linked Data principles (although not advisable if you are short on time and don't have a background in Linked Data). Here's a possible approach:
Depict each entity as a URI
Use an existing ontology (such as foaf) to describe the data
The data is transformed into Resource Description Framework (RDF)
Use an RDF visualization tool.
Since RDF is inherently a graph, you will be able to visualize your data as well as extend it.
If you are unfamiliar with Linked Data, a way to view the garphs is using Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek/). This approach is much simpler but lacks the benefits of semantic interoperability, provided you care about them in the first place.
Cytoscape might be able to import your data in that format and build a network from it.
http://www.cytoscape.org/
Your question (while mentioning Python) does not say what part or how much you want to do with Python. I will assume Python is a tool you know but that the main goal is to get the data visualized. In that case:
1) use Gephi network analysis tool - there are tools that can use your CSV file as-is and Gephi is one of them. in your case edge weights need to be preserved (= number of emails exchanged b/w 2 email addresses) which can be done using the "mixed" variation of Gephi's CSV format.
2) another option is to pre-process your CSV file (e.g. using Python), calculate edge weights (the number of e-mail between every 2 email addresses) and save it in any format you like. The result can be visualized in network analysis tools (such as Gephi) or directly in Python (e.g. using https://graph-tool.skewed.de).
Here's an example of an email network analysis project (though their graph does not show weights).

Python Classification - data structure

I am trying to develop a classifier for documents. I am relatively new to python and I am trying to figure out what the best/standard way of creating the storage structure. I am looking to introduce the dataset with machine learning algos.
I am ingesting txt files and I was thinking to have one column hold the entire document content and the second column hold the class(0-1) in my case. I initially tried creating a list of lists - such that list ["the skye is blue",1]["the sky is grey",1]["the sky is red",0].
I was also trying to create a pandas Dataframe because I thought its structure may be more suitable for data manipulation.
I was also trying to create a pandas Dataframe because I thought its structure may be more suitable for data manipulation.
I would go with that. Given that the goal is to build and train a classifier you will need to extract/compute some features from the text of the files. When you decide to do that the capability to easily generate and add new variables to a Dataframe will come in handy.
However, it also depends on the size of the data you will be crunching. If you will have massive data you should research different concepts and frameworks (for instance TensorFlow)

Categories