I've come across a situation at our work where we need to create a report to measure our field service staff occupation for the future (as a forecast).
We have all the data at SAP as the example below, extracted as an excel sheet:
We need to calculate how many Man Days (column C) we have per month (start date and end date) and per work center (column G), and also by activity type (column J). Per month we have a calculation where we have the number of working days in a month multiplied by the number of employees. The idea is to have a bar chart with a line representing the capacity.
It is possible to do this manually but i'm trying to find a more practical way because today it is an extremely manual job with copy and paste all over the place. Does anyone have any idea on how to do this with Python or even Power BI?
I am probably using poor search terms when trying to find the answer to this problem but I hope that I can explain by posting an image.
I have a weekly df (left table) and I am trying to get the total average across all cities within one week and the average of certain observations based on 2 lists (right table)
excel representation of the dataframe
Can anyone please help figure out how to do this?
I'm trying to analyze the data of a food ordering application,
the data consist of both numerical and categorical variables, the main variable I'm studying is the total delivery time of an order, which represent the time from placing the order to closing it, I want to study what are the variables the affects it the most.
an example of rows in the data is the following:
order id
branch id
date
time placed
day
period
items id
no. items
total no. items
total delivery time
total time in seconds
113113
31
2/2/2021
13:32:24
Tuesday
afternoon
571
4
11
00:46:19
2805
113113
31
2/2/2021
13:32:24
Tuesday
afternoon
573
4
11
00:46:19
2805
I want to study the effects of all the variables on the total time, even items id and branch id, does a certain item affect time? does the day and period of the day affect it as well?
I used linear regression to get the correlation between total time and the numerical variables, and tried one way anova for some categorical variables, but I didn't like the results, is there a way to analyze all variable together without encoding categorical variables?
I'm looking forward to seeing what other people say about this. Here's my two cents.
ML algos like Regression, love numbers. ML algos like Classification love labels (non-numbers). You can certainly convert labeled data to 'numbered' data. One example is to code ['red','green','blue'] with [1,2,3], would produce weird things like 'red' is lower than 'blue', and if you average a 'red' and a 'blue' you will get a 'green'. Another more subtle example might happen when you code ['low', 'medium', 'high'] with [1,2,3]. In the latter case it might happen to have an ordering which makes sense, however, some subtle inconsistencies might happen when 'medium' in not in the middle of 'low' and 'high'. Now, under the hood, I think classifiers convert labels to numbers, so if you feed in large, medium, and small, it isn't using large, medium, and small to do it's analysis, it's converting those categories to numbers. I think. Maybe someone can confirm this for me.
Thus, I don't think it makes sense to try to measure any kind of relationship between IDs and specific outcomes, like 'totaltime', 'totaldays', etc. If you kick off a project on a Monday or a Friday, does the project end sooner or later than non-Monday-start or non-Friday-start projects? Well, maybe it does. But, is that correlation or causation? You can find correlations between all kinds of things, but these don't necessarily imply causation between these same things. Let's say you find a strong relationship between multiple projects that start on the second Monday of the month and all of these projects get finished off much faster than all other projects. This seems like pure coincidence, rather than causation. Or, there is some other factor impacting the outcome. Maybe projects that start on the second Monday of the month are typically small upgrades, rather than full-blown new undertakings, so the volume of work is less, and the project is done faster. However, starting the work on the second Monday of the month doesn't CAUSE the project to be finished off faster. Tell me if I am wrong. I'm always open to feedback.
Im trying to figure out how to calculate the average cost of employees per country. I have tried using the mean() function, but the line of code is not complicated enough to pull out the average cost of employees per country. Do you guys have any tips for how to get it done? Is it to complicated as it stands right now, and do i need to do it step by step by using for an example, groupby?
Feel free to ask for more information surrounding this problem. I hope i have showcased enough for you to understand the problem and help me with a soloution.
Picture of the current csv file im working on, there has been some cleaning done to it. There are several countries inside this dataframe, not just the ones you see
Here so you can see what other country is inside
I'm currently doing some analysis on the stats of my podcast, and I have merged my Spotify listening numbers with the ones from my RSS-feed in pandas. So far so good, I now have a dataframe with a column of "Total" which tells me how many listeners I had on each episode and what the average number of listeners is.
Now, what I want to do is to see how many of my episodes fit in to three categories (at least), Good, Normal and Bad. So I need to divide my Totals into three ranges and then see how many of my episodes land within each of those frames. I have some limited experience of messing around with Python and pandas, but it's been a while since I last sat down with it and I dont really know how to approach this problem.
Any help is highly appreciated!