Aggregating and calculating weekly differences in CSV reports with Pandas? - python

Overflow Data/CSV/Pandas peeps hivemind....I need your help!
I've only recently started using Python/Pandas and I found a really good project to possibly work on, that would save me a lot of time.
I do weekly reports and report on the differences in data week by week.
I dont know Pandas 100% but I dont think this would be that hard to do with code and I feel like this project would be a great way for me to learn.
Here is an example of the report I have:
Report Example
Now, I have a list of items from the items list (and gets concatenated in the item info column) that I'm to be reporting on:
I'm essentially trying to have code that can compute:
-IF the name (from my list) is found in the item info column AND the Week number(s) is a particular number AND the year(s) is 2022 THEN aggregate the total number of the POS/sales altogether as data A
&
-IF there is viable data there as well for Week 16 (compute the similar above info for that week as data B), then subtract the difference between these weeks (A and B) and output that data to me as information point C (aka the difference)
-THEN if that difference is positive, divide C by B (aka, give me the percentage of that move)
Tl:dr-I want to aggregate the total sales of an item for the week and subtract it from the corresponding amount for the previous week for the same item and verify the difference, as well as the percentage of movement in amounts.
I only know so much in Pandas right now, would anyone be able to point me in a direction that could help? I so feel like this shouldn't be that hard to do/I'd love to make it a weekend project and saves myself a good bit of time at work and learn how to automate some work tasks too. :)

Related

Occupation Report with Capacities

I've come across a situation at our work where we need to create a report to measure our field service staff occupation for the future (as a forecast).
We have all the data at SAP as the example below, extracted as an excel sheet:
We need to calculate how many Man Days (column C) we have per month (start date and end date) and per work center (column G), and also by activity type (column J). Per month we have a calculation where we have the number of working days in a month multiplied by the number of employees. The idea is to have a bar chart with a line representing the capacity.
It is possible to do this manually but i'm trying to find a more practical way because today it is an extremely manual job with copy and paste all over the place. Does anyone have any idea on how to do this with Python or even Power BI?

How to calculate the average of rows based on date

I am probably using poor search terms when trying to find the answer to this problem but I hope that I can explain by posting an image.
I have a weekly df (left table) and I am trying to get the total average across all cities within one week and the average of certain observations based on 2 lists (right table)
excel representation of the dataframe
Can anyone please help figure out how to do this?

how to analyze numerical and categorical variables at the same time?

I'm trying to analyze the data of a food ordering application,
the data consist of both numerical and categorical variables, the main variable I'm studying is the total delivery time of an order, which represent the time from placing the order to closing it, I want to study what are the variables the affects it the most.
an example of rows in the data is the following:
order id
branch id
date
time placed
day
period
items id
no. items
total no. items
total delivery time
total time in seconds
113113
31
2/2/2021
13:32:24
Tuesday
afternoon
571
4
11
00:46:19
2805
113113
31
2/2/2021
13:32:24
Tuesday
afternoon
573
4
11
00:46:19
2805
I want to study the effects of all the variables on the total time, even items id and branch id, does a certain item affect time? does the day and period of the day affect it as well?
I used linear regression to get the correlation between total time and the numerical variables, and tried one way anova for some categorical variables, but I didn't like the results, is there a way to analyze all variable together without encoding categorical variables?
I'm looking forward to seeing what other people say about this. Here's my two cents.
ML algos like Regression, love numbers. ML algos like Classification love labels (non-numbers). You can certainly convert labeled data to 'numbered' data. One example is to code ['red','green','blue'] with [1,2,3], would produce weird things like 'red' is lower than 'blue', and if you average a 'red' and a 'blue' you will get a 'green'. Another more subtle example might happen when you code ['low', 'medium', 'high'] with [1,2,3]. In the latter case it might happen to have an ordering which makes sense, however, some subtle inconsistencies might happen when 'medium' in not in the middle of 'low' and 'high'. Now, under the hood, I think classifiers convert labels to numbers, so if you feed in large, medium, and small, it isn't using large, medium, and small to do it's analysis, it's converting those categories to numbers. I think. Maybe someone can confirm this for me.
Thus, I don't think it makes sense to try to measure any kind of relationship between IDs and specific outcomes, like 'totaltime', 'totaldays', etc. If you kick off a project on a Monday or a Friday, does the project end sooner or later than non-Monday-start or non-Friday-start projects? Well, maybe it does. But, is that correlation or causation? You can find correlations between all kinds of things, but these don't necessarily imply causation between these same things. Let's say you find a strong relationship between multiple projects that start on the second Monday of the month and all of these projects get finished off much faster than all other projects. This seems like pure coincidence, rather than causation. Or, there is some other factor impacting the outcome. Maybe projects that start on the second Monday of the month are typically small upgrades, rather than full-blown new undertakings, so the volume of work is less, and the project is done faster. However, starting the work on the second Monday of the month doesn't CAUSE the project to be finished off faster. Tell me if I am wrong. I'm always open to feedback.

Calculating mean values on a complicated dataframe

Im trying to figure out how to calculate the average cost of employees per country. I have tried using the mean() function, but the line of code is not complicated enough to pull out the average cost of employees per country. Do you guys have any tips for how to get it done? Is it to complicated as it stands right now, and do i need to do it step by step by using for an example, groupby?
Feel free to ask for more information surrounding this problem. I hope i have showcased enough for you to understand the problem and help me with a soloution.
Picture of the current csv file im working on, there has been some cleaning done to it. There are several countries inside this dataframe, not just the ones you see
Here so you can see what other country is inside

How to find out how many items in a dataframe is within a certain range?

I'm currently doing some analysis on the stats of my podcast, and I have merged my Spotify listening numbers with the ones from my RSS-feed in pandas. So far so good, I now have a dataframe with a column of "Total" which tells me how many listeners I had on each episode and what the average number of listeners is.
Now, what I want to do is to see how many of my episodes fit in to three categories (at least), Good, Normal and Bad. So I need to divide my Totals into three ranges and then see how many of my episodes land within each of those frames. I have some limited experience of messing around with Python and pandas, but it's been a while since I last sat down with it and I dont really know how to approach this problem.
Any help is highly appreciated!

Categories