Calculating mean values on a complicated dataframe - python

Im trying to figure out how to calculate the average cost of employees per country. I have tried using the mean() function, but the line of code is not complicated enough to pull out the average cost of employees per country. Do you guys have any tips for how to get it done? Is it to complicated as it stands right now, and do i need to do it step by step by using for an example, groupby?
Feel free to ask for more information surrounding this problem. I hope i have showcased enough for you to understand the problem and help me with a soloution.
Picture of the current csv file im working on, there has been some cleaning done to it. There are several countries inside this dataframe, not just the ones you see
Here so you can see what other country is inside

Related

Stable Marriage Problem but with numerical ratings for each other person instead of a ranking

Basically I have a group of people who rate each other from 1-10. I want to find the best way to get them into groups of 5, or preferably be able to change the group size easily.
My main issue with my previous attempts has been efficiency, as there are 1000+ people.
Some additional info is the data comes from a csv but I can convert that to any input that works better, and I have to write in python but again I am happy to convert from another language if someone finds a solution to this problem.
A similar problem I've found is a variation of the stable marriage problem where the group size is customizable.
https://www.reddit.com/r/learnpython/comments/ga0anm/comment/fowyem6/?utm_source=share&utm_medium=web2x&context=3
This is where I found an explanation to that problem. I'm just unsure how to make this work and optimize it for ratings out of ten instead of ordered preferences.
Thanks for any help

Aggregating and calculating weekly differences in CSV reports with Pandas?

Overflow Data/CSV/Pandas peeps hivemind....I need your help!
I've only recently started using Python/Pandas and I found a really good project to possibly work on, that would save me a lot of time.
I do weekly reports and report on the differences in data week by week.
I dont know Pandas 100% but I dont think this would be that hard to do with code and I feel like this project would be a great way for me to learn.
Here is an example of the report I have:
Report Example
Now, I have a list of items from the items list (and gets concatenated in the item info column) that I'm to be reporting on:
I'm essentially trying to have code that can compute:
-IF the name (from my list) is found in the item info column AND the Week number(s) is a particular number AND the year(s) is 2022 THEN aggregate the total number of the POS/sales altogether as data A
&
-IF there is viable data there as well for Week 16 (compute the similar above info for that week as data B), then subtract the difference between these weeks (A and B) and output that data to me as information point C (aka the difference)
-THEN if that difference is positive, divide C by B (aka, give me the percentage of that move)
Tl:dr-I want to aggregate the total sales of an item for the week and subtract it from the corresponding amount for the previous week for the same item and verify the difference, as well as the percentage of movement in amounts.
I only know so much in Pandas right now, would anyone be able to point me in a direction that could help? I so feel like this shouldn't be that hard to do/I'd love to make it a weekend project and saves myself a good bit of time at work and learn how to automate some work tasks too. :)

can't access data from URL in pandas/jupyter notebook - Programming noob

I'm new to python.
I started using jupyter notebook on a project that i'm doing to get into programming school
I wanted to work with covid data. I took the raw data from John Hopskins Git hub via URLs
i got data for confirmed cases, deaths and recovered cases. Each set of data is on a different url
Everything works fine except recovered cases. apparently i can't access the data since in my code, it returns NaN values for every country. I pushed my code on github so a friend could take a look and he can access some data (not a lot), when i can't
I don't get why...
I have another issue; i tried to make a figure with different curves showing the progression of the covid cases in France (i picked France beacuse i'm french)
and there's several issues with those curves.
the "recovered"(green) and "deaths"(orange) curves are flat. I was expecting it for the recovered cases since i can't access the data, but i don't get why it would happen witht the deaths cases, since i have values
Also, i've been trying to find another way to display the dates (on the y axis). There are so many values, (1 entry a day for the whole covid crisis) that they overlap each other. I put them on vertical but it's not enough
My code is available at : https://github.com/aaanoushka/Projet-OCR-Covid19/blob/main/Analyse_covid19_pays.ipynb?fbclid=IwAR3cjmCze1vJQ101l8wlD4tAx_slhOZQ1YgJ8jpnmso05CLmYoyFL2DofXc
I'd appreciate so much if someone wold be willing to take a look!
Feel free to ask me anything, i'll try my best to give you any detail needed
Thank you
the "recovered"(green) and "deaths"(orange) curves are flat.
There are two issues here.
The data source you are using has discontinued publishing the 'recovery' statistic. You can read the details here. It seems that their concern is that there isn't really a globally consistent definition of 'recovery.' Some places only count confirmed recoveries. Other places say that if a patient is not reported as dead, then they must have recovered.
You may be able to find another source of this data elsewhere.
The death count is not flat on that plot. It is just very hard to see. If you comment out the confirmed case count plotting, you'll see what I mean:
Another way to check this is to compare the last element of confirmed and the last element of deaths:
print("Most recent death count in France", deaths_fr.iloc[-1])
print("Most recent case count in France", confirmed_fr.iloc[-1])
Output:
Most recent death count in France 135264
Most recent case count in France 21511997
If you plot these two on the same scale, the death count will be squished - there are about 100 times more cases than deaths.
Also, i've been trying to find another way to display the dates (on the y axis)
It looks like the indexes of the dataframes are defined as strings, and not as dates. Try converting them to dates:
deaths_fr.index = pd.to_datetime(deaths_fr.index)
recovered_fr.index = pd.to_datetime(recovered_fr.index)
confirmed_fr.index = pd.to_datetime(confirmed_fr.index)
I get more reasonable axis labels when I do that.

How to calculate the average of rows based on date

I am probably using poor search terms when trying to find the answer to this problem but I hope that I can explain by posting an image.
I have a weekly df (left table) and I am trying to get the total average across all cities within one week and the average of certain observations based on 2 lists (right table)
excel representation of the dataframe
Can anyone please help figure out how to do this?

How to find out how many items in a dataframe is within a certain range?

I'm currently doing some analysis on the stats of my podcast, and I have merged my Spotify listening numbers with the ones from my RSS-feed in pandas. So far so good, I now have a dataframe with a column of "Total" which tells me how many listeners I had on each episode and what the average number of listeners is.
Now, what I want to do is to see how many of my episodes fit in to three categories (at least), Good, Normal and Bad. So I need to divide my Totals into three ranges and then see how many of my episodes land within each of those frames. I have some limited experience of messing around with Python and pandas, but it's been a while since I last sat down with it and I dont really know how to approach this problem.
Any help is highly appreciated!

Categories