can't access data from URL in pandas/jupyter notebook - Programming noob

can't access data from URL in pandas/jupyter notebook - Programming noob - python

I'm new to python.
I started using jupyter notebook on a project that i'm doing to get into programming school
I wanted to work with covid data. I took the raw data from John Hopskins Git hub via URLs
i got data for confirmed cases, deaths and recovered cases. Each set of data is on a different url
Everything works fine except recovered cases. apparently i can't access the data since in my code, it returns NaN values for every country. I pushed my code on github so a friend could take a look and he can access some data (not a lot), when i can't
I don't get why...
I have another issue; i tried to make a figure with different curves showing the progression of the covid cases in France (i picked France beacuse i'm french)
and there's several issues with those curves.
the "recovered"(green) and "deaths"(orange) curves are flat. I was expecting it for the recovered cases since i can't access the data, but i don't get why it would happen witht the deaths cases, since i have values
Also, i've been trying to find another way to display the dates (on the y axis). There are so many values, (1 entry a day for the whole covid crisis) that they overlap each other. I put them on vertical but it's not enough
My code is available at : https://github.com/aaanoushka/Projet-OCR-Covid19/blob/main/Analyse_covid19_pays.ipynb?fbclid=IwAR3cjmCze1vJQ101l8wlD4tAx_slhOZQ1YgJ8jpnmso05CLmYoyFL2DofXc
I'd appreciate so much if someone wold be willing to take a look!
Feel free to ask me anything, i'll try my best to give you any detail needed
Thank you

the "recovered"(green) and "deaths"(orange) curves are flat.
There are two issues here.
The data source you are using has discontinued publishing the 'recovery' statistic. You can read the details here. It seems that their concern is that there isn't really a globally consistent definition of 'recovery.' Some places only count confirmed recoveries. Other places say that if a patient is not reported as dead, then they must have recovered.
You may be able to find another source of this data elsewhere.
The death count is not flat on that plot. It is just very hard to see. If you comment out the confirmed case count plotting, you'll see what I mean:
Another way to check this is to compare the last element of confirmed and the last element of deaths:
print("Most recent death count in France", deaths_fr.iloc[-1])
print("Most recent case count in France", confirmed_fr.iloc[-1])
Output:
Most recent death count in France 135264
Most recent case count in France 21511997
If you plot these two on the same scale, the death count will be squished - there are about 100 times more cases than deaths.
Also, i've been trying to find another way to display the dates (on the y axis)
It looks like the indexes of the dataframes are defined as strings, and not as dates. Try converting them to dates:
deaths_fr.index = pd.to_datetime(deaths_fr.index)
recovered_fr.index = pd.to_datetime(recovered_fr.index)
confirmed_fr.index = pd.to_datetime(confirmed_fr.index)
I get more reasonable axis labels when I do that.

Related

Aggregating and calculating weekly differences in CSV reports with Pandas?

Overflow Data/CSV/Pandas peeps hivemind....I need your help!
I've only recently started using Python/Pandas and I found a really good project to possibly work on, that would save me a lot of time.
I do weekly reports and report on the differences in data week by week.
I dont know Pandas 100% but I dont think this would be that hard to do with code and I feel like this project would be a great way for me to learn.
Here is an example of the report I have:
Report Example
Now, I have a list of items from the items list (and gets concatenated in the item info column) that I'm to be reporting on:
I'm essentially trying to have code that can compute:
-IF the name (from my list) is found in the item info column AND the Week number(s) is a particular number AND the year(s) is 2022 THEN aggregate the total number of the POS/sales altogether as data A
&
-IF there is viable data there as well for Week 16 (compute the similar above info for that week as data B), then subtract the difference between these weeks (A and B) and output that data to me as information point C (aka the difference)
-THEN if that difference is positive, divide C by B (aka, give me the percentage of that move)
Tl:dr-I want to aggregate the total sales of an item for the week and subtract it from the corresponding amount for the previous week for the same item and verify the difference, as well as the percentage of movement in amounts.
I only know so much in Pandas right now, would anyone be able to point me in a direction that could help? I so feel like this shouldn't be that hard to do/I'd love to make it a weekend project and saves myself a good bit of time at work and learn how to automate some work tasks too. :)

how to analyze numerical and categorical variables at the same time?

I'm trying to analyze the data of a food ordering application,
the data consist of both numerical and categorical variables, the main variable I'm studying is the total delivery time of an order, which represent the time from placing the order to closing it, I want to study what are the variables the affects it the most.
an example of rows in the data is the following:
order id
branch id
date
time placed
day
period
items id
no. items
total no. items
total delivery time
total time in seconds
113113
31
2/2/2021
13:32:24
Tuesday
afternoon
571
4
11
00:46:19
2805
113113
31
2/2/2021
13:32:24
Tuesday
afternoon
573
4
11
00:46:19
2805
I want to study the effects of all the variables on the total time, even items id and branch id, does a certain item affect time? does the day and period of the day affect it as well?
I used linear regression to get the correlation between total time and the numerical variables, and tried one way anova for some categorical variables, but I didn't like the results, is there a way to analyze all variable together without encoding categorical variables?

I'm looking forward to seeing what other people say about this. Here's my two cents.
ML algos like Regression, love numbers. ML algos like Classification love labels (non-numbers). You can certainly convert labeled data to 'numbered' data. One example is to code ['red','green','blue'] with [1,2,3], would produce weird things like 'red' is lower than 'blue', and if you average a 'red' and a 'blue' you will get a 'green'. Another more subtle example might happen when you code ['low', 'medium', 'high'] with [1,2,3]. In the latter case it might happen to have an ordering which makes sense, however, some subtle inconsistencies might happen when 'medium' in not in the middle of 'low' and 'high'. Now, under the hood, I think classifiers convert labels to numbers, so if you feed in large, medium, and small, it isn't using large, medium, and small to do it's analysis, it's converting those categories to numbers. I think. Maybe someone can confirm this for me.
Thus, I don't think it makes sense to try to measure any kind of relationship between IDs and specific outcomes, like 'totaltime', 'totaldays', etc. If you kick off a project on a Monday or a Friday, does the project end sooner or later than non-Monday-start or non-Friday-start projects? Well, maybe it does. But, is that correlation or causation? You can find correlations between all kinds of things, but these don't necessarily imply causation between these same things. Let's say you find a strong relationship between multiple projects that start on the second Monday of the month and all of these projects get finished off much faster than all other projects. This seems like pure coincidence, rather than causation. Or, there is some other factor impacting the outcome. Maybe projects that start on the second Monday of the month are typically small upgrades, rather than full-blown new undertakings, so the volume of work is less, and the project is done faster. However, starting the work on the second Monday of the month doesn't CAUSE the project to be finished off faster. Tell me if I am wrong. I'm always open to feedback.

Statistical analysis of Likert data in python

I have two sets of Likert data on a scale from 0 - 100 where 0 is strongly disagree and 100 is strongly agree. The first set consists of answers from a sample of 500 users. The second set also consists of numerical answers from the same sample of 500 users. These data sets are related in this way: the ith user in the first set has matched with the ith user in the second data in numerous occasions of a particular gaming platform (ex: a party on playstation network) for i = 1,...,500. The question asked to the user is: Do you like dogs? Here's an example of how the data looks:
user_1_data = [100,60,98, 50,0,...,20,100]
user_2_data = [50,75,12,...,100,20]
where user_1_data[0] is the user who matched with user_2_data[0] and their responses are 100, and 50 respectively to the question Do you like dogs? and so on so forth until i = 500.
I managed to plot the actual data in the probability distribution below. Where the x axis is the rating from 0 - 100, and the y- axis is the probability of picking that particular rating.
Although the distributions look similar, I need some sort of test to prove some significance between them (if any). Ultimately I'd like to answer the question: Does a similar distribution of answers imply that the users will play together on different occasions?
Please feel free to edit this question for formatting and to be easier to understand.
This is a statistics question. Please use statistics terms and math language if possible. I am new to data science and would love to learn how to answer my own question in the future.
I code in python.

Calculating mean values on a complicated dataframe

Im trying to figure out how to calculate the average cost of employees per country. I have tried using the mean() function, but the line of code is not complicated enough to pull out the average cost of employees per country. Do you guys have any tips for how to get it done? Is it to complicated as it stands right now, and do i need to do it step by step by using for an example, groupby?
Feel free to ask for more information surrounding this problem. I hope i have showcased enough for you to understand the problem and help me with a soloution.
Picture of the current csv file im working on, there has been some cleaning done to it. There are several countries inside this dataframe, not just the ones you see
Here so you can see what other country is inside

How to find out how many items in a dataframe is within a certain range?

I'm currently doing some analysis on the stats of my podcast, and I have merged my Spotify listening numbers with the ones from my RSS-feed in pandas. So far so good, I now have a dataframe with a column of "Total" which tells me how many listeners I had on each episode and what the average number of listeners is.
Now, what I want to do is to see how many of my episodes fit in to three categories (at least), Good, Normal and Bad. So I need to divide my Totals into three ranges and then see how many of my episodes land within each of those frames. I have some limited experience of messing around with Python and pandas, but it's been a while since I last sat down with it and I dont really know how to approach this problem.
Any help is highly appreciated!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.