how to combine province data into one row country data in python - python

This is my first question here. So I have COVID data in python, the distribution of COVID cases in several provinces in each country.
What should I do if I want to make each country only have one data (one row), and drop the province column?
Thank you

You need to groupby date and country, apply sum() as the aggregate function, and drop the province column:
e.g. if you have a dataframe df, with columns date, country, province, and cases, then you should do:
grouped_df = df.groupby(['date', 'country']).sum().drop('province', axis=1)

Related

Grouping a Pandas DataFrame by Months for Distributor Names

I have a dataframe that looks like this:
[1]: https://i.stack.imgur.com/KnHba.png
Essentially, there is a distributor name column, a sales column, and a MM/DD/YYYY Date column.
For each distributor, by month, I want the sum of sales.
What I tried:
df = df.groupby(df['Distributor Name'],df.Date.dt.month)['Sales'].sum()
This throws an error. "Unhashable type: Series". This works when I remove Distributor Name, but I don't just want the overall monthly sales. I want the monthly sales BY distributor.
Thanks in advance!!
Joel
The correct way to group by multiple columns is to put them in a list as a first argument:
result = df.groupby(['Distributor Name', df.date.dt.month])['Sales'].sum()
This creates a multiindex pandas series, with Distributor Name and Date as indices. If you wish to create a dataframe with three columns (Distributor Name, Date, Sales) you can reset index of this pandas series.
result = result.reset_index()

How Can I Segregate data in pandas From my Timestamp Column

I am working with an Excel sheet with pandas where I am analysing some data from it.
Inside the excel sheet I have 8 Columns one is Timestamp and another one is City column and so on like Domain, State etc.
I want to analyse city and timestamp columns only.
I have selected the city and Timestamp column from the excel sheet inside a Data Frame. I have find out the city count, means how many rows contain the same city using cities_df['Count'] = df['City_Town_Village '].value_counts()
After finding the city count I have find out the percentages of all the cities using cities_df['PctCnt'] =(cities_df['Count']/sum(cities_df['Count'])*100).apply("{0:.2f}".format)
Now my question here is, As I am finding the city_count The rows inside my dataframe is decreasing means my df has 238 rows but after the count they are decreasing to 128 no issue at all till now. They are decreasing just because of the count.
I also have the timestamp column inside my df, lets say for city Delhi some people registered in 28-May-2021 and some people registered in 29-May-2021 just like that. But after finding out the city_count my df is just showing me timestamp for the beginning date only i.e 28-may..
I don't know why this is happening, actually I want to segregate the data into two week and want to plot the graph on week wise. and also for city percentage.
Here is my Excel file
This is the code I'm using:
import pandas as pd
df = pd.read_excel('PCS_NWR_Sheet.xlsx')
df.head()
pd.set_option('display.max_rows', 300)
cities_df = pd.DataFrame()
cities_df['Count'] = df['City_Town_Village '].value_counts()
cities_df.index.names=['City']
cities_df.reset_index(inplace = True)
cities_df['Timestamp'] = df['Timestamp']
column_names = ['Timestamp', 'City', 'Count']
cities_df = cities_df.reindex(columns=column_names)
cities_df['PctCnt'] =(cities_df['Count']/sum(cities_df['Count'])*100).apply("{0:.2f}".format)
Metro_list = ['Hyderabad', 'Kolkata', 'Delhi', 'Pune', 'Bengaluru', 'Noida', 'Kanpur', 'Gurgaon']
top_metro=cities_df[cities_df['City'].isin(Metro_list)]
top_metro
.value_counts() will return a series where the size is equal to the number of unique elements in what you are counting. So you are getting less rows because it is grouping those things.
I can think of two ways to solve the question (if I understand it right).
Do .value_counts() on both the date column and the city column.
df[['City_Town_Village','Date']].value_counts()
If you don't currently have your timestamps as dates, you'll need to make a date column that does that (you probably won't be able to group on datetimes since the times will vary. This will give a series where the row count is equal to the size of every existing combination of the two columns.
Make a separate dataframe with the value_count of the town column, and merge them. That is, if you want a column in your main dataframe that has the number of times that the town comes up ever in the data, that column is a different size (as we said), so you'll store it somewhere else but can bring it back in as needed.
df2 = pd.DataFrame(df['City_Town_Village'].value_counts())
df2.reset_index(inplace=True) # by fault, making the df from value_counts() will make your city/town the index, this makes a normal index)
df2.columns = ['City_Town_Village','Count'] #rename the columns
df = df.merge(df2,how='left',on='City_Town_Village')
This will make df to have the Counts column added, where it will be the count of the City/Town in the original dataset.

Python (pandas) - sum multiple columns based on one column

I have a dataframe of Covid-19 deaths by country. Countries are identified in the Country column. Sub-national classification is based on the Province column.
I want to generate a dataframe which sums all columns based on the value in the Country column (except the first 2, which are geographical data). In short, for each date, I want to compress the observations for all provinces of a country such that I get a single number for each country.
Right now, I am able to do that for a single date:
import pandas as pd
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-
19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
raw = pd.read_csv(url)
del raw['Lat']
del raw['Long']
raw.rename({'Country/Region': 'Country', 'Province/State': 'Province'}, axis=1, inplace=True)
raw2 = raw.groupby('Country')['6/29/20'].sum()
How can I achieve this for all dates?
You can use iloc:
raw2 = raw.iloc[:,4:].groupby(raw.Country).sum()

Filter on specific values in a Pandas DataFrame

I am trying to filter on certain values in many columns
in Column (Dimension) filter on Education, then in next column (indicator name), filter on Mean years of schooling (years), then in Country Name Column filter on USA, Canada,.....etc
I have tried the below script but I couldn't filter on the specifics mentioned above
raw_data={}
for Dimension in new_df["Dimension"]:
dimension_df=new_df.loc[new_df["Dimension"]==Dimension]
arr=[]
arr.append(dimension_df["Indicator Name"].values[0])
arr.append(dimension_df["ISO Country Code"].values[0])
raw_data[Dimension]=arr
pd.DataFrame(raw_data)

How can I convert these dataframe indexes into a column?

This is a dataframe with data of military spending for some countries from 2010-2017. I would to convert the years row of the dataframe:
into a column with the name "Year" and another one with the values corresponding to each year for each country. It should look like this dataframe (ignore name of third column, it's just an example):
Using
df.reset_index().melt('Country')

Categories