Python (pandas) - sum multiple columns based on one column - python

I have a dataframe of Covid-19 deaths by country. Countries are identified in the Country column. Sub-national classification is based on the Province column.
I want to generate a dataframe which sums all columns based on the value in the Country column (except the first 2, which are geographical data). In short, for each date, I want to compress the observations for all provinces of a country such that I get a single number for each country.
Right now, I am able to do that for a single date:
import pandas as pd
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-
19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
raw = pd.read_csv(url)
del raw['Lat']
del raw['Long']
raw.rename({'Country/Region': 'Country', 'Province/State': 'Province'}, axis=1, inplace=True)
raw2 = raw.groupby('Country')['6/29/20'].sum()
How can I achieve this for all dates?

You can use iloc:
raw2 = raw.iloc[:,4:].groupby(raw.Country).sum()

Related

How to use duplicated, sort values, and pivot table to group the data w.r.t. cell value and keep the cell value that have occurred more than one time

I want group the data from the table according to 'name' column values and keep the information where the 'name' column values have occurred more than one time in the table. The following code works fine for the given small data table.
import pandas as pd
data={'Name':['Danny','Damny','Monny','Quony','Dimny','Danny'],
'Email':['danny#gmail.com','danny#gmail.com','monny#gmail.com','quony#gmail.com','danny#gmail.com','danny#gmail.com'],
'IBAN':['NLAMRO123456789','NLINGB126656723','BGFFEO128856754','NLAMRO123896763','DUDMRO567456722','NLRABO123456712']} #data with three columns
df=pd.DataFrame(data) #creation of dataframe
df['No Dutch Bank']=None #creation of extra column for analysis
df.loc[df['IBAN'].str.find('NL') == -1, 'No Dutch Bank']='ja'# to find rows, which contain non dutch bank numbers.
df_filt=df[['Name',"Email", "IBAN"]]#filtering columns needed in the final results
df_gb = df_filt[df_filt.duplicated(subset=['Name'], keep=False)].sort_values(by='Name', ascending=False).reset_index(drop=True)#filtering on a required column
piv_tab = pd.pivot_table(df_gb, index=['Name',"Email", "IBAN"])#applying pivot table
piv_tab
This works fine for original data of three columns and six rows. In practice I have data of thirty columns and thirty thousand (30000) rows. When I select three columns (Name, Email and IBAN) and run the same code, the code does not filter out occurrence of rows which appeared in the table only once.
Why?
Everything except for the pivot logic is correct and should work for larger dataset as well.
Another approach would be calculate the count of names, filter by count > 1, and merge the names back to this table via left join.
This way is more explicit in filter by count > 1, and is more flexible to support filter by count > 2 etc.
import pandas as pd
# count names
df_name = df_filt.groupby('Name')['Email'].count()
# get names with count > 1
df_name = df_name[df_name > 1].reset_index()
# merge filtered names back to original df to get filtered df
df_gb = pd.merge(df_name['Name'], df_filt, on=['Name'], how='left')
# sorting etc.
df_gb = df_gb.sort_values(by='Name', ascending=False).reset_index(drop=True)
# and some pivot stuff
...

how to create a dataframe using groupy such that the grouping criteria is contained in the data

I wanted to create a 2D dataframe about coronavirus such that it contains a column containing countries and another one containing number of deaths. the csv file that I am using is date oriented so for some days the number of deaths is 0 so I decided to group them by Country and sum them up. yet it returned a dataframe with 1 column only. but when I write it to a csv file it creates 2 columns.
here is my code:
#import matplotlib.pyplot as plt
from pandas.core.frame import DataFrame
covid_data = pd.read_csv('countries-aggregated.csv')
bar_data = pd.DataFrame(covid_data.groupby('Country')['Deaths'].sum())
Difficult to give you a perfect answer without the dataset, however, groupby will set your key as index, thus returning a Series. You can pass as_index=False:
bar_data = covid_data.groupby('Country', as_index=False)['Deaths'].sum()
Or, if you have only one column in the DataFrame to aggregate:
bar_data = covid_data.groupby('Country', as_index=False).sum()

How Can I Segregate data in pandas From my Timestamp Column

I am working with an Excel sheet with pandas where I am analysing some data from it.
Inside the excel sheet I have 8 Columns one is Timestamp and another one is City column and so on like Domain, State etc.
I want to analyse city and timestamp columns only.
I have selected the city and Timestamp column from the excel sheet inside a Data Frame. I have find out the city count, means how many rows contain the same city using cities_df['Count'] = df['City_Town_Village '].value_counts()
After finding the city count I have find out the percentages of all the cities using cities_df['PctCnt'] =(cities_df['Count']/sum(cities_df['Count'])*100).apply("{0:.2f}".format)
Now my question here is, As I am finding the city_count The rows inside my dataframe is decreasing means my df has 238 rows but after the count they are decreasing to 128 no issue at all till now. They are decreasing just because of the count.
I also have the timestamp column inside my df, lets say for city Delhi some people registered in 28-May-2021 and some people registered in 29-May-2021 just like that. But after finding out the city_count my df is just showing me timestamp for the beginning date only i.e 28-may..
I don't know why this is happening, actually I want to segregate the data into two week and want to plot the graph on week wise. and also for city percentage.
Here is my Excel file
This is the code I'm using:
import pandas as pd
df = pd.read_excel('PCS_NWR_Sheet.xlsx')
df.head()
pd.set_option('display.max_rows', 300)
cities_df = pd.DataFrame()
cities_df['Count'] = df['City_Town_Village '].value_counts()
cities_df.index.names=['City']
cities_df.reset_index(inplace = True)
cities_df['Timestamp'] = df['Timestamp']
column_names = ['Timestamp', 'City', 'Count']
cities_df = cities_df.reindex(columns=column_names)
cities_df['PctCnt'] =(cities_df['Count']/sum(cities_df['Count'])*100).apply("{0:.2f}".format)
Metro_list = ['Hyderabad', 'Kolkata', 'Delhi', 'Pune', 'Bengaluru', 'Noida', 'Kanpur', 'Gurgaon']
top_metro=cities_df[cities_df['City'].isin(Metro_list)]
top_metro
.value_counts() will return a series where the size is equal to the number of unique elements in what you are counting. So you are getting less rows because it is grouping those things.
I can think of two ways to solve the question (if I understand it right).
Do .value_counts() on both the date column and the city column.
df[['City_Town_Village','Date']].value_counts()
If you don't currently have your timestamps as dates, you'll need to make a date column that does that (you probably won't be able to group on datetimes since the times will vary. This will give a series where the row count is equal to the size of every existing combination of the two columns.
Make a separate dataframe with the value_count of the town column, and merge them. That is, if you want a column in your main dataframe that has the number of times that the town comes up ever in the data, that column is a different size (as we said), so you'll store it somewhere else but can bring it back in as needed.
df2 = pd.DataFrame(df['City_Town_Village'].value_counts())
df2.reset_index(inplace=True) # by fault, making the df from value_counts() will make your city/town the index, this makes a normal index)
df2.columns = ['City_Town_Village','Count'] #rename the columns
df = df.merge(df2,how='left',on='City_Town_Village')
This will make df to have the Counts column added, where it will be the count of the City/Town in the original dataset.

How do I merge data between two panda's data frames where one data frame has duplicate index values

I have two data frames loaded into Pandas. Each data frame holds property information indexed by a 'pin' unique to a particular parcel of land.
The first data frame (df1) represents historic sales data. Because properties can be sold multiple times, index values (the 'pin') repeat (i.e. for each time a property was sold there will be a row with the parcel's 'pin' as the index number. If the property is sold 1 time in the data set, the index/'pin' is unique. If it was sold 5 times, the index/'pin' will occur 5 times in the data set).
The second data frame (df2) is a property record. Again they are indexed by the unique parcel pin, but because this data frame is a record of each property, the value_counts() for each index value is 1 (i.e. index values do not repeat).
I would like to add data to df1 or create a new data frame which keeps all data from df1 intact, but adds values from df2 based upon matching index values.
For Example: df1 has columns ['SALE_YEAR', 'SALE_VALUE'] - where there can be multiple rows with the same index value. df2 has columns ['Address', 'SQFT'], where the index values are all unique within the data frame. I want to add 'Address' & 'SQFT' data points to df1 by matching the index values.
Merge() & Concat() do not seem to work. I believe this is because the syntax is having a hard time processing/ matching df2 values to multiple df1 rows.
Visual Example:
Thank you for the help.
Try this:
import pandas as pd
merged_df = pd.merge(left=df1, right=df2, on='PIN', how='left')
If that still isn't working, maybe the PIN columns datatypes do not match.
df1['PIN'] = df1['PIN'].astype(int)
df2['PIN'] = df2['PIN'].astype(int)
merged_df = pd.merge(left=df1, right=df2, on='PIN', how='left')

how to combine province data into one row country data in python

This is my first question here. So I have COVID data in python, the distribution of COVID cases in several provinces in each country.
What should I do if I want to make each country only have one data (one row), and drop the province column?
Thank you
You need to groupby date and country, apply sum() as the aggregate function, and drop the province column:
e.g. if you have a dataframe df, with columns date, country, province, and cases, then you should do:
grouped_df = df.groupby(['date', 'country']).sum().drop('province', axis=1)

Categories