I am trying to pivot the Johns Hopkins Data so that date columns are rows and the rest of the information stays the same. The first seven columns should stay columns, but the remaining columns (date columns) should be rows. Any help would be appreciated.
Load and Filter data
import pandas as pd
import numpy as np
deaths_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv'
confirmed_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv'
dea = pd.read_csv(deaths_url)
con = pd.read_csv(confirmed_url)
dea = dea[(dea['Province_State'] == 'Texas')]
con = con[(con['Province_State'] == 'Texas')]
View recency of data and pivot
# get the most recent data of data
mostRecentDate = con.columns[-1] # gets the columns of the matrix
# show the data frame
con.sort_values(by=mostRecentDate, ascending = False).head(10)
# save this index variable to save the order.
index = data.columns.drop(['Province_State'])
# The pivot_table method will eliminate duplicate entries from Countries with more than one city
data.pivot_table(index = 'Admin2', aggfunc = sum)
# formatting using a variety of methods to process and sort data
finalFrame = data.transpose().reindex(index).transpose().set_index('Admin2').sort_values(by=mostRecentDate, ascending=False).transpose()
The resulting data frame looks like this, however it did not preserve any of the date times
I have also tried:
date_columns = con.iloc[:, 7:].columns
con.pivot(index = date_columns, columns = 'Admin2', values = con.iloc[:, 7:])
ValueError: Must pass DataFrame with boolean values only
Edit:
As per guidance I tried the melt command listed in the first answer and it does not create rows of dates, it just removed all other non-date values.
date_columns = con.iloc[:, 7:].columns
con.melt(id_vars=date_columns)
The end result should look like this:
Date iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ Combined_Key
1/22/2020 US USA 840 48001 Anderson Texas US 31.81534745 -95.65354823 Anderson, Texas, US
1/22/2020 US USA 840 48003 Andrews Texas US 32.30468633 -102.6376548 Andrews, Texas, US
1/22/2020 US USA 840 48005 Angelina Texas US 31.25457347 -94.60901487 Angelina, Texas, US
1/22/2020 US USA 840 48007 Aransas Texas US 28.10556197 -96.9995047 Aransas, Texas, US
Use pandas melt. Great example here.
Example:
In [41]: cheese = pd.DataFrame({'first': ['John', 'Mary'],
....: 'last': ['Doe', 'Bo'],
....: 'height': [5.5, 6.0],
....: 'weight': [130, 150]})
....:
In [42]: cheese
Out[42]:
first last height weight
0 John Doe 5.5 130
1 Mary Bo 6.0 150
In [43]: cheese.melt(id_vars=['first', 'last'])
Out[43]:
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
In [44]: cheese.melt(id_vars=['first', 'last'], var_name='quantity')
Out[44]:
first last quantity value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
In your case, you need to be operating on a dataframe (i.e. con or finalframe or wherever your date column is). For example:
con.melt(id_vars=date_columns)
See specific example here.
Related
I have the following DataFrame
import pandas as pd
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500]}
df = pd.DataFrame(data=d)
Client
Salesperson
Amount
Salesperson
Amount2
1
John
1000
Bob
400
2
John
1000
Richard
200
3
Bob
0
John
300
4
Richard
500
Tom
500
And I just need to create some sort of "sumif" statement (the one from excel) that will add the amount each salesperson is due. I don't know how to iterate over each row, but I want to have it so that it adds the values in "Amount" and "Amount2" for each one of the salespersons.
Then I need to be able to see the amount per salesperson.
Expected Output (Ideally in a DataFrame as well)
Sales Person
Total Amount
John
2300
Bob
400
Richard
700
Tom
500
There can be multiple ways of solving this. One option is to use Pandas Concat to join required columns and use groupby
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
you get
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
Edit: If you have another pair of salesperson/amount, you can add that to the concat
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500], 'Salesperson 3':['Nick','Richard','Sam','Bob'],
'Amount3':[400,800,100,400]}
df = pd.DataFrame(data=d)
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'}), df[['Salesperson 3', 'Amount3']].rename(columns={'Salesperson 3':'Salesperson','Amount3':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
Salesperson Amount
0 Bob 800
1 John 2300
2 Nick 400
3 Richard 1500
4 Sam 100
5 Tom 500
Edit 2: Another solution using pandas wide_to_long
df = df.rename({'Salesperson':'Salesperson 1','Amount':'Amount1'}, axis='columns')
reshaped_df = pd.wide_to_long(df, stubnames=['Salesperson','Amount'], i='Client',j='num', suffix='\s?\d+').reset_index(drop = 1)
The above will reshape df,
Salesperson Amount
0 John 1000
1 John 1000
2 Bob 0
3 Richard 500
4 Bob 400
5 Richard 200
6 John 300
7 Tom 500
8 Nick 400
9 Richard 800
10 Sam 100
11 Bob 400
A simple groupby on reshaped_df will give you required output
reshaped_df.groupby('Salesperson', as_index = False)['Amount'].sum()
One option is to tidy the dataframe into long form, where all the Salespersons are in one column, and the amounts are in another, then you can groupby and get the aggregate.
Let's use pivot_longer from pyjanitor to transform to long form:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index="Client",
names_to=".value",
names_pattern=r"([a-zA-Z]+).*",
)
.groupby("Salesperson", as_index = False)
.Amount
.sum()
)
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
The .value tells the function to keep only those parts of the column that match it as headers. The columns have a pattern (They start with a text - either Salesperson or Amount - and either have a number at the end ( or not). This pattern is captured in names_pattern. .value is paired with the regex in the brackets, those outside do not matter in this case.
Once transformed into long form, it is easy to groupby and aggregate. The as_index parameter allows us to keep the output as a dataframe.
I have a dataframe of people's addresses and names. I have a function that processes names that I want to apply. I am creating sub selections of people with matching addresses and applying the function to those groups.
To this point I have been using .loc to as follows
for x in df['address'].unique():
sub_selection = df.loc[df['address'] == x]
sub_selection.apply(lambda x: function(x), axis = 1)
Is there a more efficient way to approach this. I am looking into pandas .groupby() functionality, but i am struggling to get it to work.
df.groupby('address').agg(lambda x: function(x['names']))
Here is some sample data:
address, name, Unique_ID
1022 Boogie Woogie Ave, John Smith, np.nan
1022 Boogie Woogie Ave, Frederick Smith, np.nan
1022 Boogie Woogie Ave, John Jacob Smith, np.nan
3030 Sesame Street, Big Bird, np.nan
3030 Sesame Street, Elmo, np.nan
3030 Sesame Street, Big Yellow Bird, np.nan
My function itself has some moving parts, but basically I check the name against a reference dictionary I create. This process passes a few other steps, but returns a list of indexes where the name matches. I use those indexes to assign a shared unique id for matching names. In my example big bird and big yellow bird would match.
def function(x):
match_list = []
if x['name'] in __lookup_dict[0]:
match_list.append((__lookup_dict[0][x['name']))
#reduce all elements matching list to a single list of place ids matching all elements
result = set(match_list[0])
for s in match_list[1:]:
if len(result.intersection(s)) != 0:
result.intersection_update(s)
#take the reduce lists and assign each place id an unique id.
#note we are working with place ids not the sub df's index. They don't match
if pd.isnull(x['Unique_ID']):
uid = str(uuid.uuid4())
for g in result:
df.at[df.index[df.index == g].tolist()[0], 'Unq_ID'] = uid
else:
pass
return result
Try using
df.groupby('address').apply(lambda x: function(x['names']))
Edited:
Check this example. I've used a dataframe from another StackOverflow question
import pandas as pd
df = pd.DataFrame({
"City":["Delhi","Delhi","Mumbai","Mumbai","Lahore","Lahore"],
"Points":[90.1,90.3,94.1,95,89,90.5],
"Gender":["Male","Female","Female","Male","Female","Male"]
})
d = {k:v for v,k in enumerate(df.City.unique())}
df['idx'] = df['City'].replace(d)
print(df)
Output:
City Points Gender idx
0 Delhi 90.1 Male 0
1 Delhi 90.3 Female 0
2 Mumbai 94.1 Female 1
3 Mumbai 95.0 Male 1
4 Lahore 89.0 Female 2
5 Lahore 90.5 Male 2
So, try using
d = {k:v for v,k in enumerate(df['address'].unique())}
df['idx'] = df['address'].replace(d)
I'm facing the following situation. I have a dataframe which looks like this (due to sensitivity of the data I have to paraphrase it)
Column A Column B
A1B12C123 Japan
A2B34C456 Switzerland
A3B45C789 Japan
A1B15C729 Japan
My goal is to group Column A by the recurring pattern, which describes a certain property.
Meaning: Group by A1, Group by B12, Group by C123.
In order to do that, I split the Column and created new ones for each level of hierarchy, e.g.:
Column A Column B Column C
A1 B12 C123
A2 B34 C456
A3 B45 C789
A1 B15 C729
Those columns I have to add to my existing dataframe, and then I'll be able to group the way I wanted to.
I think this can work, but it seems a bit tedious and unelegant.
Is there a possibility or a way in Pandas to do this more elegantly?
I'd be happy about any input on that matter.
Best regards
Taking Seyi Daniel's idea from the comments, you can use exctractall() string method on the Column A to explode it based on regex groups and join Column B on to it.
import pandas as pd
from io import StringIO
data = StringIO("""
Column_A Column_B
A1B12C123 Japan
A2B34C456 Switzerland
A3B45C789 Japan
A1B15C729 Japan
""")
df = pd.read_csv(data, delim_whitespace=True)
regex_df = df["Column_A"].str.extractall(r"(A\d*)|(B\d*)|(C\d*)")
# drop extra levels
regex_s = regex_df.stack().reset_index((1,2), drop=True)
# give the new column a name
regex_s.name = "group"
# add column B
result = pd.merge(regex_s, df["Column_B"], left_index=True, right_index=True)
print(result)
group Column_B
0 A1 Japan
0 B12 Japan
0 C123 Japan
1 A2 Switzerland
1 B34 Switzerland
1 C456 Switzerland
2 A3 Japan
2 B45 Japan
2 C789 Japan
3 A1 Japan
3 B15 Japan
3 C729 Japan
I have a df named population with a column named countries. I want to merge rows so they reflect regions = ( africa, west hem, asia, europe, mideast). I have another df named regionref from kaggle that have all countries of the world and the region they are associated with.
How do I create a new column in the population df that has the corresponding regions for the countries in the country column, using the region column from the kaggle dataset.
so essentially this is the population dataframe
CountryName 1960 1950 ...
US
Zambia
India
And this is the regionref dataset
Country Region GDP...
US West Hem
Zambia Africa
India Asia
And I want the population df to look like
CountryName Region 1960 1950 ...
US West Hem
Zambia Africa
India Asia
EDIT: I tried the concatenation but for some reason the two columns are not recognizing the same values
population['Country Name'].isin(regionref['Country']).value_counts()
This returned False for all values, as in there are no values in common.
And this is the output, as you can see there are values in common
You just need a join functionality, or to say, concatenate, in pandas way.
Given two DataFrames pop, region:
pop = pd.DataFrame([['US', 1000, 2000], ['CN', 2000, 3000]], columns=['CountryName', 1950, 1960])
CountryName 1950 1960
0 US 1000 2000
1 CN 2000 3000
region = pd.DataFrame([['US', 'AMER', '5'], ['CN', 'ASIA', '4']], columns = ['Country', 'Region', 'GDP'])
Country Region GDP
0 US AMER 5
1 CN ASIA 4
You can do:
pd.concat([region.set_index('Country'), pop.set_index('CountryName')], axis = 1)\
.drop('GDP', axis =1)
Region 1950 1960
US AMER 1000 2000
CN ASIA 2000 3000
The axis = 1 is for concatenating horizontally. You have to set column index for joining it correctly.
I have a pandas dataframe which looks like this:
Country Sold
Japan 3432
Japan 4364
Korea 2231
India 1130
India 2342
USA 4333
USA 2356
USA 3423
I have use the code below and get the sum of the "sold" column
df1= df.groupby(df['Country'])
df2 = df1.sum()
I want to ask how to calculate the percentage of the sum of "sold" column.
You can get the percentage by adding this code
df2["percentage"] = df2['Sold']*100 / df2['Sold'].sum()
In the output dataframe, a column with the percentage of each country is added.
We can divide the original Sold column by a new column consisting of the grouped sums but keeping the same length as the original DataFrame, by using transform
df.assign(
pct_per=df['Sold'] / df.groupby('Country').transform(pd.DataFrame.sum)['Sold']
)
Country Sold pct_per
0 Japan 3432 0.440226
1 Japan 4364 0.559774
2 Korea 2231 1.000000
3 India 1130 0.325461
4 India 2342 0.674539
5 USA 4333 0.428501
6 USA 2356 0.232991
7 USA 3423 0.338509
Simple Solution
You were almost there.
First you need to group by country
Then create the new percentage column (by dividing grouped sales with sum of all sales)
# reset_index() is only there because the groupby makes the grouped column the index
df_grouped_countries = df.groupby(df.Country).sum().reset_index()
df_grouped_countries['pct_sold'] = df_grouped_countries.Sold / df.Sold.sum()
Are you looking for the percentage after or before aggregation?
import pandas as pd
countries = [['Japan',3432],['Japan',4364],['Korea',2231],['India',1130], ['India',2342],['USA',4333],['USA',2356],['USA',3423]]
df = pd.DataFrame(countries,columns=['Country','Sold'])
df1 = df.groupby(df['Country'])
df2 = df1.sum()
df2['percentage'] = (df2['Sold']/df2['Sold'].sum()) * 100
df2