Heatmap correlation using values of column? - python

Suppose I have the following data of repeat observations for US states with some value of interest:
US_State Value
Alabama 1
Alabama 10
Alabama 9
Michigan 8
Michigan 9
Michigan 2
...
How can I generate pairwise correlations for Value between all the US_State combinations? I've tried a few different things (pivot, groupby, and more), but I can't seem to wrap my head around the proper approach.
The ideal output would look like:
Alabama Michigan ...
Alabama 1 0.5
Michigan 0.5 1
...

There is a way utilising Pandas to its extents, but this is only under the assumption that each state in the input dataset has the same number of observations, otherwise correlation coefficient does not really make sense and the results will become a bit funky.
import pandas as pd
df = pd.DataFrame()
df['US_State'] = ["Alabama", "Alabama", "Alabama", "Michigan", "Michigan", "Michigan", "Oregon", "Oregon", "Oregon"]
df['Value'] = [1, 10, 9, 8, 9, 2, 6, 1, 2]
pd.DataFrame(df.groupby("US_State")['Value'].apply(lambda x: list(x))).T.apply(lambda x: pd.Series(*x), axis=0).corr()
which results into
US_State Alabama Michigan Oregon
US_State
Alabama 1.000000 -0.285578 -0.996078
Michigan -0.285578 1.000000 0.199667
Oregon -0.996078 0.199667 1.000000
What the code basically does is it collects the data for each state into a single cell as a list, transposes the dataframe to make the states columns and then expands the collected cell of list data into dataframe rows for each state. Then you can just call the standard corr() method of pandas dataframe.

Pandas DataFrame has a built-in correlation matrix function. You will somehow need to get your data into a DataFrame (takes numpy objects, plain dict (shown), etc).
from pandas import DataFrame
data = {'AL': [1,10,9],
'MI': [8,9,2],
'CO': [11,5,17]
}
df = DataFrame(data)
corrMatrix = df.corr()
print(corrMatrix)
# optional heatmap
import seaborn as sn
sn.heatmap(corrMatrix, annot=True, cmap='coolwarm')
AL MI CO
AL 1.000000 -0.285578 -0.101361
MI -0.285578 1.000000 -0.924473
CO -0.101361 -0.924473 1.000000

Related

How to Calculate Win Percentage In Pandas Library?

HomeTeamName AwayTeamName HomeTeamGoals AwayTeamGoals
0 France Mexico 4.0 1.0
1 USA Belgium 3.0 0.0
2 Yugoslavia Brazil 2.0 1.0
3 Romania Peru 3.0 1.0
4 Argentina France 1.0 0.0
A Percentage is calculated by the mathematical formula of dividing the value by the sum of all the values and then multiplying the sum by 100. This is also applicable in Pandas Dataframes. Here, the pre-defined sum() method of pandas series is used to compute the sum of all the values of a column.
Syntax: Series.sum()
Return: Returns the sum of the values.
Formula:
df[percent] = (df['column_name'] / df['column_name'].sum()) * 100
for example
# Import required libraries
import pandas as pd
import numpy as np
# Dictionary
df1 = {
'Name': ['abc', 'bcd', 'cde',
'def', 'efg', 'fgh',
'ghi'],
'Math_score': [52, 87, 49,
74, 28, 59,
48]}
# Create a DataFrame
df1 = pd.DataFrame(df1,
columns = ['Name',
'Math_score'])
# Calculating Percentage
df1['percent'] = (df1['Math_score'] /
df1['Math_score'].sum()) * 100
# Show the dataframe
df1

How to calculate weighted mean and median in python?

I have data in pandas DataFrame or NumPy array and want to calculate the weighted mean(average) or weighted median based on some weights in another column or array. I am looking for a simple solution rather than writing functions from scratch or copy-paste them everywhere I need them.
The data looks like this -
state.head()
State Population Murder.Rate Abbreviation
0 Alabama 4779736 5.7 AL
1 Alaska 710231 5.6 AK
2 Arizona 6392017 4.7 AZ
3 Arkansas 2915918 5.6 AR
4 California 37253956 4.4 CA
And I want to calculate the weighted mean or median of murder rate which takes into account the different populations in the states.
How can I do that?
First, install the weightedstats library in python.
pip install weightedstats
Then, do the following -
Weighted Mean
ws.weighted_mean(state['Murder.Rate'], weights=state['Population'])
4.445833981123394
Weighted Median
ws.weighted_median(state['Murder.Rate'], weights=state['Population'])
4.4
It also has special weighted mean and median methods to use with numpy arrays. The above methods will work but in case if you need it.
my_data = [1, 2, 3, 4, 5]
my_weights = [10, 1, 1, 1, 9]
ws.numpy_weighted_mean(my_data, weights=my_weights)
ws.numpy_weighted_median(my_data, weights=my_weights)

Python compute correlation of a single variable between groups

I'd like to compute the correlation of the variable "hours" between two groups in a panel data. Specifically, I'd like to compute the correlation of hours between groups A and B with group C. So the end result would contain two numbers: corr(hours_A, hours_C), and corr(hours_B, hours_C).
I have tried:
data.groupby('group').corr()
But it gave me the correlation between "hours" and "other variables" within each group, but I want the correlation of just the "hours" variable across two groups. I'm new to Python, so any help is welcome!
group
year
hours
other variables
A
2000
2784
567
A
2001
2724
567
A
2002
2715
567
B
2000
2301
567
B
2001
2612
567
B
2002
2489
567
C
2000
2190
567
C
2001
2139
567
C
2002
2159
567
Update:
Thank you for answering my question!
I eventually figured out some code of my own, but my code is not as elegant as the answers provided. For what it's worth, I'm posting it here.
df = df.set_index(['group','year'])
df = df.unstack(level=0)
df.index = pd.to_datetime(df.index).year
df.columns = df.columns.rename(['variables',"group"])
df.xs('hours', level="variables", axis=1).corr()
Indexing year isn't necessary for the correlation, but if I want to create cross sections of the data later, it might come in handy.
maybe it is not the best way to do it but I believe this will get you on your way.
import pandas as pd
import numpy as np
data = data[['group', 'year', 'hours']]
data_new = data.set_index(['year', 'group']).unstack(['group'])
final_df = pd.DataFrame(data_new.to_numpy(), columns=['A', 'B', 'C'])
final_df.corr()
I will also leave the process to (I think) reproduce your problem for anyone who wishes to give it a try!
import pandas as pd
import numpy as np
data_str = '''A|2000|2784|567
A|2001|2724|567
A|2002|2715|567
B|2000|2301|567
B|2001|2612|567
B|2002|2489|567
C|2000|2190|567
C|2001|2139|567
C|2002|2159|567'''.split('\n')
data = pd.DataFrame([x.split('|') for x in data_str], columns=['group', 'year', 'hours', 'other_variables'])
data['hours'] = data['hours'].astype(int)
You can apply list to the groups and then convert to Series, transpose, and then call corr() on the data.
from io import StringIO
import pandas as pd
>>> data = StringIO("""group,year,hours,other,variables
A,2000,2784,567
A,2001,2724,567
A,2002,2715,567
B,2000,2301,567
B,2001,2612,567
B,2002,2489,567
C,2000,2190,567
C,2001,2139,567
C,2002,2159,567""")
>>> df = pd.read_csv(data)
>>> df.groupby('group')['hours'].apply(list).apply(pd.Series).T.corr()
0 1 2
0 1.000000 0.771752 0.898470
1 0.771752 1.000000 0.972589
2 0.898470 0.972589 1.000000
How does this work? The groupby + apply(list) produces the following, which is a Series with three rows, each being a list of three items.
A [2784, 2724, 2715]
B [2301, 2612, 2489]
C [2190, 2139, 2159]
The apply(pd.Series) converts the list in each row to a series. You then have to transpose with the T operator to get the data for each group in a single column.
0 1 2
group
A 2784 2724 2715
B 2301 2612 2489
C 2190 2139 2159
transposed is
group A B C
0 2784 2301 2190
1 2724 2612 2139
2 2715 2489 2159
If you only want the two values, it would be
>>> df.groupby('group')['hours'].apply(list).apply(pd.Series).T.corr().iloc[1:3,0].values
array([-0.86594029, 0.86783525])
In this example, you use iloc to get the second and third rows in the first column (python indexes are zero-based) and then the values property of a Series to return an array rather than a Series.

Python - Group rows in list in pandas dataframe

I have a dataframe like this:
long lat Place
-6.779 61.9 Aarhus
-6.790 62.0 Aarhus
54.377 24.4 Dhabi
38.834 9.0 Addis
35.698 9.2 Addis
Is it possible to transform the dataframe into a format like below?
Office long + lat
Aarhus [[-6.779,61.9], [-6.790,62.0]]
Dhabi [[54.377]]
Addis [[38.834,9.0], [35.698,9.2]]
I tried different methods but still couldn't work this out. This is
what I tried to get a list for each distinct place value:
df2["index"] = df2.index
df2["long"]=df2.groupby('index')['long'].apply(list)
list 1= []
for values in ofce_list:
if df['Office'].any() == values:
list1.append(df.loc[df['Office'] == values, 'long'])
But this returned a series in a list instead which is not desired. Please help. Thank you so much.
df.groupby('Place')[['long','lat']].apply(lambda x :x.values.tolist()).\
reset_index(name='long + lat')
Out[1380]:
Place long + lat
0 Aarhus [[-6.779, 61.9], [-6.79, 62.0]]
1 Addis [[38.834, 9.0], [35.698, 9.2]]
2 Dhabi [[54.376999999999995, 24.4]]

How to operate over pandas dataframe columns where the column name is a datetime string with a suffix?

Ultimately, I am trying to calculate and then graph the price elasticity of demand for housing from 2010 to 2016 across 290 U.S. Counties.
Elasticity of demand is equal to the % change in quantity divided by the % change in price.
The data I have is from Zillow: 'InventoryMeasure_County_Public.csv' and 'City_Zhvi_AllHomes.csv'.
The data are in a multi-indexed pandas DataFrame containing time series data. It looks like this, although with 55 columns and 290 rows:
PQ = pd.DataFrame({ "2010q1_x":[0.1, 0.2, 0.3],
"2010q2_x":[0.2, 0.2, 0.2],
# ... goes to 2016q4_x
"2010q1_y":[2.1, 2.2, 2.3],
"2010q2_y":[1.2, 1.2, 1.3],
# ... goes to 2016q4_y
} index = pd.MultiIndex.from_tuples(
[('Alabama', 'Huntsville'), ('Alabama','Rainbow City')
# ... for all 50 States
('Wyoming', 'Burton County'), ('Wyoming', 'Joe Falls')],
names=['State','County'])))
I seem to only be able to perform one operation at a time. For example:
PQ['2010q1_x'].div(PQ['2010q1_y'])
yields:
State County
Alabama Madison -0.017560
Mobile -0.112925
Shelby -0.100689
Tuscaloosa 0.319638
Alaska Anchorage 0.261926
Juneau 0.099720
Arizona Maricopa -0.003240
Pima 0.098894
Yuma -1.982047
# ... and so on.
Which is perfect, it's exactly what I need. I just need to do the operation over each of the 55 columns without having to write 55 expressions.
I would like to write something like this:
(PQ['20{}q{}_x'.format([x for x in range(10,17)],[x for x in range(1,5)])])
.div(PQ['20{}q{}_y'.format([x for x in range(10,17)],[x for x in range(1,5)])])
However, when I run the above code, there is a key error:
KeyError: '20[10, 11, 12, 13, 14, 15, 16]q[1, 2, 3, 4]_x'
I found these, however, they didn't give me anything conclusive.
Python documentation for operators
Can generators be used with .format?
Adding Dataframes with same column names
I also tried converting the columns in the DataFrame to np.arrays, where I was able to successfully operate across both data, however, when I attempted to add the results to the multi-indexed DataFrame, the results were all NaN.
I also tried 'de-multi-indexing': I changed the index to the tuples of the State, County pairs to see if the problem was with the multi-index.
Hopefully I have been relatively clear in explaining this - my end goal is really very simple, and I am sure I'm just over-thinking this.
Thank you in advance for your help.
Let use .str.extract to group columns then divide.
Input:
print(PQ)
2010q1_x 2010q1_y 2010q2_x 2010q2_y
State County
Alabama Huntsville 0.1 2.1 0.2 1.2
Rainbow City 0.2 2.2 0.2 1.2
Wyoming Burton County 0.3 2.3 0.2 1.3
df_out = PQ.groupby(by=PQ.columns.str.extract('(\d{4}q\d)',expand=False),axis=1).apply(lambda x: x.iloc[:,0].div(x.iloc[:,1]))
print(df_out)
Output:
2010q1 2010q2
State County
Alabama Huntsville 0.047619 0.166667
Rainbow City 0.090909 0.166667
Wyoming Burton County 0.130435 0.153846
I like the groupby() approach by #Scott Boston, but here's another way
import pandas as pd
PQ = pd.DataFrame({ "2010q1_x":[0.1, 0.2, 0.3],
"2010q2_x":[0.2, 0.2, 0.2],
"2010q1_y":[2.1, 2.2, 2.3],
"2010q2_y":[1.2, 1.2, 1.3]}, index = pd.MultiIndex.from_tuples(
[('Alabama', 'Huntsville'), ('Alabama','Rainbow City'),
('Wyoming', 'Burton County')],
names=['State','County']))
print (PQ)
2010q1_x 2010q1_y 2010q2_x 2010q2_y
State County
Alabama Huntsville 0.1 2.1 0.2 1.2
Rainbow City 0.2 2.2 0.2 1.2
Wyoming Burton County 0.3 2.3 0.2 1.3
using pandas filter we can divide the "_x" columns by values in the "_y" columns
eods = PQ.filter(like='_x') / PQ.filter(like='_y').values
after some column name cleanup it yields
eods.columns = eods.columns.str.replace('_x','_eod')
print (eods)
2010q1_eod 2010q2_eod
State County
Alabama Huntsville 0.047619 0.166667
Rainbow City 0.090909 0.166667
Wyoming Burton County 0.130435 0.153846
I don't know if there is more skill way to do this, but this method can success.
# create an empty DataFrame to store result
result = pd.DataFrame(None,
columns=pd.MultiIndex.from_tuples(list(zip(['20{}q{}_x'.format(x, y) for x in range(10,17) for y in range(1,5)],
['20{}q{}_x'.format(x, y) for x in range(10,17) for y in range(1,5)])),
names=['numerator', 'denominator']),
index=PQ.index
)
# fill the result
result = result.apply(lambda s: PQ[s.name[0]].div(PQ[s.name[1]]))

Categories