I couldn't quite find a consensus answer for this question or one that fits my needs -- I have data in three columns of a text file: X, Y, and Z. Columns are tab-deliminated. I would like to make a heatmap representation of these data with Python where X and Y positions are shaded by the value in Z, which ranges from 0 to 1 (a discrete probability of X and Y). I was trying seaborn's heatmap package and matplotlib's pcolormesh, but unfortunately these need 2D data arrays.
My data runs through X from 1 to 37 for constant y then iterates by 0.1 in the y. y max fluctuates based on the data set, but ymin is always 0.
[X Y Z] row1[1...37 0.0000 Zvalue], row2[1...37 0.1000 Zvalue] etc.
import numpy as np
from numpy import *
import pandas as pd
import seaborn as sns
sns.set()
df = np.loadtxt(open("file.txt", "rb"), delimiter="\t").astype("float")
Any tips for next steps?
If I understand you correctly you have three columns with X and Y denoting the position of a value Z.
Consider the following example. There are three columns: X and Y contain positional information (categories in this case) and Z contains the values for shading the heatmap.
x = np.array(['a','b','c','a','b','c','a','b','c'])
y = np.array(['a','a','a','b','b','b','c','c','c'])
z = np.array([0.3,-0.3,1,0.5,-0.25,-1,0.25,-0.23,0.25])
Then we create a dataframe from these columns and transpose them (so x,y and z actually become columns). Give column names and make sure Z_value is a number.
df = pd.DataFrame.from_dict(np.array([x,y,z]).T)
df.columns = ['X_value','Y_value','Z_value']
df['Z_value'] = pd.to_numeric(df['Z_value'])
resulting in this dataframe.
X_value Y_value Z_value
0 a a 0.30
1 b a -0.30
2 c a 1.00
3 a b 0.50
4 b b -0.25
5 c b -1.00
6 a c 0.25
7 b c -0.23
8 c c 0.25
From this you cannot create a heatmap, however by calling df.pivot('Y_value','X_value','Z_value') you pivot the dataframe to a form that can be used for a heatmap.
pivotted= df.pivot('Y_value','X_value','Z_value')
The resulting dataframe looks like this.
X_value a b c
Y_value
a 0.30 -0.30 1.00
b 0.50 -0.25 -1.00
c 0.25 -0.23 0.25
You can then feed pivotted to the sns.heatmap to create your heatmap.
sns.heatmap(pivotted,cmap='RdBu')
Resulting in this heatmap.
You may need to make some adjustments to the code for your precise needs. But since I had no example data to go from I needed to make my own example.
Related
I'm trying to add a slope calculation on individual subsets of two fields in a dataframe and have that value of slope applied to all rows in each subset. (I've used the "slope" function in excel previously, although I'm not married to the exact algo. The "desired_output" field is what I'm expecting as the output. The subsets are distinguished by the "strike_order" column, subsets starting at 1 and not having a specific highest value.
"IV" is the y value
"Strike" is the x value
Any help would be appreciated as I don't even know where to begin with this....
import pandas
df = pandas.DataFrame([[1200,1,.4,0.005],[1210,2,.35,0.005],[1220,3,.3,0.005],
[1230,4,.25,0.005],[1200,1,.4,0.003],[1210,2,.37,.003]],columns=
["strike","strike_order","IV","desired_output"])
df
strike strike_order IV desired_output
0 1200 1 0.40 0.005
1 1210 2 0.35 0.005
2 1220 3 0.30 0.005
3 1230 4 0.25 0.005
4 1200 1 0.40 0.003
5 1210 2 0.37 0.003
Let me know if this isn't a well posed question and I'll try to make it better.
You can use numpy's least square
We can rewrite the line equationy=mx+c as y = Ap, where A = [[x 1]] and p = [[m], [c]]. Then use lstsq to solve for p, so we need to create A by adding a column of ones to df
import numpy as np
df['ones']=1
A = df[['strike','ones']]
y = df['IV']
m, c = np.linalg.lstsq(A,y)[0]
Alternatively you can use scikit learn's linear_model Regression model
you can verify the result by plotting the data as scatter plot and the line equation as plot
import matplotlib.pyplot as plt
plt.scatter(df['strike'],df['IV'],color='r',marker='d')
x = df['strike']
#plug x in the equation y=mx+c
y_line = c + m * x
plt.plot(x,y)
plt.xlabel('Strike')
plt.ylabel('IV')
plt.show()
the resulting plot is shown below
Try this.
First create a subset column by iterating over the dataframe, using the strike_order value transitioning to 1 as the boundary between subsets
#create subset column
subset_counter = 0
for index, row in df.iterrows():
if row["strike_order"] == 1:
df.loc[index,'subset'] = subset_counter
subset_counter += 1
else:
df.loc[index,'subset'] = df.loc[index-1,'subset']
df['subset'] = df['subset'].astype(int)
Then run a linear regression over each subset using groupby
# run linear regression on subsets of the dataframe using groupby
from sklearn import linear_model
model = linear_model.LinearRegression()
for (group, df_gp) in df.groupby('subset'):
X=df_gp[['strike']]
y=df_gp.IV
model.fit(X,y)
df.loc[df.subset == df_gp.iloc[0].subset, 'slope'] = model.coef_
df
strike strike_order IV desired_output subset slope
0 1200 1 0.40 0.005 0 -0.005
1 1210 2 0.35 0.005 0 -0.005
2 1220 3 0.30 0.005 0 -0.005
3 1230 4 0.25 0.005 0 -0.005
4 1200 1 0.40 0.003 1 -0.003
5 1210 2 0.37 0.003 1 -0.003
# Scott This worked except it went subset value 0, 1 and all subsequent subset values were 2. I added an extra conditional at the beginning and a very clumsy seed "seed" value to stop it looking for row -1.
import scipy
seed=df.loc[0,"date_exp"]
#seed ="08/11/200015/06/2001C"
#print(seed)
subset_counter = 0
for index, row in df.iterrows():
#if index['strike_order']==0:
if row['date_exp'] ==seed:
df.loc[index,'subset']=0
elif row["strike_order"] == 1:
df.loc[index,'subset'] = subset_counter
subset_counter = 1 + df.loc[index-1,'subset']
else:
df.loc[index,'subset'] = df.loc[index-1,'subset']
df['subset'] = df['subset'].astype(int)
This now does exactly what I want although I think using the seed value is clunky, would have preferred to use if row == 0 etc. But it's friday and this works.
Cheers
I have a dataframe ( "df") equivalent to:
Cat Data
x 0.112
x 0.112
y 0.223
y 0.223
z 0.112
z 0.112
In other words I have a category column and a data column, and the data values do not vary within values of the category column, but they may repeat themselves between different categories (i.e. the values in categories 'x' and 'z' are the same -- 0.112). This means that I need to select one data point from each category, rather than just subsetting on unique values of "Data".
The way I've done it is like this:
aLst = []
bLst = []
for i in df.index:
if df.loc[i,'Cat'] not in aLst:
aLst += [df.loc[i,'Cat']]
bLst += [i]
new_series = pd.Series(df.loc[bLst,'Data'])
Then I can do whatever I want with it. But the problem is this just seems like a clunky, un-pythonic way of doing things. Any suggestions?
I think you need drop_duplicates:
#by column Cat
print (df.drop_duplicates(['Cat']))
Cat Data
0 x 0.112
2 y 0.223
4 z 0.112
Or:
#by columns Cat and Value
print (df.drop_duplicates(['Cat','Data']))
Cat Data
0 x 0.112
2 y 0.223
4 z 0.112
I have a table of sensor data, for which some columns are measurements and some columns are sensor bias. For example, something like this:
df=pd.DataFrame({'x':[1.0,2.0,3.0],'y':[4.0,5.0,6.0],
'dx':[0.25,0.25,0.25],'dy':[0.5,0.5,0.5]})
dx dy x y
0 0.25 0.5 1.0 4.0
1 0.25 0.5 2.0 5.0
2 0.25 0.5 3.0 6.0
I can add a column to the table by subtracting the bias from the measurement like this:
df['newX'] = df['x'] - df['dx']
dx dy x y newX
0 0.25 0.5 1.0 4.0 0.75
1 0.25 0.5 2.0 5.0 1.75
2 0.25 0.5 3.0 6.0 2.75
But I'd like to do that for many columns at once. This doesn't work:
df[['newX','newY']] = df[['x','y']] - df[['dx','dy']]
for two reasons, it seems.
When subtracting DataFrames the column labels are used to align the subtraction, so I wind up with a 4 column result ['x', 'y', 'dx', 'dy'].
It seems I can insert a single column into the DataFrame using indexing, but not more than one.
Obviously I can iterate over the columns and do each one individually, but is there a more compact way to accomplish what I'm trying to do that is more analogous to the one column solution?
DataFrames generally align operations such as arithmetic on column and row indices. Since df[['x','y']] and df[['dx','dy']] have different column names, the dx column is not subtracted from the x column, and similiarly for the y columns.
In contrast, if you subtract a NumPy array from a DataFrame, the operation is done elementwise since the NumPy array has no Panda-style indices to align upon.
Hence, if you use df[['dx','dy']].values to extract a NumPy array consisting of the values in df[['dx','dy']], then your assignment can be done as desired:
import pandas as pd
df = pd.DataFrame({'x':[1.0,2.0,3.0],'y':[4.0,5.0,6.0],
'dx':[0.25,0.25,0.25],'dy':[0.5,0.5,0.5]})
df[['newx','newy']] = df[['x','y']] - df[['dx','dy']].values
print(df)
yields
dx dy x y newx newy
0 0.25 0.5 1.0 4.0 0.75 3.5
1 0.25 0.5 2.0 5.0 1.75 4.5
2 0.25 0.5 3.0 6.0 2.75 5.5
Be ware that if you were to try assigning a NumPy array (on the right-hand side)
to a DataFrame (on the left-hand side), the column names specified on the left must already exist.
In contrast, when assigning a DataFrame on the right-hand side to a DataFrame on the left, new columns can be used since in this case Pandas zips the keys (new column names) on the left with the columns on the right and assigns values in column-order instead of by aligning columns:
for k1, k2 in zip(key, value.columns):
self[k1] = value[k2]
Thus, using a DataFrame on the right
df[['newx','newy']] = df[['x','y']] - df[['dx','dy']].values
works, but using a NumPy array on the right
df[['newx','newy']] = df[['x','y']].values - df[['dx','dy']].values
does not.
New to Pandas and I'm wondering if there's a better way to accomplish the following -
Set up:
import pandas as pd
import numpy as np
x = np.arange(0, 1, .01)
y = np.random.binomial(10, x, 100)
bins = 50
df = pd.DataFrame({'x':x, 'y':y})
print(df.head())
x y
0 -1 1
1 38 1
2 56 0
3 42 0
4 41 0
I would like to group the x values into equal size bins, and for each bin take the average value of both x and y.
my_bins = pd.cut(x, bins=20)
data = df[['x', 'y']].groupby(my_bins).agg(['mean', 'size'])
print(data.head())
x y
mean size mean size
age
(-1.101, 4.05] -1.000000 87990 0.768428 87990
(4.05, 9.1] NaN 0 NaN 0
(9.1, 14.15] NaN 0 NaN 0
(14.15, 19.2] 18.512286 1872 0.493590 1872
(19.2, 24.25] 22.768022 8906 0.496968 8906
Well that works. But from here, how do I plot x's mean vs y's mean? I know I can do something like
data.columns = data.columns.droplevel() # remove the multiple levels that were created
data.columns = ['x_mean', 'x_size', 'y_mean', 'y_size'] # manually set new column names
data.plot.scatter(x='x_mean', y='y_mean') # plot
But this feels wrong and clunky as I have to drop the column levels (which removes useful structure from my data) and I have to manually rename the columns. Is there a better way?
You can specify the x and y parameters pointing the multi-level columns using tuples:
data.plot.scatter(x=('x', 'mean'), y=('y', 'mean'))
This way, you don't need to rename the columns in order to plot it.
Suppose I have a pandas data frame df:
I want to calculate the column wise mean of a data frame.
This is easy:
df.apply(average)
then the column wise range max(col) - min(col). This is easy again:
df.apply(max) - df.apply(min)
Now for each element I want to subtract its column's mean and divide by its column's range. I am not sure how to do that
Any help/pointers are much appreciated.
In [92]: df
Out[92]:
a b c d
A -0.488816 0.863769 4.325608 -4.721202
B -11.937097 2.993993 -12.916784 -1.086236
C -5.569493 4.672679 -2.168464 -9.315900
D 8.892368 0.932785 4.535396 0.598124
In [93]: df_norm = (df - df.mean()) / (df.max() - df.min())
In [94]: df_norm
Out[94]:
a b c d
A 0.085789 -0.394348 0.337016 -0.109935
B -0.463830 0.164926 -0.650963 0.256714
C -0.158129 0.605652 -0.035090 -0.573389
D 0.536170 -0.376229 0.349037 0.426611
In [95]: df_norm.mean()
Out[95]:
a -2.081668e-17
b 4.857226e-17
c 1.734723e-17
d -1.040834e-17
In [96]: df_norm.max() - df_norm.min()
Out[96]:
a 1
b 1
c 1
d 1
If you don't mind importing the sklearn library, I would recommend the method talked on this blog.
import pandas as pd
from sklearn import preprocessing
data = {'score': [234,24,14,27,-74,46,73,-18,59,160]}
cols = data.columns
df = pd.DataFrame(data)
df
min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(df)
df_normalized = pd.DataFrame(np_scaled, columns = cols)
df_normalized
You can use apply for this, and it's a bit neater:
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randn(4,4)* 4 + 3)
0 1 2 3
0 9.497381 0.552974 0.887313 -1.291874
1 6.461631 -6.206155 9.979247 -0.044828
2 4.276156 2.002518 8.848432 -5.240563
3 1.710331 1.463783 7.535078 -1.399565
df.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
0 1 2 3
0 0.515087 0.133967 -0.651699 0.135175
1 0.125241 -0.689446 0.348301 0.375188
2 -0.155414 0.310554 0.223925 -0.624812
3 -0.484913 0.244924 0.079473 0.114448
Also, it works nicely with groupby, if you select the relevant columns:
df['grp'] = ['A', 'A', 'B', 'B']
0 1 2 3 grp
0 9.497381 0.552974 0.887313 -1.291874 A
1 6.461631 -6.206155 9.979247 -0.044828 A
2 4.276156 2.002518 8.848432 -5.240563 B
3 1.710331 1.463783 7.535078 -1.399565 B
df.groupby(['grp'])[[0,1,2,3]].apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
0 1 2 3
0 0.5 0.5 -0.5 -0.5
1 -0.5 -0.5 0.5 0.5
2 0.5 0.5 0.5 -0.5
3 -0.5 -0.5 -0.5 0.5
Slightly modified from: Python Pandas Dataframe: Normalize data between 0.01 and 0.99? but from some of the comments thought it was relevant (sorry if considered a repost though...)
I wanted customized normalization in that regular percentile of datum or z-score was not adequate. Sometimes I knew what the feasible max and min of the population were, and therefore wanted to define it other than my sample, or a different midpoint, or whatever! This can often be useful for rescaling and normalizing data for neural nets where you may want all inputs between 0 and 1, but some of your data may need to be scaled in a more customized way... because percentiles and stdevs assumes your sample covers the population, but sometimes we know this isn't true. It was also very useful for me when visualizing data in heatmaps. So i built a custom function (used extra steps in the code here to make it as readable as possible):
def NormData(s,low='min',center='mid',hi='max',insideout=False,shrinkfactor=0.):
if low=='min':
low=min(s)
elif low=='abs':
low=max(abs(min(s)),abs(max(s)))*-1.#sign(min(s))
if hi=='max':
hi=max(s)
elif hi=='abs':
hi=max(abs(min(s)),abs(max(s)))*1.#sign(max(s))
if center=='mid':
center=(max(s)+min(s))/2
elif center=='avg':
center=mean(s)
elif center=='median':
center=median(s)
s2=[x-center for x in s]
hi=hi-center
low=low-center
center=0.
r=[]
for x in s2:
if x<low:
r.append(0.)
elif x>hi:
r.append(1.)
else:
if x>=center:
r.append((x-center)/(hi-center)*0.5+0.5)
else:
r.append((x-low)/(center-low)*0.5+0.)
if insideout==True:
ir=[(1.-abs(z-0.5)*2.) for z in r]
r=ir
rr =[x-(x-0.5)*shrinkfactor for x in r]
return rr
This will take in a pandas series, or even just a list and normalize it to your specified low, center, and high points. also there is a shrink factor! to allow you to scale down the data away from endpoints 0 and 1 (I had to do this when combining colormaps in matplotlib:Single pcolormesh with more than one colormap using Matplotlib) So you can likely see how the code works, but basically say you have values [-5,1,10] in a sample, but want to normalize based on a range of -7 to 7 (so anything above 7, our "10" is treated as a 7 effectively) with a midpoint of 2, but shrink it to fit a 256 RGB colormap:
#In[1]
NormData([-5,2,10],low=-7,center=1,hi=7,shrinkfactor=2./256)
#Out[1]
[0.1279296875, 0.5826822916666667, 0.99609375]
It can also turn your data inside out... this may seem odd, but I found it useful for heatmapping. Say you want a darker color for values closer to 0 rather than hi/low. You could heatmap based on normalized data where insideout=True:
#In[2]
NormData([-5,2,10],low=-7,center=1,hi=7,insideout=True,shrinkfactor=2./256)
#Out[2]
[0.251953125, 0.8307291666666666, 0.00390625]
So now "2" which is closest to the center, defined as "1" is the highest value.
Anyways, I thought my application was relevant if you're looking to rescale data in other ways that could have useful applications to you.
This is how you do it column-wise:
[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]