I have a dataframe which has a lot of columns, I will show only important/related columns in this question:
data[['HomePlanet', 'CryoSleep', 'Transported']].head()
HomePlanet CryoSleep Transported
0 Europa False False
1 Earth False True
2 Europa False False
3 Europa False False
4 Earth False True
data[['HomePlanet', 'CryoSleep', 'Transported']].dtypes
HomePlanet object
CryoSleep boolean
Transported boolean
dtype: object
I want to make a heatmap based on this pivot_table:
result_1 = data.pivot_table(columns='HomePlanet', index='CryoSleep', values='Transported')
result_1
HomePlanet Earth Europa Mars
CryoSleep
False 0.320992 0.400172 0.276982
True 0.656295 0.989023 0.911809
But when I try to build a heatmap with seaborn I get an error:
sns.heatmap(result_1, annot=True, cmap="PiYG_r")
TypeError: Image data of dtype object cannot be converted to float
I tried swapping columns and index in pivot_table but got the same error:
result_2 = data.pivot_table(index='HomePlanet', columns='CryoSleep', values='Transported')
result_2
CryoSleep False True
HomePlanet
Earth 0.320992 0.656295
Europa 0.400172 0.989023
Mars 0.276982 0.911809
sns.heatmap(result_2, annot=True, cmap="PiYG_r")
TypeError: Image data of dtype object cannot be converted to float
What am I doing wrong? How can I build a heatmap based on pivot_table?
Thanks to #JohanC.
I looked at the dtypes of result_1 and result_2 and the types were Float64:
result_1.dtypes
Output:
HomePlanet
Earth Float64
Europa Float64
Mars Float64
dtype: object
I changed all types to float64 with this lines:
for column in result_1.columns:
result_1[column] = result_1[column].astype('float64')
And now the heatmap finally worked!
Related
I got my .dat data formatted into arrays I could use in graphs and whatnot.
I got my data from this website and it requires an account if you want to download it yourself. The data will still be provided below, however.
https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=1028
data in python:
import pandas as pd
df = pd.read_csv("ocean_flux_co2_2d.dat", header=None)
print(df.head())
0 1 2 3
0 -178.75 -77.0 0.000003 32128.7
1 -176.25 -77.0 0.000599 32128.7
2 -173.75 -77.0 0.001649 39113.5
3 -171.25 -77.0 0.003838 58934.0
4 -168.75 -77.0 0.007192 179959.0
I then decided to put this data into arrays that could be put into graphs and other functions.
Like so:
lat = []
lon = []
sed = []
area = []
with open('/home/srowpie/SrowFinProj/Datas/ocean_flux_tss_2d.dat') as f:
for line in f:
parts = line.split(',')
lat.append(float(parts[0]))
lon.append(float(parts[1]))
sed.append(float(parts[2]))
area.append(float(parts[3]))
lat = np.array(lat)
lon = np.array(lon)
sed = np.array(sed)
area = np.array(area)
My question now is how can I put this data into a map with data points? Column 1 is latitude, Column 2 is longitude, Column 3 is sediment flux, and Column 4 is the area covered. Or do I have to bootleg it by making a graph that takes into account the variables lat, lon, and sed?
You don't need to get the data into an array. Just apply df.values and you would have a numpy array of all the data in the dataframe.
Example -
array([[-1.78750e+02, -7.70000e+01, 3.00000e-06, 3.21287e+04],
[-1.76250e+02, -7.70000e+01, 5.99000e-04, 3.21287e+04],
[-1.73750e+02, -7.70000e+01, 1.64900e-03, 3.91135e+04],
[-1.71250e+02, -7.70000e+01, 3.83800e-03, 5.89340e+04],
[-1.68750e+02, -7.70000e+01, 7.19200e-03, 1.79959e+05]])
I'll not recommend storing individual columns as variable. Instead just set the column names for the dataframe and then use them to extract a pandas Series of the data in that column.
df.columns = ["Latitude", "Longitude", "Sediment Flux", "Area covered"]
This what the table would look like after this,
Latitude
Longitude
Sediment Flux
Area covered
0
-178.75
-77.0
3e-06
32128.7
1
-176.25
-77.0
0.000599
32128.7
2
-173.75
-77.0
0.001649
39113.5
3
-171.25
-77.0
0.003838
58934.0
4
-168.75
-77.0
0.007192
179959.0
Simply do df[column_name] to get the data in that column.
For example -> df["Latitude"]
Output -
0 -178.75
1 -176.25
2 -173.75
3 -171.25
4 -168.75
Name: Latitude, dtype: float64
Once you have done all this, you can use folium to plot the rows on real interactive maps.
import folium as fl
map = fl.Map(df.iloc[0, :2], zoom_start = 100)
for index in df.index:
row = df.loc[index, :]
fl.Marker(row[:2].values, f"{dict(row[2:])}").add_to(map)
map
I'm trying to create a column based on some specific "rules". I'd like to have a new column at the end "Result" with the following result based on the three first columns:
is_return
From
To
Result
True
Fir
Fem
FirFem
False
Tre
Syv
TreSyv
True
Syv
Tre
TreSyv_r
False
Tre
Syv
TreSyv2
True
Syv
Tre
TreSyv_r2
False
Snø
Van
SnøVan
Basically if there's a trip that is not a return then just combine from and to, and add a number starting from 2 to it if there are multiple (row 4). If it's tagged as a return trip then first check if it exists as a non-return trip, example given in row 2 and 3. But if it's tagged as a return without there being a non-return variant then keep this format as is (row 1).
IIUC, you need several steps (commented in the code):
import numpy as np
# compute the string for both directions
s1 = df['From']+df['To']
s2 = df['To']+df['From']
# compute the string is the correct order
# depending on the existence of the first trip
s = pd.Series(np.where(s2.isin(s1[~df['is_return']]), s2+'_r', s1),
index=df.index)
# add number to duplicates
count = s.groupby(s).cumcount().add(1)
df['Result'] = s+np.where(count.gt(1), count.astype(str), '')
output:
is_return From To Result
0 True Fir Fem FirFem
1 False Tre Syv TreSyv
2 True Syv Tre TreSyv_r
3 False Tre Syv TreSyv2
4 True Syv Tre TreSyv_r2
5 False Snø Van SnøVan
I am working on multiclass classification problem. My target column has 4 classes as Low, medium, high and very high. When I am trying to encode it, I am getting only 0 as value_counts(). I am not sure, why.
value count in original data frame is :
High 18767
Very High 15856
Medium 9212
Low 5067
Name: physician_segment, dtype: int64
I have tried below methods to encode my target column:
Using replace() method :
target_enc = {'Low':0,'Medium':1,'High':2,'Very High':3}
df1['physician_segment'] = df1['physician_segment'].astype(object)
df1['physician_segment'] = df1['physician_segment'].replace(target_enc)
df1['physician_segment'].value_counts()
0 48902
Name: physician_segment, dtype: int64
using factorize method():
from pandas.api.types import CategoricalDtype
df1['physician_segment'] = df1['physician_segment'].factorize()[0]
df1['physician_segment'].value_counts()
0 48902
Name: physician_segment, dtype: int64
Using Label Encoder :
from sklearn import preprocessing
labelencoder= LabelEncoder()
df1['physician_segment'] = labelencoder.fit_transform(df1['physician_segment']) df1['physician_segment'].value_counts()
0 48902
Name: physician_segment, dtype: int64
In all these three techniques, I am getting only one class as 0, length of dataframe is 48902.
Can someone please point out, what I am doing wrong.
I want my target column to have values as 0, 1, 2, 3.
target_enc = {'Low':0,'Medium':1,'High':2,'Very High':3}
df1['physician_segment'] = df1['physician_segment'].astype(object)
After that create/define a function:-
def func(val):
if val in target_enc.keys():
return target_enc[val]
and finally use apply() method:-
df1['physician_segment']=df1['physician_segment'].apply(func)
Now if you print df1['physician_segment'].value_counts() you will get correct output
This question already has answers here:
Detect and exclude outliers in a pandas DataFrame
(19 answers)
Closed 1 year ago.
Given a pandas dataframe, I want to exclude rows corresponding to outliers (Z-value = 3) based on one of the columns.
The dataframe looks like this:
df.dtypes
_id object
_index object
_score object
_source.address object
_source.district object
_source.price float64
_source.roomCount float64
_source.size float64
_type object
sort object
priceSquareMeter float64
dtype: object
For the line:
dff=df[(np.abs(stats.zscore(df)) < 3).all(axis='_source.price')]
The following exception is raised:
-------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-68-02fb15620e33> in <module>()
----> 1 dff=df[(np.abs(stats.zscore(df)) < 3).all(axis='_source.price')]
/opt/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py in zscore(a, axis, ddof)
2239 """
2240 a = np.asanyarray(a)
-> 2241 mns = a.mean(axis=axis)
2242 sstd = a.std(axis=axis, ddof=ddof)
2243 if axis and mns.ndim < a.ndim:
/opt/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
68 is_float16_result = True
69
---> 70 ret = umr_sum(arr, axis, dtype, out, keepdims)
71 if isinstance(ret, mu.ndarray):
72 ret = um.true_divide(
TypeError: unsupported operand type(s) for +: 'NoneType' and 'NoneType'
And the return value of
np.isreal(df['_source.price']).all()
is
True
Why do I get the above exception, and how can I exclude the outliers?
If one wants to use the Interquartile Range of a given dataset (i.e. IQR, as shown by a Wikipedia image below) (Ref):
def Remove_Outlier_Indices(df):
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
trueList = ~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR)))
return trueList
Based on the above eliminator function, the subset of outliers according to the dataset' statistical content can be obtained:
# Arbitrary Dataset for the Example
df = pd.DataFrame({'Data':np.random.normal(size=200)})
# Index List of Non-Outliers
nonOutlierList = Remove_Outlier_Indices(df)
# Non-Outlier Subset of the Given Dataset
dfSubset = df[nonOutlierList]
Use this boolean whenever you have this sort of issue:
df=pd.DataFrame({'Data':np.random.normal(size=200)}) #example
df[np.abs(df.Data-df.Data.mean())<=(3*df.Data.std())] #keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))] #or the other way around
I believe you could create a boolean filter with the outliers and then select the oposite of it.
outliers = stats.zscore(df['_source.price']).apply(lambda x: np.abs(x) == 3)
df_without_outliers = df[~outliers]
I'm trying to extract a cell from a pandas dataframe to a simple floating point number. I'm trying
prediction = pd.to_numeric(baseline.ix[(baseline['Weekday']==5) & (baseline['Hour'] == 8)]['SmsOut'])
However, this returns
128 -0.001405
Name: SmsOut, dtype: float64
I want it to just return a simle Python float: -0.001405
How can I do that?
Output is Series with one value, so then is more possible solutions:
convert to numpy array by to_numpy and select first value by indexing
select by position by iloc or iat
prediction = pd.to_numeric(baseline.loc[(baseline['Weekday'] ==5 ) &
(baseline['Hour'] == 8), 'SmsOut'])
print (prediction.to_numpy()[0])
print (prediction.iloc[0])
print (prediction.iat[0])
Sample:
baseline = pd.DataFrame({'Weekday':[5,3],
'Hour':[8,4],
'SmsOut':[-0.001405,6]}, index=[128,130])
print (baseline)
Hour SmsOut Weekday
128 8 -0.001405 5
130 4 6.000000 3
prediction = pd.to_numeric(baseline.loc[(baseline['Weekday'] ==5 ) &
(baseline['Hour'] == 8), 'SmsOut'])
print (prediction)
128 -0.001405
Name: SmsOut, dtype: float64
print (prediction.to_numpy()[0])
-0.001405
print (prediction.iloc[0])
-0.001405
print (prediction.iat[0])
-0.001405