HeatMap on pandas python 3.7 - python

I'm trying to do a beautiful heatMap using pandas. The data is a csv file, is in the same folder of the script python.
I got an error in my code, is easy:
File "<ipython-input-6-1b7ca215e6d0>", line 4
fid = datadf u'/my_Path/File.csv'
^
SyntaxError: invalid syntax
I think the important reason is not the syntax.
So I need to your help ?
My code is:
datadf = pd.read_csv("D:\my_Path\File.csv")
## Loading the data
fid = datadf u'/my_Path/File.csv'
key = u'dataset_key'
## Load the dataframe
df = pd.read_hdf(fid,key)
## Default plot ranges:
long_range = (datadf['long'].min(), datadf['long'].max())
lat_range = (datadf['lat'].min(), datadf['lat'].max())
## France plot ranges
long_range_fr = (-5,10)
lat_range_fr = (40,52)
## Visualization
### Custom functions
def bg(img):
return tf.set_background(img,"black")
def create_image(long_range=long_range, lat_range=lat_range, w=800, h=800):
cvs = ds.Canvas(x_range=long_range, y_range=lat_range, plot_height=h, plot_width=w)
agg = cvs.points(df, 'lon', 'lat')
return bg(tf.shade(agg, cmap = cm(Hot,0.2), how='eq_hist'))
### Statit plot
create_image(long_range=long_range_fr, lat_range=lat_range_fr)
A sample of my data:
long lat
-0.91655 43.456863
-0.495795 43.162117
-0.029272 43.097401
-0.108955 43.233845
-0.10237 43.207676
-0.096726 43.19257
-0.102862 43.216438
-0.1091 43.234241
-0.105826 43.225636
-0.096518 43.190247
-0.098496 43.19902
-0.079585 43.229698
-0.081321 43.232929
-0.079448 43.232937
-0.624699 43.364143
-0.429526 43.328094

As rightly mentioned by #Kristóf Varga seaborn heatmap can be used to find the appropriate solution.
A solution can be found over here: Using seaborn heatmap

Related

Weird Time-Series Graph Using Pycaret and plotly

I am trying to visualize Air Quality Data as time-series charts using pycaret and plotly dash python libraries , but i am getting very weird graphs, below is my code:
import pandas as pd
import plotly.express as px
data = pd.read_csv('E:/Self Learning/Djang_Dash/2019-2020_5.csv')
data['Date'] = pd.to_datetime(data['Date'], format='%d/%m/%Y')
#data.set_index('Date', inplace=True)
# combine store and item column as time_series
data['OBJECTID'] = ['Location_' + str(i) for i in data['OBJECTID']]
#data['AQI_Bins_AI'] = ['Bin_' + str(i) for i in data['AQI_Bins_AI']]
data['time_series'] = data[['OBJECTID']].apply(lambda x: '_'.join(x), axis=1)
data.drop(['OBJECTID'], axis=1, inplace=True)
# extract features from date
data['month'] = [i.month for i in data['Date']]
data['year'] = [i.year for i in data['Date']]
data['day_of_week'] = [i.dayofweek for i in data['Date']]
data['day_of_year'] = [i.dayofyear for i in data['Date']]
data.head(4000)
data['time_series'].nunique()
for i in data['time_series'].unique():
subset = data[data['time_series'] == i]
subset['moving_average'] = subset['CO'].rolling(window = 30).mean()
fig = px.line(subset, x="Date", y=["CO","moving_average"], title = i, template = 'plotly_dark')
fig.show()
require needful help in this regard,
here is my sample data Google Drive Link
data has not been provided in a usable way. Sought out publicly available similar data. found: https://www.kaggle.com/rohanrao/air-quality-data-in-india?select=station_hour.csv
using this data, with a couple of cleanups of your code, no issues with plots. I suspect your data has one of these issues
date is not datetime64[ns] in your data frame
date is not sorted, leading to lines being drawn in way you have noted
by refactoring way moving average is calculated, you can use animation instead of lots of separate figures
get some data
import kaggle.cli
import sys, math
import pandas as pd
from pathlib import Path
from zipfile import ZipFile
import plotly.express as px
# download data set
# https://www.kaggle.com/rohanrao/air-quality-data-in-india?select=station_hour.csv
sys.argv = [
sys.argv[0]
] + "datasets download rohanrao/air-quality-data-in-india".split(
" "
)
kaggle.cli.main()
zfile = ZipFile("air-quality-data-in-india.zip")
print([f.filename for f in zfile.infolist()])
plot using code from question
import pandas as pd
import plotly.express as px
from pathlib import Path
from distutils.version import StrictVersion
# data = pd.read_csv('E:/Self Learning/Djang_Dash/2019-2020_5.csv')
# use kaggle data
# dfs = {f.filename:pd.read_csv(zfile.open(f)) for f in zfile.infolist() if f.filename in ['station_day.csv',"stations.csv"]}
# data = pd.merge(dfs['station_day.csv'],dfs["stations.csv"], on="StationId")
# data['Date'] = pd.to_datetime(data['Date'])
# # kaggle data is different from question, make it compatible with questions data
# data = data.assign(OBJECTID=lambda d: d["StationId"])
# sample data from google drive link
data2 = pd.read_csv(Path.home().joinpath("Downloads").joinpath("AQI.csv"))
data2["Date"] = pd.to_datetime(data2["Date"])
data = data2
# as per very first commment - it's important data is ordered !
data = data.sort_values(["Date","OBJECTID"])
data['time_series'] = "Location_" + data["OBJECTID"].astype(str)
# clean up data, remove rows where there is no CO value
data = data.dropna(subset=["CO"])
# can do moving average in one step (can also be used by animation)
if StrictVersion(pd.__version__) < StrictVersion("1.3.0"):
data["moving_average"] = data.groupby("time_series",as_index=False)["CO"].rolling(window=30).mean().to_frame()["CO"].values
else:
data["moving_average"] = data.groupby("time_series",as_index=False)["CO"].rolling(window=30).mean()["CO"]
# just first two for purpose of demonstration
for i in data['time_series'].unique()[0:3]:
subset = data.loc[data['time_series'] == i]
fig = px.line(subset, x="Date", y=["CO","moving_average"], title = i, template = 'plotly_dark')
fig.show()
can use animation
px.line(
data,
x="Date",
y=["CO", "moving_average"],
animation_frame="time_series",
template="plotly_dark",
).update_layout(yaxis={"range":[data["CO"].min(), data["CO"].quantile(.97)]})

Openpyxl minor gridlines

I am working on a Python application where I am collecting data from a device, and attempting to plot it in an excel file by using the Openpyxl library. I am successfully able to do everything including plotting the data, and formatting the scatter plot that I made, but I am having some trouble in adding minor gridlines to the plot.
I feel like this is definitely possible because in the API, I can see under the openpyxl.chart.axis module, there is a “minorGridlines” attribute, but it is not a boolean input (ON/OFF), rather it takes a Chartlines class. I tried going a bit down the rabbit-hole of seeing how I would do this, but I am wondering what the most straightforward way of adding the minor-gridlines would be? Do you have to construct chart lines manually, or is there a simple way of doing this?
I would really appreciate any help!
I think I answered my own question, but I will post it here if anybody else needs this (as I don’t see any other answers to this question on the forum).
Example Code (see lines 4, 38):
# Imports for script
from openpyxl import Workbook # For plotting things in excel
from openpyxl.chart import ScatterChart, Reference, Series
from openpyxl.chart.axis import ChartLines
from math import log10
# Variables for script
fileName = 'testFile.xlsx'
dataPoints = 100
# Generating a workbook to test with
wb = Workbook()
ws = wb.active # Fill data into the first sheet
ws_name = ws.title
# We will just generate a logarithmic plot, and scale the axis logarithmically (will look linear)
x_data = []
y_data = []
for i in range(dataPoints):
x_data.append(i + 1)
y_data.append(log10(i + 1))
# Go back through the data, and place the data into the sheet
ws['A1'] = 'x_data'
ws['B1'] = 'y_data'
for i in range(dataPoints):
ws['A%d' % (i + 2)] = x_data[i]
ws['B%d' % (i + 2)] = y_data[i]
# Generate a reference to the cells that we can plot
x_axis = Reference(ws, range_string='%s!A2:A%d' % (ws_name, dataPoints + 1))
y_axis = Reference(ws, range_string='%s!B2:B%d' % (ws_name, dataPoints + 1))
function = Series(xvalues=x_axis, values=y_axis)
# Actually create the scatter plot, and append all of the plots to it
ScatterPlot = ScatterChart()
ScatterPlot.x_axis.minorGridlines = ChartLines()
ScatterPlot.x_axis.scaling.logBase = 10
ScatterPlot.series.append(function)
ScatterPlot.x_axis.title = 'X_Data'
ScatterPlot.y_axis.title = 'Y_Data'
ScatterPlot.title = 'Openpyxl Plotting Test'
ws.add_chart(ScatterPlot, 'D2')
# Save the file at the end to output it
wb.save(fileName)
Background on solution:
I looked at how the code for Openpyxl generates the Major axis gridlines, which seems to follow a similar convention as the Minor axis gridlines, and I found that in the ‘NumericAxis’ class, they generated the major gridlines with the following line (labeled ‘##### This Line #####’ which is originally copied from the ‘openpyxl->chart->axis’ file):
class NumericAxis(_BaseAxis):
tagname = "valAx"
axId = _BaseAxis.axId
scaling = _BaseAxis.scaling
delete = _BaseAxis.delete
axPos = _BaseAxis.axPos
majorGridlines = _BaseAxis.majorGridlines
minorGridlines = _BaseAxis.minorGridlines
title = _BaseAxis.title
numFmt = _BaseAxis.numFmt
majorTickMark = _BaseAxis.majorTickMark
minorTickMark = _BaseAxis.minorTickMark
tickLblPos = _BaseAxis.tickLblPos
spPr = _BaseAxis.spPr
txPr = _BaseAxis.txPr
crossAx = _BaseAxis.crossAx
crosses = _BaseAxis.crosses
crossesAt = _BaseAxis.crossesAt
crossBetween = NestedNoneSet(values=(['between', 'midCat']))
majorUnit = NestedFloat(allow_none=True)
minorUnit = NestedFloat(allow_none=True)
dispUnits = Typed(expected_type=DisplayUnitsLabelList, allow_none=True)
extLst = Typed(expected_type=ExtensionList, allow_none=True)
__elements__ = _BaseAxis.__elements__ + ('crossBetween', 'majorUnit',
'minorUnit', 'dispUnits',)
def __init__(self,
crossBetween=None,
majorUnit=None,
minorUnit=None,
dispUnits=None,
extLst=None,
**kw
):
self.crossBetween = crossBetween
self.majorUnit = majorUnit
self.minorUnit = minorUnit
self.dispUnits = dispUnits
kw.setdefault('majorGridlines', ChartLines()) ######## THIS Line #######
kw.setdefault('axId', 100)
kw.setdefault('crossAx', 10)
super(NumericAxis, self).__init__(**kw)
#classmethod
def from_tree(cls, node):
"""
Special case value axes with no gridlines
"""
self = super(NumericAxis, cls).from_tree(node)
gridlines = node.find("{%s}majorGridlines" % CHART_NS)
if gridlines is None:
self.majorGridlines = None
return self
I took a stab, and after importing the ‘Chartlines’  class like so:
from openpyxl.chart.axis import ChartLines
 
I was able to add minor gridlines to the x-axis like so:
ScatterPlot.x_axis.minorGridlines = ChartLines()
As far as formatting the minor gridlines, I’m at a bit of a loss, and personally have no need, but this at least is a good start.

Bokeh callback not updating chart [duplicate]

Struggling to understand why this bokeh visual will not allow me to change plots and see the predicted data. The plot and select (dropdown-looking) menu appears, but I'm not able to change the plot for items in the menu.
Running Bokeh 1.2.0 via Anaconda. The code has been run both inside & outside of Jupyter. No errors display when the code is run. I've looked through the handful of SO posts relating to this same issue, but I've not been able to apply the same solutions successfully.
I wasn't sure how to create a toy problem out of this, so in addition to the code sample below, the full code (including the regression code and corresponding data) can be found at my github here (code: Regression&Plotting.ipynb, data: pred_data.csv, historical_data.csv, features_created.pkd.)
import pandas as pd
import datetime
from bokeh.io import curdoc, output_notebook, output_file
from bokeh.layouts import row, column
from bokeh.models import Select, DataRange1d, ColumnDataSource
from bokeh.plotting import figure
#Must be run from the command line
def get_historical_data(src_hist, drug_id):
historical_data = src_hist.loc[src_hist['ndc'] == drug_id]
historical_data.drop(['Unnamed: 0', 'date'], inplace = True, axis = 1)#.dropna()
historical_data['date'] = pd.to_datetime(historical_data[['year', 'month', 'day']], infer_datetime_format=True)
historical_data = historical_data.set_index(['date'])
historical_data.sort_index(inplace = True)
# csd_historical = ColumnDataSource(historical_data)
return historical_data
def get_prediction_data(src_test, drug_id):
#Assign the new date
#Write a new dataframe with values for the new dates
df_pred = src_test.loc[src_test['ndc'] == drug_id].copy()
df_pred.loc[:, 'year'] = input_date.year
df_pred.loc[:, 'month'] = input_date.month
df_pred.loc[:, 'day'] = input_date.day
df_pred.drop(['Unnamed: 0', 'date'], inplace = True, axis = 1)
prediction = lin_model.predict(df_pred)
prediction_data = pd.DataFrame({'drug_id': prediction[0][0], 'predictions': prediction[0][1], 'date': pd.to_datetime(df_pred[['year', 'month', 'day']], infer_datetime_format=True, errors = 'coerce')})
prediction_data = prediction_data.set_index(['date'])
prediction_data.sort_index(inplace = True)
# csd_prediction = ColumnDataSource(prediction_data)
return prediction_data
def make_plot(historical_data, prediction_data, title):
#Historical Data
plot = figure(plot_width=800, plot_height = 800, x_axis_type = 'datetime',
toolbar_location = 'below')
plot.xaxis.axis_label = 'Time'
plot.yaxis.axis_label = 'Price ($)'
plot.axis.axis_label_text_font_style = 'bold'
plot.x_range = DataRange1d(range_padding = 0.0)
plot.grid.grid_line_alpha = 0.3
plot.title.text = title
plot.line(x = 'date', y='nadac_per_unit', source = historical_data, line_color = 'blue', ) #plot historical data
plot.line(x = 'date', y='predictions', source = prediction_data, line_color = 'red') #plot prediction data (line from last date/price point to date, price point for input_date above)
return plot
def update_plot(attrname, old, new):
ver = vselect.value
new_hist_source = get_historical_data(src_hist, ver) #calls the function above to get the data instead of handling it here on its own
historical_data.data = ColumnDataSource.from_df(new_hist_source)
# new_pred_source = get_prediction_data(src_pred, ver)
# prediction_data.data = new_pred_source.data
#Import data source
src_hist = pd.read_csv('data/historical_data.csv')
src_pred = pd.read_csv('data/pred_data.csv')
#Prep for default view
#Initialize plot with ID number
ver = 781593600
#Set the prediction date
input_date = datetime.datetime(2020, 3, 31) #Make this selectable in future
#Select-menu options
menu_options = src_pred['ndc'].astype(str) #already contains unique values
#Create select (dropdown) menu
vselect = Select(value=str(ver), title='Drug ID', options=sorted((menu_options)))
#Prep datasets for plotting
historical_data = get_historical_data(src_hist, ver)
prediction_data = get_prediction_data(src_pred, ver)
#Create a new plot with the source data
plot = make_plot(historical_data, prediction_data, "Drug Prices")
#Update the plot every time 'vselect' is changed'
vselect.on_change('value', update_plot)
controls = row(vselect)
curdoc().add_root(row(plot, controls))
UPDATED: ERRORS:
1) No errors show up in Jupyter Notebook.
2) CLI shows a UserWarning: Pandas doesn't allow columns to be careated via a new attribute name, referencing `historical_data.data = ColumnDatasource.from_df(new_hist_source).
Ultimately, the plot should have a line for historical data, and another line or dot for predicted data derived from sklearn. It also has a dropdown menu to select each item to plot (one at a time).
Your update_plot is a no-op that does not actually make any changes to Bokeh model state, which is what is necessary to change a Bokeh plot. Changing Bokeh model state means assigning a new value to a property on a Bokeh object. Typically, to update a plot, you would compute a new data dict and then set an existing CDS from it:
source.data = new_data # plain python dict
Or, if you want to update from a DataFame:
source.data = ColumnDataSource.from_df(new_df)
As an aside, don't assign the .data from one CDS to another:
source.data = other_source.data # BAD
By contrast, your update_plot computes some new data and then throws it away. Note there is never any purpose to returning anything at all from any Bokeh callback. The callbacks are called by Bokeh library code, which does not expect or use any return values.
Lastly, I don't think any of those last JS console errors were generated by BokehJS.

How to highlight multiline graph in Altair python

I'm trying to create an interactive timeseries chart with more than 20 lines of data using the Altair module in Python.
The code to create the dataframe of the shape I'm looking at is here:
import numpy as np
import altair as alt
year = np.arange(1995, 2020)
day = np.arange(1, 91)
def gen_next_number(previous, limit, max_reached):
if max_reached:
return np.NAN, True
increment = np.random.randint(0, 10)
output = previous + increment
if output >= 100:
output = 100
max_reached = True
return output, max_reached
def gen_list():
output_list = []
initial = 0
limit = 100
max_reached = False
value = 0
for i in range(1, 91):
value, max_reached = gen_next_number(value, limit, max_reached)
if max_reached:
value = np.NAN
output_list.append(value)
return output_list
df = pd.DataFrame(index = day, columns=year )
for y in year:
data = gen_list()
df[y] = data
df['day'] = df.index
df = df.melt("day")
df = df.dropna(subset=["value"])
I can use the following Altair code to produce the initial plot, but it's not pretty:
alt.Chart(df).mark_line().encode(
x='day:N',
color="variable:N",
y='value:Q',
tooltip=["variable:N", "value"]
)
But when I've tried this code to create something interactive, it fails:
highlight = alt.selection(type='single', on='mouseover',
fields='variable', nearest=True, empty="none")
alt.Chart(plottable).mark_line().encode(
x='day:N',
color="variable:N",
y=alt.condition(highlight, 'value:Q', alt.value("lightgray")),
tooltip=["variable:N", "value"]
).add_selection(
highlight
)
It fails with the error:
TypeError: sequence item 1: expected str instance, int found
Can someone help me out?
Also, is it possible to make the legend interactive? So a hover over a year highlights a line?
Two issues:
In alt.condition, you need to provide a list of fields rather than a single field
The y encoding does not accept a condition. I suspect you meant to put the condition on color.
With these two fixes, your chart works:
highlight = alt.selection(type='single', on='mouseover',
fields=['variable'], nearest=True, empty="none")
alt.Chart(df).mark_line().encode(
x='day:N',
y='value:Q',
color=alt.condition(highlight, 'variable:N', alt.value("lightgray")),
tooltip=["variable:N", "value"]
).add_selection(
highlight
)
Because the selection doesn't change z-order, you'll find that the highlighted line is often hidden behind other gray lines. If you want it to pop out in front, you could use an approach similar to the one in https://stackoverflow.com/a/55796860/2937831
I would like to create a multi-line plot similar to the one above
without a legend
without hovering or mouseover.
Would simply like to pass a highlighted_value and have a single line be highlighted.
I have modified the code because I am not terribly familiar with the proper use of "selection" and recognize that it is somewhat kludgy to get the result that I want.
Is there a cleaner way to do this?
highlight = alt.selection(type='single', on='mouseover',
fields=['variable'], nearest=True, empty="none")
background = alt.Chart(df[df['variable'] != 1995]).mark_line().encode(
x='day:N',
y='value:Q',
color=alt.condition( highlight, 'variable:N', alt.value("lightgray")),
tooltip=["variable:N", "value"],
).add_selection(
highlight
)
foreground = alt.Chart(df[df['variable'] == 1995]).mark_line(color= "blue").encode(
x='day:N',
y='value:Q',
color=alt.Color('variable',legend=None)
)
foreground + background

Importing data as an array for plotting in Python

I am new to this question. I hop to get benefit of your advice. Sorry if it is amateurish.
I have the following code which finally shows a plot. I just write one part of code.
...
cov = np.dot(A, A.T)
samps2 = np.random.multivariate_normal([0]*ndim, cov, size=nsamp)
print(samps2)
names = ["x%s"%i for i in range(ndim)]
labels = ["x_%s"%i for i in range(ndim)]
samples2 = MCSamples(samples=samps2,names = names, labels = labels, label='Second set')
g = plots.getSubplotPlotter()
g.triangle_plot([samples2], filled=True)
It has no problem. The plot is drawn using the data coming from samps2. To see what the samps2 is, we do print(samps2) and see:
[[-0.11213986 -0.0582685 ]
[ 0.20346731 0.25309022]
[ 0.22737737 0.2250694 ]
[-0.09544588 -0.12754274]
[-1.05491483 -1.15432073]
[-0.31340717 -0.36144749]
[-0.99158936 -1.12785124]
[-0.5218308 -0.59193326]
[ 0.76552123 0.82138362]
[ 0.65083618 0.70784292]]
My question is, If I want to read these data from a txt file. what should I do?
Thank you.
There are several ways. What comes to my mind is:
plain python:
data = []
with open(filename, 'r') as f:
for line in f:
data.append([float(num) for num in line.split()])
numpy:
import numpy as np
data = np.genfromtxt(filename, ...)
pandas:
import pandas as pd
df = pd.read_table(filename, sep='\s+', header=None)

Categories