Folium Heatmap generate legend based on geopoint occurrence - python

I am attempting to generate a folium heatmap based on geopoints and how often the geopoint appear in my dataset. Tbh it seems that the counter how often the geopoint appear did not affect my heatmap.
Additionally to that i need a legend that everybody can read my heatmap.
My data is saved in a pandas dataframe with following columns:
Latitude Longitude count
Count hold the data for how often each Latitude and Longitude point occur in the dataset.
If i generate a heatmap like:
heat_data2 = [[row['Latitude'], row['Longitude'], row['count']] for index, row in df.iterrows()]
it seems that the count does not get included. I noticed this at the moment i added a legend like this:
steps = 200
colormap = branca.colormap.linear.YlOrRd_09.scale(0, 5000).to_step(steps)
gradient_map = defaultdict(dict)
for i in range(steps):
gradient_map[1 / steps * i] = colormap.rgb_hex_str(1 / steps * i)
colormap.add_to(map)
all points i see on heatmap have the color of the least occurrence. How can i combine the the count and the geopoints to get a heatmap that shows me how often each point occurred?
I also appreciate any tips for better tools to generate heatmaps for geodata in python!

Related

Random colors in pyplot.scatter based on value

I have a large pandas dataframe that I clustered and the cluster id is stored in a column of a dataframe.
I would like to display clusters in such a way that each cluster has a different color.
I tried doing this with a colormap but the problem is that I have too many points and clusters so nearby clusters get assign only slightly different colors, so when I plot all of them I just get a big mashup that looks like this:
Note that this is image contains about 4000 clusters, but because colors of clusters are just assigned top to bottom, nearby clusters blend together.
I would like nearby clusters to be painted in different colors so I tried making a random color for each cluster and then assign each point a color based on its cluster label like this:
# creating a color for each distinct cluster label
colors = [(random.random(), random.random(), random.random())
for _ in range(len(set(data['labels'])))]
# assigning color to a point based on its cluster label
for index, row in data.iterrows():
plt.scatter(row['x'], row['y'], color=colors[int(row['labels'])])
Now this program works but it is much slower that vectorized version above.
Is there a way to do color each cluster in clearly different colors without writing a for loop?
This creates a random colormap of 256 colors that you can then pass to scatter :
def segmentation_cmap():
vals = np.linspace(0, 1, 256)
np.random.shuffle(vals)
return plt.cm.colors.ListedColormap(plt.cm.CMRmap(vals))
ax.scatter(row['x'],row['y'],c=row['labels'],s=1,cmap=segmentation_cmap())
You may add colors, but you would have trouble seeing the differences anyways at some point !

How can we plot line-chart between repeating non-numeric column values in python, containing information of more than two columns?

considering the database image attached below, suppose we have to plot x = TIME; y = Value; the plot should place countries in the graph for particular quarters and values. So there are there columns values interacting with each other. We are trying to represent countries in the axes of TIME and Value. I am trying to find an alternative without using one-hot encoding.
When trying to plot the data using this code:
x = x.sort_values(by = ['TIME'])
x[['TIME', 'Value']].plot(x="TIME", y = "Value", kind="bar")
The quarters are getting repeated in the x-axis.
Can you explain how can we deal with such scenarios.
the sample of dataset
One of the best solutions will be to convert the values using One-Hot encoding and then plot them.

Bokeh graphs y_range coordinate off by half a coordinate

I am displaying the frequency of nouns in a dataframe using bokeh charts.
The data consists of companies and their patents from which I extracted the nouns.
When I display the frequencies using a y_range of (0,10) the data is displayed perfectly. When I use the list of companies, the data is offset by half a y_range coordinate.
scatter = figure(plot_width=800, plot_height=200,
x_range = max_words,
y_range = companies,
tools = tools
)
compared to
scatter = figure(plot_width=800, plot_height=200,
x_range = max_words,
y_range = (0,10),
tools = tools
)
any suggestions of how this issue can be resolved?
If you are providing a list of categorical factors e.g. y_range=companies then the actual coordinate values in the data also need to be the same (string) categorical factors, not numbers.
There is an underlying synthetic coordinate system for categorical ranges, which is why passing numbers "works" in any sense at all. But doing this is not the intended usage, and there is no guarantee that the mapping from categorical factors to (internal) synthetic numeric coordinates won't change at any time (i.e. it should not be relied upon).
See the User's Guide chapter Handling Categorical Data for more information and many examples.
Alternatively, if you really want to keep numerical y-coordinates, you could use a FuncTickFormatter that converts the integer coordinates into company names to display, in order to "fake" a categorical y-axis.

how to create interactive graph on a large data set?

I am trying to create an interactive graph using holoviews on a large data set. Below is a sample of the data file called trackData.cvs
Event Time ID Venue
Javeline 11:25:21:012345 JVL Dome
Shot pot 11:25:22:778929 SPT Dome
4x4 11:25:21:993831 FOR Track
4x4 11:25:22:874293 FOR Track
Shot pot 11:25:21:087822 SPT Dome
Javeline 11:25:23:878792 JVL Dome
Long Jump 11:25:21:892902 LJP Aquatic
Long Jump 11:25:22:799422 LJP Aquatic
This is how I read the data and plot a scatter plot.
trackData = pd.read_csv('trackData.csv')
scatter = hv.Scatter(trackData, 'Time', 'ID')
scatter
Because this data set is quite huge, zooming in and out of the scatter plot is very slow and would like to speed this process up.
I researched and found about holoviews decimate that is recommended on large datasets but I don't know how to use in the above code.
Most cases I tried seems to throw an error. Also, is there a way to make sure the Time column is converted to micros? Thanks in advance for the help
Datashader indeed does not handle categorical axes as used here, but that's not so much a limitation of the software than of my imagination -- what should it be doing with them? A Datashader scatterplot (Canvas.points) is meant for a very large number of points located on a continuously indexed 2D plane. Such a plot approximates a 2D probability distribution function, accumulating points per pixel to show the density in that region, and revealing spatial patterns across pixels.
A categorical axis doesn't have the same properties that a continuous numerical axis does, because there's no spatial relationship between adjacent values. Specifically in this case, there's no apparent meaning to an ordering of the ID field (it appears to be a letter code for a sporting event type), so I can't see any meaning to accumulating across ID values per pixel the way Datashader is designed to do. Even if you convert IDs to numbers, you'll either just get random-looking noise (if there are more ID values than vertical pixels), or a series of spotty lines (if there are fewer ID values than pixels).
Here, maybe there are only a few dozen or so unique ID values, but many, many time measurements? In that case most people would use a box, violin, histogram, or ridge plot per ID, to see the distribution of values for each ID value. A Datashader points plot is a 2D histogram, but if one axis is categorical you're really dealing with a set of 1D histograms, not a single combined 2D histogram, so just use histograms if that's what you're after.
If you really do want to try plotting all the points per ID as raw points, you could do that using vertical spike events as in https://examples.pyviz.org/iex_trading/IEX_stocks.html . You can also add some vertical jitter and then use Datashader, but that's not something directly supported right now, and it doesn't have the clear mathematical interpretation that a normal Datashader plot does (in terms of approximating a density function).
The disadvantage of decimate() is that it downsamples your datapoints.
I think you need datashader() here, but datashader doesn't like that ID is a categorical variable instead of a numerical value.
So a solution could be to convert your categorical variable to a numerical code.
See the code example below for both hvPlot (which I prefer) and HoloViews:
import io
import pandas as pd
import hvplot.pandas
import holoviews as hv
# dynspread is for making point sizes larger when using datashade
from holoviews.operation.datashader import datashade, dynspread
# sample data
text = """
Event Time ID Venue
Javeline 11:25:21:012345 JVL Dome
Shot pot 11:25:22:778929 SPT Dome
4x4 11:25:21:993831 FOR Track
4x4 11:25:22:874293 FOR Track
Shot pot 11:25:21:087822 SPT Dome
Javeline 11:25:23:878792 JVL Dome
Long Jump 11:25:21:892902 LJP Aquatic
Long Jump 11:25:22:799422 LJP Aquatic
"""
# create dataframe and parse time
df = pd.read_csv(io.StringIO(text), sep='\s{2,}', engine='python')
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S:%f')
df = df.set_index('Time').sort_index()
# get a column that converts categorical id's to numerical id's
df['ID'] = pd.Categorical(df['ID'])
df['ID_code'] = df['ID'].cat.codes
# use this to overwrite numerical yticks with categorical yticks
yticks=[(0, 'FOR'), (1, 'JVL'), (2, 'LJP'), (3, 'SPT')]
# this is the hvplot solution: set datashader=True
df.hvplot.scatter(
x='Time',
y='ID_code',
datashade=True,
dynspread=True,
padding=0.05,
).opts(yticks=yticks)
# this is the holoviews solution
scatter = hv.Scatter(df, kdims=['Time'], vdims=['ID_code'])
dynspread(datashade(scatter)).opts(yticks=yticks, padding=0.05)
More info on datashader and decimate:
http://holoviews.org/user_guide/Large_Data.html
Resulting plot:

correlation heatmap in bokeh python

Tried looking around a bit for finding a simpler way to plot the correlation matrix in bokeh via heatmap; however could not find much help.
Let's say i have a correlation DF created by way of :
corr_df = df.corr()
Can you please assist how I can use the continuous number range in this df to reflect the color intensity in bokeh?
I understand for the X and Y columns i will first have to pull those unique column names in a factors list.
factors = ["A","B","C"]
x = ["A","A","A","B"]
y = ["A","B","C","C"]
Is there a easy peasy way to do all of this ?
I can do all of this in seaborn with just a single line function.

Categories