Is there a good way to visualize large number of subplots (> 500)? - python

I am still working on my New York Subway data. I cleaned and wrangled the data in such a fashion that I now have 'Average Entries' and 'Average Exits' per Station per hour (ranging from 0 to 23) separated for weekend and weekday (category variable with two possible values: weekend/weekday).
What I was trying to do is to create a plot with each station being a row, each row having two columns (first for weekday, second for weekend). I would like to plot 'Average Entries' and 'Average Exits' per hour to gain some information about the stations. There are two things of interest here; firstly the sheer numbers to indicate how busy a station is; secondly the ratio between entries and exits for a given hour to indicate if the station is a living area (loads of entries in the morning, loads of exits in the evening) or more of a working area (loads of exits in the morning, entries peeking around 4, 6 and 8 pm or so). Only problem, there are roughly 550 stations.
I tried plotting it with seaborn facetgrid, which cant handle more than a few stations (10 or so) without running into memory issues.
So I was wondering if anybody had a good idea to accomplish what I am trying to do.
Please find attached a notebook (second to last cell shows my attempt of visualizing the data, i.e. the plotting for 4 stations). That clearly wouldn't work for 500+ stations, so maybe 5 stations in a row after all?
The very last cell contains the data for Station R001 as requested in a comment..
https://github.com/FBosler/Udacity/blob/master/Example.ipynb
Any input much appreciated!
Fabian

rather than making 550+ subplots see if you can make two big numpy arrays and then use 2 imview subplots, one for weekdays and one for weekends
for the y-values, first find the min (0) and max (10,000?) for your average values, scale these to fit each fake row of, for example, 10px then offset each row in your data by 10px * the row number.
since you want line plots for each of your 24 data points, you'll have to do linear interpolation between your data points in increments of, again for example, 10px so that the final numpy arrays will be 240 x 5500 x 2.

A possible way you could do it is to use the ratio of entries to exits per station. Each day/hour could form a column on an image and each row would be a station. As en example:
from matplotlib import pyplot as plt
import random
import numpy as np
all_stations = []
for i in range(550):
entries = [float(random.randint(0, 50)) for i in range(7*24)] # Data point for each hour over a week
exits = [float(random.randint(0, 50)) for i in range(7*24)]
weekend_entries = entries[:2*7]
weekend_exits = exits[:2*7]
day_entries = entries[2*7:]
day_exits = exits[2*7:]
weekend_ratio = [np.array(en) / np.array(ex) for en, ex in zip(weekend_entries, weekend_exits)]
day_ratio = [np.array(en) / np.array(ex) for en, ex in zip(day_entries, day_exits)]
whole_week = weekend_ratio + day_ratio
all_stations.append(whole_week)
plt.figure()
plt.imshow(all_stations, aspect='auto', interpolation="nearest")
plt.xlabel("Hours")
plt.ylabel("Station number")
plt.title("Entry/exit ratio per station")
plt.colorbar(label="Entry/exit ratio")
# Add some vertical lines to indicate days
for j in range(1, 7):
plt.plot([j*24]*2, [0, 550], color="black")
plt.xlim(0, 7*24)
plt.ylim(0, 550)
plt.show()
If you would like to show the actual numbers involved an not the ratio, I would consider splitting the data into two, one image for each of the entries and exit data sets. The intensity of each pixel could then be used to inform on the numbers, not ratio.

You're going to have problems displaying them all on a screen no matter what you do unless you have a whole wall of monitors, however to get around the memory constraint, you could rasterize them and save to image files (I would suggest .png for compressability with images of few distinct colors)
What you want for that is pyplot.savefig()
Here's an answer to another question on how to do that, with some tips and tricks

Related

How to deal with categorical data that has 35 unique values?

I am working on IPL cricket dataset which has data about batting stats for all the teams over by over.
I want to visualise how different cricket grounds affect the total score of the batting team. I try to plot a simple scatter plot but the stadium names are too long and it does not show the names clearly.
Do I have to convert the 35 values into numeric values? It prints nothing when I try to find correlation with the target variable.
The data set:
The problem with reading the plot (the x-axis):
You can change the size of the font and/or rotate it: https://matplotlib.org/api/matplotlib_configuration_api.html#matplotlib.rc
You can make your plot bigger by setting figsize.
(add this at the first line):
plt.figure(figsize(14,8))
and then rotate the xticks. (at the end):
plt.xticks(rotation=90)

how to create interactive graph on a large data set?

I am trying to create an interactive graph using holoviews on a large data set. Below is a sample of the data file called trackData.cvs
Event Time ID Venue
Javeline 11:25:21:012345 JVL Dome
Shot pot 11:25:22:778929 SPT Dome
4x4 11:25:21:993831 FOR Track
4x4 11:25:22:874293 FOR Track
Shot pot 11:25:21:087822 SPT Dome
Javeline 11:25:23:878792 JVL Dome
Long Jump 11:25:21:892902 LJP Aquatic
Long Jump 11:25:22:799422 LJP Aquatic
This is how I read the data and plot a scatter plot.
trackData = pd.read_csv('trackData.csv')
scatter = hv.Scatter(trackData, 'Time', 'ID')
scatter
Because this data set is quite huge, zooming in and out of the scatter plot is very slow and would like to speed this process up.
I researched and found about holoviews decimate that is recommended on large datasets but I don't know how to use in the above code.
Most cases I tried seems to throw an error. Also, is there a way to make sure the Time column is converted to micros? Thanks in advance for the help
Datashader indeed does not handle categorical axes as used here, but that's not so much a limitation of the software than of my imagination -- what should it be doing with them? A Datashader scatterplot (Canvas.points) is meant for a very large number of points located on a continuously indexed 2D plane. Such a plot approximates a 2D probability distribution function, accumulating points per pixel to show the density in that region, and revealing spatial patterns across pixels.
A categorical axis doesn't have the same properties that a continuous numerical axis does, because there's no spatial relationship between adjacent values. Specifically in this case, there's no apparent meaning to an ordering of the ID field (it appears to be a letter code for a sporting event type), so I can't see any meaning to accumulating across ID values per pixel the way Datashader is designed to do. Even if you convert IDs to numbers, you'll either just get random-looking noise (if there are more ID values than vertical pixels), or a series of spotty lines (if there are fewer ID values than pixels).
Here, maybe there are only a few dozen or so unique ID values, but many, many time measurements? In that case most people would use a box, violin, histogram, or ridge plot per ID, to see the distribution of values for each ID value. A Datashader points plot is a 2D histogram, but if one axis is categorical you're really dealing with a set of 1D histograms, not a single combined 2D histogram, so just use histograms if that's what you're after.
If you really do want to try plotting all the points per ID as raw points, you could do that using vertical spike events as in https://examples.pyviz.org/iex_trading/IEX_stocks.html . You can also add some vertical jitter and then use Datashader, but that's not something directly supported right now, and it doesn't have the clear mathematical interpretation that a normal Datashader plot does (in terms of approximating a density function).
The disadvantage of decimate() is that it downsamples your datapoints.
I think you need datashader() here, but datashader doesn't like that ID is a categorical variable instead of a numerical value.
So a solution could be to convert your categorical variable to a numerical code.
See the code example below for both hvPlot (which I prefer) and HoloViews:
import io
import pandas as pd
import hvplot.pandas
import holoviews as hv
# dynspread is for making point sizes larger when using datashade
from holoviews.operation.datashader import datashade, dynspread
# sample data
text = """
Event Time ID Venue
Javeline 11:25:21:012345 JVL Dome
Shot pot 11:25:22:778929 SPT Dome
4x4 11:25:21:993831 FOR Track
4x4 11:25:22:874293 FOR Track
Shot pot 11:25:21:087822 SPT Dome
Javeline 11:25:23:878792 JVL Dome
Long Jump 11:25:21:892902 LJP Aquatic
Long Jump 11:25:22:799422 LJP Aquatic
"""
# create dataframe and parse time
df = pd.read_csv(io.StringIO(text), sep='\s{2,}', engine='python')
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S:%f')
df = df.set_index('Time').sort_index()
# get a column that converts categorical id's to numerical id's
df['ID'] = pd.Categorical(df['ID'])
df['ID_code'] = df['ID'].cat.codes
# use this to overwrite numerical yticks with categorical yticks
yticks=[(0, 'FOR'), (1, 'JVL'), (2, 'LJP'), (3, 'SPT')]
# this is the hvplot solution: set datashader=True
df.hvplot.scatter(
x='Time',
y='ID_code',
datashade=True,
dynspread=True,
padding=0.05,
).opts(yticks=yticks)
# this is the holoviews solution
scatter = hv.Scatter(df, kdims=['Time'], vdims=['ID_code'])
dynspread(datashade(scatter)).opts(yticks=yticks, padding=0.05)
More info on datashader and decimate:
http://holoviews.org/user_guide/Large_Data.html
Resulting plot:

Using Python: Group by and plot ratios to compare them, add aditional calculations (ie: histogram, scatter plot, density plot)

Measuring sales with ratios and plot them.
The following data is about 4 salespeople.
The salespeople always work in pairs.
There are 3 data sets for each pair of salespeople; 12 likely combinations of the salespeople, so 36 rows of data.
One salesperson is seated at a desk and the other person is standing, both are talking to clients, s1 = salesman # 1, s2 = salesman # 2, s3 = salesman # 3, s4 = salesman # 4.
There are 12 combinations where each of the salespeople are seated or standing at different times. There are 36 data points.
In a plot, I want to show how far the Ratio Standing / Seated is compared with the Ratio Target, then add the # of minutes working (using bars maybe). In the end, I want to have 3 ratios of standing/seated and I want to see how far they are compared with the ratio target. I should have 12 plots because there are 12 different pairs.
I have tried this in Python with Groupby (Pandas) but I cannot plot any of that
At this point I am not sure if I should continue using "groupby". I want to plot each equal pair (ie. s1, s2) showing the standing/seated ratio with 1 color and the ratio target with a different color. I am not sure if I should use a Scatter Plot, Density, or other.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
OB = pd.read_excel('C:\\Users\\isaac\\Example_datav2.xlsx')
OB.shape
OB2 = OB.groupby(['seated','standing'])
OB2.describe
Data:
https://docs.google.com/spreadsheets/d/1K06PGtZk5CeGJTCoLmLZpSWMHXLgovfBUHvsfKhuWCQ/edit#gid=0
You need to use e.g. OB2.describe(include='all') - see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html for your calculation. OB.groupby is used correctly. You may also use DataFrame.reindex, if required.

matplotlib radar plot min values

I started with the matplotlib radar example but values below some min values disappear.
I have a gist here.
The result looks like
As you can see in the gist, the values for D and E in series A are both 3 but they don't show up at all.
There is some scaling going on.
In order to find out what the problem is I started with the original values and removed one by one.
When I removed one whole series then the scale would shrink.
Here an example (removing Factor 5) and scale in [0,0.2] range shrinks.
From
to
I don't care so much about the scaling but I would like my values at 3 score to show up.
Many thanks
Actually, the values for D and E in series A do show up, although they are plotted in the center of the plot. This is because the limits of your "y-axis" is autoscaled.
If you want to have a fixed "minimum radius", you can simply put ax.set_ylim(bottom=0) in your for-loop.
If you want the minimum radius to be a number relative to the lowest plotted value, you can include something like ax.set_ylim(np.asarray(data.values()).flatten().min() - margin) in the for-loop, where margin is the distance from the lowest plotted value to the center of the plot.
With fixed center at radius 0 (added markers to better show that the points are plotted):
By setting margin = 1, and using the relative y-limits, I get this output:

Python - FFT leads to wrong physical meanings

I am new to Python.
I intend to do Fourier Transform to an array of discrete points, (time, acceleration), and plot the result out.
I copy and paste the sample FFT code, and modify accordingly.
Please see codes:
import numpy as np
import matplotlib.pyplot as plt
# Load the .txt file in
myData = np.loadtxt('twenty_z_up.txt')
# Extract the time and acceleration columns
time = copy(myData[:,0])
# Extract the acceleration columns
zAcc = copy(myData[:,3])
t = np.arange(10080)
sp = np.fft.fft(zAcc)
freq = np.fft.fftfreq(t.shape[-1])
plt.plot(freq, sp.real)
myData is a rectangular matrix with 10080 rows and 10 columns.
Thus, zAcc is the row3 extracted from the matrix.
In the plot drawn by Spyder, most of the harmonics concentrated around 0.
They are all extremely small.
But my data are actually the accelerations of the phone carried by a walking person (including the gravity). So I expect the most significant harmonic happens around 2Hz.
Why is the graph non-sense?
Thanks in advance!
==============UPDATES: My Graphs======================
The first time domain one:
x-axis is in millisecond.
y-axis is in m/s^2, due to earth gravity, it has a DC offset of ~10.
You do get two spikes at (approximately) 2Hz. Your sampling period is around 2.8 ms (as best as I can infer from your first plot), giving +/-2Hz the normalized frequency of +/-0.056, which is about where your spikes are. fft.fftfreq by default returns the normalized frequency (which scales the sampling period). You can set the d argument to be the sampling period, and you'll get a vector containing the actual frequency.
Your huge spike in the middle is obviously the DC offset (which you can trivially remove by subtracting the mean).
As others said, we need to see the data, post it somewhere. Just to check, try first fixing the timestep size in fftfreq, then plot this synthetic signal, and then plot your signal to see how they compare:
timestep=1./50.#Assume sampling at 50Hz. Change this accordingly.
N=10080#the number of samples
T=N*timestep
t = np.linspace(0,T,N)#needed only to generate xAcc_synthetic
freq=2.#peak a frequency at 2Hz
#generate synthetic signal at 2Hz and add some noise to it
xAcc_synthetic = sin((2*np.pi)*freq*t)+np.random.rand(N)*0.2
sp_synthetic = np.fft.fft(xAcc_synthetic)
freq = np.fft.fftfreq(t.size,d=timestep)
print max(abs(freq))==(1/timestep)/2.#simple check highest freq.
plt.plot(freq, abs(sp_synthetic))
xlabel('Hz')
Now, at the x axis equal to 2 you actually have a physical frequency of 2Hz, and you may spot the more pronounced peak you are looking for. Moreover, you may want to have a look also at yAcc and zAcc.

Categories