Producing data for a cumulative distribution plot in bigquery - python

I'd like to produce a cumulative plot that shows, for a given value, the percentage of data that is less than or equal to that value. In python / matplotlib / pandas, I can do this with the quantile function provided by pandas (and I guess numpy too):
import numpy as np
import matplotlib.pyplot as plt
def plot_quantile(series, start=0., end=0.99):
y = np.linspace(start,end, 500)
x = series.quantile(y)
plt.plot(x, y*100)
plt.xlabel("Duration in minutes")
plt.ylabel("% of drives less than x minutes")
plt.grid()
plot_quantile(df.duration)
In this case I'm plotting the distribution of taxi ride duration from the NYC taxi dataset.
I'd like to produce similar data with an SQL query to bigquery. I'm pretty close with the following query:
select
approx_quantiles(duration, 100) as duration_quantile
from base_table
This gives me 101 data points, starting at the minimum value and ending at the max value. Now I have two problems:
I have no idea how the values correspond to the quantile (e.g. which value is the P50?) - I'll need to generate those numbers too for the plot, as you see in the python code.
I don't seem to have a way to truncate it near the top - as the max value is likely to be a very large outlier that makes my plot hard to read.

Below might not be what you want, but hope it helpful to you.
SELECT min,
SUM(cnt) OVER (ORDER BY min) cumulative_cnt,
ROUND(SUM(cnt) OVER (ORDER BY min) / SUM(cnt) OVER (), 4) cumulative_pct,
FROM (
-- trip count per duration in minutes
SELECT ROUND(trip_seconds/60) min, COUNT(1) cnt,
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
GROUP BY 1
);
I've used another dataset in the bigquery-public-project instead of the NYC taxi dataset.
Above query shows the cumulative distribution like below in Looker Studio.
I don't seem to have a way to truncate it near the top - as the max value is likely to be a very large outlier that makes my plot hard to read.
Yes, it's not easy to read as you mentioned.
Looker Studio provides a custom filer. If you apply it to a chart, you can get a smoother curve.
Instead of filtering in Looker Studio, if you want it with a query, you can add QUALIFY condition at the bottom of the above query.
SELECT min,
SUM(cnt) OVER (ORDER BY min) cumulative_cnt,
ROUND(SUM(cnt) OVER (ORDER BY min) / SUM(cnt) OVER (), 4) cumulative_pct,
FROM (
SELECT ROUND(trip_seconds/60) min, COUNT(1) cnt,
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
GROUP BY 1
) QUALIFY cumulative_pct < 0.99;

Related

Why are the whiskers not displayed correctly with boxplots?

I would like to plot a boxplot for columns of a dataframe which have percentages and to set the
lower limit to 0 and the upper limit to 100 to detect visually the outliers. However I didn't succeed in plotting the whiskers correctly.
Here I created a column with random percentages with some outliers.
import random
from random import randint
import matplotlib.pyplot as plt
import pandas as pd
random.seed(42)
lst=[]
for x in range(140):
x=randint(1,100)
lst.append(x)
lst.append(-1)
lst.append(300)
lst.append(140)
print(lst)
df = pd.DataFrame({0:lst})
Here is my function:
def boxplot(df,var,lower_limit=None,upper_limit=None):
q1=df[var].quantile(0.25)
q3=df[var].quantile(0.75)
iqr=q3-q1
w1=w2=1.5
if (q1!=q3) and (lower_limit!=None):
w1=(q1-lower_limit)/iqr
if (q1!=q3) and (upper_limit!=None):
w2=(upper_limit-q3)/iqr
plt.figure(figsize=(5,5))
df.boxplot(column=var,whis=(w1,w2))
plt.show()
print(f'The minimum of {var} is',df[var].min(),'and its maximum is ',df[var].max(),"\n")
print(f'The first quantile of {var} is ',q1,'its median is ',df[var].median(),'and its third quantile is ',q3,"\n")
I coded boxplot(df,0,lower_limit=0,upper_limit=100) and I had this result:
But the whiskers don't go to 100 and I would like to know why.
TLDR: I don't think you can do what you want to do. The whiskers must snap to values within your dataset, and cannot be set arbitrarily.
Here is a good reference post: https://stackoverflow.com/a/65390045/13386979.
First of all, kudos on a nice first post. It is great that you provided code to reproduce your problem 👏 There were a few small syntax errors, see my edit.
My impression is that what you want to do is not possible with the matplotlib boxplot (which is called by df.boxplot). One issue is that the units of the whis parameter (when you pass a pair of floats) are in percentiles. Taken from the documentation:
If a pair of floats, they indicate the percentiles at which to draw the whiskers (e.g., (5, 95)). In particular, setting this to (0, 100) results in whiskers covering the whole range of the data.
When you pass lower_limit=0, upper_limit=100 to your function, you end up with w1 == 0.5490196078431373 and w2 == 0.4117647058823529 (you can add a print statement to verify this). This tells the boxplot to extend whiskers to the 0.5th and 0.4th percentile, which are both very small (the boxplot edges are the 25th to 75th percentile). The latter is smaller than the 75th percentile, so the top whisker is drawn at the upper edge of the box.
It seems that you have based your calculation of w1 and w2 based on this section from the documentation:
If a float, the lower whisker is at the lowest datum above Q1 - whis*(Q3-Q1), and the upper whisker at the highest datum below Q3 + whis*(Q3-Q1), where Q1 and Q3 are the first and third quartiles. The default value of whis = 1.5 corresponds to Tukey's original definition of boxplots.
I say this because if you also print q1 - w1 * iqr and q3 + w2 * iqr within your call, you get 0 and 100 (respectively). But this calculation is only relevant when a single float is passed (not a pair).
But okay, then what can you pass to whis to get the limits to be any arbitrary value? This is the real problem: I don't think this is possible. The percentiles will always be a value in your data set (there is no interpolation between points). Thus, the edges of the whiskers always snap to a point in your dataset. If you have a point near 0 and near 100, you could find the corresponding percentile to place the whisker there. But without a point there, you cannot hack the whis parameter to set the limits arbitrarily.
I think to fully implement what you want, you should look into drawing the boxes and whiskers manually. Though the caution shared in the other post I referenced is also relevant here:
But be aware that this is not a box and whiskers plot anymore, so you should clearly describe what you're plotting here, otherwise people will be mislead.

Python: efficient way to calculate moving average for fixed time window (NOT fixed observation widnow)

Problem description
Say I have:
vector of time, dtype is numpy datetime64,
vector of parameters, dtype is numpy float
time horizon, dtype is numpy timedelta64
And time.shape == parameters.shape. Values of time are unique and distances between elements are not even.
Goal I have: for each moment t of time calculate some statistic (for instance mean, min, max, sum, etc.) for the parameters vector for time period from time[t-horizon] to time[t]
The rookie way would be to use a loop (I don't want to use a loop for performance reasons) or some pandas aggregation/resampling (this is however not ideal as I don't want to aggregate - this creates a new time vector, while I want to preserve my time.
My current approach
I create the following matrix. Matrix visualization is set on real data and shows why I need different range to calculate statistic for every observation separately - sometimes 15min of history has 5,000 observations while sometimes roughly few hundred. This is also something that I measure - how many events occurred within a fixed time horizon.
past = (time < time[:, None]) & (time>(time- horizon)[:, None]
plt.imshow(past)
past
The first problem - creation of the matrix like above for long observation vectors is time-consuming. Is there a better way to create such a matrix? Presented matrix represent real data for one day but this also can be longer (up to 50,000 unique observations but what I'm aiming for is scalability also).
Later I use TensorFlow to calculate desired statistic (first multiplying matrices by themselves - then I have data only where past was true and later calculation of desired statistic (mean, count or whatever I want on rows of produced matrix). What is returned is vector of shape==parameters.shape.
The second question - is there a better way to do that? By better of course I mean faster.
EDIT
Sample code
import datetime
import matplotlib.pyplot as plt
def multiply_time(param, time):
if param.shape[0] == 1 or param.ndim == 1:
_temp_param = np.ma.masked_equal(param * time, 0).data
else:
_temp_param = np.ma.masked_equal(np.sum(param, axis=1) * time, 0).data
return_param = np.nanmean( np.where(_temp_param != 0, _temp_param, np.nan), axis=1)
return return_param
horizon = np.timedelta64(10,'s')
increment = np.timedelta64(1,'s')
vector_len = 100
parameters = np.random.rand(vector_len)
# create time vector where distances between elements are not even
increment_vec = np.cumsum(np.random.randint(0,10,vector_len)*increment)
time = np.datetime64(datetime.datetime.now()) + increment_vec
past = (time < time[:, None]) & (time > (time - horizon )[:, None])
plt.imshow(past)
result = multiply_time(parameters, past)
import pandas as pd
pd_result = pd.DataFrame(parameters).rolling(10,1).mean()
plt.plot(time,result, c='r', label='desired')
plt.plot(time,parameters,c='g', label='original')
plt.plot(time,pd_result,c='b', label='pandas')
plt.legend()
plt.show()```
EDI2:
I guess we can close as answer with pandas rolling gives best results.

Is there a good way to visualize large number of subplots (> 500)?

I am still working on my New York Subway data. I cleaned and wrangled the data in such a fashion that I now have 'Average Entries' and 'Average Exits' per Station per hour (ranging from 0 to 23) separated for weekend and weekday (category variable with two possible values: weekend/weekday).
What I was trying to do is to create a plot with each station being a row, each row having two columns (first for weekday, second for weekend). I would like to plot 'Average Entries' and 'Average Exits' per hour to gain some information about the stations. There are two things of interest here; firstly the sheer numbers to indicate how busy a station is; secondly the ratio between entries and exits for a given hour to indicate if the station is a living area (loads of entries in the morning, loads of exits in the evening) or more of a working area (loads of exits in the morning, entries peeking around 4, 6 and 8 pm or so). Only problem, there are roughly 550 stations.
I tried plotting it with seaborn facetgrid, which cant handle more than a few stations (10 or so) without running into memory issues.
So I was wondering if anybody had a good idea to accomplish what I am trying to do.
Please find attached a notebook (second to last cell shows my attempt of visualizing the data, i.e. the plotting for 4 stations). That clearly wouldn't work for 500+ stations, so maybe 5 stations in a row after all?
The very last cell contains the data for Station R001 as requested in a comment..
https://github.com/FBosler/Udacity/blob/master/Example.ipynb
Any input much appreciated!
Fabian
rather than making 550+ subplots see if you can make two big numpy arrays and then use 2 imview subplots, one for weekdays and one for weekends
for the y-values, first find the min (0) and max (10,000?) for your average values, scale these to fit each fake row of, for example, 10px then offset each row in your data by 10px * the row number.
since you want line plots for each of your 24 data points, you'll have to do linear interpolation between your data points in increments of, again for example, 10px so that the final numpy arrays will be 240 x 5500 x 2.
A possible way you could do it is to use the ratio of entries to exits per station. Each day/hour could form a column on an image and each row would be a station. As en example:
from matplotlib import pyplot as plt
import random
import numpy as np
all_stations = []
for i in range(550):
entries = [float(random.randint(0, 50)) for i in range(7*24)] # Data point for each hour over a week
exits = [float(random.randint(0, 50)) for i in range(7*24)]
weekend_entries = entries[:2*7]
weekend_exits = exits[:2*7]
day_entries = entries[2*7:]
day_exits = exits[2*7:]
weekend_ratio = [np.array(en) / np.array(ex) for en, ex in zip(weekend_entries, weekend_exits)]
day_ratio = [np.array(en) / np.array(ex) for en, ex in zip(day_entries, day_exits)]
whole_week = weekend_ratio + day_ratio
all_stations.append(whole_week)
plt.figure()
plt.imshow(all_stations, aspect='auto', interpolation="nearest")
plt.xlabel("Hours")
plt.ylabel("Station number")
plt.title("Entry/exit ratio per station")
plt.colorbar(label="Entry/exit ratio")
# Add some vertical lines to indicate days
for j in range(1, 7):
plt.plot([j*24]*2, [0, 550], color="black")
plt.xlim(0, 7*24)
plt.ylim(0, 550)
plt.show()
If you would like to show the actual numbers involved an not the ratio, I would consider splitting the data into two, one image for each of the entries and exit data sets. The intensity of each pixel could then be used to inform on the numbers, not ratio.
You're going to have problems displaying them all on a screen no matter what you do unless you have a whole wall of monitors, however to get around the memory constraint, you could rasterize them and save to image files (I would suggest .png for compressability with images of few distinct colors)
What you want for that is pyplot.savefig()
Here's an answer to another question on how to do that, with some tips and tricks

Python: histogram/ binning data from 2 arrays.

I have two arrays of data: one is a radius values and the other is a corresponding intensity reading at that intensity:
e.g. a small section of the data. First column is radius and the second is the intensities.
29.77036614 0.04464427
29.70281027 0.07771409
29.63523525 0.09424901
29.3639355 1.322793
29.29596385 2.321502
29.22783249 2.415751
29.15969437 1.511504
29.09139827 1.01704
29.02302068 0.9442765
28.95463729 0.3109002
28.88609766 0.162065
28.81754446 0.1356054
28.74883612 0.03637681
28.68004928 0.05952569
28.61125036 0.05291172
28.54229804 0.08432806
28.4732599 0.09950128
28.43877462 0.1091304
28.40421016 0.09629156
28.36961249 0.1193614
28.33500089 0.102711
28.30037503 0.07161685
How can I bin the radius data, and find the average intensity corresponding to that binned radius.
The aim of this is to then use the average intensity to assign an intensity value to a radius data with a missing (NaN) data point.
I've never had to use the histogram functions before and have very little idea of how they work/ if its possible to do this with them. The full data set is large with 336622 number of data points, so I don't really want to be using loops or if statements to achieve this.
Many Thanks for any help.
if you only need to do this for a handful of points, you could do something like this.
If intensites and radius are numpy arrays of your data:
bin_width = 0.1 # Depending on how narrow you want your bins
def get_avg(rad):
average_intensity = intensities[(radius>=rad-bin_width/2.) & (radius<rad+bin_width/2.)].mean()
return average_intensities
# This will return the average intensity in the bin: 27.95 <= rad < 28.05
average = get_avg(28.)
It's not really histogramming what your are after. A histogram is more a count of items that fall into a specific bin. What you want to do is more a group by operation, where you'd group your intensities by radius intervals and on the groups of itensities you apply some aggregation method, like average or median etc.
What your are describing, however, sounds a lot more like some sort of interpolation you want to perform. So I would suggest to think about interpolation as an alternative to solve your problem. Anyways, here's a suggestion how you can achieve what you asked for (assuming you can use numpy) - I'm using random inputs to illustrate:
radius = numpy.fromiter((random.random() * 10 for i in xrange(1000)), dtype=numpy.float)
intensities = numpy.fromiter((random.random() * 10 for i in xrange(1000)), dtype=numpy.float)
# group your radius input into 20 equal distant bins
bins = numpy.linspace(radius.min(), radius.max(), 20)
groups = numpy.digitize(radius, bins)
# groups now holds the index of the bin into which radius[i] falls
# loop through all bin indexes and select the corresponding intensities
# perform your aggregation on the selected intensities
# i'm keeping the aggregation for the group in a dict
aggregated = {}
for i in range(len(bins)+1):
selected_intensities = intensities[groups==i]
aggregated[i] = selected_intensities.mean()

How can I find the FWHM of a peak in a noisy data set in python (numpy/scipy)?

I am analyzing an image of two crossing lines (like a + sign) and I am extracting a line of pixels (an nx1 numpy array) perpendicular to one of the lines. This gives me an array of floating point values (representing colors) that I can then plot. I am plotting the data with matplotlib and I get a bunch of noisy data between 180 and 200 with a distinct peak in the middle that spikes down to around 100.
I need to find FWHM of this data. I figured I needed to filter the noise first, so I used a gaussian filter, which smoothed out my data, but its still not super flat at the top.
I was wondering if there is a better way to filter the data.
How can I find the FWHM of this data?
I would like to only use numpy, scipy, and matplotlib if possible.
Here is the original data:
Here is the filtered data:
I ended up not using any filter, but rather used the original data.
The procedure I used was:
Found the minimum and maximum points and calculated difference = max(arr_y) - min(arr_y)
Found the half max (in my case it is half min) HM = difference / 2
Found the nearest data point to HM: nearest = (np.abs(arr_y - HM)).argmin()
Calculated the distance between nearest and min (this gives me the HWHM)
Then simply multiplied by 2 to get the FWHM
I don't know (or think) this is the best way, but it works and seems to be fairly accurate based on comparison.
Your script does already the correct calculation.
But the error from your distance between nearest and pos_extremum can be reduced when taking the distance between nearest_above and nearest_below - the positions at half the extremal value (maximum/minimum) on both its sides.
import numpy as np
# Example data
arr_x = np.linspace(norm.ppf(0.00001), norm.ppf(0.99999), 10000)
arr_y = norm.pdf(arr_x)
# Effective code
difference = max(arr_y) - min(arr_y)
HM = difference / 2
pos_extremum = arr_y.argmax() # or in your case: arr_y.argmin()
nearest_above = (np.abs(arr_y[pos_extremum:-1] - HM)).argmin()
nearest_below = (np.abs(arr_y[0:pos_extremum] - HM)).argmin()
FWHM = (np.mean(arr_x[nearest_above + pos_extremum]) -
np.mean(arr_x[nearest_below]))
For this example you should receive the relation between FWHM and the standard deviation:
FWHM = 2.355 times the standard deviation (here 1) as mentioned on Wikipedia.

Categories