Splitting large data set and plotting the average in matplotlib

Splitting large data set and plotting the average in matplotlib - python

I have a large data set with over 10,000 rows with values between 0 and 400,000,000. I would like to plot those values vs. the mean of another column in matplotlib where the x axis increments by 50,000,000 but I am unsure how to do so. I can plot it using pandas but would really like to do it using matplotlib but unsure how. This is what I have in pandas:
mean_values = df.groupby(pd.cut(df['budget_adj'],np.arange(0,4000000000,50000000)))['vote_average'].mean()
mean_values.plot(kind='line',figsize=(12,5))

I think I figured out what your problem is
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
# Create some data
df = pd.DataFrame({'budget_adj': np.random.uniform(0, 4000000000, 10000),
'vote_average': np.random.uniform(0, 100000, 10000)})
# Calculate the mean values
mean_values = df.groupby(pd.cut(df['budget_adj'],np.arange(0,4000000000,50000000)))['vote_average'].mean()
And this is what I suspect you do
# This wont work since mean_values.index is an interval
plt.plot(mean_values.index, mean_values)
This wont work since you index is a categorical interval. In order for plot to work your x-values have to be numbers. We can convert our intervals in many ways
# You can pick the left endpoint...
x_values = [i.left for i in mean_values.index]
# the right endpoint...
x_values = [i.right for i in mean_values.index]
# or the center value.
x_values = [i.mid for i in mean_values.index]
# And NOW you will get no error
plt.plot(x_values, mean_values)

Related

Highlight data gaps (NaN) in Matplotlib Scatter Plot

I am plotting some time based data from pandas in matplotlib (can be tens of thousands of rows) and i would like to highlight periods where there are NaNs in the data. The way i though to accomplish this was to use axvspan to draw a red box(es) on the plot starting and stopping where there are data gaps. I did think about just drawing a vertical line each time there was a NaN using axvline, but this could create thousands of objects on the plot and cause the resultant PNG to take a long time to write. So the use of axvspan i think is more appropriate. However where I am stuck is finding the start and stop indices of the groups of NaNs.
The code below isn't from my actual code is just a basic mockup to show what i am trying to achieve.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
days = pd.date_range(datetime.now(), datetime.now() + timedelta(13), freq='D')
data = [2,2.3,3,np.nan, np.nan,4.7,3.4,3.1,2.7,np.nan,np.nan,np.nan,4,4.5]
df = pd.DataFrame({'idx': days, 'col': data})
df = df.set_index('idx')
print(df)
#Code to find the start index and stop index of the groups of NaNs
# resuls in list which contains lists of each gap start and stop datetime
gaps = []
plt.plot(df.index, df['col'])
for gap in gaps:
plt.axvspan(gap[0], gap[1], facecolor='r', alpha=0.5)
plt.show()
The result would look something like the mockup below:
Other suggestions for visualizing the gaps would also be appreciated. Such as a straight line in a different color connecting the data across the gap using some sort of fillna?

To find the start and stop indices of the groups of NaNs you can first create a variable to hold the boolean values where the col is NaN. With this variable you can find the rows where there's a transition between valid and NaN values. This can be done using the shift (to dislocate one row on the dataframe) and ne, this way you can compare two consecutive rows and determine where the values alternate. After that, apply cumsum to create distinct groups of contiguous data of valid and NaN values.
Now, using only the rows with NaN values (df[is_nan]) use groupby with n_groups to gather the gaps within the same group. Next, apply aggregate to return a single tuple with the start and end timestamps of each group. The use of DateOffset here is to extend the rectangle display to the adjacent points following the desired image output. You can now use ['col'].values to access the dataframe returned by aggregate and convert it into a list.
...
...
df = df.set_index('idx')
print(df)
# Code to find the start index and stop index of the groups of NaNs
is_nan = df['col'].isna()
n_groups = is_nan.ne(is_nan.shift()).cumsum()
gap_list = df[is_nan].groupby(n_groups).aggregate(
lambda x: (
x.index[0] + pd.DateOffset(days=-1),
x.index[-1] + pd.DateOffset(days=+1)
)
)["col"].values
# resuls in list which contains tuples of each gap start and stop datetime
gaps = gap_list
plt.plot(df.index, df['col'], marker='o' )
plt.xticks(df.index, rotation=45)
for gap in gaps:
plt.axvspan(gap[0], gap[1], facecolor='r', alpha=0.5)
plt.grid()
plt.show()

We can use fill_between to highlight areas. However, it is much easier to define the parts where data are than the ones where no data are without creating gaps to existing data points. So, we simply highlight the entire plotting area, then overwrite the areas where data are in white, then plot:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
days = pd.date_range(datetime.now(), datetime.now() + timedelta(13), freq='D')
data = [2,2.3,3,np.nan, np.nan,4.7,3.4,3.1,2.7,np.nan,np.nan,np.nan,4,4.5]
df = pd.DataFrame({'idx': days, 'col': data})
df = df.set_index('idx')
fig, ax = plt.subplots()
ax.fill_between(df.index, df.col.min(), df.col.max(), where=df.col, facecolor="lightblue", alpha=0.5)
ax.fill_between(df.index, df.col.min(), df.col.max(), where=np.isfinite(df.col), facecolor="white", alpha=1)
ax.plot(df.index, df.col)
ax.xaxis.set_tick_params(rotation=45)
plt.tight_layout()
plt.show()
Sample output:

You can loop through the enumerated list of boolean values given by df['col'].isna() and compare each boolean value to the previous one to select the timestamps for the starts and stops of the gaps. Here is an example based on your code sample and where the plot is generated with the pandas plotting function:
import numpy as np # v 1.19.2
import pandas as pd # v 1.2.3
import matplotlib.pyplot as plt # v 3.3.4
days = pd.date_range('2021-03-08', periods=14, freq='D')
data = [2,2.3,3,np.nan, np.nan,4.7,3.4,3.1,2.7,np.nan,np.nan,np.nan,4,4.5]
df = pd.DataFrame(dict(col=data), index=days)
ax = df.plot(y='col', marker='.', figsize=(8,4))
# Generate lists of starts and stops timestamps for gaps in time series,
# assuming that the first and last data points are not NaNs
starts, stops = [], []
for idx, isna in enumerate(df['col'].isna()):
if isna != df['col'].isna()[idx-1] and isna:
starts.append(df.index[idx-1])
elif isna != df['col'].isna()[idx-1] and not isna:
stops.append(df.index[idx])
# Plot red vertical spans for gaps in time series
for start, stop in zip(starts, stops):
ax.axvspan(start, stop, facecolor='r', alpha=0.3)
plt.show()

In the end I took a little from column A, B and C from the provided answers, thanks for the feedback. Building the list of start stops was very slow for real world data (tens-hundreds of thousands of rows). Since i didn't need a numerical answer just a visual one i did it using matplotlib alone with the following code:
ax[i].fill_between(data.index, 0, (is_nan*data.max()), color='r', step='mid', linewidth='0')
ax[i].plot(data.index, data, color='b', linestyle='-', marker=',', label=ylabel)
The fill between creates my shaded blocks where the nans are. Multiplying them by the data.max() allows them to span the entire y axis. Step='mid' squares off the sides. Linewidth=0 hides the red line when data is 0 (not NaN).

Three subplots in Python using the same data

I'm currently needing some help here since I’m kinda novice. So I was able to import and plot my time series data via Pandas and Matplotlib, respectively. The thing is, the plot is too cramped up (due to the amount of data lol).
Using the same data set, is it possible to ‘divide’ the whole plot into 3 separate subplots?
Here's a sample to what I mean:
What I'm trying to do here is to distribute my plot into 3 subplots (it seems it doesn't have ncol=x).
Initially, my code runs like this;
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import pandas as pd
pd.options.display.float_format = '{:,.4f}'.format
data = pd.read_csv ('all_visuallc.csv')
df = pd.DataFrame(data, columns= ['JD', 'Magnitude'])
print(df) #displays ~37000ish data x 2 columns
colors = ('#696969') #a very nice dim grey heh
area = np.pi*1.7
ax = df.plot.scatter(x="JD", y="Magnitude", s=area, c=colors, alpha=0.2)
ax.set(title='HD 39801', xlabel='Julian Date', ylabel='Visual Magnitude')
ax.invert_yaxis()
ax.xaxis.set_minor_locator(ticker.AutoMinorLocator())
ax.yaxis.set_minor_locator(ticker.AutoMinorLocator())
plt.rcParams['figure.figsize'] = [20, 4]
plt.rcParams['figure.dpi'] = 250
plt.savefig('test_p.jpg')
plt.show()
which shows a very tight plot:
Thanks everyone and I do hope for your help and responses.
P.S. I think iloc[value:value] to slice from a df may work?

First of all, you have to create multiple plots for every part of your data.
For example, if we want to split data into 3 parts, we will create 3 subplots. And then, as you correctly wrote, we can apply iloc (or another type of indexing) to the data.
Here is a toy example, but I hope you are be able to apply your decorations to it.
y = np.arange(0,20,1)
x = np.arange(20,40,1)
sample = pd.DataFrame(x,y).reset_index().rename(columns={'index':'y',
0:'x'})
n_plots = 3
figs, axs = plt.subplots(n_plots, figsize=[30,10])
# Suppose we want to split data into equal parts
start_ind = 0
for i in range(n_plots):
end_ind = start_ind + round(len(sample)/n_plots) #(*)
part_of_frame = sample.iloc[start_ind:end_ind]
axs[i].scatter(part_of_frame['x'], part_of_frame['y'])
start_ind = end_ind
It's also possible to split data into unequal parts by changing the logic in the string (*)

How to remove certain values before plotting data

I'm using python for the first time. I have a csv file with a few columns of data: location, height, density, day etc... I am plotting height (i_h100) v density (i_cd) and have managed to constrain the height to values below 50 with the code below. I now want to constrain the values on the y axis to be within a certain 'day' range say (85-260). I can't work out how to do this.
import pandas
import matplotlib.pyplot as plt
data=pandas.read_csv('data.csv')
data.plot(kind='scatter',x='i_h100',y='i_cd')
plt.xlim(right=50)

Use .loc to subset data going into graph.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Make some dummy data
np.random.seed(42)
df = pd.DataFrame({'a':np.random.randint(0,365,20),
'b':np.random.rand(20),
'c':np.random.rand(20)})
# all data: plot of 'b' vs. 'c'
df.plot(kind='scatter', x='b', y='c')
plt.show()
# use .loc to subset data displayed based on value in 'a'
# can also use .loc to restrict values of 'b' displayed rather than plt.xlim
df.loc[df['a'].between(85,260) & (df['b'] < 0.5)].plot(kind='scatter', x='b', y='c')
plt.show()

How to plot stacked barplot based, when data is not periodical

I have a numpy array Xs containing index values, and an other array heights containing hegihts. How can I plot a barchart, from these values elegantly, when some indexes are missing from Xs (I want an empty space there in the plot), some are present multiple times ( I want separate, stacked rectangles in that case)
My naive solution includes 2 for loops, getting the n-th elements, creating multiple Yaxis, and then plot them on each other using another for loop, with automatic stacking. Is there a more convinient numpy/matplotlib function to handle my data?
import numpy as np
import matplotlib.pyplot as plt
Xs=np.array([0,1,1,1,3,4,4,6,6,6,7,8,9])
heights = np.array([10,9,8,5,7,6,4,3,2,1,1,12,1])
values, counts = np.unique(Xs, return_counts=True)
print (values, counts, max(counts))
WholeY=[]
smallY=np.zeros(max(Xs)+1)
for freq in range(1,max(counts)+1):
for val, cnt in zip(values, counts):
if cnt >= freq:
index = np.where(Xs==val)[0][freq-1]
smallY[val] = heights[index]
WholeY.append(smallY)
smallY=np.zeros(max(Xs)+1)
fig, ax = plt.subplots()
## stack them on each other automatically, create init bottom:
previousBars=np.zeros_like(smallY)
for smallY in WholeY:
currentBars=ax.bar(np.arange(len(smallY)),smallY, bottom=previousBars)
previousBars=smallY
plt.show()

Using pandas might be convenient. Not sure if this is what you're looking for:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Xs=np.array([0,1,1,1,3,4,4,6,6,6,7,8,9])
heights = np.array([10,9,8,5,7,6,4,3,2,1,1,12,1])
# Make an empty template with missing indexes included
g = {k:pd.Series() for k in range(max(Xs)+1)}
df = pd.DataFrame(heights, index=Xs)
# Get heights array for each index with groupby method and update corresponding entries in g
df.groupby(df.index).apply(lambda x: g.update({x.name: x[0].reset_index(drop=True)}))
# Plot stacked bar graph from pandas DataFrame
# Fill in empty values with 0 so that there will be an empty space for missing indexes
pd.DataFrame(g).T.fillna(0).plot.bar(stacked=True, legend=False)
plt.show()

Seaborn pairplot and NaN values

I'm trying to understand why this fails, even though the documentation says:
dropna : boolean, optional
Drop missing values from the data before plotting.
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.__version__
# '0.7.dev'
# generate an example DataFrame
a = pd.DataFrame(data={
'a': np.random.normal(size=(100,)),
'b': np.random.lognormal(size=(100,)),
'c': np.random.exponential(size=(100,))})
sns.pairplot(a) # this works as expected
# snip
b = a.copy()
b.iloc[5,2] = np.nan # replace one value in col 'c' by a NaN
sns.pairplot(b) # this fails with error
# "AttributeError: max must be larger than min in range parameter."
# in histogram(a, bins, range, normed, weights, density)"
> sns.pairplot(b, dropna=True) # same error as above

when you are using the data directly, ie
sns.pairplot(b) #Same as sns.pairplot(b, x_vars=['a','b','c'] , y_vars=['a','b','c'],dropna=True)
your are plotting against all the columns in the DataFrame,Then make sure no:of rows are same in all columns.
sns.pairplot(b, x_vars=['a','c'] , y_vars=['a','b','c'],dropna=True)
In this case it works fine, but there will be a minute difference in the graph for removing the 'NaN value'.
So, If you want to plot with the whole Data then :-
either the null values must be replaced using "fillna()",
or the whole row containing 'nan values' must be dropped
b = b.drop(b.index[5])
sns.pairplot(b)

I'm going to post an answer to my own question, even though it doesn't exactly solve the problem in general, but at least it solves my problem.
The problem arises when trying to draw histograms. However, it looks like the kdes are much more robust to missing data. Therefore, this works, despite the NaN in the middle of the dataframe:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.__version__
# '0.7.dev'
# generate an example DataFrame
a = pd.DataFrame(data={
'a': np.random.normal(size=(100,)),
'b': np.random.lognormal(size=(100,)),
'c': np.random.exponential(size=(100,))})
a.iloc[5,2] = np.nan # replace one value in col 'c' by a NaN
sns.pairplot(a, diag_kind='kde')

Something of a necro- but as I cracked the answer to this today I thought it might be worth sharing. I could not find this solution elsewhere on the web... If the Seaborn ignoreNa keyword has not worked for your data and you don't want to drop all rows that have any NaN. This should work for you.
All of this is in Seaborn 0.9 with pandas 0.23.4, assuming a data frame (df) with j rows (samples) that have n columns (attributes).
The solution to the issue of Seaborn being unable to cope with NaN arrays being passed to it; particularly when you want to make sure you retain a row due to it having other data within it that is useful, is based on using a function to intercept the pair-wise columns before they are passed to the PairGrid for plotting.
Functions can be passed to the grid sectors to carry out an operation per subplot. A simple example of this would be to calculate and annotate RMSE for a column pair (subplot) onto each plot:
def rmse(x,y, **kwargs):
rmse = math.sqrt(skm.mean_squared_error(x, y))
label = 'RMSE = ' + str(round(rmse, 2))
ax = plt.gca()
ax.annotate(label, xy = (0.1, 0.95), size = 20, xycoords = ax.transAxes)
grid = grid.map_upper(rmse)
Therefore by writing a function that Seaborn can take as a data plotting argument, which drops NaNs on a column pair basis as the grid.map_ iterates over the main data frame, we can minimize data loss per sample (row). This is because one NaN in a row will not cause the entire row to be lost for all sub-plots. But rather just the sub-plot for that specific column pair will exclude the given row.
The following function carries out the pairwise NaN drop, returns the two series that seaborn then plots on the axes with matplotlibs scatter plot:
df = [YOUR DF HERE]
def col_nan_scatter(x,y, **kwargs):
df = pd.DataFrame({'x':x[:],'y':y[:]})
df = df.dropna()
x = df['x']
y = df['y']
plt.gca()
plt.scatter(x,y)
cols = df.columns
grid = sns.PairGrid(data= df, vars = cols, height = 4)
grid = grid.map_upper(col_nan_scatter)
The same can be done with seaborn plotting (with for example, just the x value):
def col_nan_kde_histo(x, **kwargs):
df = pd.DataFrame({'x':x[:]})
df = df.dropna()
x = df['x']
plt.gca()
sns.kdeplot(x)
cols = df.columns
grid = sns.PairGrid(data= df, vars = cols, height = 4)
grid = grid.map_upper(col_nan_scatter)
grid = grid.map_upper(col_nan_kde_histo)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting large data set and plotting the average in matplotlib - python

Related

Highlight data gaps (NaN) in Matplotlib Scatter Plot

Three subplots in Python using the same data

How to remove certain values before plotting data

How to plot stacked barplot based, when data is not periodical

Seaborn pairplot and NaN values

Categories

Resources