Plot certain range of values with pandas and matplotlib - python

I have parsed out data form .json than plotted them but I only wants a certain range from it
e.g. year-mounth= 2014-12to 2020-03
THE CODE IS
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_json("observed-solar-cycle-indices.json", orient='records')
data = pd.DataFrame(data)
print(data)
x = data['time-tag']
y = data['ssn']
plt.plot(x, y, 'o')
plt.xlabel('Year-day'), plt.ylabel('SSN')
plt.show()
Here is the result, as you can see it is too many
here is the json file: https://services.swpc.noaa.gov/json/solar-cycle/observed-solar-cycle-indices.json
How to either parse out certain value from the JSON file or plot a certain range?

The following should work:
Select the data using a start and end date
ndata = data[ (data['time-tag'] > '2014-01') & (data['time-tag'] < '2020-12')]
Plot the data. The x-axis labeling is adapted to display only every 12th label
x = ndata['time-tag']
y = ndata['ssn']
fig, ax = plt.subplots()
plt.plot(x, y, 'o')
every_nth = 12
for n, label in enumerate(ax.xaxis.get_ticklabels()):
if n % every_nth != 0:
label.set_visible(False)
plt.xlabel('Year-Month')
plt.xticks(rotation='vertical')
plt.ylabel('SSN')
plt.show()

You could do a search for the index value of your start and end dates for both x and y values. Use this to create a smaller set of lists that you can plot.
For example, it might be something like
x = data['time-tag']
y = data['ssn']
start_index = x.index('2014-314')
end_index = x.index('2020-083')
x_subsection = x[start_index : end_index]
y_subsection = y[start_index : end_index]
plt.plot(x_subsection, y_subsection, 'o')
plt.xlabel('Year-day'), plt.ylabel('SSN')
plt.show()
You may need to convert the dataframe into an array with np.array().

Related

How to plot large dataset of date vs time using matplot lib

I want to plot date vs time graph using matplot lib. The issue I am facing is that due to access of data many lines are showing on the xaxis and I can't find a way to plot my time on xaxis cleanly with one hour gap. Say i have data in my list as string as ['6:01','6:30','7:20','7:25']. I want to divide my xaxis from 6:00 to 7:00 and the time points between them should be plotted based on time.
Note: time list is just and example I want to do this for whole 24 hour.
I tried to use ticks and many other options to complete my task but unfortunatly I am stuck at this problem. My data is in csv file.
Below is my code:
def arrivalGraph():
from datetime import datetime, timedelta
from matplotlib import pyplot as plt
from matplotlib import dates as mpl_dates
with open("Timetable2021.csv","r") as f:
fileData = f.readlines()
del fileData[0]
date = []
train1 = []
for data in fileData:
ind = data.split(",")
date.append(datetime.strptime(ind[0],"%d/%m/%Y").date())
train1Time = datetime.strptime(ind[1],"%H:%M").time()
train1.append(train1Time.strftime("%H:%M"))
plt.style.use("seaborn")
plt.figure(figsize = (10,10))
plt.plot_date(train1,date)
plt.gcf().autofmt_xdate()#gcf is get current figure - autofmt is auto format
dateformater = mpl_dates.DateFormatter("%b ,%d %Y")
plt.gca().xaxis.set_major_formatter(dateformater) # to format the xaxis
plt.xlabel("Date")
plt.ylabel("Time")
plt.title("Train Time vs Date Schedule")
plt.tight_layout()
plt.show()
When i run the code i get the following output:
output of above code
Assuming that every single minute that every single minute is present in train1 (i.e. train1 = ["00:00", "00:01", "00:02", "00:03", ... , "23:59"]), you can use plt.xticks() by generating an array representing xticks with empty string on every minute which is not 0.
unique_times = sorted(set(train1))
xticks = ['' if time[-2:]!='00' else time for time in unique_times]
plt.style.use("seaborn")
plt.figure(figsize = (10,10))
plt.plot_date(train1,date)
plt.gcf().autofmt_xdate()#gcf is get current figure - autofmt is auto format
dateformater = mpl_dates.DateFormatter("%b ,%d %Y")
# I think you wanted to format the yaxis instead of xaxis
plt.gca().yaxis.set_major_formatter(dateformater) # to format the yaxis
plt.ylabel("Date")
plt.xlabel("Time")
plt.title("Train Time vs Date Schedule")
plt.xticks(range(len(xticks)), xticks)
plt.tight_layout()
plt.show()
If every single minute is not in the train1 array, you have to keep train1 data as an object and generate arrays representing xticks location and values to be used as plt.xticks() parameters.
date = []
train1 = []
for data in fileData:
ind = data.split(",")
date.append(datetime.strptime(ind[0],"%d/%m/%Y").date())
train1Time = datetime.strptime(ind[1],"%H:%M")
train1.append(train1Time)
plt.style.use("seaborn")
plt.figure(figsize = (10,10))
plt.plot_date(train1,date)
plt.gcf().autofmt_xdate()#gcf is get current figure - autofmt is auto format
dateformater = mpl_dates.DateFormatter("%b ,%d %Y")
# I think you wanted to format the y axis instead of xaxis
plt.gca().yaxis.set_major_formatter(dateformater) # to format the yaxis
plt.ylabel("Date")
plt.xlabel("Time")
plt.title("Train Time vs Date Schedule")
ax = plt.gca()
xticks_val = []
xticks_loc = []
distance = (ax.get_xticks()[-1] - ax.get_xticks()[0]) / 24
def to_hour_str(x):
x = str(x)
if len(x) < 2:
x = '0' + x
return x + ':00'
for h in range(25):
xticks_val.append(to_hour_str(h))
xticks_loc.append(ax.get_xticks()[0] + h * distance)
plt.xticks(xticks_loc, xticks_val, rotation=90, ha='left')
plt.tight_layout()
plt.show()
Here's the code output using dummy data I generated myself.

Cannot prepare proper labels in Matplotlib

I have very simple code:
from matplotlib import dates
import matplotlib.ticker as ticker
my_plot=df_h.boxplot(by='Day',figsize=(12,5), showfliers=False, rot=90)
I've got:
but I would like to have fewer labels on X axis. To do this I've add:
my_plot.xaxis.set_major_locator(ticker.MaxNLocator(12))
It generates fewer labels but values of labels have wrong values (=first of few labels from whole list)
What am I doing wrong?
I have add additional information:
I've forgoten to show what is inside DataFrame.
I have three columns:
reg_Date - datetime64 (index)
temperature - float64
Day - date converted from reg_Date to string, it looks like '2017-10' (YYYY-MM)
Box plot group date by 'Day' and I would like to show values 'Day" as a label but not all values
, for example every third one.
You were almost there. Just set ticker.MultipleLocator.
The pandas.DataFrame.boxplot also returns axes, which is an object of class matplotlib.axes.Axes. So you can use this code snippet to customize your labels:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
center = np.random.randint(50,size=(10, 20))
spread = np.random.rand(10, 20) * 30
flier_high = np.random.rand(10, 20) * 30 + 30
flier_low = np.random.rand(10, 20) * -30
y = np.concatenate((spread, center, flier_high, flier_low))
fig, ax = plt.subplots(figsize=(10, 5))
ax.boxplot(y)
x = ['Label '+str(i) for i in range(20)]
ax.set_xticklabels(x)
ax.set_xlabel('Day')
# Set a tick on each integer multiple of a base within the view interval.
ax.xaxis.set_major_locator(ticker.MultipleLocator(5))
plt.xticks(rotation=90)
I think there is a compatibility issue with Pandas plots and Matplotlib formatters.
With the following code:
df = pd.read_csv('lt_stream-1001-full.csv', header=0, encoding='utf8')
df['reg_date'] = pd.to_datetime(df['reg_date'] , format='%Y-%m-%d %H:%M:%S')
df.set_index('reg_date', inplace=True)
df_h = df.resample(rule='H').mean()
df_h['Day']=df_h.index.strftime('%Y-%m')
print(df_h)
f, ax = plt.subplots()
my_plot = df_h.boxplot(by='Day',figsize=(12,5), showfliers=False, rot=90, ax=ax)
locs, labels = plt.xticks()
i = 0
new_labels = list()
for l in labels:
if i % 3 == 0:
label = labels[i]
i += 1
new_labels.append(label)
else:
label = ''
i += 1
new_labels.append(label)
ax.set_xticklabels(new_labels)
plt.show()
You get this chart:
But I notice that this is grouped by month instead of by day. It may not be what you wanted.
Adding the day component to the string 'Day' messes up the chart as there seems to be too many boxes.
df = pd.read_csv('lt_stream-1001-full.csv', header=0, encoding='utf8')
df['reg_date'] = pd.to_datetime(df['reg_date'] , format='%Y-%m-%d %H:%M:%S')
df.set_index('reg_date', inplace=True)
df_h = df.resample(rule='H').mean()
df_h['Day']=df_h.index.strftime('%Y-%m-%d')
print(df_h)
f, ax = plt.subplots()
my_plot = df_h.boxplot(by='Day',figsize=(12,5), showfliers=False, rot=90, ax=ax)
locs, labels = plt.xticks()
i = 0
new_labels = list()
for l in labels:
if i % 15 == 0:
label = labels[i]
i += 1
new_labels.append(label)
else:
label = ''
i += 1
new_labels.append(label)
ax.set_xticklabels(new_labels)
plt.show()
The for loop creates the tick labels every as many periods as desired. In the first chart they were set every 3 months. In the second one, every 15 days.
If you would like to see less grid lines:
df = pd.read_csv('lt_stream-1001-full.csv', header=0, encoding='utf8')
df['reg_date'] = pd.to_datetime(df['reg_date'] , format='%Y-%m-%d %H:%M:%S')
df.set_index('reg_date', inplace=True)
df_h = df.resample(rule='H').mean()
df_h['Day']=df_h.index.strftime('%Y-%m-%d')
print(df_h)
f, ax = plt.subplots()
my_plot = df_h.boxplot(by='Day',figsize=(12,5), showfliers=False, rot=90, ax=ax)
locs, labels = plt.xticks()
i = 0
new_labels = list()
new_locs = list()
for l in labels:
if i % 3 == 0:
label = labels[i]
loc = locs[i]
i += 1
new_labels.append(label)
new_locs.append(loc)
else:
i += 1
ax.set_xticks(new_locs)
ax.set_xticklabels(new_labels)
ax.grid(axis='y')
plt.show()
I've read about x_compat in Pandas plot in order to apply Matplotlib formatters, but I get an error when trying to apply it. I'll give it another shot later.
Old unsuccesful answer
The tick labels seem to be dates. If they are set as datetime in your dataframe, you can:
months = mdates.MonthLocator(1,4,7,10) #Choose the months you like the most
ax.xaxis.set_major_locator(months)
Otherwise, you can let Matplotlib know they are dates by:
ax.xaxis_date()
Your comment:
I have add additional information:
I've forgoten to show what is inside DataFrame.
I have three columns:
reg_Date - datetime64 (index)
temperature - float64
Day - date converted from reg_Date to string, it looks like '2017-10' *(YYYY-MM) *
Box plot group date by 'Day' and I would like to show values 'Day" as a label but not all values
, for example every third one.
Based on your comment in italic above, I would use reg_Date as the input and the following lines:
days = mdates.DayLocator(interval=3)
daysFmt = mdates.DateFormatter('%Y-%m') #to format display
ax.xaxis.set_major_locator(days)
ax.xaxis.set_major_formatter(daysFmt)
I forgot to mention that you will need to:
import matplotlib.dates as mdates
Does this work?

np.array mean to single column data frame

I have a 2 column array that I calculate the mean of (thus creating column A). I would like to be able to refer to and manipulate column A, but cannot seem to save it as a new single column. Here is my specific example, 'filtered' is what I'd like to be able to save/use/ Errors are regularly ValueError: Wrong number of items passed 2, placement implies 1.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df=pd.read_csv('/Users/myfile.csv', delimiter=',', usecols=['Time','Distance'])
x = df['Time']
y = df['Distance']
n = 25 #small n = less smoothed
fwd = pd.Series.ewm(df,span=n, adjust=True).mean()
bwd = pd.Series.ewm(df[::-1],span=n, adjust=True).mean()
filtered = np.stack(( fwd, bwd[::-1] ))
filtered2 = np.mean(filtered, axis=0)
plt.subplot(2,1,1)
plt.title('smoothed and raw data')
plt.plot(x,y, color = 'orange')
plt.plot(x,filtered, color='green')
plt.plot(x,fwd, color='red')
plt.plot(x[::-1],bwd, color='blue')
plt.xlabel('time')
plt.ylabel('distance')
df['filtered2'] = pd.DataFrame(filtered, dtype='str', index=None)
print(filtered2)
smoothed_velocity = ((df.filtered2 - df.filtered2.shift(1)) / df['Time'] - df['Time'].shift(1))
print(smoothed_velocity)
plt.subplot (2,1,2)
plt.title ('smoothed velocity')
plt.plot (smoothed_velocity, color = 'orange')
plt.tight_layout()
plt.show()
Because I define 'filtered' twice I tried changing one to a different variable with no luck. Error presented was ValueError: x and y must have same first dimension, but have shapes (458,) and (2, 458, 2)
Any help would be rad!

python KDE get contours and paths into specific json format leaflet-friendly

I am doing a Kernel Density Estimation in Python and getting the contours and paths as shown below. (here is my sample data: https://pastebin.com/193PUhQf).
from numpy import *
from math import *
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
x_2d = []
y_2d = []
data = {}
data['nodes'] = []
# here is the sample data:
# https://pastebin.com/193PUhQf
X = [.....]
for Picker in xrange(0, len(X)):
x_2d.append(X[Picker][0])
y_2d.append(X[Picker][1])
# convert to arrays
m1 = np.array([x_2d])
m2 = np.array([y_2d])
x_min = m1.min() - 30
x_max = m1.max() + 30
y_min = m2.min() - 30
y_max = m2.max() + 30
x, y = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
positions = np.vstack([x.ravel(), y.ravel()])
values = np.vstack([m1, m2])
kde = stats.gaussian_kde(values)
z = np.reshape(kde(positions).T, x.shape)
fig = plt.figure(2, dpi=200)
ax = fig.add_subplot(111)
pc = ax.pcolor(x, y, z)
cb = plt.colorbar(pc)
cb.ax.set_ylabel('Probability density')
c_s = plt.contour(x, y, z, 20, linewidths=1, colors='k')
ax.plot(m1, m2, 'o', mfc='w', mec='k')
ax.set_title("My Title", fontsize='medium')
plt.savefig("kde.png", dpi=200)
plt.show()
There is a similar way to get the contours using R, which is described here:
http://bl.ocks.org/diegovalle/5166482
Question: how can I achieve the same output using my python script or as a start point?
the desired output should be like contours_tj.json which can be used by leaflet.js lib.
UPDATE:
My input data structure is composed of three columns, comma separated:
first one is the X value
second one is the Y value
third one is the ID of my data, it has no numerical value, it is simply an identifier of the data point.
Update 2:
Question, if simply put, is that I want the same output as in the above link using my input file which is in numpy array format.
update 3:
my input data structure is of list type:
print type(X)
<type 'list'>
and here are the first few lines:
print X[0:5]
[[10.800584, 11.446064, 4478597], [10.576840,11.020229, 4644503], [11.434276,10.790881, 5570870], [11.156718,11.034633, 6500333], [11.054956,11.100243, 6513301]]
geojsoncontour is a python library to convert matplotlib contours to geojson
geojsoncontour.contour_to_geojson requires a contour_levels argument. The levels in pyplot.contour are chosen automatically, but you can access them with c_s._levels
So, for your example you could do:
import geojsoncontour
# your code here
c_s = plt.contour(x, y, z, 20, linewidths=1, colors='k')
# Convert matplotlib contour to geojson
geojsoncontour.contour_to_geojson(
contour=c_s,
geojson_filepath='out.geojson',
contour_levels=c_s._levels,
ndigits=3,
unit='m'
)

Plot 2D array with Pandas, Matplotlib, and Numpy

As a result from simulations, I parsed the output using Pandas groupby(). I am having a bit of difficulty to plot the data the way I want. Here's the Pandas output file (suppressed for simplicity) that I'm trying to plot:
Avg-del Min-del Max-del Avg-retx Min-retx Max-retx
Prob Producers
0.3 1 8.060291 0.587227 26.709371 42.931779 5.130041 136.216642
5 8.330889 0.371387 54.468836 43.166326 3.340193 275.932170
10 1.012147 0.161975 4.320447 6.336965 2.026241 19.177802
0.5 1 8.039639 0.776463 26.053635 43.160880 5.798276 133.090358
5 4.729875 0.289472 26.717824 25.732373 2.909811 135.289244
10 1.043738 0.160671 4.353993 6.461914 2.015735 19.595393
My y-axis is delay and my x-axis is the number of producers. I want to have errorbars for probability p=0.3 and another one for p=0.5.
My python script is the following:
import sys
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.expand_frame_repr', False)
outputFile = 'averages.txt'
f_out = open(outputFile, 'w')
data = pd.read_csv(sys.argv[1], delimiter=",")
result = data.groupby(["Prob", "Producers"]).mean()
print "Writing to output file: " + outputFile
result_s = str(result)
f_out.write(result_s)
f_out.close()
*** Update from James ***
for prob_index in result.index.levels[0]:
r = result.loc[prob_index]
labels = [col for col in r]
lines = plt.plot(r)
[line.set_label(str(prob_index)+" "+col) for col, line in zip(labels, lines)]
ax = plt.gca()
ax.legend()
ax.set_xticks(r.index)
ax.set_ylabel('Latency (s)')
ax.set_xlabel('Number of producer nodes')
plt.show()
Now I have 4 sliced arrays, one for each probability.
How do I slice them again based on delay(del) and retx, and plot errorbars based on ave, min, max?
Ok, there is a lot going on here. First, it is plotting 6 lines. When your code calls
plt.plot(np.transpose(np.array(result)[0:3, 0:3]), label = 'p=0.3')
plt.plot(np.transpose(np.array(result)[3:6, 0:3]), label = 'p=0.5')
it is calling plt.plot on a 3x3 array of data. plt.plot interprets this input not as an x and y, but rather as 3 separate series of y-values (with 3 points each). For the x values, it is imputing the values 0,1,2. In other words it for the first plot call it is plotting the data:
x = [1,2,3]; y = [8.060291, 8.330889, 1.012147]
x = [1,2,3]; y = [0.587227, 0.371387, 0.161975]
x = [1,2,3]; y = [26.709371, 54.468836, 4.320447]
Based on your x-label, I think you want the values to be x = [1,5,10]. Try this to see if it gets the plot you want.
# iterate over the first dataframe index
for prob_index in result.index.levels[0]:
r = result.loc[prob_index]
labels = [col for col in r]
lines = plt.plot(r)
[line.set_label(str(prob_index)+" "+col) for col, line in zip(labels, lines)]
ax = plt.gca()
ax.legend()
ax.set_xticks(r.index)
ax.set_ylabel('Latency (s)')
ax.set_xlabel('Number of producer nodes')

Categories