Plot 2D array with Pandas, Matplotlib, and Numpy - python

As a result from simulations, I parsed the output using Pandas groupby(). I am having a bit of difficulty to plot the data the way I want. Here's the Pandas output file (suppressed for simplicity) that I'm trying to plot:
Avg-del Min-del Max-del Avg-retx Min-retx Max-retx
Prob Producers
0.3 1 8.060291 0.587227 26.709371 42.931779 5.130041 136.216642
5 8.330889 0.371387 54.468836 43.166326 3.340193 275.932170
10 1.012147 0.161975 4.320447 6.336965 2.026241 19.177802
0.5 1 8.039639 0.776463 26.053635 43.160880 5.798276 133.090358
5 4.729875 0.289472 26.717824 25.732373 2.909811 135.289244
10 1.043738 0.160671 4.353993 6.461914 2.015735 19.595393
My y-axis is delay and my x-axis is the number of producers. I want to have errorbars for probability p=0.3 and another one for p=0.5.
My python script is the following:
import sys
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.expand_frame_repr', False)
outputFile = 'averages.txt'
f_out = open(outputFile, 'w')
data = pd.read_csv(sys.argv[1], delimiter=",")
result = data.groupby(["Prob", "Producers"]).mean()
print "Writing to output file: " + outputFile
result_s = str(result)
f_out.write(result_s)
f_out.close()
*** Update from James ***
for prob_index in result.index.levels[0]:
r = result.loc[prob_index]
labels = [col for col in r]
lines = plt.plot(r)
[line.set_label(str(prob_index)+" "+col) for col, line in zip(labels, lines)]
ax = plt.gca()
ax.legend()
ax.set_xticks(r.index)
ax.set_ylabel('Latency (s)')
ax.set_xlabel('Number of producer nodes')
plt.show()
Now I have 4 sliced arrays, one for each probability.
How do I slice them again based on delay(del) and retx, and plot errorbars based on ave, min, max?

Ok, there is a lot going on here. First, it is plotting 6 lines. When your code calls
plt.plot(np.transpose(np.array(result)[0:3, 0:3]), label = 'p=0.3')
plt.plot(np.transpose(np.array(result)[3:6, 0:3]), label = 'p=0.5')
it is calling plt.plot on a 3x3 array of data. plt.plot interprets this input not as an x and y, but rather as 3 separate series of y-values (with 3 points each). For the x values, it is imputing the values 0,1,2. In other words it for the first plot call it is plotting the data:
x = [1,2,3]; y = [8.060291, 8.330889, 1.012147]
x = [1,2,3]; y = [0.587227, 0.371387, 0.161975]
x = [1,2,3]; y = [26.709371, 54.468836, 4.320447]
Based on your x-label, I think you want the values to be x = [1,5,10]. Try this to see if it gets the plot you want.
# iterate over the first dataframe index
for prob_index in result.index.levels[0]:
r = result.loc[prob_index]
labels = [col for col in r]
lines = plt.plot(r)
[line.set_label(str(prob_index)+" "+col) for col, line in zip(labels, lines)]
ax = plt.gca()
ax.legend()
ax.set_xticks(r.index)
ax.set_ylabel('Latency (s)')
ax.set_xlabel('Number of producer nodes')

Related

Plot certain range of values with pandas and matplotlib

I have parsed out data form .json than plotted them but I only wants a certain range from it
e.g. year-mounth= 2014-12to 2020-03
THE CODE IS
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_json("observed-solar-cycle-indices.json", orient='records')
data = pd.DataFrame(data)
print(data)
x = data['time-tag']
y = data['ssn']
plt.plot(x, y, 'o')
plt.xlabel('Year-day'), plt.ylabel('SSN')
plt.show()
Here is the result, as you can see it is too many
here is the json file: https://services.swpc.noaa.gov/json/solar-cycle/observed-solar-cycle-indices.json
How to either parse out certain value from the JSON file or plot a certain range?
The following should work:
Select the data using a start and end date
ndata = data[ (data['time-tag'] > '2014-01') & (data['time-tag'] < '2020-12')]
Plot the data. The x-axis labeling is adapted to display only every 12th label
x = ndata['time-tag']
y = ndata['ssn']
fig, ax = plt.subplots()
plt.plot(x, y, 'o')
every_nth = 12
for n, label in enumerate(ax.xaxis.get_ticklabels()):
if n % every_nth != 0:
label.set_visible(False)
plt.xlabel('Year-Month')
plt.xticks(rotation='vertical')
plt.ylabel('SSN')
plt.show()
You could do a search for the index value of your start and end dates for both x and y values. Use this to create a smaller set of lists that you can plot.
For example, it might be something like
x = data['time-tag']
y = data['ssn']
start_index = x.index('2014-314')
end_index = x.index('2020-083')
x_subsection = x[start_index : end_index]
y_subsection = y[start_index : end_index]
plt.plot(x_subsection, y_subsection, 'o')
plt.xlabel('Year-day'), plt.ylabel('SSN')
plt.show()
You may need to convert the dataframe into an array with np.array().

Units on x axis of graph do not have the same scale

I have the following code
import numpy as np
#x axis
data = open("spectra.dat","r")
linesColumn = data.readlines()
xaxis = []
for x in linesColumn:
xaxis.append(x.split()[0])
data.close()
#y data
data = open("spectra.dat","r")
linesColumn1 = data.readlines()
firstColumn = []
for x in linesColumn1:
firstColumn.append(x.split()[1])
data.close()
plt.plot(xaxis, firstColumn)
plt.show()
The data is here https://drive.google.com/file/d/177kzRGXIoSKvH1fC9XJZubfK3AHzHsFF/view?usp=sharing
When I plot the graph I get a linear function because the units on the x axis do not scale the same way. In the beginning one unit is 0.1 and in the end it's 5, but it's still displayed as the same distance on x axis.
How do I fix that?
Also, is there a way to optimize the column splitting (doing it through loop or something) and storing each column as one list?
The problem is xaxis consists of str type variables. By converting str type variable to float we can get expected figure. Code is shown below:
import numpy as np
from matplotlib import pyplot as plt
#x axis
data = open("spectra.DAT","r")
linesColumn = data.readlines()
xaxis = []
for x in linesColumn:
xaxis.append(float(x.split()[0]))
data.close()
#y data
data = open("spectra.DAT","r")
linesColumn1 = data.readlines()
firstColumn = []
for x in linesColumn1:
firstColumn.append(x.split()[1])
data.close()
plt.plot(xaxis, firstColumn)
plt.show()

np.array mean to single column data frame

I have a 2 column array that I calculate the mean of (thus creating column A). I would like to be able to refer to and manipulate column A, but cannot seem to save it as a new single column. Here is my specific example, 'filtered' is what I'd like to be able to save/use/ Errors are regularly ValueError: Wrong number of items passed 2, placement implies 1.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df=pd.read_csv('/Users/myfile.csv', delimiter=',', usecols=['Time','Distance'])
x = df['Time']
y = df['Distance']
n = 25 #small n = less smoothed
fwd = pd.Series.ewm(df,span=n, adjust=True).mean()
bwd = pd.Series.ewm(df[::-1],span=n, adjust=True).mean()
filtered = np.stack(( fwd, bwd[::-1] ))
filtered2 = np.mean(filtered, axis=0)
plt.subplot(2,1,1)
plt.title('smoothed and raw data')
plt.plot(x,y, color = 'orange')
plt.plot(x,filtered, color='green')
plt.plot(x,fwd, color='red')
plt.plot(x[::-1],bwd, color='blue')
plt.xlabel('time')
plt.ylabel('distance')
df['filtered2'] = pd.DataFrame(filtered, dtype='str', index=None)
print(filtered2)
smoothed_velocity = ((df.filtered2 - df.filtered2.shift(1)) / df['Time'] - df['Time'].shift(1))
print(smoothed_velocity)
plt.subplot (2,1,2)
plt.title ('smoothed velocity')
plt.plot (smoothed_velocity, color = 'orange')
plt.tight_layout()
plt.show()
Because I define 'filtered' twice I tried changing one to a different variable with no luck. Error presented was ValueError: x and y must have same first dimension, but have shapes (458,) and (2, 458, 2)
Any help would be rad!

Pandas groupby results on the same plot

I am dealing with the following data frame (only for illustration, actual df is quite large):
seq x1 y1
0 2 0.7725 0.2105
1 2 0.8098 0.3456
2 2 0.7457 0.5436
3 2 0.4168 0.7610
4 2 0.3181 0.8790
5 3 0.2092 0.5498
6 3 0.0591 0.6357
7 5 0.9937 0.5364
8 5 0.3756 0.7635
9 5 0.1661 0.8364
Trying to plot multiple line graph for the above coordinates (x as "x1 against y as "y1").
Rows with the same "seq" is one path, and has to be plotted as one separate line, like all the x, y coordinates corresponding the seq = 2 belongs to one line, and so on.
I am able to plot them, but on a separate graphs, I want all the lines on the same graph, Using subplots, but not getting it right.
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib notebook
df.groupby("seq").plot(kind = "line", x = "x1", y = "y1")
This creates 100's of graphs (which is equal to the number of unique seq). Suggest me a way to obtain all the lines on the same graph.
**UPDATE*
To resolve the above problem, I implemented the following code:
fig, ax = plt.subplots(figsize=(12,8))
df.groupby('seq').plot(kind='line', x = "x1", y = "y1", ax = ax)
plt.title("abc")
plt.show()
Now, I want a way to plot the lines with specific colors. I am clustering path from seq = 2 and 5 in cluster 1; and path from seq = 3 in another cluster.
So, there are two lines under cluster 1 which I want in red and 1 line under cluster 2 which can be green.
How should I proceed with this?
You need to init axis before plot like in this example
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
# random df
df = pd.DataFrame(np.random.randint(0,10,size=(25, 3)), columns=['ProjID','Xcoord','Ycoord'])
# plot groupby results on the same canvas
fig, ax = plt.subplots(figsize=(8,6))
df.groupby('ProjID').plot(kind='line', x = "Xcoord", y = "Ycoord", ax=ax)
plt.show()
Consider the dataframe df
df = pd.DataFrame(dict(
ProjID=np.repeat(range(10), 10),
Xcoord=np.random.rand(100),
Ycoord=np.random.rand(100),
))
Then we create abstract art like this
df.set_index('Xcoord').groupby('ProjID').Ycoord.plot()
Another way:
for k,g in df.groupby('ProjID'):
plt.plot(g['Xcoord'],g['Ycoord'])
plt.show()
Here is a working example including the ability to adjust legend names.
grp = df.groupby('groupCol')
legendNames = grp.apply(lambda x: x.name) #Get group names using the name attribute.
#legendNames = list(grp.groups.keys()) #Alternative way to get group names. Someone else might be able to speak on speed. This might iterate through the grouper and find keys which could be slower? Not sure
plots = grp.plot('x1','y1',legend=True, ax=ax)
for txt, name in zip(ax.legend_.texts, legendNames):
txt.set_text(name)
Explanation:
Legend values get stored in the parameter ax.legend_ which in turn contains a list of Text() objects, with one item per group, where Text class is found within the matplotlib.text api. To set the text object values, you can use the setter method set_text(self, s).
As a side note, the Text class has a number of set_X() methods that allow you to change the font sizes, fonts, colors, etc. I haven't used those, so I don't know for sure they work, but can't see why not.
based on Serenity's anwser, i make the legend better.
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
# random df
df = pd.DataFrame(np.random.randint(0,10,size=(25, 3)), columns=['ProjID','Xcoord','Ycoord'])
# plot groupby results on the same canvas
grouped = df.groupby('ProjID')
fig, ax = plt.subplots(figsize=(8,6))
grouped.plot(kind='line', x = "Xcoord", y = "Ycoord", ax=ax)
ax.legend(labels=grouped.groups.keys()) ## better legend
plt.show()
and you can also do it like:
grouped = df.groupby('ProjID')
fig, ax = plt.subplots(figsize=(8,6))
g_plot = lambda x:x.plot(x = "Xcoord", y = "Ycoord", ax=ax, label=x.name)
grouped.apply(g_plot)
plt.show()
and it looks like:

Python Bar Plot pylab not reading in all data from csv

I am attempting to create a bar plot of a large text file of data that looks like the following, storedd in a csv file:
#DowntonPBS, 23
#DowntonAbbey, 12
#Download, 8
#Download:, 2
#Downloads, 2
#DownstairsMixtape, 1
#DownWithAssad, 1
#DownYoTLParty, 1
#DowntonAbbey?, 1
#Downtonabbey, 1
#DowntownAbbey, 1
The following code is where I'm at, and while this method has worked in the past for different plotting scripts, I've done something wrong here that I just can't seem to find. Instead of plotting all of the data, I only seem to be getting three records.
import pylab as p
import sys
from matplotlib.mlab import csv2rec
y = []
fig = p.figure()
ax = fig.add_subplot(1,1,1)
input = open(sys.argv[1], 'r')
data = csv2rec(input, names=['tag', 'count'])
for item in data['count']:
y.append(item)
N = len(y)
ind = range(N)
ax.bar(ind, y, align='center')
ax.set_ylabel('Counts')
ax.set_title('HashTag Diversity')
ax.set_xticks(ind)
group_labels = data['tag']
ax.set_xticklabels(group_labels)
fig.autofmt_xdate()
p.show()
If I add print statements for y and N, and run the script against my larger dataset I end up with:
[45, 37, 36]
3
These values should be a very large array "1000" values and the lenth (N) should = 1000. I'm not sure what's going on here.
csv2rec() will ignore lines startswith "#" by default, you can change this by:
csv2rec(input, names=['tag', 'count'], comments="something")

Categories