So I have this code based on a simple data array that looks like this:
5020 : 2015 7 11 11 42 54 782705
5020 : 2015 7 11 11 44 55 575776
5020 : 2015 7 11 11 46 56 560755
5020 : 2015 7 11 11 48 57 104872
and the plot looks like the following:
import scipy as sp
import matplotlib.pyplot as plt
data = sp.genfromtxt("E:/Python/data.txt", delimiter=" : ")
x = data[:,0]
y = data[:,1]
plt.scatter(x,y)
plt.title("Instagram")
plt.xlabel("Time")
plt.ylabel("Followers")
plt.xticks([w*2*60 for w in range(10)],
['2-minute interval %i'%w for w in range(10)])
plt.autoscale(tight=True)
plt.grid()
plt.show()
I'm looking for a simple way to use the datetime output as x intervals on the graph, I can't figure out a way to make it understand it and there's this:
In [15]:sp.sum(sp.isnan(y))
Out[15]: 77
Which I guess is because of the spaces? I'm new to machine learning in Python, forgive my ignorance.
Thank you very much.
I would solve this by directly passing datetime.datetime objects to pyplot. Here is a short example:
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib
# Note: please figure out yourself the data input
x = [dt.datetime(2015,7,11,11,42,54),
dt.datetime(2015,7,11,11,44,56),
dt.datetime(2015,7,11,11,46,56),
dt.datetime(2015,7,11,11,48,57)]
#define the x limit:
xstart= dt.datetime(2015,7,11,11,40,54)
xstop = dt.datetime(2015,7,11,11,50,54)
y = [782705, 575776, 560755, 104872]
fig,ax= plt.subplots()
ax.scatter(x,y)
xfmt = matplotlib.dates.DateFormatter('%D %H:%M:%S')
ax.xaxis.set_major_formatter(xfmt)
ax.set_title("Instagram")
ax.set_xlabel("Time")
ax.set_ylabel("Followers")
ax.set_xlim(xstart,xstop)
plt.xticks(rotation='vertical')
plt.show()
Result:
Yes it's because of the spaces. When you're importing the data it's assigning NaN to your x values.
Try this, it's a little longer but should work:
data = []
x=[]
y=[]
with open('data.txt', 'r') as f:
for line in f:
data.append(line.split(':'))
for i in data:
y.append(i[0])
x_old.append(i[1])
for t in x_old:
x.append(float(t[17:19]+'.'+t[20:])/60+int(t[14:16]))
Because of the spaces I had to convert the data into float manually. I divided the seconds+milliseconds by 60 then added to minutes since I'm assuming you're only interested in that (2 min interval).
If the format is done better you can use datetime and extract the information better. For example:
my_time = datetime.strptime('2015 7 11 11 42 54.782705', '&Y &m %d %H:%M:%S.%f')
Related
I am doing a matplotlib.axes.Axes.stem graph where the x-axis is a dateline that shows days. Some of my data appear on certain days. While on other days, it has no data (because such info do not exist in my data).
Question 1: How do I make a timeline stem graph that will show my data, including days with no data? Is this possible? Is there some way to auto-scale the appearance of the data x-axis to handle such a situation?
Below is a sample data file called test.txt and my python script to read in its data to show a timeline stem plot for your consideration. output from this script is also given below.
Question2. Presentation question. How do I show a "-" symbol at each annotation? Also, how do I rotate the annotation by 30 degrees?
test.txt
No. Date
1 23/01/2020
2 24/01/2020
3 24/01/2020
4 26/01/2020
5 27/01/2020
6 28/01/2020
7 29/01/2020
8 29/01/2020
9 30/01/2020
10 30/01/2020
11 31/01/2020
12 31/01/2020
13 01/02/2020
14 01/02/2020
15 04/02/2020
16 04/02/2020
17 04/02/2020
18 05/02/2020
19 05/02/2020
20 05/02/2020
21 06/02/2020
22 07/02/2020
23 07/02/2020
24 07/02/2020
25 08/02/2020
26 08/02/2020
27 08/02/2020
28 08/02/2020
29 08/02/2020
30 09/02/2020
31 10/02/2020
32 10/02/2020
33 11/02/2020
34 11/02/2020
38 13/02/2020
39 13/02/2020
40 13/02/2020
41 13/02/2020
42 13/02/2020
43 13/02/2020
44 14/02/2020
45 14/02/2020
46 14/02/2020
47 14/02/2020
48 14/02/2020
49 14/02/2020
50 15/02/2020
51 15/02/2020
52 15/02/2020
53 15/02/2020
54 15/02/2020
57 18/02/2020
58 18/02/2020
59 18/02/2020
60 19/02/2020
61 21/02/2020
stem_plot.py
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.dates as mdates
from datetime import datetime
from pathlib import Path
#########################
#### DATA EXTRACTION ####
#########################
source = Path('./test.txt')
with source.open() as f:
lines = f.readlines()
#print( lines )
# Store source data in dictionary with date shown as mm-dd.
data={}
for line in lines[1:]:
case, cdate = line.strip().split()
cdate = datetime.strptime(cdate, "%d/%m/%Y").strftime('%m-%d')
data[case] = cdate
print( f'\ndata = {data}' )
# Collate data's y-axis for each date, i.e. history
history2={}
cdates = list(data.values())
sorted_dates = sorted( set( cdates ) )
for i in sorted_dates:
cases=[]
for case, date in data.items():
if i == date:
cases.append(case)
#print( i, cases)
history2[i] = cases
print( f'\nhistory2 = {history2}')
###########################
#### DATA PRESENTATION ####
###########################
# Create figure and plot a stem plot with the date
fig, ax = plt.subplots(figsize=(8.8, 5), constrained_layout=True)
ax.set(title="Test")
labels=list( history2.values() ) # For annotation
yy = [ len(i) for i in labels ] # y-axis
xx = list(history2.keys()) # x-axis
markerline, stemline, baseline = ax.stem(
xx, yy, linefmt="C1:", basefmt="k-", use_line_collection=True)
plt.setp(markerline, marker="None" )
# annotate stem lines
for ann_x, label in list(history2.items()):
print(ann_x, label)
each_count=1
for each in label:
ax.annotate( each, xy=(ann_x, each_count), xycoords='data')
each_count += 1
#print(f'each_count = {each_count}' )
# format xaxis
plt.setp( ax.get_xticklabels(), rotation=30 )
# remove top and right spines
for spine in ["top", "right"]:
ax.spines[spine].set_visible(False)
# show axis name
ax.get_yaxis().set_label_text(label='Y-axis')
ax.get_xaxis().set_label_text(label='X-axis')
plt.show()
Current output:
About your first question. Basically, you make a list of all days between the days you are using and use that. So add this to the beginning of your code:
import pandas as pd
alldays = pd.date_range(start="20200123",
end="20200221",
normalize=True)
dates = []
for i in alldays:
dates.append(f"{i.month:02}-{i.day:02}")
What this does is it gets a pandas data range between two dates and converts this range into a list of month-day strings.
Then modify this part of your code like this:
# Collate data's y-axis for each date, i.e. history
history2={}
cdates = list(data.values())
sorted_dates = sorted( set( cdates ) )
for i in dates: # This is the only change!
cases=[]
for case, date in data.items():
if i == date:
cases.append(case)
#print( i, cases)
history2[i] = cases
And this change would give you this:
About your second question, change your code to this:
# annotate stem lines
for ann_x, label in list(history2.items()):
print(ann_x, label)
each_count=1
for each in label:
ax.annotate(f"--{each}", xy=(ann_x, each_count), xycoords='data', rotation=30)
each_count += 1
I just changed the ax.annotate line. The two changes are:
added a "--" to each of your annotation labels,
added a rotation parameter. The rotation parameter does not appear directly in the documentation, but the documentation says you can use any of the methods for Text as kwargs, and they are here.
This would hopefully give you what you have asked for:
Adding to #SinanKurmus answer to my 1st Question:
Solution1:
A time-axis with a daily interval for the entire history of the given data can be obtained using matplotlib's methods, namely drange and num2date, and python. The use of pandas can be avoided here.
First, express the start and end date of the time axis as a python datetime object. Note, you need to add 1 more day to the end date else data from the last date would not be included. Next, use 1 day as your time interval using python's datetime.timedelta object. Next supply them to matplotlib.date.drange method that will return a NumPy array. Matplotlib's num2date method in turns converts that back to a python datetime object.
def get_time_axis( data ):
start = datetime.strptime(min(data.values()), "%Y-%m-%d")
end = datetime.strptime(max(data.values()), "%Y-%m-%d") + timedelta(days=1)
delta = timedelta(days=1)
time_axis_md = mdates.drange( start, end, delta )
time_axis_py = mdates.num2date( time_axis_md, tz=None ) # Add tz when required
return time_axis_py
Solution 2:
Apparently, Matplotlib also has a FAQ on how to skip dates where there is no data. I have included their sample code example below.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib.ticker as ticker
r = mlab.csv2rec('../data/aapl.csv')
r.sort()
r = r[-30:] # get the last 30 days
N = len(r)
ind = np.arange(N) # the evenly spaced plot indices
def format_date(x, pos=None):
thisind = np.clip(int(x+0.5), 0, N-1)
return r.date[thisind].strftime('%Y-%m-%d')
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(ind, r.adj_close, 'o-')
ax.xaxis.set_major_formatter(ticker.FuncFormatter(format_date))
fig.autofmt_xdate()
plt.show()
Update 5/22/18: Answer by #aorr below original question.
I am trying to collect each ID and the data for that ID for thousands of inputs.
I am trying to collect each row of individual ID's, sort the dates, then plot each ID + plus data and export the chart for each ID.
Edited
Sample data:
Col names: Id Date O G Company Date2
aab72ffd-4d0b-4c62-b6fe-4c55b98be9a0 3/1/1999 180.66 673 A 1/1/1996
aab72ffd-4d0b-4c62-b6fe-4c55b98be9a0 3/1/1995 173.9 651 A 1/1/1996
a15961bc-0263-4c66-a825-1deb69bda8be 12/1/2010 55.14 542 C 1/1/2011
a15961bc-0263-4c66-a825-1deb69bda8be 5/1/2012 49.24 577 C 1/1/2011
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 12/1/2000 48.14 290 D 3/1/2002
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 3/1/2003 69.03 282.5 D 3/1/2002
Desired output arrays/charts, but sorted by date.
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 7/1/2005 28.24 327
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 10/1/1998 45.11 335
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 7/1/2001 28.22 348
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 7/1/1997 44.53 350.5
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 8/1/2001 28.4 333.5
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 10/1/2005 41.72 314
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 12/1/2001 29.53 313.5
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 8/1/2002 43.24 319
The code I have typed so far successfully creates an indexed array of the the different data types. Now, I am just trying to iterate over all rows and organize the data so that it prints out/writes individual arrays/charts based on ID's.
Here is what I have so far:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#import data
mydataset = pd.read_csv('input_test.csv', dtype=None)
x = mydataset.iloc[:,:].values
y = mydataset.iloc[:,:].values
#Id
b = np.array((x[:,0]), dtype=str)
#Date
c = np.array((x[:,1]), dtype=str)
# O Var
d = np.array((x[:,2]), dtype=int)
# G var
e = np.array((x[:,3]), dtype=int)
#Stack
f = np.vstack((b,c,d,e))
#Transpose array
g = f.T
#Plot data
plt.figure()
plt.plot(x[:,2], y[:,3], label ='Rate over time')
plt.xlabel('m')
plt.ylabel('r/m')
#plt.legend()
Update based on #aorr answer:
Thank's for helping us noobs.
This plots both O and G on the Y axis with Date on the X axis for each Id. And everything is sorted based on date. Great starting point to expand with this data. More to follow based on updates.
for Id in data['Id'].unique():
fig, ax = plt.subplots(figsize=(5,3))
plot_data = data.query("Id==#Id").sort_values('Date')
_ = plot_data.plot(x='Date',y='O', ax=ax)
_ = plot_data.plot(x='Date', y='G', ax=ax)
#Plot Company name in each chart
for Company in plot_data[Company]:
_ = plt.title(Company)
#Plot Date2 Event onto X-axis
for Date2 in plot_data[Date2]:
_ = plt.axvline(Date2)
Have you tried solving this with pandas? I don't think you need to create numpy arrays for every element, pandas already stores them as ndarrays internally.
import matplotlib.pyplot as plt
data = pd.read_csv('input_test.csv', parse_dates=['date'])
for id in data['id'].unique():
fig, ax = plt.subplots(figsize=(5,3))
plot_data = data.query("id==#id").sort_values('date')
_ = plot_data.plot(x='O',y='G', ax=ax)
that should get you nearly all the way there. The pandas visualization docs here have a bunch of other really helpful options for exploring data quickly, but if you're picky about the look of the figure then you'll want to use straight matplotlib for the figure and axes layouts.
In a CSV file, I have three columns z,x,y where 'z' column is used to groupby x and y columns and will be plotted w.r.t 'z'. Below is the table for z,x,y.
z x y
23 1,75181E-07 6,949512
23 8,88901E-07 6,963877
23 1,61279E-05 7,293052
23 5,35262E-05 8,135064
23 8,56942E-05 8,903738
23 0,000114883 9,579907
23 0,01068653 211,0798
23 0,01070811 211,3568
23 0,0107263 211,5871
23 0,01074606 211,8401
23 0,01076813 212,1311
23 0,01078525 212,3436
40 1,75181E-07 6,949513217
40 8,889E-07 6,96388319
40 1,61277E-05 7,293169621
40 5,35248E-05 8,135499439
40 0,00029527 13,63721607
40 0,000319049 14,1825142
40 0,000340228 14,69608917
40 0,014110191 252,3893548
40 0,014132366 252,5804547
40 0,014155023 252,8030254
40 0,014180293 253,0374241
40 0,014202693 253,1983821
40 0,014226167 253,4140887
40 0,014251631 253,6566835
40 0,014272699 253,8120535
Now I need to replace the 'x' and 'y' columns with new values say 'x1' and 'y1' by an equation: x1 = ln(1 + x) and y1 = y*(1+x) and w.r.t. same 'z' column I should plot x1 and y1.
I have tried the below code and I am able to see my new values but not able to plot with new values.
import csv
import os
import tkinter as tk
import sys
from tkinter import filedialog
import pandas as pd
import matplotlib.pyplot as plt
import math
from tkinter import ttk
import numpy as np
def readCSV(self):
x=[] # Initializing empty lists to store the 3 columns in csv
y=[]
z=[]
global df
self.filename = filedialog.askopenfilename()
df = pd.read_csv(self.filename, error_bad_lines=False) #Reading CSV file using pandas
read = csv.reader(df, delimiter = ",")
fig = plt.figure()
ax= fig.add_subplot(111)
df.set_index('x', inplace=True) #Setting index
line = df.groupby('z')['y'].plot(legend=True,ax=ax) #grouping and plotting
cursor = datacursor(line)
gdf= df[df['z'] == 23]
x=np.asarray(gdf.index.values)
y=np.asarray(gdf['y'].values)
x1 = np.log(1+x)
y1 = y * (1 + x)
df.set_index('x1', inplace=True) #Setting new index
line = df.groupby('z')['y1'].plot(legend=True,ax=ax) #grouping and plotting for new values
cursor = datacursor(line)
gdf= df[df['z'] == 23]
x1=np.asarray(gdf.index.values)
print ("x1:",x1)
y1=np.asarray(gdf['y1'].values)
print ("y1:",y1)
ax1 = ax.twinx()
ax.grid(True)
ax.set_ylim(0,None)
ax.set_xlim(0,None)
align_yaxis(ax, y.max(), ax1, 1)
plt.show()
I am getting error in this line "df.set_index('x1', inplace=True)"
as keyerror : x1
Thanks in advance
You must assign x1 adn 'y1' to the dataframe to be able to assign either of them to index:
x=np.asarray(df.index.values)
y=np.asarray(df['y'].values)
df['x1'] = np.log(1+x)
df['y1'] = y * (1+x)
Sorry, I can't google how to get my aim so I am here.
see some sandbox datatable:
mode X Y
0 1 3 10
1 1 4 11
2 1 3 12
3 1 4 13
4 2 3 14
5 2 4 15
6 2 3 16
7 2 4 17
I created following sandbox code. So here, I want plot with TWO lines corresponding to the two different modes ('mode 1' and 'mode 2'). X-axis should be 3,4. And here I want to get two lines (3,(10+12)/2)--(4,(11+13)/2) for mode 1 with averaged Y and analogical (3,15)--(4,16) for mode 2.
But this code even doesn't work.
#!/usr/bin/python3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame([[1,1,1,1,2,2,2,2],[3,4,3,4,3,4,3,4],list(range(10,18))]).T
df.columns = ['mode','X','Y']
mode = df.groupby(['mode'])['mode'].mean()
Ox = df.groupby(['X'])['X'].mean()
Oy = df.groupby(['mode','X'])['Y'].mean()
for x in mode:
plt.plot(Ox, Oy[Oy['mode'== x]] , label = 'test' + x)
plt.savefig('testpandas.pdf')
You might want to try the seaborn package, which has a lot of functionality for stuff like this
import seaborn as sns
sns.lmplot(data=df,hue='mode',x='X',y='Y',x_estimator=np.mean)
Here's one way to do it in plain pandas:
y_means=df.groupby(['mode','X'],as_index=False).mean()
for mode,g in y_means.groupby('mode'):
plt.plot(g['X'],g['Y'],'o-',label = 'mode = ' + str(mode))
It's an answer of asking person.
Actually I've found solution by myself.
#!/usr/bin/python3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame([[1,1,1,1,2,2,2,2],[3,4,3,4,3,4,3,4],list(range(10,18))]).T
df.columns = ['mode','X','Y']
mode = df.groupby(['mode'])['mode'].mean()
Ox = df.groupby(['X'])['X'].mean()
Oy = df.groupby(['mode','X'])['Y'].mean()
for x in mode:
plt.plot(Ox, Oy[mode[x]] , label = 'test' + str(x))
plt.savefig('testpandas.png')
I would guess the easiest way to do this is to use a pivot_table. This reduces the whole thing to two lines:
piv = pd.pivot_table(df, columns="mode", index="X")
plt.plot(piv)
or even only one, if you use pandas integrated plotting functionality:
pd.pivot_table(df, columns="mode", index="X").plot()
The complete solution using matplotlib:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame([[1,1,1,1,2,2,2,2],[3,4,3,4,3,4,3,4],list(range(10,18))]).T
df.columns = ['mode','X','Y']
piv = pd.pivot_table(df, columns="mode", index="X")
print piv
plt.plot(piv)
plt.legend(labels=["mode {}".format(c[1]) for c in piv.columns.values])
plt.show()
which prints the pivot table as
Y
mode 1 2
X
3 11 15
4 12 16
and creates the plot
I've created a script to create multiple plots in one object. The results I am looking for are two plots one over the other such that each plot has different y axis scale but x axis is fixed - dates. However, only one of the plots (the top) is properly created, the bottom plot is visible but empty i.e the geom_line is not visible. Furthermore, the y-axis of the second plot does not match the range of values - min to max. I also tried using facet_grid (scales="free") but no change in the y-axis. The y-axis for the second graph has a range of 0 to 0.05.
I've limited the date range to the past few weeks. This is the code I am using:
df = df.set_index('date')
weekly = df.resample('w-mon',label='left',closed='left').sum()
data = weekly[-4:].reset_index()
data= pd.melt(data, id_vars=['date'])
pplot = ggplot(aes(x="date", y="value", color="variable", group="variable"), data)
#geom_line()
scale_x_date(labels = date_format('%d.%m'),
limits=(data.date.min() - dt.timedelta(2),
data.date.max() + dt.timedelta(2)))
#facet_grid("variable", scales="free_y")
theme_bw()
The dataframe sample (df), its a daily dataset containing values for each variable x and a, in this case 'date' is the index:
date x a
2016-08-01 100 20
2016-08-02 50 0
2016-08-03 24 18
2016-08-04 0 10
The dataframe sample (to_plot) - weekly overview:
date variable value
0 2016-08-01 x 200
1 2016-08-08 x 211
2 2016-08-15 x 104
3 2016-08-22 x 332
4 2016-08-01 a 8
5 2016-08-08 a 15
6 2016-08-15 a 22
7 2016-08-22 a 6
Sorry for not adding the df dataframe before.
Your calls to the plot directives geom_line(), scale_x_date(), etc. are standing on their own in your script; you do not connect them to your plot object. Thus, they do not have any effect on your plot.
In order to apply a plot directive to an existing plot object, use the graphics language and "add" them to your plot object by connecting them with a + operator.
The result (as intended):
The full script:
from __future__ import print_function
import sys
import pandas as pd
import datetime as dt
from ggplot import *
if __name__ == '__main__':
df = pd.DataFrame({
'date': ['2016-08-01', '2016-08-08', '2016-08-15', '2016-08-22'],
'x': [100, 50, 24, 0],
'a': [20, 0, 18, 10]
})
df['date'] = pd.to_datetime(df['date'])
data = pd.melt(df, id_vars=['date'])
plt = ggplot(data, aes(x='date', y='value', color='variable', group='variable')) +\
scale_x_date(
labels=date_format('%y-%m-%d'),
limits=(data.date.min() - dt.timedelta(2), data.date.max() + dt.timedelta(2))
) +\
geom_line() +\
facet_grid('variable', scales='free_y')
plt.show()