Python 3.4 reading from CSV formats - python

OK So i have this code in Python that Im importing from a csv file the problem is that there are columns in that csv file that aren't basic numbers. There is one column that is text in the format "INT, EXT" and there is a column that is in o'clock format from "0:00 to 11:59" format. I have a third column as a normal number distance in "00.00" format.
My question is how do I go about plotting distance vs o'clock and then basing whether one is INT or EXT changing the colors of the dots for the scatterplot.
My first problem is having how to make the program read oclock format. and text formats from a csv.
Any ideas or suggestions? Thanks in advance
Here is a sample of the CSV im trying to import
ML INT .10 534.15 0:00
ML EXT .25 654.23 3:00
ML INT .35 743.12 6:30
I want to plot the 4th column as the x axis and the 5th column as the y axis
I also want to color code the scatter plot dots red or blue depending if one is INT or EXT
Here is a sample of the code i have so far
import matplotlib.pyplot as plt
from matplotlib import style
import numpy as np
style.use('ggplot')
a,b,c,d = np.loadtxt('numbers.csv',
unpack = True,
delimiter = ',')
plt.scatter(a,b)
plt.title('Charts')
plt.ylabel('Y Axis')
plt.xlabel('X Axis')
plt.show()

Reading in from your example csv using pandas:
import pandas as pd
import matplotlib.pyplot as plt
import datetime
data = pd.read_csv('data.csv', sep='\t', header=None)
print data
prints:
0 1 2 3 4
0 ML INT 0.10 534.15 0:00
1 ML EXT 0.25 654.23 3:00
2 ML INT 0.35 743.12 6:30
Then separate the 'INT' from the 'EXT':
ints = data[data[1]=='INT']
exts = data[data[1]=='EXT']
change them to datetime and grab the distances:
int_times = [datetime.datetime.time(datetime.datetime.strptime(t, '%H:%M')) for t in ints[4]]
ext_times = [datetime.datetime.time(datetime.datetime.strptime(t, '%H:%M')) for t in exts[4]]
int_dist = [d for d in ints[3]]
ext_dist = [d for d in exts[3]]
then plot a scatter plot for 'INT' and 'EXT' each:
fig, ax = plt.subplots()
ax.scatter(int_dist, int_times, c='orange', s=150)
ax.scatter(ext_dist, ext_times, c='black', s=150)
plt.legend(['INT', 'EXT'], loc=4)
plt.xlabel('Distance')
plt.show()
EDIT: Adding code to answer a question in the comments regarding how to change the time to 12 hour format (ranging from 0:00 to 11:59) and strip the seconds.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_csv('data.csv', header=None)
ints = data[data[1]=='INT']
exts = data[data[1]=='EXT']
INT_index = data[data[1]=='INT'].index
EXT_index = data[data[1]=='EXT'].index
time = [t for t in data[4]]
int_dist = [d for d in ints[3]]
ext_dist = [d for d in exts[3]]
fig, ax = plt.subplots()
ax.scatter(int_dist, INT_index, c='orange', s=150)
ax.scatter(ext_dist, EXT_index, c='black', s=150)
ax.set_yticks(np.arange(len(data[4])))
ax.set_yticklabels(time)
plt.legend(['INT', 'EXT'], loc=4)
plt.xlabel('Distance')
plt.ylabel('Time')
plt.show()

I have worked another answer to this, but will leave the original as I believe it's still good, just not exactly answering your particular question.
I also generated a few more rows of data to make the problem, at least on my end, a bit more meaningful.
What solved this for me was generating a 5th column (in code, not the csv) which is the number of minutes corresponding to a particular o'clock time, i.e. 11:59 maps to 719 min. Using pandas I inserted this new column into the dataframe. I could then place string ticklabels for every hour ('0:00', '1:00', etc.) at every 60 min.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_csv('Workbook2.csv', header=None)
print data
Prints my faked data:
0 1 2 3 4
0 ML INT 0.10 534.15 0:00
1 ML EXT 0.25 654.23 3:00
2 ML INT 0.30 743.12 6:30
3 ML EXT 0.35 744.20 4:30
4 ML INT 0.45 811.47 7:00
5 ML EXT 0.55 777.90 5:45
6 ML INT 0.66 854.70 7:54
7 ML EXT 0.74 798.40 6:55
8 ML INT 0.87 947.30 11:59
Now make a function to convert o'clock to minutes:
def convert_to_min(o_clock):
h, m = o_clock.split(':')
return int(h) * 60 + int(m)
# using this function create a list times in minutes for each time in col 4
min_col = [convert_to_min(t) for t in data[4]]
data[5] = min_col # inserts this list as a new column '5'
print data
Our new data:
0 1 2 3 4 5
0 ML INT 0.10 534.15 0:00 0
1 ML EXT 0.25 654.23 3:00 180
2 ML INT 0.30 743.12 6:30 390
3 ML EXT 0.35 744.20 4:30 270
4 ML INT 0.45 811.47 7:00 420
5 ML EXT 0.55 777.90 5:45 345
6 ML INT 0.66 854.70 7:54 474
7 ML EXT 0.74 798.40 6:55 415
8 ML INT 0.87 947.30 11:59 719
Now build the x and y axis data, the ticklabels, and the tick locations:
INTs = data[data[1]=='INT']
EXTs = data[data[1]=='EXT']
int_dist = INTs[3] # x-axis data for INT
ext_dist = EXTs[3]
# plotting time as minutes in range [0 720]
int_time = INTs[5] # y-axis data for INT
ext_time = EXTs[5]
time = ['0:00', '1:00', '2:00', '3:00', '4:00', '5:00',
'6:00', '7:00', '8:00', '9:00', '10:00', '11:00', '12:00']
# this will place the strings above at every 60 min
tick_location = [t*60 for t in range(13)]
Now plot:
fig, ax = plt.subplots()
ax.scatter(int_dist, int_time, c='orange', s=150)
ax.scatter(ext_dist, ext_time, c='black', s=150)
ax.set_yticks(tick_location)
ax.set_yticklabels(time)
plt.legend(['INT', 'EXT'], loc=4)
plt.xlabel('Distance')
plt.ylabel('Time')
plt.title('Seems to work...')
plt.show()

Related

two DataFrame plot in a single plot matplotlip

I want to plot two DataFrame in a single plot.Though, I have seen similar post but none seems to work out.
First 5 rows of my dataframe looks like this:
df1
name type start stop strand
0 geneA transcript 2000 7764 +
1 geneA exon 2700 5100 +
2 geneA exon 6000 6800 +
3 geneB transcript 9000 12720 -
4 geneB exon 9900 10100 -
df2
P1 P2 P3 P4
0 0.28 0.14 0.19 0.19
1 0.30 0.16 0.17 0.20
2 0.26 0.13 0.20 0.12
3 0.21 0.13 0.25 0.15
4 0.31 0.03 0.24 0.20
I want the plot to look like this:
I tried doing this:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
ax = df1.plot()
df1.plot(ax=ax)
but, the output was not meaningful.
I will appreciate suggestions/solutions on how to achieve this.
Here is a minimal example:
import matplotlib.pyplot as plt
f, axes = plt.subplots(nrows=len(df2.columns)+1, sharex=True, )
# plots for df2 columns
i = 0
for col in df2.columns:
df2[col].plot(ax=axes[i])
axes[i].set_ylim(0, 1.2)
axes[i].set_ylabel(col)
i+=1
## code to plot annotations
# axes[-1].plot(…)
axes[-1].set_xlabel('Genomic position')
# remove space between plots
plt.subplots_adjust(hspace=0)
Here is the full graph:
f, axes = plt.subplots(nrows=len(df2.columns)+1, sharex=True, )
# plots for df2 columns
i = 0
for col in df2.columns:
df2[col].plot(ax=axes[i], color='#505050')
axes[i].set_ylim(0, 1.3)
axes[i].set_ylabel(col)
i+=1
## code to plot annotations
axes[-1].set_xlabel('Genomic position')
axes[-1].set_ylabel('annotations')
axes[-1].set_ylim(-0.5, 1.5)
axes[-1].set_yticks([0, 1])
axes[-1].set_yticklabels(['−', '+'])
for _, r in df1.iterrows():
marker = '|'
lw=1
if r['type'] == 'exon':
marker=None
lw=8
y = 1 if r['strand'] == '+' else 0
axes[-1].plot((r['start'], r['stop']), (y, y),
marker=marker, lw=lw,
solid_capstyle='butt',
color='#505050')
# remove space between plots
plt.subplots_adjust(hspace=0)
You can use subplots for doing so (since it is difficult to understand how the two df should be plotted I've provided a general example)
Import matplotlib.pyplot as plt
fig,axes = plt.subplots(4,1) #4 rows, one column
for ax in axes:
plt.plot(X1,y1 ax =ax) # loop over each subplot and create a plot
plt.plot(X2,y2, ax = ax)

how plot multiples dataframe csv in same plot

I have 4 dataframes in 4 csv. I need to plot timeseries ( Date , mean ) in the same plot.
This is my script :
cc = Series.from_csv('D:/python/means2000_2001.csv' , header=0)
fig = plt.figure()
plt.plot(cc , color='red')
fig.suptitle('test title', fontsize=20)
plt.xlabel('Date', fontsize=15)
plt.ylabel('MEANS ', fontsize=15)
plt.xticks(rotation=90)
The 4 dataframes are like this ( x=Date and y=mean )
Out[307]:
Date
07-28 0.17
08-13 0.18
08-29 0.17
09-14 0.19
09-30 0.19
10-16 0.20
11-01 0.18
11-17 0.22
12-03 0.21
12-19 0.82
01-02 0.59
01-18 0.52
02-03 0.54
02-19 0.53
03-07 0.33
03-23 0.32
04-08 0.31
04-24 0.39
05-10 0.40
05-26 0.40
06-11 0.37
06-27 0.33
07-13 0.29
Name: mean, dtype: float64
when I plot the timeseries i have this graph :
how can i plot all dataframes in the same plot with different colors?
I need something like this :
You can do both:
plot all curves with one singel command, see: plt.plot()
adress each singel curve to plot, see for-loop with plt.fill_between()
if you have 2 DataFrames, say df1 and df2, then use plt.plot() twice:
plt.plot(t,df1); plt.plot(t,df2); plt.show()
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
#--- generate data and DataFrame --
nt = 100
t= np.linspace(0,1,nt)*3*np.pi
y1 = np.sin(t); y2 = np.cos(t); y3 = y1*y2
df = pd.DataFrame({'y1':y1,'y2':y2,'y3':y3 })
#--- graphics ---
plt.style.use('fast')
fig, ax0 = plt.subplots(figsize=(20,4))
plt.plot(t,df, lw=4, alpha=0.6); # plot all curves with 1 command
for j in range(len(df.columns)): # add on: fill_between for each curve
plt.fill_between(t,df.values[:,j],label=df.columns[j],alpha=0.2)
plt.legend(prop={'size':15});plt.grid(axis='y');plt.show()
The answer
You can plot multiple dataframes on a single graph by capturing the Axes object that df.plot returns and then reusing it. Here's an example with two dataframes, df1 and df2:
ax = df1.plot(x='dates', y='vals', label='val 1')
df2.plot(x='dates', y='vals', label='val 2', ax=ax)
plt.show()
Output:
Details
Here's the code I used to generate random example values for df1 and df2:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
def random_dates(start, end, n=10):
if isinstance(start, str): start = pd.to_datetime(start)
if isinstance(end, str): end = pd.to_datetime(end)
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
# generate two random dfs
df1 = pd.DataFrame({'dates': random_dates('2016-01-01', '2016-12-31'), 'vals': np.random.rand(10)})
df2 = pd.DataFrame({'dates': random_dates('2016-01-01', '2016-12-31'), 'vals': np.random.rand(10)})

Wrong Dates in Dataframe and Subplots

I am trying to plot my data in the csv file. Currently my dates are not shown properly in the plot also if i am converting it. How can I change it to show the proper dat format as defined Y-m-d? The second question is that I am currently plotting all the dat in one plot but want to have for every Valuegroup one subplot.
My code looks like the following:
import pandas as pd
import matplotlib.pyplot as plt
csv_loader = pd.read_csv('C:/Test.csv', encoding='cp1252', sep=';', index_col=0).dropna()
csv_loader['Date'] = pd.to_datetime(csv_loader['Date'], format="%Y-%m-%d")
print(csv_loader)
fig, ax = plt.subplots()
csv_loader.groupby('Valuegroup').plot(x='Date', y='Value', ax=ax, legend=False, kind='line')
plt.grid(True)
The csv file looks like the following:
Calcgroup;Valuegroup;id;Date;Value
Group1;A;1;20080103;0.1
Group1;A;1;20080104;0.3
Group1;A;1;20080107;0.5
Group1;A;1;20080108;0.9
Group1;B;1;20080103;0.5
Group1;B;1;20080104;1.3
Group1;B;1;20080107;2.0
Group1;B;1;20080108;0.15
Group1;C;1;20080103;1.9
Group1;C;1;20080104;2.1
Group1;C;1;20080107;2.9
Group1;C;1;20080108;0.45
You can just tell pandas to parse that column as a datetime and it will just work:
In[151]:
import matplotlib.pyplot as plt
t="""Calcgroup;Valuegroup;id;Date;Value
Group1;A;1;20080103;0.1
Group1;A;1;20080104;0.3
Group1;A;1;20080107;0.5
Group1;A;1;20080108;0.9
Group1;B;1;20080103;0.5
Group1;B;1;20080104;1.3
Group1;B;1;20080107;2.0
Group1;B;1;20080108;0.15
Group1;C;1;20080103;1.9
Group1;C;1;20080104;2.1
Group1;C;1;20080107;2.9
Group1;C;1;20080108;0.45"""
df = pd.read_csv(io.StringIO(t), parse_dates=['Date'], sep=';', index_col=0)
df
Out[151]:
Valuegroup id Date Value
Calcgroup
Group1 A 1 2008-01-03 0.10
Group1 A 1 2008-01-04 0.30
Group1 A 1 2008-01-07 0.50
Group1 A 1 2008-01-08 0.90
Group1 B 1 2008-01-03 0.50
Group1 B 1 2008-01-04 1.30
Group1 B 1 2008-01-07 2.00
Group1 B 1 2008-01-08 0.15
Group1 C 1 2008-01-03 1.90
Group1 C 1 2008-01-04 2.10
Group1 C 1 2008-01-07 2.90
Group1 C 1 2008-01-08 0.45
fig, ax = plt.subplots()
df.groupby('Valuegroup').plot(x='Date', y='Value', ax=ax, legend=False, kind='line')
plt.grid(True)
plt.show()
results in:
Besides your format string was incorrect anyway, it should be:
csv_loader['Date'] = pd.to_datetime(csv_loader['Date'], format="%Y%m%d")
however, this won't work as that column will have been loaded as int dtype so you would've needed to convert to string first:
csv_loader['Date'] = pd.to_datetime(csv_loader['Date'].astype(str), format="%Y%m%d")
To format the dates on the x-axis you can use DateFormatter from matplotlib see related: Editing the date formatting of x-axis tick labels in matplotlib
from matplotlib.dates import DateFormatter
fig, ax = plt.subplots()
df.groupby('Valuegroup').plot(x='Date', y='Value', ax=ax, legend=False, kind='line')
plt.grid(True)
myFmt = DateFormatter("%d-%m-%Y")
ax.xaxis.set_minor_formatter(myFmt)
plt.show()
now gives plot:
You're parsing your dates wrong; "%Y-%m-%d" would work for dates like 2017-12-11 (which is Dec 12, 2017). Your dates are of the form "%Y%m%d", without the hyphen.

Pandas dataframe plotting - issue when switching from two subplots to single plot w/ secondary axis

I have two sets of data I want to plot together on a single figure. I have a set of flow data at 15 minute intervals I want to plot as a line plot, and a set of precipitation data at hourly intervals, which I am resampling to a daily time step and plotting as a bar plot. Here is what the format of the data looks like:
2016-06-01 00:00:00 56.8
2016-06-01 00:15:00 52.1
2016-06-01 00:30:00 44.0
2016-06-01 00:45:00 43.6
2016-06-01 01:00:00 34.3
At first I set this up as two subplots, with precipitation and flow rate on different axis. This works totally fine. Here's my code:
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
filename = 'manhole_B.csv'
plotname = 'SSMH-2A B'
plt.style.use('bmh')
# Read csv with precipitation data, change index to datetime object
pdf = pd.read_csv('precip.csv', delimiter=',', header=None, index_col=0)
pdf.columns = ['Precipitation[in]']
pdf.index.name = ''
pdf.index = pd.to_datetime(pdf.index)
pdf = pdf.resample('D').sum()
print(pdf.head())
# Read csv with flow data, change index to datetime object
qdf = pd.read_csv(filename, delimiter=',', header=None, index_col=0)
qdf.columns = ['Flow rate [gpm]']
qdf.index.name = ''
qdf.index = pd.to_datetime(qdf.index)
# Plot
f, ax = plt.subplots(2)
qdf.plot(ax=ax[1], rot=30)
pdf.plot(ax=ax[0], kind='bar', color='r', rot=30, width=1)
ax[0].get_xaxis().set_ticks([])
ax[1].set_ylabel('Flow Rate [gpm]')
ax[0].set_ylabel('Precipitation [in]')
ax[0].set_title(plotname)
f.set_facecolor('white')
f.tight_layout()
plt.show()
2 Axis Plot
However, I decided I want to show everything on a single axis, so I modified my code to put precipitation on a secondary axis. Now my flow data data has disppeared from the plot, and even when I set the axis ticks to an empty set, I get these 00:15 00:30 and 00:45 tick marks along the x-axis.
Secondary-y axis plots
Any ideas why this might be occuring?
Here is my code for the single axis plot:
f, ax = plt.subplots()
qdf.plot(ax=ax, rot=30)
pdf.plot(ax=ax, kind='bar', color='r', rot=30, secondary_y=True)
ax.get_xaxis().set_ticks([])
Here is an example:
Setup
In [1]: from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline
df = pd.DataFrame({'x' : np.arange(10),
'y1' : np.random.rand(10,),
'y2' : np.square(np.arange(10))})
df
Out[1]: x y1 y2
0 0 0.451314 0
1 1 0.321124 1
2 2 0.050852 4
3 3 0.731084 9
4 4 0.689950 16
5 5 0.581768 25
6 6 0.962147 36
7 7 0.743512 49
8 8 0.993304 64
9 9 0.666703 81
Plot
In [2]: fig, ax1 = plt.subplots()
ax1.plot(df['x'], df['y1'], 'b-')
ax1.set_xlabel('Series')
ax1.set_ylabel('Random', color='b')
for tl in ax1.get_yticklabels():
tl.set_color('b')
ax2 = ax1.twinx() # Note twinx, not twiny. I was wrong when I commented on your question.
ax2.plot(df['x'], df['y2'], 'ro')
ax2.set_ylabel('Square', color='r')
for tl in ax2.get_yticklabels():
tl.set_color('r')
Out[2]:

Fixed y axis in Python plotting times in 12 hr format

I have this plot but I need the y axis to be fixed to 00:00, 01:00, 02:00, etc all the way up to 12:00. As of now it's only plotting the values I have in the csv on the y axis. the csv is in the following format. How do o get the y axis to be constant and only show 00:00 to 12:00 in 1 hr increments and still have the data plotted correctly?
ML INT 0.1 534.15 0:00
ML EXT 0.25 654.23 3:00
ML INT 0.35 743.12 6:30
And the following is the code I have so far.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_csv('data.csv', header=None)
ints = data[data[1]=='INT']
exts = data[data[1]=='EXT']
INT_index = data[data[1]=='INT'].index
EXT_index = data[data[1]=='EXT'].index
time = [t for t in data[4]]
int_dist = [d for d in ints[3]]
ext_dist = [d for d in exts[3]]
fig, ax = plt.subplots()
ax.scatter(int_dist, INT_index, c='orange', s=150)
ax.scatter(ext_dist, EXT_index, c='black', s=150)
ax.set_yticks(np.arange(len(data[4])))
ax.set_yticklabels(time)
plt.legend(['INT', 'EXT'], loc=4)
plt.xlabel('Distance')
plt.ylabel('Time')
plt.show()
I generated a few more rows of data to make the problem, at least on my end, a bit more meaningful.
What solved this for me was generating a 5th column (in code, not the csv) which is the number of minutes corresponding to a particular o'clock time, i.e. 11:59 maps to 719 min. Using pandas I inserted this new column into the dataframe. I could then place string ticklabels for every hour ('0:00', '1:00', etc.) at every 60 min.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_csv('Workbook2.csv', header=None)
print data
Prints my faked data:
0 1 2 3 4
0 ML INT 0.10 534.15 0:00
1 ML EXT 0.25 654.23 3:00
2 ML INT 0.30 743.12 6:30
3 ML EXT 0.35 744.20 4:30
4 ML INT 0.45 811.47 7:00
5 ML EXT 0.55 777.90 5:45
6 ML INT 0.66 854.70 7:54
7 ML EXT 0.74 798.40 6:55
8 ML INT 0.87 947.30 11:59
Now make a function to convert o'clock to minutes:
def convert_to_min(o_clock):
h, m = o_clock.split(':')
return int(h) * 60 + int(m)
# using this function create a list times in minutes for each time in col 4
min_col = [convert_to_min(t) for t in data[4]]
data[5] = min_col # inserts this list as a new column '5'
print data
Our new data:
0 1 2 3 4 5
0 ML INT 0.10 534.15 0:00 0
1 ML EXT 0.25 654.23 3:00 180
2 ML INT 0.30 743.12 6:30 390
3 ML EXT 0.35 744.20 4:30 270
4 ML INT 0.45 811.47 7:00 420
5 ML EXT 0.55 777.90 5:45 345
6 ML INT 0.66 854.70 7:54 474
7 ML EXT 0.74 798.40 6:55 415
8 ML INT 0.87 947.30 11:59 719
Now build the x and y axis data, the ticklabels, and the tick locations:
INTs = data[data[1]=='INT']
EXTs = data[data[1]=='EXT']
int_dist = INTs[3] # x-axis data for INT
ext_dist = EXTs[3]
# plotting time as minutes in range [0 720]
int_time = INTs[5] # y-axis data for INT
ext_time = EXTs[5]
time = ['0:00', '1:00', '2:00', '3:00', '4:00', '5:00',
'6:00', '7:00', '8:00', '9:00', '10:00', '11:00', '12:00']
# this will place the strings above at every 60 min
tick_location = [t*60 for t in range(13)]
Now plot:
fig, ax = plt.subplots()
ax.scatter(int_dist, int_time, c='orange', s=150)
ax.scatter(ext_dist, ext_time, c='black', s=150)
ax.set_yticks(tick_location)
ax.set_yticklabels(time)
plt.legend(['INT', 'EXT'], loc=4)
plt.xlabel('Distance')
plt.ylabel('Time')
plt.title('Seems to work...')
plt.show()
The ticks will be a lot smarter if you use a datetime for the y-axis.
Fake data:
df = pd.DataFrame({'value':[530,640,710], 'time':['0:00', '3:00', '6:30']})
time value
0 0:00 530
1 3:00 640
2 6:30 710
Convert df.time from str to datetime:
time2 = pd.to_datetime(df.time, format='%H:%M')
plt.plot(df.value, time2, marker='o', linestyle='None')
Can't seem to get this into a scatter instead of plot in case it matters for you (I suppressed the line). Maybe because datetime should always be in a timeseries lineplot and never in a scatterplot (I welcome comments that let me know if this is indeed the case and datetime cannot be put into a scatter).

Categories