Plotting 2 variables in Python from an Excel file - python

I am a Python beginner. I want to start learning Python with plotting.
I would really appreciate if someome can help me write a script to plot an Excel file with 2 variables (velocity, and direction) below:
Date Velocity Direction
3/12/2011 0:00 1.0964352 10
3/12/2011 0:30 1.1184975 15
3/12/2011 1:00 0.48979592 20
3/12/2011 1:30 0.13188942 45

Prepare the data
import pandas as pd
from io import StringIO
data = '''\
Date Velocity Direction
3/12/2011 0:00 1.0964352 10
3/12/2011 0:30 1.1184975 15
3/12/2011 1:00 0.48979592 20
3/12/2011 1:30 0.13188942 45
'''
df = pd.read_csv(StringIO(data), sep=r'\s{2,}', parse_dates=[0], dayfirst=True)
I use a trick here. Because the Dates in the Date column contain time elements, that are separated by a single whitespace, I separate columns by two or more whitespaces. This is why I give the separator as a regex sep=r'\s{2,}'. But of course in a CSV columns are normally separated by a comma which makes things easier (sep=',' which is the default setting).
Note that the Date column has been parsed as dates. Its column type is datetime64.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
Date 4 non-null datetime64[ns]
Velocity 4 non-null float64
Direction 4 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 176.0 bytes
By setting the Date column as the index plotting the data is simple:
df.set_index('Date').plot()
This will result in a line plot where both velocity and direction are plotted for each timestamp.

Related

Using axvline with datetimeindex

I am trying to plot a vertical line using axvline, but I keep getting errors even though I saw here that it's possible to just feed in the date into axvline: Python/Matplotlib plot vertical line for specific dates in line chart
Could anyone point out what am I doing wrong (I am also a beginner).
Ideally, I am looking for a way to just be able to feed in the date into axvline without adding extra pieces of code.
Here is my df:
CPISXS CPIX WTI UMCSI
Dates
2022-08-31 387.748 263.732 93.67 44.0
2022-09-30 390.555 264.370 84.26 42.0
2022-10-31 390.582 264.442 87.55 42.0
2022-11-30 390.523 263.771 84.37 46.0
2022-12-31 NaN NaN NaN NaN
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 756 entries, 1960-01-31 to 2022-12-31
Freq: M
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CPISXS 480 non-null float64
1 CPIX 671 non-null float64
2 WTI 755 non-null float64
3 UMCSI 670 non-null float64
dtypes: float64(4)
memory usage: 45.7 KB`
And here is the code:
fig2, m1 = plt.subplots(figsize=(12,6))
m2=m1.twinx()
m1.axvline(x='1990-01-30')
m1.plot(df0['UMCSI'],'r--',linewidth=1)
m2.plot(df0['WTI'],'b')
When I run it I always get the vertical line on 1970-01-01
Move axvline after the plot commands.
Why the order matters:
If you use axvline first, the line gets added to a blank plot without a datetime x-axis. Since there is no datetime x-axis, axvline doesn't know what to do with a date string input.
Conversely if you plot the time series first, the datetime x-axis gets established, at which point axvline will understand the date input.
fig2, m1 = plt.subplots(figsize=(12, 6))
m2 = m1.twinx()
# plot the time series first to establish the datetime x-axis
m1.plot(df0['UMCSI'], 'r--', linewidth=1)
m2.plot(df0['WTI'], 'b')
# now axvline will understand the date input
m1.axvline(x='1990-01-30')

(Only) some dates in date index get interpreted wrong after import from csv

I want to analyse a dataframe in python. I loaded a csv which consists of two columns, one date/time and one mean value.
I loaded the data like this:
df = pd.read_csv('L_strom_30974_gerundet.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%y %H:%M', infer_datetime_format=True)
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df = df.sort_index()
The problem is, that some dates seem to get interpreted wrong by python. The csv only ranges from 01.01.2009 00:00 to 04.10.2010 23:45 (original format). But when I load the file into python it also shows dates from November and December 2010 in the plot and df.info:
PeriodIndex: 61628 entries, 2009-01-01 00:00 to 2010-12-09 23:45
Freq: 15T
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Mean 61628 non-null float64
dtypes: float64(1)
I searched in the csv for values from this time, but couldn't find any. Also, the number of entries in the df.info matches the rows of my csv, so I reckon that some dates must have been interpreted wrong.
The tail of my dataframe after the import looks like this:
Mean
Timestamp
2010-12-09 22:45 186
2010-12-09 23:00 206
2010-12-09 23:15 168
2010-12-09 23:30 150
2010-12-09 23:45 132
I searched for similar problems, but could not find an explanation as to why most of the data is interpreted correctly, but some incorrectly. Any idea?
The assumed need for infer_datetime_format=True gives away that you are not passing the correct format. Have a look at the strftime documentation. You are using:
format='%d.%m.%y %H:%M'
# %y = Year without century as a zero-padded decimal number: 09, 10
But the format required is:
format='%d.%m.%Y %H:%M'
# %Y = Year with century as a decimal number: 2009, 2010
Apparently, infer_datetime_format isn't able to infer each string correctly, taking days as months and vice versa. Indeed, let's reproduce the error:
Create csv:
import pandas as pd
import numpy as np
data = {'Timestamp': pd.date_range('01-01-2009', '10-04-2010', freq='H'),
'Mean': np.random.randint(0,10,15385)}
df_orig = pd.DataFrame(data)
df_orig['Timestamp'] = df_orig['Timestamp'].dt.strftime('%d.%m.%Y %H:%M')
df_orig.to_csv('test.csv', sep=';', index=None, header=None)
# csv like:
01.01.2009 00:00;7
01.01.2009 01:00;6
01.01.2009 02:00;0
01.01.2009 03:00;2
01.01.2009 04:00;3
Load csv incorrectly:
df = pd.read_csv('test.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%y %H:%M',
infer_datetime_format=True)
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df = df.sort_index()
df.info() # note the incorrect `PeriodIndex`, ending with `2010-12-09 23:00`
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 15385 entries, 2009-01-01 00:00 to 2010-12-09 23:00
Freq: 15T
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Mean 15385 non-null int64
dtypes: int64(1)
memory usage: 240.4 KB
Load csv correctly:
df = pd.read_csv('test.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%Y %H:%M')
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df.info()
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 15385 entries, 2009-01-01 00:00 to 2010-10-04 00:00
Freq: 15T
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Mean 15385 non-null int64
dtypes: int64(1)
memory usage: 240.4 KB

How to remove rows in pandas of type datetime64[ns] by date?

I'm pretty newbie, started to use python for my project.
I have dataset, first column has datetime64[ns] type
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5889 entries, 0 to 5888
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 5889 non-null datetime64[ns]
1 title 5889 non-null object
2 stock 5889 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 138.1+ KB
and
type(BA['date'])
gives
pandas.core.series.Series
date has format 2020-06-10
I need to delete all instances before specific date, for example 2015-09-09
What I tried:
convert to string. Failed
Create conditions using:
.df.year <= %y & .df.month <= %m
<= ('%y-%m-%d')
create data with datetime() method
create variable with datetime64 format
just copy with .loc() and .copy()
All of this failed, I had all kinds of error, like it's not int, its not datetime, datetime mutable, not this, not that, not a holy cow
How can this pandas format can be more counterintuitive, I can't believe, for first time I feel like write a parser CSV in C++ seems easier than use prepared library in python
Thank you for understanding
Toy Example
df = pd.DataFrame({'date':['2021-1-1', '2020-12-6', '2019-02-01', '2020-02-01']})
df.date = pd.to_datetime(df.date)
df
Input df
date
0 2021-01-01
1 2020-12-06
2 2019-02-01
3 2020-02-01
Delete rows before 2020.01.01.
We are selecting the rows which have dates after 2020.01.01 and ignoring old dates.
df.loc[df.date>'2020.01.01']
Output
date
0 2021-01-01
1 2020-12-06
3 2020-02-01
If we want the index to be reset
df = df.loc[df.date>'2020.01.01']
df
Output
date
0 2021-01-01
1 2020-12-06
2 2020-02-01

formatting inconsistent date data with pandas

I'm wondering how I might approach the problem of inconsistent data formats with pandas. Initially I used regular expression to extract a date from a large data set of urls. That worked great however there is an inconsistent date format among the extracted dates:
dates
20140609
20140624
20140404
3/18/14
3/10/14
3/14/2014
20140807
20140806
2014-07-18
As you can see there is an inconsistent formatting of the date data in this dataset. Is there a way to fix this formatting so that all the dates are of the same format?
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 122270 entries, 0 to 122269
Data columns (total 4 columns):
id 119534 non-null float64
x1 122270 non-null int64
url 122270 non-null object
date 122025 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 4.7+ MB
Use to_datetime it seems man/woman enough to handle your inconsistent formatting:
In [77]:
df['dates'] = pd.to_datetime(df['dates'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 8
Data columns (total 1 columns):
dates 9 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 144.0 bytes
In [78]:
df
Out[78]:
dates
0 2014-06-09
1 2014-06-24
2 2014-04-04
3 2014-03-18
4 2014-03-10
5 2014-03-14
6 2014-08-07
7 2014-08-06
8 2014-07-18
For your sample dataset to_datetime works fine, if it didn't work for you it will be because you have some formats that it can't convert, you can either set the param coerce=True which will set any values that cannot be converted to NaT or errors='raise' to report any problems.

Transforming financial data from postgres to pandas dataframe for use with Zipline

I'm new to Pandas and Zipline, and I'm trying to learn how to use them (and use them with this data that I have). Any sorts of tips, even if no full solution, would be much appreciated. I have tried a number of things, and have gotten quite close, but run into indexing issues, Exception: Reindexing only valid with uniquely valued Index objects, in particular. [Pandas 0.10.0, Python 2.7]
I'm trying to transform monthly returns data I have for thousands of stocks in postgres from the form:
ticker_symbol :: String, monthly_return :: Float, date :: Timestamp
e.g.
AAPL, 0.112, 28/2/1992
GS, 0.13, 30/11/1981
GS, -0.23, 22/12/1981
NB: The frequency of the reporting is monthly, but there is going to be considerable NaN data here, as not all of the over 6000 companies I have here are going to be around at the same time.
…to the form described below, which is what Zipline needs to run its backtester. (I think. Can Zipline's backtester work with monthly data like this, easily? I know it can, but any tips for doing this?)
The below is a DataFrame (of timeseries? How do you say this?), in the format I need:
> data:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2268 entries, 1993-01-04 00:00:00+00:00 to 2001-12-31 00:00:00+00:00
Data columns:
AA 2268 non-null values
AAPL 2268 non-null values
GE 2268 non-null values
IBM 2268 non-null values
JNJ 2268 non-null values
KO 2268 non-null values
MSFT 2268 non-null values
PEP 2268 non-null values
SPX 2268 non-null values
XOM 2268 non-null values
dtypes: float64(10)
The below is a TimeSeries, and is in the format I need.
> data.AAPL:
Date
1993-01-04 00:00:00+00:00 73.00
1993-01-05 00:00:00+00:00 73.12
...
2001-12-28 00:00:00+00:00 36.15
2001-12-31 00:00:00+00:00 35.55
Name: AAPL, Length: 2268
Note, there isn't return data here, but prices instead. They're adjusted (by Zipline's load_from_yahoo—though, from reading the source, really by functions in pandas) for dividends, splits, etc, so there's an isomorphism (less the initial price) between that and my return data (so, no problem here).
(EDIT: Let me know if you'd like me to write what I have, or attach my iPython notebook or a gist; I just doubt it'd be helpful, but I can absolutely do it if requested.)
I suspect you are trying to set the date as the index too early. My suggestion would be to first set_index as date and company name, then you can unstack the company name and resample.
Something like this:
In [11]: df1
Out[11]:
ticker_symbol monthly_return date
0 AAPL 0.112 1992-02-28 00:00:00
1 GS 0.130 1981-11-30 00:00:00
2 GS -0.230 1981-12-22 00:00:00
df2 = df2.set_index(['date','ticker_symbol'])
df3 = df2.unstack(level=1)
df4 = df.resample('M')
In [14]: df2
Out[14]:
monthly_return
date ticker_symbol
1992-02-28 AAPL 0.112
1981-11-30 GS 0.130
1981-12-22 GS -0.230
In [15]: df3
Out[15]:
monthly_return
ticker_symbol AAPL GS
date
1981-11-30 NaN 0.13
1981-12-22 NaN -0.23
1992-02-28 0.112 NaN
In [16]: df4
Out[16]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 124 entries, 1981-11-30 00:00:00 to 1992-02-29 00:00:00
Freq: M
Data columns:
(monthly_return, AAPL) 1 non-null values
(monthly_return, GS) 2 non-null values
dtypes: float64(2)

Categories