Calculating similarity between rows of pandas dataframe

Calculating similarity between rows of pandas dataframe - python

Goal is to identify top 10 similar rows for each row in dataframe.
I start with following dictionary:
import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine
d = {'0001': [('skiing',0.789),('snow',0.65),('winter',0.56)],'0002': [('drama', 0.89),('comedy', 0.678),('action',-0.42) ('winter',-0.12),('kids',0.12)],'0003': [('action', 0.89),('funny', 0.58),('sports',0.12)],'0004': [('dark', 0.89),('Mystery', 0.678),('crime',0.12), ('adult',-0.423)],'0005': [('cartoon', -0.89),('comedy', 0.678),('action',0.12)],'0006': [('drama', -0.49),('funny', 0.378),('Suspense',0.12), ('Thriller',0.78)],'0007': [('dark', 0.79),('Mystery', 0.88),('crime',0.32), ('adult',-0.423)]}
To put it in dataframe I do following:
col_headers = []
entities = []
for key, scores in d.iteritems():
entities.append(key)
d[key] = dict(scores)
col_headers.extend(d[key].keys())
col_headers = list(set(col_headers))
populate the dataframe:
df = pd.DataFrame(columns=col_headers, index=entities)
for k in d:
df.loc[k] = pd.Series(d[k])
df.fillna(0.0, axis=1)
One of the issue in addition to my main goal that I have at this point of the code is my dataframe still has NaN. This probably why my result matrix is filled with NaNs.
Mystery drama kids winter funny snow crime dark sports Suspense adult skiing action comedy cartoon Thriller
0004 0.678 NaN NaN NaN NaN NaN 0.12 0.89 NaN NaN -0.423 NaN NaN NaN NaN NaN
0005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.12 0.678 -0.89 NaN
0006 NaN -0.49 NaN NaN 0.378 NaN NaN NaN NaN 0.12 NaN NaN NaN NaN NaN 0.78
0007 0.88 NaN NaN NaN NaN NaN 0.32 0.79 NaN NaN -0.423 NaN NaN NaN NaN NaN
0001 NaN NaN NaN 0.56 NaN 0.65 NaN NaN NaN NaN NaN 0.789 NaN NaN NaN NaN
0002 NaN 0.89 0.12 -0.12 NaN NaN NaN NaN NaN NaN NaN NaN -0.42 0.678 NaN NaN
0003 NaN NaN NaN NaN 0.58 NaN NaN NaN 0.12 NaN NaN NaN 0.89 NaN NaN NaN
To calculate cosine similarity and generate the similarity matrix between rows I do following:
data = df.values
m, k = data.shape
mat = np.zeros((m, m))
for i in xrange(m):
for j in xrange(m):
if i != j:
mat[i][j] = cosine(data[i,:], data[j,:])
else:
mat[i][j] = 0.
here is how mat looks like:
[[ 0. nan nan nan nan nan nan]
[ nan 0. nan nan nan nan nan]
[ nan nan 0. nan nan nan nan]
[ nan nan nan 0. nan nan nan]
[ nan nan nan nan 0. nan nan]
[ nan nan nan nan nan 0. nan]
[ nan nan nan nan nan nan 0.]]
Assuming NaN issue get fix and mat spits out meaning full similarity matrix. How can I get an output as follows:
{0001:[003,005,002],0002:[0001, 0004, 0007]....}

One of the issue in addition to my main goal that I have at this point of the code is my dataframe still has NaN.
That's beacause df.fillna does not modify DataFrame, but returns a new one. Fix it and your result will be fine.

Related

nan shows up on my boxplot but there is no column labeled nan?

I started with:
print(NAsales_boxplotdf)
and I got the following output:
Genre NA_Sales
0 Sports 41.49
1 Platform 29.08
2 Racing 15.85
3 Sports 15.75
4 Role-Playing 11.27
... ... ...
16594 Shooter 0.01
16595 Racing 0.00
16596 Puzzle 0.00
16597 Platform 0.01
16598 NaN NaN
[16599 rows x 2 columns]
I then pivoted the the dataframe as follows:
NAsales_boxplotpivot = NAsales_boxplotdf.pivot(values = 'NA_Sales', columns = 'Genre')
to get:
Genre NaN Action Adventure Fighting Misc Platform Puzzle Racing \
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN 29.08 NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN 15.85
3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ...
16594 NaN NaN NaN NaN NaN NaN NaN NaN
16595 NaN NaN NaN NaN NaN NaN NaN 0.00
16596 NaN NaN NaN NaN NaN NaN 0.0 NaN
16597 NaN NaN NaN NaN NaN 0.01 NaN NaN
16598 NaN NaN NaN NaN NaN NaN NaN NaN
Genre Role-Playing Shooter Simulation Sports Strategy
0 NaN NaN NaN 41.49 NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN 15.75 NaN
4 11.27 NaN NaN NaN NaN
... ... ... ... ... ...
16594 NaN 0.01 NaN NaN NaN
16595 NaN NaN NaN NaN NaN
16596 NaN NaN NaN NaN NaN
16597 NaN NaN NaN NaN NaN
16598 NaN NaN NaN NaN NaN
[16599 rows x 13 columns]
I dropped the NaN column using:
NAsales_boxplotpivot[NAsales_boxplotpivot.columns.dropna()]
to get this and then I put put it in a boxplot using:
NAsales_boxplotpivot.plot(kind='box', rot = 90, figsize = (20,10), showfliers = False)
to get this.

In the image you posted, you displayed
NAsales_boxplotpivot[NAsales_boxplotpivot.columns.dropna()]
using Jupyter notebook, but you did not store the result in NAsales_boxplotpivot. However, you used NAsales_boxplotpivot to create the boxplot. Thus, the boxplot is created from the original dataframe with the NAN column.
To remove the NAN column and store the result you'd need to write
NAsales_boxplotpivot = NAsales_boxplotpivot[NAsales_boxplotpivot.columns.dropna()]

Filter forward col 1 without iteration

I am dealing with a "waterfall" structure DataFrame in Pandas, Python.
Column 1 is full, while the rest of the data set is mostly empty representing series available for only a subset of the total period considered:
Instrument AUPRATE. AIB0411 AIB0511 AIB0611 ... AIB1120 AIB1220 AIB0121 AIB0221
Field ...
Date ...
2011-03-31 4.75 4.730 4.710 4.705 ... NaN NaN NaN NaN
2011-04-29 4.75 4.745 4.750 4.775 ... NaN NaN NaN NaN
2011-05-31 4.75 NaN 4.745 4.755 ... NaN NaN NaN NaN
2011-06-30 4.75 NaN NaN 4.745 ... NaN NaN NaN NaN
2011-07-29 4.75 NaN NaN NaN ... NaN NaN NaN NaN
... ... ... ... ... ... ... ... ...
2019-05-31 1.50 NaN NaN NaN ... NaN NaN NaN NaN
2019-06-28 1.25 NaN NaN NaN ... 0.680 NaN NaN NaN
2019-07-31 1.00 NaN NaN NaN ... 0.520 0.530 NaN NaN
2019-08-30 1.00 NaN NaN NaN ... 0.395 0.405 0.405 NaN
2019-09-30 1.00 NaN NaN NaN ... 0.435 0.445 0.445 0.45
What I would like to do is to push the values from "AUPRATE" to the start of the data in every row (such that they effectively represent the zeroth observation). Where the AUPRATE values are not adjacent to the dataset, they should be replaced with NaN.
I could probably write a junky loop to do this but I was wondering if there was an efficient way of achieving the same outcome.
I am very much a novice in pandas and Python. Thank you in advance.
[edit]
Desired output:
Instrument AUPRATE. AIB0411 AIB0511 AIB0611 ... AIB1120 AIB1220 AIB0121 AIB0221
Field ...
Date ...
2011-03-31 4.75 4.730 4.710 4.705 ... NaN NaN NaN NaN
2011-04-29 4.75 4.745 4.750 4.775 ... NaN NaN NaN NaN
2011-05-31 NaN 4.75 4.745 4.755 ... NaN NaN NaN NaN
2011-06-30 NaN NaN 4.75 4.745 ... NaN NaN NaN NaN
2011-07-29 NaN NaN NaN NaN ... NaN NaN NaN NaN
I have implemented the following, based on the suggestion below. I would still be happy if there was a way of doing this without iteration.
for i in range(AU_furures_rates.shape[0]): #iterate over rows
for j in range(AU_furures_rates.shape[1]-1): #iterate over cols
if (pd.notnull(AU_furures_rates.iloc[i,j+1])) and pd.isnull(AU_furures_rates.iloc[i,1]): #move rate when needed
AU_furures_rates.iloc[i,j] = AU_furures_rates.iloc[i,0]
AU_furures_rates.iloc[i,0] = "NaN"
break

Maybe someone would find a 'cleaner' solution, but what I thought about was first iterating over the columns, to check for each row which is the column whose value you need to replace (backwards, so that it'll end up with the first occurance) with:
df['column_to_move'] = np.nan
cols = df.columns.tolist()
for i in range(len(df) - 2, 1, -1):
df.loc[pd.isna(df[cols[i]]) & pd.notna(df[cols[i + 1]]), 'column_to_move'] = cols[i]
And then iterate the columns to fill the value from AUPRATE. to where its needed, and change AUPRATE. itself with np.nan with:
for col in cols[2: -1]:
df.loc[df['column_to_move'] == col, col] = df['AUPRATE.']
df.loc[df['column_to_move'] == col, 'AUPRATE.'] = np.nan
df.drop('column_to_move', axis=1, inplace=True)

Rolling stdev to remove outliers with NaNs

Right, so I'm a bit rusty with python (pulling it out after 4yrs) and was looking for a solution to this problem. While there were similar threads i wasn't able to figure out what I'm doing wrong.
I have some data that looks like this:
print (fwds)
1y1yUSD 1y1yEUR 1y1yAUD 1y1yCAD 1y1yCHF 1y1yGBP \
Date
2019-10-15 1.47518 -0.503679 0.681473 1.84996 -0.804212 0.626394
2019-10-14 NaN -0.513647 0.684232 NaN -0.815201 0.643280
2019-10-11 1.51515 -0.520474 0.654544 1.84918 -0.812819 0.697584
2019-10-10 1.39085 -0.538651 0.564055 1.72812 -0.846291 0.546696
2019-10-09 1.30827 -0.568942 0.564897 1.63652 -0.896871 0.479307
... ... ... ... ... ... ...
1995-01-09 8.59473 NaN 10.830200 9.59729 NaN 9.407250
1995-01-06 8.58316 NaN 10.851200 9.42043 NaN 9.434480
1995-01-05 8.56470 NaN 10.839000 9.51209 NaN 9.560490
1995-01-04 8.44306 NaN 10.745900 9.51142 NaN 9.507650
1995-01-03 8.58847 NaN NaN 9.38380 NaN 9.611590
The problem is the data quality is not great and I need to remove outliers on a rolling basis (since these time series have been trending and using a static ZS will not work).
I tried a few solutions. One was to try and get a rolling zscore and then filter for the large ones. However, when I try calculating the zscore, my result is all NaNs:
def zscore(x, window):
r = x.rolling(window=window)
m = r.mean().shift(1)
s = r.std(ddof=0, skipna=True).shift(1)
z = (x-m)/s
return z
print (fwds)
print (zscore(fwds, 200))
1y1yUSD 1y1yEUR 1y1yAUD 1y1yCAD 1y1yCHF 1y1yGBP 1y1yJPY \
Date
2019-10-15 NaN NaN NaN NaN NaN NaN NaN
2019-10-14 NaN NaN NaN NaN NaN NaN NaN
2019-10-11 NaN NaN NaN NaN NaN NaN NaN
2019-10-10 NaN NaN NaN NaN NaN NaN NaN
2019-10-09 NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ...
1995-01-09 NaN NaN NaN NaN NaN NaN NaN
1995-01-06 NaN NaN NaN NaN NaN NaN NaN
1995-01-05 NaN NaN NaN NaN NaN NaN NaN
1995-01-04 NaN NaN NaN NaN NaN NaN NaN
1995-01-03 NaN NaN NaN NaN NaN NaN NaN
Another approach:
r = fwds.rolling(window=200)
large = r.mean() + 4 * r.std()
small = r.mean() - 4 * r.std()
print(fwds[fwds > mps])
print (fwds[fwds < mps])
returns:
1y1yUSD 1y1yEUR 1y1yAUD 1y1yCAD 1y1yCHF 1y1yGBP 1y1yJPY \
Date
2019-10-15 NaN NaN NaN NaN NaN NaN NaN
2019-10-14 NaN NaN NaN NaN NaN NaN NaN
2019-10-11 NaN NaN NaN NaN NaN NaN NaN
2019-10-10 NaN NaN NaN NaN NaN NaN NaN
2019-10-09 NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ...
1995-01-09 NaN NaN NaN NaN NaN NaN NaN
1995-01-06 NaN NaN NaN NaN NaN NaN NaN
1995-01-05 NaN NaN NaN NaN NaN NaN NaN
1995-01-04 NaN NaN NaN NaN NaN NaN NaN
1995-01-03 NaN NaN NaN NaN NaN NaN NaN
for both max and min as well. Anyone have any idea how to deal with these darn NaNs when calculating rolling stdev or zscores?
Any hints appreciated. Thanks!
Edit:
For further clarity, I was hoping to remove things like the spike in the green and brown lines from the chart systematically:
fwds.plot()
Link below: https://i.stack.imgur.com/udu5O.png

Welcome to stack overflow.... depending on your use case (and how many crazy extreme values there are) data interpolation should fit the bill....
Since you're looking at forwards (I think), interpolation should be statistically sound unless some of your missing values are the result of massive disruption in the market.
You can use pandas' DataFrame.interpolate to fill in your NaN values with interpolated values.
From the docs
Filling in NaN in a Series via linear interpolation.
>>> s = pd.Series([0, 1, np.nan, 3])
>>> s
0 0.0
1 1.0
2 NaN
3 3.0
dtype: float64
>>> s.interpolate()
0 0.0
1 1.0
2 2.0
3 3.0
dtype: float64
Edit I just realized you are looking for market dislocations so you may not want to use linear interpolation as that will mute the effect of missing data

Pandas: Create view on MultiColumn object

I'm trying to set a value on a multi columned table. However, I appear to be working on a copy, as the value does not persist:
In[4]: tIndex = np.array([32, 34, 134, 234, 334, 434])
topColumns = ['homogenous', 'heterogenous']
mus = ['mu_el', 'mu_eh', 'mu_ul', 'mu_uh']
bottomColumns = mus + ['Jl', 'Jh', 'v', 'u']
arrays = [topColumns, bottomColumns]
#tuples = list(zip(*arrays))
columns = pd.MultiIndex.from_product(arrays)
df = pd.DataFrame(columns=columns, index=tIndex)
In[6]: df.loc[32, 'homogenous']['v'] = 1
In[8]: df.loc[32, 'homogenous']['v']
Out[8]: nan
The case of a multi-index inside .loc[] is trivial and mentioned extensively in the documentation. However, how do I work with a view with a multi-columned data frame?

You need to pass tuple to represent the different levels:
In [125]:
df.loc[32, ('homogenous','v')] = 1
df
Out[125]:
homogenous heterogenous \
mu_el mu_eh mu_ul mu_uh Jl Jh v u mu_el mu_eh mu_ul
32 NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN
34 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
134 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
234 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
334 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
434 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mu_uh Jl Jh v u
32 NaN NaN NaN NaN NaN
34 NaN NaN NaN NaN NaN
134 NaN NaN NaN NaN NaN
234 NaN NaN NaN NaN NaN
334 NaN NaN NaN NaN NaN
434 NaN NaN NaN NaN NaN
it looks you're doing chained indexing

Python reads time stamp of datafile as NAN

I have file in which there are 69 columns and I want to plot column 1 vs 0 in python. Column 0 is time stamp and defined as a x in my below code. that one is just read as [nan] all the way. Can anyone help how I can access it as time stamp and convert it into real time parameter.
Output of my original program is as below.
386L, 69L
1.02063350e+01 1.01029230e+01 1.01483550e+01 1.01322510e+01
1.01652060e+01 1.01666750e+01 1.00328040e+01 1.01037690e+01
1.01594840e+01 1.01917720e+01 1.02076490e+01 1.00984500e+01
1.01465400e+01 1.01099130e+01 9.93045400e+00 1.02702020e+01
1.02420290e+01 9.83942200e+00 1.01766670e+01 1.03041800e+01
1.01142430e+01 9.99233500e+00 1.02056930e+01 9.96160800e+00
1.02312710e+01 1.01937070e+01 1.00662410e+01 1.00564220e+01
1.03316840e+01 1.02984290e+01 1.01553350e+01 1.02485920e+01
1.01057070e+01 1.01322900e+01 9.85602100e+00 1.01304120e+01
1.01867200e+01 1.01230980e+01 1.04255890e+01 1.02276980e+01
9.97088100e+00 1.02358880e+01 1.02324460e+01 1.01739110e+01
9.90378000e+00 1.02250190e+01 1.01972960e+01 1.01434230e+01
1.01156340e+01 1.01877680e+01 1.00771640e+01 9.94258300e+00
1.01228480e+01 1.00226400e+01 1.03037610e+01 1.01374190e+01
1.03934040e+01 1.02223120e+01 9.91568800e+00 1.00569950e+01
1.00406810e+01 9.96628000e+00 9.95176400e+00 1.02743280e+01
1.03284240e+01 1.00261550e+01 1.02350400e+01 9.62347500e+00
7.97721500e+00 6.98834500e+00 6.54083700e+00 5.59419300e+00
5.44109600e+00 5.18832000e+00 5.07447900e+00 4.82216500e+00
5.14443900e+00 5.07041500e+00 5.12356600e+00 4.86419400e+00
4.93091800e+00 4.73329300e+00 5.19877800e+00 5.07006600e+00
5.02329300e+00 4.94752100e+00 5.08953600e+00 5.06611700e+00
5.00972200e+00 5.03730200e+00 4.93890900e+00 4.98747800e+00
4.92193600e+00 5.37086000e+00 4.69805300e+00 5.02045900e+00
5.07409300e+00 4.94737800e+00 5.01768700e+00 4.89215900e+00
4.91796700e+00 4.98060300e+00 4.89192100e+00 4.94804300e+00
4.83130500e+00 4.98217200e+00 4.95033600e+00 5.00201600e+00
5.02830800e+00 5.08981000e+00 4.76257000e+00 4.86429500e+00
4.64401200e+00 4.83474300e+00 4.92021900e+00 4.87757000e+00
4.86761000e+00 4.85844700e+00 4.83728900e+00 4.74187300e+00
4.66529000e+00 4.82284800e+00 4.71564600e+00 4.71299600e+00
5.22222600e+00 4.87288500e+00 4.93599900e+00 5.15918100e+00
4.81741600e+00 5.05354700e+00 4.91554200e+00 4.97029600e+00
4.90260000e+00 4.86965600e+00 4.70653400e+00 4.88988400e+00
4.83676100e+00 4.66035100e+00 4.70221100e+00 4.83428200e+00
4.78062500e+00 4.85336800e+00 4.69923700e+00 4.82042900e+00
4.77278600e+00 4.85703000e+00 4.92349300e+00 4.97539500e+00
4.66653900e+00 4.79438100e+00 4.05199500e+00 4.01709300e+00
4.28989800e+00 3.99912900e+00 3.97699200e+00 4.27547500e+00
4.15868000e+00 4.13992000e+00 4.11040000e+00 4.12968500e+00
3.88466100e+00 3.87837800e+00 4.45199600e+00 3.97069900e+00
4.07768100e+00 4.34960200e+00 4.05255100e+00 4.13006600e+00
4.20696700e+00 4.11243100e+00 4.01630000e+00 4.01754900e+00
4.10431500e+00 3.91450600e+00 4.21277800e+00 3.96927900e+00
4.09596500e+00 4.50494600e+00 4.22938300e+00 4.30338000e+00
4.18615500e+00 4.12275400e+00 4.04061600e+00 4.15334000e+00
4.06964500e+00 3.94753000e+00 3.97536300e+00 4.24165000e+00
3.98226700e+00 4.29778300e+00 4.22502600e+00 4.26802800e+00
4.32224600e+00 3.84938100e+00 4.08480200e+00 3.75990800e+00
4.18492200e+00 4.01363700e+00 4.01796300e+00 4.07649600e+00
4.00820700e+00 4.11053300e+00 3.87055100e+00 4.21097700e+00
4.15524400e+00 4.14812500e+00 4.13236500e+00 4.07726200e+00
3.76739800e+00 3.94160800e+00 3.81505400e+00 3.78352000e+00
3.86908200e+00 4.05378300e+00 4.31671500e+00 4.31096900e+00
4.08509900e+00 3.98346500e+00 4.15286100e+00 3.62410400e+00
3.32268500e+00 2.31938000e+00 1.88496600e+00 1.53918800e+00
1.38159400e+00 1.08586400e+00 1.02829900e+00 9.62478000e-01
1.03807300e+00 1.08465700e+00 1.06060300e+00 1.10126200e+00
8.28574000e-01 9.15849000e-01 1.04531400e+00 7.06345000e-01
9.24180000e-01 8.11576000e-01 9.22431000e-01 1.06463300e+00
1.07769300e+00 8.86140000e-01 8.91486000e-01 7.12601000e-01
7.50398000e-01 1.23665800e+00 8.17675000e-01 9.28867000e-01
1.04068000e+00 1.07396100e+00 6.77256000e-01 9.48032000e-01
1.19655800e+00 9.49906000e-01 1.05095600e+00 8.95500000e-01
9.54073000e-01 1.03294700e+00 9.47867000e-01 8.54049000e-01
8.56902000e-01 1.12824500e+00 9.39495000e-01 8.48964000e-01
1.07529300e+00 9.08451000e-01 8.41853000e-01 1.02797300e+00
7.49010000e-01 7.87141000e-01 7.73506000e-01 8.72573000e-01
6.13669000e-01 9.56504000e-01 9.12995000e-01 8.45595000e-01
1.12688400e+00 9.75989000e-01 1.24252300e+00 1.07969800e+00
9.73997000e-01 9.00494000e-01 1.01318800e+00 9.78460000e-01
8.94072000e-01 9.75827000e-01 1.06745400e+00 8.62771000e-01
8.10779000e-01 1.13640000e+00 1.04607500e+00 1.06464800e+00
1.05792800e+00 8.43800000e-01 7.44144000e-01 1.05855100e+00
1.01307500e+00 9.57641000e-01 1.00375700e+00 1.02454600e+00
2.90891000e-01 6.64140000e-02 1.05532000e-01 2.19837000e-01
1.66220000e-02 1.38264000e-01 3.69454000e-01 2.45617000e-01
4.79750000e-02 1.15673000e-01 2.39620000e-01 -6.02350000e-02
2.03631000e-01 -4.06370000e-02 2.96096000e-01 7.09180000e-02
-1.48026000e-01 2.34339000e-01 1.16457000e-01 5.01100000e-02
1.17650000e-01 1.99601000e-01 5.85800000e-03 -6.15620000e-02
-2.64468000e-01 2.79645000e-01 3.86220000e-02 4.73830000e-02
3.71340000e-02 1.15296000e-01 2.40179000e-01 1.63250000e-02
1.51336000e-01 1.13677000e-01 1.42556000e-01 2.41298000e-01
1.30385000e-01 8.48750000e-02 1.59172000e-01 1.71280000e-02
-5.85770000e-02 1.35253000e-01 5.58280000e-02 -4.02310000e-02
7.95880000e-02 6.91060000e-02 3.91030000e-02 -6.27100000e-03
1.84503000e-01 9.53810000e-02 1.53314000e-01 1.08753000e-01
3.14292000e-01 8.03350000e-02 1.11857000e-01 1.48813000e-01
-1.79700000e-03 1.52151000e-01 -5.78250000e-02 -1.23120000e-01
1.84140000e-02 5.37010000e-02 2.08872000e-01 1.80160000e-02
2.40175000e-01 3.48981000e-01 1.06070000e-02 3.37341000e-01
-3.81840000e-02 1.16279000e-01 2.05508000e-01 -9.65380000e-02
1.32069000e-01 -8.47600000e-03 1.84650000e-01 8.12810000e-02
-2.67500000e-02 nan 1.00000000e+01 5.00000000e+00
4.00000000e+00 1.00000000e+00 0.00000000e+00 nan
-7.00000000e-03 9.92500000e-01
10.206335 10.102923 10.148355 10.132251 10.165206 10.166675
10.032804 10.103769 10.159484 10.191772
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan
2
I am not sure why datetime is shown as nan when I was trying to extract that information. Please help me understand what I did wrong? why I am seeing as nan.
below is the truncated data
9/30/2014 14:13 10.206335
9/30/2014 14:13 10.102923
9/30/2014 14:13 10.148355
9/30/2014 14:13 10.132251
9/30/2014 14:13 10.165206
9/30/2014 14:13 10.166675
9/30/2014 14:13 10.032804
from __future__ import division # avoids problems with integer division
import numpy as np #many numerical routines , like vector, matrix multiplication, FFT
import pylab as p
import scipy as sp
import matplotlib.pyplot as plt
import datetime as dt
data = sp.genfromtxt("C:\\users\mshah\\desktop\\SN 32014-01 manual Stepdown 10 5 4 1 0 Mode 1.TXT", delimiter = "\t")
#print(data[10,1])
print(data.shape)
print(data[:,1])
x = data [:,0]
y = data [:,1]
#z = dt.datetime.strftime(x,"%Y/%m/%d %H:%M")
print(y[:10])
print(x)
nansum = sp.sum(sp.isnan(y))
print nansum
x = x[~sp.isnan(y)]
y = y[~sp.isnan(y)]
plt. scatter(x,y)
plt.title("Step Test Process")
plt.xlabel("time")
plt.ylabel("PPMV")
plt.autoscale(tight=True)
plt.grid()
plt.show()

Try adding this after your import statements:
from matplotlib.dates import date2num, MinuteLocator, DateFormatter
def datetime_converter(date_string):
return date2num(dt.datetime.strptime(date_string, '%m/%d/%Y %H:%M'))
Then modify your call to genfromtxt to use the 'converters' argument,
data = sp.genfromtxt("your_data_file.txt",
delimiter = "\t",
converters={0: datetime_converter}
)
The one issue here is that the data as supplied in this webpage is delimited by spaces, not tabs. As long as the date and time columns are separated by spaces and the data (third column) is separated by a tab (you use the tab delimiter so I assume your data file has tabs in it somewhere) this will work.
If the date and time columns are separated by the same delimiter as your other columns you could parse them separately and combine them after the fact, e.g.:
def date_converter(date_string):
return date2num(dt.datetime.strptime(date_string, '%m/%d/%Y'))
def time_converter(time_string):
h,m = time_string.split(':')
return (int(h) + int(m) / 60.) / 24
data = sp.genfromtxt("txtfile.txt",
delimiter = " ",
converters={0: date_converter,
1: time_converter}
)
x = data[:,0] + data[:,1]
y = data[:,2]
Plotting with matplotlib numeric dates
Before your ax.scatter call, create an axes object that you can work with,
ax = plt.axes()
ax.scatter(x,y)
Then at the end of your script, you can format the xaxis of this axes using the DateLocator and DateFormatter objects (see import dates... statement above),
ax.xaxis.set_major_locator(MinuteLocator())
ax.xaxis.set_major_formatter(DateFormatter('%H:%M'))
Here I have used the MinuteLocator, but dates has several more tickers that can be used.
Hopefully that's more helpful.

To plot the first two tab-separated columns where the first column is a date, using pandas:
from datetime import datetime
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('input.txt',
sep='\t', parse_dates=[0], header=None, index_col=0, usecols=[0,1],
date_parser=lambda s: datetime.strptime(s, '%m/%d/%Y %H:%M %S.%f'))
df.plot()
plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.