Pandas DataFrame - desired index has duplicate values

Pandas DataFrame - desired index has duplicate values - python

This is my first time trying Pandas. I think I have a reasonable use case, but I am stumbling. I want to load a tab delimited file into a Pandas Dataframe, then group it by Symbol and plot it with the x.axis indexed by the TimeStamp column. Here is a subset of the data:
Symbol,Price,M1,M2,Volume,TimeStamp
TBET,2.19,3,8.05,1124179,9:59:14 AM
FUEL,3.949,9,1.15,109674,9:59:11 AM
SUNH,4.37,6,0.09,24394,9:59:09 AM
FUEL,3.9099,8,1.11,105265,9:59:09 AM
TBET,2.18,2,8.03,1121629,9:59:05 AM
ORBC,3.4,2,0.22,10509,9:59:02 AM
FUEL,3.8599,7,1.07,102116,9:58:47 AM
FUEL,3.8544,6,1.05,100116,9:58:40 AM
GBR,3.83,4,0.46,64251,9:58:24 AM
GBR,3.8,3,0.45,63211,9:58:20 AM
XRA,3.6167,3,0.12,42310,9:58:08 AM
GBR,3.75,2,0.34,47521,9:57:52 AM
MPET,1.42,3,0.26,44600,9:57:52 AM
Note two things about the TimeStamp column;
it has duplicate values and
the intervals are irregular.
I thought I could do something like this...
from pandas import *
import pylab as plt
df = read_csv('data.txt',index_col=5)
df.sort(ascending=False)
df.plot()
plt.show()
But the read_csv method raises an exception "Tried columns 1-X as index but found duplicates". Is there an option that will allow me to specify an index column with duplicate values?
I would also be interested in aligning my irregular timestamp intervals to one second resolution, I would still wish to plot multiple events for a given second, but maybe I could introduce a unique index, then align my prices to it?

I created several issues just now to address some features / conveniences that I think would be nice to have: GH-856, GH-857, GH-858
We're currently working on a revamp of the time series capabilities and doing alignment to secondly resolution is possible now (though not with duplicates, so would need to write some functions for that). I also want to support duplicate timestamps in a better way. However, this is really panel (3D) data, so one way that you might alter things is the following:
In [29]: df.pivot('Symbol', 'TimeStamp').stack()
Out[29]:
M1 M2 Price Volume
Symbol TimeStamp
FUEL 9:58:40 AM 6 1.05 3.8544 100116
9:58:47 AM 7 1.07 3.8599 102116
9:59:09 AM 8 1.11 3.9099 105265
9:59:11 AM 9 1.15 3.9490 109674
GBR 9:57:52 AM 2 0.34 3.7500 47521
9:58:20 AM 3 0.45 3.8000 63211
9:58:24 AM 4 0.46 3.8300 64251
MPET 9:57:52 AM 3 0.26 1.4200 44600
ORBC 9:59:02 AM 2 0.22 3.4000 10509
SUNH 9:59:09 AM 6 0.09 4.3700 24394
TBET 9:59:05 AM 2 8.03 2.1800 1121629
9:59:14 AM 3 8.05 2.1900 1124179
XRA 9:58:08 AM 3 0.12 3.6167 42310
note that this created a MultiIndex. Another way I could have gotten this:
In [32]: df.set_index(['Symbol', 'TimeStamp'])
Out[32]:
Price M1 M2 Volume
Symbol TimeStamp
TBET 9:59:14 AM 2.1900 3 8.05 1124179
FUEL 9:59:11 AM 3.9490 9 1.15 109674
SUNH 9:59:09 AM 4.3700 6 0.09 24394
FUEL 9:59:09 AM 3.9099 8 1.11 105265
TBET 9:59:05 AM 2.1800 2 8.03 1121629
ORBC 9:59:02 AM 3.4000 2 0.22 10509
FUEL 9:58:47 AM 3.8599 7 1.07 102116
9:58:40 AM 3.8544 6 1.05 100116
GBR 9:58:24 AM 3.8300 4 0.46 64251
9:58:20 AM 3.8000 3 0.45 63211
XRA 9:58:08 AM 3.6167 3 0.12 42310
GBR 9:57:52 AM 3.7500 2 0.34 47521
MPET 9:57:52 AM 1.4200 3 0.26 44600
In [33]: df.set_index(['Symbol', 'TimeStamp']).sortlevel(0)
Out[33]:
Price M1 M2 Volume
Symbol TimeStamp
FUEL 9:58:40 AM 3.8544 6 1.05 100116
9:58:47 AM 3.8599 7 1.07 102116
9:59:09 AM 3.9099 8 1.11 105265
9:59:11 AM 3.9490 9 1.15 109674
GBR 9:57:52 AM 3.7500 2 0.34 47521
9:58:20 AM 3.8000 3 0.45 63211
9:58:24 AM 3.8300 4 0.46 64251
MPET 9:57:52 AM 1.4200 3 0.26 44600
ORBC 9:59:02 AM 3.4000 2 0.22 10509
SUNH 9:59:09 AM 4.3700 6 0.09 24394
TBET 9:59:05 AM 2.1800 2 8.03 1121629
9:59:14 AM 2.1900 3 8.05 1124179
XRA 9:58:08 AM 3.6167 3 0.12 42310
you can get this data in a true panel format like so:
In [35]: df.set_index(['TimeStamp', 'Symbol']).sortlevel(0).to_panel()
Out[35]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 11 (major) x 7 (minor)
Items: Price to Volume
Major axis: 9:57:52 AM to 9:59:14 AM
Minor axis: FUEL to XRA
In [36]: panel = df.set_index(['TimeStamp', 'Symbol']).sortlevel(0).to_panel()
In [37]: panel['Price']
Out[37]:
Symbol FUEL GBR MPET ORBC SUNH TBET XRA
TimeStamp
9:57:52 AM NaN 3.75 1.42 NaN NaN NaN NaN
9:58:08 AM NaN NaN NaN NaN NaN NaN 3.6167
9:58:20 AM NaN 3.80 NaN NaN NaN NaN NaN
9:58:24 AM NaN 3.83 NaN NaN NaN NaN NaN
9:58:40 AM 3.8544 NaN NaN NaN NaN NaN NaN
9:58:47 AM 3.8599 NaN NaN NaN NaN NaN NaN
9:59:02 AM NaN NaN NaN 3.4 NaN NaN NaN
9:59:05 AM NaN NaN NaN NaN NaN 2.18 NaN
9:59:09 AM 3.9099 NaN NaN NaN 4.37 NaN NaN
9:59:11 AM 3.9490 NaN NaN NaN NaN NaN NaN
9:59:14 AM NaN NaN NaN NaN NaN 2.19 NaN
you can then generate some plots from that data.
note here that the timestamps are still as strings-- I guess they could be converted to Python datetime.time objects and things might be a bit easier to work with. I don't have many plans to provide a lot of support for raw times vs. timestamps (date + time) but if enough people need it I suppose I can be convinced :)
If you have multiple observations on a second for a single symbol then some of the above methods will not work. But I want to build in better support for that in upcoming releases of pandas, so knowing your use cases will be helpful to me-- consider joining the mailing list (pystatsmodels)

Related

Incorrect exported data - How to split values in specific column and shift it to other columns in pandas dataframe?

I exported some data and this export did not work completely fine.
I read my data into a pandas dataframe and it now looks like this:
time A B C D E F
1 NaN nullnull0.54 0.74 0.89 NaN NaN
2 NaN nullnull0.01 3.32 1.19 NaN NaN
3 NaN nullnull1.89 0.65 4.50 NaN NaN
4 NaN nullnull4.64 2.87 2.22 NaN NaN
5 0.52 1.43 3.56 5.65 0.06 1.11
6 3.51 0.89 0.96 1.10 2.08 4.29
7 0.11 10.20 3.36 2.15 0.70 1.99
timeis my first column and then I have six columns A to F.
Column A is correct. I did not get any data here. The problem begins in column B. For B and C I could actually not extract values other than this null values. But instead writing this null values in column B and C in my csv-file it writes nullnull0.54 in column B and then fills columns C to F with the other extracted values and adds NaN values for E and F. I.e. values in C should be E and D should be in F for all the rows where this nullnull pattern is observed (and the numeric value in B should be added to D. That means, I would need to write a code which splits the value in B in three party null, null and value and then shift the numeric values two columns to the left beginning with this numeric value in B and only for rows where this nullnull-pattern is observed.
Edit:
The output should look like this:
time A B C D E F
1 NaN null null 0.54 0.74 0.89
2 NaN null null 0.01 3.32 1.19
3 NaN null null 1.89 0.65 4.50
4 NaN null null 4.64 2.87 2.22
5 0.52 1.43 3.56 5.65 0.06 1.11
6 3.51 0.89 0.96 1.10 2.08 4.29
7 0.11 10.20 3.36 2.15 0.70 1.99
I used this code to read the csv-file:
df = pd.read_csv(r'path\to\file.csv',delimiter=';',names=['time','A','B','C','D','E','F'],index_col=False)
It is not due to the code I used to read the file. It is due to the export which went wrong. I also get this nullnullxyz-values in one column as outputs in my csv file.

Firstly, I would suggest fixing the corrupt csv file, or better the root cause of the corruption, before loading it into pandas.
If you really have to do it in pandas, here is a slow-and-dirty fix using .apply():
def fix(row: pd.Series) -> pd.Series:
"""Fix 'nullnull' assuming it occurs only in column B."""
if str(row["B"]).startswith("nullnull"):
return pd.Series([row["time"], row["A"], float('nan'), float('nan'), float(row["B"][8:]), row["C"], row["D"]],
index=df.columns)
else: # no need to fix
return row
# apply the fix for each row
df2 = df.apply(fix, axis=1)
# column B is in object type originally
df2["B"] = df2["B"].astype(float)
Output
print(df2)
time A B C D E F
0 1.0 NaN NaN NaN 0.54 0.74 0.89
1 2.0 NaN NaN NaN 0.01 3.32 1.19
2 3.0 NaN NaN NaN 1.89 0.65 4.50
3 4.0 NaN NaN NaN 4.64 2.87 2.22
4 5.0 0.52 1.43 3.56 5.65 0.06 1.11
5 6.0 3.51 0.89 0.96 1.10 2.08 4.29
6 7.0 0.11 10.20 3.36 2.15 0.70 1.99
Also verify data types:
print(df2.dtypes)
time float64
A float64
B float64
C float64
D float64
E float64

What's the most idiomatic way to set all values to NaN except the end of the month?

I'd like to learn the most idiomatic way to set all values of a data frame to NaN except the values corresponding to the last business day of the month. I've worked out the following solution but it feels clunky.
If you're wondering what my original use-case is ... I get mixed daily and monthly data into one big data frame. I extract the monthly data which is basically repeating the same value within each month and I'd like to replace the dull repeated values with an interpolated estimation e.g. using loess. To that end I need to fill in missing values for getting all the in-between x-axis NA values.
# get the values corresponding to the last business day of each month
df_eofm = df.resample('BM').last()
# fill the original data frame with NaN's
df[:] = np.nan
# now try to set the last business days to the values we saved
df.update(df_eofm)
print(df)
print(df.dropna())
This produces the expected result:
Col1 Col2 Col3
Date
1963-12-31 57.5 -28 0.89
1964-01-01 NaN NaN NaN
1964-01-02 NaN NaN NaN
1964-01-03 NaN NaN NaN
1964-01-04 NaN NaN NaN
... ... ... ...
2020-03-11 NaN NaN NaN
2020-03-12 NaN NaN NaN
2020-03-13 NaN NaN NaN
2020-03-14 NaN NaN NaN
2020-03-15 NaN NaN NaN
[20530 rows x 3 columns]
Col1 Col2 Col3
Date
1963-12-31 57.5 -28 0.89
1964-01-31 54 106 0.65
1964-02-28 57.1 126 0.68
1964-03-31 57.9 266 0.73
1964-04-30 60.2 144 0.72
... ... ... ...
2019-10-31 47.8 136 0.11
2019-11-29 48.3 128 0.22
2019-12-31 48.1 266 0.37
2020-01-31 47.2 145 -0.08
2020-02-28 50.9 225 -0.45
[675 rows x 3 columns]

You could use is_month_end and index the dataframe with the resultant boolean series:
df[~df.index.is_month_end] = np.nan
For the last business day, using this answer we could do something like:
def is_business_day(date):
return bool(len(pd.bdate_range(date, date)))
last_bus = (df.index.to_frame()[0]
.map(is_business_day)
.groupby(df.index.month)
.transform(lambda x: x.last_valid_index()))
df[df.index==last_bus] = np.nan

Pandas combine two columns into one and exclude NaN values

I have a 5k x 2 column dataframe called "both".
I want to create a new 5k x 1 DataFrame or column (doesn't matter) by replacing any NaN value in one column with the value of the adjacent column.
ex:
Gains Loss
0 NaN NaN
1 NaN -0.17
2 NaN -0.13
3 NaN -0.75
4 NaN -0.17
5 NaN -0.99
6 1.06 NaN
7 NaN -1.29
8 NaN -0.42
9 0.14 NaN
so for example, I need to swap the NaNs in the first column in rows 1 through 5 with the values in the same rows, in second column to get a new df of the following form:
Change
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
how do I tell python to do this??

You may fill the NaN values with zeroes and then simply add your columns:
both["Change"] = both["Gains"].fillna(0) + both["Loss"].fillna(0)
Then — if you need it — you may return the resulting zeroes back to NaNs:
both["Change"].replace(0, np.nan, inplace=True)
The result:
Gains Loss Change
0 NaN NaN NaN
1 NaN -0.17 -0.17
2 NaN -0.13 -0.13
3 NaN -0.75 -0.75
4 NaN -0.17 -0.17
5 NaN -0.99 -0.99
6 1.06 NaN 1.06
7 NaN -1.29 -1.29
8 NaN -0.42 -0.42
9 0.14 NaN 0.14
Finally, if you want to get rid of your original columns, you may drop them:
both.drop(columns=["Gains", "Loss"], inplace=True)

There are many ways to achieve this. One is using the loc property:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Price1': [np.nan,np.nan,np.nan,np.nan,
np.nan,np.nan,1.06,np.nan,np.nan],
'Price2': [np.nan,-0.17,-0.13,-0.75,-0.17,
-0.99,np.nan,-1.29,-0.42]})
df.loc[df['Price1'].isnull(), 'Price1'] = df['Price2']
df = df.loc[:6,'Price1']
print(df)
Output:
Price1
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
You can see more complex recipes in the Cookbook

IIUC, we can filter for null values and just sum the columns to make your new dataframe.
cols = ['Gains','Loss']
s = df.isnull().cumsum(axis=1).eq(len(df.columns)).any(axis=1)
# add df[cols].isnull() if you only want to measure the price columns for nulls.
df['prices'] = df[cols].loc[~s].sum(axis=1)
df = df.drop(cols,axis=1)
print(df)
prices
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
7 -1.29
8 -0.42

calculate mean only when the number of values in each rows is higher then certain number in python pandas

I have a daily time series dataframe with nine columns. Each columns represent the measurement from different methods. I want to calculate daily mean only when there are more than two measurements otherwise want to assign as NaN. How to do that with pandas dataframe?
suppose my df looks like:
0 1 2 3 4 5 6 7 8
2000-02-25 NaN 0.22 0.54 NaN NaN NaN NaN NaN NaN
2000-02-26 0.57 NaN 0.91 0.21 NaN 0.22 NaN 0.51 NaN
2000-02-27 0.10 0.14 0.09 NaN 0.17 NaN 0.05 NaN NaN
2000-02-28 NaN NaN NaN NaN NaN NaN NaN NaN 0.14
2000-02-29 0.82 NaN 0.75 NaN NaN NaN 0.14 NaN NaN
and I'm expecting mean values like:
0
2000-02-25 NaN
2000-02-26 0.48
2000-02-27 0.11
2000-02-28 NaN
2000-02-29 0.57

Use where for NaNs values by condition created by DataFrame.count for count with exclude NaNs and comparing by Series.gt (>):
s = df.where(df.count(axis=1).gt(2)).mean(axis=1)
#alternative soluton with changed order
#s = df.mean(axis=1).where(df.count(axis=1).gt(2))
print (s)
2000-02-25 NaN
2000-02-26 0.484
2000-02-27 0.110
2000-02-28 NaN
2000-02-29 0.570
dtype: float64

How to get indexes of values in a Pandas DataFrame?

I am sure there must be a very simple solution to this problem, but I am failing to find it (and browsing through previously asked questions, I didn't find the answer I wanted or didn't understand it).
I have a dataframe similar to this (just much bigger, with many more rows and columns):
x val1 val2 val3
0 0.0 10.0 NaN NaN
1 0.5 10.5 NaN NaN
2 1.0 11.0 NaN NaN
3 1.5 11.5 NaN 11.60
4 2.0 12.0 NaN 12.08
5 2.5 12.5 12.2 12.56
6 3.0 13.0 19.8 13.04
7 3.5 13.5 13.3 13.52
8 4.0 14.0 19.8 14.00
9 4.5 14.5 14.4 14.48
10 5.0 15.0 19.8 14.96
11 5.5 15.5 15.5 15.44
12 6.0 16.0 19.8 15.92
13 6.5 16.5 16.6 16.40
14 7.0 17.0 19.8 18.00
15 7.5 17.5 17.7 NaN
16 8.0 18.0 19.8 NaN
17 8.5 18.5 18.8 NaN
18 9.0 19.0 19.8 NaN
19 9.5 19.5 19.9 NaN
20 10.0 20.0 19.8 NaN
In the next step, I need to compute the derivative dVal/dx for each of the value columns (in reality I have more than 3 columns, so I need to have a robust solution in a loop, I can't select the rows manually each time). But because of the NaN values in some of the columns, I am facing the problem that x and val are not of the same dimension. I feel the way to overcome this would be to only select only those x intervals, for which the val is notnull. But I am not able to do that. I am probably making some very stupid mistakes (I am not a programmer and I am very untalented, so please be patient with me:) ).
Here is the code so far (now that I think of it, I may have introduced some mistakes just by leaving some old pieces of code because I've been messing with it for a while, trying different things):
import pandas as pd
import numpy as np
df = pd.read_csv('H:/DocumentsRedir/pokus/dataframe.csv', delimiter=',')
vals = list(df.columns.values)[1:]
for i in vals:
V = np.asarray(pd.notnull(df[i]))
mask = pd.notnull(df[i])
X = np.asarray(df.loc[mask]['x'])
derivative=np.diff(V)/np.diff(X)
But I am getting this error:
ValueError: operands could not be broadcast together with shapes (20,) (15,)
So, apparently, it did not select only the notnull values...
Is there an obvious mistake that I am making or a different approach that I should adopt? Thanks!
(And another less important question: is np.diff the right function to use here or had I better calculated it manually by finite differences? I'm not finding numpy documentation very helpful.)

To calculate dVal/dX:
dVal = df.iloc[:, 1:].diff() # `x` is in column 0.
dX = df['x'].diff()
>>> dVal.apply(lambda series: series / dX)
val1 val2 val3
0 NaN NaN NaN
1 1 NaN NaN
2 1 NaN NaN
3 1 NaN NaN
4 1 NaN 0.96
5 1 NaN 0.96
6 1 15.2 0.96
7 1 -13.0 0.96
8 1 13.0 0.96
9 1 -10.8 0.96
10 1 10.8 0.96
11 1 -8.6 0.96
12 1 8.6 0.96
13 1 -6.4 0.96
14 1 6.4 3.20
15 1 -4.2 NaN
16 1 4.2 NaN
17 1 -2.0 NaN
18 1 2.0 NaN
19 1 0.2 NaN
20 1 -0.2 NaN
We difference all columns (except the first one), and then apply a lambda function to each column which divides it by the difference in column X.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.