Why Pandas doesn´t allow multiple index setting? - python

Hello I am trying to set a multi Index on my office computer
data.set_index(['POM', 'DTM'],inplace = True)
but I get the following error
Categorical levels must be unique
At home I don't get the error. Both Pandas are version 0.13.1
Here is some sample data
POM DTM RNF WET HMD TMP DEW INF
0 QuintaVilar 2011-11-01 00:00:00 0 0 0 0 0 0
1 QuintaVilar 2011-11-01 00:15:00 0 0 0 0 0 0
2 QuintaVilar 2011-11-01 00:30:00 0 0 0 0 0 0
3 QuintaVilar 2011-11-01 00:45:00 0 0 0 0 0 0
4 QuintaVilar 2011-11-01 01:00:00 0 0 0 0 0 0
5 QuintaVilar 2011-11-01 01:15:00 0 0 0 0 0 0
6 QuintaVilar 2011-11-01 01:30:00 0 0 0 0 0 0
Could you help me?
Thank you

Shouldn't be. But how about just creating a MultiIndex?:
In [52]:
print df
POM DTM RNF WET HMD TMP DEW INF
0 QuintaVilar 2011-11-01 00:00:00 0 0 0 0 0 0
1 QuintaVilar 2011-11-01 00:15:00 0 0 0 0 0 0
2 QuintaVilar 2011-11-01 00:30:00 0 0 0 0 0 0
3 QuintaVilar 2011-11-01 00:45:00 0 0 0 0 0 0
4 QuintaVilar 2011-11-01 01:00:00 0 0 0 0 0 0
5 QuintaVilar 2011-11-01 01:15:00 0 0 0 0 0 0
6 QuintaVilar 2011-11-01 01:30:00 0 0 0 0 0 0
[7 rows x 8 columns]
In [53]:
idx=pd.MultiIndex.from_arrays(df[['POM','DTM']].values.T)
In [54]:
df.index=idx
In [56]:
print df
POM DTM RNF WET \
QuintaVilar 2011-11-01 00:00:00 QuintaVilar 2011-11-01 00:00:00 0 0
2011-11-01 00:15:00 QuintaVilar 2011-11-01 00:15:00 0 0
2011-11-01 00:30:00 QuintaVilar 2011-11-01 00:30:00 0 0
2011-11-01 00:45:00 QuintaVilar 2011-11-01 00:45:00 0 0
2011-11-01 01:00:00 QuintaVilar 2011-11-01 01:00:00 0 0
2011-11-01 01:15:00 QuintaVilar 2011-11-01 01:15:00 0 0
2011-11-01 01:30:00 QuintaVilar 2011-11-01 01:30:00 0 0
HMD TMP DEW INF
QuintaVilar 2011-11-01 00:00:00 0 0 0 0
2011-11-01 00:15:00 0 0 0 0
2011-11-01 00:30:00 0 0 0 0
2011-11-01 00:45:00 0 0 0 0
2011-11-01 01:00:00 0 0 0 0
2011-11-01 01:15:00 0 0 0 0
2011-11-01 01:30:00 0 0 0 0
[7 rows x 8 columns]

Related

Splitting a record into 12 months based on the date in pandas dataframe

I have the data in the below format stored in a pandas dataframe
PolicyNumber InceptionDate
1 2017-12-28 00:00:00.0
https://i.stack.imgur.com/pEfLT.png
I want to split this single record into 12 records based on the inception date. For eg,
1 2017-12-28 00:00:00.0
1 2018-1-28 00:00:00.0
1 2018-2-28 00:00:00.0
1 2018-3-28 00:00:00.0
.
.
1 2018-11-28 00:00:00.0
Is this possible?
You can use pd.date_range to generate a list of date range then explode the column
df['InceptionDate'] = pd.to_datetime(df['InceptionDate'])
df = (df.assign(InceptionDate=df['InceptionDate'].apply(lambda date: pd.date_range(start=date, periods=12, freq='MS')+pd.DateOffset(days=date.day-1)))
.explode('InceptionDate'))
print(df)
PolicyNumber InceptionDate
0 1 2018-01-28
0 1 2018-02-28
0 1 2018-03-28
0 1 2018-04-28
0 1 2018-05-28
0 1 2018-06-28
0 1 2018-07-28
0 1 2018-08-28
0 1 2018-09-28
0 1 2018-10-28
0 1 2018-11-28
0 1 2018-12-28
To convert it to your original format from datetime type
df['InceptionDate'] = df['InceptionDate'].dt.strftime('%Y-%m-%d %H:%M:%S.%f')
PolicyNumber InceptionDate
0 1 2018-01-28 00:00:00.000000
0 1 2018-02-28 00:00:00.000000
0 1 2018-03-28 00:00:00.000000
0 1 2018-04-28 00:00:00.000000
0 1 2018-05-28 00:00:00.000000
0 1 2018-06-28 00:00:00.000000
0 1 2018-07-28 00:00:00.000000
0 1 2018-08-28 00:00:00.000000
0 1 2018-09-28 00:00:00.000000
0 1 2018-10-28 00:00:00.000000
0 1 2018-11-28 00:00:00.000000
0 1 2018-12-28 00:00:00.000000

Pandas: How to preserve rows with index and empty values in pivot_table for appending to dataframe

I have been pulling out my hair (if I'd had any left) getting the following to work for me:
import pandas as pd
import numpy as np
d1={'TS':['2021-07-17 00:05:00', '2021-07-17 00:05:00', '2021-07-17 00:10:00', '2021-07-17 00:15:00', '2021-07-17 00:20:00'],
'CM':['C1','C1','C2','C3','C4'],
'ST':['S1','S1','S2','S2','S3']}
d2={'TS':['2021-07-18 00:05:00', '2021-07-18 00:10:00', '2021-07-18 00:16:00', '2021-07-18 00:21:00','2021-07-18 00:27:00'],
'CM':[np.nan, np.nan, np.nan, np.nan, np.nan],
'ST':[np.nan, np.nan, np.nan, np.nan, np.nan]}
dtot = pd.DataFrame()
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
d1p = pd.pivot_table(df1, index=['TS'], values=['CM'], columns=['ST'], aggfunc=len, fill_value=0, dropna=False)
d2p = pd.pivot_table(df2, index=['TS'], values=['CM'], columns=['ST'], aggfunc=len, fill_value=0, dropna=False)
dtot = dtot.append(d1p)
dtot = dtot.append(d2p)
This will result in:
CM
ST S1 S2 S3
TS
2021-07-17 00:05:00 2 0 0
2021-07-17 00:10:00 0 1 0
2021-07-17 00:15:00 0 1 0
2021-07-17 00:20:00 0 0 1
So rows of df2 are all ignored (d2p will be an empty dataframe).
What I would like to have is:
CM
ST S1 S2 S3
TS
2021-07-17 00:05:00 2 0 0
2021-07-17 00:10:00 0 1 0
2021-07-17 00:15:00 0 1 0
2021-07-17 00:20:00 0 0 1
2021-07-18 00:05:00 0 0 0
2021-07-18 00:10:00 0 0 0
2021-07-18 00:16:00 0 0 0
2021-07-18 00:21:00 0 0 0
2021-07-18 00:27:00 0 0 0
So I need to preserve the timestamps and get "zero" values in d2p.
The issue is of course that the possible values (as occurences S1,S2,S3) are absent for categorizing the pivot d2p.
Still, how can I achieve the end result?
Your d2p is empty due to the NaN values.
You can fix that by adding a fillna before the pivot:
df2=pd.DataFrame(d2).fillna(0)
Then the pivot get mixed up with the 0 value, so skip the last column of the dtot, and fillna again to remove NaNs and convert to int if you do not want the 0.0.
dtot.iloc[:,:-1].fillna(0).astype(int)
Output:
CM
ST S1 S2 S3
TS
2021-07-17 00:05:00 2 0 0
2021-07-17 00:10:00 0 1 0
2021-07-17 00:15:00 0 1 0
2021-07-17 00:20:00 0 0 1
2021-07-18 00:05:00 0 0 0
2021-07-18 00:10:00 0 0 0
2021-07-18 00:16:00 0 0 0
2021-07-18 00:21:00 0 0 0
2021-07-18 00:27:00 0 0 0
I would recommend that you only pivot_table dataframe with actual observed values, and then create a combined index of timestamps and .reindex it with those values to get your desired output.
This has the advantage of being less computationally intensive (no point in running a pivot_table call on a table we know with no valid observations).
timestamps = df1["TS"].append(df2["TS"]).unique()
new_df = (
df1.pivot_table(
index=['TS'],
values=['CM'],
columns=['ST'],
aggfunc=len,
fill_value=0,
dropna=False
)
.reindex(timestamps, fill_value=0)
)
print(new_df)
CM
ST S1 S2 S3
TS
2021-07-17 00:05:00 2 0 0
2021-07-17 00:10:00 0 1 0
2021-07-17 00:15:00 0 1 0
2021-07-17 00:20:00 0 0 1
2021-07-18 00:05:00 0 0 0
2021-07-18 00:10:00 0 0 0
2021-07-18 00:16:00 0 0 0
2021-07-18 00:21:00 0 0 0
2021-07-18 00:27:00 0 0 0

Identify continuous sequences or groups of boolean data in Pandas

I have a boolean time based data set. As per the example below. I am interested in highlighting continuous sequences of more than three 1's in the data set. I would like to capture this in a new column called [Continuous_out_x]. Is there any efficient operation to do this?
I generated test data in this way:
df = pd.DataFrame(zip(list(np.random.randint(2, size=20)),list(np.random.randint(2, size=20))), columns=['tag1','tag2'] ,index=pd.date_range('2020-01-01', periods=20, freq='s'))
The output I got was the following:
print (df):
tag1 tag2
2020-01-01 00:00:00 0 0
2020-01-01 00:00:01 1 0
2020-01-01 00:00:02 1 0
2020-01-01 00:00:03 1 1
2020-01-01 00:00:04 1 0
2020-01-01 00:00:05 1 0
2020-01-01 00:00:06 1 1
2020-01-01 00:00:07 0 1
2020-01-01 00:00:08 0 0
2020-01-01 00:00:09 1 1
2020-01-01 00:00:10 1 0
2020-01-01 00:00:11 0 1
2020-01-01 00:00:12 1 0
2020-01-01 00:00:13 0 1
2020-01-01 00:00:14 0 1
2020-01-01 00:00:15 0 1
2020-01-01 00:00:16 1 1
2020-01-01 00:00:17 0 0
2020-01-01 00:00:18 0 1
2020-01-01 00:00:19 1 0
A solution to this example set (above) would look like this:
print(df):
tag1 tag2 Continuous_out_1 Continuous_out_2
2020-01-01 00:00:00 0 0 0 0
2020-01-01 00:00:01 1 0 1 0
2020-01-01 00:00:02 1 0 1 0
2020-01-01 00:00:03 1 1 1 0
2020-01-01 00:00:04 1 0 1 0
2020-01-01 00:00:05 1 0 1 0
2020-01-01 00:00:06 1 1 1 0
2020-01-01 00:00:07 0 1 0 0
2020-01-01 00:00:08 0 0 0 0
2020-01-01 00:00:09 1 1 0 0
2020-01-01 00:00:10 1 0 0 0
2020-01-01 00:00:11 0 1 0 0
2020-01-01 00:00:12 1 0 0 0
2020-01-01 00:00:13 0 1 0 1
2020-01-01 00:00:14 0 1 0 1
2020-01-01 00:00:15 0 1 0 1
2020-01-01 00:00:16 1 1 0 1
2020-01-01 00:00:17 0 0 0 0
2020-01-01 00:00:18 0 1 0 0
2020-01-01 00:00:19 1 0 0 0
You can do this as:
create a series that distinguishes each streak (group)
assign bool to groups with more than three rows
code
# ok to loop over a few columns, still very performant
for col in ["tag1", "tag2"]:
col_no = col[-1]
df[f"group_{col}"] = np.cumsum(df[col].shift(1) != df[col])
df[f"{col}_counts"] = df.groupby(f"group_{col}").tag1.transform("count") > 3
df[f"Continuous_out_{col_no}"] = df[f"{col}_counts"].astype(int)
df = df.drop(columns=[f"group_{col}", f"{col}_counts"])
output
tag1 tag2 Continuous_out_1 Continuous_out_2
2020-01-01 00:00:00 0 0 0 0
00:00:01 1 0 1 0
00:00:02 1 0 1 0
00:00:03 1 1 1 0
00:00:04 1 0 1 0
00:00:05 1 0 1 0
00:00:06 1 1 1 0
00:00:07 0 1 0 0
00:00:08 0 0 0 0
00:00:09 1 1 0 0
00:00:10 1 0 0 0
00:00:11 0 1 0 0
00:00:12 1 0 0 0
00:00:13 0 1 0 1
00:00:14 0 1 0 1
00:00:15 0 1 0 1
00:00:16 1 1 0 1
00:00:17 0 0 0 0
00:00:18 0 1 0 0
00:00:19 1 0 0 0
You can identify the regions of contiguous True/False and check if they are greater than your cutoff.
for colname, series in df.items():
new = f'Continuous_{colname}'
df[new] = series.diff().ne(0).cumsum() # label contiguous regions
df[new] = series.groupby(df[new]).transform('size') # get size of region
df[new] = df[new].gt(3) * series # mark with cutoff
Output
tag1 tag2 Continuous_tag1 Continuous_tag2
index
2020-01-01 00:00:00 0 0 0 0
2020-01-01 00:00:01 1 0 1 0
2020-01-01 00:00:02 1 0 1 0
2020-01-01 00:00:03 1 1 1 0
2020-01-01 00:00:04 1 0 1 0
2020-01-01 00:00:05 1 0 1 0
2020-01-01 00:00:06 1 1 1 0
2020-01-01 00:00:07 0 1 0 0
2020-01-01 00:00:08 0 0 0 0
2020-01-01 00:00:09 1 1 0 0
2020-01-01 00:00:10 1 0 0 0
2020-01-01 00:00:11 0 1 0 0
2020-01-01 00:00:12 1 0 0 0
2020-01-01 00:00:13 0 1 0 1
2020-01-01 00:00:14 0 1 0 1
2020-01-01 00:00:15 0 1 0 1
2020-01-01 00:00:16 1 1 0 1
2020-01-01 00:00:17 0 0 0 0
2020-01-01 00:00:18 0 1 0 0
2020-01-01 00:00:19 1 0 0 0

Pandas: fill one column with count of # of obs between occurrences in a 2nd column

Say I have the following DataFrame which has a 0/1 entry depending on whether something happened/didn't happen within a certain month.
Y = [0,0,1,1,0,0,0,0,1,1,1]
X = pd.date_range(start = "2010", freq = "MS", periods = len(Y))
df = pd.DataFrame({'R': Y},index = X)
R
2010-01-01 0
2010-02-01 0
2010-03-01 1
2010-04-01 1
2010-05-01 0
2010-06-01 0
2010-07-01 0
2010-08-01 0
2010-09-01 1
2010-10-01 1
2010-11-01 1
What I want is to create a 2nd column that lists the # of months until the next occurrence of a 1.
That is, I need:
R F
2010-01-01 0 2
2010-02-01 0 1
2010-03-01 1 0
2010-04-01 1 0
2010-05-01 0 4
2010-06-01 0 3
2010-07-01 0 2
2010-08-01 0 1
2010-09-01 1 0
2010-10-01 1 0
2010-11-01 1 0
What I've tried: I haven't gotten far, but I'm able to fill the first bit
A = list(df.index)
T = df[df['R']==1]
a = df.index[0]
b = T.index[0]
c = A.index(b) - A.index(a)
df.loc[a:b, 'F'] = np.linspace(c,0,c+1)
R F
2010-01-01 0 2.0
2010-02-01 0 1.0
2010-03-01 1 0.0
2010-04-01 1 NaN
2010-05-01 0 NaN
2010-06-01 0 NaN
2010-07-01 0 NaN
2010-08-01 0 NaN
2010-09-01 1 NaN
2010-10-01 1 NaN
2010-11-01 1 NaN
EDIT Probably would have been better to provide an original example that spanned multiple years.
Y = [0,0,1,1,0,0,0,0,1,1,1,0,0,1,1,1,0,1,1,1]
X = pd.date_range(start = "2010", freq = "MS", periods = len(Y))
df = pd.DataFrame({'R': Y},index = X)
Here is my way
s=df.R.cumsum()
df.loc[df.R==0,'F']=s.groupby(s).cumcount(ascending=False)+1
df.F.fillna(0,inplace=True)
df
Out[12]:
R F
2010-01-01 0 2.0
2010-02-01 0 1.0
2010-03-01 1 0.0
2010-04-01 1 0.0
2010-05-01 0 4.0
2010-06-01 0 3.0
2010-07-01 0 2.0
2010-08-01 0 1.0
2010-09-01 1 0.0
2010-10-01 1 0.0
2010-11-01 1 0.0
Create a series containing your dates, mask this series when your R series is not equal to 1, bfill, and subtract!
u = df.index.to_series()
ii = u.where(df.R.eq(1)).bfill()
12 * (ii.dt.year - u.dt.year) + (ii.dt.month - u.dt.month)
2010-01-01 2
2010-02-01 1
2010-03-01 0
2010-04-01 0
2010-05-01 4
2010-06-01 3
2010-07-01 2
2010-08-01 1
2010-09-01 0
2010-10-01 0
2010-11-01 0
Freq: MS, dtype: int64
Here is a way that worked for me, not as elegant as #user3483203 but it does the job.
df['F'] = 0
for i in df.index:
j = i
while df.loc[j, 'R'] == 0:
df.loc[i, 'F'] =df.loc[i, 'F'] + 1
j=j+1
df
################
Out[39]:
index R F
0 2010-01-01 0 2
1 2010-02-01 0 1
2 2010-03-01 1 0
3 2010-04-01 1 0
4 2010-05-01 0 4
5 2010-06-01 0 3
6 2010-07-01 0 2
7 2010-08-01 0 1
8 2010-09-01 1 0
9 2010-10-01 1 0
10 2010-11-01 1 0
In [40]:
My take
s = (df.R.diff().ne(0) | df.R.eq(1)).cumsum()
s.groupby(s).transform(lambda s: np.arange(len(s),0,-1) if len(s)>1 else 0)
2010-01-01 2
2010-02-01 1
2010-03-01 0
2010-04-01 0
2010-05-01 4
2010-06-01 3
2010-07-01 2
2010-08-01 1
2010-09-01 0
2010-10-01 0
2010-11-01 0
Freq: MS, Name: R, dtype: int64

Create a Series from a Pandas DataFrame by choosing an element from different columns on each row

My goal is to create a Series from a Pandas DataFrame by choosing an element from different columns on each row.
For example, I have the following DataFrame:
In [171]: pred[:10]
Out[171]:
0 1 2
Timestamp
2010-12-21 00:00:00 0 0 1
2010-12-20 00:00:00 1 1 1
2010-12-17 00:00:00 1 1 1
2010-12-16 00:00:00 0 0 1
2010-12-15 00:00:00 1 1 1
2010-12-14 00:00:00 1 1 1
2010-12-13 00:00:00 0 0 1
2010-12-10 00:00:00 1 1 1
2010-12-09 00:00:00 1 1 1
2010-12-08 00:00:00 0 0 1
And, I have the following series:
In [172]: useProb[:10]
Out[172]:
Timestamp
2010-12-21 00:00:00 1
2010-12-20 00:00:00 2
2010-12-17 00:00:00 1
2010-12-16 00:00:00 2
2010-12-15 00:00:00 2
2010-12-14 00:00:00 2
2010-12-13 00:00:00 0
2010-12-10 00:00:00 2
2010-12-09 00:00:00 2
2010-12-08 00:00:00 0
I would like to create a new series, usePred, that takes the values from pred, based on the column information in useProb to return the following:
In [172]: usePred[:10]
Out[172]:
Timestamp
2010-12-21 00:00:00 0
2010-12-20 00:00:00 1
2010-12-17 00:00:00 1
2010-12-16 00:00:00 1
2010-12-15 00:00:00 1
2010-12-14 00:00:00 1
2010-12-13 00:00:00 0
2010-12-10 00:00:00 1
2010-12-09 00:00:00 1
2010-12-08 00:00:00 0
This last step is where I fail. I've tried things like:
usePred = pd.DataFrame(index = pred.index)
for row in usePred:
usePred['PREDS'].ix[row] = pred.ix[row, useProb[row]]
And, I've tried:
usePred['PREDS'] = pred.iloc[:,useProb]
I google'd and search on stackoverflow, for hours, but can't seem to solve the problem.
One solution could be to use get dummies (which should be more efficient that apply):
In [11]: (pd.get_dummies(useProb) * pred).sum(axis=1)
Out[11]:
Timestamp
2010-12-21 00:00:00 0
2010-12-20 00:00:00 1
2010-12-17 00:00:00 1
2010-12-16 00:00:00 1
2010-12-15 00:00:00 1
2010-12-14 00:00:00 1
2010-12-13 00:00:00 0
2010-12-10 00:00:00 1
2010-12-09 00:00:00 1
2010-12-08 00:00:00 0
dtype: float64
You could use an apply with a couple of locs:
In [21]: pred.apply(lambda row: row.loc[useProb.loc[row.name]], axis=1)
Out[21]:
Timestamp
2010-12-21 00:00:00 0
2010-12-20 00:00:00 1
2010-12-17 00:00:00 1
2010-12-16 00:00:00 1
2010-12-15 00:00:00 1
2010-12-14 00:00:00 1
2010-12-13 00:00:00 0
2010-12-10 00:00:00 1
2010-12-09 00:00:00 1
2010-12-08 00:00:00 0
dtype: int64
The trick being that you have access to the rows index via the name property.
Here is another way to do it using DataFrame.lookup:
pred.lookup(row_labels=pred.index,
col_labels=pred.columns[useProb['0']])
It seems to be exactly what you need, except that care must be taken to supply values which are labels. For example, if pred.columns are strings, and useProb['0'] values are integers, then we could use
pred.columns[useProb['0']]
so that the values passed to the col_labels parameter are proper label values.
For example,
import io
import pandas as pd
content = io.BytesIO('''\
Timestamp 0 1 2
2010-12-21 00:00:00 0 0 1
2010-12-20 00:00:00 1 1 1
2010-12-17 00:00:00 1 1 1
2010-12-16 00:00:00 0 0 1
2010-12-15 00:00:00 1 1 1
2010-12-14 00:00:00 1 1 1
2010-12-13 00:00:00 0 0 1
2010-12-10 00:00:00 1 1 1
2010-12-09 00:00:00 1 1 1
2010-12-08 00:00:00 0 0 1''')
pred = pd.read_table(content, sep='\s{2,}', parse_dates=True, index_col=[0])
content = io.BytesIO('''\
Timestamp 0
2010-12-21 00:00:00 1
2010-12-20 00:00:00 2
2010-12-17 00:00:00 1
2010-12-16 00:00:00 2
2010-12-15 00:00:00 2
2010-12-14 00:00:00 2
2010-12-13 00:00:00 0
2010-12-10 00:00:00 2
2010-12-09 00:00:00 2
2010-12-08 00:00:00 0''')
useProb = pd.read_table(content, sep='\s{2,}', parse_dates=True, index_col=[0])
print(pd.Series(pred.lookup(row_labels=pred.index,
col_labels=pred.columns[useProb['0']]),
index=pred.index))
yields
Timestamp
2010-12-21 0
2010-12-20 1
2010-12-17 1
2010-12-16 1
2010-12-15 1
2010-12-14 1
2010-12-13 0
2010-12-10 1
2010-12-09 1
2010-12-08 0
dtype: int64

Categories