I have the following dataframe:
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],[1,7,8,4,3,4,3]]),
columns=['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
>>> 9:00:00 9:05:00 09:10:00 09:15:00 09:20:00 09:25:00 09:30:00 ....
a 1 2 3 4 7 9 5
b 2 6 5 4 9 8 2
c 3 5 3 21 12 6 7
d 1 7 8 4 3 4 3
I would like to get for each row (e.g a,b,c,d ...) the mean vale between specific hours. The hours are between 9-15, and I want to groupby period, for example to calculate the mean value between 09:00:00 to 11:00:00, between 11- 12, between 13-15 (or any period I decide to).
I was trying first to convert the columns values to datetime format and then I though it would be easier to do this:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
but then I got the columns names with fake year "1900-01-01 09:00:00"...
And also, the columns headers type was object, so I felt a bit lost...
My end goal is to be able to calculate new columns with the mean value for each row only between columns that fall inside the defined time period (e.g 9-11 etc...)
If need some period, e.g. each 2 hours:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
df1 = df.resample('2H', axis=1).mean()
print (df1)
1900-01-01 08:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
If need some custom periods is possible use cut:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
bins = ['5:00:00','9:00:00','11:00:00','12:00:00', '23:59:59']
dates = pd.to_datetime(bins,format="%H:%M:%S")
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
And last use mean per columns, reason of NaNs columns is columns are categoricals:
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 5:00:00-9:00:00 11:00:00-12:00:00 12:00:00-23:59:59
0 4.428571 NaN NaN NaN
1 5.142857 NaN NaN NaN
2 8.142857 NaN NaN NaN
3 4.285714 NaN NaN NaN
For avoid NaNs columns convert columns names to strings:
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
EDIT: Solution above with timedeltas, because format HH:MM:SS:
df.columns = pd.to_timedelta(df.columns)
print (df)
0 days 09:00:00 0 days 09:05:00 0 days 09:10:00 0 days 09:15:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
0 days 09:20:00 0 days 09:25:00 0 days 09:30:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
bins = ['9:00:00','11:00:00','12:00:00']
dates = pd.to_timedelta(bins)
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
#missing values because not exist datetimes between 11:00:00-12:00:00
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 11:00:00-12:00:00
0 4.428571 NaN
1 5.142857 NaN
2 8.142857 NaN
3 4.285714 NaN
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
I am going to show you my code and the results after the ejecution.
First import libraries and dataframe
import numpy as np
import pandas as pd
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],
[1,7,8,4,3,4,3]]),
columns=
['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
It would be nice create a class in order to define what is a period:
class Period():
def __init__(self,initial,end):
self.initial=initial
self.end=end
def __repr__(self):
return self.initial +' -- ' +self.end
With comand .loc we can get a subdataframe with the columns that I desire:
`def get_colMean(df,period):
df2 = df.loc[:,period.initial:period.end]
array_mean = df.mean(axis=1).values
col_name = 'mean_'+period.initial+'--'+period.end
pd_colMean = pd.DataFrame(array_mean,columns=[col_name])
return pd_colMean`
Finally we use .join in orde to add our column with the means to our original dataframe:
def join_colMean(df,period):
pd_colMean = get_colMean(df,period)
df = df.join(pd_colMean)
return df
I am goint to show you my results:
Please help.
My first table looks like:
id val1 val2
0 4 30
1 5 NaN
2 3 10
3 2 8
4 3 NaN
My second table looks like
id val1 val2_estimate
0 1 8
1 2 12
2 3 13
3 4 16
4 5 22
I want to replace Nan in 1st table with estimated values from column val2_estimate from 2nd table where val1 are the same. val1 in 2nd table are unique. End result need to look like that:
id val1 val2
0 4 30
1 5 22
2 3 10
3 2 8
4 3 13
I want to replace NaN values only.
Use merge to get the corresponding df2's estimate for df1, then use fillna:
df['val2'] = df['val2'].fillna(
df.merge(df2, on=['val1'], how='left')['val2_estimate'])
df
id val1 val2
0 0 4 30.0
1 1 5 22.0
2 2 3 10.0
3 3 2 8.0
4 4 3 13.0
Many ways to skin a cat, this is one of them.
Use fillna with map from a pd.Series created using set_index:
df['val2'] = df['val2'].fillna(df['val1'].map(df2.set_index('val1')['val2_estimate']))
df
Output:
val1 val2
id
0 4 30.0
1 5 22.0
2 3 10.0
3 2 8.0
4 3 13.0
Given a Time Series DataFrame is it possible to create a new DataFrame with the same dimensions but the values are the ranking for each row compared to other columns (ordered smallest value first)?
Example:
ABC DEFG HIJK XYZ
date
2018-01-14 0.110541 0.007615 0.063217 0.002543
2018-01-21 0.007012 0.042854 0.061271 0.007988
2018-01-28 0.085946 0.177466 0.046432 0.069297
2018-02-04 0.018278 0.065254 0.038972 0.027278
2018-02-11 0.071785 0.033603 0.075826 0.073270
The first row would become:
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
as XYZ has the smallest value in that row and ABC the largest.
numpy.argsort looks like it might help however as it outputs the location itself I have not managed to get it to work.
Many thanks
Use double argsort for rank per rows and pass to DataFrame constructor:
df1 = pd.DataFrame(df.values.argsort().argsort() + 1, index=df.index, columns=df.columns)
print (df1)
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
2018-01-21 1 3 4 2
2018-01-28 3 4 1 2
2018-02-04 1 4 3 2
2018-02-11 2 1 4 3
Or use DataFrame.rank with method='dense':
df1 = df.rank(axis=1, method='dense').astype(int)
print (df1)
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
2018-01-21 1 3 4 2
2018-01-28 3 4 1 2
2018-02-04 1 4 3 2
2018-02-11 2 1 4 3
I want to delete values based on their relative rank within their column. Specifically, I want to isolate the X highest and X lowest values within several columns. So if X=2 and my dataframe looks like this:
ID Val1 Val2 Val3
001 2 8 14
002 10 15 8
003 3 1 20
004 11 11 7
005 14 4 19
The output should look like this:
ID Val1 Val2 Val3
001 2 NaN NaN
002 NaN 15 8
003 3 1 20
004 11 11 7
005 14 4 19
I know that I can make a sub-table to isolate the high and low rank using:
df = df.sort('Column Name')
df2 = df.head(X) # OR: df.tail(X)
And I figure I clear these sub-tables of the values from other columns using:
df2['Other Column'] = np.NaN
df2['Other Column B'] = np.NaN
Then merge the sub-tables back together in a way that replaces NaN values when there is data in one of the tables. I tried:
df2.update(df3) # df3 is a sub-table made the same way as df2 using a different column
Which only updated rows already present in df2.
I tried:
out = pd.merge(df2, df3, how='outer')
which gave me separate rows when a row appeared in both df2 and d3
I tried:
out = df2.combine_first(df3)
which over-wrote numerical values with found NaN values in some cases making it unsuitable.
There must be a way to do this: I want to the original dataframe with NaN values plugged in whenever a value is not among the X highest or X lowest values in that column.
Interesting question, you can get the index of the values of each columns in the sorted values of each columns (here in the mask DataFrame), and then keep the values that have the index within you defined boundary.
In [98]:
print df
Val1 Val2 Val3
ID
1 2 8 14
2 10 15 8
3 3 1 20
4 11 11 7
5 14 4 19
In [99]:
mask = df.apply(lambda x: np.searchsorted(sorted(x),x))
print mask
Val1 Val2 Val3
ID
1 0 2 2
2 2 4 1
3 1 0 4
4 3 3 0
5 4 1 3
In [100]:
print (mask<=1)|(mask>=(len(mask)-2))
Val1 Val2 Val3
ID
1 True False False
2 False True True
3 True True True
4 True True True
5 True True True
In [101]:
print df.where((mask<=1)|(mask>=(len(mask)-2)))
Val1 Val2 Val3
ID
1 2 NaN NaN
2 NaN 15 8
3 3 1 20
4 11 11 7
5 14 4 19
Given a data frame like:
>>> df
ix val1 val2 val3 val4
1.31 2 3 4 5
8.22 2 3 4 5
5.39 2 3 4 5
7.34 2 3 4 5
Is it possible to edit index using something like replace?
Pseudo code: (since df index doesnt have str attribute)
df.index=df.index.str.replace("\\.[0-9]*","")
I need something like:
>>> df
ix val1 val2 val3 val4
1 2 3 4 5
8 2 3 4 5
5 2 3 4 5
7 2 3 4 5
The problem is that my dataframe is huge.
Thanks in advance
You can do:
df.index = df.index.to_series().astype(str).str.replace(r'\.[0-9]*','').astype(int)
you may also use .extract:
df.index.to_series().astype(str).str.extract(r'(\d+)').astype(int)
alternatively, you may just map the index to int:
pd.Index(map(int, df.index))