How to select rows from different point in pandas Dataframe

How to select rows from different point in pandas Dataframe - python

How do i select/slice from multiple points which is in this case, starting from max() for the for all the columns. each stock has its own max value, so it shall begin selections from that particular point.
df
>>> TSLA MSFT
2017-05-15 00:00:00+00:00 314 68
2017-05-16 00:00:00+00:00 319 69
2017-05-17 00:00:00+00:00 320 61
2017-05-18 00:00:00+00:00 313 66
2017-05-19 00:00:00+00:00 316 70
2017-05-22 00:00:00+00:00 314 65
2017-05-23 00:00:00+00:00 310 63
max_idx = df.idxmax() # returns index of max value
>>> TSLA 2017-05-17 00:00:00+00:00
>>> MSFT 2017-05-19 00:00:00+00:00
max_value = df.max() # returns max value
>>> TSLA = 320
>>> MSFT = 70
Is their any way like using,
df2 = df.loc[max_idx:]
i want the output such that i can later find the max_value and max_idx again on this new output starting from,
TSLA 2017-05-17 00:00:00+00:00
MSFT 2017-05-19 00:00:00+00:00
EDIT : am expecting the following output:
df2
>>> TSLA MSFT
2017-05-17 00:00:00+00:00 320 2017-05-19 00:00:00+00:00 70
2017-05-18 00:00:00+00:00 313 2017-05-22 00:00:00+00:00 65
2017-05-19 00:00:00+00:00 316 2017-05-23 00:00:00+00:00 63
2017-05-22 00:00:00+00:00 314
2017-05-23 00:00:00+00:00 310
Similar to how #Unutbu used Multindexing, the new dataframe can be multindexed if possible.
Just for example, i only posted 2 columns, but their will be 100's of columns,
so please keep in mind such big data. Thanks!

You could use the apply method:
In [204]: df.apply(lambda s: s.loc[s.idxmax():])
Out[204]:
MSFT TSLA
2017-05-17 NaN 320
2017-05-18 NaN 313
2017-05-19 70.0 316
2017-05-22 65.0 314
2017-05-23 63.0 310
or, building on MaxU's answer,
In [205]: pd.concat({c:df.loc[max_idx[c]:, c] for c in df.columns}).unstack(level=0)
Out[205]:
MSFT TSLA
2017-05-17 NaN 320.0
2017-05-18 NaN 313.0
2017-05-19 70.0 316.0
2017-05-22 65.0 314.0
2017-05-23 63.0 310.0
Both of these solutions loop over the columns. (df.apply's loop is done under
the hood, but it amounts to a Python-speed loop performance-wise.) I know you
are looking for a vectorized solution but in this case I don't see a way to
avoid the loop.
If you want to avoid the NaNs, you could leave the answer unstacked:
In [208]: pd.concat({c:df.loc[max_idx[c]:, c] for c in df.columns})
Out[208]:
MSFT 2017-05-19 70
2017-05-22 65
2017-05-23 63
TSLA 2017-05-17 320
2017-05-18 313
2017-05-19 316
2017-05-22 314
2017-05-23 310
dtype: int64
or, if you're using df.apply, call stack to move the columns labels into a level of the row index:
In [213]: df.apply(lambda s: s.loc[s.idxmax():]).T.stack()
Out[213]:
MSFT 2017-05-19 70.0
2017-05-22 65.0
2017-05-23 63.0
TSLA 2017-05-17 320.0
2017-05-18 313.0
2017-05-19 316.0
2017-05-22 314.0
2017-05-23 310.0
dtype: float64
So let's look at performance. With this setup (to test on a bigger DataFrame):
shape = (1000,2000)
bigdf = pd.DataFrame(np.random.randint(100, size=shape),
index=pd.date_range('2000-1-1', periods=N))
def using_apply(df):
return df.apply(lambda s: s.loc[s.idxmax():])
def using_loop(df):
max_idx = df.idxmax()
return pd.concat({c:df.loc[max_idx[c]:, c] for c in df.columns}).unstack(level=0)
MaxU's using_loop is slightly faster than using_apply:
In [202]: %timeit using_apply(bigdf)
1 loop, best of 3: 1.45 s per loop
In [203]: %timeit using_loop(bigdf)
1 loop, best of 3: 1.22 s per loop
Note, however, that it is best to test benchmarks on your own machine as results
may vary.

We could do something like this:
In [120]: {c:df.loc[max_idx[c]:, c].max() for c in df.columns}
Out[120]: {'MSFT': 70, 'TSLA': 320}

if you want to slice based on the index of the max, you can use:
df[(df.index > max_idx.TSLA) & (df.index > max_idx.TSLA)]
which gives you the rows with a timestamp greater than both maxima (you could choose one or the other, I wasn't sure what you wanted.)

Related

pandas groupby, then aggregate by a 2nd column and find corresponding value in a 3rd column

I have a table with 3 main columns. I would like to first group the data by Company ID, then get the Highest Post Valuation per Company ID, and its corresponding Deal Date.
Question: How do I add corresponding Deal Date in?
The data:
Company ID
Post Valuation
Deal Date
60
119616-85
NaN
2022-03-01
80
160988-50
6.77
2022-02-10
85
108827-47
NaN
2022-02-01
89
154876-33
1.40
2022-01-27
104
435509-92
6.16
2022-01-05
107
186777-73
17.26
2022-01-03
111
232001-47
NaN
2022-01-01
113
160988-50
NaN
2021-12-31
119
114196-78
NaN
2021-12-15
128
481375-00
2.82
2021-12-01
130
128348-20
NaN
2021-11-25
131
166855-60
658.36
2021-11-25
150
113503-87
NaN
2021-10-20
156
178448-68
21.75
2021-10-07
170
479007-64
NaN
2021-09-13
182
128479-51
NaN
2021-09-01
185
113503-87
NaN
2021-08-31
186
128348-20
NaN
2021-08-30
191
108643-42
8.02
2021-08-13
192
186272-74
NaN
2021-08-12
The attempt
df_X.sort_values('Post Valuation', ascending=True).groupby('Company ID', as_index=False)['Post Valuation'].first()

Sort and drop duplicates:
result = df.sort_values('Post Valuation').drop_duplicates(subset='Company ID', keep='last')

Adding a row into DataFrame with multiindex

I would like to create a new DataFrame and a bunch of stock data per each date.
Declaring a DataFrame with a multi-index - date and stock ticker.
Adding data for 2020-06-07
date stock open high low close
2020-06-07 AAPL 33.50 34.20 32.1 33.30
2020-06-07 MSFT 53.50 54.20 32.1 53.30
Adding data for 2020-06-08
date stock open high low close
2020-06-07 AAPL 33.50 34.20 32.1 33.30
2020-06-07 MSFT 53.50 54.20 32.1 53.30
2020-06-08 AAPL 32.50 34.20 31.1 32.30
2020-06-08 MSFT 58.50 59.20 52.1 53.30
What would be the best and most efficient solution?
Here's my current version that doesn't work as I expect.
df = pd.DataFrame()
for date in dates:
universe500 = get_universe(date) #returns stocks on a specific date
for security in universe500:
prices = data.get_prices(security, ['open','high','low','close'], 1, '1d') # returns pd.DataFrame
df.iloc[(date, security),:] = prices

If prices is a DataFrame formatted in the same manner as the original df, you can use concat:
In[0]:
#consttucting a fake entry
arrays = [['06-07-2020'], ['ABCD']]
multi = pd.MultiIndex.from_arrays(arrays, names=('date', 'stock'))
to_add = pd.DataFrame({'open':1, 'high':2, 'low':3, 'close':4},index=multi)
print(to_add)
Out[0]:
open high low close
date stock
2020-06-09 ABCD 1 2 3 4
In[1]:
#now adding to your data
df = pd.concat([df, to_add])
print(df)
Out[1]:
open high low close
date stock
2020-06-07 AAPL 33.5 34.2 32.1 33.3
MSFT 53.5 54.2 32.1 53.3
2020-06-08 AAPL 32.5 34.2 31.1 32.3
MSFT 58.5 59.2 52.1 53.3
2020-06-09 ABCD 1.0 2.0 3.0 4.0
If the data (prices) were just an array of 4 numbers (the open, high, low, and close) values, then loc would work in the place you used iloc:
In[2]:
df.loc[('2020-06-10','WXYZ'),:] = [10,20,30,40]
Out[2]:
open high low close
date stock
2020-06-07 AAPL 33.5 34.2 32.1 33.3
MSFT 53.5 54.2 32.1 53.3
2020-06-08 AAPL 32.5 34.2 31.1 32.3
MSFT 58.5 59.2 52.1 53.3
2020-06-09 ABCD 1.0 2.0 3.0 4.0
2020-06-10 WXYZ 10.0 20.0 30.0 40.0

Pandas convert YEARMODA in to a datetime. convert to datetimelike values

I have the following dataframe:
YEARMODA TEMP MAX MIN
0 19730701 74.5 90.0 53.6
1 19730702 74.5 88.9 57.9
2 19730703 81.7 95.0 63.0
3 19730704 85.0 95.0 65.8
4 19730705 85.0 97.9 63.9
How do I get the date to datetimelike. I want to get the average and standard deviation of the temp by year and by month. I know how to use group, it's just working with YEARMODA that is the problem

Here are two ways to solve this, take your pick
df['YEARMODA'] = pd.to_datetime(df['YEARMODA'], format='%Y%m%d')
YEARMODA TEMP MAX MIN
0 1973-07-01 74.5 90.0 53.6
1 1973-07-02 74.5 88.9 57.9
2 1973-07-03 81.7 95.0 63.0
3 1973-07-04 85.0 95.0 65.8
4 1973-07-05 85.0 97.9 63.9
--------------------------------------------------------------------
from functools import partial
p = partial(pd.to_datetime, format='%Y%m%d')
df['YEARMODA'] = df['YEARMODA'].apply(p)
YEARMODA TEMP MAX MIN
0 1973-07-01 74.5 90.0 53.6
1 1973-07-02 74.5 88.9 57.9
2 1973-07-03 81.7 95.0 63.0
3 1973-07-04 85.0 95.0 65.8
4 1973-07-05 85.0 97.9 63.9
Edit: The issue you are having is you are not providing the correct format to your pd.to_datetime expression hence it is failing.
Edit 2: To get the std by month according to how you want to do it you would do it as such.
df.groupby(df.YEARMODA.apply(p).dt.strftime('%B')).TEMP.std()
YEARMODA
July 5.321936
Name: TEMP, dtype: float64
df.assign(temp=pd.to_datetime(df['YEARMODA'], format='%Y%m%d') \
.dt \
.strftime('%B')) \
.groupby('temp') \
.TEMP \
.std()
temp
July 5.321936
Name: TEMP, dtype: float64

How to assign a values to dataframe's column by comparing values in another dataframe

I have two data frames. One has rows for every five minutes in a day:
df
TIMESTAMP TEMP
1 2011-06-01 00:05:00 24.5
200 2011-06-01 16:40:00 32.0
1000 2011-06-04 11:20:00 30.2
5000 2011-06-18 08:40:00 28.4
10000 2011-07-05 17:20:00 39.4
15000 2011-07-23 02:00:00 29.3
20000 2011-08-09 10:40:00 29.5
30656 2011-09-15 10:40:00 13.8
I have another dataframe that ranks the days
ranked
TEMP DATE RANK
62 43.3 2011-08-02 1.0
63 43.1 2011-08-03 2.0
65 43.1 2011-08-05 3.0
38 43.0 2011-07-09 4.0
66 42.8 2011-08-06 5.0
64 42.5 2011-08-04 6.0
84 42.2 2011-08-24 7.0
56 42.1 2011-07-27 8.0
61 42.1 2011-08-01 9.0
68 42.0 2011-08-08 10.0
Both the columns TIMESTAMP and DATE are datetime datatypes (dtype returns dtype('M8[ns]').
What I want to be able to do is add a column to the dataframe df and then put the rank of the row based on the TIMESTAMP and corresponding day's rank from ranked (so in a day all the 5 minute timesteps will have the same rank).
So, the final result would look something like this:
df
TIMESTAMP TEMP RANK
1 2011-06-01 00:05:00 24.5 98.0
200 2011-06-01 16:40:00 32.0 98.0
1000 2011-06-04 11:20:00 30.2 96.0
5000 2011-06-18 08:40:00 28.4 50.0
10000 2011-07-05 17:20:00 39.4 9.0
15000 2011-07-23 02:00:00 29.3 45.0
20000 2011-08-09 10:40:00 29.5 40.0
30656 2011-09-15 10:40:00 13.8 100.0
What I have done so far:
# Separate the date and times.
df['DATE'] = df['YYYYMMDDHHmm'].dt.normalize()
df['TIME'] = df['YYYYMMDDHHmm'].dt.time
df = df[['DATE', 'TIME', 'TAIR']]
df['RANK'] = 0
for index, row in df.iterrows():
df.loc[index, 'RANK'] = ranked[ranked['DATE']==row['DATE']]['RANK'].values
But I think I am going in a very wrong direction because this takes ages to complete.
How do I improve this code?

IIUC, you can play with indexes to match the values
df = df.set_index(df.TIMESTAMP.dt.date)\
.assign(RANK=ranked.set_index('DATE').RANK)\
.set_index(df.index)

Folding pandas time series into single day

I have a time series of events that spans multiple days-I'm mostly interested in counts/10min interval. So currently, after resampling, it looks like this
2018-02-27 16:20:00 5
2018-02-27 16:30:00 4
2018-02-27 16:40:00 0
2018-02-27 16:50:00 0
2018-02-27 17:00:00 0
...
2018-06-19 05:30:00 0
2018-06-19 05:40:00 0
2018-06-19 05:50:00 1
How can I "fold" this data over to have just one "day" of data, with the counts added up? So it would look something like this
00:00:00 0
00:10:00 0
...
11:00:00 47
11:10:00 36
11:20:00 12
...
23:40:00 1
23:50:00 0

If your series index is a DatetimeIndex, you can use the attribute time -- if it's a DataFrame and your datetimes are a column, you can use .dt.time. For example:
In [19]: times = pd.date_range("2018-02-27 16:20:00", "2018-06-19 05:50:00", freq="10 min")
...: ser = pd.Series(np.random.randint(0, 6, len(times)), index=times)
...:
...:
In [20]: ser.head()
Out[20]:
2018-02-27 16:20:00 0
2018-02-27 16:30:00 1
2018-02-27 16:40:00 4
2018-02-27 16:50:00 5
2018-02-27 17:00:00 0
Freq: 10T, dtype: int32
In [21]: out = ser.groupby(ser.index.time).sum()
In [22]: out.head()
Out[22]:
00:00:00 285
00:10:00 293
00:20:00 258
00:30:00 263
00:40:00 307
dtype: int32
In [23]: out.tail()
Out[23]:
23:10:00 280
23:20:00 291
23:30:00 236
23:40:00 303
23:50:00 299
dtype: int32

If i understand correctly, you want a sum of values per 10 min intervals in the first time column. You can perhaps try something like:
df.groupby('columns')['value'].agg(['count'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to select rows from different point in pandas Dataframe - python

We could do something like this: In [120]: {c:df.loc[max_idx[c]:, c].max() for c in df.columns} Out[120]: {'MSFT': 70, 'TSLA': 320}

if you want to slice based on the index of the max, you can use: df[(df.index > max_idx.TSLA) & (df.index > max_idx.TSLA)] which gives you the rows with a timestamp greater than both maxima (you could choose one or the other, I wasn't sure what you wanted.)

Related

pandas groupby, then aggregate by a 2nd column and find corresponding value in a 3rd column

Adding a row into DataFrame with multiindex

Pandas convert YEARMODA in to a datetime. convert to datetimelike values

How to assign a values to dataframe's column by comparing values in another dataframe

Folding pandas time series into single day

Categories

Resources