Pandas / Numpy: Issues with np.where

Pandas / Numpy: Issues with np.where - python

I have a strange problem with np.where. I first load a database called df and create a duplicate of df, df1. I then use np.where to make each value in df1 be 1 if the number in the cell is greater or equal to its mean (found in the DataFrame df_mean) else make the cell equal to 0. I use a for loop to iterate over each column headers in df1 and through a list of mean values df_mean. Here's my code:
#Load the data
df = pd.read_csv('F:\\file.csv')
df.head(2)
>>> A AA AAP AAPL ABC
2011-01-10 09:30:00 -0.000546 0.006528 -0.001051 0.034593 -0.000095 ...
2011-01-10 09:30:10 -0.000256 0.007705 -0.001134 0.008578 -0.000549 ...
# Show list file with columns average
>>> df_mean.head(4)
A 0.000656
AA 0.002068
AAP 0.001134
AAPL 0.001728
...
df_1 = df
for x in list:
df_1[x] = np.where(df_1[x] >= *df_mean[x], 1, 0)
>>> df_1.head(4) #Which is my desired output (but which also makes df = df_1...WHY?)
A AA AAP AAPL ABC
2011-01-10 09:30:00 0 1 0 1 0 ...
2011-01-10 09:30:10 0 1 0 1 0 ...
2011-01-10 09:30:20 0 0 0 1 0 ...
2011-01-10 09:30:30 0 0 0 1 1 ...
Now, I get what I want which is a binary 1/0 matrix for df_1, but it turns that df also gets into a binary matrix (same as df_1). WHY? The loop does not incorporate df...

Although this is not what you asked for, but my spidy sense tells me, you want to find some form of indicator, if a stock is currently over or underperforming in regard of "something" using the mean of this "something". Maybe try this:
S = pd.DataFrame(
np.array([[1.2,3.4],[1.1,3.5],[1.4,3.3],[1.2,1.6]]),
columns=["Stock A","Stock B"],
index=pd.date_range("2014-01-01","2014-01-04",freq="D")
)
indicator = S > S.mean()
binary = indicator.astype("int")
print S
print indicator
print binary
This gives the output:
Stock A Stock B
2014-01-01 1.2 3.4
2014-01-02 1.1 3.5
2014-01-03 1.4 3.3
2014-01-04 1.2 1.6
[4 rows x 2 columns]
Stock A Stock B
2014-01-01 False True
2014-01-02 False True
2014-01-03 True True
2014-01-04 False False
[4 rows x 2 columns]
Stock A Stock B
2014-01-01 0 1
2014-01-02 0 1
2014-01-03 1 1
2014-01-04 0 0
[4 rows x 2 columns]
While you are at it, you should probably look into pd.rolling_mean(S, n_periods_for_mean).

Related

Select rows based on column condition and date time

From below picture, we see that serial C was failed on 3rd January, and A failed on 5th January within 6 days period. I am interested to take samples for 3 days before the failure of each serial number.
My codes:
from pickle import TRUE
import pandas as pd
import numpy as np
import datetime
from datetime import date, timedelta
df = pd.read_csv('https://gist.githubusercontent.com/JishanAhmed2019/e464ca4da5c871428ca9ed9264467aa0/raw/da3921c1953fefbc66dddc3ce238dac53142dba8/failure.csv',sep='\t')
df['date'] = pd.to_datetime(df['date'])
#df.drop(columns=df.columns[0], axis=1, inplace=True)
df = df.sort_values(by="date")
d = datetime.timedelta(days = 3)
df_fail_date = df[df['failure']==1].groupby(['serial_number'])['date'].min()
df_fail_date = df_fail_date - d
df_fail_date
I was not not able to move further to sample my data. I am interested to get the following data, that is 3 days before the failure. Serial C had only 1 day available before failure so I wanna keep that one as well. It would be nice to add duration column to count the days before failure occurred. I appreciate your suggestions. Thanks!
Expected output dataframe:

You can use a groupby.rolling to get the dates/serials to keep, then merge to select:
df['date'] = pd.to_datetime(df['date'])
N = 3
m = (df.sort_values(by='date')
.loc[::-1]
.groupby('serial_number', group_keys=False)
.rolling(f'{N+1}d', on='date')
['failure'].max().eq(1)
.iloc[::-1]
)
out = df.merge(m[m], left_on=['serial_number', 'date'],
right_index=True, how='right')
Output:
date serial_number failure_x smart_5_raw smart_187_raw failure_y
2 2014-01-01 C 0 0 80 True
8 2014-01-02 C 0 0 200 True
4 2014-01-03 C 1 0 120 True
7 2014-01-02 A 0 0 180 True
5 2014-01-03 A 0 0 140 True
9 2014-01-04 A 0 0 280 True
14 2014-01-05 A 1 0 400 True

Another possible solution:
N = 4
df['date'] = pd.to_datetime(df['date'])
(df[df.groupby('serial_number')['failure'].transform(sum) == 1]
.sort_values(by=['serial_number', 'date'])
.groupby('serial_number')
.apply(lambda g:
g.assign(duration=1+np.arange(min(0, min(N, len(g))-len(g)), min(N, len(g)))))
.loc[lambda x: x['duration'] > 0]
.reset_index(drop=True))
Output:
date serial_number failure smart_5_raw smart_187_raw duration
0 2014-01-02 A 0 0 180 1
1 2014-01-03 A 0 0 140 2
2 2014-01-04 A 0 0 280 3
3 2014-01-05 A 1 0 400 4
4 2014-01-01 C 0 0 80 1
5 2014-01-02 C 0 0 200 2
6 2014-01-03 C 1 0 120 3

How to generate a new column which the values are between two existing columns

I need to add a new column based on the values of two existing columns.
My data set looks like this:
Date Bid Ask Last Volume
0 2021.02.01 00:01:02.327 1.21291 1.21336 0.0 0
1 2021.02.01 00:01:21.445 1.21290 1.21336 0.0 0
2 2021.02.01 00:01:31.912 1.21287 1.21336 0.0 0
3 2021.02.01 00:01:32.600 1.21290 1.21336 0.0 0
4 2021.02.01 00:02:08.920 1.21290 1.21338 0.0 0
... ... ... ... ... ...
80356 2021.02.01 23:58:54.332 1.20603 1.20605 0.0 0
and I need to generate a new column named "New" and the values of column "New" needs to have a random number between Column "Bid" and Column "Ask". For each value of the column "New", it has to be in the range from Bid to Ask (can equal to Bid or Ask).
I have tried to do like this
df['rand_between'] = df.apply(lambda x: np.random.randint(x.Ask,x.Bid), axis=1)
But I got this
Exception has occurred: ValueError
low >= high
I am new to Python.

Use np.random.uniform so you get a random float with equal probability between your high and low bounds with closure [low_bound, high_bound).
Also ditch the apply; np.random.uniform can generate the numbers using arrays of bounds. (I added a row at the bottom to make this obvious).
import numpy as np
df['New'] = np.random.uniform(df.Bid, df.Ask, len(df))
Date Bid Ask Last Volume New
0 2021.02.01 00:01:02.327 1.21291 1.21336 0.0 0 1.213114
1 2021.02.01 00:01:21.445 1.21290 1.21336 0.0 0 1.212969
2 2021.02.01 00:01:31.912 1.21287 1.21336 0.0 0 1.213342
3 2021.02.01 00:01:32.600 1.21290 1.21336 0.0 0 1.212933
4 2021.02.01 00:02:08.920 1.21290 1.21338 0.0 0 1.212948
5 2021.02.01 00:02:08.920 100.00000 115.00000 0.0 0 100.552836

All you need to do is switch the order of x.Ask and x.Bid in your code. In your dataframe, the ask prices are always higher than the bid, that's why you are getting the error:
df['rand_between'] = df.apply(lambda x: np.random.randint(x.Bid,x.Ask), axis=1)
If your ask value is sometimes greater and sometimes less than the bid, use:
df['rand_between'] = df.apply(lambda x: np.random.randint(x.Bid,x.Ask) if x.Ask > x.Bid else np.random.randint(x.Ask,x.Bid), axis=1)
Finally, if it is possible for ask to be greater, less than or equal to bis, use:
def helper(x):
if x.Ask > x.Bid:
return np.random.randint(x.Bid,x.Ask)
elif x.Bid > x.Ask:
return np.random.randint(x.Ask, x.Bid)
else:
return None
df['rand_between'] = df.apply(helper, axis=1)

You can loop through the rows using apply and then use your randint function (for floats you might want to use random.uniform). For example:
In [1]: import pandas as pd
...: from random import randint
...: df = pd.DataFrame({'bid':range(10),'ask':range(0,20,2)})
...:
...: df['new'] = df.apply(lambda x: randint(x['bid'],x['ask']), axis=1)
...: df
Out[1]:
bid ask new
0 0 0 0
1 1 2 1
2 2 4 4
3 3 6 6
4 4 8 6
5 5 10 9
6 6 12 9
7 7 14 12
8 8 16 13
9 9 18 9
The axis=1 is telling the apply function to loop over rows, not columns.

python pandas sum columns into sum column

I want to create a column in a pandas dataframe that would add the values of the other columns (which are 0 or 1s). the column is called "sum"
my HEADPandas looks like:
Application AnsSr sum Col1 Col2 Col3 .... Col(n-2) Col(n-1) Col(n)
date 28-12-11 0.0 0.0 28/12/11 .... ...Dates... 28/12/11
~00c 0 0.0 0.0 0 0 0 .... 0 0 0
~00pr 0 0.0 0.0 0 0 0 .... 0 0 0
~00te 0 0.0 0.0 0 0 1 .... 0 0 1
in an image from pythoneverywhere:
expected result (assuming there would be no more columns
Application AnsSr sum Col1 Col2 Col3 .... Col(n-2) Col(n-1) Col(n)
date 28-12-11 0.0 nan 28/12/11 .... ...Dates... 28/12/11
~00c 0 0.0 0.0 0 0 0 .... 0 0 0
~00pr 0 0.0 0.0 0 0 0 .... 0 0 0
~00te 0 0.0 2 0 0 1 .... 0 0 1
as you see the values of 'sum' are kept 0 even if there are 1s values in some columns.
what Am I doing wrong?
The basics of the code are:
theMatrix=pd.DataFrame([datetime.today().strftime('%Y-%m-%d')],['Date'],['Application'])
theMatrix['Ans'] = 0
theMatrix['sum'] = 0
so far so good
then I add all the values with loc.
and then I want to add up values with
theMatrix.fillna(0, inplace=True)
# this being the key line:
theMatrix['sum'] = theMatrix.sum(axis=1)
theMatrix.sort_index(axis=0, ascending=True, inplace=True)
As you see in the result (attached image) the sum remains 0.
I had a look to here or here and to the pandas documentation at no avail.
Actually the expression:
theMatrix['sum'] = theMatrix.sum(axis=1)
I got it from there.
changing this last line by:
theMatrix['sum'] = theMatrix[3:0].sum(axis=1)
in order to avoid to sum the first three columns gives as result:
Application AnsSr sum Col1 Col2 Col3 .... Col(n-2) Col(n-1) Col(n)
date 28-12-11 0.0 nan 28/12/11 .... ...Dates... 28/12/11
~00c 0 0.0 nan 1 1 0 .... 0 0 0
~00pr 0 0.0 1.0 0 0 0 .... 0 0 1
~00te 0 0.0 0 0 0 0 .... 0 0 0
please observe two things:
a) how in row '~00c' sum is nan but there are 1s in that row.
b) before the calculating the sum the code theMatrix.fillna(0, inplace=True) should have change all possible nan into 0 so the sum should never be nan since in theory there are no nan values in any of the columns[3:]
it wouldnt work.
some idea?
thanks
PS: Later edition, just in case you wondere how the dataframe is populated: reading and parsing an XML and the lines are:
# myDocId being the name of the columns
# concept being the index.
theMatrix.loc[concept,myDocId]=1

If I understand correctly, this can help you:
import pandas as pd
import datetime
#create dataframe following your example
theMatrix=pd.DataFrame([datetime.datetime.today().strftime('%Y-%m-%d')],['Date'],['Application'])
theMatrix['Ans'] = 0
theMatrix['col1'] = 1
theMatrix['col2'] = 1
# create 'sum' column with summed values from certain columns
theMatrix['sum'] = theMatrix['col1'] + theMatrix['col2']

Any data you choose to sum, just add to a list, and use that list to provide to your sum function, with axis=1. This will provide you the desired outcome. Here is a sample related to your data.
Sample File Data:
Date,a,b,c
bad, bad, bad, bad # Used to simulate your data better
2018-11-19,1,0,0
2018-11-20,1,0,0
2018-11-21,1,0,1
2018-11-23,1,nan,0 # Nan here is just to represent the missing data
2018-11-28,1,0,1
2018-11-30,1,nan,1 # Nan here is just to represent the missing data
2018-12-02,1,0,1
Code:
import pandas as pd
df = pd.read_csv(yourdata.filename) # Your method of loading the data
#rows_to_sum = ['a','b','c'] # The rows you wish to summarize
rows_to_sum = df.columns[1:] # Alternate method to select remainder of rows.
df = df.fillna(0) # used to fill the NaN you were talking about below.
df['sum'] = df[rows_to_sum][1:].astype(int).sum(axis=1) # skip the correct amount of rows here.
# Also, the use of astype(int), is due to the bad data read from the top. So redefining it here, allows you to sum it appropriately.
print(df)
Output:
Date a b c sum
bad bad bad bad NaN
2018-11-19 1 0 0 1.0
2018-11-20 1 0 0 1.0
2018-11-21 1 0 1 2.0
2018-11-23 1 0 0 1.0
2018-11-28 1 0 1 2.0
2018-11-30 1 0 1 2.0
2018-12-02 1 0 1 2.0

Pandas check for corresponding column in list and lowest date

I have a dataframe with multiple status fields per row. I want to check if any of the status fields have values in a list, and if so, I need to take the lowest date field for the corresponding status. My list of acceptable values and a sample dataframe look like this:
checkList = ['Foo','Bar']
df = pd.DataFrame([['A',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],['B','Foo',datetime.datetime(2017,10,1),'Other',datetime.datetime(2017,9,1),np.nan,np.nan],
['C','Bar',datetime.datetime(2016,1,1),np.nan,np.nan,'Foo',datetime.datetime(2016,5,5)]]
,columns = ['record','status1','status1_date','status2','status2_date','another_status','another_status_date'])
print df
record status1 status1_date status2 status2_date another_status \
0 A NaN NaT NaN NaT NaN
1 B Foo 2017-10-01 Other 2017-09-01 NaN
2 C Bar 2016-01-01 NaN NaT Foo
another_status_date
0 NaT
1 NaT
2 2016-05-05
I need to figure out if any of the statuses are in the approved list. If so, I need the first date for an approved status. The output would look like this:
print output_df
record master_status master_status_date
0 A False NaT
1 B True 2017-10-01
2 C True 2016-01-01
Thoughts on how best to approach? I can't just take min date, I'd need min where corresponding status field is in the list.

master_status = df.apply(lambda x: False if all([pd.isnull(rec) for rec in x[1:]]) else True, axis=1)
master_status_date = df.apply(lambda x: min([i for i in x[1:] if isinstance(i, datetime.datetime)]), axis=1)
record = df['record']
n_df = pd.concat([record, master_status, master_status_date], 1)
print(n_df)
record 0 1
0 A False NaT
1 B True 2017-09-01
2 C True 2016-01-01

Boolean Check in a Pandas DataFrame based on Criteria at different Index values

I would like to calculate the number of instances two criteria are fulfilled in a Pandas DataFrame at a different index value. A snipped of the DataFrame is:
GDP USRECQ
DATE
1947-01-01 NaN 0
1947-04-01 NaN 0
1947-07-01 NaN 0
1947-10-01 NaN 0
1948-01-01 0.095023 0
1948-04-01 0.107998 0
1948-07-01 0.117553 0
1948-10-01 0.078371 0
1949-01-01 0.034560 1
1949-04-01 -0.004397 1
I would like to count the number of observation for which USRECQ[DATE+1]==1 and GDP[DATE]>a if GDP[DATE]!='NAN'.
By referring to DATE+1 and DATE I mean that the value of USRECQ should be check at the subsequent date for which the value of GDP is examined. Unfortunately, I do not know how to address the deal with the different time indices in my selection. Can someone kindly advise me on how to count the number of instances properly?

One may of achieving this is to create a new column to show what the next value of 'USRECQ' is:
>>> df['USRECQ NEXT'] = df['USRECQ'].shift(-1)
>>> df
DATE GDP USRECQ USRECQ NEXT
0 1947-01-01 NaN 0 0
1 1947-04-01 NaN 0 0
2 1947-07-01 NaN 0 0
3 1947-10-01 NaN 0 0
4 1948-01-01 0.095023 0 0
5 1948-04-01 0.107998 0 0
6 1948-07-01 0.117553 0 0
7 1948-10-01 0.078371 0 1
8 1949-01-01 0.034560 1 1
9 1949-04-01 -0.004397 1 NaN
Then you could filter your DataFrame according to your requirements as follows:
>>> a = 0.01
>>> df[(df['USRECQ NEXT'] == 1) & (df['GDP'] > a) & (pd.notnull(df['GDP']))]
DATE GDP USRECQ USRECQ NEXT
7 1948-10-01 0.078371 0 1
8 1949-01-01 0.034560 1 1
To count the number of rows in a DataFrame, you can just use the built-in function len.

I think the DataFrame.shift method is the key to what you seek in terms of looking at the next index.
And Numpy's logical expressions can come in really handy for these sorts of things.
So if df is your dataframe then I think what you're looking for is something like
count = df[np.logical_and(df.shift(-1)['USRECQ'] == 1,df.GDP > -0.1)]
The example I used to test this is on github.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas / Numpy: Issues with np.where - python

Related

Select rows based on column condition and date time

How to generate a new column which the values are between two existing columns

python pandas sum columns into sum column

Pandas check for corresponding column in list and lowest date

Boolean Check in a Pandas DataFrame based on Criteria at different Index values

Categories

Resources