How to replace negative values in dataframe with specified values? - python

I have a df and need to replace negative values with specified values, how can I make my code simpler and without warning.
Before replacement:
datetime a0 a1 a2
0 2022-01-01 0.097627 0.430379 0.205527
1 2022-01-02 0.089766 -0.152690 0.291788
2 2022-01-03 -0.124826 0.783546 0.927326
3 2022-01-04 -0.233117 0.583450 0.057790
4 2022-01-05 0.136089 0.851193 -0.857928
5 2022-01-06 -0.825741 -0.959563 0.665240
6 2022-01-07 0.556314 0.740024 0.957237
7 2022-01-08 0.598317 -0.077041 0.561058
8 2022-01-09 -0.763451 0.279842 -0.713293
9 2022-01-10 0.889338 0.043697 -0.170676
After replacing,
datetime a0 a1 a2
0 2022-01-01 9.762701e-02 4.303787e-01 2.055268e-01
1 2022-01-02 8.976637e-02 1.000000e-13 2.917882e-01
2 2022-01-03 1.000000e-13 7.835460e-01 9.273255e-01
3 2022-01-04 1.000000e-13 5.834501e-01 5.778984e-02
4 2022-01-05 1.360891e-01 8.511933e-01 1.000000e-13
5 2022-01-06 1.000000e-13 1.000000e-13 6.652397e-01
6 2022-01-07 5.563135e-01 7.400243e-01 9.572367e-01
7 2022-01-08 5.983171e-01 1.000000e-13 5.610584e-01
8 2022-01-09 1.000000e-13 2.798420e-01 1.000000e-13
9 2022-01-10 8.893378e-01 4.369664e-02 1.000000e-13
<ipython-input-5-887189ce29a9>:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df2[df2 < 0] = float(1e-13)
<ipython-input-5-887189ce29a9>:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df2[df2 < 0] = float(1e-13)
My code is as follows, where the generate_data function is to generate demo data.
import numpy as np
import pandas as pd
np.random.seed(0)
# This function generates demo data.
def generate_data():
datetime1 = pd.date_range(start='20220101', end='20220110')
df = pd.DataFrame(data=datetime1, columns=['datetime'])
col = [f'a{x}' for x in range(3)]
df[col] = np.random.uniform(-1, 1, (10, 3))
return df
def main():
df = generate_data()
print(df)
col = list(df.columns)[1:]
df2 = df[col]
df2[df2 < 0] = float(1e-13)
df[col] = df2
print(df)
return
if __name__ == '__main__':
main()

You get a warning, because not all columns contain numerical values, you can use df2.mask(...) to avoid the warnings.
df2 = df2.mask(df2 < 0, float(1e-13))

you may try to use np.where
pd.concat([df.datetime, df.iloc[:,1:4].apply(lambda x:np.where(x<0,float(1e-13),x),axis=0)],axis=1)
Btw thanks for the beautiful reproducible example

Function loc from pandas library may help. Once your df is generated:
# get columns to check for the condition
cols = list(df.columns)[1:]
# iterate through columns and replace
for col in cols:
df.loc[df[col] < 0, col] = float(1e-13)
This should do the trick, hope it helps!

Maybe this:
df1['datetime'] = df['datetime']
df = df.mask(df.loc[:, df.columns != 'datetime'] < 0, float(1e-13))
df['datetime'] = df1['datetime']
print(df)
All the code:
import numpy as np
import pandas as pd
np.random.seed(0)
# This function generates demo data.
def generate_data():
datetime1 = pd.date_range(start='20220101', end='20220110')
df = pd.DataFrame(data=datetime1, columns=['datetime'])
col = [f'a{x}' for x in range(3)]
df[col] = np.random.uniform(-1, 1, (10, 3))
return df
def main():
df = generate_data()
df1['datetime'] = df['datetime']
df = df.mask(df.loc[:, df.columns != 'datetime'] < 0, float(1e-13))
df['datetime'] = df1['datetime']
print(df)
return
if __name__ == '__main__':
main()

Related

Resampling timeseries dataframe with multi-index

Generate data:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(freq=f'{FREQ}T',start='2020-10-01',periods=(12)*24))
df['col1'] = np.random.normal(size = df.shape[0])
df['col2'] = np.random.random_integers(1, 100, size= df.shape[0])
df['uid'] = 1
df2 = pd.DataFrame(index=pd.date_range(freq=f'{FREQ}T',start='2020-10-01',periods=(12)*24))
df2['col1'] = np.random.normal(size = df2.shape[0])
df2['col2'] = np.random.random_integers(1, 50, size= df2.shape[0])
df2['uid'] = 2
df3=pd.concat([df, df2]).reset_index()
df3=df3.set_index(['index','uid'])
I am trying to resample the data to 30min intervals and assign how to aggregate the data for each uid and each column individually. I have many columns and I need to assign whether if I want the mean, median, std, max, min, for each column. Since there are duplicate timestamps I need to do this operation for each user, that's why I try to set the multiindex and do the following:
df3.groupby(pd.Grouper(freq='30Min',closed='right',label='right')).agg({
"col1": "max", "col2": "min", 'uid':'max'})
but I get the following error
ValueError: MultiIndex has no single backing array. Use
'MultiIndex.to_numpy()' to get a NumPy array of tuples.
How can I do this operation?
You have to specify the level name when you use pd.Grouper on index:
out = (df3.groupby([pd.Grouper(level='index', freq='30T', closed='right', label='right'), 'uid'])
.agg({"col1": "max", "col2": "min"}))
print(out)
# Output
col1 col2
index uid
2020-10-01 00:00:00 1 -0.222489 77
2 -1.490019 22
2020-10-01 00:30:00 1 1.556801 16
2 0.580076 1
2020-10-01 01:00:00 1 0.745477 12
... ... ...
2020-10-02 23:00:00 2 0.272276 13
2020-10-02 23:30:00 1 0.378779 20
2 0.786048 5
2020-10-03 00:00:00 1 1.716791 20
2 1.438454 5
[194 rows x 2 columns]

check if column is blank in pandas dataframe

I have the next csv file:
A|B|C
1100|8718|2021-11-21
1104|21|
I want to create a dataframe that gives me the date output as follows:
A B C
0 1100 8718 20211121000000
1 1104 21 ""
This means
if C is empty:
put doublequotes
else:
format date to yyyymmddhhmmss (adding 0s to hhmmss)
My code:
df['C'] = np.where(df['C'].empty, df['C'].str.replace('', '""'), df['C'] + '000000')
but it gives me the next:
A B C
0 1100 8718 2021-11-21
1 1104 21 0
I have tried another piece of code:
if df['C'].empty:
df['C'] = df['C'].str.replace('', '""')
else:
df['C'] = df['C'].str.replace('-', '') + '000000'
OUTPUT:
A B C
0 1100 8718 20211121000000
1 1104 21 0000000
Use dt.strftime:
df = pd.read_csv('data.csv', sep='|', parse_dates=['C'])
df['C'] = df['C'].dt.strftime('%Y%m%d%H%M%S').fillna('""')
print(df)
# Output:
A B C
0 1100 8718 20211121000000
1 1104 21 ""
A good way would be to convert the column into datetime using pd.to_datetime with parameter errors='coerce' then dropping None values.
import pandas as pd
x = pd.DataFrame({
'one': 20211121000000,
'two': 'not true',
'three': '20211230'
}, index = [1])
x.apply(lambda x: pd.to_datetime(x, errors='coerce')).T.dropna()
# Output:
1
one 1970-01-01 05:36:51.121
three 2021-12-30 00:00:00.000

Checking Padded data in Pandas Dataframe on specific columns

I have a DataFrame that looks like this:
import numpy as np
raw_data = {'Series_Date':['2017-03-10','2017-03-13','2017-03-14','2017-03-15'],'SP':[35.6,56.7,41,41],'1M':[-7.8,56,56,-3.4],'3M':[24,-31,53,5]}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['Series_Date','SP','1M','3M'])
print df
I would like to run a test on certain columns in this DataFrame only, all column names in this list:
check = {'1M','SP'}
print check
For these columns, I would like to know when the values in either of these columns is the same as the value on the previous day. So the output dataframe should return series date and a Comment such as (for the example in this case:)
output_data = {'Series_Date':['2017-03-14','2017-03-15'],'Comment':["Value for 1M data is same as previous day","Value for SP data is same as previous day"]}
output_data_df = pd.DataFrame(output_data,columns = ['Series_Date','Comment'])
print output_data_df
Could you please provide some assistance how to deal with this?
The following does more or less what you want.
Columns item_ok are added to the original dataframe specifying if the value is the same as previous day or not:
from datetime import timedelta
df['Date_diff'] = pd.to_datetime(df['Series_Date']).diff()
for item in check:
df[item+'_ok'] = (df[item].diff() == 0) & (df['Date_diff'] == timedelta(1))
df_output = df.loc[(df[[item + '_ok' for item in check]]).any(axis=1)]
I'm not sure it is the most clean way to do it. However, it works
check = {'1M', 'SP'}
prev_dict = {c: None for c in check}
def check_prev_value(row):
global prev_dict
msg = ""
# MAYBE add clause to check if both are equal
for column in check:
if row[column] == prev_dict[column]:
msg = 'Value for %s data is same as previous day' % column
prev_dict[column] = row[column]
return msg
df['comment'] = df.apply(check_prev_value, axis=1)
output_data_df = df[df['comment'] != ""]
output_data_df = output_data_df[["Series_Date", "comment"]].reset_index(drop=True)
For your input:
Series_Date SP 1M 3M
0 2017-03-10 35.6 -7.8 24
1 2017-03-13 56.7 56.0 -31
2 2017-03-14 41.0 56.0 53
3 2017-03-15 41.0 -3.4 5
The output is:
Series_Date comment
0 2017-03-14 Value for 1M data is same as previous day
1 2017-03-15 Value for SP data is same as previous day
Reference: this answer
cols = ['1M','SP']
for col in cols:
df[col + '_dup'] = df[col].groupby((df[col] != df[col].shift()).cumsum()).cumcount()
Output column will have an integer greater than zero when a duplicate is found.
df:
Series_Date SP 1M 3M 1M_dup SP_dup
0 2017-03-10 35.6 -7.8 24 0 0
1 2017-03-13 56.7 56.0 -31 0 0
2 2017-03-14 41.0 56.0 53 1 0
3 2017-03-15 41.0 -3.4 5 0 1
Slice to find dups:
col = 'SP'
dup_df = df[df[col + '_dup'] > 0][['Series_Date', col + '_dup']]
dup_df:
Series_Date SP_dup
3 2017-03-15 1
Here is a function version of the above (with the added feature of handling multiple columns):
import pandas as pd
import numpy as np
def find_repeats(df, col_list, date_col='Series_Date'):
dummy_df = df[[date_col, *col_list]].copy()
dates = dummy_df[date_col]
date_series = []
code_series = []
if len(col_list) > 1:
for col in col_list:
these_repeats = df[col].groupby((df[col] != df[col].shift()).cumsum()).cumcount().values
repeat_idx = list(np.where(these_repeats > 0)[0])
date_arr = dates.iloc[repeat_idx]
code_arr = [col] * len(date_arr)
date_series.extend(list(date_arr))
code_series.extend(code_arr)
return pd.DataFrame({date_col: date_series, 'col_dup': code_series}).sort_values(date_col).reset_index(drop=True)
else:
col = col_list[0]
dummy_df[col + '_dup'] = df[col].groupby((df[col] != df[col].shift()).cumsum()).cumcount()
return dummy_df[dummy_df[col + '_dup'] > 0].reset_index(drop=True)
find_repeats(df, ['1M'])
Series_Date 1M 1M_dup
0 2017-03-14 56.0 1
find_repeats(df, ['1M', 'SP'])
Series_Date col_dup
0 2017-03-14 1M
1 2017-03-15 SP
And here is another way using pandas diff:
def find_repeats(df, col_list, date_col='Series_Date'):
code_list = []
dates = list()
for col in col_list:
these_dates = df[date_col].iloc[np.where(df[col].diff().values == 0)[0]].values
code_arr = [col] * len(these_dates)
dates.extend(list(these_dates))
code_list.extend(code_arr)
return pd.DataFrame({date_col: dates, 'val_repeat': code_list}).sort_values(date_col).reset_index(drop=True)

drop NaN in pandas python

Cant figure out why .dropnan() is not dropping cells with NaN values?
help please, I've gone through the pandas documentation, dont know what Im doing wrong????
import pandas as pd
import quandl
import pandas as pd
df = quandl.get("GOOG/NYSE_SPY")
df2 = quandl.get("YAHOO/AAPL")
date = pd.date_range('2010-01-01', periods = 365)
df3 = pd.DataFrame(index = date)
df3 = df3.join(df['Open'], how = 'inner')
df3.rename(columns = {'Open': 'SPY'}, inplace = True)
df3 = df3.join(df2['Open'], how = 'inner')
df3.rename(columns = {'Open': 'AAPL'}, inplace = True)
df3['Spread'] = df3['SPY'] / df3['AAPL']
df3 = df3 / df3.ix[0]
df3.dropna(how = 'any')
df3.plot()
print(df3)
change df3.dropna(how = 'any') to df3 = df3.dropna(how = 'any')
I tried to replicate your problem with a simple csv file:
In [6]: df
Out[6]:
a b
0 1.0 3.0
1 2.0 NaN
2 NaN 6.0
3 5.0 3.0
Both df.dropna(how='any') as well as df1 = df.dropna(how='any') work. Even just df.dropna() works. I am wondering whether your issue is because you are performing a division in the previous line:
df3 = df3 / df3.ix[0]
df3.dropna(how = 'any')
For instance, if I divide by df.ix[1], since one of the elements is a NaN, it converts all elements of a column in the result to NaN, and then if I remove NaNs using dropna, it will remove all rows:
In [17]: df.ix[1]
Out[17]:
a 2.0
b NaN
Name: 1, dtype: float64
In [18]: df2 = df / df.ix[1]
In [19]: df2
Out[19]:
a b
0 0.5 NaN
1 1.0 NaN
2 NaN NaN
3 2.5 NaN
In [20]: df2.dropna()
Out[20]:
Empty DataFrame
Columns: [a, b]
Index: []

Vectorizing a multiplication and dict mapping on a Pandas DataFrame without iterating?

I have a Pandas DataFrame, df:
import pandas as pd
import numpy as np
import math
df = pd.DataFrame({'A':[1,2,2,4,np.nan],'B':[1,2,3,4,5]})
and a dict, mask:
mask = {1:32,2:64,3:100,4:200}
I want my end result to be a DataFrame like this:
A B C
1 1 32
2 2 64
2 3 96
4 4 400
nan nan nan
Right now I am doing this, which seems innefficient:
for idx, row in df.iterrows():
if not math.isnan(row['A']):
if row['A'] != 1:
df.loc[idx, 'C'] = row['B'] * mask[row['A'] - 1]
else:
df.loc[idx, 'C'] = row['B'] * mask[row['A']]
Is there an easy way to vectorize this?
This should work:
df['C'] = df.B * (df.A - (df.A != 1)).map(mask)
Timing
10,000 rows
# Initialize each run with
df = pd.DataFrame({'A':[1,2,2,4,np.nan],'B':[1,2,3,4,5]})
df = pd.concat([df for _ in range(2000)])
100,000 rows
# Initialize each run with
df = pd.DataFrame({'A':[1,2,2,4,np.nan],'B':[1,2,3,4,5]})
df = pd.concat([df for _ in range(20000)])
Here is an option using apply, and the get method for dictionary which returns None if the key is not in the dictionary:
df['C'] = df.apply(lambda r: mask.get(r.A) if r.A == 1 else mask.get(r.A - 1), axis = 1) * df.B
df
# A B C
#0 1 1 32
#1 2 2 64
#2 2 3 96
#3 4 4 400
#4 NaN 5 NaN

Categories