I have a Dataframe in which some columns contain wrong information. This wrong information is always before a longer sequence of NaN values. Let's imagine I have the following dataset:
import pandas as pd
from numpy import nan
d = {'Obs1': [1, 2, 3, 4, 5, 6, 7, 8], 'Obs2': [0.1, 0.1, nan, nan, nan, nan, 100, 101]}
df = pd.DataFrame(data=d)
"Obs1" is without wrong information, while "Obs2" has wrong values before the 4-NaN sequence. Does anyone know how to find such a longer sequence in a timeseries (e.g. an occurence of 4 NaN values), to then fill all previous entries with NaN? To give an example, my desired Output would be:
Output = {'Obs1': [1, 2, 3, 4, 5, 6, 7, 8], 'Obs2': [nan, nan, nan, nan, nan, nan, 100, 101]}
Thanks in advance
For each column, check the i'th element and (i+1)'th element are NaN and find max index (i) satisfying the i'th element and (i+1)'th element are NaN.
See the following code.
for col in df.columns:
cond = df[col].iloc[1:].isnull() + df[col].iloc[:-1].isnull() == 2
if sum(cond) >= 2:
df[col].iloc[:cond.index[-1] - 1] = nan
Related
I need to assign a stack of rows from a pd.DataFrame by index to a matrix of known size as efficiently as possible. Many of the indices will not exist in the dataframe, and these rows must be filled with NaNs. This operation will happen inside a loop iterating over each row in the dataframe and must be as fast as possible.
In short, I want to speed up the following sequence:
# DF to iterate over
df = pd.DataFrame({'a': [1,2,3,4], 'b': [2,4,6,8], 'c': [3,6,9,12]})
# (Relative) DF indices to return if possible
indices = [-2, 0, 2, 4]
len_df = len(df)
len_indices = len(indices)
len_cols = len(df.columns)
values = np.zeros((len_df, len_indices, len_cols))
for i in range(len_df):
for j, n in enumerate(indices):
idx = i + n
try:
assert idx >= 0 # avoid wrapping around
values[i, j] = df.iloc[idx]
except:
values[i, j] = np.nan
values
Returns
[[[nan, nan, nan],
[ 1, 2, 3],
[ 3, 6, 9],
[nan, nan, nan]],
[[nan, nan, nan],
[ 2, 4, 6],
[ 4, 8, 12],
[nan, nan, nan]],
[[ 1, 2, 3],
[ 3, 6, 9],
[nan, nan, nan],
[nan, nan, nan]],
[[ 2, 4, 6],
[ 4, 8, 12],
[nan, nan, nan],
[nan, nan, nan]]]
As desired, but is quite slow. Any suggestions?
As desired
Here's a solution that builds a lookup table:
# repeat indices into new dimension
rep_ids = np.tile(indices, (len(df), 1))
# add row index to the tiled ids
range_ids = rep_ids + np.arange(len(df))[:, np.newaxis]
# check which ids make a valid lookup in the df
# TODO: do you really mean > 0 here in the original code, not >= 0?
valid_test = (range_ids > 0) & (range_ids < len(df))
# lookup table, using 0 for invalid ids (will be overwritten later)
lookup_ids = np.where(valid_test, range_ids, 0)
# lookup all values
values = df.values[lookup_ids].astype(np.float64)
# set the invalids to nan
values[~valid_test] = np.nan
I have a 1d Numpy array with NaNs and no NaNs values:
arr = np.array([4, np.nan, np.nan, 3, np.nan, 5])
I need to replace NaN values with previous no-NaN value and replace no-NaN values with NaN, as per below:
result = np.array([np.nan, 4, 4, np.nan, 3, 5])
Thanks in advance.
Tommaso
If Pandas is available, you can use ffill(), and then replace the original non-Nan values with a boolean mask:
import pandas
arr2 = pd.Series(arr).ffill()
mask = ~np.isnan(arr) # all elements which started non-NaN
mask[-1] = False # last element will never forward-fill
arr2[mask] = np.nan # replace non-NaNs with NaNs
Output:
arr2
0 NaN
1 4.0
2 4.0
3 NaN
4 3.0
5 5.0
dtype: float64
if not using pandas then:
arr = np.array([4.1, np.nan, np.nan, 3.1, np.nan, 5.1])
arr2 = np.array([])
val=0
for n in arr:
if ~np.isnan(n):
val = n
arr2= np.append(arr2,np.nan)
else:
arr2= np.append(arr2,val)
output:
arr
array([4.1, nan, nan, 3.1, nan, 5.1])
arr2
array([nan, 4.1, 4.1, nan, 3.1, nan])
I have a pandas DataFrame df
L C
0 [1, 2, 3] 5
1 [4, nan, 6] 0
2 [nan, nan, nan] 15
and another DataFrame other
C
0 0
1 25
2 0
Then I append other to df and in L column are added 3 rows with NaN values.
L C
0 [1, 2, 3] 5
1 [4, nan, 6] 0
2 [nan, nan, nan] 15
0 NaN 0
1 NaN 25
2 NaN 0
I want to create a column that if L column is NaN and C is 0 then it will get value 1 otherwise it will get value 0. I also make computations with rows that do not contain NaN values but it is out of the purpose of this post.
I found that the way Pandas deals with Nan values is pd.isna().
I created the function
def check_cols(L, C):
if pd.isna(L) and C == 0:
return 1
elif pd.isna(L) and C != 0:
return 0
and I apply the function on every row
df['col'] = df.apply(lambda row: check_cols(row.L,row.C), axis=1)
but i get the error
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
because it checks every element of the list if is NaN. I don't want to check the elements of the list if there are NaN or not, i want want to check if there is a list (even with all elements nan) or a NaN value. Another way to do it is to create a column with pd.isna() like this
L C is_NaN
0 [1, 2, 3] 5 False
1 [4, nan, 6] 0 False
2 [nan, nan, nan] 15 False
0 NaN 0 True
1 NaN 25 True
2 NaN 0 True
and then give three columns as an argument to the function, which will work. I want to do the same check, if there is a list and if there is a NaN value, within the function without having to create extra column.
If someone would explain why in the first case it checks every element of the list and in the second it does the check that I want, and/or provide some sources, it would be great.
The reason behind the exception is that you should use & instead of and and also the if condition cannot evaluate to either True or False because the output is a Series of Booleans. Example:
pd.isna(df.L) & df.C == 0
0 True
1 True
2 True
0 True
1 False
2 True
dtype: bool
The result above cannot be evaluated by an if condition.
Here's a solution that returns directly the condition you mentioned:
import pandas as pd
import numpy as np
def check_cols(L, C):
return pd.isna(df.L) & (df.C == 0)
data = {
'L': [[1, 2, 3], [4, np.nan, 6], [np.nan, np.nan, np.nan], np.nan, np.nan, np.nan],
'C': [5, 0, 15, 0, 25, 0]}
df = pd.DataFrame(data=data, index=[0, 1, 2, 0, 1 ,2])
res = check_cols(df.L, df.C)
df['res'] = res
df
# EDIT: Updated solution according to comments
Then the issue is that you are applying pd.isna to a list - for example in the first row L = [1, 2, 3] in the first row and that cannot be evaluated by the if condition.
import pandas as pd
import numpy as np
def check_cols(L, C):
if not isinstance(L, list) and np.isnan(L) and C == 0:
return 1
elif not isinstance(L, list) and np.isnan(L) and C != 0:
return 0
else:
# when L is a list
return 1
data = {
'L': [[1, 2, 3], [4, np.nan, 6], [np.nan, np.nan, np.nan], np.nan, np.nan, np.nan],
'C': [5, 0, 15, 0, 25, 0]
}
df = pd.DataFrame(data=data, index=[0, 1, 2, 0, 1 ,2])
df['col'] = df.apply(lambda row: check_cols(row.L,row.C), axis=1)
df
EDIT 2: I decided to go for np.nan but it also works with pd.na.
I'm processing data from a readout from a storage device in this format:
id:name:UPS_serial_number:WWNN:status:IO_group_id:IO_group_name:config_node:UPS_unique_id:hardware:iscsi_name:iscsi_alias:panel_name:enclosure_id:canister_id:enclosure_serial_number:site_id:site_name
10:node_A::00A550:online:0:io_grp0:yes::SV1:iqn.1986-03.com:2145.test.nodeA::A:::::
15:node_B::00A548:online:0:io_grp0:no::SV1:iqn.1986-03.com.:2145.test.nodeB::B:::::
How can I read that data as a 2D array, like datarray['15']['status']?
I tried this way:
# Create array
datarray = []
try:
# Loop trough list
for i, x in enumerate(lis):
# Split on the delimter
linesplit = x.split(":")
row = []
for lsi,lsx in enumerate(linesplit):
row.append([lsi,lsx])
datarray.append(row)
But that seems to slice the the data wrong:
[[[0, u'id'], [1, u'name'], [2, u'UPS_serial_number'], [3, u'WWNN'], [4, u'status'], [5, u'IO_group_id'], [6, u'IO_group_name'], [7, u'config_node'], [8, u'UPS_unique_id'], [9, u'hardware'], [10, u'iscsi_name'], [11, u'iscsi_alias'], [12, u'panel_name'], [13, u'enclosure_id'],
Use a csv.DictReader to read the individual lines as dictionaries and then use a dictionary comprehention to create the "outer" dict mapping the ID attribute to the inner dicts with that ID.
raw = """id:name:UPS_serial_number:WWNN:status:IO_group_id:IO_group_name:config_node:UPS_unique_id:hardware:iscsi_name:iscsi_alias:panel_name:enclosure_id:canister_id:enclosure_serial_number:site_id:site_name
10:node_A::00A550:online:0:io_grp0:yes::SV1:iqn.1986-03.com:2145.test.nodeA::A:::::
15:node_B::00A548:online:0:io_grp0:no::SV1:iqn.1986-03.com.:2145.test.nodeB::B:::::"""
reader = csv.DictReader(raw.splitlines(), delimiter=":")
result = {line["id"]: line for line in reader}
print(result["15"]["status"]) # 'online'
Note that this is not a 2D array but a dictionary of dictionaries (with dictionaries being associative arrays). As a simple 2D array, a query like result["15"]["status"] would not work.
What I can make out of the data is that it is colon(:) separated data and first line has header. If that is the case you can load it to pandas dataframe as you load a csv file with separator = ':'. And then convert that dataframe to numpy array.
import pandas as pd
import os
os.chdir('/Users/Downloads/')
df = pd.read_csv('train.txt',sep=':')
df
id name UPS_serial_number WWNN status IO_group_id IO_group_name config_node UPS_unique_id hardware iscsi_name iscsi_alias panel_name enclosure_id canister_id enclosure_serial_number site_id site_name
10 node_A NaN 00A550 online 0 io_grp0 yes NaN SV1 iqn.1986-03.com 2145.test.nodeA NaN A NaN NaN NaN NaN NaN
15 node_B NaN 00A548 online 0 io_grp0 no NaN SV1 iqn.1986-03.com. 2145.test.nodeB NaN B NaN NaN NaN NaN NaN
df.as_matrix()
array([['node_A', nan, '00A550', 'online', 0, 'io_grp0', 'yes', nan,
'SV1', 'iqn.1986-03.com', '2145.test.nodeA', nan, 'A', nan, nan,
nan, nan, nan],
['node_B', nan, '00A548', 'online', 0, 'io_grp0', 'no', nan,
'SV1', 'iqn.1986-03.com.', '2145.test.nodeB', nan, 'B', nan, nan,
nan, nan, nan]], dtype=object)
How can I get the index of the median value for an array which contains NaNs?
For example, I have the array of values [Nan, 2, 5, NaN, 4, NaN, 3, 1] with correspondent array of errors on those values [np.nan, 0.1, 0.2, np.nan, 0.1, np.nan, 0.4, 0.3]. Then the median is 3, while the error is 0.4.
Is there a simple way to do this?
EDIT: I edited the error array to imply a more realistic situation. And Yes, I am using numpy.
It's not really clear how you intend to meaningfully extract the error from the median, but if you do happen to have an array such that the median is one of its entries, and the corresponding error array is defined at the corresponding index, and there aren't other entries with the same value as the median, and probably several other disclaimers, then you can do the following:
a = np.array([np.nan,2,5,np.nan, 4,np.nan,3,1])
aerr = np.array([np.nan, 0.1, 0.2, np.nan, 0.1, np.nan, 0.4, 0.3])
# median, ignoring NaNs
amedian = np.median(a[np.isfinite(a)])
# find the index of the closest value to the median in a
idx = np.nanargmin(np.abs(a-amedian))
# this is the corresponding "error"
aerr[idx]
EDIT: as #DSM points out, if you have NumPy 1.9 or above, you can simplify the calculation of amedian as amedian = np.nanmedian(a).
numpy has everything you need:
values = np.array([np.nan, 2, 5, np.nan, 4, np.nan, 3, 1])
errors = np.array([np.nan, 0.1, 0.2, np.nan, 0.1, np.nan, 0.4, 0.3])
# filter
filtered = values[~np.isnan(values)]
# find median
median = np.median(filtered)
# find indexes
indexes = np.where(values == median)[0]
# find errors
errors[indexes] # array([ 0.4])
let say you have your list named as "a", then you can use this codeto find a masked array without "Nan" and then do median with a np.ma.median():
a=[Nan, 2, 5, NaN, 4, NaN, 3, 1]
am = numpy.ma.masked_array(a, [numpy.isnan(x) for x in a])
numpy.ma.median(am)
you can do the same for errors as well.