I'm processing data from a readout from a storage device in this format:
id:name:UPS_serial_number:WWNN:status:IO_group_id:IO_group_name:config_node:UPS_unique_id:hardware:iscsi_name:iscsi_alias:panel_name:enclosure_id:canister_id:enclosure_serial_number:site_id:site_name
10:node_A::00A550:online:0:io_grp0:yes::SV1:iqn.1986-03.com:2145.test.nodeA::A:::::
15:node_B::00A548:online:0:io_grp0:no::SV1:iqn.1986-03.com.:2145.test.nodeB::B:::::
How can I read that data as a 2D array, like datarray['15']['status']?
I tried this way:
# Create array
datarray = []
try:
# Loop trough list
for i, x in enumerate(lis):
# Split on the delimter
linesplit = x.split(":")
row = []
for lsi,lsx in enumerate(linesplit):
row.append([lsi,lsx])
datarray.append(row)
But that seems to slice the the data wrong:
[[[0, u'id'], [1, u'name'], [2, u'UPS_serial_number'], [3, u'WWNN'], [4, u'status'], [5, u'IO_group_id'], [6, u'IO_group_name'], [7, u'config_node'], [8, u'UPS_unique_id'], [9, u'hardware'], [10, u'iscsi_name'], [11, u'iscsi_alias'], [12, u'panel_name'], [13, u'enclosure_id'],
Use a csv.DictReader to read the individual lines as dictionaries and then use a dictionary comprehention to create the "outer" dict mapping the ID attribute to the inner dicts with that ID.
raw = """id:name:UPS_serial_number:WWNN:status:IO_group_id:IO_group_name:config_node:UPS_unique_id:hardware:iscsi_name:iscsi_alias:panel_name:enclosure_id:canister_id:enclosure_serial_number:site_id:site_name
10:node_A::00A550:online:0:io_grp0:yes::SV1:iqn.1986-03.com:2145.test.nodeA::A:::::
15:node_B::00A548:online:0:io_grp0:no::SV1:iqn.1986-03.com.:2145.test.nodeB::B:::::"""
reader = csv.DictReader(raw.splitlines(), delimiter=":")
result = {line["id"]: line for line in reader}
print(result["15"]["status"]) # 'online'
Note that this is not a 2D array but a dictionary of dictionaries (with dictionaries being associative arrays). As a simple 2D array, a query like result["15"]["status"] would not work.
What I can make out of the data is that it is colon(:) separated data and first line has header. If that is the case you can load it to pandas dataframe as you load a csv file with separator = ':'. And then convert that dataframe to numpy array.
import pandas as pd
import os
os.chdir('/Users/Downloads/')
df = pd.read_csv('train.txt',sep=':')
df
id name UPS_serial_number WWNN status IO_group_id IO_group_name config_node UPS_unique_id hardware iscsi_name iscsi_alias panel_name enclosure_id canister_id enclosure_serial_number site_id site_name
10 node_A NaN 00A550 online 0 io_grp0 yes NaN SV1 iqn.1986-03.com 2145.test.nodeA NaN A NaN NaN NaN NaN NaN
15 node_B NaN 00A548 online 0 io_grp0 no NaN SV1 iqn.1986-03.com. 2145.test.nodeB NaN B NaN NaN NaN NaN NaN
df.as_matrix()
array([['node_A', nan, '00A550', 'online', 0, 'io_grp0', 'yes', nan,
'SV1', 'iqn.1986-03.com', '2145.test.nodeA', nan, 'A', nan, nan,
nan, nan, nan],
['node_B', nan, '00A548', 'online', 0, 'io_grp0', 'no', nan,
'SV1', 'iqn.1986-03.com.', '2145.test.nodeB', nan, 'B', nan, nan,
nan, nan, nan]], dtype=object)
Related
I have a rather large dataset, and need to find in that dataset extreme values, including coordinates.
The real dataset is much larger, but let's take this one for testing:
import xarray as xr
import numpy as np
import pandas as pd
values = np.array(
[[[3, 1, 1],
[1, 1, 1],
[1, 1, 1]],
[[1, 1, 1],
[1, 1, 1],
[1, 1, 4]],
[[1, 1, 1],
[1, 1, 1],
[1, 1, 5]]]
)
da = xr.DataArray(values, dims=('time', 'lat', 'lon'),
coords={'time': list(range(3)), 'lat': list(range(3)), 'lon':list(range(3))})
I want to find in this dataarray all values larger than 2. I found on here this solution:
da.where(da>2, drop=True)
but even in this small example, this produces a lot more nans than values:
array([[[ 3., nan],
[nan, nan]],
[[nan, nan],
[nan, 4.]],
[[nan, nan],
[nan, 5.]]])
and it's worse in my actual dataset.
I've tried to write a helper function to convert it to a pandas dataframe, like this:
def find_val(da):
res = pd.DataFrame(columns=['Time', 'Latitude', 'Longitude', 'Value'])
for time_idx, time in enumerate(da['time']):
for lat_idx, lat in enumerate(da['lat']):
for lon_idx, lon in enumerate(da['lon']):
value = da.isel(time=time_idx, lat=lat_idx, lon=lon_idx).item()
if not np.isnan(value):
res.loc[len(res.index)] = [time.item(), lat.item(), lon.item(), value]
return res
find_val(da.where(da>2, drop=True))
This produces the output I want, but 3 nested for loops seems excessive.
Time Latitude Longitude Value
0 0.0 0.0 0.0 3.0
1 1.0 1.0 1.0 4.0
2 2.0 1.0 1.0 5.0
Any good suggestions on how to improve this?
There is already an implementation of converting to Pandas
DataArray.to_dataframe(name=None, dim_order=None)
https://docs.xarray.dev/en/stable/generated/xarray.DataArray.to_dataframe.html
On a side note, if you are looking to remove extreme values with no specific range then you might want to check out outlier detection
https://scikit-learn.org/stable/modules/outlier_detection.html
I need to assign a stack of rows from a pd.DataFrame by index to a matrix of known size as efficiently as possible. Many of the indices will not exist in the dataframe, and these rows must be filled with NaNs. This operation will happen inside a loop iterating over each row in the dataframe and must be as fast as possible.
In short, I want to speed up the following sequence:
# DF to iterate over
df = pd.DataFrame({'a': [1,2,3,4], 'b': [2,4,6,8], 'c': [3,6,9,12]})
# (Relative) DF indices to return if possible
indices = [-2, 0, 2, 4]
len_df = len(df)
len_indices = len(indices)
len_cols = len(df.columns)
values = np.zeros((len_df, len_indices, len_cols))
for i in range(len_df):
for j, n in enumerate(indices):
idx = i + n
try:
assert idx >= 0 # avoid wrapping around
values[i, j] = df.iloc[idx]
except:
values[i, j] = np.nan
values
Returns
[[[nan, nan, nan],
[ 1, 2, 3],
[ 3, 6, 9],
[nan, nan, nan]],
[[nan, nan, nan],
[ 2, 4, 6],
[ 4, 8, 12],
[nan, nan, nan]],
[[ 1, 2, 3],
[ 3, 6, 9],
[nan, nan, nan],
[nan, nan, nan]],
[[ 2, 4, 6],
[ 4, 8, 12],
[nan, nan, nan],
[nan, nan, nan]]]
As desired, but is quite slow. Any suggestions?
As desired
Here's a solution that builds a lookup table:
# repeat indices into new dimension
rep_ids = np.tile(indices, (len(df), 1))
# add row index to the tiled ids
range_ids = rep_ids + np.arange(len(df))[:, np.newaxis]
# check which ids make a valid lookup in the df
# TODO: do you really mean > 0 here in the original code, not >= 0?
valid_test = (range_ids > 0) & (range_ids < len(df))
# lookup table, using 0 for invalid ids (will be overwritten later)
lookup_ids = np.where(valid_test, range_ids, 0)
# lookup all values
values = df.values[lookup_ids].astype(np.float64)
# set the invalids to nan
values[~valid_test] = np.nan
I have a 1d Numpy array with NaNs and no NaNs values:
arr = np.array([4, np.nan, np.nan, 3, np.nan, 5])
I need to replace NaN values with previous no-NaN value and replace no-NaN values with NaN, as per below:
result = np.array([np.nan, 4, 4, np.nan, 3, 5])
Thanks in advance.
Tommaso
If Pandas is available, you can use ffill(), and then replace the original non-Nan values with a boolean mask:
import pandas
arr2 = pd.Series(arr).ffill()
mask = ~np.isnan(arr) # all elements which started non-NaN
mask[-1] = False # last element will never forward-fill
arr2[mask] = np.nan # replace non-NaNs with NaNs
Output:
arr2
0 NaN
1 4.0
2 4.0
3 NaN
4 3.0
5 5.0
dtype: float64
if not using pandas then:
arr = np.array([4.1, np.nan, np.nan, 3.1, np.nan, 5.1])
arr2 = np.array([])
val=0
for n in arr:
if ~np.isnan(n):
val = n
arr2= np.append(arr2,np.nan)
else:
arr2= np.append(arr2,val)
output:
arr
array([4.1, nan, nan, 3.1, nan, 5.1])
arr2
array([nan, 4.1, 4.1, nan, 3.1, nan])
I have a Dataframe in which some columns contain wrong information. This wrong information is always before a longer sequence of NaN values. Let's imagine I have the following dataset:
import pandas as pd
from numpy import nan
d = {'Obs1': [1, 2, 3, 4, 5, 6, 7, 8], 'Obs2': [0.1, 0.1, nan, nan, nan, nan, 100, 101]}
df = pd.DataFrame(data=d)
"Obs1" is without wrong information, while "Obs2" has wrong values before the 4-NaN sequence. Does anyone know how to find such a longer sequence in a timeseries (e.g. an occurence of 4 NaN values), to then fill all previous entries with NaN? To give an example, my desired Output would be:
Output = {'Obs1': [1, 2, 3, 4, 5, 6, 7, 8], 'Obs2': [nan, nan, nan, nan, nan, nan, 100, 101]}
Thanks in advance
For each column, check the i'th element and (i+1)'th element are NaN and find max index (i) satisfying the i'th element and (i+1)'th element are NaN.
See the following code.
for col in df.columns:
cond = df[col].iloc[1:].isnull() + df[col].iloc[:-1].isnull() == 2
if sum(cond) >= 2:
df[col].iloc[:cond.index[-1] - 1] = nan
I would like to convert everything but the first column of a pandas dataframe into a numpy array. For some reason using the columns= parameter of DataFrame.to_matrix() is not working.
df:
viz a1_count a1_mean a1_std
0 n 3 2 0.816497
1 n 0 NaN NaN
2 n 2 51 50.000000
I tried X=df.as_matrix(columns=[df[1:]]) but this yields an array of all NaNs
the easy way is the "values" property df.iloc[:,1:].values
a=df.iloc[:,1:]
b=df.iloc[:,1:].values
print(type(df))
print(type(a))
print(type(b))
so, you can get type
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>
Please use the Pandas to_numpy() method. Below is an example--
>>> import pandas as pd
>>> df = pd.DataFrame({"A":[1, 2], "B":[3, 4], "C":[5, 6]})
>>> df
A B C
0 1 3 5
1 2 4 6
>>> s_array = df[["A", "B", "C"]].to_numpy()
>>> s_array
array([[1, 3, 5],
[2, 4, 6]])
>>> t_array = df[["B", "C"]].to_numpy()
>>> print (t_array)
[[3 5]
[4 6]]
Hope this helps. You can select any number of columns using
columns = ['col1', 'col2', 'col3']
df1 = df[columns]
Then apply to_numpy() method.
The columns parameter accepts a collection of column names. You're passing a list containing a dataframe with two rows:
>>> [df[1:]]
[ viz a1_count a1_mean a1_std
1 n 0 NaN NaN
2 n 2 51 50]
>>> df.as_matrix(columns=[df[1:]])
array([[ nan, nan],
[ nan, nan],
[ nan, nan]])
Instead, pass the column names you want:
>>> df.columns[1:]
Index(['a1_count', 'a1_mean', 'a1_std'], dtype='object')
>>> df.as_matrix(columns=df.columns[1:])
array([[ 3. , 2. , 0.816497],
[ 0. , nan, nan],
[ 2. , 51. , 50. ]])
Hope this easy one liner helps:
cols_as_np = df[df.columns[1:]].to_numpy()
The best way for converting to Numpy Array is using '.to_numpy(self, dtype=None, copy=False)'. It is new in version 0.24.0.Refrence
You can also use '.array'.Refrence
Pandas .as_matrix deprecated since version 0.23.0.
Instead of .as_matrix(), use .values, because the first one was deprecated. Here is the contribution:
'DataFrame' object has no attribute 'as_matrix
The fastest and easiest way is to use .as_matrix(). One short line:
df.iloc[:,[1,2,3]].as_matrix()
Gives:
array([[3, 2, 0.816497],
[0, 'NaN', 'NaN'],
[2, 51, 50.0]], dtype=object)
By using indices of the columns, you can use this code for any dataframe with different column names.
Here are the steps for your example:
import pandas as pd
columns = ['viz', 'a1_count', 'a1_mean', 'a1_std']
index = [0,1,2]
vals = {'viz': ['n','n','n'], 'a1_count': [3,0,2], 'a1_mean': [2,'NaN', 51], 'a1_std': [0.816497, 'NaN', 50.000000]}
df = pd.DataFrame(vals, columns=columns, index=index)
Gives:
viz a1_count a1_mean a1_std
0 n 3 2 0.816497
1 n 0 NaN NaN
2 n 2 51 50
Then:
x1 = df.iloc[:,[1,2,3]].as_matrix()
Gives:
array([[3, 2, 0.816497],
[0, 'NaN', 'NaN'],
[2, 51, 50.0]], dtype=object)
Where x1 is numpy.ndarray.