Suppose I have an H2OFrame called df. What is the quickest way to get the values of column x from said frame as a numpy array?
One could do
x_array = df['x'].as_data_frame()['x'].values
But that seems unnecessarily verbose. Especially passing via a pandas DataFrame with as_data_frame seems superfluous. I was hoping for something more elegant like, e.g. df['x'].to_array(). But I can't find it.
here is another way. However, I'm not sure it's faster. I'm using the h2o.as_list() function to convert a column to a list and then I use the np.array() function to convert the list to an array.
import h2o
import numpy as np
h2o.init()
# Using sample dataset from H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
## Creating np array from h2o frame column
np.array(h2o.as_list(train['x1']))
Related
How can I get the same result I`m getting on pandas on DASK?
The objective is to have a uniform time interval for each group, replicating the last value until we have a new one.
import pandas as pd import numpy as np import datetime
data=pd.DataFrame([["AAAA","2020-01-15",2],
["AAAA","2020-02-15",9],
["AAAA","2020-02-20",2],
["AAAA","2020-02-25",9],
["AAAA","2020-04-18",2],
["BBBB","2020-01-01",5],
["BBBB","2020-02-15",5],
["BBBB","2020-02-20",4],
["BBBB","2020-02-25",4],
["BBBB","2020-04-15",2],
["CCCC","2020-01-01",9],
["CCCC","2020-02-15",5],
["CCCC","2020-03-20",7],
["CCCC","2020-04-25",4],
["CCCC","2020-05-15",2]])
data.columns=['Asset','Date','P']
data['Date']=pd.to_datetime(data['Date'])
data.index=data['Date'].values
temp=data.groupby('Asset').resample('2D').pad()
temp
** this is just an example, the real-world application is really big.
Thanks!
.resample() functionality is not fully replicated in the current version of dask. My suggestion would be to either look into xarray (if you want to have grid-like structure) or use dask.delayed wrapped around pandas.
I need to create a multi-index for dask by concatenating two arrays (preferably dask arrays). I found the following solution for numpy, but looking for a dask solution
cols=100000
index = np.array([x1 + x2 +x3 for x1,x2,x3 in zip(repeat(1,cols ).astype('str'),repeat('-',cols ),repeat(1,cols ).astype('str'))])
if I pass it da.from_array() it balks at + two arrays.
I have also tried np.core.defchararray.add(), this works but converts to dask array to numpy arrays (as far as i can tell).
You might want to try da.map_blocks. You can make a numpy function that does whatever you want, and then da.map_blocks will apply that numpy function blockwise on to each of the numpy arrays that make up your dask array.
I am stuck with an issue when it comes to taking slices of my data in python (I come from using Matlab).
So here is the code I'm using,
import scipy.io as sc
import math as m
import numpy as np
from scipy.linalg import expm, sinm, cosm
import matplotlib.pyplot as plt
import pandas as pd
import sys
data = pd.read_excel('DataDMD.xlsx')
print(data.shape)
print(data)
The out put looks like so,
Output
So I wish to take certain rows only (or from my understand in Python slices) of this data matrix. The other problem I have is that the top row of my matrix becomes almost like the titles of the columns instead of actually data points. So I have two problems,
1) I don't need the top of the matrix to have any 'titles' or anything of that sort because it's all numeric and all symbolizes data.
2) I only need to take the 6th row of the whole matrix as a new data matrix.
3) I plan on using matrix multiplication later so is panda allowed or do I need numpy?
So this is what I've tried,
data.iloc[0::6,:]
this gives me something like this,
Output2
which is wrong because I don't need the values of 24.8 to be the 'title' but be the first row of the new matrix.
I've also tried using np.array for this but my problem is when I try to using iloc, it says (which makes sense)
'numpy.ndarray' object has no attribute 'iloc'
If anyone has any ideas, please let me know! Thanks!
To avoid loading the first record as the header, try using the following:
pd.read_excel('DataDMD.xlsx', header=None)
The read_excel function has an header argument; the value for the header argument indicates which row of the data should be used as header. It gets a default value of 0. Use None as a value for the header argument if none of the rows in your data functions as the header.
There are many useful arguments, all described in the documentation of the function.
This should also help with number 2.
Hope this helps.
Good luck!
I want to initialise an array that will hold some data. I have created a random matrix (using np.empty) and then multiplied it by np.nan. Is there anything wrong with that? Or is there a better practice that I should stick to?
To further explain my situation: I have data I need to store in an array. Say I have 8 rows of data. The number of elements in each row is not equal, so my matrix row length needs to be as long as the longest row. In other rows, some elements will not be filled. I don't want to use zeros since some of my data might actually be zeros.
I realise I can use some value I know my data will never, but nans is definitely clearer. Just wondering if that can cause any issues later with processing. I realise I need to use nanmax instead of max and so on.
I have created a random matrix (using np.empty) and then multiplied it by np.nan. Is there anything wrong with that? Or is there a better practice that I should stick to?
You can use np.full, for example:
np.full((100, 100), np.nan)
However depending on your needs you could have a look at numpy.ma for masked arrays or scipy.sparse for sparse matrices. It may or may not be suitable, though. Either way you may need to use different functions from the corresponding module instead of the normal numpy ufuncs.
A way I like to do it which probably isn't the best but it's easy to remember is adding a 'nans' method to the numpy object this way:
import numpy as np
def nans(n):
return np.array([np.nan for i in range(n)])
setattr(np,'nans',nans)
and now you can simply use np.nans as if it was the np.zeros:
np.nans(10)
I am wondering if there is a Python or Pandas function that approximates the Ruby #each_slice method. In this example, the Ruby #each_slice method will take the array or hash and break it into groups of 100.
var.each_slice(100) do |batch|
# do some work on each batch
I am trying to do this same operation on a Pandas dataframe. Is there a Pythonic way to accomplish the same thing?
I have checked out this answer: Python equivalent of Ruby's each_slice(count)
However, it is old and is not Pandas specific. I am checking it out but am wondering if there is a more direct method.
There isn't a built in method as such but you can use numpy's array_slice, you can pass the dataframe to this and the number of slices.
In order to get ~100 size slices you'll have to calculate this which is simply the number of rows/100:
import numpy as np
# df.shape returns the dimensions in a tuple, the first dimension is the number of rows
np.array_slice(df, df.shape[0]/100)
This returns a list of dataframes sliced as evenly as possible