I was wondering how I would be able to convert my binned dataframe to a binned numpy array that I can use in sklearn's PCA.
Here's my code so far (x is my original unbinned dataframe):
bins=(2,6,10,14,20,26,32,38,44,50,56,62,68,74,80,86,92,98)
binned_data = x.groupby(pd.cut(x.Weight, bins))
I want to convert binned_data to a numpy array. Thanks in advance.
EDIT:
When I try binned_data.values, I receive this error:
AttributeError: Cannot access attribute 'values' of 'DataFrameGroupBy' objects, try using the 'apply' method
You need to apply some kind of aggregation to the GroupBy object to return a DataFrame. Once you have that, you can use .values to extract the numpy arrary.
For example, if you wanted the sum or count of the data in each bin you could do:
binned_data.sum().values
binned_data.size().values
Edit:
My code wasn't exactly right, because the column (Weight) and the index will have the same name. It can be fixed by renaming the index, as below:
binned_data = x.groupby(pd.cut(x.Weight, bins)).sum()
binned_data.index.name = 'Weight_Bin'
binned_data.reset_index().values
Related
How can i add a list or a numpy array as a column to a Dask dataframe? When i try with the regular pandas syntax df['x']=x it gives me a TypeError: Column assignment doesn't support type list error.
You can add a pandas series:
df["new_col"] = pd.Series(my_list, index=index_matching_df_index)
The issue is that the index is extremely important so dask can understand how to partition the data. The size of each partition in a dask dataframe is not always known, so you cannot assign by position.
I finally solved it just casting the list into a dask array with dask.array.from_array(), which i think it's the most direct way.
I am using value_counts() to get the frequency for sec_id. The output of value_counts() should be integers.
When I build DataFrame with these integers, I found those columns are object dtype. Does anyone know the reason?
They are the object dtype because your sec_id column contains string values (e.g. "94114G"). When you call .values on the dataframe created by .reset_index(), you get two arrays which both contain string objects.
More importantly, I think you are doing some unnecessary work. Try this:
>>> sec_count_df = df['sec_id'].value_counts().rename_axis("sec_id").rename("count").reset_index()
When I tried to run this code:
X_test = df.values
df_new = ks.DataFrame(X_test, columns = ['Sales','T_Year','T_Month','T_Week','T_Day','T_Hour'])
I am getting new index for df_new data frame which is not the same has df.
I tried changing the code below to retain index for dictionary. However it gives an error:
X_test = df.values(index=df.index)
'numpy.ndarray' object is not callable.
Is there a way to maintain an index for df_new which are the same has df dataframe?
DataFrames have a set_index() method in order to manually set the "index column". Koalas in particular accepts as main argument:
keys: label or array-like or list of labels/arrays
This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses Series, Index and np.ndarray.
By that, you can pass the Index object of your original df:
X_test = df.values
df_new = ks.DataFrame(X_test, columns = ['Sales','T_Year','T_Month','T_Week','T_Day','T_Hour'])
df_new = df_new.set_index(df.index)
Now about the line you are getting an error:
X_test = df.values(index=df.index)
The errors arise due to the fact that you are kind of confusing numpy arrays with pandas DataFrames.
When you call df.values of a DataFrame df, this returns a np.ndarray object with all the dataframe values without the index.
This is not a function, so you cannot "call it" by writing (index=df.index).
Numpy arrays don't have custom indixes, they are just arrays. Your df_new only cares about that, and you can set it as I showed above.
Disclaimer: I wasn't able to install koalas for this answer, so this is only tested in pandas Dataframes. If koalas does support pandas' interface completely, that should work.
I have a Panda series 'ids' of only unique ids, which is a dtype of object.
data_df.id.dtype
returns dtype('O')
I'm trying to follow the example here to create a sparse matrix from my df: Efficiently create sparse pivot tables in pandas?
id_u= list(data_df.id.unique())
row = data_df.id.astype('category', categories=reviewer_u).cat.codes
and I get:
TypeError: data type "category" not understood
I'm not sure what this error means and I haven't been able to find much on it.
Try instead:
row = pd.Categorical(data_df['id'], categories=reviewer_u)
You can get the codes using:
row.codes
I am trying to set colb in a pandas array depending on the value in colb.
The order in which I refer to the two column indices in the array seems to have an impact on whether the indexing works. Why is this?
Here is an example of what I mean.
I set up my dataframe:
test=pd.DataFrame(np.random.rand(20,1))
test['cola']=[x for x in range(20)]
test['colb']=0
If I try to set column b using the following code:
test.loc['colb',test.cola>2]=1
I get the error:`ValueError: setting an array element with a sequence
If I use the following code, the code alters the dataframe as I expect.
test.loc[test.cola>2,'colb']=1
Why is this?
Further, is there a better way to assign a column using a test like this?