I have a Panda series 'ids' of only unique ids, which is a dtype of object.
data_df.id.dtype
returns dtype('O')
I'm trying to follow the example here to create a sparse matrix from my df: Efficiently create sparse pivot tables in pandas?
id_u= list(data_df.id.unique())
row = data_df.id.astype('category', categories=reviewer_u).cat.codes
and I get:
TypeError: data type "category" not understood
I'm not sure what this error means and I haven't been able to find much on it.
Try instead:
row = pd.Categorical(data_df['id'], categories=reviewer_u)
You can get the codes using:
row.codes
Related
I have my data set, https://github.com/mayuripandey/Data-Analysis/blob/main/similarity.csv, is there any way i can make matrix with two specific column and make a matrix of it? For eg:
Count and Topic?
Simply subset the columns of interest, and retrieve the values without the column names using the ".values" attribute.
df = pd.read_html("https://github.com/mayuripandey/Data-Analysis/blob/main/similarity.csv")[0]
df[["Count","Topic"]].values
This returns a 2D numpy array of only the values, then if you need, you can transform into a matrix object like this:
np.matrix(df[["Count","Topic"]].values)
I am using value_counts() to get the frequency for sec_id. The output of value_counts() should be integers.
When I build DataFrame with these integers, I found those columns are object dtype. Does anyone know the reason?
They are the object dtype because your sec_id column contains string values (e.g. "94114G"). When you call .values on the dataframe created by .reset_index(), you get two arrays which both contain string objects.
More importantly, I think you are doing some unnecessary work. Try this:
>>> sec_count_df = df['sec_id'].value_counts().rename_axis("sec_id").rename("count").reset_index()
When I tried to run this code:
X_test = df.values
df_new = ks.DataFrame(X_test, columns = ['Sales','T_Year','T_Month','T_Week','T_Day','T_Hour'])
I am getting new index for df_new data frame which is not the same has df.
I tried changing the code below to retain index for dictionary. However it gives an error:
X_test = df.values(index=df.index)
'numpy.ndarray' object is not callable.
Is there a way to maintain an index for df_new which are the same has df dataframe?
DataFrames have a set_index() method in order to manually set the "index column". Koalas in particular accepts as main argument:
keys: label or array-like or list of labels/arrays
This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses Series, Index and np.ndarray.
By that, you can pass the Index object of your original df:
X_test = df.values
df_new = ks.DataFrame(X_test, columns = ['Sales','T_Year','T_Month','T_Week','T_Day','T_Hour'])
df_new = df_new.set_index(df.index)
Now about the line you are getting an error:
X_test = df.values(index=df.index)
The errors arise due to the fact that you are kind of confusing numpy arrays with pandas DataFrames.
When you call df.values of a DataFrame df, this returns a np.ndarray object with all the dataframe values without the index.
This is not a function, so you cannot "call it" by writing (index=df.index).
Numpy arrays don't have custom indixes, they are just arrays. Your df_new only cares about that, and you can set it as I showed above.
Disclaimer: I wasn't able to install koalas for this answer, so this is only tested in pandas Dataframes. If koalas does support pandas' interface completely, that should work.
I have the following data frame:
After I perform this operation:
pages = df_ref.groupby("KV").work_p.unique().reset_index()
I'm getting the new dataframe where the data type of the column work_p is an array. How can I extract/convert it to integer?
I feel like I can achieve the goal also by changing the first step, but as I am new to pandas I, unfortunately, stuck.
Try:
pages['work_p'] = pages.work_p.apply(lambda x: x[0])
I was wondering how I would be able to convert my binned dataframe to a binned numpy array that I can use in sklearn's PCA.
Here's my code so far (x is my original unbinned dataframe):
bins=(2,6,10,14,20,26,32,38,44,50,56,62,68,74,80,86,92,98)
binned_data = x.groupby(pd.cut(x.Weight, bins))
I want to convert binned_data to a numpy array. Thanks in advance.
EDIT:
When I try binned_data.values, I receive this error:
AttributeError: Cannot access attribute 'values' of 'DataFrameGroupBy' objects, try using the 'apply' method
You need to apply some kind of aggregation to the GroupBy object to return a DataFrame. Once you have that, you can use .values to extract the numpy arrary.
For example, if you wanted the sum or count of the data in each bin you could do:
binned_data.sum().values
binned_data.size().values
Edit:
My code wasn't exactly right, because the column (Weight) and the index will have the same name. It can be fixed by renaming the index, as below:
binned_data = x.groupby(pd.cut(x.Weight, bins)).sum()
binned_data.index.name = 'Weight_Bin'
binned_data.reset_index().values