How to maintain same index for dictonary from dataframe - python

When I tried to run this code:
X_test = df.values
df_new = ks.DataFrame(X_test, columns = ['Sales','T_Year','T_Month','T_Week','T_Day','T_Hour'])
I am getting new index for df_new data frame which is not the same has df.
I tried changing the code below to retain index for dictionary. However it gives an error:
X_test = df.values(index=df.index)
'numpy.ndarray' object is not callable.
Is there a way to maintain an index for df_new which are the same has df dataframe?

DataFrames have a set_index() method in order to manually set the "index column". Koalas in particular accepts as main argument:
keys: label or array-like or list of labels/arrays
This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses Series, Index and np.ndarray.
By that, you can pass the Index object of your original df:
X_test = df.values
df_new = ks.DataFrame(X_test, columns = ['Sales','T_Year','T_Month','T_Week','T_Day','T_Hour'])
df_new = df_new.set_index(df.index)
Now about the line you are getting an error:
X_test = df.values(index=df.index)
The errors arise due to the fact that you are kind of confusing numpy arrays with pandas DataFrames.
When you call df.values of a DataFrame df, this returns a np.ndarray object with all the dataframe values without the index.
This is not a function, so you cannot "call it" by writing (index=df.index).
Numpy arrays don't have custom indixes, they are just arrays. Your df_new only cares about that, and you can set it as I showed above.
Disclaimer: I wasn't able to install koalas for this answer, so this is only tested in pandas Dataframes. If koalas does support pandas' interface completely, that should work.

Related

Add list or numpy array as column to a dask dataframe

How can i add a list or a numpy array as a column to a Dask dataframe? When i try with the regular pandas syntax df['x']=x it gives me a TypeError: Column assignment doesn't support type list error.
You can add a pandas series:
df["new_col"] = pd.Series(my_list, index=index_matching_df_index)
The issue is that the index is extremely important so dask can understand how to partition the data. The size of each partition in a dask dataframe is not always known, so you cannot assign by position.
I finally solved it just casting the list into a dask array with dask.array.from_array(), which i think it's the most direct way.

Outer merging two dataframes where one contains a StringArray raises ValueError

I am trying to perform an outer (or left) merge on two dataframes, the latter of which has an exclusive column with a dtype of "StringDtype", and this raises an error like ValueError: StringArray requires a sequence of strings or pandas.NA. I understand that I can use Series.astype(str) to cast the column to the "object" dtype, but the dataframe has many of these columns and it seems unnecessary to me. I'm wondering if this is a bug or is there another workaround I'm not aware of.
Here is an example to recreate the error:
import pandas as pd
df1 = pd.DataFrame(dict(id=pd.Series([1], dtype=int)))
df2 = pd.DataFrame(dict(id=pd.Series([], dtype=int), first_name=pd.Series([], dtype="string")))
df_final = df1.merge(df2, on="id", how="outer") # "outer" can be replaced with "left" with the same effect
This should result in a DataFrame with a single row containing the id of 1 and the first_name NA or something similar, instead it raises the ValueError.
This will work if I cast the "first_name" column to "Object" with df2.first_name = df2.first_name.astype(str) but I'd like to avoid this as I explained in the first paragraph.
Pandas version 1.0.5 is installed.

While using reindex method of pandas on a data frame why the original values are lost?

This is the original Dataframetols :
What I wanted : I wanted to convert this above data-frame into this multi-indexed column data-frame :
I managed to do it by this piece of code :
# tols : original dataframe
cols = pd.MultiIndex.from_product([['A','B'],['Y','X']
['P','Q']])
tols.set_axis(cols, axis = 1, inplace = False)
What I tried : I tried to do this with the reindex method like this :
cols = pd.MultiIndex.from_product([['A','B'],['Y','X'],
['P','Q']])
tols.reindex(cols, axis = 'columns')
it resulted in an output like this :
My problem :
As you could see in the output above all my original numerical values go missing on employing the reindex method. In the documentation page it was clearly mentioned :
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one. So i don't understand:
Where did i particularly err in employing the reindex method to lose my original values
How should i have employed the reindex method correctly to get my desired output
You need to assign new columns names, only necessary same length of columns in original DataFrame with length of MultiIndex:
tols.columns = pd.MultiIndex.from_product([['A','B'],['Y','X'], ['P','Q']])
Problem with DataFrame.reindex here is pandas is looking for values of cols in original columns names and because they're not found so they're set to missing values.
It is the intended behaviour, from the documentation:
Conform DataFrame to new index with optional filling logic, placing
NA/NaN in locations having no value in the previous index

Subsetting dataframe via a list

I dummified one column in my data frame using get_dummies but that produced an additional 400 columns. The issue is that I would like to subset the data frame which now has over 700 columns to run below operation
replace([np.inf, -np.inf], np.nan).dropna()
I tried isolating the new columns generated by get_dummies by storing them in a list which I initializaed as col1.
When I tried to subset the df using
df = df[['var1','var2','var3',[col1] ]]
I got an error msg saying " ValueError: setting an array element with a sequence''
Is there a way to go about subsetting the new dummies without having to type them all out when subsetting?
You can use an asterisk to unpack your list in the column selection
Otherwise you're passing in your list as a sublist into the column list. Your current method becomes:
df[['var1','var2','var3',['sub1','sub2','sub3']]]
But :
df = df[['var1','var2','var3',*col1]]
is unpacked to
df[['var1','var2','var3','sub1','sub2','sub3']]

Convert Pandas Series to Categorical

I have a Panda series 'ids' of only unique ids, which is a dtype of object.
data_df.id.dtype
returns dtype('O')
I'm trying to follow the example here to create a sparse matrix from my df: Efficiently create sparse pivot tables in pandas?
id_u= list(data_df.id.unique())
row = data_df.id.astype('category', categories=reviewer_u).cat.codes
and I get:
TypeError: data type "category" not understood
I'm not sure what this error means and I haven't been able to find much on it.
Try instead:
row = pd.Categorical(data_df['id'], categories=reviewer_u)
You can get the codes using:
row.codes

Categories