Confusion about modifying DataFrame within Function - python

when working within a function, I am struggling with understanding why sometimes I can modify dataframe values and other times I cannot. For example, variable assignment works without a return statement, but pd.get_dummies seems to need me to use return statement at end of the function.
import numpy as np
import pandas as pd
data = {
'x' : np.linspace(0,10,3) ,
'y' : np.linspace(10,20,3) ,
'cat1' : ['dog','cat','fish'] ,
'cat2' : ['website1','website1','website2'] }
df = pd.DataFrame(data)
cat1 cat2 x y
0 dog website1 0.0 10.0
1 cat website1 5.0 15.0
2 fish website2 10.020.0
Modifies the original dataframe
def change_variable(df):
df['x'] = 999
Doesn't Modify
def ready_for_ml(df):
pd.get_dummies(df,columns=['cat1','cat2'])
My work-around:
def ready_for_ml_return_df_version(df):
df = pd.get_dummies(df,columns=['cat1','cat2'])
return df
df = ready_for_ml_return_df_version(df)
x y cat dog fish website1 website2
0 0.0 10.0 0 1 0 1 0
1 5.0 15.0 1 0 0 1 0
2 10.020.0 0 0 1 0 1
I have read about using inplace transforms as a solution. How to modify a pandas DataFrame in a function so that changes are seen by the caller? but, I pd.get_dummies doesn't seem to have an inplace argument. So, I am curious about this function but, also more broadly the underlying behavior.
Thank you for your time.

Related

How fulfil empy df by FOR loop

I need to create a dataframe with two columns: variable, function based on this variable. There is an error in case of next code:
test = pd.DataFrame({'Column_1': pd.Series([], dtype='int'),
'Column_2': pd.Series([], dtype='float')})
for i in range(1,30):
k = 0.5**i
test.append(i, k)
print(test)
TypeError: cannot concatenate object of type '<class 'int'>'; only Series and DataFrame objs are valid
What do I need to fix here? Looks like answer is easy, however it is uneasy to find it...
Many thanks for your help
Is there a specific reason you are trying to use the loop? You can create df with column_1 and use Pandas vectorized operations to create column_2
df = pd.DataFrame(np.arange(1,30), columns = ['Column_1'])
df['Column_2'] = 0.5**df['Column_1']
Column_1 Column_2
0 1 0.50000
1 2 0.25000
2 3 0.12500
3 4 0.06250
4 5 0.03125
I like Vaishali's way of approaching it. If you really want to use the for loop, this is how I would of done it:
import pandas as pd
test = pd.DataFrame({'Column_1': pd.Series([], dtype='int'),
'Column_2': pd.Series([], dtype='float')})
for i in range(1,30):
test=test.append({'Column_1':i,'Column_2':0.5**i},ignore_index=True)
test = test.round(5)
print(test)
Column_1 Column_2
0 1.0 0.50000
1 2.0 0.25000
2 3.0 0.12500
3 4.0 0.06250
4 5.0 0.03125
5 6.0 0.01562

Hot-Encoding only on some elements of a column

On my dataset I have many columns with mixed categorical and numerical values. Basically when the numerical value was not available, a code is assigned, like 'M', 'C', etc.. associated to the reason it was missing.
They have special meaning and peculiar behavior, so I want to cast them as categorical, and keep the rest as numeric.
Minimal example:
# Original df
ex1 = ['a', 'b', '0', '1', '2']
df = pd.DataFrame(ex1, columns=['CName'])
print(df)
CName
0 a
1 b
2 0
3 1
4 2
## What I want to achieve
df['CName_a'] = (df.CName == 'a').astype(int)
df['CName_b'] = (df.CName == 'b').astype(int)
ff = (df.CName == 'b') | (df.CName == 'a')
df['CNname_num'] = np.where(ff, np.NaN, df.CName)
df2 = df.drop('CName', axis=1)
print(df2)
CName_a CName_b CNname_num
0 1 0 NaN
1 0 1 NaN
2 0 0 0
3 0 0 1
4 0 0 2
Question 1.
Q1: How this can be done efficiently? Ideally I need to chain it in a Pipeline, some fit_transform kind ot thing? I have to write from scratch or there is a hack from common libraries to hot-encode a subset of a column, like ['a', 'b', 'else'] ?
Question 2.
Q2: How should I fill the 'Nan' for the CName_num? The categorical elements ('a' and 'b' in the example) have behavior that differ from the average of the numerical (actually from any of the numerical). I feel assign 0 or 'mean' is not the right choice, but I ran out of options. I plan to use Random Forest, DNN, or even Regression-like training if it performs decently.
Here is one potential solution. First create a boolean mask using str.isdigit. Use pandas.get_dummies and pandas.concat for your final DataFrame:
mask = mask = df['CName'].str.isdigit()
pd.concat([pd.get_dummies(df.loc[~mask, 'CName'], prefix='CName')
.reindex(df.index).fillna(0),
df.loc[mask].add_suffix('_num')], axis=1)
[out]
CName_a CName_b CName_num
0 1.0 0.0 NaN
1 0.0 1.0 NaN
2 0.0 0.0 0
3 0.0 0.0 1
4 0.0 0.0 2

Python: How to replace only 0 values in a column by multiplication of 2 columns in Dataframe with a loop?

Here is my Dataframe:
df={'pack':[2,2,2,2], 'a_cost':[10.5,0,11,0], 'b_cost':[0,6,0,6.5]}
It should look like this:
At this point you will find that a_cost and b_cost columns have 0s where other column has a value. I would like my function to follow this logic...
for i in df.a_cost:
if i==0:
b_cost(column):value *(multiply) pack(column):value
replace 0 with this new multiplied value (example: 6.0*2=12)
for i in df_b.cost:
if i==0:
a_cost(column):value /(divide) pack(column):value
replace 0 with this new divided value (example: 10.5/2=5.25)
I can't figure out how to write this logic successfully... Here is the expected output:
Output in code:
df={'pack':[2,2,2,2], 'a_cost':[10.5,12.0,11,13.0], 'b_cost':[5.25,6,5.50,6.5]}
Help is really appreciated!
IIUC,
df.loc[df.a_cost.eq(0), 'a_cost'] = df.b_cost * df.pack
df.loc[df.b_cost.eq(0), 'b_cost'] = df.a_cost / df.pack
You can also play with mask and fillna:
df['a_cost'] = df.a_cost.mask(df.a_cost.eq(0)).fillna(df.b_cost * df.pack)
df['b_cost'] = df.b_cost.mask(df.b_cost.eq(0)).fillna(df.a_cost / df.pack)
Update as commented, you can use other in mask:
df['a_cost'] = df.a_cost.mask(df.a_cost.eq(0), other=df.b_cost * df.pack)
Also note that the second filtering is not needed once you already fill 0 in columns a_cost. That is, we can just do:
df['b_cost'] = df.a_cost / df.pack
after the first command in both methods.
Output:
pack a_cost b_cost
0 2 10.5 5.25
1 2 12.0 6.00
2 2 11.0 5.50
3 2 13.0 6.50
import numpy as np
df = pd.DataFrame({'pack':[2,2,2,2], 'a_cost':[10.5,0,11,0], 'b_cost':[0,6,0,6.5]})
df['a_cost'] = np.where(df['a_cost']==0, df['pack']*df['b_cost'], df['a_cost'])
df['b_cost'] = np.where(df['b_cost']==0, df['a_cost']/df['pack'], df['b_cost'])
print (df)
#pack a_cost b_cost
#0 2 10.5 5.25
#1 2 12.0 6.0
#2 2 11.0 5.50
#3 2 13.0 6.5
Try this:
df['a_pack'] = df.apply(lambda x: x['b_cost']*x['pack'] if x['a_cost'] == 0 and x['b_cost'] != 0 else x['a_cost'], axis = 1)
df['b_pack'] = df.apply(lambda x: x['a_cost']/x['pack'] if x['b_cost'] == 0 and x['a_cost'] != 0 else x['b_cost'], axis = 1)

Conditional counting for group variables

I have the following dataframe:
import pandas as pd
df = pd.DataFrame({"Shop_type": [1,2,3,3,2,3,1,2,1],
"Self_managed" : [True,False,False,True,True,False,False,True,False],
"Support_required" : [True,True,True,False,False,False,False,False,True]})
My goal is to get an overview of the number of number of Self_managed shops and Support_required shops somewhat looking like this:
Shop_type Self_count Supprt_count
0 1 1 2
1 2 2 1
2 3 1 1
Currently I use the following code to achieve this, but it looks very long and unprofessional. Since I am still learning Python, I would like to improve and have more efficient code. Any ideas?
df1 = df[df["Self_managed"] == True]
df1 = df1.groupby(['Shop_type']).size().reset_index(name='Self_count')
df2 = df[df["Support_required"] == True]
df2 = df2.groupby(['Shop_type']).size().reset_index(name='Supprt_count')
df = df1.merge(df2, how = "outer", on="Shop_type")
Seems like you need
df.groupby('Shop_type',as_index=False).sum()
Out[298]:
Shop_type Self_managed Support_required
0 1 1.0 2.0
1 2 2.0 1.0
2 3 1.0 1.0

How to encode two Pandas dataframes according to the same dummy vectors?

I'm trying to encode categorical values to dummy vectors.
pandas.get_dummies does a perfect job, but the dummy vectors depend on the values present in the Dataframe. How to encode a second Dataframe according to the same dummy vectors as the first Dataframe?
import pandas as pd
df=pd.DataFrame({'cat1':['A','N','K','P'],'cat2':['C','S','T','B']})
b=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
print(b)
cat1_A cat1_K cat1_N cat1_P
0 1 0 0 0
1 0 0 1 0
2 0 1 0 0
3 0 0 0 1
df_test=df=pd.DataFrame({'cat1':['A','N',],'cat2':['T','B']})
c=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
print(c)
cat1_A cat1_N
0 1 0
1 0 1
How can I get this output ?
cat1_A cat1_K cat1_N cat1_P
0 1 0 0 0
1 0 0 1 0
I was thinking to manually compute uniques for each column and then create a dictionary to map the second Dataframe, but I'm sure there is already a function for that...
Thanks!
A always use categorical_encoding because it has a great choice of encoders. It also works with Pandas very nicely, is pip installable and is written inline with the sklearn API. It means you can quick test different types of encoders with the fit and transform methods or in a Pipeline.
If you wish to encode just the first column, like in your example, we can do so.
import pandas as pd
import category_encoders as ce
df = pd.DataFrame({'cat1':['A','N','K','P'], 'cat2':['C','S','T','B']})
enc_ohe = ce.one_hot.OneHotEncoder(cols=['cat1'])
# cols=None, all string columns encoded
df_trans = enc_ohe.fit_transform(df)
print(df_trans)
cat1_0 cat1_1 cat1_2 cat1_3 cat2
0 0 1 0 0 C
1 0 0 0 1 S
2 1 0 0 0 T
3 0 0 1 0 B
The default is to have column names have numerical encoding instead of the original letters. This is helpful though when you have long strings as categories. This can be changed by passing the use_cat_names=True kwarg, as mentioned by Arthur.
Now we can use the transform method to encode your second DataFrame.
df_test = pd.DataFrame({'cat1':['A','N',],'cat2':['T','B']})
df_test_trans = enc_ohe.transform(df_test)
print(df_test_trans)
cat1_1 cat1_3 cat2
0 1 0 T
1 0 1 B
As commented in line 5, not setting cols defaults to encode all string columns.
I had the same problem before. This is what I did which is not necessary the best way to do this. But this works for me.
df=pd.DataFrame({'cat1':['A','N'],'cat2':['C','S']})
df['cat1'] = df['cat1'].astype('category', categories=['A','N','K','P'])
# then run the get_dummies
b=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
Using the function astype with 'categories' values passed in as parameter.
To apply the same category to all DFs, you better store the category values to a variable like
cat1_categories = ['A','N','K','P']
cat2_categories = ['C','S','T','B']
Then use astype like
df_test=df=pd.DataFrame({'cat1':['A','N',],'cat2':['T','B']})
df['cat1'] = df['cat1'].astype('category', categories=cat1_categories)
c=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
print(c)
cat1_A cat1_N cat1_K cat1_P
0 1 0 0 0
1 0 1 0 0

Categories