Pandas get_dummies in for loop - python

I would like to convert categorical variables into dummies using pandas.get_dummies in a for loop.
However, following code does not convert the dataframes.
data_cleaner = [data_train, data_val]
for df in data_cleaner:
df = pd.get_dummies(df, columns = categorical_fields)
data_train.head() # Not changed
I know that an iterator in a for loop is just a temporary variable. But the modified code also didn't work.
for i in range(len(data_cleaner)):
data_cleaner[i] = pd.get_dummies(data_cleaner[i], columns = categorical_fields)
data_train.head() # Still not changed
Anyone can help? Do I have to manually run get_dummies for each dataframe? FYI, Pandas get_dummies doesn't provide an inplace option.

You can run it as a list comprehension
data_cleaner = [pd.get_dummies(df, columns=categorical_fields) for df in data_cleaner]
or
data_train_dum, data_val_dum = [pd.get_dummies(df, columns=categorical_fields) for df in [data_train, data_val]]

Try following
data_cleaner = [data_train, data_val]
for i,df in enumerate(data_cleaner):
data_cleaner[i] = pd.get_dummies(df, columns = categorical_fields)
data_train,data_val=data_cleaner

Related

Changing data types of multiple pandas dataframes in a for loop

I have several data frames which I need to convert the datatypes to integers. I tried using a for loop to try and make my code a bit tidier, but after running this and checking the dtypes they don't change. Anyone know why this could be/any work arounds? I think its something to do with creating copies. An example of similar code below:
for df in [df1, df2, df3]:
df = df.astype(int)
The problem here is that you are not changing your initial objects, only your variable df.
To change your initial dataframes you could do the following :
df_list = [df1, df2, df3]
for i in range(len(df_list)):
df_list[i] = df_list[i].astype(int)
I've found a nicer why to code this using a function. It's not quite as elegant as I initially hoped with a for loop, but will save me writing out long lists of column names to be changed several times:
def to_int(df, cols):
df[cols] = df[cols].astype(int)
return df
df = to_int(df, ['col1', 'col2'])
Allows me to change only the data type of the desired columns.

How do I split rows in DataFrame?

I want to split the rows while maintaing the values.
How can I split the rows like that?
The data frame below is an example.
the output that i want to see
You can use the pd.melt( ). Read the documentation for more information: https://pandas.pydata.org/docs/reference/api/pandas.melt.html
I tried working on your problem.
import pandas as pd
melted_df = data.melt(id_vars=['value'], var_name="ToBeDropped", value_name="ID1")
This would show a warning because of the unambiguity in the string passed for "value_name" argument. This would also create a new column which I have assigned the name already. The new column will be called 'ToBeDropped'. Below code will remove the column for you.
df = melted_df.drop(columns = ['ToBeDropped'])
'df' will be your desired output.
via wide_to_long:
df = pd.wide_to_long(df, stubnames='ID', i='value',
j='ID_number').reset_index(0)
via set_index and stack:
df = df.set_index('value').stack().reset_index(name='IDs').drop('level_1', 1)
via melt:
df = df.melt(id_vars='value', value_name="ID1").drop('variable', 1)

column filter and multiplication in dask dataframe

I am trying to replicate the following operation on a dask dataframe where I have to filter the dataframe based on column value and multiply another column on that.
Following is pandas equivalent -
import dask.dataframe as dd
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
I am trying to do this on a dask dataframe but it doesn't support assignment.
TypeError: '_LocIndexer' object does not support item assignment
This is working for me -
df['adjusted_revenue'] = 0
df1 = df.loc[df['tracked'] ==1]
df1['adjusted_revenue'] = 0.7*df1['gross_revenue']
df2 = df.loc[df['tracked'] ==0]
df2['adjusted_revenue'] = 0.3*df['gross_revenue']
df = dd.concat([df1, df2])
However, I was hoping if there is any simpler way to do this.
Thanks!
You should use .apply, which is probably the right thing to do with Pandas too; or perhaps where. However, to keep things as similar to your original, here it is with map_partitions, in which you act on each piece of the the dataframe independently, and those pieces really are Pandas dataframes.
def make_col(df):
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
return df
new_df = df.map_partitions(make_col)

Efficient way to empty a pandas dataframe using its mutable alias property

So I want to create a function in which a part of the codes modifies an existing pandas dataframe df and under some conditions, the df will be modified to empty. The challenge is that this function is now allwoed to return the dataframe itself; it can only modify the df by handling the alias. An example of this is the following function:
import pandas as pd
import random
def random_df_modifier(df):
letter_lst = list('abc')
message_lst = [f'random {i}' for i in range(len(letter_lst) - 1)] + ['BOOM']
chosen_tup = random.choice(list(zip(letter_lst, message_lst)))
df[chosen_tup[0]] = chosen_tup[1]
if chosen_tup[0] == letter_lst[-1]:
print('Game over')
df = pd.DataFrame()#<--this line won't work as intended
return chosen_tup
testing_df = pd.DataFrame({'col1': [True, False]})
print(random_df_modifier(testing_df))
I am aware of the reason df = pd.DataFrame() won't work is because the local df is now associated with the pd.DataFrame() instead of the mutable alias of the input dataframe. so is there any way to change the df inplace to an empty dataframe?
Thank you in advance
EDIT1: df.drop(df.index, inplace=True) seems to work as intended, but I am not sure about its efficientcy because df.drop() may suffer from performance issue
when the dataframe is big enough(by big enough I mean 1mil+ total entries).
df = pd.DataFrame(columns=df.columns)
will empty a dataframe in pandas (and be way faster than using the drop method).
I believe that is what your asking.

Need help to solve the Unnamed and to change it in dataframe in pandas

how set my indexes from "Unnamed" to the first line of my dataframe in python
import pandas as pd
df = pd.read_excel('example.xls','Day_Report',index_col=None ,skip_footer=31 ,index=False)
df = df.dropna(how='all',axis=1)
df = df.dropna(how='all')
df = df.drop(2)
To set the column names (assuming that's what you mean by "indexes") to the first row, you can use
df.columns = df.loc[0, :].values
Following that, if you want to drop the first row, you can use
df.drop(0, inplace=True)
Edit
As coldspeed correctly notes below, if the source of this is reading a CSV, then adding the skiprows=1 parameter is much better.

Categories