Using pandas.get_dummies - python

So essentially I have a data frame with a bunch of columns, some of which I want to keep (stored in to_keep) and some other columns that I want to create categorical variables for using pandas.get_dummies (these are stored in to_change).
However, I can't seem to get the syntax of how to do this down, and all the examples I have seen (i.e here: http://blog.yhat.com/posts/logistic-regression-and-python.html), don't seem to help.
Here's what I have at present:
new_df = df.copy()
dummies= pd.get_dummies(new_df[to_change])
new_df = new_df[to_keep].join(dummies)
return new_df
Any help on where I am going wrong would be appreciated, as the problem I keep running into is that this only adds categorical variables for the first column in to_change.

Didn't understand the problem completely, I must say.
However, say your DataFrae is df, and you have a list of columns to_make_categorical.
The DataFrame with the non-categorical columns, is
wo_categoricals = df[[c for c in list(df.columns) if c not in to_make_categorical]]
The DataFrames of the categorical expansions are
categoricals = [pd.get_dummies(df[c], prefix=c) for c in to_make_categorical]
Now you could just concat them horizontally:
pd.concat([wo_categoricals] + categoricals, axis=1)

Related

Is there a way to return a pandas dataframe with a modified column?

Say I have a dataframe df with column "age".
Say "age" has some NaN values and I want to create two new dataframes, dfMean and dfMedian, which fill in the NaN values differently.
This is the way I would do it:
# Step 1:
dfMean = df
dfMean["age"].fillna(df["age"].mean(),inplace=True)
# Step 2:
dfMedian= df
dfMedian["age"].fillna(df["age"].median(),inplace=True)
I'm curious whether there's a way to do each of these steps in one line instead of two, by returning the modified dataframe without needing to copy the original. But I haven't been able to find anything so far. Thanks, and let me know if I can clarify or if you have a better title in mind for the question :)
Doing dfMean = dfMean["age"].fillna(df["age"].mean()) you create a Series, not a DataFrame.
To add two new Series (=columns) to your DataFrame, use:
df2 = df.assign(age_fill_mean=df["age"].fillna(df["age"].mean()),
age_fill_median=df["age"].fillna(df["age"].median()),
)
You alternatively can use alias Pandas.DataFrame.agg()
"Aggregate using one or more operations over the specified axis."
df.agg({'age' : ['mean', 'median']})
No, need 2 times defined new 2 DataFrames by DataFrame.fillna with dictionary for specify columns names for replacement missing values:
dfMean = df.fillna({'age': df["age"].mean()})
dfMedian = df.fillna({'age': df["age"].median()})
One line is:
dfMean,dfMedian=df.fillna({'age': df["age"].mean()}), df.fillna({'age': df["age"].median()})

FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead [duplicate]

I have a initial dataframe D. I extract two data frames from it like this:
A = D[D.label == k]
B = D[D.label != k]
I want to combine A and B into one DataFrame. The order of the data is not important. However, when we sample A and B from D, they retain their indexes from D.
DEPRECATED: DataFrame.append and Series.append were deprecated in v1.4.0.
Use append:
df_merged = df1.append(df2, ignore_index=True)
And to keep their indexes, set ignore_index=False.
Use pd.concat to join multiple dataframes:
df_merged = pd.concat([df1, df2], ignore_index=True, sort=False)
Merge across rows:
df_row_merged = pd.concat([df_a, df_b], ignore_index=True)
Merge across columns:
df_col_merged = pd.concat([df_a, df_b], axis=1)
If you're working with big data and need to concatenate multiple datasets calling concat many times can get performance-intensive.
If you don't want to create a new df each time, you can instead aggregate the changes and call concat only once:
frames = [df_A, df_B] # Or perform operations on the DFs
result = pd.concat(frames)
This is pointed out in the pandas docs under concatenating objects at the bottom of the section):
Note: It is worth noting however, that concat (and therefore append)
makes a full copy of the data, and that constantly reusing this
function can create a significant performance hit. If you need to use
the operation over several datasets, use a list comprehension.
If you want to update/replace the values of first dataframe df1 with the values of second dataframe df2. you can do it by following steps —
Step 1: Set index of the first dataframe (df1)
df1.set_index('id')
Step 2: Set index of the second dataframe (df2)
df2.set_index('id')
and finally update the dataframe using the following snippet —
df1.update(df2)
To join 2 pandas dataframes by column, using their indices as the join key, you can do this:
both = a.join(b)
And if you want to join multiple DataFrames, Series, or a mixture of them, by their index, just put them in a list, e.g.,:
everything = a.join([b, c, d])
See the pandas docs for DataFrame.join().
# collect excel content into list of dataframes
data = []
for excel_file in excel_files:
data.append(pd.read_excel(excel_file, engine="openpyxl"))
# concatenate dataframes horizontally
df = pd.concat(data, axis=1)
# save combined data to excel
df.to_excel(excelAutoNamed, index=False)
You can try the above when you are appending horizontally! Hope this helps sum1
Use this code to attach two Pandas Data Frames horizontally:
df3 = pd.concat([df1, df2],axis=1, ignore_index=True, sort=False)
You must specify around what axis you intend to merge two frames.

Comparing two large dataframes w/ pySpark

This post mentions symmetric difference and leveraging code df1.except(df2).union(df2.except(df1)) and/ordf1.unionAll(df2).except(df1.intersect(df2)) but I'm getting syntax errors when using except.
I'm trying to compare two dataframes who can have up to 50 or 50+ columns. I have the working code below but need to avoid hard coding columns.
sample code and example
# Create the two dataframes
df1 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3500,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Vom',5000,'mex','IT','2/11/2019'),(66,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df2 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3000,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Xom',5000,'mex','IT','2/11/2019'),(77,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df1 = df1.withColumn('FLAG',lit('DF1'))
df2 = df2.withColumn('FLAG',lit('DF2'))
# Concatenate the two DataFrames, to create one big dataframe.
df = df1.union(df2)
#Use window function to check if the count of same rows is more than 1 and if it indeed is, then mark column FLAG as SAME, else keep it the way it is. Finally, drop the duplicates.
my_window = Window.partitionBy('No','Name','Sal','Address','Dept','Join_Date').rowsBetween(-sys.maxsize, sys.maxsize)
df = df.withColumn('FLAG', when((count('*').over(my_window) > 1),'SAME').otherwise(col('FLAG'))).dropDuplicates()
df.show()
You can get all column names from df and use that list as parameter for the Window function:
cols = df.columns
cols.remove('FLAG')
my_window = Window.partitionBy(cols).rowsBetween(-sys.maxsize, sys.maxsize)
The remaining code stays unchanged.

Adding comuln/s if not existing using Pandas

I am using Pandas with PsychoPy to reorder my results in a dataframe. The problem is that the dataframe is going to vary according to the participant performance. However, I would like to have a common dataframe, where non-existing columns are created as empty. Then the columns have to be in a specific order in the output file.
Let´s suppose I have a dataframe from a participant with the following columns:
x = ["Error_1", "Error_2", "Error_3"]
I want the final dataframe to look like this:
x = x[["Error_1", "Error_2", "Error_3", "Error_4"]]
Where "Error_4" is created as an empty column.
I tried applying something like this (adapted from another question):
if "Error_4" not in x:
x["Error_4"] = ""
x = x[["Error_1", "Error_2", "Error_3", "Error_4"]]
In principle it should work, however I have more or less other 70 columns for which I should do this, and it doesn´t seem practical to do it for each of them.
Do you have any suggestions?
I also tried creating a new dataframe with all the possible columns, e.g.:
y = ["Error_1", "Error_2", "Error_3", "Error_4"]
However, it is still not clear to me how to merge the dataframes x and y skipping columns with the same header.
Use DataFrame.reindex:
x = x.reindex(["Error_1", "Error_2", "Error_3", "Error_4"], axis=1, fill_value='')
Thanks for the reply, I followed your suggestion and adapted it. I post it here, since it may be useful for someone else.
First I create a dataframe y as I want my output to look like:
y = ["Error_1", "Error_2", "Error_3", "Error_4", "Error_5", "Error_6"]
Then, I get my actual output file df and modify it as df2, adding to it all the columns of y in the exact same order.
df = pd.DataFrame(myData)
columns = df.columns.values.tolist()
df2 = df.reindex(columns = y, fill_value='')
In this case, all the columns that are absent in df2 but are present in y, are going to be added to df2.
However, let´s suppose that in df2 there is a column "Error_7" absent in y. To keep track of these columns I am just applying merge and creating a new dataframe df3:
df3 = pd.merge(df2,df)
df3.to_csv(filename+'UPDATED.csv')
The missing columns are going to be added at the end of the dataframe.
If you think this procedure might have drawbacks, or if there is another way to do it, let me know :)

How do I combine two dataframes?

I have a initial dataframe D. I extract two data frames from it like this:
A = D[D.label == k]
B = D[D.label != k]
I want to combine A and B into one DataFrame. The order of the data is not important. However, when we sample A and B from D, they retain their indexes from D.
DEPRECATED: DataFrame.append and Series.append were deprecated in v1.4.0.
Use append:
df_merged = df1.append(df2, ignore_index=True)
And to keep their indexes, set ignore_index=False.
Use pd.concat to join multiple dataframes:
df_merged = pd.concat([df1, df2], ignore_index=True, sort=False)
Merge across rows:
df_row_merged = pd.concat([df_a, df_b], ignore_index=True)
Merge across columns:
df_col_merged = pd.concat([df_a, df_b], axis=1)
If you're working with big data and need to concatenate multiple datasets calling concat many times can get performance-intensive.
If you don't want to create a new df each time, you can instead aggregate the changes and call concat only once:
frames = [df_A, df_B] # Or perform operations on the DFs
result = pd.concat(frames)
This is pointed out in the pandas docs under concatenating objects at the bottom of the section):
Note: It is worth noting however, that concat (and therefore append)
makes a full copy of the data, and that constantly reusing this
function can create a significant performance hit. If you need to use
the operation over several datasets, use a list comprehension.
If you want to update/replace the values of first dataframe df1 with the values of second dataframe df2. you can do it by following steps —
Step 1: Set index of the first dataframe (df1)
df1.set_index('id')
Step 2: Set index of the second dataframe (df2)
df2.set_index('id')
and finally update the dataframe using the following snippet —
df1.update(df2)
To join 2 pandas dataframes by column, using their indices as the join key, you can do this:
both = a.join(b)
And if you want to join multiple DataFrames, Series, or a mixture of them, by their index, just put them in a list, e.g.,:
everything = a.join([b, c, d])
See the pandas docs for DataFrame.join().
# collect excel content into list of dataframes
data = []
for excel_file in excel_files:
data.append(pd.read_excel(excel_file, engine="openpyxl"))
# concatenate dataframes horizontally
df = pd.concat(data, axis=1)
# save combined data to excel
df.to_excel(excelAutoNamed, index=False)
You can try the above when you are appending horizontally! Hope this helps sum1
Use this code to attach two Pandas Data Frames horizontally:
df3 = pd.concat([df1, df2],axis=1, ignore_index=True, sort=False)
You must specify around what axis you intend to merge two frames.

Categories