Drop function in Pandas is changing original dataframe? - python

I am using drop in pandas with an inplace=True set. I am performing this on a duplicate dataframe, but the original dataframe is also being modified.
df1 = df
for col in df1.columns:
if df1[col].sum() > 1:
df1.drop(col,inplace=True,axis=1)
This is modifying my 'df' dataframe and don't seem to understand why.

Use df1 = df.copy(). Otherwise they are the same object in memory.
However, it would be better to generate a new DataFrame directly, e.g.
df1 = df.loc[:, df.sum() <= 0]

Related

Python Pandas drop a specific raneg of columns that are all NaN

I'm attempting to drop a range of columns in a pandas dataframe that have all NaN. I know the following code:
df.dropna(axis=1, how='all', inplace = True)
Will search all the columns in the dataframe and drop the ones that have all NaN.
However, when I extend this code to a specific range of columns:
df[df.columns[48:179]].dropna(axis=1, how='all', inplace = True)
The result is the original dataframe with no columns removed. I also no for a fact that the selected range has multiple columns with all NaN's
Any idea what I might be doing wrong here?
Don't use inplace=True. Instead do this:
cols = df.columns[48:179]
df[cols] = df[cols].dropna(axis=1, how='all')
inplace=True can only used when you apply changes to the whole dataframe. I won't work in range of columns. Try to use dropna without inplace=True to see the results(in a jupyter notebook)

Pandas returning whole dataframe when .loc to a specific column that is not existant

I have a dataframe with column names ['2533,3093', '1645,2421', '1776,1645', '3133,2533', '2295,2870'] and I'm trying to add a new column which is '2009,3093'.
I'm using df.loc[:, col] = some series, but it is returning a KeyError meaning that column does not exist. But by default, pandas would create that column. If I do df.loc[:, 'test'] = value it works fine.
But somehow, when I do df.loc[:, col], it returns me the entire dataframe. When it should actually return a KeyError, because the column does not existe in the dataframe.
Any thoughts?
Thanks
please use this syntax
df.loc[:,[column name]] = series
df.loc[:, ['2009,3093']] = series
I have used this code for testing, not sure what series you were trying to assing
import pandas as pd
col = ['2533,3093', '1645,2421', '1776,1645', '3133,2533']
df = pd.DataFrame(columns=col)
df.loc[:, ['2009,3093']] = ['a','b','c','d']
print(df)

Dropping column in dataframe with assignment not workng in a loop

I have two dataframes (df_train and df_test) containing a column ('Date') that I want to drop.
As far as I understood, I could do it in two ways, i.e. either by using inplace or by assigning the dataframe to itself, like:
if 'Date' in df_train.columns:
df_train.drop(['Date'], axis=1, inplace=True)
OR
if 'Date' in df_train.columns:
df_train = df_train.drop(['Date'], axis=1)
Both the methods work on the single dataframe, but the former way should be more memory friendly, since with the assignent a copy of the dataframe is created.
The weird thing is, I have to do it for both the dataframes, so I tried to do the same within a loop:
for data in [df_train, df_test]:
if 'Date' in data.columns:
data.drop(['Date'], axis=1, inplace=True)
and
for data in [df_train, df_test]:
if 'Date' in data.columns:
data = data.drop(['Date'], axis=1)
and the weird thing is that, in this case, only the first ways (using inplace) works. If I use the second way, the 'Date' columns aren't dropped.
Why is that?
It doesn't work because iterating through the list and changing what's in the list doesn't actually change the actual list of dataframes because it only changes the iterators, so you should try:
lst = []
for data in [df_train, df_test]:
if 'Date' in data.columns:
lst.append(data.drop(['Date'], axis=1))
print(lst)
Now lst contains all the dataframes.
Its better to use a list comprehension:
res = [data.drop(['Date'], axis=1) for data in [df_train, df_test] if 'Date' in data.columns]
Here, you will get a copy of both dataframes after columns are dropped.

pandas conditional selection - returning a view not a copy

I have an original pandas Dataframe with a chain of objects doing conditional selection on it. Each time I do a conditional selection, pandas creates a new dataframe. In other words:
import pandas as pd
df = pd.DataFrame(dict(A=range(3,23), B=range(5,25)))
print(id(df))
df2 = df[df['A']> 15]
print(id(df2))
df = pd.DataFrame(dict(A=range(3,43), B=range(5,45)))
print(id(df))
# output:
139963862409288
139963862409456
139963862275296
In the above example, I want df2 to change when I update df. I know that now because I rebind the variable df to a new Pandas DataFrame (a new object), its ID changes and df2 is not connected to the new df anymore. Is there anyway to do it the way I want? Is there any method/attribute in pandas to keep the connection between the original Dataframe and my conditional selection, or any Pythonic way I'm not aware of?
What are you trying to accomplish? Maybe it can be accomplished in a different way?
Regarding having views instead of copies -- when you select a single row or column, you have a view. The code below demonstrates this:
import pandas as pd
df = pd.DataFrame(dict(A=range(8,13), B=range(10,15), C=range(-3,2)))
print(df)
print('-----------')
dfa = df['A']
df2 = df.loc[2]
dfi = df.iloc[2]
dfa[2]=42
df2['B']=99
dfi['C']=-1
print(df)
print(dfa)
print(df2)
print(dfi)

Pandas how to concat two dataframes without losing the column headers

I have the following toy code:
import pandas as pd
df = pd.DataFrame()
df["foo"] = [1,2,3,4]
df2 = pd.DataFrame()
df2["bar"]=[4,5,6,7]
df = pd.concat([df,df2], ignore_index=True,axis=1)
print(list(df))
Output: [0,1]
Expected Output: [foo,bar] (order is not important)
Is there any way to concatenate two dataframes without losing the original column headers, if I can guarantee that the headers will be unique?
Iterating through the columns and then adding them to one of the DataFrames comes to mind, but is there a pandas function, or concat parameter that I am unaware of?
Thanks!
As stated in merge, join, and concat documentation, ignore index will remove all name references and use a range (0...n-1) instead. So it should give you the result you want once you remove ignore_index argument or set it to false (default).
df = pd.concat([df, df2], axis=1)
This will join your df and df2 based on indexes (same indexed rows will be concatenated, if other dataframe has no member of that index it will be concatenated as nan).
If you have different indexing on your dataframes, and want to concatenate it this way. You can either create a temporary index and join on that, or set the new dataframe's columns after using concat(..., ignore_index=True).
I don't think the accepted answer answers the question, which is about column headers, not indexes.
I am facing the same problem, and my workaround is to add the column names after the concatenation:
df.columns = ["foo", "bar"]

Categories