Does a function in pandas exist that can delete a row without making a copy?
Functions like
.drop()
.dropna()
.drop_duplicates()
unfortunately return a copy of the data frame.
Related
In the codebase I'm working in, I see the following
df1 = some pandas dataframe (one of the columns is quantity)
df1_subset = df1[df1.quantity == input_quantity].copy()
I am wondering why the .copy() is needed here? Doesn't df1_subset = df1[df1.quantity == input_quantity] make df1_subset a copy and not a reference?
I saw in why should I make a copy of a data frame in pandas that a reference would be returned, but I think this is outdated?
I'm new to python and just trying to redo my first project from matlab. I've written a code in vscode to import an excel file using pandas
filename=r'C:\Users\user\Desktop\data.xlsx'
sheet=['data']
with pd.ExcelFile(filename) as xls:
Dateee=pd.read_excel(xls, sheet,index_col=0)
Then I want to access data in a row and column.
I tried to print data using code below:
for key in dateee.keys():
print(dateee.keys())
but this returns nothing.
Is there anyway to access the data (as a list)?
You can iterate on each column, making the contents of each a list:
for c in df:
print(df[c].to_list())
df is what the dataframe was assigned as. (OP had inconsistent syntax & so I didn't use that.)
Look into df.iterrows() or df.itertuples() if you want to iterate by row. Example:
for row in df.itertuples():
print(row)
Look into df.iloc and df.loc for row and column selection of individual values, see Pandas iloc and loc – quickly select rows and columns in DataFrames.
Or df.iat or df.at for getting or setting single values, see here, here, and here.
I've imported a CSV dataset, cleaned it (e.g. removed duplicates) and then tried to export the updated CSV. However, the exported CSV file contains the same data as the original, rather than the updated DataFrame.
I've tried both df.to_csv('out.csv') and df.to_csv(r'out.csv')
Reading data from a csv file, into a data frame
import pandas as pd
df = pd.read_csv('data.csv')
Drop duplicates
df.drop_duplicates()
Save updated DataFrame to CSV
df.to_csv(r'cleanedData.csv')
or
df.to_csv('cleanedData.csv')
Can anyone spot what I'm doing wrong?
Since the data has 100 rows, and 25 are duplicates, I expect there to be 75 left. Within a Jupyter notebook, the duplicates are correctly dropped from the DataFrame. However, when I open the actual CSV file that I exported, I still have 100 rows of data.
You need to set the inplace argument to True
Ex:
import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace=True)
MoreInfo
You need to also say inplace=True in drop_duplicates() otherwise it will return a new DataFrame leaving your original intact.
You should add inplace option to keep the changes on your original DataFrame
df.drop_duplicates(inplace=True)
By default, the drop_duplicates() method returns a new DataFrame with duplicated elements removed, so in your case df remains the same. You should write:
df.drop_duplicates(inplace=True)
df.to_csv('cleanedData.csv')
See also: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
According to the Pandas drop duplicates documentation:
df.drop_duplicates return the deduplicated dataframe, then the correct form would be:
deduplicated_df = df.drop_duplicates()
For the other side drop_duplicates() has the flag inplace for modify the dataframe directly, if you want substitute the df, set the flag as True
df.drop_duplicates(inplace=True)
I've recently seen this kind of code:
import pandas as pd
data = pd.read_csv('/path/to/some/data.csv')
colX = data['colX'].copy()
data.drop(labels=['colX'], inplace=True, axis=1)
I know that, to make an explicit copy of an object, I need copy(), but in this case, when extracting and subsequent deletion of a colum, is there a good reason to use copy()?
#EdChum statet in the comments:
the user may want to separate that column from the main df, of course if the user just wanted to delete that column then taking a copy is pointless if their intention is to delete the column but for instance they didn't take a copy and instead took a reference then operations on that column may or may not affect the orig df if you didn't drop it.
Lets presume a I have a simple dataframe df
And a simple method that does something to the dataframe
def alterDF(df):
df1['new column'] = df['some column'] + x
return df1
In the above method I modify an entire column with x and save it to a new variable name...inside the method!
However, when I inspect my original dataframe (i.e. df) I see that it also has the new column added to it...
I am aware that the original dataframe I created exists outside of the method. But I would expect that any alterations that occur inside the method, should remain there, unless I save the changes via the return block in my method.
However, I know that I am wrong...the changes also applied within my method, also occur outside of my method. How can this be? Why is this so?
Probably because you have a line like this
df1 = df # by doing this you copy the reference also
if you want to copy a dataframe use
df1 = df.copy()
instead