For example, let's take Penguins dataset, and i want to drop all entries in bill_length_mm column when they are more then 30:
import seaborn as sns
import pandas as pd
ds = sns.load_dataset("penguins")
ds.head()
ds.drop(ds[ds['bill_length_mm']>30])
And it gives me an error. And if i'll try to add axis=1 it'll just drop every column in dataset.
ds.drop(ds[ds['bill_length_mm']>30], axis=1)
So what shoud i do to complete ma goal?
Try
ds=ds.drop(ds[ds['bill_length_mm']>30].index)
Or
ds = ds[ds['bill_length_mm']<=30]
ds.drop is used to drop columns, not rows. If you only want to keep the rows where bill_length_mm<=30, you can use
ds = ds[ds['bill_length_mm']<=30]
Related
I'm trying to setup a data quality check for numeric columns in a dataframe. I want to run the describe() to produce stats on each numeric columns. How can I filter out other columns to produce stats. See line of code I'm using.
df1 = pandas.read_csv("D:/dc_Project/loans.csv")
print(df1.describe(include=sorted(df1)))
Went with the following from a teammate:
import pandas as pd
import numpy as np
df1 = pandas.read_csv("D:/dc_Project/loans.csv")
df2=df1.select_dtypes(include=np.number)
Thank you in advance for taking the time to help me! (Code provided below) (Data Here)
I am trying to average the first 3 columns and insert it as a new column labeled 'Topsoil'. What is the best way to go about doing that?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
raw_data = pd.read_csv('all-deep-soil-temperatures.csv', index_col=1, parse_dates=True)
df_all_stations = raw_data.copy()
df_selected_station.fillna(method = 'ffill', inplace=True);
df_selected_station_D=df_selected_station.resample(rule='D').mean()
df_selected_station_D['Day'] = df_selected_station_D.index.dayofyear
mean=df_selected_station_D.groupby(by='Day').mean()
mean['Day']=mean.index
#mean.head()
Try this :
mean['avg3col']=mean[['5 cm', '10 cm','15 cm']].mean(axis=1)
df['new column'] = (df['col1'] + df['col2'] + df['col3'])/3
You could use the apply method in the following way:
mean['Topsoil'] = mean.apply(lambda row: np.mean(row[0:3]), axis=1)
You can read about the apply method in the following link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
The logic is that you perform the same task along a specific axis multiple times.
Note: It is not wise to call data-structures in names of functions, in your case it might be better be mean_df rather the mean
Use DataFrame.iloc for select by positions - first 3 columns with mean:
mean['Topsoil'] = mean.iloc[:, :3].mean(axis=1)
I have a data frame of 6 columns that 2 first columns should be plotted as x & y. I want to replace the values of the 6th column with other values and then excluding x, y that have values larger than a threshold like 0.0003-0.002. The effort that I had is below:
import numpy as np
import os
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('configuration_1000.out', sep="\s+", header=None)
#print(df)
col_5 = df.iloc[:,5]
g = col_5.abs()
g = g*0.00005
#print(g)
df.loc[:,5].replace(g, inplace=True)
#df.head()
selected = df[ (df.loc[:,5] > 0.0003) & (df.loc[:,5] < 0.002) ]
print(selected)
plt.plot(selected[0], selected[1],marker=".")
but when I do this, nothing is gonna changed.
You don't need iloc for this, nor do you need to go through the intermediate steps. Just manipulate the column directly.
df[df.columns[5]] = abs(df[df.columns[5]])*0.00005
To solve this problem just need to do this
df.loc[:,5] = g
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
df = pd.read_csv('time_series_covid_19_deaths_US.csv')
df = df.drop(['UID','iso2','iso3','code3','FIPS','Admin2','Combined_Key'],axis =1)
for name, values in df.iteritems():
if '/' in name:
df.drop([name],axis=1,inplace =True)
df2 = df.set_index(['Lat','Long_'])
print(df2.head())
lat = df2[df2["Lat"]]
print(lat)
long = df2[df2['Long_']]
Code is above. I got the data set from https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset - using the US deaths.
I have attached an image of the output. I do not know what this error means.
Apologies if worded ambiguously / incorrectly, or if there is a preexisting answer somewhere
When you define an index using one or more columns, e.g. via set_index(), these columns are promoted to index and by default no longer accessible using the df[<colname>] notation. This behavior can be changed with set_index(..., drop=False) but that's usually not necessary.
With the index in place, use df.loc[] to access single rows by their index value, aka label. Read about label-based indexing here.
To access the values of your MultiIndex as you would do with a column, you can use df.index.get_level_values(<colname>).array (or .to_numpy()). So in your case you could write:
lat = df2.index.get_level_values('Lat').array
print(lat)
long = df2.index.get_level_values('Long_').array
print(lat)
BTW: read_csv() has a useful usecols argument that lets you specify which columns to load (others will be ignored).
I am trying to drop some useless columns in a dataframe but I am getting the error: "too many indices for array"
Here is my code :
import pandas as pd
def answer_one():
energy = pd.read_excel("Energy Indicators.xls")
energy.drop(energy.index[0,1], axis = 1)
answer_one()
Option 1
Your syntax is wrong when slicing the index and it should be the columns
import pandas as pd
energy = pd.read_excel("Energy Indicators.xls")
energy.drop(energy.columns[[0,1]], axis=1)
Option 2
I'd do it like this
import pandas as pd
energy = pd.read_excel("Energy Indicators.xls")
energy.iloc[:, 2:]
I think it's better to skip unneeded columns when parsing/reading Excel file:
energy = pd.read_excel("Energy Indicators.xls", usecols='C:ZZ')
If you're trying to drop the column need to change the syntax. You can refer to them by the header or the index. Here is how you would refer to them by name.
import pandas as pd
energy = pd.read_excel("Energy Indicators.xls")
energy.drop(['first_colum', 'second_column'], axis=1, inplace=True)
Another solution would be to exclude them in the first place:
energy = pd.read_excel("Energy Indicators.xls", usecols=[2:])
This will help speed up the import as well.