I am trying to drop some useless columns in a dataframe but I am getting the error: "too many indices for array"
Here is my code :
import pandas as pd
def answer_one():
energy = pd.read_excel("Energy Indicators.xls")
energy.drop(energy.index[0,1], axis = 1)
answer_one()
Option 1
Your syntax is wrong when slicing the index and it should be the columns
import pandas as pd
energy = pd.read_excel("Energy Indicators.xls")
energy.drop(energy.columns[[0,1]], axis=1)
Option 2
I'd do it like this
import pandas as pd
energy = pd.read_excel("Energy Indicators.xls")
energy.iloc[:, 2:]
I think it's better to skip unneeded columns when parsing/reading Excel file:
energy = pd.read_excel("Energy Indicators.xls", usecols='C:ZZ')
If you're trying to drop the column need to change the syntax. You can refer to them by the header or the index. Here is how you would refer to them by name.
import pandas as pd
energy = pd.read_excel("Energy Indicators.xls")
energy.drop(['first_colum', 'second_column'], axis=1, inplace=True)
Another solution would be to exclude them in the first place:
energy = pd.read_excel("Energy Indicators.xls", usecols=[2:])
This will help speed up the import as well.
Related
For example, let's take Penguins dataset, and i want to drop all entries in bill_length_mm column when they are more then 30:
import seaborn as sns
import pandas as pd
ds = sns.load_dataset("penguins")
ds.head()
ds.drop(ds[ds['bill_length_mm']>30])
And it gives me an error. And if i'll try to add axis=1 it'll just drop every column in dataset.
ds.drop(ds[ds['bill_length_mm']>30], axis=1)
So what shoud i do to complete ma goal?
Try
ds=ds.drop(ds[ds['bill_length_mm']>30].index)
Or
ds = ds[ds['bill_length_mm']<=30]
ds.drop is used to drop columns, not rows. If you only want to keep the rows where bill_length_mm<=30, you can use
ds = ds[ds['bill_length_mm']<=30]
Provided I have a multiindex data Frame as follows:
import pandas as pd
import pandas as pd
import numpy as np
input_id = np.array(['input_id'])
docType = np.array(['pre','pub','app','dw'])
docId = np.array(['34455667'])
sec_type = np.array(['bib','abs','cl','de'])
sec_ids = np.array(['x-y','z-k'])
index = pd.MultiIndex.from_product([input_id,docType,docId,sec_type,sec_ids])
content= [str(np.random.randint(1,10))+ '##' + str(np.random.randint(1,10)) for i in range(len(index))]
df = pd.DataFrame(content, index=index, columns=['content'])
df.rename_axis(index=['input_id','docType','docId','secType','sec_ids'], inplace=True)
I would like to query the multiindex DF
# query a multiindex DF
idx = pd.IndexSlice
df.loc[idx[:,'pub',:,'de',:]]
Resulting in:
I would like to get directly the values of the multiindex column sec_ids as a list. How do I have to modify to get the follwoing result:
['x-y','z-k']
Thanks
You can use the MultiIndex.get_level_values() method to get the values of a specific level of a MultiIndex. So in this case call it after your slice.
df.loc[idx[:,'pub',:,'de',:]].index.get_level_values('sec_ids').tolist()
#['x-y', 'z-k']
Thank you in advance for taking the time to help me! (Code provided below) (Data Here)
I am trying to average the first 3 columns and insert it as a new column labeled 'Topsoil'. What is the best way to go about doing that?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
raw_data = pd.read_csv('all-deep-soil-temperatures.csv', index_col=1, parse_dates=True)
df_all_stations = raw_data.copy()
df_selected_station.fillna(method = 'ffill', inplace=True);
df_selected_station_D=df_selected_station.resample(rule='D').mean()
df_selected_station_D['Day'] = df_selected_station_D.index.dayofyear
mean=df_selected_station_D.groupby(by='Day').mean()
mean['Day']=mean.index
#mean.head()
Try this :
mean['avg3col']=mean[['5 cm', '10 cm','15 cm']].mean(axis=1)
df['new column'] = (df['col1'] + df['col2'] + df['col3'])/3
You could use the apply method in the following way:
mean['Topsoil'] = mean.apply(lambda row: np.mean(row[0:3]), axis=1)
You can read about the apply method in the following link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
The logic is that you perform the same task along a specific axis multiple times.
Note: It is not wise to call data-structures in names of functions, in your case it might be better be mean_df rather the mean
Use DataFrame.iloc for select by positions - first 3 columns with mean:
mean['Topsoil'] = mean.iloc[:, :3].mean(axis=1)
Here below is the CSV file that I'm working with:
I'm trying to get my hands on the enj coin: (United States) column. Nonetheless when I try printing all of the columns of the DataFrame it doesn't appear to be treated as a column
Code:
import pandas as pd
df = pd.read_csv("/multiTimeline.csv")
print(df.columns)
I get the following output:
Index(['Category: All categories'], dtype='object')
I've tried accessing the column with df['Category: All categories']['enj coin: (United States)'] but sadly it doesn't work.
Question:
Could someone possibly explain to me how I could possibly transform this DataFrame (which has only one column Category: All categories) into a DataFrame which has two columns Time and enj coin: (United States)?
Thank you very much for your help
Try using the parameter skiprows=2 when reading in the CSV. I.e.
df = pd.read_csv("/multiTimeline.csv", skiprows=2)
The csv looks good.
Ignore the complex header at the top.
pd.read_csv(csvdata, header=[1])
The entire header can be taken in as well, although it is not delimited as the data is.
import pandas as pd
from pandas.compat import StringIO
print(pd.__version__)
csvdata = StringIO("""Category: All categories
Time,enj coin: (United States)
2019-04-10T19,7
2019-04-10T20,20""")
df = pd.read_csv(csvdata, header=[0,1])
print(df)
0.24.2
Category: All categories
Time
2019-04-10T19 7
2019-04-10T20 20
I have a dataframe that currently looks like this:
import numpy as np
raw_data = {'Series_Date':['2017-03-10','2017-03-13','2017-03-14','2017-03-15'],'SP':[35.6,56.7,41,41],'1M':[-7.8,56,56,-3.4],'3M':[24,-31,53,5]}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['Series_Date','SP','1M','3M'])
print df
I would like to transponse in a way such that all the value fields get transposed to the Value Column and the date is appended as a row item. The column name of the value field becomes a row for the Description column. That is the resulting Dataframe should look like this:
import numpy as np
raw_data = {'Series_Date':['2017-03-10','2017-03-10','2017-03-10','2017-03-13','2017-03-13','2017-03-13','2017-03-14','2017-03-14','2017-03-14','2017-03-15','2017-03-15','2017-03-15'],'Value':[35.6,-7.8,24,56.7,56,-31,41,56,53,41,-3.4,5],'Desc':['SP','1M','3M','SP','1M','3M','SP','1M','3M','SP','1M','3M']}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['Series_Date','Value','Desc'])
print df
Could someone please help how I can flip and transpose my DataFrame this way?
Use pd.melt to transform DF from a wide format to a long one:
idx = "Series_Date" # identifier variable
pd.melt(df, id_vars=idx, var_name="Desc").sort_values(idx).reset_index(drop=True)