Drop first row when index doesn't start at 0 - python

I want to drop the first row of a dataframe subset which is a subset of the main dataframe main. The first row of the dataframe has index = 31, so when I try dropping the first row I get the following error:
>>> subset.drop(0, axis=1)
KeyError: '[0] not found in axis'
I want to perform this drop on multiple dataframes, so I cannot drop index 31 on every dataframe. Is it possible to drop the first row when the index isn't equal to 0?

Simpliest is select all rows without first by position:
df = df.iloc[1:]
Or with drop is possible select first value, but if duplicated values, then all rows are removed:
df = df.drop(df.index[0])
Your solution try remove column 0:
subset.drop(0, axis=1)

df = df if df.index[0] == 0 else df.iloc[1:]

Related

Pandas dataframe : sample() function resets indexes?

Please consider a panda dataframe final_df with 142457 rows correctly indexed:
0
1
2
3
4
...
142452
142453
142454
142455
142456
I create / sample a new df data_test_for_all_models from this one:
data_test_for_all_models = final_df.copy().sample(frac=0.1, random_state=786)
A few indexes:
2235
118727
23291`
Now I drop rows from final_df with indexes in data_test_for_all_models :
final_df = = final_df.drop(data_test_for_all_models.index)
If I check a few indexes present in final_df :
final_df.iloc[2235]
returns wrongly a row.
I think it's a problem of reset indexes but which function does it: drop(), sample()?
Thanks.
You are using .iloc which provides integer-based indexing. You are getting the row number 2235, not the row with index 2235.
For that, you should use .loc:
final_df.loc[2235]
And you should get a KeyError.

Reindex a dataframe with the index of an other dataframe using the sum to fill the values

First create dataframe with regular index, this is the df that I want to resample using th index of df1
df0 = pd.DataFrame(index=pd.date_range(start='2018-10-31 00:17:24', periods=50,freq='1s'))
I didn't know how to create a df that has an irregular index so I have created a new dataframe( the index of which I want to use) to resample df0
df1 = pd.DataFrame(index=pd.date_range(start='2018-10-31 00:17:24', periods=50,freq='20s'))
For minimum reproducible example. Create a column with values between 0 and 1
df0['dat'] = np.random.rand(len(df0))
I want to find the rows where the dat column has a value greater than 0.5
df0['target'] = 0
df0.loc[(df0['dat'] >= 0.5), 'target'] = 1
I then want to reindex df0 using the index of df1 but each row of the column named df0['target']
Should have the sum of the values that lay in that window
What I have tried is:
new_index = df1.index
df_new = df0.reindex(df0.index.union(new_index)).interpolate(method='linear').reindex(new_index).sum()
But this sum() screws everything
IIUC:
try:
df_new=df0.reindex(df1.index.union(df0.index)).interpolate(method='linear').reset_index()
Finally make use of pd.Grouper() and groupby():
out=df_new.groupby(pd.Grouper(key='index',freq='1 min')).sum()

add values in specific columns horizontally in python dataframe

i have created a dataframe from python pandas using a numpy array but i want to know how do i add values in specific columns horizontally not vertically
let's assume i have this dataframe:
df = pd.DataFrame(data=data1)
how can i add [1.2,3.5,2.2] to the second row of (-1,label) (-2,label) (0,label)?
Use DataFrame.loc:
#if need set last 3 columns and index 1
df.loc[1, df.columns[-3:]] = [1.2,3.5,2.2]
Or DataFrame.iloc:
#if need set last 3 columns and second index
df.iloc[1, -3:] = [1.2,3.5,2.2]
Or:
#if need set columns by names
cols = [col1, col3, col5]
df.loc[1, cols] = [1.2,3.5,2.2]

Pandas Dataframe- Drop columns if all the values with the column are either 0,1,nan

Assuming this is the dataframe. I looking to drop the columns if all the values within columns are either 0,1, NaN.
df = pd.DataFrame([[1,0,0,0], [0,0,1,0],[2,'NaN',1,0]])
End result should be just first column "0" and drop remaining columns.
Try:
lst = [0, 1, 'NaN']
mask = df.isin(lst).all(axis=0)
df.drop(mask.loc[mask].index, axis=1, inplace=True)
it will essentially ensure you drop all columns, for which all values are in lst.
use this:
df.drop([1,2])
See also the documentation

pandas add column to dataframe aggregate on time series

I've done a dataframe aggregation and I want to add a new column in which if there is a value > 0 in year 2020 in row, it will put an 1, otherwise 0.
this is my code
and head of dataframe
df['year'] = pd.DatetimeIndex(df['TxnDate']).year # add column year
df['client'] = df['Customer'].str.split(' ').str[:3].str.join(' ') # add colum with 3 first word
Datedebut = df['year'].min()
Datefin = df['year'].max()
#print(df)
df1 = df.groupby(['client','year']).agg({'Amount': ['sum']}).unstack()
print(df1)
df1['nb2020']= np.where( df1['year']==2020, 1, 0)
Data frame df1 print before last line is like that:
Last line error is : KeyError: 'year'
thanks
When you performed that the aggregation and unstacked (df.groupby(['client','year']).agg({'Amount': ['sum']}).unstack()), the values of the column year have been expanded into columns, and these columns are a MultiIndex. You can look at that by calling:
print (df1.columns)
And then you can select them.
Using the MultiIndex column
So to select the column which matches to 2020 you can use:
df1.loc[:,df1.columns.get_level_values(2).isin({2020})
You can probably get the correct column then check if 2020 has a non zero value using:
df1['nb2020'] = df1.loc[:,df1.columns.get_level_values('year').isin({2020})] > 0
If you would like to have the 1 and 0 (instead of the bool types), you can convert to int (using astype).
Renaming the columns
If you think this is a bit complicated, you might also prefer change the column to single indexes. Using something like
df1.columns = df1.columns.get_level_values('year')
Or
df1.columns = df1.columns.get_level_values(2)
And then
df1['nb2020'] = (df1[2020] > 0).astype(int)

Categories