Increase index of pandas DataFrame by one - python

I'd like to have my dataframe start with index 1 instead of 0. But somehow I am not getting it:
In[1]: df = pd.DataFrame([[4,7],[10,11],[7,2]],columns=['one', 'two'])
In[2]: df
Out[2]:
one two
0 4 7
1 10 11
2 7 2
In[3]: df.reindex(range(1,len(df)+1))
Out[3]:
one two
1 10 11
2 7 2
3 NaN NaN
Where did my first row go? What am I getting wrong about reindex()?

To increase your index with 1 you can simply modify the index like this, df.index += 1.
Full example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[4,7],[10,11],[7,2]],columns=['one', 'two'])
In [3]: df
Out[3]:
one two
0 4 7
1 10 11
2 7 2
In [4]: df.index += 1
In [5]: df
Out[5]:
one two
1 4 7
2 10 11
3 7 2

The reindexing does not reassign the index values and preserve the order for that you can assign directly:
In [25]:
df.index = range(1,len(df)+1)
df
Out[25]:
one two
1 4 7
2 10 11
3 7 2
The docs show that you are conforming your data to the new index which will introduce NaN values where none existed hence why you lost a row, this is why there is a fillna param for reindex.

Related

Pandas Dataframe Create New Column That is Row Below Current Row's Value [duplicate]

I've got a pandas dataframe. I want to 'lag' one of my columns. Meaning, for example, shifting the entire column 'gdp' up by one, and then removing all the excess data at the bottom of the remaining rows so that all columns are of equal length again.
df =
y gdp cap
0 1 2 5
1 2 3 9
2 8 7 2
3 3 4 7
4 6 7 7
df_lag =
y gdp cap
0 1 3 5
1 2 7 9
2 8 4 2
3 3 7 7
Anyway to do this?
In [44]: df['gdp'] = df['gdp'].shift(-1)
In [45]: df
Out[45]:
y gdp cap
0 1 3 5
1 2 7 9
2 8 4 2
3 3 7 7
4 6 NaN 7
In [46]: df[:-1]
Out[46]:
y gdp cap
0 1 3 5
1 2 7 9
2 8 4 2
3 3 7 7
shift column gdp up:
df.gdp = df.gdp.shift(-1)
and then remove the last row
Time is going. And current Pandas documentation recommend this way:
df.loc[:, 'gdp'] = df.gdp.shift(-1)
To easily shift by 5 values for example and also get rid of the NaN rows, without having to keep track of the number of values you shifted by:
d['gdp'] = df['gdp'].shift(-5)
df = df.dropna()
First shift the column:
df['gdp'] = df['gdp'].shift(-1)
Second remove the last row which contains an NaN Cell:
df = df[:-1]
Third reset the index:
df = df.reset_index(drop=True)
df.gdp = df.gdp.shift(-1) ## shift up
df.gdp.drop(df.gdp.shape[0] - 1,inplace = True) ## removing the last row

How to assign values to multiple non existing columns in a pandas dataframe?

So what I want to do is to add columns to a dataframe and fill them (all rows respectively) with a single value.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1,2],[3,4]]), columns = ["A","B"])
arr = np.array([7,8])
# this is what I would like to do
df[["C","D"]] = arr
# and this is what I want to achieve
# A B C D
# 0 1 2 7 8
# 1 3 4 7 8
# but it yields an "KeyError" sadly
# KeyError: "['C' 'D'] not in index"
I do know about the assign-functionality and how I would tackle this issue if I only were to add one column at once. I just want to know whether there is a clean and simple way to do this with multiple new columns as I was not able to find one.
For me working:
df[["C","D"]] = pd.DataFrame([arr], index=df.index)
Or join:
df = df.join(pd.DataFrame([arr], columns=['C','D'], index=df.index))
Or assign:
df = df.assign(**pd.Series(arr, index=['C','D']))
print (df)
A B C D
0 1 2 7 8
1 3 4 7 8
You can using assign and pass a dict in it
df.assign(**dict(zip(['C','D'],[arr.tolist()]*2)))
Out[755]:
A B C D
0 1 2 7 7
1 3 4 8 8

In pandas Dataframe with multiindex how can I filter by order?

Assume the following dataframe
>>> import pandas as pd
>>> L = [(1,'A',9,9), (1,'C',8,8), (1,'D',4,5),(2,'H',7,7),(2,'L',5,5)]
>>> df = pd.DataFrame.from_records(L).set_index([0,1])
>>> df
2 3
0 1
1 A 9 9
C 8 8
D 4 5
2 H 7 7
L 5 5
I want to filter the rows in the nth position of level 1 of the multiindex, i.e. filtering the first
2 3
0 1
1 A 9 9
2 H 7 7
or filtering the third
2 3
0 1
1 D 4 5
How can I achieve this ?
You can filter rows with the help of GroupBy.nth after performing grouping on the first level of the multi-index DF. Since n follows the 0-based indexing approach, you need to provide the values appropriately to it as shown:
1) To select the first row grouped per level=0:
df.groupby(level=0, as_index=False).nth(0)
2) To select the third row grouped per level=0:
df.groupby(level=0, as_index=False).nth(2)

Apply a for loop to multiple DataFrames in Pandas

I have multiple DataFrames that I want to do the same thing to.
First I create a list of the DataFrames. All of them have the same column called 'result'.
df_list = [df1,df2,df3]
I want to keep only the rows in all the DataFrames with value 'passed' so I use a for loop on my list:
for df in df_list:
df =df[df['result'] == 'passed']
...this does not work, the values are not filtered out of each DataFrame.
If I filter each one separately then it does work.
df1 =df1[df1['result'] == 'passed']
df2 =df2[df2['result'] == 'passed']
df3 =df3[df3['result'] == 'passed']
This is because every time you do a subset like this df[<whatever>] you are returning a new dataframe, and assigning it to the df looping variable, which gets obliterated each time you go to the next iteration (although you do keep the last one). This similar to slicing lists:
>>> list1 = [1,2,3,4]
>>> list2 = [11,12,13,14]
>>> for lyst in list1,list2:
... lyst = lyst[1:-1]
...
>>> list1, list2
([1, 2, 3, 4], [11, 12, 13, 14])
>>> lyst
[12, 13]
Usually, you need to use a mutator method if you want to actually modify the lists in-place. Equivalently, with a dataframe, you could use assignment on an indexer, e.g. .loc/.ix/.iloc/ etc in combination with the .dropna method, being careful to pass the inplace=True argument. Suppose I have three dataframes and I want to only keep the rows where my second column is positive:
Warning: This way is not ideal, look at edit for better way
In [11]: df1
Out[11]:
0 1 2 3
0 0.957288 -0.170286 0.406841 -3.058443
1 1.762343 -1.837631 -0.867520 1.666193
2 0.618665 0.660312 -1.319740 -0.024854
3 -2.008017 -0.445997 -0.028739 -0.227665
4 0.638419 -0.271300 -0.918894 1.524009
5 0.957006 1.181246 0.513298 0.370174
6 0.613378 -0.852546 -1.778761 -1.386848
7 -1.891993 -0.304533 -1.427700 0.099904
In [12]: df2
Out[12]:
0 1 2 3
0 -0.521018 0.407258 -1.167445 -0.363503
1 -0.879489 0.008560 0.224466 -0.165863
2 0.550845 -0.102224 -0.575909 -0.404770
3 -1.171828 -0.912451 -1.197273 0.719489
4 -0.887862 1.073306 0.351835 0.313953
5 -0.517824 -0.096929 -0.300282 0.716020
6 -1.121527 0.183219 0.938509 0.842882
7 0.003498 -2.241854 -1.146984 -0.751192
In [13]: df3
Out[13]:
0 1 2 3
0 0.240411 0.795132 -0.305770 -0.332253
1 -1.162097 0.055346 0.094363 -1.254859
2 -0.493466 -0.717872 1.090417 -0.591872
3 1.021246 -0.060453 -0.013952 0.304933
4 -0.859882 -0.947950 0.562609 1.313632
5 0.917199 1.186865 0.354839 -1.771787
6 -0.694799 -0.695505 -1.077890 -0.880563
7 1.088068 -0.893466 -0.188419 -0.451623
In [14]: for df in df1, df2, df3:
....: df.loc[:,:] = df.loc[df[1] > 0,:]
....: df.dropna(inplace = True,axis =0)
....:
In [15]: df1
dfOut[15]:
0 1 2 3
2 0.618665 0.660312 -1.319740 -0.024854
5 0.957006 1.181246 0.513298 0.370174
In [16]: df2
Out[16]:
0 1 2 3
0 -0.521018 0.407258 -1.167445 -0.363503
1 -0.879489 0.008560 0.224466 -0.165863
4 -0.887862 1.073306 0.351835 0.313953
6 -1.121527 0.183219 0.938509 0.842882
In [17]: df3
Out[17]:
0 1 2 3
0 0.240411 0.795132 -0.305770 -0.332253
1 -1.162097 0.055346 0.094363 -1.254859
5 0.917199 1.186865 0.354839 -1.771787
Edited to Add:
I think I found a better way just using the .drop method.
In [21]: df1
Out[21]:
0 1 2 3
0 -0.804913 -0.481498 0.076843 1.136567
1 -0.457197 -0.903681 -0.474828 1.289443
2 -0.820710 1.610072 0.175455 0.712052
3 0.715610 -0.178728 -0.664992 1.261465
4 -0.297114 -0.591935 0.487698 0.760450
5 1.035231 -0.108825 -1.058996 0.056320
6 1.579931 0.958331 -0.653261 -0.171245
7 0.685427 1.447411 0.001002 0.241999
In [22]: df2
Out[22]:
0 1 2 3
0 1.660864 0.110002 0.366881 1.765541
1 -0.627716 1.341457 -0.552313 0.578854
2 0.277738 0.128419 -0.279720 -1.197483
3 -1.294724 1.396698 0.108767 1.353454
4 -0.379995 0.215192 1.446584 0.530020
5 0.557042 0.339192 -0.105808 -0.693267
6 1.293941 0.203973 -3.051011 1.638143
7 -0.909982 1.998656 -0.057350 2.279443
In [23]: df3
Out[23]:
0 1 2 3
0 -0.002327 -2.054557 -1.752107 -0.911178
1 -0.998328 -1.119856 1.468124 -0.961131
2 -0.048568 0.373192 -0.666330 0.867719
3 0.533597 -1.222963 0.119789 -0.037949
4 1.203075 -0.773511 0.475809 1.352943
5 -0.984069 -0.352267 -0.313516 0.138259
6 0.114596 0.354404 2.119963 -0.452462
7 -1.033029 -0.787237 0.479321 -0.818260
In [25]: for df in df1,df2,df3:
....: df.drop(df.index[df[1] < 0],axis=0,inplace=True)
....:
In [26]: df1
Out[26]:
0 1 2 3
2 -0.820710 1.610072 0.175455 0.712052
6 1.579931 0.958331 -0.653261 -0.171245
7 0.685427 1.447411 0.001002 0.241999
In [27]: df2
Out[27]:
0 1 2 3
0 1.660864 0.110002 0.366881 1.765541
1 -0.627716 1.341457 -0.552313 0.578854
2 0.277738 0.128419 -0.279720 -1.197483
3 -1.294724 1.396698 0.108767 1.353454
4 -0.379995 0.215192 1.446584 0.530020
5 0.557042 0.339192 -0.105808 -0.693267
6 1.293941 0.203973 -3.051011 1.638143
7 -0.909982 1.998656 -0.057350 2.279443
In [28]: df3
Out[28]:
0 1 2 3
2 -0.048568 0.373192 -0.666330 0.867719
6 0.114596 0.354404 2.119963 -0.452462
Certainly faster:
In [8]: timeit.Timer(stmt="df.loc[:,:] = df.loc[df[1] > 0, :];df.dropna(inplace = True,axis =0)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[8]: 23.69621358400036
In [9]: timeit.Timer(stmt="df.drop(df.index[df[1] < 0],axis=0,inplace=True)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[9]: 11.476448250003159
If you multiple columns contains passed condition like more than 40 are passed else failed
col_list_contains_passed= ['result1','result2']
df_list = [df1,df2,df3]
for d in df_list:
for c in col_list:
d[c]=np.where(d[c]>=40,'Passed','failed')#you can put your condition
Then you can filter individual dataframe by condition
df1=df1[df1['col_name']=='xyz]
I will update the answer if for loop works. To be continued...

Python - Get group names from aggregated results in pandas

I have a dataframe like this:
minute values
0 1 3
1 2 4
2 1 1
3 4 6
4 3 7
5 2 2
When I apply
df.groupby('minute').sum().sort('values', ascending=False)
This gives:
values
minute
3 7
2 6
4 6
1 4
I want to get first two values in minute column in an array like [3,2]. How can I access values in minute column
If what you want is the values from the minute column in the grouped dataframe (which would be the index column as well) , you can use DataFrame.index , to access that column. Example -
grouped = df.groupby('minute').sum().sort('values', ascending=False)
grouped.index[:2]
If you really want it as a list, you can use .tolist() to convert it to a list. Example -
grouped.index[:2].tolist()
Demo -
In [3]: df
Out[3]:
minute values
0 1 3
1 2 4
2 1 1
3 4 6
4 3 7
5 2 2
In [4]: grouped = df.groupby('minute').sum().sort('values', ascending=False)
In [5]: grouped.index[:2]
Out[5]: Int64Index([3, 2], dtype='int64', name='minute')
In [6]: grouped.index[:2].tolist()
Out[6]: [3, 2]

Categories