Suppose I have a dataframe dataset as the following:
dataset = pd.DataFrame({'id':list('123456'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3]})
print(dataset)
id B C
0 1 4 7
1 2 5 8
2 3 4 9
3 4 5 4
4 5 5 2
5 6 4 3
Now I slice it using iloc() and get
dataset = dataset.iloc[2:5]
id B C
2 3 4 9
3 4 5 4
4 5 5 2
Now I set the id as the new index due to my needs in my project, so I do
dataset.set_index("id", inplace=True)
print(dataset)
B C
id
3 4 9
4 5 4
5 5 2
I would like to select the new dataset using iloc on the original index. So if I do dataset.iloc[3] I would lke to see the first row. However, if I do that it throws me a out of bound error. If I do dataset.iloc[0] it gives me the first row.
Is there anyway I can preserve the original index? Thanks.
iloc is slice by its position you can check subtract the lower
n = 2 # n is 2 since you slice 2:5
dataset.iloc[3-n-1]
Out[648]:
B 4
C 9
Name: 3, dtype: int64
In this case it is recommended to use loc instead of iloc:
dataset.index = dataset.index.astype('int')
dataset.loc[3]
>>>
B 4
C 9
Name: 3, dtype: int64
Related
Consider a dataframe which contains several groups of integers:
d = pd.DataFrame({'label': ['a','a','a','a','b','b','b','b'], 'value': [1,2,3,2,7,1,8,9]})
d
label value
0 a 1
1 a 2
2 a 3
3 a 2
4 b 7
5 b 1
6 b 8
7 b 9
For each of these groups of integers, each integer has to be bigger or equal to the previous one. If not the case, it takes on the value of the previous integer. I replace using
s.where(~(s < s.shift()), s.shift())
which works fine for a single series. I can even group the dataframe, and loop through each extracted series:
grouped = s.groupby('label')['value']
for _, s in grouped:
print(s.where(~(s < s.shift()), s.shift()))
0 1.0
1 2.0
2 3.0
3 3.0
Name: value, dtype: float64
4 7.0
5 7.0
6 8.0
7 9.0
Name: value, dtype: float64
However, how do I now get these values back into my original dataframe?
Or, is there a better way to do this? I don't care for using .groupby and don't consider the for loop a pretty solution either...
IIUC, you can use cummax in the groupby like:
d['val_max'] = d.groupby('label')['value'].cummax()
print (d)
label value val_max
0 a 1 1
1 a 2 2
2 a 3 3
3 a 2 3
4 b 7 7
5 b 1 7
6 b 8 8
7 b 9 9
I have a data frame where there are several groups of numeric series where the values are cumulative. Consider the following:
df = pd.DataFrame({'Cat': ['A', 'A','A','A', 'B','B','B','B'], 'Indicator': [1,2,3,4,1,2,3,4], 'Cumulative1': [1,3,6,7,2,4,6,9], 'Cumulative2': [1,3,4,6,1,5,7,12]})
In [74]:df
Out[74]:
Cat Cumulative1 Cumulative2 Indicator
0 A 1 1 1
1 A 3 3 2
2 A 6 4 3
3 A 7 6 4
4 B 2 1 1
5 B 4 5 2
6 B 6 7 3
7 B 9 12 4
I need to create discrete series for Cumulative1 and Cumulative2, with starting point being the earliest entry in 'Indicator'.
my Approach is to use diff()
In[82]: df['Discrete1'] = df.groupby('Cat')['Cumulative1'].diff()
Out[82]: df
Cat Cumulative1 Cumulative2 Indicator Discrete1
0 A 1 1 1 NaN
1 A 3 3 2 2.0
2 A 6 4 3 3.0
3 A 7 6 4 1.0
4 B 2 1 1 NaN
5 B 4 5 2 2.0
6 B 6 7 3 2.0
7 B 9 12 4 3.0
I have 3 questions:
How do I avoid the NaN in an elegant/Pythonic way? The correct values are to be found in the original Cumulative series.
Secondly, how do I elegantly apply this computation to all series, say -
cols = ['Cumulative1', 'Cumulative2']
Thirdly, I have a lot of data that needs this computation -- is this the most efficient way?
You do not want to avoid NaNs, you want to fill them with the start values from the "cumulative" column:
df['Discrete1'] = df['Discrete1'].combine_first(df['Cumulative1'])
To apply the operation to all (or select) columns, broadcast it to all columns of interest:
sources = 'Cumulative1', 'Cumulative2'
targets = ["Discrete" + x[len('Cumulative'):] for x in sources]
df[targets] = df.groupby('Cat')[sources].diff()
You still have to condition the NaNs in a loop:
for s,t in zip(sources, targets):
df[t] = df[t].combine_first(df[s])
Say I want to delete a set of adjacent columns in a DataFrame and my code looks something like this currently:
del df['1'], df['2'], df['3'], df['4'], df['5'], df['6']
This works, but I was wondering if there was a more efficient, compact, or aesthetically pleasing way to do it, such as:
del df['1','6']
I think you need drop, for selecting is used range or numpy.arange:
df = pd.DataFrame({'1':[1,2,3],
'2':[4,5,6],
'3':[7,8,9],
'4':[1,3,5],
'5':[7,8,9],
'6':[1,3,5],
'7':[5,3,6],
'8':[5,3,6],
'9':[7,4,3]})
print (df)
1 2 3 4 5 6 7 8 9
0 1 4 7 1 7 1 5 5 7
1 2 5 8 3 8 3 3 3 4
2 3 6 9 5 9 5 6 6 3
print (np.arange(1,7))
[1 2 3 4 5 6]
print (range(1,7))
range(1, 7)
#convert string column names to int
df.columns = df.columns.astype(int)
df = df.drop(np.arange(1,7), axis=1)
#another solution with range
#df = df.drop(range(1,7), axis=1)
print (df)
7 8 9
0 5 5 7
1 3 3 4
2 6 6 3
You can do this without modifying the columns, by passing a slice object to drop:
In [29]:
df.drop(df.columns[slice(df.columns.tolist().index('1'),df.columns.tolist().index('6')+1)], axis=1)
Out[29]:
7 8 9
0 5 5 7
1 3 3 4
2 6 6 3
So this returns the ordinal position of the lower and upper bound of the column end points and passes these to create a slice object against the columns array
Let's say I create a DataFrame:
import pandas as pd
df = pd.DataFrame({"a": [1,2,3,13,15], "b": [4,5,6,6,6], "c": ["wish", "you","were", "here", "here"]})
Like so:
a b c
0 1 4 wish
1 2 5 you
2 3 6 were
3 13 6 here
4 15 6 here
... and then group and aggregate by a couple columns ...
gb = df.groupby(['b','c']).agg({"a": lambda x: x.nunique()})
Yielding the following result:
a
b c
4 wish 1
5 you 1
6 here 2
were 1
Is it possible to merge df with the newly aggregated table gb such that I create a new column in df, containing the corresponding values from gb? Like this:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
I tried doing the simplest thing:
df.merge(gb, on=['b','c'])
But this gives the error:
KeyError: 'b'
Which makes sense because the grouped table has a Multi-index and b is not a column. So my question is two-fold:
Can I transform the multi-index of the gb DataFrame back into columns (so that it has the b and c column)?
Can I merge df with gb on the column names?
Whenever you want to add some aggregated column from groupby operation back to the df you should be using transform, this produces a Series with its index aligned with your orig df:
In [4]:
df['nc'] = df.groupby(['b','c'])['a'].transform(pd.Series.nunique)
df
Out[4]:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
There is no need to reset the index or perform an additional merge.
There's a simple way of doing this using reset_index().
df.merge(gb.reset_index(), on=['b','c'])
gives you
a_x b c a_y
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
Im facing a problem with applying a function to a DataFrame (to model a solar collector based on annual hourly weather data)
Suppose I have the following (simplified) DataFrame:
df2:
A B C
0 11 13 5
1 6 7 4
2 8 3 6
3 4 8 7
4 0 1 7
Now I have defined a function that takes all rows as input to create a new column called D, but I want the function to also take the last calculated value of D (except of course for the first row as no value for D is calculated) as input.
def Funct(x):
D = x['A']+x['B']+x['C']+(x-1)['D']
I know that the function above is not working, but it gives an idea of what I want.
So to summarise:
Create a function that creates a new column in the dataframe and takes the value of the new column one row above it as input
Can somebody help me?
Thanks in advance.
It sounds like you are calculating a cumulative sum. In that case, use cumsum:
In [45]: df['D'] = (df['A']+df['B']+df['C']).cumsum()
In [46]: df
Out[46]:
A B C D
0 11 13 5 29
1 6 7 4 46
2 8 3 6 63
3 4 8 7 82
4 0 1 7 90
[5 rows x 4 columns]
Are you looking for this?
You can use shift to align the previous row with current row and then you can do your operation.
In [7]: df
Out[7]:
a b
1 1 1
2 2 2
3 3 3
4 4 4
[4 rows x 2 columns]
In [8]: df['c'] = df['b'].shift(1) #First row will be Nan
In [9]: df
Out[9]:
a b c
1 1 1 NaN
2 2 2 1
3 3 3 2
4 4 4 3
[4 rows x 3 columns]