How to change row value by index [duplicate] - python

I have created a Pandas DataFrame
df = DataFrame(index=['A','B','C'], columns=['x','y'])
and have got this
x y
A NaN NaN
B NaN NaN
C NaN NaN
Now, I would like to assign a value to particular cell, for example to row C and column x.
I would expect to get this result:
x y
A NaN NaN
B NaN NaN
C 10 NaN
with this code:
df.xs('C')['x'] = 10
However, the contents of df has not changed. The dataframe contains yet again only NaNs.
Any suggestions?

RukTech's answer, df.set_value('C', 'x', 10), is far and away faster than the options I've suggested below. However, it has been slated for deprecation.
Going forward, the recommended method is .iat/.at.
Why df.xs('C')['x']=10 does not work:
df.xs('C') by default, returns a new dataframe with a copy of the data, so
df.xs('C')['x']=10
modifies this new dataframe only.
df['x'] returns a view of the df dataframe, so
df['x']['C'] = 10
modifies df itself.
Warning: It is sometimes difficult to predict if an operation returns a copy or a view. For this reason the docs recommend avoiding assignments with "chained indexing".
So the recommended alternative is
df.at['C', 'x'] = 10
which does modify df.
In [18]: %timeit df.set_value('C', 'x', 10)
100000 loops, best of 3: 2.9 µs per loop
In [20]: %timeit df['x']['C'] = 10
100000 loops, best of 3: 6.31 µs per loop
In [81]: %timeit df.at['C', 'x'] = 10
100000 loops, best of 3: 9.2 µs per loop

Update: The .set_value method is going to be deprecated. .iat/.at are good replacements, unfortunately pandas provides little documentation
The fastest way to do this is using set_value. This method is ~100 times faster than .ix method. For example:
df.set_value('C', 'x', 10)

You can also use a conditional lookup using .loc as seen here:
df.loc[df[<some_column_name>] == <condition>, [<another_column_name>]] = <value_to_add>
where <some_column_name is the column you want to check the <condition> variable against and <another_column_name> is the column you want to add to (can be a new column or one that already exists). <value_to_add> is the value you want to add to that column/row.
This example doesn't work precisely with the question at hand, but it might be useful for someone wants to add a specific value based on a condition.

Try using df.loc[row_index,col_indexer] = value

The recommended way (according to the maintainers) to set a value is:
df.ix['x','C']=10
Using 'chained indexing' (df['x']['C']) may lead to problems.
See:
https://stackoverflow.com/a/21287235/1579844
http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-view-versus-copy
https://github.com/pydata/pandas/pull/6031

This is the only thing that worked for me!
df.loc['C', 'x'] = 10
Learn more about .loc here.

To set values, use:
df.at[0, 'clm1'] = 0
The fastest recommended method for setting variables.
set_value, ix have been deprecated.
No warning, unlike iloc and loc

.iat/.at is the good solution.
Supposing you have this simple data_frame:
A B C
0 1 8 4
1 3 9 6
2 22 33 52
if we want to modify the value of the cell [0,"A"] u can use one of those solution :
df.iat[0,0] = 2
df.at[0,'A'] = 2
And here is a complete example how to use iat to get and set a value of cell :
def prepossessing(df):
for index in range(0,len(df)):
df.iat[index,0] = df.iat[index,0] * 2
return df
y_train before :
0
0 54
1 15
2 15
3 8
4 31
5 63
6 11
y_train after calling prepossessing function that iat to change to multiply the value of each cell by 2:
0
0 108
1 30
2 30
3 16
4 62
5 126
6 22

I would suggest:
df.loc[index_position, "column_name"] = some_value
To modifiy multiple cells at the same time:
df.loc[start_idx_pos: End_idx_pos, "column_name"] = some_value

Avoid Assignment with Chained Indexing
You are dealing with an assignment with chained indexing which will result in a SettingWithCopy warning. This should be avoided by all means.
Your assignment will have to resort to one single .loc[] or .iloc[] slice, as explained here. Hence, in your case:
df.loc['C', 'x'] = 10

In my example i just change it in selected cell
for index, row in result.iterrows():
if np.isnan(row['weight']):
result.at[index, 'weight'] = 0.0
'result' is a dataField with column 'weight'

Here is a summary of the valid solutions provided by all users, for data frames indexed by integer and string.
df.iloc, df.loc and df.at work for both type of data frames, df.iloc only works with row/column integer indices, df.loc and df.at supports for setting values using column names and/or integer indices.
When the specified index does not exist, both df.loc and df.at would append the newly inserted rows/columns to the existing data frame, but df.iloc would raise "IndexError: positional indexers are out-of-bounds". A working example tested in Python 2.7 and 3.7 is as follows:
import numpy as np, pandas as pd
df1 = pd.DataFrame(index=np.arange(3), columns=['x','y','z'])
df1['x'] = ['A','B','C']
df1.at[2,'y'] = 400
# rows/columns specified does not exist, appends new rows/columns to existing data frame
df1.at['D','w'] = 9000
df1.loc['E','q'] = 499
# using df[<some_column_name>] == <condition> to retrieve target rows
df1.at[df1['x']=='B', 'y'] = 10000
df1.loc[df1['x']=='B', ['z','w']] = 10000
# using a list of index to setup values
df1.iloc[[1,2,4], 2] = 9999
df1.loc[[0,'D','E'],'w'] = 7500
df1.at[[0,2,"D"],'x'] = 10
df1.at[:, ['y', 'w']] = 8000
df1
>>> df1
x y z w q
0 10 8000 NaN 8000 NaN
1 B 8000 9999 8000 NaN
2 10 8000 9999 8000 NaN
D 10 8000 NaN 8000 NaN
E NaN 8000 9999 8000 499.0

you can use .iloc.
df.iloc[[2], [0]] = 10

set_value() is deprecated.
Starting from the release 0.23.4, Pandas "announces the future"...
>>> df
Cars Prices (U$)
0 Audi TT 120.0
1 Lamborghini Aventador 245.0
2 Chevrolet Malibu 190.0
>>> df.set_value(2, 'Prices (U$)', 240.0)
__main__:1: FutureWarning: set_value is deprecated and will be removed in a future release.
Please use .at[] or .iat[] accessors instead
Cars Prices (U$)
0 Audi TT 120.0
1 Lamborghini Aventador 245.0
2 Chevrolet Malibu 240.0
Considering this advice, here's a demonstration of how to use them:
by row/column integer positions
>>> df.iat[1, 1] = 260.0
>>> df
Cars Prices (U$)
0 Audi TT 120.0
1 Lamborghini Aventador 260.0
2 Chevrolet Malibu 240.0
by row/column labels
>>> df.at[2, "Cars"] = "Chevrolet Corvette"
>>> df
Cars Prices (U$)
0 Audi TT 120.0
1 Lamborghini Aventador 260.0
2 Chevrolet Corvette 240.0
References:
pandas.DataFrame.iat
pandas.DataFrame.at

One way to use index with condition is first get the index of all the rows that satisfy your condition and then simply use those row indexes in a multiple of ways
conditional_index = df.loc[ df['col name'] <condition> ].index
Example condition is like
==5, >10 , =="Any string", >= DateTime
Then you can use these row indexes in variety of ways like
Replace value of one column for conditional_index
df.loc[conditional_index , [col name]]= <new value>
Replace value of multiple column for conditional_index
df.loc[conditional_index, [col1,col2]]= <new value>
One benefit with saving the conditional_index is that you can assign value of one column to another column with same row index
df.loc[conditional_index, [col1,col2]]= df.loc[conditional_index,'col name']
This is all possible because .index returns a array of index which .loc can use with direct addressing so it avoids traversals again and again.

I tested and the output is df.set_value is little faster, but the official method df.at looks like the fastest non deprecated way to do it.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(100, 100))
%timeit df.iat[50,50]=50 # ✓
%timeit df.at[50,50]=50 # ✔
%timeit df.set_value(50,50,50) # will deprecate
%timeit df.iloc[50,50]=50
%timeit df.loc[50,50]=50
7.06 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.52 µs ± 64.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.68 µs ± 80.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
98.7 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
109 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Note this is setting the value for a single cell. For the vectors loc and iloc should be better options since they are vectorized.

If one wants to change the cell in the position (0,0) of the df to a string such as '"236"76"', the following options will do the work:
df[0][0] = '"236"76"'
# %timeit df[0][0] = '"236"76"'
# 938 µs ± 83.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Or using pandas.DataFrame.at
df.at[0, 0] = '"236"76"'
# %timeit df.at[0, 0] = '"236"76"'
#15 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Or using pandas.DataFrame.iat
df.iat[0, 0] = '"236"76"'
# %timeit df.iat[0, 0] = '"236"76"'
# 41.1 µs ± 3.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Or using pandas.DataFrame.loc
df.loc[0, 0] = '"236"76"'
# %timeit df.loc[0, 0] = '"236"76"'
# 5.21 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Or using pandas.DataFrame.iloc
df.iloc[0, 0] = '"236"76"'
# %timeit df.iloc[0, 0] = '"236"76"'
# 5.12 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If time is of relevance, using pandas.DataFrame.at is the fastest approach.

Soo, your question to convert NaN at ['x',C] to value 10
the answer is..
df['x'].loc['C':]=10
df
alternative code is
df.loc['C', 'x']=10
df

df.loc['c','x']=10
This will change the value of cth row and
xth column.

If you want to change values not for whole row, but only for some columns:
x = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
x.iloc[1] = dict(A=10, B=-10)

From version 0.21.1 you can also use .at method. There are some differences compared to .loc as mentioned here - pandas .at versus .loc, but it's faster on single value replacement

In addition to the answers above, here is a benchmark comparing different ways to add rows of data to an already existing dataframe. It shows that using at or set-value is the most efficient way for large dataframes (at least for these test conditions).
Create new dataframe for each row and...
... append it (13.0 s)
... concatenate it (13.1 s)
Store all new rows in another container first, convert to new dataframe once and append...
container = lists of lists (2.0 s)
container = dictionary of lists (1.9 s)
Preallocate whole dataframe, iterate over new rows and all columns and fill using
... at (0.6 s)
... set_value (0.4 s)
For the test, an existing dataframe comprising 100,000 rows and 1,000 columns and random numpy values was used. To this dataframe, 100 new rows were added.
Code see below:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Nov 21 16:38:46 2018
#author: gebbissimo
"""
import pandas as pd
import numpy as np
import time
NUM_ROWS = 100000
NUM_COLS = 1000
data = np.random.rand(NUM_ROWS,NUM_COLS)
df = pd.DataFrame(data)
NUM_ROWS_NEW = 100
data_tot = np.random.rand(NUM_ROWS + NUM_ROWS_NEW,NUM_COLS)
df_tot = pd.DataFrame(data_tot)
DATA_NEW = np.random.rand(1,NUM_COLS)
#%% FUNCTIONS
# create and append
def create_and_append(df):
for i in range(NUM_ROWS_NEW):
df_new = pd.DataFrame(DATA_NEW)
df = df.append(df_new)
return df
# create and concatenate
def create_and_concat(df):
for i in range(NUM_ROWS_NEW):
df_new = pd.DataFrame(DATA_NEW)
df = pd.concat((df, df_new))
return df
# store as dict and
def store_as_list(df):
lst = [[] for i in range(NUM_ROWS_NEW)]
for i in range(NUM_ROWS_NEW):
for j in range(NUM_COLS):
lst[i].append(DATA_NEW[0,j])
df_new = pd.DataFrame(lst)
df_tot = df.append(df_new)
return df_tot
# store as dict and
def store_as_dict(df):
dct = {}
for j in range(NUM_COLS):
dct[j] = []
for i in range(NUM_ROWS_NEW):
dct[j].append(DATA_NEW[0,j])
df_new = pd.DataFrame(dct)
df_tot = df.append(df_new)
return df_tot
# preallocate and fill using .at
def fill_using_at(df):
for i in range(NUM_ROWS_NEW):
for j in range(NUM_COLS):
#print("i,j={},{}".format(i,j))
df.at[NUM_ROWS+i,j] = DATA_NEW[0,j]
return df
# preallocate and fill using .at
def fill_using_set(df):
for i in range(NUM_ROWS_NEW):
for j in range(NUM_COLS):
#print("i,j={},{}".format(i,j))
df.set_value(NUM_ROWS+i,j,DATA_NEW[0,j])
return df
#%% TESTS
t0 = time.time()
create_and_append(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))
t0 = time.time()
create_and_concat(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))
t0 = time.time()
store_as_list(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))
t0 = time.time()
store_as_dict(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))
t0 = time.time()
fill_using_at(df_tot)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))
t0 = time.time()
fill_using_set(df_tot)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

I too was searching for this topic and I put together a way to iterate through a DataFrame and update it with lookup values from a second DataFrame. Here is my code.
src_df = pd.read_sql_query(src_sql,src_connection)
for index1, row1 in src_df.iterrows():
for index, row in vertical_df.iterrows():
src_df.set_value(index=index1,col=u'etl_load_key',value=etl_load_key)
if (row1[u'src_id'] == row['SRC_ID']) is True:
src_df.set_value(index=index1,col=u'vertical',value=row['VERTICAL'])

Related

How can I insert a single value into a Pandas dataframe at a given location? [duplicate]

I have created a Pandas DataFrame
df = DataFrame(index=['A','B','C'], columns=['x','y'])
and have got this
x y
A NaN NaN
B NaN NaN
C NaN NaN
Now, I would like to assign a value to particular cell, for example to row C and column x.
I would expect to get this result:
x y
A NaN NaN
B NaN NaN
C 10 NaN
with this code:
df.xs('C')['x'] = 10
However, the contents of df has not changed. The dataframe contains yet again only NaNs.
Any suggestions?
RukTech's answer, df.set_value('C', 'x', 10), is far and away faster than the options I've suggested below. However, it has been slated for deprecation.
Going forward, the recommended method is .iat/.at.
Why df.xs('C')['x']=10 does not work:
df.xs('C') by default, returns a new dataframe with a copy of the data, so
df.xs('C')['x']=10
modifies this new dataframe only.
df['x'] returns a view of the df dataframe, so
df['x']['C'] = 10
modifies df itself.
Warning: It is sometimes difficult to predict if an operation returns a copy or a view. For this reason the docs recommend avoiding assignments with "chained indexing".
So the recommended alternative is
df.at['C', 'x'] = 10
which does modify df.
In [18]: %timeit df.set_value('C', 'x', 10)
100000 loops, best of 3: 2.9 µs per loop
In [20]: %timeit df['x']['C'] = 10
100000 loops, best of 3: 6.31 µs per loop
In [81]: %timeit df.at['C', 'x'] = 10
100000 loops, best of 3: 9.2 µs per loop
Update: The .set_value method is going to be deprecated. .iat/.at are good replacements, unfortunately pandas provides little documentation
The fastest way to do this is using set_value. This method is ~100 times faster than .ix method. For example:
df.set_value('C', 'x', 10)
You can also use a conditional lookup using .loc as seen here:
df.loc[df[<some_column_name>] == <condition>, [<another_column_name>]] = <value_to_add>
where <some_column_name is the column you want to check the <condition> variable against and <another_column_name> is the column you want to add to (can be a new column or one that already exists). <value_to_add> is the value you want to add to that column/row.
This example doesn't work precisely with the question at hand, but it might be useful for someone wants to add a specific value based on a condition.
Try using df.loc[row_index,col_indexer] = value
The recommended way (according to the maintainers) to set a value is:
df.ix['x','C']=10
Using 'chained indexing' (df['x']['C']) may lead to problems.
See:
https://stackoverflow.com/a/21287235/1579844
http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-view-versus-copy
https://github.com/pydata/pandas/pull/6031
This is the only thing that worked for me!
df.loc['C', 'x'] = 10
Learn more about .loc here.
To set values, use:
df.at[0, 'clm1'] = 0
The fastest recommended method for setting variables.
set_value, ix have been deprecated.
No warning, unlike iloc and loc
.iat/.at is the good solution.
Supposing you have this simple data_frame:
A B C
0 1 8 4
1 3 9 6
2 22 33 52
if we want to modify the value of the cell [0,"A"] u can use one of those solution :
df.iat[0,0] = 2
df.at[0,'A'] = 2
And here is a complete example how to use iat to get and set a value of cell :
def prepossessing(df):
for index in range(0,len(df)):
df.iat[index,0] = df.iat[index,0] * 2
return df
y_train before :
0
0 54
1 15
2 15
3 8
4 31
5 63
6 11
y_train after calling prepossessing function that iat to change to multiply the value of each cell by 2:
0
0 108
1 30
2 30
3 16
4 62
5 126
6 22
I would suggest:
df.loc[index_position, "column_name"] = some_value
To modifiy multiple cells at the same time:
df.loc[start_idx_pos: End_idx_pos, "column_name"] = some_value
Avoid Assignment with Chained Indexing
You are dealing with an assignment with chained indexing which will result in a SettingWithCopy warning. This should be avoided by all means.
Your assignment will have to resort to one single .loc[] or .iloc[] slice, as explained here. Hence, in your case:
df.loc['C', 'x'] = 10
In my example i just change it in selected cell
for index, row in result.iterrows():
if np.isnan(row['weight']):
result.at[index, 'weight'] = 0.0
'result' is a dataField with column 'weight'
Here is a summary of the valid solutions provided by all users, for data frames indexed by integer and string.
df.iloc, df.loc and df.at work for both type of data frames, df.iloc only works with row/column integer indices, df.loc and df.at supports for setting values using column names and/or integer indices.
When the specified index does not exist, both df.loc and df.at would append the newly inserted rows/columns to the existing data frame, but df.iloc would raise "IndexError: positional indexers are out-of-bounds". A working example tested in Python 2.7 and 3.7 is as follows:
import numpy as np, pandas as pd
df1 = pd.DataFrame(index=np.arange(3), columns=['x','y','z'])
df1['x'] = ['A','B','C']
df1.at[2,'y'] = 400
# rows/columns specified does not exist, appends new rows/columns to existing data frame
df1.at['D','w'] = 9000
df1.loc['E','q'] = 499
# using df[<some_column_name>] == <condition> to retrieve target rows
df1.at[df1['x']=='B', 'y'] = 10000
df1.loc[df1['x']=='B', ['z','w']] = 10000
# using a list of index to setup values
df1.iloc[[1,2,4], 2] = 9999
df1.loc[[0,'D','E'],'w'] = 7500
df1.at[[0,2,"D"],'x'] = 10
df1.at[:, ['y', 'w']] = 8000
df1
>>> df1
x y z w q
0 10 8000 NaN 8000 NaN
1 B 8000 9999 8000 NaN
2 10 8000 9999 8000 NaN
D 10 8000 NaN 8000 NaN
E NaN 8000 9999 8000 499.0
you can use .iloc.
df.iloc[[2], [0]] = 10
set_value() is deprecated.
Starting from the release 0.23.4, Pandas "announces the future"...
>>> df
Cars Prices (U$)
0 Audi TT 120.0
1 Lamborghini Aventador 245.0
2 Chevrolet Malibu 190.0
>>> df.set_value(2, 'Prices (U$)', 240.0)
__main__:1: FutureWarning: set_value is deprecated and will be removed in a future release.
Please use .at[] or .iat[] accessors instead
Cars Prices (U$)
0 Audi TT 120.0
1 Lamborghini Aventador 245.0
2 Chevrolet Malibu 240.0
Considering this advice, here's a demonstration of how to use them:
by row/column integer positions
>>> df.iat[1, 1] = 260.0
>>> df
Cars Prices (U$)
0 Audi TT 120.0
1 Lamborghini Aventador 260.0
2 Chevrolet Malibu 240.0
by row/column labels
>>> df.at[2, "Cars"] = "Chevrolet Corvette"
>>> df
Cars Prices (U$)
0 Audi TT 120.0
1 Lamborghini Aventador 260.0
2 Chevrolet Corvette 240.0
References:
pandas.DataFrame.iat
pandas.DataFrame.at
One way to use index with condition is first get the index of all the rows that satisfy your condition and then simply use those row indexes in a multiple of ways
conditional_index = df.loc[ df['col name'] <condition> ].index
Example condition is like
==5, >10 , =="Any string", >= DateTime
Then you can use these row indexes in variety of ways like
Replace value of one column for conditional_index
df.loc[conditional_index , [col name]]= <new value>
Replace value of multiple column for conditional_index
df.loc[conditional_index, [col1,col2]]= <new value>
One benefit with saving the conditional_index is that you can assign value of one column to another column with same row index
df.loc[conditional_index, [col1,col2]]= df.loc[conditional_index,'col name']
This is all possible because .index returns a array of index which .loc can use with direct addressing so it avoids traversals again and again.
I tested and the output is df.set_value is little faster, but the official method df.at looks like the fastest non deprecated way to do it.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(100, 100))
%timeit df.iat[50,50]=50 # ✓
%timeit df.at[50,50]=50 # ✔
%timeit df.set_value(50,50,50) # will deprecate
%timeit df.iloc[50,50]=50
%timeit df.loc[50,50]=50
7.06 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.52 µs ± 64.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.68 µs ± 80.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
98.7 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
109 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Note this is setting the value for a single cell. For the vectors loc and iloc should be better options since they are vectorized.
If one wants to change the cell in the position (0,0) of the df to a string such as '"236"76"', the following options will do the work:
df[0][0] = '"236"76"'
# %timeit df[0][0] = '"236"76"'
# 938 µs ± 83.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Or using pandas.DataFrame.at
df.at[0, 0] = '"236"76"'
# %timeit df.at[0, 0] = '"236"76"'
#15 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Or using pandas.DataFrame.iat
df.iat[0, 0] = '"236"76"'
# %timeit df.iat[0, 0] = '"236"76"'
# 41.1 µs ± 3.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Or using pandas.DataFrame.loc
df.loc[0, 0] = '"236"76"'
# %timeit df.loc[0, 0] = '"236"76"'
# 5.21 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Or using pandas.DataFrame.iloc
df.iloc[0, 0] = '"236"76"'
# %timeit df.iloc[0, 0] = '"236"76"'
# 5.12 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If time is of relevance, using pandas.DataFrame.at is the fastest approach.
Soo, your question to convert NaN at ['x',C] to value 10
the answer is..
df['x'].loc['C':]=10
df
alternative code is
df.loc['C', 'x']=10
df
df.loc['c','x']=10
This will change the value of cth row and
xth column.
If you want to change values not for whole row, but only for some columns:
x = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
x.iloc[1] = dict(A=10, B=-10)
From version 0.21.1 you can also use .at method. There are some differences compared to .loc as mentioned here - pandas .at versus .loc, but it's faster on single value replacement
In addition to the answers above, here is a benchmark comparing different ways to add rows of data to an already existing dataframe. It shows that using at or set-value is the most efficient way for large dataframes (at least for these test conditions).
Create new dataframe for each row and...
... append it (13.0 s)
... concatenate it (13.1 s)
Store all new rows in another container first, convert to new dataframe once and append...
container = lists of lists (2.0 s)
container = dictionary of lists (1.9 s)
Preallocate whole dataframe, iterate over new rows and all columns and fill using
... at (0.6 s)
... set_value (0.4 s)
For the test, an existing dataframe comprising 100,000 rows and 1,000 columns and random numpy values was used. To this dataframe, 100 new rows were added.
Code see below:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Nov 21 16:38:46 2018
#author: gebbissimo
"""
import pandas as pd
import numpy as np
import time
NUM_ROWS = 100000
NUM_COLS = 1000
data = np.random.rand(NUM_ROWS,NUM_COLS)
df = pd.DataFrame(data)
NUM_ROWS_NEW = 100
data_tot = np.random.rand(NUM_ROWS + NUM_ROWS_NEW,NUM_COLS)
df_tot = pd.DataFrame(data_tot)
DATA_NEW = np.random.rand(1,NUM_COLS)
#%% FUNCTIONS
# create and append
def create_and_append(df):
for i in range(NUM_ROWS_NEW):
df_new = pd.DataFrame(DATA_NEW)
df = df.append(df_new)
return df
# create and concatenate
def create_and_concat(df):
for i in range(NUM_ROWS_NEW):
df_new = pd.DataFrame(DATA_NEW)
df = pd.concat((df, df_new))
return df
# store as dict and
def store_as_list(df):
lst = [[] for i in range(NUM_ROWS_NEW)]
for i in range(NUM_ROWS_NEW):
for j in range(NUM_COLS):
lst[i].append(DATA_NEW[0,j])
df_new = pd.DataFrame(lst)
df_tot = df.append(df_new)
return df_tot
# store as dict and
def store_as_dict(df):
dct = {}
for j in range(NUM_COLS):
dct[j] = []
for i in range(NUM_ROWS_NEW):
dct[j].append(DATA_NEW[0,j])
df_new = pd.DataFrame(dct)
df_tot = df.append(df_new)
return df_tot
# preallocate and fill using .at
def fill_using_at(df):
for i in range(NUM_ROWS_NEW):
for j in range(NUM_COLS):
#print("i,j={},{}".format(i,j))
df.at[NUM_ROWS+i,j] = DATA_NEW[0,j]
return df
# preallocate and fill using .at
def fill_using_set(df):
for i in range(NUM_ROWS_NEW):
for j in range(NUM_COLS):
#print("i,j={},{}".format(i,j))
df.set_value(NUM_ROWS+i,j,DATA_NEW[0,j])
return df
#%% TESTS
t0 = time.time()
create_and_append(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))
t0 = time.time()
create_and_concat(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))
t0 = time.time()
store_as_list(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))
t0 = time.time()
store_as_dict(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))
t0 = time.time()
fill_using_at(df_tot)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))
t0 = time.time()
fill_using_set(df_tot)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))
I too was searching for this topic and I put together a way to iterate through a DataFrame and update it with lookup values from a second DataFrame. Here is my code.
src_df = pd.read_sql_query(src_sql,src_connection)
for index1, row1 in src_df.iterrows():
for index, row in vertical_df.iterrows():
src_df.set_value(index=index1,col=u'etl_load_key',value=etl_load_key)
if (row1[u'src_id'] == row['SRC_ID']) is True:
src_df.set_value(index=index1,col=u'vertical',value=row['VERTICAL'])

How to speed up calculations involving previous row in pandas?

I'm trying to create a new Pandas DataFrame column using shifted values of the column being created itself.
The only way I've been able to do so is by iterating through the data which is too slow and causing a bottleneck in my code.
import pandas as pd
df = pd.DataFrame([1,6,2,8], columns=['a'])
df.at[0, 'b'] = 5
for i in range(1, len(df)):
df.loc[i, ('b')] = (df.a[i-1] + df.b[i-1]) /2
I tried using shift but it didn't work. It fills in the value for row 1 and NaN for the rest. I'm assuming this method can't read newly created values on the fly.
df.loc[1:, ('b')] = (df.a.shift() + df.b.shift()) /2
UPDATE
I was able to significantly reduce the timing by using df.at in the iteration rather than df.loc
def with_df_loc(df):
for i in range(1, len(df)):
df.loc[i, ('b')] = (df.a[i-1] + df.b[i-1]) /2
return df
def with_df_at(df):
for i in range(1, len(df)):
df.at[i, 'b'] = (df.a[i-1] + df.b[i-1]) /2
return df
%timeit with_df_loc(df)
183 ms ± 75.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit with_df_at(df)
19.4 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
This timing is based on a larger dataset of 150 rows. Considering that df.rolling(20).mean() takes about 3ms, I think this might be the best I can do.
Thanks for the answers, if I need to further optimize I'll look into Asish M's suggestion of numba.
We can use numba to speed up calculations here, see Enhancing performance section in the docs.
import numba
#numba.njit
def func(a, b_0=5):
n = len(a)
b = np.full(n, b_0, dtype=np.float64)
for i in range(1, n):
b[i] = (b[i - 1] + a[i - 1]) / 2
return b
df['b'] = func(df['a'].to_numpy())
df
a b
0 1 5.00
1 6 3.00
2 2 4.50
3 8 3.25
Comparing performance
Benchmarking code, for reference.
The blue line represents the performance of the fastest version of your current method (using .at). The orange line represents the numba's performance.
You could try shift + cumsum, starting from 5 with fillna:
import pandas as pd
df = pd.DataFrame([1,2,3,4], columns=['a'])
df['b'] = df['a'].shift().fillna(5).cumsum()
print(df)
Output
a b
0 1 5.0
1 2 6.0
2 3 8.0
3 4 11.0
I most likely would have misconstrued your question, but try this if you're looking to create a shifted column:
df = pd.DataFrame([1,2,3,4], columns=['a'])
df["b"] = df.a.shift()

Get unique strings in pandas column by delimiter

Lets say i have the data below:
import numpy as np
import pandas as pd
data=np.array([["xxx--xxx--xxx--yyy"],
["aaa--bbb--aaa--ccc"],
["xxx--axa--axa--ccc"],
["bbb--bab--bbb--bab--tgh"]])
df = pd.DataFrame({'Practice Column': data.ravel()})
print(df)
How could i create a new column in this dataframe that will look at the strings and spit out a unique combination? The desired output would be:
Any help is appreciated. Thanks.
Use list comprehension with split, pandas.unique for unique with same ordering or set with sorted and last join together:
df['des'] = ['--'.join(pd.unique(x.split('--'))) for x in df['Practice Column']]
Or:
df['des'] = ['--'.join(sorted(set(x.split('--')),key=x.index)) for x in df['Practice Column']]
print (df)
Practice Column des
0 xxx--xxx--xxx--yyy xxx--yyy
1 aaa--bbb--aaa--ccc aaa--bbb--ccc
2 xxx--axa--axa--ccc xxx--axa--ccc
3 bbb--bab--bbb--bab--tgh bbb--bab--tgh
If ordering is not important solution is simplier:
df['des'] = ['--'.join(set(x.split('--'))) for x in df['Practice Column']]
print (df)
Practice Column des
0 xxx--xxx--xxx--yyy yyy--xxx
1 aaa--bbb--aaa--ccc ccc--bbb--aaa
2 xxx--axa--axa--ccc ccc--axa--xxx
3 bbb--bab--bbb--bab--tgh bab--tgh--bbb
Consider using OrderedDict here to drop duplicates and keep order very efficiently.
from collections import OrderedDict as o
df['Desired'] = [
'--'.join(o.fromkeys(x.split('--'), 1))
for x in df['Practice Column']]
df
Practice Column Desired
0 xxx--xxx--xxx--yyy xxx--yyy
1 aaa--bbb--aaa--ccc aaa--bbb--ccc
2 xxx--axa--axa--ccc xxx--axa--ccc
3 bbb--bab--bbb--bab--tgh bbb--bab--tgh
Performance
df_ = df
df = pd.concat([df] * 1000, ignore_index=True)
%%timeit
df['des'] = [
'--'.join(sorted(set(x.split('--')),key=x.index))
for x in df['Practice Column']]
%%timeit
df['des'] = [
'--'.join(o.fromkeys(x.split('--'), 1))
for x in df['Practice Column']
]
14.6 ms ± 392 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.18 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Haven't timed jez's second solution as it does not maintain order.
Hope this works
df = pd.DataFrame({'Practice Column': data.ravel(),'Desired':data.unique()})

Creating a column in Dataframe if values exist in other columns

I have a DataFrame with a number of columns. There are 3 columns that contain rows that are either blank or, if the row corresponds to the column variable, have a random number/letter string. I would like to take this data and create another column that has a string with the name of the variable for each row.
For example:
raw_data['A']
Out[192]:
0 00Q2400000GUxMjEAL
1 00Q2400000G5QDzEAN
2 NaN
3 NaN
4 NaN
5 NaN
So far I have tried writing a function to apply but it only returns 'xyz' for every row.
def type(row):
if row['A'] is not None:
return 'xyz'
elif row['B'] is not None:
return 'acb'
else:
return 'efg'
raw_data['TUV'] = raw_data.apply(lambda row: type(row), axis = 1)
Any help would be greatly appreciated.
Using pd.notnull:
def type(row):
if pd.notnull(row['A']):
return 'xyz'
elif pd.notnull(row['B']):
return 'acb'
else:
return 'efg'
df['TUV'] = df.apply(lambda row: type(row), axis = 1)
Edit better to use pd.notnull
With bigger datasets, apply can be slow.
Even with just 10,000 rows, you can get about a 25x speedup on this task with simple indexing operations.
Here's some example data:
N = 10000
data = {"A": np.random.choice([1, None], size=N),
"B": np.random.choice([1, None], size=N)}
df = pd.DataFrame(data)
df.head()
A B
0 1 1
1 None 1
2 1 1
3 1 1
4 None None
Using basic assignment and indexing:
%%timeit
df["TUV"] = "efg"
df.loc[df.A.notnull(), "TUV"] = "xyz"
df.loc[df.B.notnull(), "TUV"] = "acb"
# 6.15 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using apply:
%%timeit
def type(row):
if pd.notnull(row['A']):
return 'xyz'
elif pd.notnull(row['B']):
return 'acb'
else:
return 'efg'
df['TUV2'] = df.apply(lambda row: type(row), axis = 1)
# 152 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df.TUV.equals(df.TUV2) # True

Pandas slicing excluding the end

When slicing a dataframe using loc,
df.loc[start:end]
both start and end are included. Is there an easy way to exclude the end when using loc?
Easiest I can think of is df.loc[start:end].iloc[:-1].
Chops off the last one.
loc includes both the start and end, one less ideal work around is to get the index position and use iloc to slice the data frame (assume you don't have duplicated index):
df=pd.DataFrame({'A':[1,2,3,4]}, index = ['a','b','c','d'])
df.iloc[df.index.get_loc('a'):df.index.get_loc('c')]
# A
#a 1
#b 2
df.loc['a':'c']
# A
#a 1
#b 2
#c 3
None of the answers addresses the situation where end is not part of the index.
The more general solution is simply comparing the index to start and end, that way you can enforce either of them being inclusive of exclusive.
df[(df.index >= start) & (df.index < end)]
For instance:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(
{
"x": np.arange(48),
"y": np.arange(48) * 2,
},
index=pd.date_range("2020-01-01 00:00:00", freq="1H", periods=48)
)
>>> start = "2020-01-01 14:00"
>>> end = "2020-01-01 19:30" # this is not in the index
>>> df[(df.index >= start) & (df.index < end)]
x y
2020-01-01 14:00:00 14 28
2020-01-01 15:00:00 15 30
2020-01-01 16:00:00 16 32
2020-01-01 17:00:00 17 34
2020-01-01 18:00:00 18 36
2020-01-01 19:00:00 19 38
For slicing a DatetimeIndex, you can try this. It will grab everything up to one nanosecond before your end time. This will exclude the end time (assuming you aren't using ns precision), but not necessarily the last time.
df.loc[start:(end - pd.Timedelta('1ns'))]
pd.RangeIndex can be used instead for slicing indices with .loc with an exclusive stop provided that the index has integer dtype. Here is a straightforward helper:
class _eidx:
def __getitem__(self, s: slice) -> pd.RangeIndex:
return pd.RangeIndex(s.start, s.stop, s.step)
eidx = _eidx()
Example:
df = pd.DataFrame({"x": range(10), "y": range(10, 20)})
print(df.loc[eidx[3:5]])
x y
3 3 13
4 4 14
An even simpler way is just using python range:
print(df.loc[range(3, 5)])
x y
3 3 13
4 4 14
There doesn't seem to be any really neat way to do this, but I would favour solutions which are expressive (is it clear what I'm trying to do?).
For this reason, I like this solution even though it's somewhat basic and a bit clumsy.
A more robust, expressive and, I think, performant version of this same idea would be to first create the inclusive slice, then filter the result to exclude the end-point:
df.loc[start:end][lambda _: _.index != end]
This solution is reasonably fast (I've set s = start; e = end) and done it with a Series called ts:
In [1]: %timeit ts[s:e]
135 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [2]: %timeit ts[(ts.index >= s) & (ts.index < e)]
45.1 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [3]: %timeit ts[s:e][lambda s: s.index != e]
299 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
It can be made even more readable by allowing an intermediate variable:
inclusive = df.loc[start:end]
exclusive = inclusive[inclusive.index != end]

Categories