Set value for particular cell in pandas DataFrame with iloc - python

I have a question similar to this and this. The difference is that I have to select row by position, as I do not know the index.
I want to do something like df.iloc[0, 'COL_NAME'] = x, but iloc does not allow this kind of access. If I do df.iloc[0]['COL_NAME'] = x the warning about chained indexing appears.

For mixed position and index, use .ix. BUT you need to make sure that your index is not of integer, otherwise it will cause confusions.
df.ix[0, 'COL_NAME'] = x
Update:
Alternatively, try
df.iloc[0, df.columns.get_loc('COL_NAME')] = x
Example:
import pandas as pd
import numpy as np
# your data
# ========================
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10, 2), columns=['col1', 'col2'], index=np.random.randint(1,100,10)).sort_index()
print(df)
col1 col2
10 1.7641 0.4002
24 0.1440 1.4543
29 0.3131 -0.8541
32 0.9501 -0.1514
33 1.8676 -0.9773
36 0.7610 0.1217
56 1.4941 -0.2052
58 0.9787 2.2409
75 -0.1032 0.4106
76 0.4439 0.3337
# .iloc with get_loc
# ===================================
df.iloc[0, df.columns.get_loc('col2')] = 100
df
col1 col2
10 1.7641 100.0000
24 0.1440 1.4543
29 0.3131 -0.8541
32 0.9501 -0.1514
33 1.8676 -0.9773
36 0.7610 0.1217
56 1.4941 -0.2052
58 0.9787 2.2409
75 -0.1032 0.4106
76 0.4439 0.3337

One thing I would add here is that the at function on a dataframe is much faster particularly if you are doing a lot of assignments of individual (not slice) values.
df.at[index, 'col_name'] = x
In my experience I have gotten a 20x speedup. Here is a write up that is Spanish but still gives an impression of what's going on.

If you know the position, why not just get the index from that?
Then use .loc:
df.loc[index, 'COL_NAME'] = x

You can use:
df.set_value('Row_index', 'Column_name', value)
set_value is ~100 times faster than .ix method. It also better then use df['Row_index']['Column_name'] = value.
But since set_value is deprecated now so .iat/.at are good replacements.
For example if we have this data_frame
A B C
0 1 8 4
1 3 9 6
2 22 33 52
if we want to modify the value of the cell [0,"A"] we can do
df.iat[0,0] = 2
or df.at[0,'A'] = 2

another way is, you assign a column value for a given row based on the index position of a row, the index position always starts with zero, and the last index position is the length of the dataframe:
df["COL_NAME"].iloc[0]=x

To modify the value in a cell at the intersection of row "r" (in column "A") and column "C"
retrieve the index of the row "r" in column "A"
i = df[ df['A']=='r' ].index.values[0]
modify the value in the desired column "C"
df.loc[i,"C"]="newValue"
Note: before, be sure to reset the index of rows ...to have a nice index list!
df=df.reset_index(drop=True)

Another way is to get the row index and then use df.loc or df.at.
# get row index 'label' from row number 'irow'
label = df.index.values[irow]
df.at[label, 'COL_NAME'] = x

Extending Jianxun's answer, using set_value mehtod in pandas. It sets value for a column at given index.
From pandas documentations:
DataFrame.set_value(index, col, value)
To set value at particular index for a column, do:
df.set_value(index, 'COL_NAME', x)
Hope it helps.

Related

Pandas: Optimal subtract every nth row

I'm writing a function for a special case of row-wise subtraction in pandas.
First the user should be able to specify rows either by regex (i.e. "_BL[0-9]+") or by regular index i.e every 6th row
Then we must subtract every matching row from rows preceding it, but not past another match
[Optionally] Drop selected rows
Column to match on should be user-defined by either index or label
For example if:
Samples
var1
var1
something
10
20
something
20
30
something
40
30
some_BL20_thing
100
100
something
50
70
something
90
100
some_BL10_thing
100
10
Expected output should be:
Samples
var1
var1
something
-90
-80
something
-80
-70
something
-60
-70
something
-50
60
something
-10
90
My current (incomplete) implementation relies heavily on looping:
def subtract_blanks(data:pd.DataFrame, num_samples:int)->pd.DataFrame:
'''
Accepts a data dataframe and a mod int and
subtracts each blank from all mod preceding samples
'''
expr = compile(r'(_BL[0-9]{1})')
output = data.copy(deep = True)
for idx,row in output.iterrows():
if search(expr,row['Sample']):
for i in range(1,num_samples+1):
output.iloc[idx-i,data_start:] = output.iloc[idx-i,6:]-row.iloc[6:]
return output
Is there a better way of doing this? This implementation seems pretty ugly. I've also considered maybe splitting the DataFrame to chucks and operating on them instead.
Code
# Create boolean mask for matching rows
# m = np.arange(len(df)) % 6 == 5 # for index match
m = df['Samples'].str.contains(r'_BL\d+') # for regex match
# mask the values and backfill to propagate the row
# values corresponding to match in backward direction
df['var1'] = df['var1'] - df['var1'].mask(~m).bfill()
# Delete the matching rows
df = df[~m].copy()
Samples var1 var1
0 something -90.0 -80.0
1 something -80.0 -70.0
2 something -60.0 -70.0
4 something -50.0 60.0
5 something -10.0 90.0
Note: The core logic is specified in the code so I'll leave the function implementation upto the OP.

groupby aggregate does not work as expected for Pandas

I need some help with aggregation and joining the dataframe groupby output.
Here is my dataframe:
df = pd.DataFrame({
'Date': ['2020/08/18','2020/08/18', '2020/08/18', '2020/08/18', '2020/08/18', '2020/08/18', '2020/08/18'],
'Time':['Val3',60,30,'Val2',60,60,'Val2'],
'Val1': [0, 53.5, 33.35, 0,53.5, 53.5,0],
'Val2':[0, 0, 0, 45, 0, 0, 35],
'Val3':[48.5,0,0,0,0,0,0],
'Place':['LOC_A','LOC_A','LOC_A','LOC_B','LOC_B','LOC_B','LOC_A']
})
I want following result:
Place Total_sum Factor Val2_new
0 LOC_A 86.85 21.71 35
1 LOC_B 107.00 26.75 45
I have tried following:
df_by_place = df.groupby('Place')['Val1'].sum().reset_index(name='Total_sum')
df_by_place['Factor'] = round(df_by_place['Total_sum']*0.25, 2)
df_by_place['Val2_new'] = df.groupby('Place')['Val2'].agg('sum')
print(df_by_place)
But I get following result:
Place Total_sum Factor Val2_new
0 LOC_A 86.85 21.71 NaN
1 LOC_B 107.00 26.75 NaN
When I do following operation by it self:
print(df.groupby('Place')['Val2'].agg('sum'))
Output is desired:
Place
LOC_A 35
LOC_B 45
But when I assign to a column it gives "NaN" value.
Any help to this issue would be appreciated.
Thank You in advance.
Groupby in pandas >= 0.25 will allow you to assign names to columns inside of it and do what you want in one go.
df.groupby('Place').agg(Total_sum = ('Val1','sum'),
Factor = ('Val1', lambda x: round((x * 0.25).sum(),2)),
Val2_new = ('Val2', 'sum')).reset_index()
This provides your desired result.
Place Total_sum Factor Val2_new
0 LOC_A 86.85 21.71 35
1 LOC_B 107.00 26.75 45
Using lambda functions within groupby will make things a lot neater!
The answer given by sushanth seems to hold good.
df_by_place['Val2_new'] = df.groupby('Place')['Val2'].agg('sum').reset_index(drop=True)
By assigning drop = True in reset_index the previously created index are removed and the new index/column_name given by user is assigned.
slight variation on #maishm's answer, but basically same idea:
df.groupby('Place').agg(total_sum=pd.NamedAgg(column='Val1', aggfunc=sum),
factor=pd.NamedAgg(column='Val1', aggfunc=lambda x: round(sum(x)*0.25,2)),
val2_new=pd.NamedAgg(column='Val2', aggfunc=sum)).reset_index()
output:
Place total_sum factor val2_new
0 LOC_A 86.85 21.71 35
1 LOC_B 107.00 26.75 45

Replace value in a pandas dataframe column by the previous one

My code detects outliers in a time series. Which I want to do is to replace the outliers in de first dataframe column with the previous value which is not an outlier.
This code just detect outliers, creating a boolean array where:
True means that a value in the dataframe is an outlier
False means that a value in the dataframe is not an outlier
series = read_csv('horario_completo.csv', header=None, squeeze=True)
df=pd.DataFrame(series)
from pandas import rolling_median
consumos=df.iloc[:,0]
df['rolling_median'] = rolling_median(consumos, window=48, center=True).fillna(method='bfill').fillna(method='ffill')
threshold =50
difference = np.abs(consumos - df['rolling_median'])
outlier = difference > threshold
Up to this point, everything works.
The next step I have thought is to create a mask to replace the Truevalues with the previous value of the same column (if this was possible, it would be much faster than making a loop).
I'll try to explain it with a little example:
This is what I have:
index consumo
0 54
1 67
2 98
index outlier
0 False
1 False
2 True
And this is what I want to do:
index consumo
0 54
1 67
2 67
I think I should create a mask like this:
df.mask(outlier, df.columns=[[0]][i-1],axis=1)
obviosly this IS NOT the way to write it. It just an explanation about how I think it could be done (I'm talking about the [i-1]).
It seems you need shift:
consumo = consumo.mask(outlier, consumo.shift())
print (consumo)
0 54.0
1 67.0
2 67.0
Name: consumo, dtype: float64
Last if all values are ints add astype:
consumo = consumo.mask(outlier, consumo.shift()).astype(int)
print (consumo)
0 54
1 67
2 67
Name: consumo, dtype: int32

In Pandas, how can I patch a dataframe with missing values with values from another dataframe given a similar index?

From
Fill in missing row values in pandas dataframe
I have the following dataframe and would like to fill in missing values.
mukey hzdept_r hzdepb_r sandtotal_r silttotal_r
425897 0 61
425897 61 152 5.3 44.7
425911 0 30 30.1 54.9
425911 30 74 17.7 49.8
425911 74 84
I want each missing value to be the average of values corresponding to that mukey. In this case, e.g. the first row missing values will be the average of sandtotal_r and silttotal_r corresponding to mukey==425897. pandas fillna doesn't seem to do the trick. Any help?
While the code works for the sample dataframe in that example, it is failing on the larger dataset I have uploaded here: https://www.dropbox.com/s/w3m0jppnq74op4c/www004.csv?dl=0
import pandas as pd
df = pd.read_csv('www004.csv')
# CSV file is here: https://www.dropbox.com/s/w3m0jppnq74op4c/www004.csv?dl=0
df1 = df.set_index('mukey')
df1.fillna(df.groupby('mukey').mean(),inplace=True)
df1.reset_index()
I get the error: InvalidIndexError. Why is it not working?
Use combine_first. It allows you to patch up the missing data on the left dataframe with the matching data on the right dataframe based on same index.
In this case, df1 is on the left and df2, the means, as the one on the right.
In [48]: df = pd.read_csv('www004.csv')
...: df1 = df.set_index('mukey')
...: df2 = df.groupby('mukey').mean()
In [49]: df1.loc[426178,:]
Out[49]:
hzdept_r hzdepb_r sandtotal_r silttotal_r claytotal_r om_r
mukey
426178 0 36 NaN NaN NaN 72.50
426178 36 66 NaN NaN NaN 72.50
426178 66 152 42.1 37.9 20 0.25
In [50]: df2.loc[426178,:]
Out[50]:
hzdept_r 34.000000
hzdepb_r 84.666667
sandtotal_r 42.100000
silttotal_r 37.900000
claytotal_r 20.000000
om_r 48.416667
Name: 426178, dtype: float64
In [51]: df3 = df1.combine_first(df2)
...: df3.loc[426178,:]
Out[51]:
hzdept_r hzdepb_r sandtotal_r silttotal_r claytotal_r om_r
mukey
426178 0 36 42.1 37.9 20 72.50
426178 36 66 42.1 37.9 20 72.50
426178 66 152 42.1 37.9 20 0.25
Note that the following rows still won't have values in the resulting df3
426162
426163
426174
426174
426255
because they were single rows to begin with, hence, .mean() doesn't mean anything to them (eh, see what I did there?).
The problem is the duplicate index values. When you use df1.fillna(df2), if you have multiple NaN entries in df1 where both the index and the column label are the same, pandas will get confused when trying to slice df1, and throw that InvalidIndexError.
Your sample dataframe works because even though you have duplicate index values there, only one of each index value is null. Your larger dataframe contains null entries that share both the index value and column label in some cases.
To make this work, you can do this one column at a time. For some reason, when operating on a series, pandas will not get confused by multiple entries of the same index, and will simply fill the same value in each one. Hence, this should work:
import pandas as pd
df = pd.read_csv('www004.csv')
# CSV file is here: https://www.dropbox.com/s/w3m0jppnq74op4c/www004.csv?dl=0
df1 = df.set_index('mukey')
grouped = df.groupby('mukey').mean()
for col in ['sandtotal_r', 'silttotal_r']:
df1[col] = df1[col].fillna(grouped[col])
df1.reset_index()
NOTE: Be careful using the combine_first method if you ever have "extra" data in the dataframe you're filling from. The combine_first function will include ALL indices from the dataframe you're filling from, even if they're not present in the original dataframe.

What does `ValueError: cannot reindex from a duplicate axis` mean?

I am getting a ValueError: cannot reindex from a duplicate axis when I am trying to set an index to a certain value. I tried to reproduce this with a simple example, but I could not do it.
Here is my session inside of ipdb trace. I have a DataFrame with string index, and integer columns, float values. However when I try to create sum index for sum of all columns I am getting ValueError: cannot reindex from a duplicate axis error. I created a small DataFrame with the same characteristics, but was not able to reproduce the problem, what could I be missing?
I don't really understand what ValueError: cannot reindex from a duplicate axismeans, what does this error message mean? Maybe this will help me diagnose the problem, and this is most answerable part of my question.
ipdb> type(affinity_matrix)
<class 'pandas.core.frame.DataFrame'>
ipdb> affinity_matrix.shape
(333, 10)
ipdb> affinity_matrix.columns
Int64Index([9315684, 9315597, 9316591, 9320520, 9321163, 9320615, 9321187, 9319487, 9319467, 9320484], dtype='int64')
ipdb> affinity_matrix.index
Index([u'001', u'002', u'003', u'004', u'005', u'008', u'009', u'010', u'011', u'014', u'015', u'016', u'018', u'020', u'021', u'022', u'024', u'025', u'026', u'027', u'028', u'029', u'030', u'032', u'033', u'034', u'035', u'036', u'039', u'040', u'041', u'042', u'043', u'044', u'045', u'047', u'047', u'048', u'050', u'053', u'054', u'055', u'056', u'057', u'058', u'059', u'060', u'061', u'062', u'063', u'065', u'067', u'068', u'069', u'070', u'071', u'072', u'073', u'074', u'075', u'076', u'077', u'078', u'080', u'082', u'083', u'084', u'085', u'086', u'089', u'090', u'091', u'092', u'093', u'094', u'095', u'096', u'097', u'098', u'100', u'101', u'103', u'104', u'105', u'106', u'107', u'108', u'109', u'110', u'111', u'112', u'113', u'114', u'115', u'116', u'117', u'118', u'119', u'121', u'122', ...], dtype='object')
ipdb> affinity_matrix.values.dtype
dtype('float64')
ipdb> 'sums' in affinity_matrix.index
False
Here is the error:
ipdb> affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0)
*** ValueError: cannot reindex from a duplicate axis
I tried to reproduce this with a simple example, but I failed
In [32]: import pandas as pd
In [33]: import numpy as np
In [34]: a = np.arange(35).reshape(5,7)
In [35]: df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], range(10, 17))
In [36]: df.values.dtype
Out[36]: dtype('int64')
In [37]: df.loc['sums'] = df.sum(axis=0)
In [38]: df
Out[38]:
10 11 12 13 14 15 16
x 0 1 2 3 4 5 6
y 7 8 9 10 11 12 13
u 14 15 16 17 18 19 20
z 21 22 23 24 25 26 27
w 28 29 30 31 32 33 34
sums 70 75 80 85 90 95 100
This error usually rises when you join / assign to a column when the index has duplicate values. Since you are assigning to a row, I suspect that there is a duplicate value in affinity_matrix.columns, perhaps not shown in your question.
As others have said, you've probably got duplicate values in your original index. To find them do this:
df[df.index.duplicated()]
Indices with duplicate values often arise if you create a DataFrame by concatenating other DataFrames. IF you don't care about preserving the values of your index, and you want them to be unique values, when you concatenate the the data, set ignore_index=True.
Alternatively, to overwrite your current index with a new one, instead of using df.reindex(), set:
df.index = new_index
Simple Fix
Run this before grouping
df = df.reset_index()
Thanks to this github comment for the solution.
For people who are still struggling with this error, it can also happen if you accidentally create a duplicate column with the same name. Remove duplicate columns like so:
df = df.loc[:,~df.columns.duplicated()]
Simply skip the error using .values at the end.
affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0).values
I came across this error today when I wanted to add a new column like this
df_temp['REMARK_TYPE'] = df.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)
I wanted to process the REMARK column of df_temp to return 1 or 0. However I typed wrong variable with df. And it returned error like this:
----> 1 df_temp['REMARK_TYPE'] = df.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)
/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in __setitem__(self, key, value)
2417 else:
2418 # set column
-> 2419 self._set_item(key, value)
2420
2421 def _setitem_slice(self, key, value):
/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in _set_item(self, key, value)
2483
2484 self._ensure_valid_index(value)
-> 2485 value = self._sanitize_column(key, value)
2486 NDFrame._set_item(self, key, value)
2487
/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in _sanitize_column(self, key, value, broadcast)
2633
2634 if isinstance(value, Series):
-> 2635 value = reindexer(value)
2636
2637 elif isinstance(value, DataFrame):
/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in reindexer(value)
2625 # duplicate axis
2626 if not value.index.is_unique:
-> 2627 raise e
2628
2629 # other
ValueError: cannot reindex from a duplicate axis
As you can see it, the right code should be
df_temp['REMARK_TYPE'] = df_temp.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)
Because df and df_temp have a different number of rows. So it returned ValueError: cannot reindex from a duplicate axis.
Hope you can understand it and my answer can help other people to debug their code.
In my case, this error popped up not because of duplicate values, but because I attempted to join a shorter Series to a Dataframe: both had the same index, but the Series had fewer rows (missing the top few). The following worked for my purposes:
df.head()
SensA
date
2018-04-03 13:54:47.274 -0.45
2018-04-03 13:55:46.484 -0.42
2018-04-03 13:56:56.235 -0.37
2018-04-03 13:57:57.207 -0.34
2018-04-03 13:59:34.636 -0.33
series.head()
date
2018-04-03 14:09:36.577 62.2
2018-04-03 14:10:28.138 63.5
2018-04-03 14:11:27.400 63.1
2018-04-03 14:12:39.623 62.6
2018-04-03 14:13:27.310 62.5
Name: SensA_rrT, dtype: float64
df = series.to_frame().combine_first(df)
df.head(10)
SensA SensA_rrT
date
2018-04-03 13:54:47.274 -0.45 NaN
2018-04-03 13:55:46.484 -0.42 NaN
2018-04-03 13:56:56.235 -0.37 NaN
2018-04-03 13:57:57.207 -0.34 NaN
2018-04-03 13:59:34.636 -0.33 NaN
2018-04-03 14:00:34.565 -0.33 NaN
2018-04-03 14:01:19.994 -0.37 NaN
2018-04-03 14:02:29.636 -0.34 NaN
2018-04-03 14:03:31.599 -0.32 NaN
2018-04-03 14:04:30.779 -0.33 NaN
2018-04-03 14:05:31.733 -0.35 NaN
2018-04-03 14:06:33.290 -0.38 NaN
2018-04-03 14:07:37.459 -0.39 NaN
2018-04-03 14:08:36.361 -0.36 NaN
2018-04-03 14:09:36.577 -0.37 62.2
I wasted couple of hours on the same issue. In my case, I had to reset_index() of a dataframe before using apply function.
Before merging, or looking up from another indexed dataset, you need to reset the index as 1 dataset can have only 1 Index.
I got this error when I tried adding a column from a different table. Indeed I got duplicate index values along the way. But it turned out I was just doing it wrong: I actually needed to df.join the other table.
This pointer might help someone in a similar situation.
In my case it was caused by mismatch in dimensions:
accidentally using a column from different df during the mul operation
This can also be a cause for this[:) I solved my problem like this]
It may happen even if you are trying to insert a dataframe type column inside dataframe
you can try this
df['my_new']=pd.Series(my_new.values)
if you get this error after merging two dataframe and remove suffix adnd try to write to excel
Your problem is that there are columns you are not merging on that are common to both source DataFrames. Pandas needs a way to say which one came from where, so it adds the suffixes, the defaults being '_x' on the left and '_y' on the right.
If you have a preference on which source data frame to keep the columns from, then you can set the suffixes and filter accordingly, for example if you want to keep the clashing columns from the left:
# Label the two sides, with no suffix on the side you want to keep
df = pd.merge(
df,
tempdf[what_i_care_about],
on=['myid', 'myorder'],
how='outer',
suffixes=('', '_delete_suffix') # Left gets no suffix, right gets something identifiable
)
# Discard the columns that acquired a suffix
df = df[[c for c in df.columns if not c.endswith('_delete_suffix')]]
Alternatively, you can drop one of each of the clashing columns prior to merging, then Pandas has no need to assign a suffix.
Just add .to_numpy() to the end of the series you want to concatenate.
It happened to me when I appended 2 dataframes into another (df3 = df1.append(df2)), so the output was:
df1
A B
0 1 a
1 2 b
2 3 c
df2
A B
0 4 d
1 5 e
2 6 f
df3
A B
0 1 a
1 2 b
2 3 c
0 4 d
1 5 e
2 6 f
The simplest way to fix the indexes is using the "df.reset_index(drop=bool, inplace=bool)" method, as Connor said... you can also set the 'drop' argument True to avoid the index list to be created as a columns, and 'inplace' to True to make the indexes reset permanent.
Here is the official refference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
In addition, you can also use the ".set_index(keys=list, inplace=bool)" method, like this:
new_index_list = list(range(0, len(df3)))
df3['new_index'] = new_index_list
df3.set_index(keys='new_index', inplace=True)
official refference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html
Make sure your index does not have any duplicates, I simply did df.reset_index(drop=True, inplace=True) and I don't get the error anymore! But you might want to keep the index, in that case just set drop to False
df = df.reset_index(drop=True) worked for me
I was trying to create a histogram using seaborn.
sns.histplot(data=df, x='Blood Chemistry 1', hue='Outcome', discrete=False, multiple='stack')
I get ValueError: cannot reindex from a duplicate axis. To solve it, I had to choose only the rows where x has no missing values:
data = df[~df['Blood Chemistry 1'].isnull()]

Categories