Pandas replacing values on specific columns - python

I am aware of these two similar questions:
Pandas replace values
Pandas: Replacing column values in dataframe
I used a different approach for substituting values from which I think it should be the cleanest one. But it does not work. I know how to work around it, but I would like to understand why it does not work:
In [108]: df=pd.DataFrame([[1, 2, 8],[3, 4, 8], [5, 1, 8]], columns=['A', 'B', 'C'])
In [109]: df
Out[109]:
A B C
0 1 2 8
1 3 4 8
2 5 1 8
In [110]: df.loc[:, ['A', 'B']].replace([1, 3, 2], [3, 6, 7], inplace=True)
In [111]: df
Out[111]:
A B C
0 1 2 8
1 3 4 8
2 5 1 8
In [112]: df.loc[:, 'A'].replace([1, 3, 2], [3, 6, 7], inplace=True)
In [113]: df
Out[113]:
A B C
0 3 2 8
1 6 4 8
2 5 1 8
If I slice only one column In [112] it works different to slicing several columns In [110]. As I understand the .loc method it returns a view and not a copy. In my logic this means that making an inplace change on the slice should change the whole DataFrame. This is what happens at line In [110].

Here is the answer by one of the developers: https://github.com/pydata/pandas/issues/11984
This should ideally show a SettingWithCopyWarning, but I think this is
quite difficult to detect.
You should NEVER do this type of chained inplace setting. It is simply
bad practice.
idiomatic is:
In [7]: df[['A','B']] = df[['A','B']].replace([1, 3, 2], [3, 6, 7])
In [8]: df
Out[8]:
A B C
0 3 7 8
1 6 4 8
2 5 3 8
(you can do with df.loc[:,['A','B']] as well, but more clear as above.

to_rep = dict(zip([1, 3, 2],[3, 6, 7]))
df.replace({'A':to_rep, 'B':to_rep}, inplace = True)
This will return:
A B C
0 3 7 8
1 6 4 8
2 5 3 8

Related

Pandas dataframe with N columns

I need to use Python with Pandas to write a DataFrame with N columns. This is a simplified version of what I have:
Ind=[[1, 2, 3],[4, 5, 6],[7, 8, 9],[10, 11, 12]]
DAT = pd.DataFrame([Ind[0],Ind[1],Ind[2],Ind[3]], index=None).T
DAT.head()
Out
0 1 2 3
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
This is the result that I want, but my real Ind has 121 sets of points and I really don't want to write each one in the DataFrame's argument. Is there a way to write this easily? I tried using a for loop, but that didn't work out.
You can just pass the list directly:
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]
df = pd.DataFrame(data, index=None).T
df.head()
Outputs:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12

Replace part of df column with values defined in Series/dictionary

I have a column in a DataFrame that often has repeat indexes. Some indexes have exceptions and need to be changed based on another Series I've made, while the rest of the indices are fine as is. The Series indices are unique.
Here's a couple variables to illustrate
df = pd.DataFrame(data={'hi':[1, 2, 3, 4, 5, 6, 7]}, index=[1, 1, 1, 2, 2, 3, 4])
Out[52]:
hi
1 1
1 2
1 3
2 4
2 5
3 6
4 7
exceptions = pd.Series(data=[90, 95], index=[2, 4])
Out[36]:
2 90
4 95
I would like to do set the df to ...
hi
1 1
1 2
1 3
2 90
2 90
3 6
4 95
What's a clean way to do this? I'm a bit new to Pandas, my thoughts are just to loop but I don't think that's the proper way to solve this
Assuming that the index in exceptions is guaranteed to be a subset of df indexes we can use loc and the Series.index to assign the values:
df.loc[exceptions.index, 'hi'] = exceptions
We can use index.intersection if we have extra values in exceptions that does not or should not align in df:
exceptions = pd.Series(data=[90, 95, 100], index=[2, 4, 5])
df.loc[exceptions.index.intersection(df.index, sort=False), 'hi'] = exceptions
df:
hi
1 1
1 2
1 3
2 90
2 90
3 6
4 95

Append a dataframe with a column of another dataframe and a constant with Python

Let's take these two dataframes :
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df1
A B
0 1 2
1 3 4
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('CD'))
df2
C D
0 5 6
1 7 8
I would like to add column C of df2 to column A of df1, and to put 9 in column B. To sum up, I would like to have :
df1
df1
A B
0 1 2
1 3 4
2 5 9
3 7 9
I tried numerous things with the append function but didn't succeed to find the right code. Could you please help me ?
df1.append(df2.rename(columns={'C':'A'}).drop(columns='D'), ignore_index=True) \
.fillna(9).astype(int)
A B
0 1 2
1 3 4
2 5 9
3 7 9
Another alternative based on #splash58's answer :
df1.append(df2.rename(columns={'C':'A'}).drop(df2.columns.difference(['C']), 1), ignore_index=True,sort=False) \
.fillna(9).astype(int)

pandas reorder only a specific row

I have a Dataframe file in which I want to switch the order of columns in only the third row while keeping other rows the same.
Under some condition, I have to switch orders for my project, but here is an example that probably has no real meaning.
Suppose the dataset is
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', 'b', 'c', 'd', 'e']})
df
out[1]:
A B C
0 0 5 a
1 1 6 b
2 2 7 c
3 3 8 d
4 4 9 e
I want to have the output:
A B C
0 0 5 a
1 1 6 b
2 **7 2** c
3 3 8 d
4 4 9 e
How do I do it?
I have tried:
new_order = [1, 0, 2] # specify new order of the third row
i = 2 # specify row number
df.iloc[i] = df[df.columns[new_order]].loc[i] # reorder the third row only and assign new values to df
I observed from the output of the right-hand side that the columns are reordering as I wanted:
df[df.columns[new_order]].loc[i]
Out[2]:
B 7
A 2
C c
Name: 2, dtype: object
But when assigned to df again, it did nothing. I guess it's because of the name matching.
Can someone help me? Thanks in advance!

Pandas groupby multiindex when unique on first level: unexpected results

Python version: 3.5.2; Pandas version: 0.23.1
I am noticing unexpected behavior when I groupby using two indices but each row is unique on the first index. The code I am executing on my data frame with column c is:
df.c.groupby(df.index.names).min()
Everything works as expected when the rows are not unique on the first index. To make this clear, I've included two versions below. Edit: Now including three versions!
Version 1: Has the expected output
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [1, 2, 4]], columns=['a', 'b', 'c'])
df = df.set_index(['a','b']).sort_index()
Input:
c
a b
1 2 3
2 4
4 5 6
Output:
a b
1 2 3
4 5 6
Version 2: Has the unexpected output
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['a', 'b', 'c'])
df = df.set_index(['a','b']).sort_index()
Input:
c
a b
1 2 3
4 5 6
Output:
a 3
b 6
Expected Output:
a b
1 2 3
4 5 6
Version 3: Has expected output, but not expected with version 2 in mind.
df = pd.DataFrame([[1, 2, 3, 4], [4, 5, 6, 7]], columns=['a', 'b1', 'b2', 'c'])
df = df.set_index(['a','b1','b2']).sort_index()
Input:
c
a b1 b2
1 2 3 4
4 5 6 7
Output:
a b1 b2
1 2 3 4
4 5 6 7
Here is a peek in to what is going on. Take a look at the name of the series that gets getting passed into the "applied" function, f.
In the first case (Expected Results):
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [1, 2, 4]], columns=['a', 'b', 'c'])
df = df.set_index(['a','b']).sort_index()
def f(x):
print(x)
print('\n')
print(min(x))
print('\n')
return min(x)
df.c.groupby(['a','b']).apply(f)
Output:
a b
1 2 3
2 4
Name: (1, 2), dtype: int64
3
a b
4 5 6
Name: (4, 5), dtype: int64
6
Out[292]:
a b
1 2 3
4 5 6
In the second case (unexpected results), note the name of the series passed in:
df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['a', 'b', 'c'])
df1 = df1.set_index(['a','b']).sort_index()
def f(x):
print(x)
print('\n')
print(min(x))
print('\n')
return min(x)
df1.c.groupby(['a','b']).apply(f)
Output:
a b
1 2 3
Name: a, dtype: int64
3
a b
4 5 6
Name: b, dtype: int64
6
Out[293]:
a 3
b 6
Name: c, dtype: int64
It uses these series to build the resulting dataframe. The naming of the series is the culprit due the nature of the data. Why? Well, we'll have to look into the code for that.
The idiomatic fix for this problem is use this syntax:
df1.groupby(df1.index.names)['c'].min()
Output:
a b
1 2 3
4 5 6
Name: c, dtype: int64
You can use the level argument of groupby:
>>> df
c
a b
1 2 3
4 5 6
>>> df.c.groupby(level=[0,1]).min()
a b
1 2 3
4 5 6
Name: c, dtype: int64
From the docs
level : int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels
This behavior is now changed in pandas. The output now matches the expected output in all cases.

Categories