How to add a value to specific columns of a pandas dataframe? - python

I have to perform the same arithmetic operation on specific columns of a pandas DataFrame. I do it as
c.loc[:,'col3'] += cons
c.loc[:,'col5'] += cons
c.loc[:,'col6'] += cons
There should be a simpler approach to do all of these in one operation. I mean updating col3,col5,col6 in one command.

pd.DataFrame.loc label indexing accepts lists:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
df.loc[:, ['B', 'C']] += 10
print(df)
A B C
0 1 12 13
1 4 15 16

Related

How to calculate number of rows between 2 indexes of pandas dataframe

I have the following Pandas dataframe in Python:
import pandas as pd
d = {'col1': [1, 2, 3, 4, 5], 'col2': [6, 7, 8, 9, 10]}
df = pd.DataFrame(data=d)
df.index=['A', 'B', 'C', 'D', 'E']
df
which gives the following output:
col1 col2
A 1 6
B 2 7
C 3 8
D 4 9
E 5 10
I need to write a function (say the name will be getNrRows(fromIndex) ) that will take an index value as input and will return the number of rows between that given index and the last index of the dataframe.
For instance:
nrRows = getNrRows("C")
print(nrRows)
> 2
Because it takes 2 steps (rows) from the index C to the index E.
How can I write such a function in the most elegant way?
The simplest way might be
len(df[row_index:]) - 1
For your information we have built-in function get_indexer_for
len(df)-df.index.get_indexer_for(['C'])-1
Out[179]: array([2], dtype=int64)

Merge (or concat) two dataframes by index with duplicates index

I have two dataframe A and B with common indexes for A and B. These common indexes can appear several times (duplicate) for A and B.
I want to merge A and B in these 3 ways :
Case 0: If index i of A appear one time (i1) and index i for B
appear one times (i1), I want my merged by index dataframe to add
the rows A(i1), B(i1)
Case 1 : If index i of A appear one time (i1) and index i for B
appear two times in this order : (i1 and i2), I want my merged by
index dataframe to add the rows A(i1), B(i1) and A(i1), B(i2).
Case 2: If index i of A appear two time in this order : (i1, i2) and
index i for B appear two times in this order : (i1 and i2), I want
my merged by index dataframe to add the rows A(i1), B(i1) and A(i2),
B(i2).
These 3 cases are all of the possible case that could appear on my data.
When using pandas.merge, case 0 and case 1 works. But for case 2, the returned dataframe will add rows A(i1), B(i1) and A(i1), B(i2) and A(i2), B(i1) and A(i2), B(i2) instead of A(i1), B(i1) and A(i2), B(i2).
I could use pandas.merge method and then delete the undesired merged rows but is there a ways to combine these 3 cases at the same time ?
A = pd.DataFrame([[1, 2], [4, 2], [5,5], [5,5], [1,1]], index=['a', 'a', 'b', 'c', 'c'])
B = pd.DataFrame([[1, 5], [4, 8], [7,7], [5,5]], index=['b', 'c', 'a', 'a'])
pd.merge(A,B, left_index=True, right_index=True, how='inner')
For example, in the dataframe above, I want exactly it without the second and third index 'a'.
Basically, your 3 cases can be summarized into 2 cases:
Index i occur the same times (1 or 2 times) in A and B, merge according to the order.
Index i occur 2 times in A and 1 time in B, merge using B content for all rows.
Prep code:
def add_secondary_index(df):
df.index.name = 'Old'
df['Order'] = df.groupby(df.index).cumcount()
df.set_index('Order', append=True, inplace=True)
return df
import pandas as pd
A = pd.DataFrame([[1, 2], [4, 2], [5,5], [5,5], [1,1]], index=['a', 'a', 'b', 'c', 'c'])
B = pd.DataFrame([[1, 5], [4, 8], [7,7], [5,5]], index=['b', 'c', 'a', 'a'])
index_times = A.groupby(A.index).count() == B.groupby(B.index).count()
For case 1 is easy to solve, you just need to add the secondary index:
same_times_index = index_times[index_times[0].values].index
A_same = A.loc[same_times_index].copy()
B_same = B.loc[same_times_index].copy()
add_secondary_index(A_same)
add_secondary_index(B_same)
result_merge_same = pd.merge(A_same,B_same,left_index=True,right_index=True)
For case 2, you need to seprately consider:
not_same_times_index = index_times[~index_times.index.isin(same_times_index)].index
A_notsame = A.loc[not_same_times_index].copy()
B_notsame = B.loc[not_same_times_index].copy()
result_merge_notsame = pd.merge(A_notsame,B_notsame,left_index=True,right_index=True)
You could consider whether to add secondary index for result_merge_notsame, or drop it for result_merge_same.

pandas data frame / numpy array - roll without aggregate function

rolling in python aggregates data:
x = pd.DataFrame([[1,'a'],[2,'b'],[3,'c'],[4,'d']], columns=['a','b'])
y = x.rolling(2).mean()
print(y)
gives:
a b
0 NaN a
1 1.5 b
2 2.5 c
3 3.5 d
what I need is 3 dimension dataframes (or numpy arrays) shifting 3 samples by 1 step (in this example):
[
[[1,'a'],[2,'b'],[3,'c']],
[[2,'b'],[3,'c'],[4,'d']]
]
Whats the right way to do it for 900 samples shifting by 1 each step?
Using np.concantenate
np.concatenate([x.values[:-1],
x.values[1:]], axis=1)\
.reshape([x.shape[0] - 1, x.shape[1], -1])
You can try of concatenating window length associated dataframes based on the window length chosen (as selected 2)
length = df.dropna().shape[0]-1
cols = len(df.columns)
pd.concat([df.shift(1),df],axis=1).dropna().astype(int,errors='ignore').values.reshape((length,cols,2))
Out:
array([[[1, 'a'],
[2, 'b']],
[[2, 'b'],
[3, 'c']],
[[3, 'c'],
[4, 'd']]], dtype=object)
Let me know whether this solution suits your question.
p = x[['a','b']].values.tolist() # create a list of list ,as [i.a,i.b] for every i row in x
#### Output ####
[[1, 'a'], [2, 'b'], [3, 'c'], [4, 'd']]
#iterate through list except last two and for every i, fetch p[i],p[i+1],p[i+2] into a list
list_of_3 = [[p[i],p[i+1],p[i+2]] for i in range(len(p)-2)]
#### Output ####
[
[[1, 'a'], [2, 'b'], [3, 'c']],
[[2, 'b'], [3, 'c'], [4, 'd']]
]
# This is used if in case the list you require is numpy ndarray
from numpy import array
a = array(list_of_3)
#### Output ####
[[['1' 'a']
['2' 'b']
['3' 'c']]
[['2' 'b']
['3' 'c']
['4' 'd']]
]
Since pandas 1.1 you can iterate over rolling objects:
[window.values.tolist() for window in x.rolling(3) if window.shape[0] == 3]
The if makes sure we only get full windows. This solution has the advantage that you can use any parameter of the handy rolling function of pandas.

Pandas: renaming columns that have the same name

I have a dataframe that has duplicated column names a, b and b. I would like to rename the second b into c.
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "b1": [7, 8, 9]})
df.rename(index=str, columns={'b1' : 'b'})
Trying this with no success..
df.rename(index=str, columns={2 : "c"})
try:
>>> df.columns = ['a', 'b', 'c']
>>> df
a b c
0 1 4 7
1 2 5 8
2 3 6 9
You can always just manually rename all the columns.
df.columns = ['a', 'b', 'c']
You can simply do:
df.columns = ['a','b','c']
If your columns are ordered and you want lettered columns, don't type names out manually. This is prone to error.
You can use string.ascii_lowercase, assuming you have a maximum of 26 columns:
from string import ascii_lowercase
df = pd.DataFrame(columns=['a', 'b', 'b1'])
df.columns = list(ascii_lowercase[:len(df.columns)])
print(df.columns)
Index(['a', 'b', 'c'], dtype='object')
These solutions don't take into account the problem with having many cols.
Here is a solution where, independent on the amount of columns, you can rename the columns with the same name to a unique name
df.columns = ['name'+str(col[0]) if col[1] == 'name' else col[1] for col in enumerate(df.columns)]

cannot add multiple column with values in Python Pandas

I want to add the the data of reference to data, so I use
data[reference.columns]=reference
but it only creates the column with no value, how can I add the value?
Your two DataFrames are indexed differently, so when you do data[reference.columns] = reference it tries to align the new columns on indices. Since the indices of reference are not in data (or only align for index=0) it adds the columns, but fills the values with NaN.
It looks like you want to add multiple static columns to data with the values from reference. You can just assign these:
for col in reference.columns:
data[col] = reference[col].values[0]
Here's an illustration of the issue.
import pandas as pd
data = pd.DataFrame({'id': [1, 2, 3, 4],
'val1': ['A', 'B', 'C', 'D']})
reference = pd.DataFrame({'id2': [1, 2, 3, 4],
'val2': ['A', 'B', 'C', 'D']})
These have the same indices ranging from 0-3.
data[reference.columns] = reference
Outputs
id val1 id2 val2
0 1 A 1 A
1 2 B 2 B
2 3 C 3 C
3 4 D 4 D
But, if these DataFrames have different indices (that only partially overlap):
data = pd.DataFrame({'id': [1, 2, 3, 4],
'val1': ['A', 'B', 'C', 'D']})
reference = pd.DataFrame({'id2': [1, 2, 3, 4],
'val2': ['A', 'B', 'C', 'D']})
reference.index=[3,4,5,6]
data[reference.columns]=reference
Outputs:
id val1 id2 val2
0 1 A NaN NaN
1 2 B NaN NaN
2 3 C NaN NaN
3 4 D 1.0 A
As only the index value of 3 is shared.

Categories