Sums of dataframe columns

Sums of dataframe columns - python

I have a dataframe, that has a varying number of columns depending on my dataset. I want a function that will add up the combinations of these columns and append these new 'summed columns' to the existing dataframe.
For example if I have 3 columns, I want 3 more columns with 1 summed with 2, 1 summed with 3 and 3 summed with 2.
Much obliged.

IIUC, you can use itertools.combinations combined with pandas.concat:
from itertools import combinations
out = pd.concat({f'{a}+{b}': df[a]+df[b] for a,b in combinations(df, 2)}, axis=1)
Example:
import numpy as np
np.random.seed(0)
df = pd.DataFrame({k: np.random.randint(0, 10, 5) for k in list('ABC')})
from itertools import combinations
out = pd.concat({f'{a}+{b}': df[a]+df[b] for a,b in combinations(df, 2)}, axis=1)
print(df.join(out))
output:
A B C A+B A+C B+C
0 5 9 7 14 12 16
1 0 3 6 3 6 9
2 3 5 8 8 11 13
3 3 2 8 5 11 10
4 7 4 1 11 8 5

Related

Function in pandas to stack rows into columns by number of rows?

Suppose I have heterogeneous dataframe:
a b c d
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
4 13 14 15 16
And i want to stack the rows like so:
a b c d
1 1,5,8,13 2,6,10,14 3,7,11,15 4,8,12,16
Etc...
All the references for grouby etc seem to require some feature of grouping, I just want to put x rows into columns, regardless of their content. Each row has a timestamp, I am looking to group values by sample count, so i want 1 row with all the values of x sample rows as columns.
I should end up with a dataframe that has x*original number of columns and original number of rows/x
I'm sure there must be some simple method I'm missing here without a series of loop etc

If need join all values to strings use:
df1 = df.astype(str).agg(','.join).to_frame().T
print (df1)
a b c d
0 1,5,9,13 2,6,10,14 3,7,11,15 4,8,12,16
Or if need create lists use:
df2 = pd.DataFrame([[list(df[x]) for x in df]], columns=df.columns)
print (df2)
a b c d
0 [1, 5, 9, 13] [2, 6, 10, 14] [3, 7, 11, 15] [4, 8, 12, 16]
If need scalars with MultiIndex (generated fro index nad columns labels) use:
df3 = df.unstack().to_frame().T
print (df3)
a b c d
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
0 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16

Move specific columns to the rightmost of the DataFrame

I want to shift some columns in the middle of the dataframe to the rightmost.
I could do this with individual column using code:
cols=list(df.columns.values)
cols.pop(cols.index('one_column'))
df=df[cols +['one_column']]
df
But it's inefficient to do it individually when there are 100 columns of 2 series, ie. series1_1... series1_50 and series2_1... series2_50 in the middle of the dataframe.
How can I do it by assigning the 2 series as lists, popping them and putting them back? Maybe something like
cols=list(df.columns.values)
series1 = list(df.loc['series1_1':'series1_50'])
series2 = list(df.loc['series2_1':'series2_50'])
cols.pop('series1', 'series2')
df=df[cols +['series1', 'series2']]
but this didn't work. Thanks

If you just want to shift the columns, you could call concat like this:
cols_to_shift = ['colA', 'colB']
pd.concat([
df[df.columns.difference(cols_to_shift)],
df[cols_to_shift]
], axis=1
)
Or, you could do a little list manipulation on the columns.
cols_to_keep = [c for c in df.columns if c not in cols_to_shift]
df[cols_to_keep + cols_to_shift]
Minimal Example
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1, 10, (3, 5)), columns=list('ABCDE'))
df
A B C D E
0 6 1 4 4 8
1 4 6 3 5 8
2 7 9 9 2 7
cols_to_shift = ['B', 'C']
pd.concat([
df[df.columns.difference(cols_to_shift)],
df[cols_to_shift]
], axis=1
)
A D E B C
0 6 4 8 1 4
1 4 5 8 6 3
2 7 2 7 9 9
[c for c in df.columns if c not in cols_to_shift]
df[cols_to_keep + cols_to_shift]
A D E B C
0 6 4 8 1 4
1 4 5 8 6 3
2 7 2 7 9 9

I think list.pop only takes indices of the elements in the list.
You should list.remove instead.
cols = df.columns.tolist()
for s in (‘series1’, ‘series2’):
cols.remove(s)
df = df[cols + [‘series1’, ‘series2’]]

Randomly choose two values without repetition in dataframe

Consider a dataframe df with N columns and M rows:
>>> df = pd.DataFrame(np.random.randint(1, 10, (10, 5)), columns=list('abcde'))
>>> df
a b c d e
0 4 4 5 5 7
1 9 3 8 8 1
2 2 8 1 8 5
3 9 5 1 2 7
4 3 5 8 2 3
5 2 8 8 2 8
6 3 1 7 2 6
7 4 1 5 6 3
8 5 4 4 9 5
9 3 7 5 6 6
I want to randomly choose two columns and then randomly choose one particular row (this would give me two values of the same row). I can achieve this using
>>> df.sample(2, axis=1).sample(1,axis=0)
e a
1 3 5
I want to perform this K times like below :
>>> for i in xrange(5):
... df.sample(2, axis=1).sample(1,axis=0)
...
e a
1 3 5
d b
2 1 9
e b
4 8 9
c b
0 6 5
e c
1 3 5
I want to ensure that I do not choose the same two values (by choosing the same two columns and same row) in any of the trials. How would I achieve this?
I want to then perform a bitwise XOR operation on the two chosen values in each trial as well. For example, 3 ^ 5, 1 ^ 9 , .. and count all the bit differences in the chosen values.

You can create a list of all of the index by 2 column tuples. And then take random selections from that without replacement.
Sample Data
import pandas as pd
import numpy as np
from itertools import combinations, product
np.random.seed(123)
df = pd.DataFrame(np.random.randint(1, 10, (10, 5)), columns=list('abcde'))
#df = df.reset_index() #if index contains duplicates
Code
K = 5
choices = np.array(list(product(df.index, combinations(df.columns, 2))))
idx = choices[np.r_[np.random.choice(len(choices), K, replace=False)]]
#array([[9, ('a', 'e')],
# [2, ('a', 'e')],
# [1, ('a', 'c')],
# [3, ('b', 'e')],
# [8, ('d', 'e')]], dtype=object)
Then you can decide how exactly you want your output, but something like this is close to what you show:
pd.concat([df.loc[myid[0], list(myid[1])].reset_index().T for myid in idx])
# 0 1
#index a e
#9 4 8
#index a e
#2 1 1
#index a c
#1 7 1
#index b e
#3 2 3
#index d e
#8 5 7

How to return a dataframe value from row and column reference?

I know this is probably a basic question, but somehow I can't find the answer. I was wondering how it's possible to return a value from a dataframe if I know the row and column to look for? E.g. If I have a dataframe with columns 1-4 and rows A-D, how would I return the value for B4?

You can use ix for this:
In [236]:
df = pd.DataFrame(np.random.randn(4,4), index=list('ABCD'), columns=[1,2,3,4])
df
Out[236]:
1 2 3 4
A 1.682851 0.889752 -0.406603 -0.627984
B 0.948240 -1.959154 -0.866491 -1.212045
C -0.970505 0.510938 -0.261347 -1.575971
D -0.847320 -0.050969 -0.388632 -1.033542
In [237]:
df.ix['B',4]
Out[237]:
-1.2120448782618383

Use at, if rows are A-D and columns 1-4:
print (df.at['B', 4])
If rows are 1-4 and columns A-D:
print (df.at[4, 'B'])
Fast scalar value getting and setting.
Sample:
df = pd.DataFrame(np.arange(16).reshape(4,4),index=list('ABCD'), columns=[1,2,3,4])
print (df)
1 2 3 4
A 0 1 2 3
B 4 5 6 7
C 8 9 10 11
D 12 13 14 15
print (df.at['B', 4])
7
df = pd.DataFrame(np.arange(16).reshape(4,4),index=[1,2,3,4], columns=list('ABCD'))
print (df)
A B C D
1 0 1 2 3
2 4 5 6 7
3 8 9 10 11
4 12 13 14 15
print (df.at[4, 'B'])
13

Python Pandas add column with relative order numbers

How do I add a order number column to an existing DataFrame?
This is my DataFrame:
import pandas as pd
import math
frame = pd.DataFrame([[1, 4, 2], [8, 9, 2], [10, 2, 1]], columns=['a', 'b', 'c'])
def add_stats(row):
row['sum'] = sum([row['a'], row['b'], row['c']])
row['sum_sq'] = sum(math.pow(v, 2) for v in [row['a'], row['b'], row['c']])
row['max'] = max(row['a'], row['b'], row['c'])
return row
frame = frame.apply(add_stats, axis=1)
print(frame.head())
The resulting data is:
a b c sum sum_sq max
0 1 4 2 7 21 4
1 8 9 2 19 149 9
2 10 2 1 13 105 10
First, I would like to add 3 extra columns with order numbers, sorting on sum, sum_sq and max, respectively. Next, these 3 columns should be combined into one column - the mean of the order numbers - but I do know how to do that part (with apply and axis=1).

I think you're looking for rank where you mention sorting. Given your example, add:
frame['sum_order'] = frame['sum'].rank()
frame['sum_sq_order'] = frame['sum_sq'].rank()
frame['max_order'] = frame['max'].rank()
frame['mean_order'] = frame[['sum_order', 'sum_sq_order', 'max_order']].mean(axis=1)
To get:
a b c sum sum_sq max sum_order sum_sq_order max_order mean_order
0 1 4 2 7 21 4 1 1 1 1.000000
1 8 9 2 19 149 9 3 3 2 2.666667
2 10 2 1 13 105 10 2 2 3 2.333333
The rank method has some options as well, to specify the behavior in case of identical or NA-values for example.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sums of dataframe columns - python

Related

Function in pandas to stack rows into columns by number of rows?

Move specific columns to the rightmost of the DataFrame

Randomly choose two values without repetition in dataframe

How to return a dataframe value from row and column reference?

Python Pandas add column with relative order numbers

Categories

Resources