Python Pandas add column with relative order numbers - python

How do I add a order number column to an existing DataFrame?
This is my DataFrame:
import pandas as pd
import math
frame = pd.DataFrame([[1, 4, 2], [8, 9, 2], [10, 2, 1]], columns=['a', 'b', 'c'])
def add_stats(row):
row['sum'] = sum([row['a'], row['b'], row['c']])
row['sum_sq'] = sum(math.pow(v, 2) for v in [row['a'], row['b'], row['c']])
row['max'] = max(row['a'], row['b'], row['c'])
return row
frame = frame.apply(add_stats, axis=1)
print(frame.head())
The resulting data is:
a b c sum sum_sq max
0 1 4 2 7 21 4
1 8 9 2 19 149 9
2 10 2 1 13 105 10
First, I would like to add 3 extra columns with order numbers, sorting on sum, sum_sq and max, respectively. Next, these 3 columns should be combined into one column - the mean of the order numbers - but I do know how to do that part (with apply and axis=1).

I think you're looking for rank where you mention sorting. Given your example, add:
frame['sum_order'] = frame['sum'].rank()
frame['sum_sq_order'] = frame['sum_sq'].rank()
frame['max_order'] = frame['max'].rank()
frame['mean_order'] = frame[['sum_order', 'sum_sq_order', 'max_order']].mean(axis=1)
To get:
a b c sum sum_sq max sum_order sum_sq_order max_order mean_order
0 1 4 2 7 21 4 1 1 1 1.000000
1 8 9 2 19 149 9 3 3 2 2.666667
2 10 2 1 13 105 10 2 2 3 2.333333
The rank method has some options as well, to specify the behavior in case of identical or NA-values for example.

Related

Replace specific values in a data frame with column mean

I have a dataframe and I want to replace the value 7 with the round number of mean of its columns with out other 7 in that columns. Here is a simple example:
import pandas as pd
df = pd.DataFrame()
df['a'] = [1, 2, 3]
df['b'] =[3, 0, -1]
df['c'] = [4, 7, 6]
df['d'] = [7, 7, 6]
a b c d
0 1 3 4 7
1 2 0 7 7
2 3 -1 6 6
And here is the output I want:
a b c d
0 1 3 4 2
1 2 0 3 2
2 3 -1 6 6
For example, in row 1, the mean of column c is equal to 3.33 and then its round is 3, and in column column d is equal to 2 (since we do not consider the other 7 in that column).
Can you please help me with that?
here is one way to do it
# replace 7 with np.nan
df.replace(7,np.nan, inplace=True)
# fill NaN values with the mean of the column
(df.fillna(df.apply(lambda x: x.replace(np.nan, 0)
.mean(skipna=False) ))
.round(0)
.astype(int))
a b c d
0 1 3 4 2
1 2 0 3 2
2 3 -1 6 6
temp = df.replace(to_replace=7, value=0, inplace=False).copy()
df.replace(to_replace=7, value=temp.mean().astype(int), inplace=True)

Apply different mathematical function in table in Python

I have two columns - Column A and Column B and it has some values like below:-
Now, I want to apply normal arithmetic function for each row and add result in next column. But Different arithmetic operator should be apply on each row. Like
A+B for first row
A-B for second row
A*B for third row
A/B for fourth row
and so on till nth record of the row with same repetitive mathematical function.
Can someone please help me with this code in Python.
python-3.x
pandas
We can use:
row.name to access the index when using apply on a row
can use a dictionary to map indexes to a operations
Code
import operator as _operator
# Data
d = {"A":[5, 6, 7, 8, 9, 10, 11],
"B": [1, 2, 3, 4, 5, 6, 7]}
df = pd.DataFrame(d)
print(df)
# Mapping from index to mathematical operation
operator_map = {
0: _operator.add,
1: _operator.sub,
2: _operator.mul,
3: _operator.truediv,
}
# use row.name % 4 to have operators have a cycle of 4
df['new'] = df.apply(lambda row: operator_map[row.name % 4](*row), axis = 1)
Output
Initial df
A B
0 5 1
1 6 2
2 7 3
3 8 4
4 9 5
5 10 6
6 11 7
New df
A B new
0 5 1 6.0
1 6 2 4.0
2 7 3 21.0
3 8 4 2.0
4 9 5 14.0
5 10 6 4.0
6 11 7 77.0
IIUC, you can try DataFrame.apply on rows with operator
import operator
operators = [operator.add, operator.sub, operator.mul, operator.truediv]
df['C'] = df.apply(lambda row: operators[row.name](*row), axis=1)
print(df)
A B C
0 5 1 6.0
1 6 2 4.0
2 7 3 21.0
3 8 4 2.0

Function in pandas to stack rows into columns by number of rows?

Suppose I have heterogeneous dataframe:
a b c d
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
4 13 14 15 16
And i want to stack the rows like so:
a b c d
1 1,5,8,13 2,6,10,14 3,7,11,15 4,8,12,16
Etc...
All the references for grouby etc seem to require some feature of grouping, I just want to put x rows into columns, regardless of their content. Each row has a timestamp, I am looking to group values by sample count, so i want 1 row with all the values of x sample rows as columns.
I should end up with a dataframe that has x*original number of columns and original number of rows/x
I'm sure there must be some simple method I'm missing here without a series of loop etc
If need join all values to strings use:
df1 = df.astype(str).agg(','.join).to_frame().T
print (df1)
a b c d
0 1,5,9,13 2,6,10,14 3,7,11,15 4,8,12,16
Or if need create lists use:
df2 = pd.DataFrame([[list(df[x]) for x in df]], columns=df.columns)
print (df2)
a b c d
0 [1, 5, 9, 13] [2, 6, 10, 14] [3, 7, 11, 15] [4, 8, 12, 16]
If need scalars with MultiIndex (generated fro index nad columns labels) use:
df3 = df.unstack().to_frame().T
print (df3)
a b c d
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
0 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16

How can I extract a column from dataframe and attach it to rows while keeping other columns intact

How can I extract a column from pandas dataframe attach it to rows while keeping the other columns same.
This is my example dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID': np.arange(0,5),
'sample_1' : [5,6,7,8,9],
'sample_2' : [10,11,12,13,14],
'group_id' : ["A","B","C","D","E"]})
The output I'm looking for is:
df2 = pd.DataFrame({'ID': [0, 1, 2, 3, 4, 0, 1, 2, 3, 4],
'sample_1' : [5,6,7,8,9,10,11,12,13,14],
'group_id' : ["A","B","C","D","E","A","B","C","D","E"]})
I have tried to slice the dataframe and concat using pd.concat but it was giving NaN values.
My original dataset is large.
You could do this using stack: Set the index to the columns you don't want to modify, call stack, sort by the "sample" column, then reset your index:
df.set_index(['ID','group_id']).stack().sort_values(0).reset_index([0,1]).reset_index(drop=True)
ID group_id 0
0 0 A 5
1 1 B 6
2 2 C 7
3 3 D 8
4 4 E 9
5 0 A 10
6 1 B 11
7 2 C 12
8 3 D 13
9 4 E 14
Using pd.wide_to_long:
res = pd.wide_to_long(df, stubnames='sample_', i='ID', j='group_id')
res.index = res.index.droplevel(1)
res = res.rename(columns={'sample_': 'sample_1'}).reset_index()
print(res)
ID group_id sample_1
0 0 A 5
1 1 B 6
2 2 C 7
3 3 D 8
4 4 E 9
5 0 A 10
6 1 B 11
7 2 C 12
8 3 D 13
9 4 E 14
The function you are looking for is called melt
For example:
df2 = pd.melt(df, id_vars=['ID', 'group_id'], value_vars=['sample_1', 'sample_2'], value_name='sample_1')
df2 = df2.drop('variable', axis=1)

How to return a dataframe value from row and column reference?

I know this is probably a basic question, but somehow I can't find the answer. I was wondering how it's possible to return a value from a dataframe if I know the row and column to look for? E.g. If I have a dataframe with columns 1-4 and rows A-D, how would I return the value for B4?
You can use ix for this:
In [236]:
df = pd.DataFrame(np.random.randn(4,4), index=list('ABCD'), columns=[1,2,3,4])
df
Out[236]:
1 2 3 4
A 1.682851 0.889752 -0.406603 -0.627984
B 0.948240 -1.959154 -0.866491 -1.212045
C -0.970505 0.510938 -0.261347 -1.575971
D -0.847320 -0.050969 -0.388632 -1.033542
In [237]:
df.ix['B',4]
Out[237]:
-1.2120448782618383
Use at, if rows are A-D and columns 1-4:
print (df.at['B', 4])
If rows are 1-4 and columns A-D:
print (df.at[4, 'B'])
Fast scalar value getting and setting.
Sample:
df = pd.DataFrame(np.arange(16).reshape(4,4),index=list('ABCD'), columns=[1,2,3,4])
print (df)
1 2 3 4
A 0 1 2 3
B 4 5 6 7
C 8 9 10 11
D 12 13 14 15
print (df.at['B', 4])
7
df = pd.DataFrame(np.arange(16).reshape(4,4),index=[1,2,3,4], columns=list('ABCD'))
print (df)
A B C D
1 0 1 2 3
2 4 5 6 7
3 8 9 10 11
4 12 13 14 15
print (df.at[4, 'B'])
13

Categories