Computed/Reactive column in Pandas? - python

I would like to emulate an Excel formula in Pandas I've tried this:
df = pd.DataFrame({'a': [3, 2, 1, 0], 'b': [5, 3, 2, 1]})
df['c'] = lambda x : df.a + df.b + 1 # Displays <function <lambda> ..> instead of the result
df['d'] = df.a + df.b + 1 # Static computation
df.a *= 2
df # Result of column c and d not updated :(
a b c d
0 6 5 <function <lambda> at 0x7f2354ddcca0> 9
1 4 3 <function <lambda> at 0x7f2354ddcca0> 6
2 2 2 <function <lambda> at 0x7f2354ddcca0> 4
3 0 1 <function <lambda> at 0x7f2354ddcca0> 2
What I expect is:
df
a b c
0 6 5 12
1 4 3 8
2 2 2 5
3 0 1 2
df.a /= 2
a b c
0 3 5 9
1 2 3 6
2 1 2 4
3 0 1 2
Is this possible to have a computed column dynamically in Pandas?

Maybe this code might give you a step in the right direction:
import pandas as pd
c_list =[]
df = pd.DataFrame({'a': [3, 2, 1, 0], 'b': [5, 3, 2, 1]})
c_list2 = list(map(lambda x: x + df.b + 1 , list(df.a)))
for i in range (0,4):
c_list.append(pd.DataFrame(c_list2[i])["b"][i])
df['c'] = c_list
df['d'] = df.a + df.b # Static computation
df.a *= 2
df

Reactivity between columns in a DataFrame does not seem practically feasible. My cellopype package does give you Excel-like reactivity between DataFrames. Here's my take on your question:
pip install cellopype
import pandas as pd
from cellopype import Cell
# define source df and wrap it in a Cell:
df_ab = pd.DataFrame({'a': [3, 2, 1, 0], 'b': [5, 3, 2, 1]})
cell_ab = Cell(recalc=lambda: df_ab.copy())
# define the dependent/reactive Cell (with a single column 'c')
cell_c = Cell(
recalc=lambda df: pd.DataFrame(df.a + df.b, columns=['c']),
sources=[cell_ab]
)
# and get its value
print(cell_c.value)
c
0 8
1 5
2 3
3 1
# change source df and recalc its Cell...
df_ab.loc[0,'a']=100
cell_ab.recalc()
# cell_c has been updated in response
print(cell_c.value)
c
0 105
1 5
2 3
3 1
Also see my response to this question.

Related

Find difference in two different data-frames

I have two data frame df1 is 26000 rows, df2 is 25000 rows.
Im trying to find data points that are in d1 but not in d2, vice versa.
This is what I wrote (below code) but when I cross check it shows me shared data point
import pandas as pd
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df_join = pd.concat([df1,df2], axis = 1).drop_duplicates(keep = FALSE)
only_df1 = df_join.loc[df_join[df2.columns.to_list()].isnull().all(axis = 1), df1.columns.to_list()]
Order doesn't matter just want to know whether that data point exist in one or the other data frame.
With two dfs:
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 1, 1, 1, 1]})
df2 = pd.DataFrame({'a': [2, 3, 4, 5, 6], 'b': [1, 1, 1, 1, 1]})
print(df1)
print(df2)
a b
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
a b
0 2 1
1 3 1
2 4 1
3 5 1
4 6 1
You could do:
df_differences = df1.merge(df2, how='outer', indicator=True)
print(df_differences)
Result:
a b _merge
0 1 1 left_only
1 2 1 both
2 3 1 both
3 4 1 both
4 5 1 both
5 6 1 right_only
And then:
only_df1 = df_differences[df_differences['_merge'].eq('left_only')].drop(columns=['_merge'])
only_df2 = df_differences[df_differences['_merge'].eq('right_only')].drop(columns=['_merge'])
print(only_df1)
print()
print(only_df2)
a b
0 1 1
a b
5 6 1

Apply function rowwise to pandas dataframe while referencing a column

I have a pandas dataframe like this:
df = pd.DataFrame({'A': [2, 3], 'B': [1, 2], 'C': [0, 1], 'D': [1, 0], 'total': [4, 6]})
A B C D total
0 2 1 0 1 4
1 3 2 1 0 6
I'm trying to perform a rowwise calculation and create a new column with the result. The calculation is to divide each column ABCD by the total, square it, and sum it up rowwise. This should be the result (0 if total is 0):
A B C D total result
0 2 1 0 1 4 0.375
1 3 2 1 0 6 0.389
This is what I've tried so far, but it always returns 0:
df['result'] = df[['A', 'B', 'C', 'D']].apply(lambda x: ((x/df['total'])**2).sum(), axis=1)
I guess the problem is df['total'] in the lambda function, because if I replace this by a number it works fine. I don't know how to work around this though. Appreciate any suggestions.
A combination of div, pow and sum can solve this :
df["result"] = df.filter(regex="[^total]").div(df.total, axis=0).pow(2).sum(1)
df
A B C D total result
0 2 1 0 1 4 0.375000
1 3 2 1 0 6 0.388889
you could do
df['result'] = (df.loc[:, "A": 'D'].divide(df.total, axis=0) ** 2).sum(axis=1)

How to name a dataframe column filled by numpy array?

I am filling a DataFrame by transposing some numpy array :
for symbol in syms[:5]:
price_p = Share(symbol)
closes_p = [c['Close'] for c in price_p.get_historical(startdate_s, enddate_s)]
dump = np.array(closes_p)
na_price_ar.append(dump)
print symbol
df = pd.DataFrame(na_price_ar).transpose()
df, the DataFrame is well filled however, the column name are 0,1,2...,5 I would like to rename them with the value of the element syms[:5]. I googled it and I found this:
for symbol in syms[:5]:
df.rename(columns={ ''+ str(i) + '' : symbol}, inplace=True)
i = i+1
But if I check the variabke df I still have the same column name.
Any ideas ?
Instead of using a list of arrays and transposing, you could build the DataFrame from a dict whose keys are symbols and whose values are arrays of column values:
import numpy as np
import pandas as pd
np.random.seed(2016)
syms = 'abcde'
na_price_ar = {}
for symbol in syms[:5]:
# price_p = Share(symbol)
# closes_p = [c['Close'] for c in price_p.get_historical(startdate_s, enddate_s)]
# dump = np.array(closes_p)
dump = np.random.randint(10, size=3)
na_price_ar[symbol] = dump
print(symbol)
df = pd.DataFrame(na_price_ar)
print(df)
yields
a b c d e
0 3 3 8 2 4
1 7 8 7 6 1
2 2 4 9 3 9
You can use:
na_price_ar = [['A','B','C'],[0,2,3],[1,2,4],[5,2,3],[8,2,3]]
syms = ['q','w','e','r','t','y','u']
df = pd.DataFrame(na_price_ar, index=syms[:5]).transpose()
print (df)
q w e r t
0 A 0 1 5 8
1 B 2 2 2 2
2 C 3 4 3 3
You may use as dictionary key into the .rename() method the df.columns[ number ] statement
dic = {'a': [4, 1, 3, 1], 'b': [4, 2, 1, 4], 'c': [5, 7, 9, 1], 'd': [4, 1, 3, 1], 'e': [5, 2, 6, 0]}
df = pd.DataFrame(dic)
number = 0
for symbol in syms[:5]:
df.rename( columns = { df.columns[number]: symbol}, implace = True)
number = number + 1
and the result is
i f g h i
0 4 4 5 4 5
1 1 2 7 1 2
2 3 1 9 3 6
3 1 4 1 1 0

Pandas tuples groupby aggregation

Certain columns of my data frame contain tuples. Whenever I do aggregation via group by that columns do not appear in the resulting data frame unless explicitly specified.
Example,
df = pd.DataFrame()
df['A'] = [1, 2, 1, 2]
df['B'] = [1, 2, 3, 4]
df['C'] = map(lambda s: (s,), df['B'])
print df
A B C
0 1 1 (1,)
1 2 2 (2,)
2 1 3 (3,)
3 2 4 (4,)
If I do the following way then the column C does not appear in aggregation
print df.groupby('A').sum()
B
A
1 4
2 6
But if I specify it explicitly it appears as expected
print df[['A', 'C']].groupby('A').sum()
C
A
1 (1, 3)
2 (2, 4)
Could you please tell me why the C column didn't appear in the first case?
I would like it to go by default.
Because you aggregate by column B, not column C:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['A'] = [1, 2, 1, 2]
df['B'] = [1, 2, 3, 4]
df['C'] = map(lambda s: (s,), df['B'])
print df
df.at[0,'B'] = 10
print df
A B C
0 1 10 (1,)
1 2 2 (2,)
2 1 3 (3,)
3 2 4 (4,)
print df.groupby('A').sum()
B
A
1 13
2 6
print df.groupby('A')['B'].sum()
1 13
2 6
Name: B, dtype: int64

Change pivot table from Series to DataFrame

df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'B': ['X', 'Y', 'Z'] * 3,
'C': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
>>> df
A B C
0 1 X 1
1 1 Y 2
2 1 Z 3
3 2 X 1
4 2 Y 2
5 2 Z 3
6 3 X 1
7 3 Y 2
8 3 Z 3
result = df.pivot_table(index=['B'], values='C', aggfunc=sum)
>>> result
B
X 3
Y 6
Z 9
Name: C, dtype: int64
How can I have the column name for C show up a above the sums, and how can I sort result either ascending or descending. Result is a series not a dataframe and seems non-sortable?
Python: 2.7.11 and Pandas: 0.17.1
You were very close. Note that the brackets around the values coerces the result into a dataframe instead of a series (i.e. values=['C'] instead of values='C').
result = df.pivot_table(index = ['B'], values=['C'], aggfunc=sum)
>>> result
C
B
X 3
Y 6
Z 9
Asresult is now a dataframe, you can use sort_values on it:
>>> result.sort_values('C', ascending=False)
C
B
Z 9
Y 6
X 3

Categories