Applying an operation on multiple columns with a fixed column in pandas - python

I have a dataframe as shown below. The last column shows the sum of values from all the columns i.e. A,B,D,K and T. Please note some of the columns have NaN as well.
word1,A,B,D,K,T,sum
na,,63.0,,,870.0,933.0
sva,,1.0,,3.0,695.0,699.0
a,,102.0,,1.0,493.0,596.0
sa,2.0,487.0,,2.0,15.0,506.0
su,1.0,44.0,,136.0,214.0,395.0
waw,1.0,9.0,,34.0,296.0,340.0
How can I calculate the entropy for each row? i.e. I should find something like following
df['A']/df['sum']*log(df['A']/df['sum']) + df['B']/df['sum']*log(df['B']/df['sum']) + ...... + df['T']/df['sum']*log(df['T']/df['sum'])
The condition is that whenever the value inside the log becomes zero or NaN, the whole value should be treated as zero (by definition, the log will return an error as log 0 is undefined).
I am aware of using lambda operation to apply on individual columns. Here I am not able to think for a pure pandas solution where a fixed column sum is applied on different columns A,B,D etc.. Though I can think of a simple loopwise iteration over CSV file with hard-coded column values.

I think you can use ix for selecting columns from A to T, then divide by div with numpy.log. Last use sum:
print (df['A']/df['sum']*np.log(df['A']/df['sum']))
0 NaN
1 NaN
2 NaN
3 -0.021871
4 -0.015136
5 -0.017144
dtype: float64
print (df.ix[:,'A':'T'].div(df['sum'],axis=0)*np.log(df.ix[:,'A':'T'].div(df['sum'],axis=0)))
A B D K T
0 NaN -0.181996 NaN NaN -0.065191
1 NaN -0.009370 NaN -0.023395 -0.005706
2 NaN -0.302110 NaN -0.010722 -0.156942
3 -0.021871 -0.036835 NaN -0.021871 -0.104303
4 -0.015136 -0.244472 NaN -0.367107 -0.332057
5 -0.017144 -0.096134 NaN -0.230259 -0.120651
print((df.ix[:,'A':'T'].div(df['sum'],axis=0)*np.log(df.ix[:,'A':'T'].div(df['sum'],axis=0)))
.sum(axis=1))
0 -0.247187
1 -0.038471
2 -0.469774
3 -0.184881
4 -0.958774
5 -0.464188
dtype: float64

df1 = df.iloc[:, :-1]
df2 = df1.div(df1.sum(1), axis=0)
df2.mul(np.log(df2)).sum(1)
word1
na -0.247187
sva -0.038471
a -0.469774
sa -0.184881
su -0.958774
waw -0.464188
dtype: float64
Setup
from StringIO import StringIO
import pandas as pd
text = """word1,A,B,D,K,T,sum
na,,63.0,,,870.0,933.0
sva,,1.0,,3.0,695.0,699.0
a,,102.0,,1.0,493.0,596.0
sa,2.0,487.0,,2.0,15.0,506.0
su,1.0,44.0,,136.0,214.0,395.0
waw,1.0,9.0,,34.0,296.0,340.0"""
df = pd.read_csv(StringIO(text), index_col=0)
df

Related

Pandas Groupby with lambda gives some NANs

I have a DF where I'd like to create a new column with the difference of 2 other column values.
name rate avg_rate
A 10 3
B 6 5
C 4 3
I wrote this code to calculate the difference :
result= df.groupby(['name']).apply(lambda g: g.rate - g.avg_rate)
df['rate_diff']=result.reset_index(drop=True)
df.tail(3)
But I notice that some of the values calculated are NANs. What is the best way to handle this?
Output i am getting:
name rate avg_rate rate_diff
A 10 3 NAN
B 6 5 NAN
C 4 3 NAN
If you want to use groupby and apply then following should work,
res = df.groupby(['name']).apply(lambda g: g.rate - g.avg_rate).reset_index().set_index('level_1')
df = pd.merge(df,res,on=['name'],left_index = True, right_index=True).rename({0:'rate_diff'},axis=1)
However, as #sacuL suggested in the comments, you don't need to use groupby to calculate the difference as you are just going to get the difference by simply subtracting columns (side by side), and groupby apply will be overkill for this simple task.
df["rate_diff"] = df.rate - df.avg_rate

How to filter Pandas rows based on last/next row?

I have two data sets from different pulse oximeters, and plot them with pyplot as displayed below. As you may see, the green data sheet has alot of outliers(vertical drops). In my work I've defined these outlayers as non-valid in for my statistical analysis, they are must certainly not measurements. Therefore I argue that I can simply remove them.
The characteristics of these rogue values is that they're single(or top two) value outliers(see df below). The "real" sample values are either the same as the previous value, or +-1. In e.g. java(pseudo code) I would do something like:
for(i; i <df.length; i++)
if (df[i+1|-1].spo2 - df[i].spo2 > 1|-1)
df[i].drop
What would be the pandas(numpy?) equivalent of what I'm trying to do, remove values that is more/less than 1 compared to the last/next value?
df:
time, spo2
1900-01-01 18:18:41.194 98.0
1900-01-01 18:18:41.376 98.0
1900-01-01 18:18:41.559 78.0
1900-01-01 18:18:41.741 98.0
1900-01-01 18:18:41.923 98.0
1900-01-01 18:18:42.105 90.0
1900-01-01 18:18:42.288 97.0
1900-01-01 18:18:42.470 97.0
1900-01-01 18:18:42.652 98.0
have a look at pandas.DataFrame.shift. This is a column-wise operation that shifts all rows in a given column to another row of another column:
# original df
x1
0 0
1 1
2 2
3 3
4 4
# shift down
df.x2 = df.x1.shift(1)
x1 x2
0 0 NaN # Beware
1 1 0
2 2 1
3 3 2
4 4 3
# Shift up
df.x2 = df.x1.shift(-1)
x1 x2
0 0 1
1 1 2
2 2 3
3 3 4
4 4 NaN # Beware
You can use this to move spo2 of timestamp n+1 next to spo2 in the timestamp n row. Then, filter based on conditions applied to that one row.
df['spo2_Next'] = df['spo2'].shift(-1)
# replace NaN to allow float comparison
df.spo2_Next.fillna(1, inplace = True)
# Apply your row-wise condition to create filter column
df.loc[((df.spo2_Next - df.spo2) > 1) or ((df.spo2_Next - df.spo2) < 1), 'Outlier'] = True
# filter
df_clean = df[df.Outlier != True]
# remove filter column
del df_clean['Outlier']
When you filter a pandas dataframe like:
df[ df.colum1 = 2 & df.colum2 < 3 ], you are:
comparing a numeric series to a scalar value and generating a boolean series
obtaining two boolean series and doing a logical and
then using a numeric series to filter the data frame (the false values will not be added in the new data frame)
So you just need create an iterative algorithm over the data frame to produce such boolean array, and use it to filter the dataframe, as in:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
df[ [True, False, True]]
You can also create a closure to filter the data frame (using df.apply), and keeping previous observations in the closure to detect abrupt changes, but this would be way too complicated. I would go for the straightforward imperative solution.

Pandas Groupby and apply method with custom function

I built the following function with the aim of estimating an optimal exponential moving average of a pandas' DataFrame column.
from scipy import optimize
from sklearn.metrics import mean_squared_error
import pandas as pd
## Function that finds best alpha and uses it to create ewma
def find_best_ewma(series, eps=10e-5):
def f(alpha):
ewm = series.shift().ewm(alpha=alpha, adjust=False).mean()
return mean_squared_error(series, ewm.fillna(0))
result = optimize.minimize(f,.3, bounds=[(0+eps, 1-eps)])
return series.shift().ewm(alpha=result.x, adjust=False).mean()
Now I want to apply this function to each of the groups created using pandas-groupby on the following test df:
## test
data1 data2 key1 key2
0 -0.018442 -1.564270 a x
1 -0.038490 -1.504290 b x
2 0.953920 -0.283246 a x
3 -0.231322 -0.223326 b y
4 -0.741380 1.458798 c z
5 -0.856434 0.443335 d y
6 -1.416564 1.196244 c z
To do so, I tried the following two ways:
## First way
test.groupby(["key1","key2"])["data1"].apply(find_best_ewma)
## Output
0 NaN
1 NaN
2 -0.018442
3 NaN
4 NaN
5 NaN
6 -0.741380
Name: data1, dtype: float64
## Second way
test.groupby(["key1","key2"]).apply(lambda g: find_best_ewma(g["data1"]))
## Output
key1 key2
a x 0 NaN
2 -0.018442
b x 1 NaN
y 3 NaN
c z 4 NaN
6 -0.741380
d y 5 NaN
Name: data1, dtype: float64
Both ways produce a pandas.core.series.Series but ONLY the second way provides the expected hierarchical index.
I do not understand why the first way does not produce the hierarchical index and instead returns the original dataframe index. Could you please explain me why this happens?
What am I missing?
Thanks in advance for your help.
The first way creates a pandas.core.groupby.DataFrameGroupBy object, which becomes a pandas.core.groupby.SeriesGroupBy object once you select a specific column from it; It is to this object that the 'apply' method is applied to, hence a series is returned.
test.groupby(["key1","key2"])["data1"]#.apply(find_best_ewma)
<pandas.core.groupby.SeriesGroupBy object at 0x7fce51fac790>
The second way remains a DataFrameGroupBy object. The function you apply to that object selects the column, which means the function 'find_best_ewma' is applied to each member of that column, but the 'apply' method is applied to the original DataFrameGroupBy, hence a DataFrame is returned, the 'magic' is that the indexes of the DataFrame are hence still present.

Pandas DataFrame constructor introduces NaN when including the index argument

I'm creating a pandas DataFrame object using the DataFrame constructor. My data is a dict of lists and categorical data Series objects. When I pass an index to the constructor, my categorical data series gets reset with NaN values. What's going on here? Thanks in advance!
Example:
import pandas as pd
import numpy as np
a = pd.Series(['a','b','c'],dtype="category")
b = pd.Series(['a','b','c'],dtype="object")
c = pd.Series(['a','b','cc'],dtype="object")
A = pd.DataFrame({'A':a,'B':[1,2,3]},index=["0","1","2"])
AA = pd.DataFrame({'A':a,'B':[1,2,3]})
B = pd.DataFrame({'A':b,'C':[4,5,6]})
print("DF A:")
print(A)
print("\nDF A, without specifying an index in the constructor:")
print(AA)
print("\nDF B:")
print(B)
This doesn't have anything to do with categories vs. object, it has to do with index alignment.
You're getting NaNs in A because you're telling the constructor you want an index of three strings. But a has an index of its own, consisting of the integers [0, 1, 2]. Since that doesn't match the index you've said you want, the data doesn't align, and so you get a DataFrame with the index you said you wanted and the NaNs highlight that the data is missing. By contrast, B is simply a list, and so there's no index to ignore, and accordingly it assumes the data is given in index-appropriate order.
This might be easier to see than to explain. Regardless of dtype, if the indices don't match, you get NaN:
In [147]: pd.DataFrame({'A':pd.Series(list("abc"), dtype="category"),'B':[1,2,3]},
index=["0","1","2"])
Out[147]:
A B
0 NaN 1
1 NaN 2
2 NaN 3
In [148]: pd.DataFrame({'A':pd.Series(list("abc"), dtype="object"),'B':[1,2,3]},
index=["0","1","2"])
Out[148]:
A B
0 NaN 1
1 NaN 2
2 NaN 3
If you use a fully-matching index, it works:
In [149]: pd.DataFrame({'A':pd.Series(list("abc"), dtype="object"),'B':[1,2,3]},
index=[0,1,2])
Out[149]:
A B
0 a 1
1 b 2
2 c 3
And if you use a partially-matching index, you'll get values where the indices align and NaN where they don't:
In [150]: pd.DataFrame({'A':pd.Series(list("abc"), dtype="object"),'B':[1,2,3]},
index=[0,1,10])
Out[150]:
A B
0 a 1
1 b 2
10 NaN 3

Pandas. Selection by label. One-row output

I'm trying to select every entry in a pandas DataFrame D, correspoding to some certain userid, filling missing etime values with zeros as follows:
user_entries = D.loc[userid]
user_entries.index = user_entries.etime
user_entries = user_entries.reindex(range(distinct_time_entries_num))
user_entries = user_entries.fillna(0)
The problem is, for some ids, there exists exactly one entry, and thus .loc() method is returning a Series object with an unexpected index:
(Pdb) user_entries.index = user_entries.etime
*** TypeError: Index(...) must be called with a collection of some kind, 388 was passed
(Pdb) user_entries
etime 388
requested 1
rejected 0
Name: 351, dtype: int64
(Pdb) user_entries.index
Index([u'etime', u'requested', u'rejected'], dtype='object')
which is painful to handle. I'd seiously prefer a DataFrame object with one row. Is there any way around it? Thanks.
UPD: A have to apologize for unintengible formulation, this is my first post here. I'll try again.
So the deal is: there is a dataframe, indexed by userid. Every userid can possibly have up to some number N corresponding dataframe rows (columns are: 'etime','requested','rejected') for which 'etime' is basically the key. For some 'userid', there exist all of the N corresponding entries, but for the most of them, there are missing entries for some 'etime'.
My intensions are: for every 'userid' construct an explicit DataFrame object, containing all N entries indexed by 'etime', filled with zeros for the missing entries. That's why I'm changing index to 'etime' and then reindexing selected row subset with the full 'etime' range.
The problem is: for some 'userid' there is exactly one corresponding 'etime', for which.loc() subsetting returns not a dataframe with one row indexed by 'userid' but a series object indexed by the array:
Index([u'etime', u'requested', u'rejected'], dtype='object')
And that's why changing index fails. Checking dimensions and index every time I select some dataframe subset looks pretty ugly. What else can I do about it?
UPD2: here is the script demonstrating the case
full_etime_range = range(10)
df = DataFrame(index=[0,0,1],
columns=['etime','requested'],
data=[[0,1],[1,1],[1,1]])
for i in df.index:
tmp = df.loc[i]
tmp.index = tmp['etime']
tmp = tmp.reindex(full_etime_range,fill_value = 0)
print tmp
So, starting with df being your dataframe, we can do the following safely:
In[215]: df.set_index([df.index, 'etime'], inplace=True)
In[216]: df
Out[216]:
requested
etime
0 0 1
1 1
1 1 1
DF = pd.DataFrame(index=full_etime_range, columns=[])
df0 = DF.copy()
In[225]: df0.join(df.loc[0])
Out[225]:
requested
0 1
1 1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
In[230]: df1 = DF.copy()
In[231]: df1.join(df.loc[1])
Out[231]:
requested
0 NaN
1 1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
which is technically what you want. But behold, we can do this nicer:
listOfDf = [DF.copy().join(df.loc[i]) for i in df.index.get_level_values(1).unique()]
I wanted to do it even one level nicer, but the following did not work - maybe someone can chip in why.
df.groupby(level=0).apply(lambda x: DF.copy().join(x))
Are you just trying to fill nas? Why are you reindexing the dataframe?
Just
user_entries = D.loc[userid]
user_entries.fillna(0)
Should do the trick. But if you are willing to fillna just for the etime field, what you should do is:
user_entries = D.loc[userid]
temp = user_entries["etime"].fillna(0)
user_extries["etime"] = temp
Hope it helps. If not, clarify what you're trying to achieve

Categories