Pandas - starting iteration index and slicing with .loc - python

I'm still quite new to Python and programming in general. With luck, I have the right idea, but I can't quite get this to work.
With my example df, I want iteration to start when entry == 1.
import pandas as pd
import numpy as np
nan = np.nan
a = [0,0,4,4,4,4,6,6]
b = [4,4,4,4,4,4,4,4]
entry = [nan,nan,nan,nan,1,nan,nan,nan]
df = pd.DataFrame(columns=['a', 'b', 'entry'])
df = pd.DataFrame.assign(df, a=a, b=b, entry=entry)
I wrote a function, with little success. It returns an error, unhashable type: 'slice'. FWIW, I'm applying this function to groups of various lengths.
def exit_row(df):
start = df.index[df.entry == 1]
df.loc[start:,(df.a > df.b), 'exit'] = 1
return df
Ideally, the result would be as below:
a b entry exit
0 0 4 NaN NaN
1 0 4 NaN NaN
2 4 4 NaN NaN
3 4 4 NaN NaN
4 4 4 1.0 NaN
5 4 4 NaN NaN
6 6 4 NaN 1
7 6 4 NaN 1
Any advice much appreciated. I had wondered if I should attempt a For loop instead, though I often find them difficult to read.

You can use boolean indexing:
# what are the rows after entry?
m1 = df['entry'].notna().cummax()
# in which rows is a>b?
m2 = df['a'].gt(df['b'])
# set 1 where both conditions are True
df.loc[m1&m2, 'exit'] = 1
output:
a b entry exit
0 0 4 NaN NaN
1 0 4 NaN NaN
2 4 4 NaN NaN
3 4 4 NaN NaN
4 4 4 1.0 NaN
5 4 4 NaN NaN
6 6 4 NaN 1.0
7 6 4 NaN 1.0
Intermediates:
a b entry notna m1 m2 m1&m2 exit
0 0 4 NaN False False False False NaN
1 0 4 NaN False False False False NaN
2 4 4 NaN False False False False NaN
3 4 4 NaN False False False False NaN
4 4 4 1.0 True True False False NaN
5 4 4 NaN False True False False NaN
6 6 4 NaN False True True True 1.0
7 6 4 NaN False True True True 1.0

Related

Tricky Pandas Task - Reversely check other column based on True/False

I have a pandas dataframe with 2 columns that basically looks like this. If i have true in B I want to have the last non-nan value of A in C. Is that even possible?
Actual table:
A B
0 754 False
1 None False
2 None False
3 None False
4 None True
5 999 False
6 None False
7 None True
8 None False
9 875 False
Wanted table:
A B C
0 754 False 754
1 None False NaN
2 None False NaN
3 None False NaN
4 None True NaN
5 999 False 999
6 None False NaN
7 None True NaN
8 None False NaN
9 875 False NaN
Im unclear as to exactly what you want but what you describe in the text can be achieved by
enter code heredf.loc[df.B, 'C'] = df.loc[df.B].index
for n in df.C:
if not math.isnan(n):
cap = df.C[df.C == n].index[0]
df.loc[cap, 'C'] = df.A[df.A[:cap].last_valid_index()]
output:
A B C
0 754.0 False NaN
1 NaN False NaN
2 NaN False NaN
3 NaN False NaN
4 NaN True 754.0
5 999.0 False NaN
6 NaN False NaN
7 NaN True 999.0
8 NaN False NaN
9 875.0 False NaN

Pandas python: getting one value from a DataFrame

I implemented a function that goes to the first occurence of a valued in a panda dataframe but I feel the implementation is kindda ugly. Would you have a nicer way to implement it??
[mots] is an array of strings
# Sans doutes la pire implémentation au monde...
def find_singular_value(self, mots):
bool_table = self.document.isin(mots)
for i in range(bool_table.shape[0]):
for j in range(bool_table.shape[1]):
boolean = bool_table.iloc[i][j]
if boolean:
return self.document.iloc[i][j + 1]
Here's a solution for getting the j+1 value. It uses df.unstack and df.shift
df = self.document.unstack()
vals = df[df.isin(mots).shift().fillna(False)]
vals will contain all of the j+1 values in self.documents. You can then select the first one as in my previous answer.
Hopefully this works for you.
This one liner should give you what you need.
self.document[self.document.isin(mots)].melt()["value"].dropna().values[0]
It applies your isin mask to the original df then finds the first non nan value using pd.melt and df.dropna
Here's a simple breakdown:
>>> df = pd.DataFrame({"a":[1,2,3],"b":[4,5,6],"c":[7,8,9]})
>>> df.isin([4,6])
a b c
0 False True False
1 False False False
2 False True False
>>> df[df.isin([4,6])]
a b c
0 NaN 4.0 NaN
1 NaN NaN NaN
2 NaN 6.0 NaN
>>> df[df.isin([4,6])].melt()
variable value
0 a NaN
1 a NaN
2 a NaN
3 b 4.0
4 b NaN
5 b 6.0
6 c NaN
7 c NaN
8 c NaN
>>> df[df.isin([4,6])].melt()["value"]
0 NaN
1 NaN
2 NaN
3 4.0
4 NaN
5 6.0
6 NaN
7 NaN
8 NaN
Name: value, dtype: float64
>>> df[df.isin([4,6])].melt()["value"].dropna()
3 4.0
5 6.0
Name: value, dtype: float64
>>> df[df.isin([4,6])].melt()["value"].dropna().values
array([ 4., 6.])
>>> df[df.isin([4,6])].melt()["value"].dropna().values[0]
4.0
>>>

Remove any empty fields in a loop?

A list has many paths of certain csv's.
How to check if each csv in every loop has any empty columns and delete them if they are.
Code:
for i in list1:
if (list1.columns = '').any():
i.remove that column
Hope this explains what I am talking about.
Sample:
df = pd.DataFrame({
'':list('abcdef'),
'B':[4,5,4,5,5,np.nan],
'C':[''] * 6,
'D':[np.nan] * 6,
'E':[5,3,6,9,2,4],
'F':list('aaabb') + ['']
})
print (df)
B C D E F
0 a 4.0 NaN 5 a
1 b 5.0 NaN 3 a
2 c 4.0 NaN 6 a
3 d 5.0 NaN 9 b
4 e 5.0 NaN 2 b
5 f NaN NaN 4
Removed first column, because empty column name - it means filtering only columns with no empty values with loc and boolean indexing:
df1 = df.loc[:, df.columns != '']
print (df1)
B C D E F
0 4.0 NaN 5 a
1 5.0 NaN 3 a
2 4.0 NaN 6 a
3 5.0 NaN 9 b
4 5.0 NaN 2 b
5 NaN NaN 4
Reoved column C, because filled only empty values - compare all values if not empty values and get at least one True per column by DataFrame.any, also filter by boolean indexing with loc:
df2 = df.loc[:, (df != '').any()]
print (df2)
B D E
0 a 4.0 NaN 5
1 b 5.0 NaN 3
2 c 4.0 NaN 6
3 d 5.0 NaN 9
4 e 5.0 NaN 2
5 f NaN NaN 4
print ((df != ''))
B C D E F
0 True True False True True True
1 True True False True True True
2 True True False True True True
3 True True False True True True
4 True True False True True True
5 True True False True True False
print ((df != '').any())
True
B True
C False
D True
E True
F True
dtype: bool
Removed column D because filled only missing values with function dropna:
df3 = df.dropna(axis=1, how='all')
print (df3)
B C E F
0 a 4.0 5 a
1 b 5.0 3 a
2 c 4.0 6 a
3 d 5.0 9 b
4 e 5.0 2 b
5 f NaN 4

how to merge two dataframes if the index and length both do not match?

i have two data frames predictor_df and solution_df like this :
predictor_df
1000 A B C
1001 1 2 3
1002 4 5 6
1003 7 8 9
1004 Nan Nan Nan
and a solution_df
0 D
1 10
2 11
3 12
the reason for the names is that the predictor_df is used to do some analysis on it's columns to arrive at analysis_df . My analysis leaves the rows with Nan values in predictor_df and hence the shorter solution_df
Now i want to know how to join these two dataframes to obtain my final dataframe as
A B C D
1 2 3 10
4 5 6 11
7 8 9 12
Nan Nan Nan
please guide me through it . thanks in advance.
Edit : i tried to merge the two dataframes but the result comes like this ,
A B C D
1 2 3 Nan
4 5 6 Nan
7 8 9 Nan
Nan Nan Nan
Edit 2 : also when i do pd.concat([predictor_df, solution_df], axis = 1)
it becomes like this
A B C D
Nan Nan Nan 10
Nan Nan Nan 11
Nan Nan Nan 12
Nan Nan Nan Nan
You could use reset_index with drop=True which resets the index to the default integer index.
pd.concat([df_1.reset_index(drop=True), df_2.reset_index(drop=True)], axis=1)
A B C D
0 1 2 3 10.0
1 4 5 6 11.0
2 7 8 9 12.0
3 Nan Nan Nan NaN

How can I use apply with pandas rolling_corr()

I posted this a while ago but no one could solve the problem.
first let's create some correlated DataFrames and call rolling_corr(), with dropna() as I am going to sparse it up later and no min_period set as I want to keep the results robust and consistent with the set window
hey=(DataFrame(np.random.random((15,3)))+.2).cumsum()
hoo=(DataFrame(np.random.random((15,3)))+.2).cumsum()
hey_corr= rolling_corr(hey.dropna(),hoo.dropna(), 4)
gives me
In [388]: hey_corr
Out[388]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 0.991087 0.978383 0.992614
4 0.974117 0.974871 0.989411
5 0.966969 0.972894 0.997427
6 0.942064 0.994681 0.996529
7 0.932688 0.986505 0.991353
8 0.935591 0.966705 0.980186
9 0.969994 0.977517 0.931809
10 0.979783 0.956659 0.923954
11 0.987701 0.959434 0.961002
12 0.907483 0.986226 0.978658
13 0.940320 0.985458 0.967748
14 0.952916 0.992365 0.973929
now when I sparse it up it gives me...
hey.ix[5:8,0] = np.nan
hey.ix[6:10,1] = np.nan
hoo.ix[5:8,0] = np.nan
hoo.ix[6:10,1] = np.nan
hey_corr_sparse = rolling_corr(hey.dropna(),hoo.dropna(), 4)
hey_corr_sparse
Out[398]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 0.991273 0.992557 0.985773
4 0.953041 0.999411 0.958595
11 0.996801 0.998218 0.992538
12 0.994919 0.998656 0.995235
13 0.994899 0.997465 0.997950
14 0.971828 0.937512 0.994037
chucks of data are missing, it looks like we only have data where the dropna() can form a complete window across the dataframe
I can solve the problem with a ugly iter-fudge as follows...
hey_corr_sparse = DataFrame(np.nan, index=hey.index,columns=hey.columns)
for i in hey_corr_sparse.columns:
hey_corr_sparse.ix[:,i] = rolling_corr(hey.ix[:,i].dropna(),hoo.ix[:,i].dropna(), 4)
hey_corr_sparse
Out[406]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 0.991273 0.992557 0.985773
4 0.953041 0.999411 0.958595
5 NaN 0.944246 0.961917
6 NaN NaN 0.941467
7 NaN NaN 0.963183
8 NaN NaN 0.980530
9 0.993865 NaN 0.984484
10 0.997691 NaN 0.998441
11 0.978982 0.991095 0.997462
12 0.914663 0.990844 0.998134
13 0.933355 0.995848 0.976262
14 0.971828 0.937512 0.994037
Does anyone in the community know if it is possible make this an array function to give this result, I've attempted to use .apply but drawn a blank, is it even possible to .apply a function that works on two data structures (hey and hoo in this example)?
many thanks, LW
you can try this:
>>> def sparse_rolling_corr(ts, other, window):
... return rolling_corr(ts.dropna(), other[ts.name].dropna(), window).reindex_like(ts)
...
>>> hey.apply(sparse_rolling_corr, args=(hoo, 4))

Categories