Remove linearly increasing "count" columns pandas - python

I have a dataframe with some columns representing counts for every timestep, I would like to automatically drop these, for example like the df.dropna() functionality, but something like df.dropcounts().
Here is an example dataframe
array = [[0.0,1.6,2.7,12.0],[1.0,3.5,4.5,13.0],[2.0,6.5,8.6,14.0]]
pd.DataFrame(array)
0 1 2 3
0 0.0 1.6 2.7 12.0
1 1.0 3.5 4.5 13.0
2 2.0 6.5 8.6 14.0
I would like to drop the first and last columns

I believe need:
val = 1
df = df.loc[:, df.diff().fillna(val).ne(val).any()]
print (df)
1 2
0 1.6 2.7
1 3.5 4.5
2 6.5 8.6
Explanation:
First compare by DataFrame.diff:
print (df.diff())
0 1 2 3
0 NaN NaN NaN NaN
1 1.0 1.9 1.8 1.0
2 1.0 3.0 4.1 1.0
Replace NaNs:
print (df.diff().fillna(val))
0 1 2 3
0 1.0 1.0 1.0 1.0
1 1.0 1.9 1.8 1.0
2 1.0 3.0 4.1 1.0
Compare if not equal by ne:
print (df.diff().fillna(val).ne(val))
0 1 2 3
0 False False False False
1 False True True False
2 False True True False
And chck at least one True per column by DataFrame.any:
print (df.diff().fillna(val).ne(val).any())
0 False
1 True
2 True
3 False
dtype: bool

Using all
d.loc[:,~d.diff().fillna(1).eq(1).all().values]
Out[295]:
1 2
0 1.6 2.7
1 3.5 4.5
2 6.5 8.6

Related

How to iterate over an array using a lambda function with pandas apply

I have the following dataset:
0 1 2
0 2.0 2.0 4
0 1.0 1.0 2
0 1.0 1.0 3
3 1.0 1.0 5
4 1.0 1.0 2
5 1.0 NaN 1
6 NaN 1.0 1
and what I want to do is insert a new column that iterates over each row, and if there is a NaN then give it a 0, if not then copy the value from column '2' to get this:
0 1 2 3
0 2.0 2.0 4 4
0 1.0 1.0 2 2
0 1.0 1.0 3 3
3 1.0 1.0 5 5
4 1.0 1.0 2 2
5 1.0 NaN 1 0
6 NaN 1.0 1 0
The following code is what I have so far, which works fine but does not iterate over the values of column '2'.
df.isna().sum(axis=1).apply(lambda x: df[2].iloc[x] if x==0 else 0)
if I use df.iloc[x] I get
0 4
1 4
2 4
3 4
4 4
5 0
6 0
How can I iterate over the column '2'?
Try the below code with np.where with isna and any:
>>> df['3'] = np.where(df[['0', '1']].isna().any(1), 0, df['2'])
>>> df
0 1 2 3
0 2.0 2.0 4 4
0 1.0 1.0 2 2
0 1.0 1.0 3 3
3 1.0 1.0 5 5
4 1.0 1.0 2 2
5 1.0 NaN 1 0
6 NaN 1.0 1 0
>>>

Number Of Rows Since Positive/Negative in Pandas

I have a DataFrame similar to this:
MACD
0 -2.3
1 -0.3
2 0.8
3 0.1
4 0.6
5 -0.7
6 1.1
7 2.4
How can I add an extra column showing the number of rows since MACD was on the opposite side of the origin (positive/negative)?
Desired Outcome:
MACD RowsSince
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1
3 0.1 2
4 0.6 3
5 -0.7 1
6 1.1 1
7 2.4 2
We can try with use np.sign with diff create the subgroup , then with groupby + cumcount
s = np.sign(df['MACD']).diff().ne(0).cumsum()
df['new'] = (df.groupby(s).cumcount()+1).mask(s.eq(1))
df
Out[80]:
MACD new
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1.0
3 0.1 2.0
4 0.6 3.0
5 -0.7 1.0
6 1.1 1.0
7 2.4 2.0

pandas if column is true, perform percent change on that row

I have a df with an additional column of boolean values based off a conditional statement.
df = pd.DataFrame({'col1': [1,2,3,2.5,5,2]})
df['bool'] = df['col1'] >= 3
The df looks like...
col bool
0 1.0 False
1 2.0 False
2 3.0 True
3 2.5 False
4 5.0 True
5 2.0 False
I would like to get pct_change() of "col1" if "bool" is true and if false return NaN. The output should look something like...
col pct_change
0 1.0 NaN
1 2.0 NaN
2 3.0 -0.169
3 2.5 NaN
4 5.0 -0.400
5 2.0 NaN
What would be the best way of going about this?
Use numpy.where to use df["bool"] as a boolean mask:
df["pct_change"] = np.where(df["bool"], df["col"].pct_change().shift(-1), np.nan)
print(df)
Output:
col bool pct_change
0 1.0 False NaN
1 2.0 False NaN
2 3.0 True -0.166667
3 2.5 False NaN
4 5.0 True -0.400000
5 3.0 False NaN

Forward fill Pandas df only if an entire line is made of Nan

I would like to forward fill a pandas df with the previous line only when the current line is entirely composed ofnan.
This means that fillna(method='ffill', limit = 1) does not work in my case because it works element wise while I would need a fillna line wise.
Is there a more elegant way to achieve this task than the following instructions?
s = df.count(axis = 1)
for d in df.index[1:]:
if s.loc[d] == 0:
i = s.index.get_loc(d)
df.iloc[i] = df.iloc[i-1]
Input
v1 v2
1 1 2
2 nan 3
3 2 4
4 nan nan
Output
v1 v2
1 1 2
2 nan 3
3 2 4
4 2 4
You can use conditions for filter rows for applying ffill:
m = df.isnull().all(axis=1) | df.notnull().all(axis=1)
print (m)
1 True
2 False
3 True
4 True
dtype: bool
print (df[m])
v1 v2
1 1.0 2.0
3 2.0 4.0
4 NaN NaN
df[m] = df[m].ffill()
print (df)
v1 v2
1 1.0 2.0
2 NaN 3.0
3 2.0 4.0
4 2.0 4.0
EDIT:
print (df)
v1 v2
1 1.0 2.0
2 NaN 7.0
3 4.0 8.0
4 NaN NaN
5 2.0 4.0
6 NaN 3.0
7 NaN NaN
m = df.isnull().all(axis=1) | df.notnull().all(axis=1)
print (m)
1 True
2 False
3 True
4 True
5 True
6 False
7 True
dtype: bool
long_str = 'some long helper str'
df[~m] = df[~m].fillna(long_str)
df = df.ffill().replace(long_str, np.nan)
print (df)
v1 v2
1 1.0 2.0
2 NaN 7.0
3 4.0 8.0
4 4.0 8.0
5 2.0 4.0
6 NaN 3.0
7 NaN 3.0

Transposing a subset of columns in a Pandas DataFrame while using others as grouping variable?

Let's say I have a Pandas dataframe (that is already in the dataframe format):
x = [[1,2,8,7,9],[1,3,5.6,4.5,4],[2,3,4.5,5,5]]
df = pd.DataFrame(x, columns=['id1','id2','val1','val2','val3'])
id1 id2 val1 val2 val3
1 2 8.0 7.0 9
1 3 5.6 4.5 4
2 3 4.5 5.0 5
I want val1, val2, and val2 in one column, with id1 and id2 as grouping variables. I can use this extremely convoluted code:
dfT = df.iloc[:,2::].T.reset_index(drop=True)
n_points = dfT.shape[0]
final = pd.DataFrame()
for i in range(0, df.shape[0]):
data = np.asarray([[df.ix[i,'id1']]*n_points,
[df.ix[i,'id2']]*n_points,
dfT.ix[:,i].values]).T
temp = pd.DataFrame(data, columns=['id1','id2','val'])
final = pd.concat([final, temp], axis=0)
to get my dataframe into the correct format:
id1 id2 val
0 1.0 2.0 8.0
1 1.0 2.0 7.0
2 1.0 2.0 9.0
0 1.0 3.0 5.6
1 1.0 3.0 4.5
2 1.0 3.0 4.0
0 2.0 3.0 4.5
1 2.0 3.0 5.0
2 2.0 3.0 5.0
but there must be a more efficient way of doing this, since on a large dataframe this takes way too long.
Suggestions?
You can use melt with drop column variable:
print (pd.melt(df, id_vars=['id1','id2'], value_name='val')
.drop('variable', axis=1))
id1 id2 val
0 1 2 8.0
1 1 3 5.6
2 2 3 4.5
3 1 2 7.0
4 1 3 4.5
5 2 3 5.0
6 1 2 9.0
7 1 3 4.0
8 2 3 5.0
Another solution with set_index and stack:
print (df.set_index(['id1','id2'])
.stack()
.reset_index(level=2, drop=True)
.reset_index(name='val'))
id1 id2 val
0 1 2 8.0
1 1 2 7.0
2 1 2 9.0
3 1 3 5.6
4 1 3 4.5
5 1 3 4.0
6 2 3 4.5
7 2 3 5.0
8 2 3 5.0
There's even a simpler one which can be done using lreshape(Not yet documented though):
pd.lreshape(df, {'val': ['val1', 'val2', 'val3']}).sort_values(['id1', 'id2'])

Categories