I have the following dataset:
0 1 2
0 2.0 2.0 4
0 1.0 1.0 2
0 1.0 1.0 3
3 1.0 1.0 5
4 1.0 1.0 2
5 1.0 NaN 1
6 NaN 1.0 1
and what I want to do is insert a new column that iterates over each row, and if there is a NaN then give it a 0, if not then copy the value from column '2' to get this:
0 1 2 3
0 2.0 2.0 4 4
0 1.0 1.0 2 2
0 1.0 1.0 3 3
3 1.0 1.0 5 5
4 1.0 1.0 2 2
5 1.0 NaN 1 0
6 NaN 1.0 1 0
The following code is what I have so far, which works fine but does not iterate over the values of column '2'.
df.isna().sum(axis=1).apply(lambda x: df[2].iloc[x] if x==0 else 0)
if I use df.iloc[x] I get
0 4
1 4
2 4
3 4
4 4
5 0
6 0
How can I iterate over the column '2'?
Try the below code with np.where with isna and any:
>>> df['3'] = np.where(df[['0', '1']].isna().any(1), 0, df['2'])
>>> df
0 1 2 3
0 2.0 2.0 4 4
0 1.0 1.0 2 2
0 1.0 1.0 3 3
3 1.0 1.0 5 5
4 1.0 1.0 2 2
5 1.0 NaN 1 0
6 NaN 1.0 1 0
>>>
I have a DataFrame similar to this:
MACD
0 -2.3
1 -0.3
2 0.8
3 0.1
4 0.6
5 -0.7
6 1.1
7 2.4
How can I add an extra column showing the number of rows since MACD was on the opposite side of the origin (positive/negative)?
Desired Outcome:
MACD RowsSince
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1
3 0.1 2
4 0.6 3
5 -0.7 1
6 1.1 1
7 2.4 2
We can try with use np.sign with diff create the subgroup , then with groupby + cumcount
s = np.sign(df['MACD']).diff().ne(0).cumsum()
df['new'] = (df.groupby(s).cumcount()+1).mask(s.eq(1))
df
Out[80]:
MACD new
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1.0
3 0.1 2.0
4 0.6 3.0
5 -0.7 1.0
6 1.1 1.0
7 2.4 2.0
I have a df with an additional column of boolean values based off a conditional statement.
df = pd.DataFrame({'col1': [1,2,3,2.5,5,2]})
df['bool'] = df['col1'] >= 3
The df looks like...
col bool
0 1.0 False
1 2.0 False
2 3.0 True
3 2.5 False
4 5.0 True
5 2.0 False
I would like to get pct_change() of "col1" if "bool" is true and if false return NaN. The output should look something like...
col pct_change
0 1.0 NaN
1 2.0 NaN
2 3.0 -0.169
3 2.5 NaN
4 5.0 -0.400
5 2.0 NaN
What would be the best way of going about this?
Use numpy.where to use df["bool"] as a boolean mask:
df["pct_change"] = np.where(df["bool"], df["col"].pct_change().shift(-1), np.nan)
print(df)
Output:
col bool pct_change
0 1.0 False NaN
1 2.0 False NaN
2 3.0 True -0.166667
3 2.5 False NaN
4 5.0 True -0.400000
5 3.0 False NaN
I would like to forward fill a pandas df with the previous line only when the current line is entirely composed ofnan.
This means that fillna(method='ffill', limit = 1) does not work in my case because it works element wise while I would need a fillna line wise.
Is there a more elegant way to achieve this task than the following instructions?
s = df.count(axis = 1)
for d in df.index[1:]:
if s.loc[d] == 0:
i = s.index.get_loc(d)
df.iloc[i] = df.iloc[i-1]
Input
v1 v2
1 1 2
2 nan 3
3 2 4
4 nan nan
Output
v1 v2
1 1 2
2 nan 3
3 2 4
4 2 4
You can use conditions for filter rows for applying ffill:
m = df.isnull().all(axis=1) | df.notnull().all(axis=1)
print (m)
1 True
2 False
3 True
4 True
dtype: bool
print (df[m])
v1 v2
1 1.0 2.0
3 2.0 4.0
4 NaN NaN
df[m] = df[m].ffill()
print (df)
v1 v2
1 1.0 2.0
2 NaN 3.0
3 2.0 4.0
4 2.0 4.0
EDIT:
print (df)
v1 v2
1 1.0 2.0
2 NaN 7.0
3 4.0 8.0
4 NaN NaN
5 2.0 4.0
6 NaN 3.0
7 NaN NaN
m = df.isnull().all(axis=1) | df.notnull().all(axis=1)
print (m)
1 True
2 False
3 True
4 True
5 True
6 False
7 True
dtype: bool
long_str = 'some long helper str'
df[~m] = df[~m].fillna(long_str)
df = df.ffill().replace(long_str, np.nan)
print (df)
v1 v2
1 1.0 2.0
2 NaN 7.0
3 4.0 8.0
4 4.0 8.0
5 2.0 4.0
6 NaN 3.0
7 NaN 3.0
Let's say I have a Pandas dataframe (that is already in the dataframe format):
x = [[1,2,8,7,9],[1,3,5.6,4.5,4],[2,3,4.5,5,5]]
df = pd.DataFrame(x, columns=['id1','id2','val1','val2','val3'])
id1 id2 val1 val2 val3
1 2 8.0 7.0 9
1 3 5.6 4.5 4
2 3 4.5 5.0 5
I want val1, val2, and val2 in one column, with id1 and id2 as grouping variables. I can use this extremely convoluted code:
dfT = df.iloc[:,2::].T.reset_index(drop=True)
n_points = dfT.shape[0]
final = pd.DataFrame()
for i in range(0, df.shape[0]):
data = np.asarray([[df.ix[i,'id1']]*n_points,
[df.ix[i,'id2']]*n_points,
dfT.ix[:,i].values]).T
temp = pd.DataFrame(data, columns=['id1','id2','val'])
final = pd.concat([final, temp], axis=0)
to get my dataframe into the correct format:
id1 id2 val
0 1.0 2.0 8.0
1 1.0 2.0 7.0
2 1.0 2.0 9.0
0 1.0 3.0 5.6
1 1.0 3.0 4.5
2 1.0 3.0 4.0
0 2.0 3.0 4.5
1 2.0 3.0 5.0
2 2.0 3.0 5.0
but there must be a more efficient way of doing this, since on a large dataframe this takes way too long.
Suggestions?
You can use melt with drop column variable:
print (pd.melt(df, id_vars=['id1','id2'], value_name='val')
.drop('variable', axis=1))
id1 id2 val
0 1 2 8.0
1 1 3 5.6
2 2 3 4.5
3 1 2 7.0
4 1 3 4.5
5 2 3 5.0
6 1 2 9.0
7 1 3 4.0
8 2 3 5.0
Another solution with set_index and stack:
print (df.set_index(['id1','id2'])
.stack()
.reset_index(level=2, drop=True)
.reset_index(name='val'))
id1 id2 val
0 1 2 8.0
1 1 2 7.0
2 1 2 9.0
3 1 3 5.6
4 1 3 4.5
5 1 3 4.0
6 2 3 4.5
7 2 3 5.0
8 2 3 5.0
There's even a simpler one which can be done using lreshape(Not yet documented though):
pd.lreshape(df, {'val': ['val1', 'val2', 'val3']}).sort_values(['id1', 'id2'])