I have two dataframes:
Actual_Values
0 0.60
1 0.60
2 0.60
3 0.60
4 0.60
Predicted_Values
0 0.60
1 0.60
2 0.60
and I want a something like this:
Actual_Values Predicted_Values
0 0.60 NaN
1 0.60 NaN
2 0.60 0.6
3 0.60 0.6
4 0.60 0.6
I have tried pandas' join, merge, concat, but none works.
Try with assign the new index
df2.index=df1.index[-len(df2):]
out = df1.join(df2)
Out[283]:
Actual_Values Predicted_Values
0 0.6 NaN
1 0.6 NaN
2 0.6 0.6
3 0.6 0.6
4 0.6 0.6
Related
I have a DataFrame similar to this:
MACD
0 -2.3
1 -0.3
2 0.8
3 0.1
4 0.6
5 -0.7
6 1.1
7 2.4
How can I add an extra column showing the number of rows since MACD was on the opposite side of the origin (positive/negative)?
Desired Outcome:
MACD RowsSince
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1
3 0.1 2
4 0.6 3
5 -0.7 1
6 1.1 1
7 2.4 2
We can try with use np.sign with diff create the subgroup , then with groupby + cumcount
s = np.sign(df['MACD']).diff().ne(0).cumsum()
df['new'] = (df.groupby(s).cumcount()+1).mask(s.eq(1))
df
Out[80]:
MACD new
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1.0
3 0.1 2.0
4 0.6 3.0
5 -0.7 1.0
6 1.1 1.0
7 2.4 2.0
I would like to update and insert a new row, if D1 value is not existing in other ID's, whilst my df['Value'] is left blank (N/A). Your help is appreciated.
Input
D1 ID Value
0.02 1 1.2
0.04 1 1.6
0.06 1 1.9
0.08 1 2.8
0.02 2 4.5
0.04 2 4.1
0.08 2 3.6
0.02 3 2.7
0.04 3 2.9
0.06 3 2.4
0.08 3 2.1
0.1 3 1.9
Expected output:
D1 ID Value
0.02 1 1.2
0.04 1 1.6
0.06 1 1.9
0.08 1 2.8
0.1 1
0.02 2 4.5
0.04 2 4.1
0.06 2
0.08 2 3.6
0.1 2
0.02 3 2.7
0.04 3 2.9
0.06 3 2.4
0.08 3 2.1
0.1 3 1.9
Unfortunately the codes I have written have been way off or simply gets multiple error messages, unlike my other questions I do not have examples to show.
Use unstack and stack. Chain additional sort_index and reset_index to achieve desired order
df_final = (df.set_index(['D1', 'ID']).unstack().stack(dropna=False)
.sort_index(level=[1,0]).reset_index())
Out[952]:
D1 ID Value
0 0.02 1 1.2
1 0.04 1 1.6
2 0.06 1 1.9
3 0.08 1 2.8
4 0.10 1 NaN
5 0.02 2 4.5
6 0.04 2 4.1
7 0.06 2 NaN
8 0.08 2 3.6
9 0.10 2 NaN
10 0.02 3 2.7
11 0.04 3 2.9
12 0.06 3 2.4
13 0.08 3 2.1
14 0.10 3 1.9
I have the following data frame. I want to check the values of each row for the columns of "mental_illness", "feeling", and "flavor". If all the values for those three columns per row are less than 0.5, I want to change the corresponding value of the "unclassified" column to 1.0.
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0.0 0.75 0.30 0.28
1 1 word_2 0.0 0.17 0.72 0.16
2 2 word_3 0.0 0.19 0.38 0.16
3 3 word_4 0.0 0.39 0.20 0.14
4 4 word_5 0.0 0.72 0.30 0.14
Expected result:
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0.0 0.75 0.30 0.28
1 1 word_2 0.0 0.17 0.72 0.16
2 2 word_3 1.0 0.19 0.38 0.16
3 3 word_4 1.0 0.39 0.20 0.14
4 4 word_5 0.0 0.72 0.30 0.14
How do I go about doing so?
Use .le and .all over axis=1:
m = df[['mental_illness', 'feeling', 'flavor']].le(0.5).all(axis=1)
df['unclassified'] = m.astype(int)
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0 0.75 0.30 0.28
1 1 word_2 0 0.17 0.72 0.16
2 2 word_3 1 0.19 0.38 0.16
3 3 word_4 1 0.39 0.20 0.14
4 4 word_5 0 0.72 0.30 0.14
Would this work?
mask1 = df["mental_illness"] < 0.5
mask2 = df["feeling"] < 0.5
mask3 = df["flavor"] < 0.5
df.loc[mask1 & mask2 & mask3, 'unclassified'] = 1
Here is my solution:
data.unclassified = data[['mental_illness', 'feeling', 'flavor']].apply(lambda x: x.le(0.5)).apply(lambda x: 1 if sum(x) == 3 else 0, axis = 1)
output
sent_no pos unclassified mental_illness feeling flavor
0 0 Word_1 0 0.75 0.30 0.28
1 1 Word_2 0 0.17 0.72 0.16
2 2 Word_3 1 0.19 0.38 0.16
3 3 Word_4 1 0.39 0.20 0.14
4 4 Word_5 0 0.72 0.30 0.14
is there a good code to split dataframes into chunks and automatically name each chunk into its own dataframe?
for example, dfmaster has 1000 records. split by 200 and create df1, df2,….df5
any guidance would be much appreciated.
I've looked on other boards and there is no guidance for a function that can automatically create new dataframes.
Use numpy for splitting:
See example below:
In [2095]: df
Out[2095]:
0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.00 0.0 0.00 0.0 0.94 0.00 0.00 0.63 0.00
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.00 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN
In [2096]: np.split(df, 2)
Out[2096]:
[ 0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.0 0.0 0.0 0.0 0.94 0.0 0.0 0.63 0.0
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN,
0 1 2 3 4 5 6 7 8 9 10
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.0 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN]
df gets split into 2 dataframes having 2 rows each.
You can do np.split(df, 500)
I find these ideas helpful:
solution via list:
https://stackoverflow.com/a/49563326/10396469
solution using numpy.split:
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.split.html
just use df = df.values first to convert from dataframe to numpy.array.
I have a question how to select different column(create new series) based on another column value. raw data as following:
DEST_ZIP5 EXP_EDD_FRC_DAY GND_EDD_FRC_DAY \
0 00501 5 6
1 00544 5 6
2 01001 4 8
3 01001 4 8
4 01001 4 8
EXP_DAY_2 EXP_DAY_3 EXP_DAY_4 EXP_DAY_5 ... \
0 0.0 1.00 1.00 1.0 ...
1 0.0 1.00 1.00 1.0 ...
2 0.0 0.85 1.00 1.0 ...
3 0.0 1.00 1.00 1.0 ...
4 0.0 0.85 0.85 1.0 ...
GND_DAY_3 GND_DAY_4 GND_DAY_5 GND_DAY_6 GND_DAY_7 GND_DAY_8 \
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 0.0 0.0 0.16 0.33 0.83 1.00
3 0.0 0.0 0.00 0.14 0.71 0.85
4 0.1 0.1 0.20 0.40 0.40 0.60
I want to have two new data serize which get the number value of for responding column.
(the row 1, EXP_EDD_FRC_DAY =5, so, return df[EXP_DAY_5].
GND_EDD_FRC_DAY =6, return df[GND_DAY_6]
DEST_ZIP5 EXP_percentage GND_percentage \
0 00501 1.0 NaN
1 00544 1.0 NaN
2 01001 1.0 1.00
3 01001 1.0 0.85
4 01001 0.85 0.60
I found function lookup. Not not sure how to use that.
Thank you very much
-
IIUC:
c = df['EXP_EDD_FRC_DAY'].astype(str).radd('GND_DAY_')
new_series = pd.Series(df.lookup(df.index, c), df.index)