Learning plotly line animation and come across this question
My df:
Date
1Mo
2Mo
3Mo
6Mo
1Yr
2Yr
0
2023-02-12
4.66
4.77
4.79
4.89
4.50
4.19
1
2023-02-11
4.66
4.77
4.77
4.90
4.88
4.49
2
2023-02-10
4.64
4.69
4.72
4.88
4.88
4.79
3
2023-02-09
4.62
4.68
4.71
4.82
4.88
4.89
4
2023-02-08
4.60
4.61
4.72
4.83
4.89
4.89
How do I animate this dataframe so the frame has
x = [1Mo, 2Mo, 3Mo, 6Mo, 1Yr, 2Yr], and
y = the actual value on a date, eg y=df[df['Date']=="2023-02-08"], animation_frame = df['Date']?
I tried
plot = px.line(df, x=df.columns[1:], y=df['Date'], title="Treasury Yields", animation_frame=df_treasuries_yield['Date'])
No joy :(
I think the problem is you cannot pass multiple columns to the animation_frame parameter. But we can get around this by converting your df from wide to long format using pd.melt – for your data, we will want to take all of the values from [1Mo, 2Mo, 3Mo, 6Mo, 1Yr, 2Yr] and put them a new column called "value" and we will have a variable column called "variable" to tell us which column the value came from.
df_long = pd.melt(df, id_vars=['Date'], value_vars=['1Mo', '2Mo', '3Mo', '6Mo', '1Yr', '2Yr'])
This will look like the following:
Date variable value
0 2023-02-12 1Mo 4.66
1 2023-02-11 1Mo 4.66
2 2023-02-10 1Mo 4.64
3 2023-02-09 1Mo 4.62
4 2023-02-08 1Mo 4.60
...
28 2023-02-09 2Yr 4.89
29 2023-02-08 2Yr 4.89
Now can pass the argument animation_frame='Date' to px.line:
fig = px.line(df_long, x="variable", y="value", animation_frame="Date", title="Yields")
Related
I have a dataframe containing various data, including a column from Linux Timestamp. For further analysis, I need to extract the minutes of each period (hour minute number, day minute number, week minute number, month minute number, year minute number) from the Linux Timestamp column.
I have:
TimeStamp var1 var2
1659494100 5.22 6.34
1659494160 4.33 7.33
1659494220 5.46 7.21
1659494280 4.33 4.51
1659494340 6.45 5.67
...
I need to have:
TimeStamp var1 var2 minute_of_hour minute_of_day minute_of_week minute_of_month minute_of_year
1659494100 5.22 6.34 35 155 3035 3035 308315
1659494160 4.33 7.33 36 156 3036 3036 308316
1659494220 5.46 7.21 37 157 3037 3037 308317
1659494280 4.33 4.51 38 158 3038 3038 308318
1659494340 6.45 5.67 39 159 3039 3039 308319
I have a large table and using loops is not an option. Do you have any ideas?
import pandas as pd
# Your dataframe here:
df = pd.DataFrame({
"Timestamp": [1659494100, 1659494160, 1659494220, 1659494280, 1659494340],
"var1": [5.22, 4.33, 5.46, 4.33, 6.45],
"var2": [6.34, 7.33, 7.21, 4.51, 5.67]
})
timestamps = pd.to_datetime(df["Timestamp"], unit="s")
freqs = {
"hour": "H",
"day": "D",
"week": "W",
"month": "M",
"year": "Y"
}
for name, freq in freqs.items():
df[f"minute_of_{name}"] = (
timestamps - timestamps.dt.to_period(freq).dt.start_time
) // pd.Timedelta("1Min")
Output:
Timestamp var1 var2 minute_of_hour minute_of_day minute_of_week \
0 1659494100 5.22 6.34 35 155 3035
1 1659494160 4.33 7.33 36 156 3036
2 1659494220 5.46 7.21 37 157 3037
3 1659494280 4.33 4.51 38 158 3038
4 1659494340 6.45 5.67 39 159 3039
minute_of_month minute_of_year
0 3035 308315
1 3036 308316
2 3037 308317
3 3038 308318
4 3039 308319
Note that some columns can be calculated more directly, but this method makes the code consistent for all frequencies.
I have a dataframe, df, in which I am attempting to fill in values within the empty "Set" column, depending on a condition. The condition is as follows: the value of the 'Set' columns need to be "IN" whenever the 'valence_median_split' column's value is 'Low_Valence' within the corresponding row, and "OUT' in all other cases.
Please see below for an example of my attempt to solve this:
df.head()
Out[65]:
ID Category Num Vert_Horizon Description Fem_Valence_Mean \
0 Animals_001_h Animals 1 h Dead Stork 2.40
1 Animals_002_v Animals 2 v Lion 6.31
2 Animals_003_h Animals 3 h Snake 5.14
3 Animals_004_v Animals 4 v Wolf 4.55
4 Animals_005_h Animals 5 h Bat 5.29
Fem_Valence_SD Fem_Av/Ap_Mean Fem_Av/Ap_SD Arousal_Mean ... Contrast \
0 1.30 3.03 1.47 6.72 ... 68.45
1 2.19 5.96 2.24 6.69 ... 32.34
2 1.19 5.14 1.75 5.34 ... 59.92
3 1.87 4.82 2.27 6.84 ... 75.10
4 1.56 4.61 1.81 5.50 ... 59.77
JPEG_size80 LABL LABA LABB Entropy Classification \
0 263028 51.75 -0.39 16.93 7.86
1 250208 52.39 10.63 30.30 6.71
2 190887 55.45 0.25 4.41 7.83
3 282350 49.84 3.82 1.36 7.69
4 329325 54.26 -0.34 -0.95 7.82
valence_median_split temp_selection set
0 Low_Valence Animals_001_h
1 High_Valence NaN
2 Low_Valence Animals_003_h
3 Low_Valence Animals_004_v
4 Low_Valence Animals_005_h
[5 rows x 36 columns]
df['set'] = np.where(df.loc[df['valence_median_split'] == 'Low_Valence'], 'IN', 'OUT')
ValueError: Length of values does not match length of index
I can accomplish this by using loc to separate the df into two different df's, but wondering if there is a more elegant solution using the "np.where" or a similar approach.
Change to
df['set'] = np.where(df['valence_median_split'] == 'Low_Valence', 'IN', 'OUT')
If need .loc
df.loc[df['valence_median_split'] == 'Low_Valence','set']='IN'
df.loc[df['valence_median_split'] != 'Low_Valence','set']='OUT'
I have a SQL table like this:
Ticker Return Shares
AGJ 2.20 1265
ATA 1.78 698
ARS 9.78 10939
ARE -7.51 -26389
AIM 0.91 1758
ABT 10.02 -5893
AC -5.73 -2548
ATD 6.51 7850
AP 1.98 256
ALA -9.58 8524
So essentially, a table of stocks I've longed/shorted.
I want to find the top 4 best performers in this table, so the shorts (shares < 0) who have the lowest return, and the longs (shares > 0) who have the highest return.
Essentially, returning this:
Ticker Return Shares
ARS 9.78 10939
ARE -7.51 -26389
AC -5.73 -2548
ATD 6.51 7850
How would I be able to write the query that lets me do this?
Or, if it's easier, if there are any pandas functions that would do the same thing if I turned this table into a pandas dataframe.
Something like this:
select top (4) t.*
from t
order by (case when shares < 0 then - [return] else [return] end) desc;
Pandas solution:
In [134]: df.loc[(np.sign(df.Shares)*df.Return).nlargest(4).index]
Out[134]:
Ticker Return Shares
2 ARS 9.78 10939
3 ARE -7.51 -26389
7 ATD 6.51 7850
6 AC -5.73 -2548
Explanation:
In [137]: (np.sign(df.Shares)*df.Return)
Out[137]:
0 2.20
1 1.78
2 9.78
3 7.51
4 0.91
5 -10.02
6 5.73
7 6.51
8 1.98
9 -9.58
dtype: float64
In [138]: (np.sign(df.Shares)*df.Return).nlargest(4)
Out[138]:
2 9.78
3 7.51
7 6.51
6 5.73
dtype: float64
I have a data-frame (df) with the following structure:
date a b c d e f g
23/02/2009 577.9102
24/02/2009 579.1345
25/02/2009 583.2158
26/02/2009 629.7425
27/02/2009 553.8306
02/03/2009 6.15 5.31 5.80 223716 790.8724 5.7916
03/03/2009 6.16 6.2 6.26 818424 770.6165 6.0161
04/03/2009 6.6 6.485 6.57 636544 858.5754 1.4331 6.4149
05/03/2009 6.1 5.98 6.06 810584 816.5025 1.7475 6.242
06/03/2009 5.845 5.95 6.00 331079 796.7618 1.7144 6.0427
09/03/2009 5.4 5.2 5.28 504271 744.0833 1.6449 5.4076
10/03/2009 5.93 5.59 5.595 906742 814.2862 1.4128 5.8434
where columns a and g have data i would like to multiple them together using the following:
df["h"] = df["a"]*df["g"]
however as you can see from the timeseries above there is not always data with which to perform the calculation and I am being returned the following error:
KeyError: 'g'
Is there a way to check if the data exists before performing the calculation? I am trying to use :
df["h"] = np.where((df.a == blank)|(df.g == blank),"",df.a*df.g)
I would like to have returned:
date a b c d e f g h
23/02/2009 577.9102
24/02/2009 579.1345
25/02/2009 583.2158
26/02/2009 629.7425
27/02/2009 553.8306
02/03/2009 6.15 5.31 5.8 223716 790.8724 5.7916 1.0618
03/03/2009 6.16 6.2 6.26 818424 770.6165 6.0161 1.0239
04/03/2009 6.6 6.485 6.57 636544 858.5754 1.4331 6.4149 1.0288
05/03/2009 6.1 5.98 6.06 810584 816.5025 1.7475 6.242 0.9772
06/03/2009 5.845 5.95 6.00 331079 796.7618 1.7144 6.0427 0.9672
09/03/2009 5.4 5.2 5.28 504271 744.0833 1.6449 5.4076 0.9985
10/03/2009 5.93 5.59 5.595 906742 814.2862 1.4128 5.8434 1.0148
but am unsure of the syntax for a blank data field. What should that be?
I have a DataFrame like so:
In [10]: df.head()
Out[10]:
sand silt clay rho_b ... n \
5 25 60 5 25 60 5 25 60 5 ... 60
STID ...
ACME 73.0 60.3 52.5 19.7 23.9 25.9 7.2 15.7 21.5 1.27 ... 1.32
ADAX 61.1 51.1 47.6 22.0 25.4 24.6 16.9 23.5 27.8 1.01 ... 1.25
ALTU 23.8 17.8 14.3 40.0 45.2 40.9 36.2 37.0 44.8 1.57 ... 1.18
ALV2 33.3 21.2 19.8 31.4 29.7 29.8 35.3 49.1 50.5 1.66 ... 1.20
ANT2 55.6 57.5 47.7 34.9 31.1 26.8 9.4 11.3 25.5 1.49 ... 1.29
So for every STID (e.g. ACME, ADAX, ALTU), there's some property (e.g. sand, silt, clay) defined at three depths (5, 25, 60).
This structure makes it really easy to do per-depth calculations at each STID, e.g.:
In [12]: (df['sand'] + df['silt']).head()
Out[12]:
5 25 60
STID
ACME 92.7 84.2 78.4
ADAX 83.1 76.5 72.2
ALTU 63.8 63.0 55.2
ALV2 64.7 50.9 49.6
ANT2 90.5 88.6 74.5
How can I neatly incorporate a calculated result back in to the DataFrame? For example, if I wanted to call the result of the above calculation 'notclay':
In [13]: df['notclay'] = df['sand'] + df['silt']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-a30bd9ba99c3> in <module>()
----> 1 df['notclay'] = df['sand'] + df['silt']
<snip>
ValueError: Wrong number of items passed 3, placement implies 1
Three columns are expected to be defined for each column in the result, not just the one 'notclay' column.
I do have a solution using strict assignments, but I'm not very satisfied with it:
In [21]: df[[('notclay', 5), ('notclay', 25), ('notclay', 60)]] = df['sand'] + df['silt']
In [22]: df['notclay'].head()
Out[22]:
5 25 60
STID
ACME 92.7 84.2 78.4
ADAX 83.1 76.5 72.2
ALTU 63.8 63.0 55.2
ALV2 64.7 50.9 49.6
ANT2 90.5 88.6 74.5
I have many other calculations to do similar to this one, and using a strict assignment every time seems tedious. I'm guessing there's a better/"right" way to do this. I think add a field in pandas dataframe with MultiIndex columns might contain the answer, but I don't very well understand the solutions (or even what a Panel is and if it can help me).
Edit: Something I tried that doesn't work, prepending a category using concat:
In [36]: concat([df['sand'] + df['silt']], axis=1, keys=['notclay']).head()
Out[36]:
notclay
5 25 60
STID
ACME 92.7 84.2 78.4
ADAX 83.1 76.5 72.2
ALTU 63.8 63.0 55.2
ALV2 64.7 50.9 49.6
ANT2 90.5 88.6 74.5
In [37]: df['notclay'] = concat([df['sand'] + df['silt']], axis=1, keys=['notclay'])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<snip>
ValueError: Wrong number of items passed 3, placement implies 1
Same ValueError raised as above.
Depending on your taste, this may be a nicer way to do it still using concat:
In [53]: df
Out[53]:
blah foo
1 2 3 1 2 3
a 0.351045 0.044654 0.855627 0.839725 0.675183 0.325324
b 0.610374 0.394499 0.924708 0.924303 0.404475 0.885368
c 0.116418 0.487866 0.190669 0.283535 0.862869 0.346477
d 0.771014 0.204143 0.143449 0.848520 0.887373 0.220083
e 0.103268 0.306820 0.277125 0.627272 0.631019 0.386406
In [54]: newdf
Out[54]:
1 2 3
a 0.433377 0.806679 0.976298
b 0.593683 0.217415 0.086565
c 0.716244 0.908777 0.180252
d 0.031942 0.074283 0.745019
e 0.651517 0.393569 0.861616
In [56]: newdf.columns=pd.MultiIndex.from_product([['bar'], newdf.columns])
In [57]: pd.concat([df, newdf], axis=1)
Out[57]:
blah foo bar \
1 2 3 1 2 3 1
a 0.351045 0.044654 0.855627 0.839725 0.675183 0.325324 0.433377
b 0.610374 0.394499 0.924708 0.924303 0.404475 0.885368 0.593683
c 0.116418 0.487866 0.190669 0.283535 0.862869 0.346477 0.716244
d 0.771014 0.204143 0.143449 0.848520 0.887373 0.220083 0.031942
e 0.103268 0.306820 0.277125 0.627272 0.631019 0.386406 0.651517
2 3
a 0.806679 0.976298
b 0.217415 0.086565
c 0.908777 0.180252
d 0.074283 0.745019
e 0.393569 0.861616
In order to store this into the original dataframe, you can simply assign to it in the last line:
In [58]: df = pd.concat([df, newdf], axis=1)