How can I append to to an existing Pandas Dataframe - python

This is what I have so far to retrieve data from polygon.io. For each sym I would like to add it to df
sym = ['OCGN', 'TKAT', 'MMAT', 'MDIA', 'PHUN']
df = pd.DataFrame()
i=0
for i in range(len(sym)):
stock = sym[i]
url = f'https://api.polygon.io/v2/aggs/ticker/{stock}/range/1/day/{fromdate}/{to}?
adjusted=true&sort=asc&limit=50000&apiKey=Demo'.format()
tick = requests.get(url)
tick =pd.json_normalize(tick.json()["results"])
daa = (tick.iloc[[-1]])
data = pd.DataFrame(daa)
df = df.append(data, ignore_index=True)
print(df)
output
v vw o c h l t
0 15426806.0 6.2736 6.31 6.03 6.66 6.030 1638334800000
1 464144.0 4.9949 5.16 4.73 5.28 4.640 1638334800000
2 8101699.0 3.5164 3.75 3.36 3.82 3.300 1638334800000
3 109407.0 5.0286 4.90 4.77 5.28 4.654 1638334800000
4 45679175.0 3.7679 3.01 3.25 3.26 2.780 1638334800000

Related

Animate px.line line with plotly express

Learning plotly line animation and come across this question
My df:
Date
1Mo
2Mo
3Mo
6Mo
1Yr
2Yr
0
2023-02-12
4.66
4.77
4.79
4.89
4.50
4.19
1
2023-02-11
4.66
4.77
4.77
4.90
4.88
4.49
2
2023-02-10
4.64
4.69
4.72
4.88
4.88
4.79
3
2023-02-09
4.62
4.68
4.71
4.82
4.88
4.89
4
2023-02-08
4.60
4.61
4.72
4.83
4.89
4.89
How do I animate this dataframe so the frame has
x = [1Mo, 2Mo, 3Mo, 6Mo, 1Yr, 2Yr], and
y = the actual value on a date, eg y=df[df['Date']=="2023-02-08"], animation_frame = df['Date']?
I tried
plot = px.line(df, x=df.columns[1:], y=df['Date'], title="Treasury Yields", animation_frame=df_treasuries_yield['Date'])
No joy :(
I think the problem is you cannot pass multiple columns to the animation_frame parameter. But we can get around this by converting your df from wide to long format using pd.melt – for your data, we will want to take all of the values from [1Mo, 2Mo, 3Mo, 6Mo, 1Yr, 2Yr] and put them a new column called "value" and we will have a variable column called "variable" to tell us which column the value came from.
df_long = pd.melt(df, id_vars=['Date'], value_vars=['1Mo', '2Mo', '3Mo', '6Mo', '1Yr', '2Yr'])
This will look like the following:
Date variable value
0 2023-02-12 1Mo 4.66
1 2023-02-11 1Mo 4.66
2 2023-02-10 1Mo 4.64
3 2023-02-09 1Mo 4.62
4 2023-02-08 1Mo 4.60
...
28 2023-02-09 2Yr 4.89
29 2023-02-08 2Yr 4.89
Now can pass the argument animation_frame='Date' to px.line:
fig = px.line(df_long, x="variable", y="value", animation_frame="Date", title="Yields")

What is the best way to populate a column of a dataframe with conditional values based on corresponding rows in another column?

I have a dataframe, df, in which I am attempting to fill in values within the empty "Set" column, depending on a condition. The condition is as follows: the value of the 'Set' columns need to be "IN" whenever the 'valence_median_split' column's value is 'Low_Valence' within the corresponding row, and "OUT' in all other cases.
Please see below for an example of my attempt to solve this:
df.head()
Out[65]:
ID Category Num Vert_Horizon Description Fem_Valence_Mean \
0 Animals_001_h Animals 1 h Dead Stork 2.40
1 Animals_002_v Animals 2 v Lion 6.31
2 Animals_003_h Animals 3 h Snake 5.14
3 Animals_004_v Animals 4 v Wolf 4.55
4 Animals_005_h Animals 5 h Bat 5.29
Fem_Valence_SD Fem_Av/Ap_Mean Fem_Av/Ap_SD Arousal_Mean ... Contrast \
0 1.30 3.03 1.47 6.72 ... 68.45
1 2.19 5.96 2.24 6.69 ... 32.34
2 1.19 5.14 1.75 5.34 ... 59.92
3 1.87 4.82 2.27 6.84 ... 75.10
4 1.56 4.61 1.81 5.50 ... 59.77
JPEG_size80 LABL LABA LABB Entropy Classification \
0 263028 51.75 -0.39 16.93 7.86
1 250208 52.39 10.63 30.30 6.71
2 190887 55.45 0.25 4.41 7.83
3 282350 49.84 3.82 1.36 7.69
4 329325 54.26 -0.34 -0.95 7.82
valence_median_split temp_selection set
0 Low_Valence Animals_001_h
1 High_Valence NaN
2 Low_Valence Animals_003_h
3 Low_Valence Animals_004_v
4 Low_Valence Animals_005_h
[5 rows x 36 columns]
df['set'] = np.where(df.loc[df['valence_median_split'] == 'Low_Valence'], 'IN', 'OUT')
ValueError: Length of values does not match length of index
I can accomplish this by using loc to separate the df into two different df's, but wondering if there is a more elegant solution using the "np.where" or a similar approach.
Change to
df['set'] = np.where(df['valence_median_split'] == 'Low_Valence', 'IN', 'OUT')
If need .loc
df.loc[df['valence_median_split'] == 'Low_Valence','set']='IN'
df.loc[df['valence_median_split'] != 'Low_Valence','set']='OUT'

Not performing calculation on blank field in dataframe

I have a data-frame (df) with the following structure:
date a b c d e f g
23/02/2009 577.9102
24/02/2009 579.1345
25/02/2009 583.2158
26/02/2009 629.7425
27/02/2009 553.8306
02/03/2009 6.15 5.31 5.80 223716 790.8724 5.7916
03/03/2009 6.16 6.2 6.26 818424 770.6165 6.0161
04/03/2009 6.6 6.485 6.57 636544 858.5754 1.4331 6.4149
05/03/2009 6.1 5.98 6.06 810584 816.5025 1.7475 6.242
06/03/2009 5.845 5.95 6.00 331079 796.7618 1.7144 6.0427
09/03/2009 5.4 5.2 5.28 504271 744.0833 1.6449 5.4076
10/03/2009 5.93 5.59 5.595 906742 814.2862 1.4128 5.8434
where columns a and g have data i would like to multiple them together using the following:
df["h"] = df["a"]*df["g"]
however as you can see from the timeseries above there is not always data with which to perform the calculation and I am being returned the following error:
KeyError: 'g'
Is there a way to check if the data exists before performing the calculation? I am trying to use :
df["h"] = np.where((df.a == blank)|(df.g == blank),"",df.a*df.g)
I would like to have returned:
date a b c d e f g h
23/02/2009 577.9102
24/02/2009 579.1345
25/02/2009 583.2158
26/02/2009 629.7425
27/02/2009 553.8306
02/03/2009 6.15 5.31 5.8 223716 790.8724 5.7916 1.0618
03/03/2009 6.16 6.2 6.26 818424 770.6165 6.0161 1.0239
04/03/2009 6.6 6.485 6.57 636544 858.5754 1.4331 6.4149 1.0288
05/03/2009 6.1 5.98 6.06 810584 816.5025 1.7475 6.242 0.9772
06/03/2009 5.845 5.95 6.00 331079 796.7618 1.7144 6.0427 0.9672
09/03/2009 5.4 5.2 5.28 504271 744.0833 1.6449 5.4076 0.9985
10/03/2009 5.93 5.59 5.595 906742 814.2862 1.4128 5.8434 1.0148
but am unsure of the syntax for a blank data field. What should that be?

Add a calculated result with multiple columns to Pandas DataFrame with MultiIndex columns

I have a DataFrame like so:
In [10]: df.head()
Out[10]:
sand silt clay rho_b ... n \
5 25 60 5 25 60 5 25 60 5 ... 60
STID ...
ACME 73.0 60.3 52.5 19.7 23.9 25.9 7.2 15.7 21.5 1.27 ... 1.32
ADAX 61.1 51.1 47.6 22.0 25.4 24.6 16.9 23.5 27.8 1.01 ... 1.25
ALTU 23.8 17.8 14.3 40.0 45.2 40.9 36.2 37.0 44.8 1.57 ... 1.18
ALV2 33.3 21.2 19.8 31.4 29.7 29.8 35.3 49.1 50.5 1.66 ... 1.20
ANT2 55.6 57.5 47.7 34.9 31.1 26.8 9.4 11.3 25.5 1.49 ... 1.29
So for every STID (e.g. ACME, ADAX, ALTU), there's some property (e.g. sand, silt, clay) defined at three depths (5, 25, 60).
This structure makes it really easy to do per-depth calculations at each STID, e.g.:
In [12]: (df['sand'] + df['silt']).head()
Out[12]:
5 25 60
STID
ACME 92.7 84.2 78.4
ADAX 83.1 76.5 72.2
ALTU 63.8 63.0 55.2
ALV2 64.7 50.9 49.6
ANT2 90.5 88.6 74.5
How can I neatly incorporate a calculated result back in to the DataFrame? For example, if I wanted to call the result of the above calculation 'notclay':
In [13]: df['notclay'] = df['sand'] + df['silt']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-a30bd9ba99c3> in <module>()
----> 1 df['notclay'] = df['sand'] + df['silt']
<snip>
ValueError: Wrong number of items passed 3, placement implies 1
Three columns are expected to be defined for each column in the result, not just the one 'notclay' column.
I do have a solution using strict assignments, but I'm not very satisfied with it:
In [21]: df[[('notclay', 5), ('notclay', 25), ('notclay', 60)]] = df['sand'] + df['silt']
In [22]: df['notclay'].head()
Out[22]:
5 25 60
STID
ACME 92.7 84.2 78.4
ADAX 83.1 76.5 72.2
ALTU 63.8 63.0 55.2
ALV2 64.7 50.9 49.6
ANT2 90.5 88.6 74.5
I have many other calculations to do similar to this one, and using a strict assignment every time seems tedious. I'm guessing there's a better/"right" way to do this. I think add a field in pandas dataframe with MultiIndex columns might contain the answer, but I don't very well understand the solutions (or even what a Panel is and if it can help me).
Edit: Something I tried that doesn't work, prepending a category using concat:
In [36]: concat([df['sand'] + df['silt']], axis=1, keys=['notclay']).head()
Out[36]:
notclay
5 25 60
STID
ACME 92.7 84.2 78.4
ADAX 83.1 76.5 72.2
ALTU 63.8 63.0 55.2
ALV2 64.7 50.9 49.6
ANT2 90.5 88.6 74.5
In [37]: df['notclay'] = concat([df['sand'] + df['silt']], axis=1, keys=['notclay'])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<snip>
ValueError: Wrong number of items passed 3, placement implies 1
Same ValueError raised as above.
Depending on your taste, this may be a nicer way to do it still using concat:
In [53]: df
Out[53]:
blah foo
1 2 3 1 2 3
a 0.351045 0.044654 0.855627 0.839725 0.675183 0.325324
b 0.610374 0.394499 0.924708 0.924303 0.404475 0.885368
c 0.116418 0.487866 0.190669 0.283535 0.862869 0.346477
d 0.771014 0.204143 0.143449 0.848520 0.887373 0.220083
e 0.103268 0.306820 0.277125 0.627272 0.631019 0.386406
In [54]: newdf
Out[54]:
1 2 3
a 0.433377 0.806679 0.976298
b 0.593683 0.217415 0.086565
c 0.716244 0.908777 0.180252
d 0.031942 0.074283 0.745019
e 0.651517 0.393569 0.861616
In [56]: newdf.columns=pd.MultiIndex.from_product([['bar'], newdf.columns])
In [57]: pd.concat([df, newdf], axis=1)
Out[57]:
blah foo bar \
1 2 3 1 2 3 1
a 0.351045 0.044654 0.855627 0.839725 0.675183 0.325324 0.433377
b 0.610374 0.394499 0.924708 0.924303 0.404475 0.885368 0.593683
c 0.116418 0.487866 0.190669 0.283535 0.862869 0.346477 0.716244
d 0.771014 0.204143 0.143449 0.848520 0.887373 0.220083 0.031942
e 0.103268 0.306820 0.277125 0.627272 0.631019 0.386406 0.651517
2 3
a 0.806679 0.976298
b 0.217415 0.086565
c 0.908777 0.180252
d 0.074283 0.745019
e 0.393569 0.861616
In order to store this into the original dataframe, you can simply assign to it in the last line:
In [58]: df = pd.concat([df, newdf], axis=1)

Pandas groupby(),agg() - how to return results without the multi index?

I have a dataframe:
pe_odds[ [ 'EVENT_ID', 'SELECTION_ID', 'ODDS' ] ]
Out[67]:
EVENT_ID SELECTION_ID ODDS
0 100429300 5297529 18.00
1 100429300 5297529 20.00
2 100429300 5297529 21.00
3 100429300 5297529 22.00
4 100429300 5297529 23.00
5 100429300 5297529 24.00
6 100429300 5297529 25.00
When I use groupby and agg, I get results with a multi-index:
pe_odds.groupby( [ 'EVENT_ID', 'SELECTION_ID' ] )[ 'ODDS' ].agg( [ np.min, np.max ] )
Out[68]:
amin amax
EVENT_ID SELECTION_ID
100428417 5490293 1.71 1.71
5881623 1.14 1.35
5922296 2.00 2.00
5956692 2.00 2.02
100428419 603721 2.44 2.90
4387436 4.30 6.20
4398859 1.23 1.35
4574687 1.35 1.46
4881396 14.50 19.00
6032606 2.94 4.20
6065580 2.70 5.80
6065582 2.42 3.65
100428421 5911426 2.22 2.52
I have tried using as_index to return the results without the multi_index:
pe_odds.groupby( [ 'EVENT_ID', 'SELECTION_ID' ], as_index=False )[ 'ODDS' ].agg( [ np.min, np.max ], as_index=False )
But it still gives me a multi-index.
I can use .reset_index(), but it is very slow:
pe_odds.groupby( [ 'EVENT_ID', 'SELECTION_ID' ] )[ 'ODDS' ].agg( [ np.min, np.max ] ).reset_index()
pe_odds.groupby( [ 'EVENT_ID', 'SELECTION_ID' ] )[ 'ODDS' ].agg( [ np.min, np.max ] ).reset_index()
Out[69]:
EVENT_ID SELECTION_ID amin amax
0 100428417 5490293 1.71 1.71
1 100428417 5881623 1.14 1.35
2 100428417 5922296 2.00 2.00
3 100428417 5956692 2.00 2.02
4 100428419 603721 2.44 2.90
5 100428419 4387436 4.30 6.20
How can I return the results, without the Multi-index, using parameters of the groupby and/or agg function. And without having to resort to using reset_index() ?
Below call:
>>> gr = df.groupby(['EVENT_ID', 'SELECTION_ID'], as_index=False)
>>> res = gr.agg({'ODDS':[np.min, np.max]})
>>> res
EVENT_ID SELECTION_ID ODDS
amin amax
0 100429300 5297529 18 25
1 100429300 5297559 30 38
returns a frame with mulit-index columns. If you do not want columns to be multi-index either you may do:
>>> res.columns = list(map(''.join, res.columns.values))
>>> res
EVENT_ID SELECTION_ID ODDSamin ODDSamax
0 100429300 5297529 18 25
1 100429300 5297559 30 38
It is also possible to remove the multi_index on the columns using a pipe method, set_axis, and chaining (which I believe is more readable).
(
pe_odds
.groupby(by=['EVENT_ID', 'SELECTION_ID'] )
.agg([ np.min, np.max ])
.pipe(lambda x: x.set_axis(x.columns.map('_'.join), axis=1))
)
This is the output w/out reseting the index.
ODDS_amin ODDS_amax
EVENT_ID SELECTION_ID
100429300 5297529 18.0 25.0
100429300 5297559 30.0 38.0
I have taken Kim's comment and optimised it (you don't need to use .to_flat_index() at all) into the below code. I believe this is the most pythonic (easy to understand) and elegant approach:
df.columns = ["_".join(col_name).rstrip('_') for col_name in df.columns]
An example use would be:
>>> df.columns = ["_".join(col_name).rstrip('_') for col_name in df.columns]
>>> df
EVENT_ID SELECTION_ID ODDS_amin ODDS_amax
0 100429300 5297529 18 25
1 100429300 5297559 30 38

Categories