Selecting with same condition, but HDFStore gives different answers

Selecting with same condition, but HDFStore gives different answers - python

I stored my data in hdf5 file. The strange thing is that I am selecting a table with same condition, but HDFStore gives different answers.
Who can tell me why?
In [2]: import pandas as pd
In [3]: store=pd.HDFStore("./data/m2016.h5","r")
In [4]: store
Out[4]:
<class 'pandas.io.pytables.HDFStore'>
File path: ./data/m2016.h5
/m2016 frame_table (typ->appendable,nrows->37202055,ncols->6,indexers->[index],dc->[dt,code])
In [5]: a=store.select('m2016',where="code='000001'")
In [6]: b=store.select('m2016',where="code='000001'")
In [7]: a.shape
Out[7]: (2388318, 6)
In [8]: b.shape
Out[8]: (2374525, 6)
In [9]: a.head()
Out[9]:
dt market code price volume preclose
85920 2016-01-04 09:30:00 0 000001 11.98 1102900 11.99
85921 2016-01-04 09:31:00 0 000001 11.96 289100 11.99
85922 2016-01-04 09:32:00 0 000001 11.97 361800 11.99
85923 2016-01-04 09:33:00 0 000001 12.00 279200 11.99
85924 2016-01-04 09:34:00 0 000001 12.00 405600 11.99
I tested it at all my three computers, result as:
PC1, os:Win2012server, python:winpython 2.7.10.3 (64bits), select result is wrong.
PC2, os:Win10, python winpython 2.7.10.3 (64bits), select result is wrong.
PC3, os:Win7, python:Winpython 2.7.10.3 (64bits), select result is ok!
Maybe HDFStore.select only can run at Win7?

maybe the default encoding of your operating system varies ?
would this work b=store.select('m2016',where="code=u'000001'")

I has tested more about it at my PC in Win7, it still got random wrong result.
In [1]: import pandas as pd
In [2]: cd /projects
C:\projects
In [3]: store=pd.HDFStore("./data/m2016.h5","r")
In [4]: d0=store.select("m2016",where='dt<Timestamp("2016-01-10")')
In [5]: d1=store.select("m2016",where='dt<Timestamp("2016-01-10")')
In [6]: d0.shape
Out[6]: (6917149, 6)
In [7]: d1.shape
Out[7]: (4199769, 6)
In [8]: d0.tail()
Out[8]:
dt market code price volume preclose
455381 2016-04-21 11:11:00 1 600461 13.33 16400 13.2
455386 2016-04-21 11:16:00 1 600461 13.36 13800 13.2
455387 2016-04-21 11:17:00 1 600461 13.37 8300 13.2
455388 2016-04-21 11:18:00 1 600461 13.36 9800 13.2
455389 2016-04-21 11:19:00 1 600461 13.34 15300 13.2
In [9]: d1.tail()
Out[9]:
dt market code price volume preclose
573543 2016-04-22 14:03:00 1 601333 3.94 8200 3.97
573548 2016-04-22 14:08:00 1 601333 3.96 45000 3.97
573549 2016-04-22 14:09:00 1 601333 3.96 8800 3.97
573550 2016-04-22 14:10:00 1 601333 3.97 10700 3.97
573551 2016-04-22 14:11:00 1 601333 3.96 6800 3.97
In [10]: !ptdump m2016.h5
/ (RootGroup) ''
/m2016 (Group) ''
/m2016/table (Table(50957318,), shuffle, zlib(9)) ''
I upload my hdf5 file here

Related

Animate px.line line with plotly express

Learning plotly line animation and come across this question
My df:
Date
1Mo
2Mo
3Mo
6Mo
1Yr
2Yr
0
2023-02-12
4.66
4.77
4.79
4.89
4.50
4.19
1
2023-02-11
4.66
4.77
4.77
4.90
4.88
4.49
2
2023-02-10
4.64
4.69
4.72
4.88
4.88
4.79
3
2023-02-09
4.62
4.68
4.71
4.82
4.88
4.89
4
2023-02-08
4.60
4.61
4.72
4.83
4.89
4.89
How do I animate this dataframe so the frame has
x = [1Mo, 2Mo, 3Mo, 6Mo, 1Yr, 2Yr], and
y = the actual value on a date, eg y=df[df['Date']=="2023-02-08"], animation_frame = df['Date']?
I tried
plot = px.line(df, x=df.columns[1:], y=df['Date'], title="Treasury Yields", animation_frame=df_treasuries_yield['Date'])
No joy :(

I think the problem is you cannot pass multiple columns to the animation_frame parameter. But we can get around this by converting your df from wide to long format using pd.melt – for your data, we will want to take all of the values from [1Mo, 2Mo, 3Mo, 6Mo, 1Yr, 2Yr] and put them a new column called "value" and we will have a variable column called "variable" to tell us which column the value came from.
df_long = pd.melt(df, id_vars=['Date'], value_vars=['1Mo', '2Mo', '3Mo', '6Mo', '1Yr', '2Yr'])
This will look like the following:
Date variable value
0 2023-02-12 1Mo 4.66
1 2023-02-11 1Mo 4.66
2 2023-02-10 1Mo 4.64
3 2023-02-09 1Mo 4.62
4 2023-02-08 1Mo 4.60
...
28 2023-02-09 2Yr 4.89
29 2023-02-08 2Yr 4.89
Now can pass the argument animation_frame='Date' to px.line:
fig = px.line(df_long, x="variable", y="value", animation_frame="Date", title="Yields")

What is the best way to populate a column of a dataframe with conditional values based on corresponding rows in another column?

I have a dataframe, df, in which I am attempting to fill in values within the empty "Set" column, depending on a condition. The condition is as follows: the value of the 'Set' columns need to be "IN" whenever the 'valence_median_split' column's value is 'Low_Valence' within the corresponding row, and "OUT' in all other cases.
Please see below for an example of my attempt to solve this:
df.head()
Out[65]:
ID Category Num Vert_Horizon Description Fem_Valence_Mean \
0 Animals_001_h Animals 1 h Dead Stork 2.40
1 Animals_002_v Animals 2 v Lion 6.31
2 Animals_003_h Animals 3 h Snake 5.14
3 Animals_004_v Animals 4 v Wolf 4.55
4 Animals_005_h Animals 5 h Bat 5.29
Fem_Valence_SD Fem_Av/Ap_Mean Fem_Av/Ap_SD Arousal_Mean ... Contrast \
0 1.30 3.03 1.47 6.72 ... 68.45
1 2.19 5.96 2.24 6.69 ... 32.34
2 1.19 5.14 1.75 5.34 ... 59.92
3 1.87 4.82 2.27 6.84 ... 75.10
4 1.56 4.61 1.81 5.50 ... 59.77
JPEG_size80 LABL LABA LABB Entropy Classification \
0 263028 51.75 -0.39 16.93 7.86
1 250208 52.39 10.63 30.30 6.71
2 190887 55.45 0.25 4.41 7.83
3 282350 49.84 3.82 1.36 7.69
4 329325 54.26 -0.34 -0.95 7.82
valence_median_split temp_selection set
0 Low_Valence Animals_001_h
1 High_Valence NaN
2 Low_Valence Animals_003_h
3 Low_Valence Animals_004_v
4 Low_Valence Animals_005_h
[5 rows x 36 columns]
df['set'] = np.where(df.loc[df['valence_median_split'] == 'Low_Valence'], 'IN', 'OUT')
ValueError: Length of values does not match length of index
I can accomplish this by using loc to separate the df into two different df's, but wondering if there is a more elegant solution using the "np.where" or a similar approach.

Change to
df['set'] = np.where(df['valence_median_split'] == 'Low_Valence', 'IN', 'OUT')
If need .loc
df.loc[df['valence_median_split'] == 'Low_Valence','set']='IN'
df.loc[df['valence_median_split'] != 'Low_Valence','set']='OUT'

Selecting min values in SQL table when column < 0, and max values when column > 0

I have a SQL table like this:
Ticker Return Shares
AGJ 2.20 1265
ATA 1.78 698
ARS 9.78 10939
ARE -7.51 -26389
AIM 0.91 1758
ABT 10.02 -5893
AC -5.73 -2548
ATD 6.51 7850
AP 1.98 256
ALA -9.58 8524
So essentially, a table of stocks I've longed/shorted.
I want to find the top 4 best performers in this table, so the shorts (shares < 0) who have the lowest return, and the longs (shares > 0) who have the highest return.
Essentially, returning this:
Ticker Return Shares
ARS 9.78 10939
ARE -7.51 -26389
AC -5.73 -2548
ATD 6.51 7850
How would I be able to write the query that lets me do this?
Or, if it's easier, if there are any pandas functions that would do the same thing if I turned this table into a pandas dataframe.

Something like this:
select top (4) t.*
from t
order by (case when shares < 0 then - [return] else [return] end) desc;

Pandas solution:
In [134]: df.loc[(np.sign(df.Shares)*df.Return).nlargest(4).index]
Out[134]:
Ticker Return Shares
2 ARS 9.78 10939
3 ARE -7.51 -26389
7 ATD 6.51 7850
6 AC -5.73 -2548
Explanation:
In [137]: (np.sign(df.Shares)*df.Return)
Out[137]:
0 2.20
1 1.78
2 9.78
3 7.51
4 0.91
5 -10.02
6 5.73
7 6.51
8 1.98
9 -9.58
dtype: float64
In [138]: (np.sign(df.Shares)*df.Return).nlargest(4)
Out[138]:
2 9.78
3 7.51
7 6.51
6 5.73
dtype: float64

Add a calculated result with multiple columns to Pandas DataFrame with MultiIndex columns

I have a DataFrame like so:
In [10]: df.head()
Out[10]:
sand silt clay rho_b ... n \
5 25 60 5 25 60 5 25 60 5 ... 60
STID ...
ACME 73.0 60.3 52.5 19.7 23.9 25.9 7.2 15.7 21.5 1.27 ... 1.32
ADAX 61.1 51.1 47.6 22.0 25.4 24.6 16.9 23.5 27.8 1.01 ... 1.25
ALTU 23.8 17.8 14.3 40.0 45.2 40.9 36.2 37.0 44.8 1.57 ... 1.18
ALV2 33.3 21.2 19.8 31.4 29.7 29.8 35.3 49.1 50.5 1.66 ... 1.20
ANT2 55.6 57.5 47.7 34.9 31.1 26.8 9.4 11.3 25.5 1.49 ... 1.29
So for every STID (e.g. ACME, ADAX, ALTU), there's some property (e.g. sand, silt, clay) defined at three depths (5, 25, 60).
This structure makes it really easy to do per-depth calculations at each STID, e.g.:
In [12]: (df['sand'] + df['silt']).head()
Out[12]:
5 25 60
STID
ACME 92.7 84.2 78.4
ADAX 83.1 76.5 72.2
ALTU 63.8 63.0 55.2
ALV2 64.7 50.9 49.6
ANT2 90.5 88.6 74.5
How can I neatly incorporate a calculated result back in to the DataFrame? For example, if I wanted to call the result of the above calculation 'notclay':
In [13]: df['notclay'] = df['sand'] + df['silt']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-a30bd9ba99c3> in <module>()
----> 1 df['notclay'] = df['sand'] + df['silt']
<snip>
ValueError: Wrong number of items passed 3, placement implies 1
Three columns are expected to be defined for each column in the result, not just the one 'notclay' column.
I do have a solution using strict assignments, but I'm not very satisfied with it:
In [21]: df[[('notclay', 5), ('notclay', 25), ('notclay', 60)]] = df['sand'] + df['silt']
In [22]: df['notclay'].head()
Out[22]:
5 25 60
STID
ACME 92.7 84.2 78.4
ADAX 83.1 76.5 72.2
ALTU 63.8 63.0 55.2
ALV2 64.7 50.9 49.6
ANT2 90.5 88.6 74.5
I have many other calculations to do similar to this one, and using a strict assignment every time seems tedious. I'm guessing there's a better/"right" way to do this. I think add a field in pandas dataframe with MultiIndex columns might contain the answer, but I don't very well understand the solutions (or even what a Panel is and if it can help me).
Edit: Something I tried that doesn't work, prepending a category using concat:
In [36]: concat([df['sand'] + df['silt']], axis=1, keys=['notclay']).head()
Out[36]:
notclay
5 25 60
STID
ACME 92.7 84.2 78.4
ADAX 83.1 76.5 72.2
ALTU 63.8 63.0 55.2
ALV2 64.7 50.9 49.6
ANT2 90.5 88.6 74.5
In [37]: df['notclay'] = concat([df['sand'] + df['silt']], axis=1, keys=['notclay'])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<snip>
ValueError: Wrong number of items passed 3, placement implies 1
Same ValueError raised as above.

Depending on your taste, this may be a nicer way to do it still using concat:
In [53]: df
Out[53]:
blah foo
1 2 3 1 2 3
a 0.351045 0.044654 0.855627 0.839725 0.675183 0.325324
b 0.610374 0.394499 0.924708 0.924303 0.404475 0.885368
c 0.116418 0.487866 0.190669 0.283535 0.862869 0.346477
d 0.771014 0.204143 0.143449 0.848520 0.887373 0.220083
e 0.103268 0.306820 0.277125 0.627272 0.631019 0.386406
In [54]: newdf
Out[54]:
1 2 3
a 0.433377 0.806679 0.976298
b 0.593683 0.217415 0.086565
c 0.716244 0.908777 0.180252
d 0.031942 0.074283 0.745019
e 0.651517 0.393569 0.861616
In [56]: newdf.columns=pd.MultiIndex.from_product([['bar'], newdf.columns])
In [57]: pd.concat([df, newdf], axis=1)
Out[57]:
blah foo bar \
1 2 3 1 2 3 1
a 0.351045 0.044654 0.855627 0.839725 0.675183 0.325324 0.433377
b 0.610374 0.394499 0.924708 0.924303 0.404475 0.885368 0.593683
c 0.116418 0.487866 0.190669 0.283535 0.862869 0.346477 0.716244
d 0.771014 0.204143 0.143449 0.848520 0.887373 0.220083 0.031942
e 0.103268 0.306820 0.277125 0.627272 0.631019 0.386406 0.651517
2 3
a 0.806679 0.976298
b 0.217415 0.086565
c 0.908777 0.180252
d 0.074283 0.745019
e 0.393569 0.861616
In order to store this into the original dataframe, you can simply assign to it in the last line:
In [58]: df = pd.concat([df, newdf], axis=1)

Python Pandas interpolate with new x-axis

I want to do interpolation for a Pandas series of the following structure
X
22.88 3.047
45.75 3.215
68.63 3.328
91.50 3.423
114.38 3.516
137.25 3.578
163.40 3.676
196.08 3.756
228.76 3.861
261.44 3.942
294.12 4.012
326.80 4.084
359.48 4.147
392.16 4.197
Name: Y, dtype: float64
I want to interpolate the data so that I have a new series to cover X=[23:392:1]. I looked up the document but didn't find where I could input the new x-axis. Did I miss something? How can I do interpolation with the new x-axis?

This can be done with pandas's reindex and interpolate:
In [27]: s
Out[27]:
1
0
22.88 3.047
45.75 3.215
68.63 3.328
91.50 3.423
114.38 3.516
137.25 3.578
163.40 3.676
196.08 3.756
228.76 3.861
261.44 3.942
294.12 4.012
326.80 4.084
359.48 4.147
392.16 4.197
[14 rows x 1 columns]
In [28]: idx = pd.Index(np.arange(23, 392))
In [29]: s.reindex(s.index + idx).interpolate(method='values')
Out[29]:
1
22.88 3.047000
23.00 3.047882
24.00 3.055227
25.00 3.062573
26.00 3.069919
27.00 3.077265
28.00 3.084611
29.00 3.091957
30.00 3.099303
31.00 3.106648
32.00 3.113994
33.00 3.121340
34.00 3.128686
35.00 3.136032
36.00 3.143378
37.00 3.150724
38.00 3.158070
39.00 3.165415
40.00 3.172761
41.00 3.180107
42.00 3.187453
43.00 3.194799
44.00 3.202145
45.00 3.209491
45.75 3.215000
46.00 3.216235
47.00 3.221174
48.00 3.226112
The idea is the create the index you want (s.index + idx), which is sorted automatically, reindex an that (which makes a bunch of NaNs at the new points, and the interpolate to fill the NaNs, using the values method, which interpolates at the index points.

You can call numpy.interp() directly:
import numpy as np
import pandas as pd
import io
data = """x y
22.88 3.047
45.75 3.215
68.63 3.328
91.50 3.423
114.38 3.516
137.25 3.578
163.40 3.676
196.08 3.756
228.76 3.861
261.44 3.942
294.12 4.012
326.80 4.084
359.48 4.147
392.16 4.197"""
s = pd.read_csv(io.BytesIO(data), delim_whitespace=True, index_col=0, squeeze=True)
new_idx = np.arange(23,393)
new_val = np.interp(new_idx, s.index.values.astype(float), s.values)
s2 = pd.Series(new_val, new_idx)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting with same condition, but HDFStore gives different answers - python

maybe the default encoding of your operating system varies ? would this work b=store.select('m2016',where="code=u'000001'")

Related

Animate px.line line with plotly express

What is the best way to populate a column of a dataframe with conditional values based on corresponding rows in another column?

Selecting min values in SQL table when column < 0, and max values when column > 0

Add a calculated result with multiple columns to Pandas DataFrame with MultiIndex columns

Python Pandas interpolate with new x-axis

Categories

Resources