Related
I have a bunch of data points in a timeseries in a pandas dataframe. Each column is supposedly independent of each other. I want to create a montecarlo process to calculate expected values for each of the columns. For that, my expectation is that the underlying data follows a brownian motion pattern, so I'd need to generate a normal distribution over the differences between points in time space.
I transform my data like this:
diffs = (data.diff() / data.shift(1))
This is what I have at the moment:
data = diffs.describe()
This gives the following output:
A B C
count 4986.000000 4963.000000 1861.000000
mean 0.000285 0.000109 0.000421
std 0.015759 0.015426 0.014676
...
I process it like this to generate more samples:
import numpy as np
desired_samples = 1000
random = np.random.default_rng().normal(loc=[data.loc[["mean"]].to_numpy()], scale=[data.loc[["std"]].to_numpy()], size=[len(data.columns), desired_samples])
However this gives me an error:
ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (441, 1000) and arg 1 with shape (1, 1, 441).
What I'd want is just a matrix of random values whose columns have the same std and mean as the sample's columns. I.e. such as when I do random.describe(), I'd get something like:
A B C
count 1000.0 1000.0 1000.0
mean 0.000285 0.000109 0.000421
std 0.015759 0.015426 0.014676
...
What'd be the correct way to generate those samples?
You could use apply() to create a data frame of random normal values using the attributes of the associated columns.
Generate Test Data
nv = 50
d = {'A':np.random.normal(1,1,nv),'B':np.random.normal(2,2,nv),'C':np.random.normal(3,3,nv)}
df = pd.DataFrame(d)
print(df)
A B C
0 0.276252 -2.833479 5.746740
1 1.562030 1.497242 2.557416
2 0.883105 -0.861824 3.106192
3 0.352372 0.014653 4.006219
4 1.475524 3.151062 -1.392998
5 2.011649 -2.289844 4.371251
6 3.230964 3.578058 0.610422
7 0.366506 3.391327 0.812932
8 1.669673 -1.021665 4.262500
9 1.835547 4.292063 6.983015
10 1.768208 4.029970 3.971751
...
45 0.501706 0.926860 7.008008
46 1.759266 -0.215047 4.560403
47 1.899167 0.690204 -0.538415
48 1.460267 1.506934 1.306303
49 1.641662 1.066182 0.049233
df.describe()
A B C
count 50.000000 50.000000 50.000000
mean 0.962083 1.522234 2.992492
std 1.073733 1.848754 2.838976
Generate Random Values with same approx (calculated) Mean and STD
mat = df.apply(lambda x: np.random.normal(x.mean(),x.std(),100))
print(mat)
A B C
0 0.234955 2.201961 1.910073
1 1.973203 3.528576 5.925673
2 -0.858201 2.234295 1.741338
3 2.245650 2.805498 0.135784
4 1.913691 2.134813 2.246989
.. ... ... ...
95 2.996207 2.248727 2.792658
96 0.663609 4.533541 1.518872
97 0.848259 -0.348086 2.271724
98 3.672370 1.706185 -0.862440
99 0.392051 0.832358 -0.354981
[100 rows x 3 columns]
mat.describe()
A B C
count 100.000000 100.000000 100.000000
mean 0.877725 1.332039 2.673327
std 1.148153 1.749699 2.447532
If you want the matrix to be numpy
mat.to_numpy()
array([[ 0.78881292, 3.09428714, -1.22757096],
[ 0.13044099, -1.02564025, 2.6566989 ],
[ 0.06090083, 1.50629474, 3.61487469],
[ 0.71418932, 1.88441111, 5.84979454],
[ 2.34287411, 2.58478867, -4.04433653],
[ 1.41846256, 0.36414635, 8.47482082],
[ 0.46765842, 1.37188986, 3.28011085],
[ 0.87433273, 3.45735286, 1.13351138],
[ 1.59029413, 4.0227165 , 3.58282534],
[ 2.23663894, 2.75007385, -0.36242541],
[ 1.80967311, 1.29206572, 1.73277577],
[ 1.20787923, 2.75529187, 4.64721489],
[ 2.33466341, 6.43830387, 4.31354348],
[ 0.87379125, 3.00658046, 4.94270155],
etc ...
I am using pandas to analyse some election results. I have a DF, Results, which has a row for each constituency and columns representing the votes for the various parties (over 100 of them):
In[60]: Results.columns
Out[60]:
Index(['Constituency', 'Region', 'Country', 'ID', 'Type', 'Electorate',
'Total', 'Unnamed: 9', '30-50', 'Above',
...
'WP', 'WRP', 'WVPTFP', 'Yorks', 'Young', 'Zeb', 'Party', 'Votes',
'Share', 'Turnout'],
dtype='object', length=147)
So...
In[63]: Results.head()
Out[63]:
Constituency Region Country ID Type \
PAID
1 Aberavon Wales Wales W07000049 County
2 Aberconwy Wales Wales W07000058 County
3 Aberdeen North Scotland Scotland S14000001 Burgh
4 Aberdeen South Scotland Scotland S14000002 Burgh
5 Aberdeenshire West & Kincardine Scotland Scotland S14000058 County
Electorate Total Unnamed: 9 30-50 Above ... WP WRP WVPTFP \
PAID ...
1 49821 31523 NaN NaN NaN ... NaN NaN NaN
2 45525 30148 NaN NaN NaN ... NaN NaN NaN
3 67745 43936 NaN NaN NaN ... NaN NaN NaN
4 68056 48551 NaN NaN NaN ... NaN NaN NaN
5 73445 55196 NaN NaN NaN ... NaN NaN NaN
Yorks Young Zeb Party Votes Share Turnout
PAID
1 NaN NaN NaN Lab 15416 0.489040 0.632725
2 NaN NaN NaN Con 12513 0.415052 0.662230
3 NaN NaN NaN SNP 24793 0.564298 0.648550
4 NaN NaN NaN SNP 20221 0.416490 0.713398
5 NaN NaN NaN SNP 22949 0.415773 0.751528
[5 rows x 147 columns]
The per-constituency results for each party are given in the columns Results.ix[:, 'Unnamed: 9': 'Zeb']
I can find the winning party (i.e. the party which polled highest number of votes) and the number of votes it polled using:
RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb']
Results['Party'] = RawResults.idxmax(axis=1)
Results['Votes'] = RawResults.max(axis=1).astype(int)
But, I also need to know how many votes the second-place party got (and ideally its index/name). So is there any way in pandas to return the second highest value/index in a set of columns for each row?
To get the highest values of a column, you can use nlargest() :
df['High'].nlargest(2)
The above will give you the 2 highest values of column High.
You can also use nsmallest() to get the lowest values.
Here is a NumPy solution:
In [120]: df
Out[120]:
a b c d e f g h
0 1.334444 0.322029 0.302296 -0.841236 -0.360488 -0.860188 -0.157942 1.522082
1 2.056572 0.991643 0.160067 -0.066473 0.235132 0.533202 1.282371 -2.050731
2 0.955586 -0.966734 0.055210 -0.993924 -0.553841 0.173793 -0.534548 -1.796006
3 1.201001 1.067291 -0.562357 -0.794284 -0.554820 -0.011836 0.519928 0.514669
4 -0.243972 -0.048144 0.498007 0.862016 1.284717 -0.886455 -0.757603 0.541992
5 0.739435 -0.767399 1.574173 1.197063 -1.147961 -0.903858 0.011073 -1.404868
6 -1.258282 -0.049719 0.400063 0.611456 0.443289 -1.110945 1.352029 0.215460
7 0.029121 -0.771431 -0.285119 -0.018216 0.408425 -1.458476 -1.363583 0.155134
8 1.427226 -1.005345 0.208665 -0.674917 0.287929 -1.259707 0.220420 -1.087245
9 0.452589 0.214592 -1.875423 0.487496 2.411265 0.062324 -0.327891 0.256577
In [121]: np.sort(df.values)[:,-2:]
Out[121]:
array([[ 1.33444404, 1.52208164],
[ 1.28237078, 2.05657214],
[ 0.17379254, 0.95558613],
[ 1.06729107, 1.20100071],
[ 0.86201603, 1.28471676],
[ 1.19706331, 1.57417327],
[ 0.61145573, 1.35202868],
[ 0.15513379, 0.40842477],
[ 0.28792928, 1.42722604],
[ 0.48749578, 2.41126532]])
or as a pandas Data Frame:
In [122]: pd.DataFrame(np.sort(df.values)[:,-2:], columns=['2nd-largest','largest'])
Out[122]:
2nd-largest largest
0 1.334444 1.522082
1 1.282371 2.056572
2 0.173793 0.955586
3 1.067291 1.201001
4 0.862016 1.284717
5 1.197063 1.574173
6 0.611456 1.352029
7 0.155134 0.408425
8 0.287929 1.427226
9 0.487496 2.411265
or a faster solution from #Divakar:
In [6]: df
Out[6]:
a b c d e f g h
0 0.649517 -0.223116 0.264734 -1.121666 0.151591 -1.335756 -0.155459 -2.500680
1 0.172981 1.233523 0.220378 1.188080 -0.289469 -0.039150 1.476852 0.736908
2 -1.904024 0.109314 0.045741 -0.341214 -0.332267 -1.363889 0.177705 -0.892018
3 -2.606532 -0.483314 0.054624 0.979734 0.205173 0.350247 -1.088776 1.501327
4 1.627655 -1.261631 0.589899 -0.660119 0.742390 -1.088103 0.228557 0.714746
5 0.423972 -0.506975 -0.783718 -2.044002 -0.692734 0.980399 1.007460 0.161516
6 -0.777123 -0.838311 -1.116104 -0.433797 0.599724 -0.884832 -0.086431 -0.738298
7 1.131621 1.218199 0.645709 0.066216 -0.265023 0.606963 -0.194694 0.463576
8 0.421164 0.626731 -0.547738 0.989820 -1.383061 -0.060413 -1.342769 -0.777907
9 -1.152690 0.696714 -0.155727 -0.991975 -0.806530 1.454522 0.788688 0.409516
In [7]: a = df.values
In [8]: a[np.arange(len(df))[:,None],np.argpartition(-a,np.arange(2),axis=1)[:,:2]]
Out[8]:
array([[ 0.64951665, 0.26473378],
[ 1.47685226, 1.23352348],
[ 0.17770473, 0.10931398],
[ 1.50132666, 0.97973383],
[ 1.62765464, 0.74238959],
[ 1.00745981, 0.98039898],
[ 0.5997243 , -0.0864306 ],
[ 1.21819904, 1.13162068],
[ 0.98982033, 0.62673128],
[ 1.45452173, 0.78868785]])
Here is an interesting approach. What if we replace the maximum value with the minimum value and calculate. Although it is a quick hack and, not recommended!
first_highest_value_index = df.idxmax()
second_highest_value_index = df.replace(df.max(),df(min)).idxmax()
first_highest_value = df[first_highest_value_index]
second_highest_value = df[second_highest_value_index]
You could just sort your results, such that the first rows will contain the max. Then you can simply use indexing to get the first n places.
RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb'].sort_values(by='votes', ascending=False)
RawResults.iloc[0, :] # First place
RawResults.iloc[1, :] # Second place
RawResults.iloc[n, :] # nth place
Here is a solution using nlargest function:
>>> df
a b c
0 4 20 2
1 5 10 2
2 3 40 5
3 1 50 10
4 2 30 15
>>> def give_largest(col,n):
... largest = col.nlargest(n).reset_index(drop = True)
... data = [x for x in largest]
... index = [f'{i}_largest' for i in range(1,len(largest)+1)]
... return pd.Series(data,index=index)
...
...
>>> def n_largest(df, axis, n):
... '''
... Function to return the n-largest value of each
... column/row of the input DataFrame.
... '''
... return df.apply(give_largest, axis = axis, n = n)
...
>>> n_largest(df,axis = 1, n = 2)
1_largest 2_largest
0 20 4
1 10 5
2 40 5
3 50 10
4 30 15
>>> n_largest(df,axis = 0, n = 2)
a b c
1_largest 5 50 15
2_largest 4 40 10
import numpy as np
import pandas as pd
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
def second_largest(df):
return (df.nlargest(2).min())
print(df.apply(second_largest))
a 4
b 40
c 20
dtype: int64
df
a b c d e f g h
0 1.334444 0.322029 0.302296 -0.841236 -0.360488 -0.860188 -0.157942 1.522082
1 2.056572 0.991643 0.160067 -0.066473 0.235132 0.533202 1.282371 -2.050731
2 0.955586 -0.966734 0.055210 -0.993924 -0.553841 0.173793 -0.534548 -1.796006
3 1.201001 1.067291 -0.562357 -0.794284 -0.554820 -0.011836 0.519928 0.514669
4 -0.243972 -0.048144 0.498007 0.862016 1.284717 -0.886455 -0.757603 0.541992
5 0.739435 -0.767399 1.574173 1.197063 -1.147961 -0.903858 0.011073 -1.404868
6 -1.258282 -0.049719 0.400063 0.611456 0.443289 -1.110945 1.352029 0.215460
7 0.029121 -0.771431 -0.285119 -0.018216 0.408425 -1.458476 -1.363583 0.155134
8 1.427226 -1.005345 0.208665 -0.674917 0.287929 -1.259707 0.220420 -1.087245
9 0.452589 0.214592 -1.875423 0.487496 2.411265 0.062324 -0.327891 0.256577
tranpose and use nlargest in a for loop to get the results order by each line:
df1=df.T
results=list()
for col in df1.columns: results.append(df1[col].nlargest(len(df.columns))
the results var is a list of pandas objects, where the first item on the list will be the df's first row sorted in descending order and so on. Since each item on the list is a pandas object, it carries df's column as index (it was transposed), so you will get the values and the df's columns name of each row sorted
results
[h 1.522082
a 1.334444
b 0.322029
c 0.302296
g -0.157942
e -0.360488
d -0.841236
f -0.860188
Name: 0, dtype: float64,
a 2.056572
g 1.282371
b 0.991643
f 0.533202
e 0.235132
c 0.160067
d -0.066473
h -2.050731
Name: 1, dtype: float64,
....
I have a csv file that looks like as shown in the picture. There are multiple rows like this whose values are zero in between. So in this row, i want an interpolated value of the upper and lower row. I used df.interpolate(method ='linear', limit_direction ='forward') to interpolate. However, the zero values are not treated as NaN values so it didnt work for me.
First replace all the zeros with np.nan and then the interpolate will work correctly:
import pandas as pd
import numpy as np
data = [
[7260,-4.458639405975710,-4.,7.E-08,0.1393070275997700,0.,-0.11144176562682400],
[8030,-4.452569075111660,-4.,4.E-08,0.1347428577024860,-0.1001462206643270,-0.04915374942019220],
[498,-4.450785570790800,-4.437233532812810,1.E-07,0.1577349354100960,-0.1628636478696300,-0.05505793797144350],
[1500,-4.450303023388150,-4.429207978066990,1.E-07,0.1219543073754720,-0.1886731968341070,-0.14408112469719300],
[6600,-4.462030024237730,-4.4286701710604900,4.E-08,0.100803412848051,-0.1840333872203410,-0.18430271378600200],
[8860,0.0,0.0,0.0,0.0,0.0,0.0],
[530,-4.453994378096950,-4.0037494206318200,-9.E-08,0.0594973737919224,1.0356594366090900,-0.03173366589936420],
[6904,-4.449221525263950,-3.1840342819501800,-2.E-07,0.0918042463623589,1.5125956674286500,-0.01150704151230230],
[7700,-4.454965896625150,-3.041102261967650,-1.E-07,0.1211292098853800,1.837772463779190,0.0680406376006960],
[6463,-4.4524324374160600,-3.1096025723730000,-4.E-08,0.1920291560629040,2.062490856824510,0.10665282217392200],
]
df = pd.DataFrame(data, columns=range(98, 105)) \
.replace(0, np.nan) \
.interpolate(method ='linear', limit_direction ='forward')
print(df)
Giving:
98 99 100 101 102 103 104
0 7260 -4.458639 -4.000000 7.000000e-08 0.139307 NaN -0.111442
1 8030 -4.452569 -4.000000 4.000000e-08 0.134743 -0.100146 -0.049154
2 498 -4.450786 -4.437234 1.000000e-07 0.157735 -0.162864 -0.055058
3 1500 -4.450303 -4.429208 1.000000e-07 0.121954 -0.188673 -0.144081
4 6600 -4.462030 -4.428670 4.000000e-08 0.100803 -0.184033 -0.184303
5 8860 -4.458012 -4.216210 -2.500000e-08 0.080150 0.425813 -0.108018
6 530 -4.453994 -4.003749 -9.000000e-08 0.059497 1.035659 -0.031734
7 6904 -4.449222 -3.184034 -2.000000e-07 0.091804 1.512596 -0.011507
8 7700 -4.454966 -3.041102 -1.000000e-07 0.121129 1.837772 0.068041
9 6463 -4.452432 -3.109603 -4.000000e-08 0.192029 2.062491 0.106653
I have been trying to export data stored as numpy array to GrADS-flat binary.
It seems GrADS does not recognize Z dimension given in the .ctl file.
Whatever value I use for 'set z integer', GrADS only shows the first level.
Here is the minimal reproduction of my problem
my python code:
import numpy as np
from array import array
data = np.linspace(1, 60, num=60, endpoint=True)
data = np.reshape(data, [5, 4, 3])
print(data)
with open('temp.dat', 'ab') as wf:
float_array = array('f', data.flatten())
float_array.tofile(wf)
wf.close()
executing this writes numbers [1, 2, 3, ..., 60] as single-precision float to a binary file
my .ctl file:
DSET ^temp.dat
TITLE title
UNDEF -9.99E33
XDEF 3 LINEAR 0.0 1
YDEF 4 LINEAR 0.0 1
ZDEF 5 LEVELS 0 1 2 3 4
TDEF 1 LINEAR 0Z10apr1991 12hr
VARS 1
var 0 99 some var
ENDVARS
This set of .dat and .ctl files shows first 12 numbers as first level of the field as expected.
ga-> open temp.ctl
Scanning description file: temp.ctl
Data file temp.dat is open as file 1
LON set to 0 2
LAT set to 0 3
LEV set to 0 0
Time values set: 1991:4:10:0 1991:4:10:0
E set to 1 1
ga-> set digsize 0.6
digsiz = 0.6
ga-> set lon -1 3
LON set to -1 3
ga-> set lat -1 4
LAT set to -1 4
ga-> set gxout grid
ga-> d var
However, if I try to 'set z 2' it still shows the first level.
Orelse, var(z=3) and var(z=1) being
var(z=1)=
[[ 1. 2. 3.]
[ 4. 5. 6.]
[ 7. 8. 9.]
[10. 11. 12.]]
var(z=3)=
[[25. 26. 27.]
[28. 29. 30.]
[31. 32. 33.]
[34. 35. 36.]]
this should show the constant field of "24" ... but GrADS is showing "0" as if z=3 is the same with z=1.
ga-> c
ga-> d var(z=3) - var(z=1)
What is even more ununderstantable is that if I give var2 in .ctl file,
grads recognize what it should be var(z=2) as var2(z=1)!
I know there are lots of better visualization tools other than GrADS, but I want to use legacy GrADS codes so it is unavoidable.
Did I write the binary in the wrong order? or is the binary file missing header or separator or something?
I am grad if anyone knows why it is happening.
Thanks in advance.
My colleague has just pointed out that the problem is in .ctl file.
I have set the layer number to zero, which GrADS recognize as "surface layer",
which is a single layer variable field and can be overlayed upon any layer above it.
I am able to show any layer of the field with the corrected .ctl file.
my corrected .ctl.
DSET ^temp.dat
TITLE title
UNDEF -9.99E33
XDEF 3 LINEAR 0.0 1
YDEF 4 LINEAR 0.0 1
ZDEF 5 LEVELS 0 1 2 3 4
TDEF 1 LINEAR 0Z10apr1991 12hr
VARS 1
// this should be 5 not zero because there are five layers!
// var 0 99 some var
var 5 99 some var
ENDVARS
I have a dataframe like this:
YAU OTBL HLE
2009-03-08 nan nan nan
2009-03-09 1.59904743 1.66397210 1.67345829
2009-03-10 -0.37065629 -0.36541822 -0.36015840
2009-03-11 -0.41055669 0.60004777 0.00536958
This is my function.
def get_covariance_returns(returns):
return np.cov(returns.values)
returns parameter is a DataFrame Returns for each ticker and date.
The output is a 2 dimensional Ndarray The covariance of the returns.
When I run my code I have:
AssertionError: Wrong shape for output returns_covariance. Got (4, 4), expected (3, 3)
Now, I modified my function like this:
def get_covariance_returns(returns):
return np.cov(returns.values, rowvar=False)
My result is:
OUTPUT returns_covariance:
[[ nan nan nan]
[ nan nan nan]
[ nan nan nan]]
Please note that the expected output is:
EXPECTED OUTPUT FOR returns_covariance:
[[ 0.89856076 0.7205586 0.8458721 ]
[ 0.7205586 0.78707297 0.76450378]
[ 0.8458721 0.76450378 0.83182775]]
I need a guide to know what's wrong with my implementation, please. I am programming in Python language.
You could use np.cov if you drop the NaNs:
>>> np.cov(df.dropna().values, rowvar=False)
array([[ 1.31997225, 1.01614032, 1.2238726 ],
[ 1.01614032, 1.0304141 , 1.04243784],
[ 1.2238726 , 1.04243784, 1.17528792]])
Or more simply, use pandas .cov which automatically exculdes NaN:
>>> df.cov()
YAU OTBL HLE
YAU 1.319972 1.016140 1.223873
OTBL 1.016140 1.030414 1.042438
HLE 1.223873 1.042438 1.175288
[EDIT]: Based on your expected output, you are actually replacing NaN for Zero:
>>> np.cov(df.replace(np.nan, 0).values, rowvar=False)
array([[ 0.89856076, 0.7205586 , 0.8458721 ],
[ 0.7205586 , 0.78707297, 0.76450378],
[ 0.8458721 , 0.76450378, 0.83182775]])
>>> df.replace(np.nan, 0).cov()
YAU OTBL HLE
YAU 0.898561 0.720559 0.845872
OTBL 0.720559 0.787073 0.764504
HLE 0.845872 0.764504 0.831828
I'll leave my original post anyways because it shows a distinction between the two cov functions