Wy arange and linspace does not produce equal objects? - python

Wy arange and linspace does not produce the same result on the code below?
a = np.arange(12,17,.2, dtype=float)
b = np.linspace(12.,16.8,25, dtype=float)
print(list(a)==list(b))
The number of elements and the elements theyself is the same, apparently. But:
a==b
results not equal.
I expect the output of '''True''', but the actual output is '''False'''.
See !https://imgur.com/qEvGcJW

for _a, _b in zip(a, b):
print (_a, _b)
Based on https://docs.python.org/3/tutorial/floatingpoint.html,
You can see why it is not same from result as following:
12.0 12.0
12.2 12.2
12.399999999999999 12.4
12.599999999999998 12.6
12.799999999999997 12.8
12.999999999999996 13.0
13.199999999999996 13.2
13.399999999999995 13.4
13.599999999999994 13.6
13.799999999999994 13.8
13.999999999999993 14.0
14.199999999999992 14.200000000000001
14.399999999999991 14.4
14.59999999999999 14.600000000000001
14.79999999999999 14.8
14.99999999999999 15.0
15.199999999999989 15.200000000000001
15.399999999999988 15.4
15.599999999999987 15.600000000000001
15.799999999999986 15.8
15.999999999999986 16.0
16.199999999999985 16.200000000000003
16.399999999999984 16.400000000000002
16.599999999999984 16.6
16.799999999999983 16.8
In temporary, We can use np.round for just this problem.
a = np.arange(12,17,.2, dtype=np.float)
a = np.round(a, 1)
b = np.linspace(12.,16.8,25, dtype=np.float)
b = np.round(a, 1)
print (np.array_equal(a,b))
It returns True.

Related

multiplying column of the file by exponential function

I,m struggling with multiplying one column file by an exponential function
so my equation is
y=10.43^(-x/3.0678)+0.654
The first values in the column are my X in the equation, so far I was able to multiply only by scalars but with exponential functions not
the file looks like this
8.09
5.7
5.1713
4.74
4.41
4.14
3.29
3.16
2.85
2.52
2.25
2.027
1.7
1.509
0.76
0.3
0.1
So after the calculations, my Y should get these values
8.7 0.655294908
8.09 0.656064021
5.7 0.6668238549
5.1713 0.6732091509
4.74 0.6807096436
4.41 0.6883719253
4.14 0.6962497391
3.29 0.734902438
3.16 0.7433536016
2.85 0.7672424605
2.52 0.7997286905
2.25 0.8331287249
2.027 0.8664148415
1.7 0.926724933
1.509 0.9695896976
0.76 1.213417197
0.3 1.449100509
0.1 1.580418766````
So far this code is working for me but it´s far away from what i want
from scipy.optimize import minimize_scalar
import math
col_list = ["Position"]
df = pd.read_csv("force.dat", usecols=col_list)
print(df)
A = df["Position"]
X = ((-A/3.0678+0.0.654)
print(X)
If I understand it correctly you just want to apply a function to a column in a pandas dataframe, right? If so, you can define the function:
def foo(x):
y = 10.43 ** (-x/3.0678)+0.654
return y
and apply it to df in a new column. If A is the column with the x values, then y will be
df['y'] = df.apply(foo,axis=1)
Now print(df) should give you the example result in your question.
You can do it in one line:
>>> df['y'] = 10.43 ** (- df['x']/3.0678)+0.654
>>> print(df)
x y
0 8.0900 0.656064
1 5.7000 0.666824
2 5.1713 0.673209
3 4.7400 0.680710
4 4.4100 0.688372
5 4.1400 0.696250
6 3.2900 0.734902
7 3.1600 0.743354
8 2.8500 0.767242
9 2.5200 0.799729
10 2.2500 0.833129
11 2.0270 0.866415
12 1.7000 0.926725
13 1.5090 0.969590
14 0.7600 1.213417
15 0.3000 1.449101
16 0.1000 1.580419

How to read specific lines from text using a starting and ending condition?

I have a document.gca file that contains specific information that I need, I'm trying to extract certain information, in a part of text repeats the next sentences:
#Sta/Elev= xx
(here goes pair numbers)
#Mann
This part of text repeats several times. My goal is to catch (the pair numbers) that are in that interval, and repeat this process in my text. How can I extract that? Say I have this:
Sta/Elev= 259
0 2186.31 .3 2186.14 .9 2185.83 1.4 2185.56 2.5 2185.23
3 2185.04 3.6 2184.83 4.7 2184.61 5.6 2184.4 6.4 2184.17
6.9 2183.95 7.5 2183.69 7.6 2183.59 8 2183.35 8.6 2182.92
10.2 2181.47 10.8 2181.03 11.3 2180.63 11.9 2180.27 12.4 2179.97
13 2179.72 13.6 2179.47 14.1 2179.3 14.3 2179.21 14.7 2179.11
15.7 2178.9 17.4 2178.74 17.9 2178.65 20.1 2178.17 20.4 2178.13
20.4 2178.12 21.5 2177.94 22.6 2177.81 22.6 2177.8 22.9 2177.79
24.1 2177.78 24.4 2177.75 24.6 2177.72 24.8 2177.68 25.2 2177.54
Mann= 3 , 0 , 0
0 .2 0 26.9 .2 0 46.1 .2 0
Bank Sta=26.9,46.1
XS Rating Curve= 0 ,0
XS HTab Starting El and Incr=2176.01,0.3, 56
XS HTab Horizontal Distribution= 0 , 0 , 0
Exp/Cntr(USF)=0,0
Exp/Cntr=0.3,0.1
Type RM Length L Ch R = 1 ,2655 ,11.2,11.1,10.5
XS GIS Cut Line=4
858341.2470677761196439.12427935858354.9998313071196457.53292637
858369.2753539641196470.40256485858387.8228168661196497.81690065
Node Last Edited Time=Aug/05/2019 11:42:02
Sta/Elev= 245
0 2191.01 .8 2190.54 2.5 2189.4 5 2187.76 7.2 2186.4
8.2 2185.73 9.5 2184.74 10.1 2184.22 10.3 2184.04 10.8 2183.55
12.8 2180.84 13.1 2180.55 13.3 2180.29 13.9 2179.56 14.2 2179.25
14.5 2179.03 15.8 2178.18 16.4 2177.81 16.7 2177.65 17 2177.54
17.1 2177.51 17.2 2177.48 17.5 2177.43 17.6 2177.4 17.8 2177.39
18.3 2177.37 18.8 2177.37 19.7 2177.44 20 2177.45 20.6 2177.45
20.7 2177.45 20.8 2177.44 21 2177.42 21.3 2177.41 21.4 2177.4
21.7 2177.32 22 2177.26 22.1 2177.21 22.2 2177.13 22.5 2176.94
22.6 2176.79 22.9 2176.54 23.2 2176.19 23.5 2175.88 23.9 2175.68
24.4 2175.55 24.6 2175.54 24.8 2175.53 24.9 2175.53 25.1 2175.54
25.7 2175.63 26 2175.71 26.3 2175.78 26.4 2175.8 26.4 2175.82
#Mann= 3 , 0 , 0
0 .2 0 22.9 .2 0 43 .2 0
Bank Sta=22.9,43
XS Rating Curve= 0 ,0
XS HTab Starting El and Incr=2175.68,0.3, 51
XS HTab Horizontal Distribution= 0 , 0 , 0
Exp/Cntr(USF)=0,0
Exp/Cntr=0.3,0.1
But I want to select the numbers between Sta/Elev and Mann and save as a pair vectors, for each Sta/Elev right now I have this:
import re
with open('a.g01','r') as file:
file_contents = file.read()
#print(file_contents)
try:
found = re.search('#Sta/Elev(.+?)#Mann',file_contents).group(1)
except AttributeError:
found = '' # apply your error handling
print(found)
found is empty and I want to catch all the numbers in interval '#Sta/Elev and #Mann'
The problem is in your regex, try switching
found = re.search('#Sta/Elev(.+?)#Mann',file_contents).group(1)
to
found = re.search('Sta/Elev(.*)Mann',file_contents).group(1)
output:
>>> import re
>>> file_contents = 'Sta/ElevthisisatestMann'
>>> found = re.search('Sta/Elev(.*)Mann',file_contents).group(1)
>>> print(found)
thisisatest
Edit:
For multiline matching try adding the DOTALL parameter:
found = re.search('Sta/Elev=(.*)Mann',file_contents, re.DOTALL).group(1)
It was not clear to me on what is the separating string, since they are different in your examples, but for that you can just change it in the regex expression

Slice MultiIndex pandas DataFrame by position

I am currently trying to to slice a MuliIndex DataFrame that has three levels by position.
I am using pandas 19.1
Level0 Level1 Level2 Value
03-00368 A Item111 6.9
03-00368 A Item333 19.2
03-00368 B Item111 9.7
03-00368 B Item222 17.4
04-00176 C Item110 17.4
04-00176 C Item111 9.7
04-00246 D Item46 12.5
04-00246 D Item66 5.6
04-00246 D Item99 11.2
04-00247 E Item23 12.5
04-00247 E Item24 5.6
04-00247 E Item111 11.2
04-00247 F Item23 7.9
04-00247 F Item24 9.7
04-00247 F Item111 12.5
04-00247 G Item46 11.2
04-00247 G Item66 9.7
04-00247 G Item999 9.7
04-00247 H Item23 11.2
04-00247 H Item94 7.9
04-00247 H Item111 11.2
04-00247 I Item46 5.6
04-00247 I Item66 12.5
04-00247 I Item888 11.2
04-00353 J Item66 12.5
04-00353 J Item99 12.5
04-00354 K Item43 12.5
04-00354 K Item94 12.5
04-00355 L Item54 50
04-00355 L Item99 50
Currently I can achieve:
df.loc[(slice('03-00368', '04-00361'), slice(None), slice(None)), :]
But in practice I won't know what the labels will be. I just want to select the first ten level 0's so I tried this(and many other things which are similar):
>>> df.iloc[(slice(0, 10), slice(None), slice(None)), :]
TypeError: unorderable types: int() >= NoneType()
The end goal is to limit the final number of rows displayed, without breaking up the Level0 index
>>>df.iloc[(0,1,), :]
Level0 Level1 Level2 Value
03-00368 A Item111 6.9
03-00368 A Item333 19.2
Notice that it only returned the first two rows, I would like the result to be:
Level0 Level1 Level2 Value
03-00368 A Item111 6.9
03-00368 A Item333 19.2
03-00368 B Item111 9.7
03-00368 B Item222 17.4
04-00176 C Item110 17.4
04-00176 C Item111 9.7
There are of hacky way to accomplish this but I'm posting because I want to know what I am doing wrong, or why I can't expect to be able to slice MultiIndexes this way.
method 1
groupby + head
df.groupby(level=0).head(10)
method 2
Unnecessarily verbose
IndexSlice
df.sort_index().loc[pd.IndexSlice[df.index.levels[0][:10], :, :], :]
method 3
loc
df.loc[df.index.levels[0][:10].tolist()]
You could groupby level and take the top two this way
df.groupby(level=0).head(2)

Add a calculated result with multiple columns to Pandas DataFrame with MultiIndex columns

I have a DataFrame like so:
In [10]: df.head()
Out[10]:
sand silt clay rho_b ... n \
5 25 60 5 25 60 5 25 60 5 ... 60
STID ...
ACME 73.0 60.3 52.5 19.7 23.9 25.9 7.2 15.7 21.5 1.27 ... 1.32
ADAX 61.1 51.1 47.6 22.0 25.4 24.6 16.9 23.5 27.8 1.01 ... 1.25
ALTU 23.8 17.8 14.3 40.0 45.2 40.9 36.2 37.0 44.8 1.57 ... 1.18
ALV2 33.3 21.2 19.8 31.4 29.7 29.8 35.3 49.1 50.5 1.66 ... 1.20
ANT2 55.6 57.5 47.7 34.9 31.1 26.8 9.4 11.3 25.5 1.49 ... 1.29
So for every STID (e.g. ACME, ADAX, ALTU), there's some property (e.g. sand, silt, clay) defined at three depths (5, 25, 60).
This structure makes it really easy to do per-depth calculations at each STID, e.g.:
In [12]: (df['sand'] + df['silt']).head()
Out[12]:
5 25 60
STID
ACME 92.7 84.2 78.4
ADAX 83.1 76.5 72.2
ALTU 63.8 63.0 55.2
ALV2 64.7 50.9 49.6
ANT2 90.5 88.6 74.5
How can I neatly incorporate a calculated result back in to the DataFrame? For example, if I wanted to call the result of the above calculation 'notclay':
In [13]: df['notclay'] = df['sand'] + df['silt']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-a30bd9ba99c3> in <module>()
----> 1 df['notclay'] = df['sand'] + df['silt']
<snip>
ValueError: Wrong number of items passed 3, placement implies 1
Three columns are expected to be defined for each column in the result, not just the one 'notclay' column.
I do have a solution using strict assignments, but I'm not very satisfied with it:
In [21]: df[[('notclay', 5), ('notclay', 25), ('notclay', 60)]] = df['sand'] + df['silt']
In [22]: df['notclay'].head()
Out[22]:
5 25 60
STID
ACME 92.7 84.2 78.4
ADAX 83.1 76.5 72.2
ALTU 63.8 63.0 55.2
ALV2 64.7 50.9 49.6
ANT2 90.5 88.6 74.5
I have many other calculations to do similar to this one, and using a strict assignment every time seems tedious. I'm guessing there's a better/"right" way to do this. I think add a field in pandas dataframe with MultiIndex columns might contain the answer, but I don't very well understand the solutions (or even what a Panel is and if it can help me).
Edit: Something I tried that doesn't work, prepending a category using concat:
In [36]: concat([df['sand'] + df['silt']], axis=1, keys=['notclay']).head()
Out[36]:
notclay
5 25 60
STID
ACME 92.7 84.2 78.4
ADAX 83.1 76.5 72.2
ALTU 63.8 63.0 55.2
ALV2 64.7 50.9 49.6
ANT2 90.5 88.6 74.5
In [37]: df['notclay'] = concat([df['sand'] + df['silt']], axis=1, keys=['notclay'])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<snip>
ValueError: Wrong number of items passed 3, placement implies 1
Same ValueError raised as above.
Depending on your taste, this may be a nicer way to do it still using concat:
In [53]: df
Out[53]:
blah foo
1 2 3 1 2 3
a 0.351045 0.044654 0.855627 0.839725 0.675183 0.325324
b 0.610374 0.394499 0.924708 0.924303 0.404475 0.885368
c 0.116418 0.487866 0.190669 0.283535 0.862869 0.346477
d 0.771014 0.204143 0.143449 0.848520 0.887373 0.220083
e 0.103268 0.306820 0.277125 0.627272 0.631019 0.386406
In [54]: newdf
Out[54]:
1 2 3
a 0.433377 0.806679 0.976298
b 0.593683 0.217415 0.086565
c 0.716244 0.908777 0.180252
d 0.031942 0.074283 0.745019
e 0.651517 0.393569 0.861616
In [56]: newdf.columns=pd.MultiIndex.from_product([['bar'], newdf.columns])
In [57]: pd.concat([df, newdf], axis=1)
Out[57]:
blah foo bar \
1 2 3 1 2 3 1
a 0.351045 0.044654 0.855627 0.839725 0.675183 0.325324 0.433377
b 0.610374 0.394499 0.924708 0.924303 0.404475 0.885368 0.593683
c 0.116418 0.487866 0.190669 0.283535 0.862869 0.346477 0.716244
d 0.771014 0.204143 0.143449 0.848520 0.887373 0.220083 0.031942
e 0.103268 0.306820 0.277125 0.627272 0.631019 0.386406 0.651517
2 3
a 0.806679 0.976298
b 0.217415 0.086565
c 0.908777 0.180252
d 0.074283 0.745019
e 0.393569 0.861616
In order to store this into the original dataframe, you can simply assign to it in the last line:
In [58]: df = pd.concat([df, newdf], axis=1)

Python Empirical distribution function (ecdf) implementation

I am aware of statsmodels.tools.tools.ECDF but since the calculation of an empricial cumulative distribution function (ECDF) is pretty straight-forward and I want to minimise dependencies in my project, I want to code it manually.
In a given list() / np.array() Pandas.Series, the ECDF for each element can be calculated as given in Wikipedia:
I have the Pandas DataFrame ,dfser, below and I want to get the ecdf of the values column. My two one-liner solutions are given as well.
Is there a faster way to do this? Speed is important in my application.
# Note that in my case indices are unique identifiers so I cannot reset them.
import numpy as np
import pandas as pd
# all indices are unique, but there may be duplicate measurement values (that belong to different indices).
dfser = pd.DataFrame({'group':['a','b','b','a','d','c','e','e','c','a','b','d','d','c','d','e','e','a'],
'values':[2.01899E-06, 1.12186E-07, 8.97467E-07, 2.91257E-06, 1.93733E-05,
0.00017889, 0.000120963, 4.27643E-07, 3.33614E-07, 2.08352E-12,
1.39478E-05, 4.28255E-08, 9.7619E-06, 8.51787E-09, 1.28344E-09,
3.5063E-05, 0.01732035,2.08352E-12]},
index = [123, 532, 235, 645, 747, 856, 345, 245, 845, 248, 901, 712, 162, 126,
198,748, 127,395] )
# My 1st Solution - list comprehension
dfser['ecdf']=[sum( dfser['values'] <= x)/float(dfser['values'].size) for x in dfser['values']]
# My 2nd Solution - ranking
dfser['rank'] = dfser['values'].rank(ascending = 0)
dfser['ecdf_r']=(len(dfser)-dfser['rank']+1)/len(dfser)
dfser
group values ecdf rank ecdf_r
123 a 2.018990e-06 0.555556 9.0 0.555556
532 b 1.121860e-07 0.333333 13.0 0.333333
235 b 8.974670e-07 0.500000 10.0 0.500000
645 a 2.912570e-06 0.611111 8.0 0.611111
747 d 1.937330e-05 0.777778 5.0 0.777778
856 c 1.788900e-04 0.944444 2.0 0.944444
345 e 1.209630e-04 0.888889 3.0 0.888889
245 e 4.276430e-07 0.444444 11.0 0.444444
845 c 3.336140e-07 0.388889 12.0 0.388889
248 a 2.083520e-12 0.111111 17.5 0.083333
901 b 1.394780e-05 0.722222 6.0 0.722222
712 d 4.282550e-08 0.277778 14.0 0.277778
162 d 9.761900e-06 0.666667 7.0 0.666667
126 c 8.517870e-09 0.222222 15.0 0.222222
198 d 1.283440e-09 0.166667 16.0 0.166667
748 e 3.506300e-05 0.833333 4.0 0.833333
127 e 1.732035e-02 1.000000 1.0 1.000000
395 a 2.083520e-12 0.111111 17.5 0.083333
Since you are already using pandas I think it will be silly not to use some of its features:
In [15]:
import numpy as np
from numpy import *
sq=ser.value_counts()
sq.sort_index().cumsum()*1./len(sq)
Out[15]:
2.083520e-12 0.058824
1.283440e-09 0.117647
8.517870e-09 0.176471
4.282550e-08 0.235294
1.121860e-07 0.294118
3.336140e-07 0.352941
4.276430e-07 0.411765
8.974670e-07 0.470588
2.018990e-06 0.529412
2.912570e-06 0.588235
9.761900e-06 0.647059
1.394780e-05 0.705882
1.937330e-05 0.764706
3.506300e-05 0.823529
1.209630e-04 0.882353
1.788900e-04 0.941176
1.732035e-02 1.000000
dtype: float64
And speed comparison
In [19]:
%timeit sq.sort_index().cumsum()*1./len(sq)
1000 loops, best of 3: 344 µs per loop
In [18]:
%timeit ser.value_counts().sort_index().cumsum()*1./len(ser.value_counts())
1000 loops, best of 3: 1.58 ms per loop
In [17]:
%timeit [sum( ser <= x)/float(len(ser)) for x in ser]
100 loops, best of 3: 3.31 ms per loop
If the values are all unique, the ser.value_counts() is no longer needed. That part is slow (Fetching unique values). All you need in that case is just to sort ser.
In [23]:
%timeit np.arange(1, ser.size+1)/float(ser.size)
10000 loops, best of 3: 11.6 µs per loop
The fastest version that I can think of is to use get vectorized:
In [35]:
np.sum(dfser['values'].values[...,newaxis]<=dfser['values'].values.reshape((1,-1)), axis=0)*1./dfser['values'].size
Out[35]:
array([ 0.55555556, 0.33333333, 0.5 , 0.61111111, 0.77777778,
0.94444444, 0.88888889, 0.44444444, 0.38888889, 0.11111111,
0.72222222, 0.27777778, 0.66666667, 0.22222222, 0.16666667,
0.83333333, 1. , 0.11111111])
Add let see:
In [37]:
%timeit dfser['ecdf']=[sum( dfser['values'] <= x)/float(dfser['values'].size) for x in dfser['values']]
100 loops, best of 3: 6 ms per loop
In [38]:
%%timeit
dfser['rank'] = dfser['values'].rank(ascending = 0)
dfser['ecdf_r']=(len(dfser)-dfser['rank']+1)/len(dfser)
1000 loops, best of 3: 827 µs per loop
In [39]:
%timeit np.sum(dfser['values'].values[...,newaxis]<=dfser['values'].values.reshape((1,-1)), axis=0)*1./dfser['values'].size
10000 loops, best of 3: 91.1 µs per loop

Categories