Python genfromtxt file path - python

I have an extremely basic problem with the numpy.genfromtxt function. I'm using the Enthought Canopy package: where shall I save the file.txt I want to use, or how shall I tell Python where to look for it? When using IDLE I simply save the file in a preset folder such as C:\Users\Davide\Python\data.txt and what I get is
>>> import numpy as np
>>> np.genfromtxt('data.txt')
array([[ 33.1 , 32.6 , 18.2 , 17.9 ],
[ 32.95, 32.7 , 17.95, 17.9 ],
[ 32.9 , 32.6 , 18. , 17.9 ],
[ 33. , 32.65, 18. , 17.9 ],
[ 32.95, 32.65, 18.05, 17.9 ],
[ 33. , 32.6 , 18. , 17.9 ],
[ 33.05, 32.7 , 18. , 17.9 ],
[ 33.05, 32.5 , 18.1 , 17.9 ],
[ 33. , 32.6 , 18.05, 17.9 ],
[ 33. , 32.55, 18. , 17.95]])
while working with Canopy the same code gives IOError: data.txt not found, nor something like np.genfromtxt('C:\Users\Davide\Python\data.txt') works. I'm sorry for the question's banality but I'm really going crazy with this. Thanks for help.

You can pass a fully qualified path but this:
np.genfromtxt('C:\Users\Davide\Python\data.txt')
won't work because back slashes need to be escaped:
np.genfromtxt('C:\\Users\\Davide\\Python\\data.txt')
or you could use a raw string:
np.genfromtxt(r'C:\Users\Davide\Python\data.txt')
As to where the currect saved location is you can query this using os.getcwd():
In [269]:
import os
os.getcwd()
Out[269]:
'C:\\WinPython-64bit-3.4.3.1\\notebooks\\docs'

Related

Np Array values are being changed without doing stuff

Why does :
print(np.delete(MatrixAnalytics(Cmp),[0],1))
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print("SecondPrint")
print(MyNewMatrix)
returns :
[[ 2. 2. 2. 2. 2.]
[ 1. 2. 2. 2. 2.]
[ 1. 2. 0. 2. 2.]
[ 2. 2. 2. 2. 2.]
[ 2. 2. 2. 0. 0.]
[ 1. 2. 2. 0. 2.]
[ 1. 2. 2. 2. 2.]
[ 1. 2. 2. 2. nan]
[ 2. 2. 2. 2. 2.]
[ 2. 2. 2. 2. nan]]
Second Print
[[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. -1. 0. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. -1. 0.]
[-1. 0. 0. -1. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. nan]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. nan]]
This is weird, and can't figure this out. Why Would the values change without any line of code between 3 print ?
def MatrixAnalytics(DataMatrix):
AnalyzedMatrix = DataMatrix
for i in range(len(AnalyzedMatrix)): #Browse Each Column
for j in range(len(AnalyzedMatrix[i])): #Browse Each Line
if j>0:
if AnalyzedMatrix[i][j] > 50:
if AnalyzedMatrix[i][j] > AnalyzedMatrix[i][j-1]:
AnalyzedMatrix[i][j] = 2
else:
AnalyzedMatrix[i][j] = 1
else:
if AnalyzedMatrix[i][j] <50:
if AnalyzedMatrix[i][j] > AnalyzedMatrix[i][j-1]:
AnalyzedMatrix[i][j] = 0
else:
AnalyzedMatrix[i][j] = -1
return AnalyzedMatrix
The input array is :
[[55. 57.6 57.2 57. 51.1 55.9]
[55.3 54.7 56.1 55.8 52.7 55.5]
[55.5 52. 52.2 49.9 53.8 55.6]
[54.9 57.8 57.6 53.6 54.2 59.9]
[47.9 50.7 53.3 52.5 49.9 45.8]
[57. 56.2 58.3 55.4 47.9 56.5]
[56.6 54.2 57.6 54.7 50.1 53.6]
[54.7 53.4 52. 52. 50.9 nan]
[51.4 51.5 51.2 53. 50.1 50.1]
[55.3 58.7 59.2 56.4 53. nan]]
It seems that it call again the function MatrixAnalytics But I don't understand why
**
Doing this works :
**
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print(MyNewMatrix)
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print("SecondPrint")
print(MyNewMatrix)
I think I got the issue.
In this code :
def MatrixAnalytics(DataMatrix):
AnalyzedMatrix = DataMatrix
...
...
return AnalyzedMatrix
AnalyzedMatrix is not a copy of DataMatrix, it's referencing to the same object in memory !
So on the first call of MatrixAnalytics, your are actually modifying the object behind the reference given as argument (because arrays are mutable).
In the second call, your are giving the same reference as argument so the array behind it has already been modified.
note : return AnalyzedMatrix statement just returns the a new reference to the object referenced by the DataMatrix argument (not a copy).
Try to replace this line :
AnalyzedMatrix = DataMatrix
with this one (in your definition of MatrixAnalytics) :
AnalyzedMatrix = np.copy(DataMatrix)
For more info :
mutable vs unmutable
numpy.delete()
numpy.copy()
I believe you want same output in both the cases,
Sadly the thing is np.delete performs changes in the array itself, so when you called the first line (np.delete(MatrixAnalytics(Cmp),[0],1))
it deletes the 0th column and saves it in matrixanalytics, so never call this function in print statement, either call it during assignment or even without assignment as it will make the changes in the given array itself, but never in print since the column would be lost in the print statement.

How to extract a DataFrame to obtain a nested array?

I have a sample DataFrame as below:
First column consists of 2 years, for each year, 2 track exist and each track includes pairs of longitude and latitude coordinated. How can I extract every track for each year separately to obtain an array of tracks with lat and long?
df = pd.DataFrame(
{'year':[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1],
'track_number':[0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1],
'lat': [11.7,11.8,11.9,11.9,12.0,12.1,12.2,12.2,12.3,12.3,12.4,12.5,12.6,12.6,12.7,12.8],
'long':[-83.68,-83.69,-83.70,-83.71,-83.71,-83.73,-83.74,-83.75,-83.76,-83.77,-83.78,-83.79,-83.80,-83.81,-83.82,-83.83]})
You can groupby year and then extract a numpy.array from the created dataframes with .to_numpy().
>>> years = []
>>> for _, df2 in df.groupby(["year"]):
years.append(df2.to_numpy()[:, 1:])
>>> years[0]
array([[ 0. , 11.7 , -83.68],
[ 0. , 11.8 , -83.69],
[ 0. , 11.9 , -83.7 ],
[ 0. , 11.9 , -83.71],
[ 1. , 12. , -83.71],
[ 1. , 12.1 , -83.73],
[ 1. , 12.2 , -83.74],
[ 1. , 12.2 , -83.75]])
>>> years[1]
array([[ 0. , 12.3 , -83.76],
[ 0. , 12.3 , -83.77],
[ 0. , 12.4 , -83.78],
[ 0. , 12.5 , -83.79],
[ 1. , 12.6 , -83.8 ],
[ 1. , 12.6 , -83.81],
[ 1. , 12.7 , -83.82],
[ 1. , 12.8 , -83.83]])
Where years[0] would have the desired information for the year 0. And so on. Inside the array, the positions of the original dataframe are preserved. That is, the first element is the track; the second, the latitude, and the third, the longitude.
If you wish to do the same for the track, i.e, have an array of only latitude and longitude, you can groupby(["year", "track_number"]) as well.

Interpolation of gridded data

I was hoping someone could help me with a problem that Ive been having (I'm still very new to python). I have been trying to interpolate data from a 50x4 array that is read from an excel sheet seen below.
[ 60. 0. 23.88 22.38 ]
[ 60. 5. 19.508 28.2 ]
[ 60. 10. 16.9 32.23 ]
[ 60. 15. 15.4 34.03 ]
[ 60. 20. 14.4 35.12 ]
[ 60. 25. 13.66 36.02 ]
[ 60. 30. 13.14 36.61 ]
[ 60. 35. 12.69 37.14 ]
[ 60. 40. 12.53 37.56 ]
[ 60. 50. 12.33 38.32 ]
[ 70. 0. 19.3 21.34 ]
[ 70. 5. 16.06 25.37 ]
[ 70. 10. 13.74 28.08 ]
[ 70. 15. 12.33 40.07 ]
[ 70. 20. 11.45 41.78 ]
[ 70. 25. 10.77 42.8 ]
etc...
What I'm trying to achieve is to enter 2 values (say 65 and 12) which correspond to interpolated values in the 1st and 2nd column, and it would return the interpolated values for columns 3 and 4. I managed to get it working using the griddata function in matlab. However no luck in python yet.
Thanks in advance
I think that scipy.interpolate might do the same (or at least similar) as MATLAB's Griddata. Below code uses the Radial Basis Function for interpolation. I've only made the example for your column 3 as z-axis.
import numpy as np
from scipy import interpolate
import matplotlib.pyplot as plt
x = np.array([60] * 10 + [70] * 6)
y = np.array([0,5,10,15,20,25,30,35,40,50,0,5,10,15,20,25])
z = np.array([23.88, 19.508, 16.9, 15.4, 14.4, 13.66, 13.14, 12.69, 12.53, 12.33, 19.3, 16.06, 13.74, 12.33, 11.45, 10.77])
x_ticks = np.linspace(60, 70, 11)
y_ticks = np.linspace(0, 50, 51)
XI, YI = np.meshgrid(x_ticks, y_ticks)
rbf = interpolate.Rbf(x, y, z, epsilon=2)
ZI = rbf(XI, YI)
print(ZI[np.argwhere(y_ticks==12)[0][0], np.argwhere(x_ticks==65)[0][0]])
>>> 14.222288614849171
Be aware that the result is ZI[y,x], not ZI[x,y]. Also be aware that your ticks must contain the x and y values you query, otherwise you'll get an IndexError.
Maybe you can build up on that solution depending on your needs.

Not able to get my head around this python

I just implemented a hierarchical clustering by following the documentation here: http://www.mathworks.com/help/stats/hierarchical-clustering.html?s_tid=doc_12b
So, let me try to put down what I am trying to do.
Take a look at the following figure:
Now, this dendogram is generated from the following data:
node1 node2 dist(node1,node2) num_elems
assigning index **37 to [ 16. 26**. 1.14749118 2. ]
assigning index 38 to [ 4. 7. 1.20402602 2. ]
assigning index 39 to [ 13. 29. 1.44708015 2. ]
assigning index 40 to [ 12. 18. 1.45827365 2. ]
assigning index 41 to [ 10. 34. 1.49607538 2. ]
assigning index 42 to [ 17. 38. 1.52565922 3. ]
assigning index 43 to [ 8. 25. 1.58919037 2. ]
assigning index 44 to [ 3. 40. 1.60231007 3. ]
assigning index 45 to [ 6. 42. 1.65755731 4. ]
assigning index 46 to [ 15. 23. 1.77770844 2. ]
assigning index 47 to [ 24. 33. 1.77771082 2. ]
assigning index 48 to [ 20. 35. 1.81301111 2. ]
assigning index 49 to [ 19. 48. 1.9191061 3. ]
assigning index 50 to [ 0. 44. 1.94238609 4. ]
assigning index 51 to [ 2. 36. 2.0444266 2. ]
assigning index 52 to [ 39. 45. 2.11667375 6. ]
assigning index 53 to [ 32. 43. 2.17132916 3. ]
assigning index 54 to [ 21. 41. 2.2882061 3. ]
assigning index 55 to [ 9. 30. 2.34492327 2. ]
assigning index 56 to [ 5. 51. 2.38383321 3. ]
assigning index 57 to [ 46. 52. 2.42100025 8. ]
assigning index 58 to [ **28. 37**. 2.48365024 3. ]
assigning index 59 to [ 50. 53. 2.57305009 7. ]
assigning index 60 to [ 49. 57. 2.69459675 11. ]
assigning index 61 to [ 11. 54. 2.75669475 4. ]
assigning index 62 to [ 22. 27. 2.77163751 2. ]
assigning index 63 to [ 47. 55. 2.79303418 4. ]
assigning index 64 to [ 14. 60. 2.88015327 12. ]
assigning index 65 to [ 56. 59. 2.95413905 10. ]
assigning index 66 to [ 61. 65. 3.12615829 14. ]
assigning index 67 to [ 64. 66. 3.28846304 26. ]
assigning index 68 to [ 31. 58. 3.3282066 4. ]
assigning index 69 to [ 63. 67. 3.47397104 30. ]
assigning index 70 to [ 62. 68. 3.63807605 6. ]
assigning index 71 to [ 1. 69. 4.09465969 31. ]
assigning index 72 to [ 70. 71. 4.74129435 37.
So basically, there are 37 points in my data same indexed from 0-36..Now, when I see the first element in this list... I assign i + len(thiscompletelist) + 1
So for example, when the id is 37 seen again in future iterations, then that basically means that it is linked to a branch as well.
I used matlab to generate this image. But I want to query this information as query_node(node_id) such that it returns me a list by level.. such that... on query_node(37) I get
{ "left": {"level":1 {"id": 28}} , "right":{"level":0 {"left" :"id":16},"right":{"id":26}}}
Actually.. I dont even know what is the right data structure to do this..
Basically I want to query by node and gain some insight on what does the structure of this dendogram looks like when I am standing on that node and looking below. :(
EDIT 1:
*OOH I didn't knew that you wont be able to zoom the image.. basically the fourth element from the left is 28 and the green entry is the first row of the data..
So fourth vertical line on dendogram represents 28
Next to that line (the first green line) represents 16
and next to that line (the second green line) represents 26*
Well it's always good to build upon something already existing so take a look at dendrogram in scipy.

Rolling median in python

I have some stock data based on daily close values. I need to be able to insert these values into a python list and get a median for the last 30 closes. Is there a python library that does this?
In pure Python, having your data in a Python list a, you could do
median = sum(sorted(a[-30:])[14:16]) / 2.0
(This assumes a has at least 30 items.)
Using the NumPy package, you could use
median = numpy.median(a[-30:])
Have you considered pandas? It is based on numpy and can automatically associate timestamps with your data, and discards any unknown dates as long as you fill it with numpy.nan. It also offers some rather powerful graphing via matplotlib.
Basically it was designed for financial analysis in python.
isn't the median just the middle value in a sorted range?
so, assuming your list is stock_data:
last_thirty = stock_data[-30:]
median = sorted(last_thirty)[15]
Now you just need to get the off-by-one errors found and fixed and also handle the case of stock_data being less than 30 elements...
let us try that here a bit:
def rolling_median(data, window):
if len(data) < window:
subject = data[:]
else:
subject = data[-30:]
return sorted(subject)[len(subject)/2]
#found this helpful:
list=[10,20,30,40,50]
med=[]
j=0
for x in list:
sub_set=list[0:j+1]
median = np.median(sub_set)
med.append(median)
j+=1
print(med)
Here is a much faster method with w*|x| space complexity.
def moving_median(x, w):
shifted = np.zeros((len(x)+w-1, w))
shifted[:,:] = np.nan
for idx in range(w-1):
shifted[idx:-w+idx+1, idx] = x
shifted[idx+1:, idx+1] = x
# print(shifted)
medians = np.median(shifted, axis=1)
for idx in range(w-1):
medians[idx] = np.median(shifted[idx, :idx+1])
medians[-idx-1] = np.median(shifted[-idx-1, -idx-1:])
return medians[(w-1)//2:-(w-1)//2]
moving_median(np.arange(10), 4)
# Output
array([0.5, 1. , 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8. ])
The output has the same length as the input vector.
Rows with less than one entry will be ignored and with half of them nans (happens only for an even window-width), only the first option will be returned. Here is the shifted_matrix from above with the respective median values:
[[ 0. nan nan nan] -> -
[ 1. 0. nan nan] -> 0.5
[ 2. 1. 0. nan] -> 1.0
[ 3. 2. 1. 0.] -> 1.5
[ 4. 3. 2. 1.] -> 2.5
[ 5. 4. 3. 2.] -> 3.5
[ 6. 5. 4. 3.] -> 4.5
[ 7. 6. 5. 4.] -> 5.5
[ 8. 7. 6. 5.] -> 6.5
[ 9. 8. 7. 6.] -> 7.5
[nan 9. 8. 7.] -> 8.0
[nan nan 9. 8.] -> -
[nan nan nan 9.]]-> -
The behaviour can be changed by adapting the final slice medians[(w-1)//2:-(w-1)//2].
Benchmark:
%%timeit
moving_median(np.arange(1000), 4)
# 267 µs ± 759 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Alternative approach: (the results will be shifted)
def moving_median_list(x, w):
medians = np.zeros(len(x))
for j in range(len(x)):
medians[j] = np.median(x[j:j+w])
return medians
%%timeit
moving_median_list(np.arange(1000), 4)
# 15.7 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Both algorithms have a linear time complexity.
Therefore, the function moving_median will be the faster option.

Categories