How to extract a DataFrame to obtain a nested array?

How to extract a DataFrame to obtain a nested array? - python

I have a sample DataFrame as below:
First column consists of 2 years, for each year, 2 track exist and each track includes pairs of longitude and latitude coordinated. How can I extract every track for each year separately to obtain an array of tracks with lat and long?
df = pd.DataFrame(
{'year':[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1],
'track_number':[0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1],
'lat': [11.7,11.8,11.9,11.9,12.0,12.1,12.2,12.2,12.3,12.3,12.4,12.5,12.6,12.6,12.7,12.8],
'long':[-83.68,-83.69,-83.70,-83.71,-83.71,-83.73,-83.74,-83.75,-83.76,-83.77,-83.78,-83.79,-83.80,-83.81,-83.82,-83.83]})

You can groupby year and then extract a numpy.array from the created dataframes with .to_numpy().
>>> years = []
>>> for _, df2 in df.groupby(["year"]):
years.append(df2.to_numpy()[:, 1:])
>>> years[0]
array([[ 0. , 11.7 , -83.68],
[ 0. , 11.8 , -83.69],
[ 0. , 11.9 , -83.7 ],
[ 0. , 11.9 , -83.71],
[ 1. , 12. , -83.71],
[ 1. , 12.1 , -83.73],
[ 1. , 12.2 , -83.74],
[ 1. , 12.2 , -83.75]])
>>> years[1]
array([[ 0. , 12.3 , -83.76],
[ 0. , 12.3 , -83.77],
[ 0. , 12.4 , -83.78],
[ 0. , 12.5 , -83.79],
[ 1. , 12.6 , -83.8 ],
[ 1. , 12.6 , -83.81],
[ 1. , 12.7 , -83.82],
[ 1. , 12.8 , -83.83]])
Where years[0] would have the desired information for the year 0. And so on. Inside the array, the positions of the original dataframe are preserved. That is, the first element is the track; the second, the latitude, and the third, the longitude.
If you wish to do the same for the track, i.e, have an array of only latitude and longitude, you can groupby(["year", "track_number"]) as well.

Related

Np Array values are being changed without doing stuff

Why does :
print(np.delete(MatrixAnalytics(Cmp),[0],1))
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print("SecondPrint")
print(MyNewMatrix)
returns :
[[ 2. 2. 2. 2. 2.]
[ 1. 2. 2. 2. 2.]
[ 1. 2. 0. 2. 2.]
[ 2. 2. 2. 2. 2.]
[ 2. 2. 2. 0. 0.]
[ 1. 2. 2. 0. 2.]
[ 1. 2. 2. 2. 2.]
[ 1. 2. 2. 2. nan]
[ 2. 2. 2. 2. 2.]
[ 2. 2. 2. 2. nan]]
Second Print
[[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. -1. 0. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. -1. 0.]
[-1. 0. 0. -1. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. nan]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. nan]]
This is weird, and can't figure this out. Why Would the values change without any line of code between 3 print ?
def MatrixAnalytics(DataMatrix):
AnalyzedMatrix = DataMatrix
for i in range(len(AnalyzedMatrix)): #Browse Each Column
for j in range(len(AnalyzedMatrix[i])): #Browse Each Line
if j>0:
if AnalyzedMatrix[i][j] > 50:
if AnalyzedMatrix[i][j] > AnalyzedMatrix[i][j-1]:
AnalyzedMatrix[i][j] = 2
else:
AnalyzedMatrix[i][j] = 1
else:
if AnalyzedMatrix[i][j] <50:
if AnalyzedMatrix[i][j] > AnalyzedMatrix[i][j-1]:
AnalyzedMatrix[i][j] = 0
else:
AnalyzedMatrix[i][j] = -1
return AnalyzedMatrix
The input array is :
[[55. 57.6 57.2 57. 51.1 55.9]
[55.3 54.7 56.1 55.8 52.7 55.5]
[55.5 52. 52.2 49.9 53.8 55.6]
[54.9 57.8 57.6 53.6 54.2 59.9]
[47.9 50.7 53.3 52.5 49.9 45.8]
[57. 56.2 58.3 55.4 47.9 56.5]
[56.6 54.2 57.6 54.7 50.1 53.6]
[54.7 53.4 52. 52. 50.9 nan]
[51.4 51.5 51.2 53. 50.1 50.1]
[55.3 58.7 59.2 56.4 53. nan]]
It seems that it call again the function MatrixAnalytics But I don't understand why
**
Doing this works :
**
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print(MyNewMatrix)
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print("SecondPrint")
print(MyNewMatrix)

I think I got the issue.
In this code :
def MatrixAnalytics(DataMatrix):
AnalyzedMatrix = DataMatrix
...
...
return AnalyzedMatrix
AnalyzedMatrix is not a copy of DataMatrix, it's referencing to the same object in memory !
So on the first call of MatrixAnalytics, your are actually modifying the object behind the reference given as argument (because arrays are mutable).
In the second call, your are giving the same reference as argument so the array behind it has already been modified.
note : return AnalyzedMatrix statement just returns the a new reference to the object referenced by the DataMatrix argument (not a copy).
Try to replace this line :
AnalyzedMatrix = DataMatrix
with this one (in your definition of MatrixAnalytics) :
AnalyzedMatrix = np.copy(DataMatrix)
For more info :
mutable vs unmutable
numpy.delete()
numpy.copy()

I believe you want same output in both the cases,
Sadly the thing is np.delete performs changes in the array itself, so when you called the first line (np.delete(MatrixAnalytics(Cmp),[0],1))
it deletes the 0th column and saves it in matrixanalytics, so never call this function in print statement, either call it during assignment or even without assignment as it will make the changes in the given array itself, but never in print since the column would be lost in the print statement.

Vectorized equivalent of dict.get

I'm looking for the functionality that operates like such
lookup_dict = {5:1.0, 12:2.0, 39:2.0...}
# this is the missing magic:
lookup = vectorized_dict(lookup_dict)
x = numpy.array([5.0, 59.39, 39.49...])
xbins = numpy.trunc(x).astype(numpy.int_)
y = lookup.get(xbins, 0.0)
# the idea is that we get this as the postcondition:
for (result, input) in zip(y, xbins):
assert(result==lookup_dict.get(input, 0.0))
Is there some flavor of sparse array in numpy (or scipy) that gets at this kind of functionality?
The full context is that I'm binning some samples of a 1-D feature.

As far as I know, numpy does not support different data types in the same array structures but you can achieve a similar result if you are willing to separate keys from values and maintain the keys (and corresponding values) in sorted order:
import numpy as np
keys = np.array([5,12,39])
values = np.array([1.0, 2.0, 2.0])
valueOf5 = values[keys.searchsorted(5)] # 2.0
k = np.array([5,5,12,39,12])
values[keys.searchsorted(k)] # array([1., 1., 2., 2., 2.])
This may not be as efficient as a hashing key but it does support the propagation of indirections from arrays with any number of dimensions.
note that this assumes your keys are always present in the keys array. If not, rather than an error, you could be getting the value from the next key up.

Using np.select to create boolean masks over the array, ([xbins == k for k in lookup_dict]), the values from the dict (lookup_dict.values()), and a default value of 0:
y = np.select(
[xbins == k for k in lookup_dict],
lookup_dict.values(),
0.0
)
# In [17]: y
# Out[17]: array([1., 0., 2.])
This assumes that the dictionary is sorted, I'm not sure what the behaviour would be below python 3.6.
OR overkill with pandas:
import pandas as pd
s = pd.Series(xbins)
s = s.map(lookup_dict).fillna(0)

Another approach is to use searchsorted to search a numpy array which has the integer 'keys' and returns the initially loaded value in the range n <= x < n+1. This may be useful to somebody asking the a similar question in the future.
import numpy as np
class NpIntDict:
""" Class to simulate a python dict get for a numpy array. """
def __init__( self, dict_in, default = np.nan ):
""" dict_in: a dictionary with integer keys.
default: the value to be returned for keys not in the dictionary.
defaults to np.nan
default must be consistent with the dtype of values
"""
# Create list of dict items sorted by key.
list_in = sorted([ item for item in dict_in.items() ])
# Create three empty lists.
key_list = []
val_list = []
is_def_mask = []
for key, value in list_in:
key = int(key)
if not key in key_list: # key not yet in key list
# Update the three lists for key as default.
key_list.append( key )
val_list.append( default )
is_def_mask.append( True )
# Update the lists for key+1. With searchsorted this gives the required results.
key_list.append( key + 1 )
val_list.append( value )
is_def_mask.append( False )
# Add the key > max(key) to the val and is_def_mask lists.
val_list.append( default )
is_def_mask.append( True )
self.keys = np.array( key_list, dtype = np.int )
self.values = np.array( val_list )
self.default_mask = np.array( is_def_mask )
def set_default( self, default = 0 ):
""" Set the default to a new default value. Using self.default_mask.
Changes the default value for all future self.get(arr).
"""
self.values[ self.default_mask ] = default
def get( self, arr, default = None ):
""" Returns an array looking up the values in `arr` in the dict.
default can be used to change the default value returned for this get only.
"""
if default is None:
values = self.values
else:
values= self.values.copy()
values[ self.default_mask ] = default
return values[ np.searchsorted( self.keys, arr, side = 'right' ) ]
# side = 'right' to ensure key[ix] <= x < key[ix+1]
# side = 'left' would mean key[ix] < x <= key[ix+1]
This could be simplified if there's no requirement to change the default returned after the NpIntDict is created.
To test it.
d = { 2: 5.1, 3: 10.2, 5: 47.1, 8: -6}
# x <2 Return default
# 2 <= x <3 return 5.1
# 3 <= x < 4 return 10.2
# 4 <= x < 5 return default
# 5 <= x < 6 return 47.1
# 6 <= x < 8 return default
# 8 <= x < 9 return -6.
# 9 <= x return default
test = NpIntDict( d, default = 0.0 )
arr = np.arange( 0., 100. ).reshape(10,10)/10
print( arr )
"""
[[0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9]
[1. 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9]
[2. 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9]
[3. 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9]
[4. 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9]
[5. 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9]
[6. 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9]
[7. 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9]
[8. 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9]
[9. 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9]]
"""
print( test.get( arr ) )
"""
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1]
[10.2 10.2 10.2 10.2 10.2 10.2 10.2 10.2 10.2 10.2]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[47.1 47.1 47.1 47.1 47.1 47.1 47.1 47.1 47.1 47.1]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[-6. -6. -6. -6. -6. -6. -6. -6. -6. -6. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]]
"""
This could be amended to raise an exception if any of the arr elements aren't in the key list. For me returning a default would be more useful.

Numpy array references behaving strangely when inside a for loop

I'm writing a Python implementation of Euler's method, using an example from Paul's math notes here.
I'm using a n x 3 numpy array to store the results. The goal is to have the t-value in the first column, y in the second, and the value of y' computed using the current row in the third column.
When I did the first problem listed on the page, using only ten iterations, everything behaved exactly as expected. The step size was 0.1, so the values in the first column incremented by 0.1 with each iteration of the for loop.
But now that I've copied the code over and attempted to apply it to problem 3, the first column behaves very strangely. I inputted the step size as 0.01, but for the first ten iterations it increments by 0.1, then after the tenth iteration it appears to reset to zero, then uses the expected 0.01, but later on it resets again in a similar fashion.
Here's my code:
import numpy as np
def ex3(t,y):
return y + (-0.5 * np.exp(t/2) * np.sin(5*t)) + (5 * np.exp(t/2) * np.cos(5*t))
ex3out = np.empty((0,3), float)
# Input the initial conditions and first y' computation
ex3out = np.append(ex1out, np.array([[0,0,ex3(0,0)]]), axis=0)
h = 0.01
n = 500
for i in range(1,n+1):
# Compute the new t and y values and put in 0 as a dummy y' for now
new = np.array([[ex3out[i - 1,0] + h, ex3out[i - 1,1] + h * ex3out[i - 1,2],0]])
# Append the new row
ex3out = np.append(ex3out,new,axis=0)
# Replace the dummy 0 with y' based on the new values
ex3out[i,2] = ex3(ex3out[i,0],ex3out[i,1])
And here are the first several rows of ex3out after running the above code:
array([[ 0. , 1. , -1. ],
[ 0.1 , 0.9 , 5.2608828 ],
[ 0.2 , 0.852968 , 3.37361534],
[ 0.3 , 0.8374415 , 0.6689041 ],
[ 0.4 , 0.83983378, -2.25688988],
[ 0.5 , 0.85167737, -4.67599317],
[ 0.6 , 0.86780837, -5.90918813],
[ 0.7 , 0.8851749 , -5.51040903],
[ 0.8 , 0.90205891, -3.40904125],
[ 0.9 , 0.91757091, 0.031139 ],
[ 1. , 0.93132436, 4.06022317],
[ 0. , 0. , 5. ],
[ 0.01 , 0.99 , 5.98366774],
[ 0.02 , 0.95260883, 5.92721107],
[ 0.03 , 0.88670415, 5.82942804],
[ 0.04 , 0.84413054, 5.74211536],
[ 0.05 , 0.81726488, 5.65763415],
[ 0.06 , 0.80491744, 5.57481145],
[ 0.07 , 0.80871649, 5.4953251 ],
[ 0.08 , 0.83007081, 5.42066644],
[ 0.09 , 0.8679685 , 5.34993924],
[ 0.1 , 0.9178823 , 5.2787651 ],
[ 0.11 , 0.97192659, 5.19944036],
[ 0.12 , 0.05 , 4.13207859],
[ 0.13 , 1.04983668, 4.97466166],
[ 0.14 , 1.01188094, 4.76791408],
[ 0.15 , 0.94499843, 4.5210138 ],
[ 0.16 , 0.90155169, 4.28666725],
[ 0.17 , 0.87384122, 4.0575499 ],
[ 0.18 , 0.86066555, 3.83286568],
[ 0.19 , 0.86366974, 3.61469476],
[ 0.2 , 0.88427747, 3.40492482],
[ 0.21 , 0.92146789, 3.20302701],
I wondered if this might be a floating point issue, so I tried enclosing various parts of the for loop in float() with the same results.
I must've made a typo somewhere, right?

Simpler loop:
ex3out = [[0, 0, ex3(0,0)]]
h = 0.01
n = 50
for i in range(1,n+1):
# Compute the new t and y values and put in 0 as a dummy y' for now
last = ex3out[-1]
new = [last[0] + h, last[1] + h * last[2], 0]
new[2] = ex3(new[0], new[1])
# Append the new row
ex3out.append(new)
print(np.array(ex3out)) # for pretty numpy display

How can I generate a rolling metric like this in Pandas

I have a dataframe that initially contains two columns, Home, which is 1 if a game was player at home, else 0, and PTS, which records the number of points a player scored in a given game. I want to end up with a third column, a rolling metric that represents how sensitive a player is to playing at home. I'll calculate this as follows:
Home Sensitivity = (Average PTS Home - Average PTS Away)/Average PTS
I did this successfully in the following code, but it felt cumbersome, as I created many columns I didn't need in the end. How can I solve this problem more directly?
df=pd.DataFrame({'Home':[1,0,1,0,1,0,1,0], 'PTS':[11, 10, 12, 11, 13, 12, 14, 12]})
df.loc[testDF['Home'] == 1, 'Home PTS'] = df['PTS']
df.loc[testDF['Home'] == 0, 'Away PTS'] = df['PTS']
df['Home PTS'] = df['Home PTS'].fillna(0)
df['Away PTS'] = df['Away PTS'].fillna(0)
df['Home Sum'] = df['Home PTS'].expanding(min_periods=1).sum()
df['Away Sum'] = df['Away PTS'].expanding(min_periods=1).sum()
df['Home Count']=df['Home'].expanding().sum()
df['Index']=df.index+1
df['Away Count']=df['Index']-df['Home Count']
df['Home Average']=df['Home Sum']/df['Home Count']
df['Away Average']=df['Away Sum']/df['Away Count']
df['Average']=df['PTS'].expanding().mean()
df['Metric']=(df['Home Average']-df['Away Average'])/df['Average']

Here is a naive way to do it: take increasingly larger slices of the DataFrame in a loop; do the math on each slice and store it in a list; assign the list to a new column of the DataFrame (using your testDF):
df = tesdDF
sens = []
for i in range(len(df)):
d = df[:i]
mean_pts = d.PTS.mean()
home = d[d.Home == 1].PTS.mean()
away = d[d.Home == 0].PTS.mean()
#print(home, away, (home - away) / mean_pts)
sens.append((home - away) / mean_pts)
df['sens'] = sens
>>> df
Home PTS sens
0 1 11 NaN
1 0 10 NaN
2 1 12 0.095238
3 0 11 0.136364
4 1 13 0.090909
5 0 12 0.131579
6 1 14 0.086957
7 0 12 0.126506
Using DataFrame.expanding(): Not quite there yet ...
>>> mean_pts = df.PTS.expanding(1).mean()
>>> away = df[df['Home'] == 0].PTS.expanding(1).mean()
>>> home = df[df['Home'] == 1].PTS.expanding(1).mean()
>>>
>>> home
0 11.0
2 11.5
4 12.0
6 12.5
Name: PTS, dtype: float64
>>> away
1 10.00
3 10.50
5 11.00
7 11.25
Name: PTS, dtype: float64
>>> mean_pts
0 11.000000
1 10.500000
2 11.000000
3 11.000000
4 11.400000
5 11.500000
6 11.857143
7 11.875000
Name: PTS, dtype: float64
>>>
To do the math will require more manipulation.
You cannot get the difference between home and away directly because the indices are different - but you can do ...
>>> home.values - away.values
array([ 1. , 1. , 1. , 1.25])
>>>
Also home and away only have four rows and mean_pts has eight.
I tried .expanding(1).apply() with the following function and didn't get what I expected, expanding doesn't pass both columns to the function, it appears to pass one column then the other; so I punted...
def f(thing):
print(thing, '***')
return thing.mean()
>>> df.expanding(1).apply(f)
[ 1.] ***
[ 1. 0.] ***
[ 1. 0. 1.] ***
[ 1. 0. 1. 0.] ***
[ 1. 0. 1. 0. 1.] ***
[ 1. 0. 1. 0. 1. 0.] ***
[ 1. 0. 1. 0. 1. 0. 1.] ***
[ 1. 0. 1. 0. 1. 0. 1. 0.] ***
[ 11.] ***
[ 11. 10.] ***
[ 11. 10. 12.] ***
[ 11. 10. 12. 11.] ***
[ 11. 10. 12. 11. 13.] ***
[ 11. 10. 12. 11. 13. 12.] ***
[ 11. 10. 12. 11. 13. 12. 14.] ***
[ 11. 10. 12. 11. 13. 12. 14. 12.] ***

Python genfromtxt file path

I have an extremely basic problem with the numpy.genfromtxt function. I'm using the Enthought Canopy package: where shall I save the file.txt I want to use, or how shall I tell Python where to look for it? When using IDLE I simply save the file in a preset folder such as C:\Users\Davide\Python\data.txt and what I get is
>>> import numpy as np
>>> np.genfromtxt('data.txt')
array([[ 33.1 , 32.6 , 18.2 , 17.9 ],
[ 32.95, 32.7 , 17.95, 17.9 ],
[ 32.9 , 32.6 , 18. , 17.9 ],
[ 33. , 32.65, 18. , 17.9 ],
[ 32.95, 32.65, 18.05, 17.9 ],
[ 33. , 32.6 , 18. , 17.9 ],
[ 33.05, 32.7 , 18. , 17.9 ],
[ 33.05, 32.5 , 18.1 , 17.9 ],
[ 33. , 32.6 , 18.05, 17.9 ],
[ 33. , 32.55, 18. , 17.95]])
while working with Canopy the same code gives IOError: data.txt not found, nor something like np.genfromtxt('C:\Users\Davide\Python\data.txt') works. I'm sorry for the question's banality but I'm really going crazy with this. Thanks for help.

You can pass a fully qualified path but this:
np.genfromtxt('C:\Users\Davide\Python\data.txt')
won't work because back slashes need to be escaped:
np.genfromtxt('C:\\Users\\Davide\\Python\\data.txt')
or you could use a raw string:
np.genfromtxt(r'C:\Users\Davide\Python\data.txt')
As to where the currect saved location is you can query this using os.getcwd():
In [269]:
import os
os.getcwd()
Out[269]:
'C:\\WinPython-64bit-3.4.3.1\\notebooks\\docs'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract a DataFrame to obtain a nested array? - python

Related

Np Array values are being changed without doing stuff

Vectorized equivalent of dict.get

Numpy array references behaving strangely when inside a for loop

How can I generate a rolling metric like this in Pandas

Python genfromtxt file path

Categories

Resources