I have a pandas dataframe from which I wish to construct some matrices using numpy arrays. These matrices will be constructed based on variables in the dataframe, and I would like to create these via a loop over a list of the dataframe variables. I would also like the numpy arrays to be named based on the variable, so that I can easily reference them.
Below is code to try to illustrate my problem. I create a dataframe with two categorical variables and an identifier. I then create a list 'vars' with the variable names I'd like to loop over. I show that my code runs outside the loop (although the object created is pandas not numpy). The commented piece at the end does not work, but shows my attempt at including the variable string in the loop.
import pandas as pd
import numpy as np
import random
mult_cat = [] # multiple categories
bin_cat = [] # binary categories
id = []
for i in range(0,10):
x = random.randint(0,4)
y = random.randint(0,1)
z = i+1
mult_cat.append(x)
bin_cat.append(y)
id.append(z)
data_2 = {'ID': id,
'mult_cat': mult_cat,
'bin_cat': bin_cat}
df = pd.DataFrame(data_2,
columns = ['ID', 'mult_cat', 'bin_cat'])
vars = ['mult_cat', 'bin_cat']
twice_mult_cat=2*df.mult_cat
print(mult_cat)
print(twice_mult_cat)
"""
for var in vars:
twice_var=2*df.var
print(twice_var)
"""
I believe there are at least two issues here.
1) I am simply multiplying the pandas array, so the resulting object is not a numpy array.
2) The issue of naming, which is, I think, the more important issue here.
Related
def claims(dataframe):
dataframe.loc[(dataframe.severity ==1),'claims_made']= randint(200, 20000)
return dataframe
here 'severity' is an existing column and 'claims_made' is a new column, I want to have the randint keep picking different values that are being assigned to the 'claims_made' column. because for now it's just picking one random value out of the bucket specified and is assigning the same value to all the rows that satisfy the condition
Your code gets a single randint and applies that one value to the column you create. Its the same as if you had done
val = randint(20, 20000)
dataframe.loc[(dataframe.severity ==1),'claims_made']= val
Instead you could get an index of the rows you want to assign. Use it to create a series of random integers and when you assign that back to the dataframe, non-indexed rows become NaN.
import pandas as pd
import numpy as np
def claims(dataframe):
wanted_index = dataframe[df.severity==1].index
dataframe["claims_made"] = pd.Series(
np.random.randint(20,20000, size=len(wanted_index)),
index=wanted_index)
return dataframe
df = pd.DataFrame({"severity":[1, 1, 0, 8, -1, 99, 1]})
print(claims(df))
If you want to stick with your existing approach, you could do something like this:
def claims2(df):
n_rows = len(df.loc[(df.severity==1), 'claims_made'])
vals = [randint(200, 20000) for _ in range(n_rows)]
df.loc[(df.severity==1), 'claims_made'] = vals
return df
p.s. I'd recommend accessing columns via df['severity'] instead of df.severity -- you can get into trouble using the . syntax if you have a dataset with spaces etc. in the column names.
I'll give you a broad hint; coding is up to you.
Form a series (a temporary column object) of random numbers in the desired range. Assign that series to your data frame column. You can find examples of this technique in any tutorial on data frames.
I am tryig to apply argrelextrema function with dataframe df. But unable to apply correctly. below is my code
import pandas as pd
from scipy.signal import argrelextrema
np.random.seed(42)
def maxloc(data):
loc_opt_ind = argrelextrema(df.values, np.greater)
loc_max = np.zeros(len(data))
loc_max[loc_opt_ind] = 1
data['loc_max'] = loc_max
return data
values = np.random.rand(23000)
df = pd.DataFrame({'value': values})
np.all(maxloc_faster(df).loc_max)
It gives me error
that loc_max[loc_opt_ind] = 1
IndexError: too many indices for array
A Pandas dataframe is two-dimensional. That is, df.values is two dimensional, even when it has only one column. As a result, loc_opt_ind will contain x and y indices (two tuples; just print loc_opt_ind to see), which can't be used to index loc_max. You probably want to use either df['values'].values (which turns into <Series>.values), or np.squeeze(df.values) as input. Note that argrelextrema still returns a tuple in that case, just a one-element one, so you may need loc_opt_ind[0] (np.where has similar behaviour).
I have numpy arrays which are around 2000 long each, but not every element has a value. Some are blank. As you can see at the end of the code ive stacked them into one called 'match'. How would I remove a row in match if it is missing an element. So for example if a particular ID is missing the magnitude it removes the entire row. I'm only interested in keeping the rows that have data for all of the elements.
from astropy.table import Table
import numpy as np
data = '/home/myname/datable.fits'
data = Table.read(data, format="fits")
ID = np.array(data['ID'])
ID.astype(str)
redshift = np.array(data['z'])
redshift.astype(float)
radius = np.array(data['r'])
radius.astype(float)
mag = np.array(data['MAG'])
mag.astype(float)
match = (ID, redshift, radius, mag)
np.stack(match, axis=1)
Here you can use the numpy.isnan method which gives true for missing values and false for existing values. But numpy.isnan can be applied to NumPy arrays of native dtype (such as np.float64).
Your requirement can be achieved as follows:
Note: considering data is your numpy array.
import numpy as np
data = np.array(some_array) # set data as your numpy array
key_col = np.array(data[:,0], dtype=np.float64) # If you want to filter based on column 0
filtered_data = data[~np.isnan(key_col)] # ~ is the logical not here
For better flexibility, consider using pandas!!
Hope this helps!!
I have an h5 file with 5 groups, each group containing a 3D dataset. I am looking to build a for loop that allows me to extract each group into a numpy array and assign the numpy array to an object with the group header name. I am able to get a number of different methods to work with one group, but when I try to build a for loop that applies to code to all 5 groups, it breaks. For example:
import h5py as h5
import numpy as np
f = h5.File("FFM0012.h5", "r+") #read in h5 file
print(list(f.keys())) #['FFM', 'Image'] for my dataset
FFM = f['FFM'] #Generate object with all 5 groups
print(list(FFM.keys())) #['Amp', 'Drive', 'Phase', 'Raw', 'Zsnsr'] for my dataset
Amp = FFM['Amp'] #Generate object for 1 group
Amp = np.array(Amp) #Turn into numpy array, this works.
Now when I try to apply the same logic with a for loop:
h5_keys = []
FFM.visit(h5_keys.append) #Create list of group names ['Amp', 'Drive', 'Phase', 'Raw', 'Zsnsr']
for h5_key in h5_keys:
tmp = FFM[h5_key]
h5_key = np.array(tmp)
print(Amp[30,30,30]) #To check that array is populated
When I run this code I get "NameError: name 'Amp' is not defined". I've tried initializing the numpy array before the for loop with:
h5_keys = []
FFM.visit(h5_keys.append) #Create list of group names
Amp = np.array([])
for h5_key in h5_keys:
tmp = FFM[h5_key]
h5_key = np.array(tmp)
print(Amp[30,30,30]) #To check that array is populated
This produces the error message "IndexError: too many indices for array"
I've also tried generating a dictionary and creating numpy arrays from the dictionary. That is a similar story where I can get the code to work for one h5 group, but it falls apart when I build the for loop.
Any suggestions are appreciated!
You seem to have jumped to using h5py and numpy before learning much of Python
Amp = np.array([]) # creates a numpy array with 0 elements
for h5_key in h5_keys: # h5_key is set of a new value each iteration
tmp = FFM[h5_key]
h5_key = np.array(tmp) # now you reassign h5_key
print(Amp[30,30,30]) # Amp is the original (0,) shape array
Try this basic python loop, paying attention to the value of i:
alist = [1,2,3]
for i in alist:
print(i)
i = 10
print(i)
print(alist) # no change to alist
f is the file.
FFM = f['FFM']
is a group
Amp = FFM['Amp']
is a dataset. There are various ways of load the dataset into an numpy array. I believe the [...] slicing is the current preferred one. .value used to used but is now deprecated (loading dataset)
Amp = FFM['Amp'][...]
is an array.
alist = [FFM[key][...] for key in h5_keys]
should create a list of arrays from the FFM group.
If the shapes are compatible, you can concatenate the arrays into one array:
np.array(alist)
np.stack(alist)
np.concatatenate(alist, axis=0) # or other axis
etc
adict = {key: FFM[key][...] for key in h5_keys}
should crate of dictionary of array keyed by dataset names.
In Python, lists and dictionaries are the ways of accumulating objects. The h5py groups behave much like dictionaries. Datasets behave much like numpy arrays, though they remain on the disk until loaded with [...].
I have a 5D array called predictors with a shape of [6,288,37,90,107] where 6 is the number of variables,
288 is the time series of those variables,
37is the k locations,
90 is the j locations,
107 is the i locations.
I want to have a pandas dataframe that includes columns of each variable timeseries at each k,j,i location so that of course will be a lot of columns.
Then I would like to somehow obtain the names for each column.
For example the first column would be var1_k_j_i = predictors[0,:,0,0,0]
except in the name I actually want the k location, j location,
and i location instead of k_j_i.
Since there are so many I can't do this by hand so I was hoping for a suggestion on the best way to organize this into a pandas dataframe and obtain the names? A loop possibly?
So in summary by the end of this I would like my 5D array of predictors turned into a large pandas dataframe where each column is a variable located at different k,j,i locations with the corresponding names of the variable and location in the header or first row of the dataframe.
Sound like you need to have fun with reshape here.
To address the location i,j,k is easy as using reshape. Then I'm not sure if you can reshape again to obtain a 2D representation of what you need, so I'm proposing a loop for you as follow.
import itertools
import pandas as pd
dfs = []
new_matrix = matrix.reshape([6,288,37*90*107])
for var range(6):
iterator = itertools.product(range(37), range(90), range(107))
columns = ['var%i_' % var + '_'.join(map(str, x)) for x in iterator]
dfs.append(pd.DataFrame(new_matrix[var]))
result = pd.concat(dfs)