How to organize data frame with several variables in Python? - python

When I organize data frame with 1 variable, it works well.
import numpy as np
a = np.random.normal(45, 9, 10000)
source = {"Genotype": ["CV1"]*10000, "AGW": a}
df=pd.DataFrame(source)
df
However, when I add more variables, it does not work.
import numpy as np
a = np.random.normal(45, 9, 10000)
b = np.random.normal(35, 10, 10000)
source = {"Genotype": ["CV1"]*10000 + ["CV2"]*10000,
"AGW": a + b}
df=pd.DataFrame(source)
df
and it says "ValueError: All arrays must be of the same length"
I think the AGW column calculates actual a + b which results in 10,000 rows, not array numbers vertically. I want to make data frame with two columns with 20,000 rows.
Could you let me know how to do it?
Thanks!!

Use numpy.hstack for join 2 numpy arrays:
source = {"Genotype": ["CV1"]*10000 + ["CV2"]*10000,
"AGW": np.hstack((a, b))}
df=pd.DataFrame(source)
Or join lists:
source = {"Genotype": ["CV1"]*10000 + ["CV2"]*10000,
"AGW": list(a) + list(b)}
df=pd.DataFrame(source)

Related

efficient way to get all numpy slices for different ranges

I want to slice the same numpy array (data_arra) multiple times to find each time the values in a different range
data_ar shpe: (203,)
range_ar shape: (1000,)
I implemented it with a for loop, but it takes way to long since I have a lot of data_arrays:
#create results array
results_ar = np.zeros(shape=(1000),dtype=object)
i=0
for range in range_ar:
results_ar[i] = data_ar[( (data_ar>=(range-delta)) & (data_ar<(range+delta)) )].values
i+=1
so for example:
data_ar = [1,3,4,6,10,12]
range_ar = [7,4,2]
delta= 3
expected output:
(note results_ar shpae=(3,) dtype=object, each element is an array)
results_ar[[6,10];
[1,3,4,6];
[1,3,4]]
some idea on how to tackle this?
You can use numba to speed up the computations.
import numpy as np
import numba
from numba.typed import List
import timeit
data_ar = np.array([1,3,4,6,10,12])
range_ar = np.array([7,4,2])
delta = 3
def foo(data_ar, range_ar):
results_ar = list()
for i in range_ar:
results_ar.append(data_ar[( (data_ar>=(i-delta)) & (data_ar<(i+delta)) )])
print(timeit.timeit(lambda :foo(data_ar, range_ar)))
#numba.njit(parallel=True, fastmath=True)
def foo(data_ar, range_ar):
results_ar = List()
for i in range_ar:
results_ar.append(data_ar[( (data_ar>=(i-delta)) & (data_ar<(i+delta)) )])
print(timeit.timeit(lambda :foo(data_ar, range_ar)))
15.53519330600102
1.6557575029946747
An almost 9.8 times speedup.
You could use np.searchsorted like this:
data_ar = np.array([1, 3, 4, 6, 10, 12])
range_ar = np.array([7, 4, 2])
delta = 3
bounds = range_ar[:, None] + delta * np.array([-1, 1])
result = [data_ar[slice(*row)] for row in np.searchsorted(data_ar, bounds)]

Scatter plot with logical indexing

I have a 100x2 array D and a 100x1 array c (with entries +/- 1) I'm trying to make a scatter plot of the columns in D corresponding to c = 1.
I tried something like this: plt.scatter(D[0][c==1],D[1][c==1]) but it throws up IndexError: too many indices for array
I'm aware that I've use list comprehension or something of that sort. I'm fairly new to Python and hence struggling with the format.
Thanks a lot.
Concept
You can use np.where to select only rows from D that are 1 in your array C:
D = np.array([[0.25, 0.25], [0.75, 0.75]])
C = np.array([1, 0])
Using np.where, we can select only rows that are 1 in C:
>>> D[np.where(C==1)]
array([[0.25, 0.25]])
Example On your actual data:
D = np.random.randn(100, 2)
C = np.random.randint(0, 2, (100, 1))
valid = D[np.where(C.ravel()==1)]
import matplotlib.pyplot as plt
plt.scatter(valid[:, 0], valid[:, 1])
Output:
You can use numpy for this (assuming you have two numpy arrays, otherwise you can convert them into numpy arrays):
import numpy as np
c_ones = np.where(c == 1) # Finds all indices where c == 1
d_0 = D[0][c_ones]
d_1 = D[1][c_ones]
Then you can plot d_0, d_1 as normal.
For converting your lists if needed,
C_np = np.asarray(c)
D_np = np.asarray(D)
And then perform np.where on C_np as shown above.
Would this solve your issue?

pd.get_dummies() slow on large levels

I'm unsure if this is already the fastest possible method, or if I'm doing this inefficiently.
I want to hot encode a particular categorical column which has 27k+ possible levels. The column has different values in 2 different datasets, so I combined the levels first before using get_dummies()
def hot_encode_column_in_both_datasets(column_name,df,df2,sparse=True):
col1b = set(df2[column_name].unique())
col1a = set(df[column_name].unique())
combined_cats = list(col1a.union(col1b))
df[column_name] = df[column_name].astype('category', categories=combined_cats)
df2[column_name] = df2[column_name].astype('category', categories=combined_cats)
df = pd.get_dummies(df, columns=[column_name],sparse=sparse)
df2 = pd.get_dummies(df2, columns=[column_name],sparse=sparse)
try:
del df[column_name]
del df2[column_name]
except:
pass
return df,df2
However, Its been running for more than 2 hours and it's still stuck hot encoding.
Could I be doing something wrongly here? Or is it just the nature of running it on large datasets?
Df has 6.8m rows and 27 columns, Df2 has 19990 rows and 27 columns before hot encoding the column that I wanted to.
Advice appreciated, thank you! :)
I reviewed the get_dummies source code briefly, and I think it may not be taking full advantage of the sparsity for your use case. The following approach may be faster, but I did not attempt to scale it all the way up to the 19M records you have:
import numpy as np
import pandas as pd
import scipy.sparse as ssp
np.random.seed(1)
N = 10000
dfa = pd.DataFrame.from_dict({
'col1': np.random.randint(0, 27000, N)
, 'col2b': np.random.choice([1, 2, 3], N)
, 'target': np.random.choice([1, 2, 3], N)
})
# construct an array of the unique values of the column to be encoded
vals = np.array(dfa.col1.unique())
# extract an array of values to be encoded from the dataframe
col1 = dfa.col1.values
# construct a sparse matrix of the appropriate size and an appropriate,
# memory-efficient dtype
spmtx = ssp.dok_matrix((N, len(vals)), dtype=np.uint8)
# do the encoding. NB: This is only vectorized in one of the two dimensions.
# Finding a way to vectorize the second dimension may yield a large speed up
for idx, val in enumerate(vals):
spmtx[np.argwhere(col1 == val), idx] = 1
# Construct a SparseDataFrame from the sparse matrix and apply the index
# from the original dataframe and column names.
dfnew = pd.SparseDataFrame(spmtx, index=dfa.index,
columns=['col1_' + str(el) for el in vals])
dfnew.fillna(0, inplace=True)
UPDATE
Borrowing insights from other answers here and here, I was able to vectorize the solution in both dimensions. In my limited testing, I noted that constructing the SparseDataFrame seems to increase the execution time several fold. So, if you don't need to return a DataFrame-like object, you can save a lot of time. This solution also handles the case where you need to encode 2+ DataFrames into 2-d arrays with equal numbers of columns.
import numpy as np
import pandas as pd
import scipy.sparse as ssp
np.random.seed(1)
N1 = 10000
N2 = 100000
dfa = pd.DataFrame.from_dict({
'col1': np.random.randint(0, 27000, N1)
, 'col2a': np.random.choice([1, 2, 3], N1)
, 'target': np.random.choice([1, 2, 3], N1)
})
dfb = pd.DataFrame.from_dict({
'col1': np.random.randint(0, 27000, N2)
, 'col2b': np.random.choice(['foo', 'bar', 'baz'], N2)
, 'target': np.random.choice([1, 2, 3], N2)
})
# construct an array of the unique values of the column to be encoded
# taking the union of the values from both dataframes.
valsa = set(dfa.col1.unique())
valsb = set(dfb.col1.unique())
vals = np.array(list(valsa.union(valsb)), dtype=np.uint16)
def sparse_ohe(df, col, vals):
"""One-hot encoder using a sparse ndarray."""
colaray = df[col].values
# construct a sparse matrix of the appropriate size and an appropriate,
# memory-efficient dtype
spmtx = ssp.dok_matrix((df.shape[0], vals.shape[0]), dtype=np.uint8)
# do the encoding
spmtx[np.where(colaray.reshape(-1, 1) == vals.reshape(1, -1))] = 1
# Construct a SparseDataFrame from the sparse matrix
dfnew = pd.SparseDataFrame(spmtx, dtype=np.uint8, index=df.index,
columns=[col + '_' + str(el) for el in vals])
dfnew.fillna(0, inplace=True)
return dfnew
dfanew = sparse_ohe(dfa, 'col1', vals)
dfbnew = sparse_ohe(dfb, 'col1', vals)

Traversing multiple dataframes simultaneously

I have three dataframes of three users with same column names like time, compass data,accelerometer data, gyroscope data and camera panning information. I want to traverse all the dataframes simultaneously to check for a particular time which user has performed camera panning and return the user(like in which data frame panning has been detected for a particular time). I have tried using dash for achieving parallelism but in vain. below is my code
import pandas as pd
import glob
import numpy as np
import math
from scipy.signal import butter, lfilter
order=3
fs=30
cutoff=4.0
data=[]
gx=[]
gy=[]
g_x2=[]
g_y2=[]
dataList = glob.glob(r'C:\Users\chaitanya\Desktop\Thesis\*.csv')
for csv in dataList:
data.append(pd.read_csv(csv))
for i in range(0, len(data)):
data[i] = data[i].groupby("Time").agg(lambda x: x.value_counts().index[0])
data[i].reset_index(level=0, inplace=True)
def butter_lowpass(cutoff,fs,order=5):
nyq=0.5 * fs
nor=cutoff / nyq
b,a=butter(order,nor,btype='low', analog=False)
return b,a
def lowpass_filter(data,cutoff,fs,order=5):
b,a=butter_lowpass(cutoff,fs,order=order)
y=lfilter(b,a,data)
return y
for i in range(0,len(data)):
gx.append(lowpass_filter(data[i]["Gyro_X"],cutoff,fs,order))
gy.append(lowpass_filter(data[i]["Gyro_Y"],cutoff,fs,order))
g_x2.append(gx[i]*gx[i])
g_y2.append(gy[i]*gy[i])
g_rad=[[] for _ in range(len(data))]
g_ang=[[] for _ in range(len(data))]
for i in range(0,len(data)):
for j in range(0,len(data[i])):
g_ang[i].append(math.degrees(math.atan(gy[i][j]/gx[i][j])))
data[i]["Ang"]=g_ang[i]
panning=[[] for _ in range(len(data))]
for i in range(0,len(data)):
for j in data[i]["Ang"]:
if 0-30<=j<=0+30:
panning[i].append("Panning")
elif 180-30<=j<=180+30:
panning[i].append("left")
else:
panning[i].append("None")
data[i]["Panning"]=panning[i]
result=[[] for _ in range(len(data))]
for i in range (0,len(data)):
result[i].append(data[i].loc[data[i]['Panning']=='Panning','Ang'])
I'm going to make the assumption that you want to traverse simultaneously in time. In any case, you want your three dataframes to have an index in the dimension you want to traverse.
I'll generate 3 dataframes with rows representing random seconds in a 9 second period.
Then, I'll align these with a pd.concat and ffill to be able to reference the last known data for any gaps.
seconds = pd.date_range('2016-08-31', periods=10, freq='S')
n = 6
ssec = seconds.to_series()
sidx = ssec.sample(n).index
df1 = pd.DataFrame(np.random.randint(1, 10, (n, 3)),
ssec.sample(n).index.sort_values(),
['compass', 'accel', 'gyro'])
df2 = pd.DataFrame(np.random.randint(1, 10, (n, 3)),
ssec.sample(n).index.sort_values(),
['compass', 'accel', 'gyro'])
df3 = pd.DataFrame(np.random.randint(1, 10, (n, 3)),
ssec.sample(n).index.sort_values(),
['compass', 'accel', 'gyro'])
df4 = pd.concat([df1, df2, df3], axis=1, keys=['df1', 'df2', 'df3']).ffill()
df4
you can then proceed to walk through via iterrows()
for tstamp, row in df4.iterrows():
print tstamp

Is there a way to confirm all the input array dimensions in numpy?

I'm running Python 2.7.9. I have two numpy arrays (100000 x 142 and 100000 x 20) that I want to concatenate into 1, 100000 x 162 array.
The following is the code I'm running:
import numpy as np
import pandas as pd
def ratingtrueup():
actones = np.ones((100000, 20), dtype='f8', order='C')
actualhhdata = np.array(pd.read_csv
('C:/Users/Desktop/2015actualhhrating.csv', index_col=None, header=None, sep=','))
projectedhhdata = np.array(pd.read_csv
('C:/Users/Desktop/2015projectedhhrating.csv', index_col=None, header=None, sep=','))
adjfctr = round(1 + ((actualhhdata.mean() - projectedhhdata.mean()) / projectedhhdata.mean()), 5)
projectedhhdata = (adjfctr * projectedhhdata)
actualhhdata = (actones * actualhhdata)
end = np.concatenate((actualhhdata.T, projectedhhdata[:, 20:]), axis=1)
ratingtrueup()
I get the following value error:
File "C:/Users/PycharmProjects/TestProjects/M.py",
line 16, in ratingtrueup
end = np.concatenate([actualhhdata.T, projectedhhdata[:, 20:]], axis=1) ValueError: all the input array dimensions except for the
concatenation axis must match exactly
I've confirmed that both arrays are 'numpy.ndarry'.
Is there a way to I check the dimensions of the input array to see where I'm going wrong.
Thank you in advance.
I would add a (temporary) print line right before the concatenate:
actualhhdata = (actones * actualhhdata)
print(acutalhhdata.T.shape, projectedhhdata[:,20:].shape)
end = np.concatenate((actualhhdata.T, projectedhhdata[:, 20:]), axis=1)
For more of a production context, you might want to add some sort of test
e.g.
x,y=np.ones((100,20)),np.zeros((100,10))
assert x.shape[0]==y.shape[0], (x.shape,y.shape)
np.concatenate([x,y],axis=1).shape

Categories