Python “for” problem - Can only tuple-index with a MultiIndex - python

I wanna take array C2, size N,1, and make a array B, size N-1,1.
B[0] = C2[1]
B[1] = C2[2]
and so on. My code is:
import numpy as np
import pandas as pd
fields = "B:D"
data = pd.read_excel(r'C:\Users\file.xlsx', "Sheet2", usecols=fields)
N = 2
# Covariance calculation
C1 = data.cov() C2 = data.var()
B = np.zeros(shape=(N,1))
for i in B:
B[i,1] = C2[i+1,1]
But the error is:
ValueError: Can only tuple-index with a MultiIndex
I know it is a simple mistake, but cant find where :S (new python user)

First, are you sure you need to be using numpy arrays? This seems like a job for python lists.
Next, what do you mean to be doing with for i in B:? what type is i?
In this case, iterating over B is going to set i to [0.], and you can now see that the next line is going to fail in the substitution
B[[0.],i] = C2[[0.]+1,1]
In addition, the call to pd.var() returns a 1-d series, so the second index isn't doing anything.
I think you want to iterate over N like
for i in range(N):
B[i,1] = C2[i+1]

Related

Function returns only one iteration, instead of multiple. What is wrong?

First of all I'm a beginner and having an issue about functions and returning values. After that, I need to do some matrix operations to take minimum value of the right column. However, since I cannot return these values (I could not figure out why) I'm not able to do any operations on it. The problem here is, every time I try to use return, It gives me only the first or the last row of the matrix. If you can help, I really appreciate it. Thanks.
import numpy as np
import pandas as pd
df = pd.read_csv(r"C:\Users\Yunus Özer\Downloads/MA.csv")
df.head()
x = df["x"]
def minreg():
for k in range(2,16):
x_pred = np.full(x.shape, np.nan)
for t in range(k,x.size):
x_pred[t] = np.mean(x[(t-k):t])
mape_value=((np.mean(np.abs(x-x_pred)/np.abs(x))*100))
m=np.array([k,mape_value])
return m
print(minreg())
return m command basicly terminates the function and returns m. As a result, the function terminates after executing the first loop. So firstly you need to call return after your loop ends. Secondly you need to put each m value generated for the loop to an array to store them and return that array.
import numpy as np
import pandas as pd
df = pd.read_csv(r"C:\Users\Yunus Özer\Downloads/MA.csv")
df.head()
x = df["x"]
def minreg():
m_arr = []
for k in range(2,16):
x_pred = np.full(x.shape, np.nan)
for t in range(k,x.size):
x_pred[t] = np.mean(x[(t-k):t])
mape_value=((np.mean(np.abs(x-x_pred)/np.abs(x))*100))
m_arr.append(np.array([k,mape_value]))
return m_arr
print(minreg())

Is there a way to write fast nested loops in python?

I wrote a function that takes a set of multiple variables and matches it to a list of dataframes (which contains 5572 rows with presence and absence of those variables), later calculating a distance matrix.
In other words, I have a list with 25 presence/absence vectors (filled with 0s and ONE 1) and a list of 25 dataframes (also filled with 0s and 1s). Vectors and dataframes are in iterable order, meaning veclist[0] matches column number in PAM[0] and so on. The function has a nested loop, with the outer loop running the dataframes in order and the inner loop running the rows (finds in which row the variables occur in the dataframe for all 25 variables and gives it a value)
To illustrate it better I recreated a short version of my data below.
The list of vectors looks like this:
import numpy as np
import pandas as pd
import math
A = np.random.randint(1, size=6).reshape(1, 6)
B = np.random.randint(1, size=11).reshape(1, 11)
A[:,2]=1
B[:,7]=1
vec1 = pd.DataFrame(A, columns=["a","b","c","d","e","f"])
vec2 = pd.DataFrame(B, columns=["a","b","c","d","e","f","g","h","i","j","k"])
veclist=[vec1,vec2]
print(veclist[0])
and the list of dataframes looks like this:
A2 = np.random.randint(2, size=600).reshape(100,6)
B2 = np.random.randint(2, size=1100).reshape(100,11)
df1 = pd.DataFrame(A2, columns=["a","b","c","d","e","f"])
df2 = pd.DataFrame(B2, columns=["a","b","c","d","e","f","g","h","i","j","k"])
dflist=[df1,df2]
print(dflist[0])
This is the code for the function I wrote.
def find_distance(veclist,dflist):
ncol=len(dflist)
nrow=dflist[0].shape[0]
distance=np.zeros((nrow,ncol))
for k in range(ncol):
pres=np.where(veclist[k]==1) #getting matching column (where the vector matches the header)
PAM=dflist[k]
for m in range(nrow):
if (veclist[k].iloc[pres]==1).bool() & (PAM.iloc[m,pres[1]]==1).bool():
a = 2
else:
a = 0
if (veclist[k].iloc[pres]==1).bool() & (PAM.iloc[m,pres[1]]==0).bool():
b = 1
else:
b = 0
if (veclist[k].iloc[pres]==0).bool() & (PAM.iloc[m,pres[1]]==1).bool():
c = 1
else:
c = 0
d = (2 * a)/(2 * a + b + c)
d = math.sqrt(1-d)
distance[m,k]=d
return distance
This function works: it does everything I need it to do. However, it is very slow. With my real data, it is taking up to a minute to run the inner loop. I am more familiar with R, where this same function runs in seconds. So why is this function taking up to 25 minutes to run in python?
What did I do wrong?
I am guessing that the problem is the code needs to be more pythonic. I am slowly migrating from R to python, and still have some difficulties. For example, I don't know how to do away with the nested loop and use Numpy, since everything needs to be perfectly matched and saved. Any help with this problem would be much appreciated.
I am using python 3.7 and the spyder IDE.

How to insert a multidimensional numpy array to pandas column?

I have some numpy array, whose number of rows (axis=0) is the same as a pandas dataframe's number of rows.
I want to create a new column in the dataframe, for which each entry would be a numpy array of a lesser dimension.
Code:
some_df = pd.DataFrame(columns=['A'])
for i in range(10):
some_df.loc[i] = [np.random.rand(4, 6, 8)
data = np.stack(some_df['A'].values) #shape (10, 4, 6, 8)
processed = np.max(data, axis=1) # shape (10, 6, 8)
some_df['B'] = processed # This fails
I want the new column 'B' to contain numpy arrays of shape (6, 8)
How can this be done?
This is not recommended, it is pain, slow and later processing is not easy.
One possible solution is use list comprehension:
some_df['B'] = [x for x in processed]
Or convert to list and assign:
some_df['B'] = processed.tolist()
Coming back to this after 2 years, here is a much better practice:
from itertools import product, chain
import pandas as pd
import numpy as np
from typing import Dict
def calc_col_names(named_shape):
*prefix, shape = named_shape
names = [map(str, range(i)) for i in shape]
return map('_'.join, product(prefix, *names))
def create_flat_columns_df_from_dict_of_numpy(
named_np: Dict[str, np.array],
n_samples_per_np: int,
):
named_np_correct_lenth = {k: v for k, v in named_np.items() if len(v) == n_samples_per_np}
flat_nps = [a.reshape(n_samples_per_np, -1) for a in named_np_correct_lenth.values()]
stacked_nps = np.column_stack(flat_nps)
named_shapes = [(name, arr.shape[1:]) for name, arr in named_np_correct_lenth.items()]
col_names = [*chain.from_iterable(calc_col_names(named_shape) for named_shape in named_shapes)]
df = pd.DataFrame(stacked_nps, columns=col_names)
df = df.convert_dtypes()
return df
def parse_series_into_np(df, col_name, shp):
# can parse the shape from the col names
n_samples = len(df)
col_names = sorted(c for c in df.columns if col_name in c)
col_names = list(filter(lambda c: c.startswith(col_name + "_") or len(col_names) == 1, col_names))
col_as_np = df[col_names].astype(np.float).values.reshape((n_samples, *shp))
return col_as_np
usage to put a ndarray into a Dataframe:
full_rate_df = create_flat_columns_df_from_dict_of_numpy(
named_np={name: np.array(d[name]) for name in ["name1", "name2"]},
n_samples_per_np=d["name1"].shape[0]
)
where d is a dict of nd arrays of the same shape[0], hashed by ["name1", "name2"].
The reverse operation can be obtained by parse_series_into_np.
The accepted answer remains, as it answers the original question, but this one is a much better practice.
I know this question already has an answer to it, but I would like to add a much more scalable way of doing this. As mentioned in the comments above it is in general not recommended to store arrays as "field"-values in a pandas-Dataframe column (I actually do not know why?). Nevertheless, in my day to day work this is an extermely important functionality when working with time-series data and a bunch of related meta-data.
In general I organize my experimantal time-series in form of pandas dataframes with one column holding same-length numpy arrays and the other columns containing information on meta-data with respect to certain measurement conditions etc.
The proposed solution by jezrael works very well, and I used this for the last 4 years on a regular basis. But this method potentially encounters huge memory problems. In my case I came across these problems working with dataframes beyond 5 Million rows and time-series with approx. 100 data points.
The solution to these problems is extremely simple, since I did not find it anywhere I just wanted to share it here: Simply transform your 2D array to a pandas-Series object and assign this to a column of your dataframe:
df["new_list_column"] = pd.Series(list(numpy_array_2D))

using python to read a column vector from excel

OK i think this must be a super simple thing to do, but i keep getting index error messages no matter how i try to format this. my professor is making us multiply a 1X3 row vector by a 3x1 column vector, and i cant get python to read the column vector. the row vector is from cells A1-C1, and the column vector is from cells A3-A5 in my excel spreadsheet. I am using the right "format" for how he wants us to do it, (if i do something that works, but don't format it the way he likes i don't get credit.) the row vector is reading properly in the variable explorer, but i am only getting a 2x2 column vector (with the first column being the 0th column and being all zeros, again how he wants it), I havent even gotten to the multiplication part of the code because i cant get python to read the column vector correctly. here is the code:
import xlwings as xw
import numpy as np
filename = 'C:\\python\\homework4.xlsm'
wb=xw.Workbook(filename)
#initialize vectors
a = np.zeros((1+1,3+1))
b = np.zeros((3+1,1+1))
n=3
#Read a and b vectors from excel
for i in range(1,n+1):
for j in range(1,n+1):
a[i,j] = xw.Range((i,j)).value
'end j'
b[i,j] = xw.Range((i+2,j)).value
'end i'
Something like this should work. The way you iterate over i and j are wrong (plus the initalization of a and b)
#initialize vectors
a = np.zeros((1,3))
b = np.zeros((3,1))
n=3
#Read a and b vectors from excel
for i in range(0,n):
a[0,i] = xw.Range((1,i+1)).value
for i in range (0,n)
b[i,0] = xw.Range((3+i,1)).value
Remember, Python use 0-based indexing and Excel use 1-based indexing.
This code will read out the vectors properly, and then you can check on numpy "scalar product" to produce the multiplication. You can also assign the whole vectors immediately without loop.
import xlwings as xw
import numpy as np
filename = 'C:\\Temp\\Book2.xlsx'
wb=xw.Book(filename).sheets[0]
n=3
#initialize vectors
a = np.zeros((1,n))
b = np.zeros((n,1))
#Read a and b vectors from excel
for j in range(1,n+1):
a[0, j-1] = wb.range((1, j)).value
b[j-1, 0] = wb.range((j+3-1, 1)).value
#Without loop
a = wb.range((1, 1),(1, 3)).value
b = wb.range((3, 1),(5, 1)).value

Shape manipulation of numpy array

The 'd' is given condition however it was obtained.
I want to get 'result' in the required shape.
I tried it as follows; but it's beyond my imagination.
import numpy as np
data = [np.ones((300,1)), np.ones((300,5)), np.ones((300,3))]
result = []
for d in data:
**print np.shape(np.array(d))**
result.append(d)
print np.shape(np.array(result))
The result should be in this shape:
(300, 1+5+3) = (300,9)
Can someone help me?
I got
ValueError: could not broadcast input array from shape (300,1) into shape (300)
EDIT:
data is just to make this question; it is just representation of my large program. given condition is d, which is a list but different shapes are list are generating from the for loop.
3 2d arrays the differ in the last dimension can be joined on that dimension
Np.concatenate(data, axis=1)
hstack does the same.
In my comment I suggested axis 0, but that was a quick response and I didn't a chance test it.
When you try ideas and they fail, show us what was wrong. You list a ValueError but don't show where that occurred. What operation.
Your comments make a big deal about d, but you don't show how d might differ from the elements of data.
You can also try numpy.column_stack, which essentially does numpy.concatenate under the hood.
Example use
In [1]: import numpy as np
In [2]: data = [np.ones((300,1)), np.ones((300,5)), np.ones((300,3))]
In [3]: out = np.column_stack(data)
In [4]: out.shape
Out[4]: (300, 9)
Your result is a Python list. In fact it is a list with the exact same contents as the original data. You are trying to concatenate arrays horizontally (along the second dimension), so you need to use numpy.hstack:
import numpy as np
data = []
for d in some_source:
data.append(d)
result = np.hstack(data)
print result.shape
If some_source is a list, a generator, or any other iterable, you can do this even more concisely:
result = np.hstack(some_source)
You want to stack the elements horizontally (if you imagine each element as a matrix with 300 rows and variable number of columns), i.e.
import numpy as np
data = [np.ones((300,1)), np.ones((300,5)), np.ones((300,3))]
result = np.hstack(data)
If you only have access to an iterator that generates elements d you can achieve the same as follows:
result = np.hstack([d for d in some_iterator_that_generates_your_ds])
try:
import numpy as np
data = [np.ones((300,1)), np.ones((300,5)), np.ones((300,3))]
result = []
print(len(data))
for d in data:
result.append(np.hstack(d))
print(result.shape)
This should do the job. You can also try:
import numpy as np
data = np.ones((300,1)), np.ones((300,5)), np.ones((300,3))
result = np.vstack(data[-1])
print(result.shape)
Both of which would yield (300, 3) as output.
If you're looking for (300, 9), you can do as follows:
result = np.hstack(data)
Finally, if you'd like your results to be in list() as opposed to numpy.array or numpy.matrix, you can just stick a .tolist() to the end, like so: result.tolist().

Categories