ValueError: Columns must be same length as key - python

I have a problem running the code below.
data is my dataframe. X is the list of columns for train data. And L is a list of categorical features with numeric values.
I want to one hot encode my categorical features. So I do as follows. But a "ValueError: Columns must be same length as key" (for the last line) is thrown. And I still don't understand why after long research.
def turn_dummy(df, prop):
dummies = pd.get_dummies(df[prop], prefix=prop, sparse=True)
df.drop(prop, axis=1, inplace=True)
return pd.concat([df, dummies], axis=1)
L = ['A', 'B', 'C']
for col in L:
data_final[X] = turn_dummy(data_final[X], col)

It appears that this is a problem of dimensionality. It would be like the following:
Say I have a list like so:
mylist = [0, 0, 0, 0]
It is of length 4. If I wanted to do 1:1 mapping of elements of a new list into that one:
otherlist = ['a', 'b']
for i in range(len(mylist)):
mylist[i] = otherlist[i]
Obviously this will throw an IndexError, because it's trying to get elements that otherlist just doesn't have
Much the same is occurring here. You are trying to insert a string (len=1) to a column of length n>1. Try:
data_final[X] = turn_dummy(data_final[X], L)
Assuming len(L) = number_of_rows

Related

Searching index position in python

cols = [2,4,6,8,10,12,14,16,18] # selected the columns i want to work with
df = pd.read_csv('mywork.csv')
df1 = df.iloc[:, cols]
b= np.array(df1)
b
outcome
b = [['WV5 6NY' 'RE4 9VU' 'BU4 N90' 'TU3 5RE' 'NE5 4F']
['SA8 7TA' 'BA31 0PO' 'DE3 2FP' 'LR98 4TS' 0]
['MN0 4NU' 'RF5 5FG' 'WA3 0MN' 'EA15 8RE' 'BE1 4RE']
['SB7 0ET' 'SA7 0SB' 'BT7 6NS' 'TA9 0LP' 'BA3 1OE']]
a = np.concatenate(b) #concatenated to get a single array, this worked well
a = np.array([x for x in a if x != 'nan'])
a = a[np.where(a != '0')] #removed the nan
print(np.sort(a)) # to sort alphabetically
#Sorted array
['BA3 1OE' 'BA31 0PO' 'BE1 4RE' 'BT7 6NS' 'BU4 N90'
'DE3 2FP' 'EA15 8RE' 'LR98 4TS' 'MN0 4NU', 'NE5 4F' 'RE4 9VU'
'RF5 5FG' 'SA7 0SB' 'SA8 7TA' 'SB7 0ET' 'TA9 0LP' 'TU3 5RE'
'WA3 0MN' 'WV5 6NY']
#Find the index position of all elements of b in a(sorted array)
def findall_index(b, a )
result = []
for i in range(len(a)):
for j in range(len(a[i])):
if b[i][j] == a:
result.append((i, j))
return result
print(findall_index(0,result))
I am still very new with python, I tried finding the index positions of all element of b in a above. The underneath codes blocks doesn't seem to be giving me any result. Please can some one help me.
Thank you in advance.
One way you could approach this is by zipping (creating pairs) the index of elements in b with the actual elements and then sorting this new array based on the elements only. Now you have a mapping from indices of the original array to the new sorted array. You can then just loop over the sorted pairs to map the current index to the original index.
I would highly suggest you to code this yourself, since it will help you learn!

Python -- Reduce list of duplicates but preserve duplicates in list if they are alternating

I'm working with a Pandas dataframe, and I need to reduce a column's list of values while preserving alternating duplicates, if they exist, and while preserving order. I'm able to mask the values such that there are only ever two distinct values to work with (e.g., A and B below).
(It's best to show...) I'm looking to define the reduce_list() method below...
dummy_arr_one = ['A','A','B','B','A','A','A','A','B','B','B']
dummy_arr_two = ['A','A','A','B','B','B']
df = pd.DataFrame({"instance":
["group_one" for x in range(0,len(dummy_arr_one))] + ["group_two" for y in range(0,len(dummy_arr_two))],
"value":dummy_arr_one + dummy_arr_two
})
>> x = df[df['instance']=='group_one']['value'].values # ['A','A','B','B','A','A','A','A','B','B','B']
>> y = reduce_list(x)
[output] >> ['A','B','A','B']
OR
>> x = df[df['instance']=='group_one']['value'].values # ['A','A','A','B','B','B']
>> y = reduce_list(x)
[output] >> ['A','B']
I've tried a few approaches with collections and dictionaries, but I can't wrap my head around getting farther than the following (unrelated to collections attempts):
for group in df['instance'].unique():
val_arr = df[df['instance'] == group]['value'].values
unique_vals = np.unique(val_arr)
...<then what to do?>
since dictionaries need unique keys and I may need to dynamically create the keys (e.g., A_1, B_1, A_2), but then I also need to keep in mind preserving the order.
I feel like I'm overlooking something obvious. So any help is greatly appreciated!
Use itertools.groupby
from itertools import groupby
reduced = [k for k, _ in groupby(df['value'])]
print(reduced)
Output
['A', 'B', 'A', 'B', 'A', 'B']
If you needed by each group of instance, group first, then apply to each instance group:
res = [[k for k, _ in groupby(vs)] for k, vs in df.groupby('instance')['value']]
print(res)
Output
[['A', 'B', 'A', 'B'], ['A', 'B']]
It works for the lists, i might hand mis understood the context.
def reduce_list(x):
unique = []
reduced = []
for k in x:
if k not in unique:
unique.append(k)
# now we have the uniques.
for k in range(len(x)-1):
if x[k] != x[k+1]:
reduced.append(x[k])
if x[len(x)-1] != reduced[len(reduced)-1]:
reduced.append(x[len(x)-1])
return reduced
This is a loop intensive implementation of the code.
First loop collects the uniques which is very easy to understand.
The second loop, checks if two consecutive elements are different. If they are, it appends the one at the previous position to the loop. However, this loop fails when you have repetitive ending.
Therefore, you have to add an additional check, which sees if the last element at x is similar or different from the last element of reduced if not, it appends it.

Compare array with file and form groups from elements of array

I have a text file with letters (tab delimited), and a numpy array (obj) with a few letters (single row). The text file has rows with different numbers of columns. Some rows in the text file may have multiple copies of same letters (I will like to consider only a single copy of a letter in each row). Letters in the same row of the text file are assumed to be similar to each other. Also, each letter of the numpy array obj is present in one or more rows of the text file.
Below is an example of the text file (you can download the file from here ):
b q a i m l r
j n o r o
e i k u i s
In the above example, the letter o is mentioned two times in the second row, and the letter i is denoted two times in the third row. I will like to consider single copies of letters rows of the text file.
This is an example of obj: obj = np.asarray(['a', 'e', 'i', 'o', 'u'])
I want to compare obj with rows of the text file and form clusters from elements in obj.
This is how I want to do it. Corresponding to each row of the text file, I want to have a list which denotes a cluster (In the above example we will have three clusters since the text file has three rows). For every given element of obj, I want to find rows of the text file where the element is present. Then, I will like to assign index of that element of obj to the cluster which corresponds to the row with maximum length (the lengths of rows are decided with all rows having single copies of letters).
Below is a python code that I have written for this task
import pandas as pd
import numpy as np
data = pd.read_csv('file.txt', sep=r'\t+', header=None, engine='python').values[:,:].astype('<U1000')
obj = np.asarray(['a', 'e', 'i', 'o', 'u'])
for i in range(data.shape[0]):
globals()['data_row' + str(i).zfill(3)] = []
globals()['clust' + str(i).zfill(3)] = []
for j in range(len(obj)):
if obj[j] in set(data[i, :]): globals()['data_row' + str(i).zfill(3)] += [j]
for i in range(len(obj)):
globals()['obj_lst' + str(i).zfill(3)] = [0]*data.shape[0]
for j in range(data.shape[0]):
if i in globals()['data_row' + str(j).zfill(3)]:
globals()['obj_lst' + str(i).zfill(3)][j] = len(globals()['data_row' + str(j).zfill(3)])
indx_max = globals()['obj_lst' + str(i).zfill(3)].index( max(globals()['obj_lst' + str(i).zfill(3)]) )
globals()['clust' + str(indx_max).zfill(3)] += [i]
for i in range(data.shape[0]): print globals()['clust' + str(i).zfill(3)]
>> [0]
>> [3]
>> [1, 2, 4]
The above code gives me the right answer. But, in my actual work, the text file has tens of thousands of rows, and the numpy array has hundreds of thousands of elements. And, the above given code is not very fast. So, I want to know if there is a better (faster) way to implement the above functionality and aim (using python).
You can do it using merge after a stack on data (in pandas), then some groupby with nunique or idxmax to get what you want
#keep data in pandas
data = pd.read_csv('file.txt', sep=r'\t+', header=None, engine='python')
obj = np.asarray(['a', 'e', 'i', 'o', 'u'])
#merge to keep only the letters from obj
df = (data.stack().reset_index(0,name='l')
.merge(pd.DataFrame({'l':obj})).set_index('level_0'))
#get the len of unique element of obj in each row of data
# and use transform to keep this lenght along each row of df
df['len'] = df.groupby('level_0').transform('nunique')
#get the result you want in a series
res = (pd.DataFrame({'data_row':df.groupby('l')['len'].idxmax().values})
.groupby('data_row').apply(lambda x: list(x.index)))
print(res)
data_row
0 [0]
1 [3]
2 [1, 2, 4]
dtype: object
res contains the clusters with the index being the row in the original data

How to change the values of only part (slice) of a list of strings

Does anyone know how to concatenate a list of strings using a for loop?
This works, without a loop:
Extract Column Names
column_names = file.columns.tolist()
column_names[1]= column_names[1] + '_Vol_Adj'
column_names[1]
But does not work as part of the following loop:
for i in column_names[1:-2]:
column_names[i]= column_names[i] + '_Vol_Adj'
i = i+1
TypeError Traceback (most recent call last)
<ipython-input-242-5509cede9f32> in <module>()
2
3 for i in column_names[1:-2]:
----> 4 column_names[i]= column_names[i] + '_Vol_Adj'
5 i = i+1
TypeError: list indices must be integers or slices, not str
Your problem is that when you iterate over a the list column_names using for i in column_names[1:-2], the value of i will be the element in the list (not the corresponding index).
In your case, one simple thing would be to use enumerate, as shown in the following example:
column_names = ["a", "b", "c", "d", "e"]
for i, val in enumerate(column_names[1:-2]):
column_names[i+1] += '_Vol_Adj'
print(column_names)
#['a', 'b_Vol_Adj', 'c_Vol_Adj', 'd', 'e']
I am updating the value at index i+1 inside the loop because we are starting the iteration at index 1.
Or you could use range(1, len(column_names)-2) as #Barmar suggested:
for i in range(1, len(column_names)-2):
column_names[i+1] += '_Vol_Adj'
Can also use add_suffix and df.rename
cols=['b', 'c']
ncols = df[cols].add_suffix('_suffix').columns
df.rename(columns={old:new for old,new in zip(cols,ncols)})
See #pault's solution for how you can adapt your for loop with enumerate or range.
Given you are using Pandas, you may be interested in a NumPy-based solution:
import pandas as pd, numpy as np
df = pd.DataFrame(columns=list('ABCDE'))
arr = df.columns.values
arr[1:-1] += '_Vol_Adj'
df.columns = arr
print(df.columns)
Index(['A', 'B_Vol_Adj', 'C_Vol_Adj', 'D_Vol_Adj', 'E'], dtype='object')
It's important that you do not modify df.columns.values directly, as this has side-effects. Here we've modified a copy of the underlying NumPy array and then assigned back to the dataframe.

Unknown error on PySpark map + broadcast

I have a big group of tuples with tuple[0] = integer and tuple[1] = list of integers (resulting from a groupBy). I call the value tuple[0] key for simplicity.
The values inside the lists tuple[1] can be eventually other keys.
If key = n, all elements of key are greater than n and sorted / distinct.
In the problem I am working on, I need to find the number of common elements in the following way:
0, [1,2]
1, [3,4,5]
2, [3,7,8]
.....
list of values of key 0:
1: [3,4,5]
2: [3,7,8]
common_elements between list of 1 and list of 2: 3 -> len(common_elements) = 1
Then I apply the same for keys 1, 2 etc, so:
list of values of 1:
3: ....
4: ....
5: ....
The sequential script I wrote is based on pandas DataFrame df, with the first column v as list of 'keys' (as index = True) and the second column n as list of list of values:
for i in df.v: #iterate each value
for j in df.n[i]: #iterate within the list
common_values = set(df.n[i]).intersection(df.n[j])
if len(common_values) > 0:
return len(common_values)
Since is a big dataset, I'm trying to write a parallelized version with PySpark.
df.A #column of integers
df.B #column of integers
val_colA = sc.parallelize(df.A)
val_colB = sc.parallelize(df.B)
n_values = val_colA.zip(val_colB).groupByKey().MapValues(sorted) # RDD -> n_values[0] will be the key, n_values[1] is the list of values
n_values_broadcast = sc.broadcast(n_values.collectAsMap()) #read only dictionary
def f(element):
for i in element[1]: #iterating the values of "key" element[0]
common_values = set(element[1]).intersection(n_values_broadcast.value[i])
if len(common_values) > 0:
return len(common_values)
collection = n_values.map(f).collect()
The programs fails after few seconds giving error like KeyError: 665 but does not provide any specific failure reason.
I'm a Spark beginner thus not sure whether this the correct approach (should I consider foreach instead? or mapPartition) and especially where is the error.
Thanks for the help.
The error is actually pretty clear and Spark specific. You are accessing Python dict with __getitem__ ([]):
n_values_broadcast.value[i]
and if key is missing in the dictionary you'll get KeyError. Use get method instead:
n_values_broadcast.value.get(i, [])

Categories