Loop over nested dictionary in list [duplicate] - python

This question already has answers here:
How do I create variable variables?
(17 answers)
Closed 1 year ago.
I have a list with nested dictionaries as list elements. There are 12 such dictionaries with 107 key-value pairs each, as the snapshot of the first dictionary below:
big_list = [{0: 0.9065934065934067,
1: 0.0,
2: 0.14285714285714288,
3: 0.03663003663003663,
4: 0.0,
5: 0.0,
6: 0.053113553113553126,
7: 0.03663003663003663,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.01098901098901099,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0,
20: 0.0,
21: 0.0,
22: 0.0,
23: 0.0,
24: 0.0,
25: 0.0,
26: 0.0,
27: 0.0,
28: 0.0,
29: 0.0,
30: 0.0,
31: 0.0,
32: 0.0,
33: 0.0,
34: 0.0,
35: 0.0,
36: 0.0,
37: 0.0,
38: 0.0,
39: 0.0,
40: 0.0,
41: 0.0,
42: 0.0,
43: 0.0,
44: 0.0,
45: 0.0,
46: 0.0,
47: 0.0,
48: 0.0,
49: 0.0,
50: 0.0,
51: 0.0,
52: 0.0,
53: 0.0,
54: 0.0,
55: 0.0,
56: 0.0,
57: 0.0,
58: 0.0,
59: 0.0,
60: 0.0,
61: 0.0,
62: 0.0,
63: 0.0,
64: 0.0,
65: 0.0,
66: 0.0,
67: 0.0,
68: 0.0,
69: 0.0,
70: 0.0,
71: 0.0,
72: 0.0,
73: 0.0,
74: 0.0,
75: 0.0,
76: 0.0,
77: 0.0,
78: 0.0,
79: 0.0,
80: 0.0,
81: 0.0,
82: 0.0,
83: 0.0,
84: 0.0,
85: 0.0,
86: 0.0,
87: 0.0,
88: 0.0,
89: 0.0,
90: 0.0,
91: 0.0,
92: 0.0,
93: 0.0,
94: 0.0,
95: 0.0,
96: 0.0,
97: 0.0,
98: 0.0,
99: 0.0,
100: 0.0,
101: 0.0,
102: 0.0,
103: 0.0,
104: 0.0,
105: 0.0,
106: 0.0},
I want to construct a loop through which I can extract the values in this way:
first_dict[key1][value]
second_dict[key1][value]
third_dict[key1][value]
...
twelfth_dict[key1][value]
...
first_dict[key107][value]
...
twelfth_dict[key107][value]
and so on for every key in every dictionary, and then find the average values of every key across dictionaries, ie average value of key 1, key 2 through key 106. I know the zeros may complicate things a bit but they're needed for the task. Please let me know if this is possible and happy to elaborate further if needed. Thanks.

In case the dictionary keys are ascending, you could iterate over the keys first:
big_list = [{
0: 0.9065934065934067,
1: 0.0,
2: 0.14285714285714288,
3: 0.03663003663003663,
4: 0.0,
5: 0.0
}, {
0: 0.9065934065934067,
1: 0.0,
2: 0.14285714285714288,
3: 0.03663003663003663,
4: 0.0,
5: 0.0,
6: 1.0
}, {
0: 0.9065934065934067,
1: 0.0,
2: 0.14285714285714288,
3: 0.03663003663003663,
4: 0.0,
5: 0.0,
6: 1.0,
7: 2.2
}]
longestDict = max(len(d.keys()) for d in big_list) # determine longest dictionary, in case they are not equal sized.
for key in range(0, longestDict):
print(f"Key: {key}")
for dct in big_list:
print(f"\t{dct.get(key, None)}")
Out:
Key: 0
0.9065934065934067
0.9065934065934067
0.9065934065934067
Key: 1
0.0
0.0
0.0
Key: 2
0.14285714285714288
0.14285714285714288
0.14285714285714288
...

Related

Improve computational speed for chunk-wise distance calculation

I have an issue with the performance of a code I have to calculate the distance between vectors, but I think a little context is in order before exposing the problem.
I have two sets of vectors stored in two dataframes. What I want to do is to compute the distance between the every vector in set of vectors in one dataframe to every vector in the other dataframe. Here are examples of how these dataframes looks like (I post these at the end of the question in the form of dictionaries) here only the first 5 lines:
df_sample =
CalVec
1272 [0.0, 4.0, 8.0, 15.0, 10.0, 8.0, 2.54, 2.0, 4.91, 0.0, 0.0, 0.0, 0.0, 3.59, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.0]
657 [1.44, 12.0, 10.0, 5.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.23, 4.36, 15.0]
806 [4.58, 13.09, 15.46, 3.59, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 6.31]
771 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 0.0, 2.0, 0.0, 5.59, 11.67, 3.91, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
1370 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 15.0, 2.89, 0.0, 0.0, 0.0, 0.0]
df_sample.to_dict()
and
DF =
id \
4538 A4060462000516278
5043 A4050494272716275
11663 A4070271111316245
2701 A4060462848716270
825 A4060454573516274
MeasVec
4538 [0.0, 0.0, 0.0, 0.0, 6.0, 15.0, 16.0, 0.0, 0.0, 5.0, 0.0, 15.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.5, 0.0, 3.0]
5043 [0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 12.0, 0.0, 13.0, 15.0, 0.0, 15.0, 0.0, 0.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 0.0]
11663 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 15.0, 0.0, 0.0, 0.0, 6.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2701 [0.0, 0.0, 0.0, 8.0, 13.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 0.0, 7.0]
825 [0.0, 0.0, 0.0, 0.0, 0.0, 11.0, 15.0, 0.0, 13.0, 16.0, 0.0, 9.0, 3.0, 0.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
df_sample
M = len(DF)
In reality df_sample has 1700 rows while DF has 12000 rows. I provide a sample of 10 and 50 respectively.
Now, to compute the distances (in my full size data) I am forced to chunk the larger dataframe into smaller pieces and in my actual distance computation I need to make sure that the chunks have the same amount of rows as df_sample, hence I create empty vectors for every chunk until it matches the length of df_sample.
M = len(DF)
N = len(df_sample)
P = int(round(M/N,0))-1
Number_of_id = int(round(M/P,0)) #There are only unique id:s in DF
Number_AP = 26
def zerolistmaker(n):
listofzeros = [0.0] * n
return listofzeros
def split_dataframe(df, chunk_size):
chunks = list()
num_chunks = len(df) // chunk_size + 1
for i in range(num_chunks):
chunks.append(df[i*chunk_size:(i+1)*chunk_size])
return chunks
DF_chunked = split_dataframe(DF,Number_of_id)
and here I compute the distances (actually, weighted distances, so there is no commutativity, i.e. d(v1,v2) != d(v2,v1)).
import time
t = time.process_time()
DIST = []
for i in range(P):
vec = DF_chunked[i]
number_zero_vectors = len(vec)-len(df_sample)
df =pd.DataFrame(columns = ['CalVec'])
for k in range(number_zero_vectors):
a = zerolistmaker(Number_AP)
df = df.append({'CalVec':a},ignore_index=True)
df_sample_ = pd.concat([df_sample, df])
m = np.repeat(np.vstack(df_sample_['CalVec']), df_sample_.shape[0], axis=0)
n = np.tile(np.vstack(vec['MeasVec']), (vec.shape[0], 1))
d = np.count_nonzero(m, axis=1, keepdims=True)
dist = np.sqrt(np.sum((m - n)**2/d, axis=-1))
mi = pd.MultiIndex.from_product([vec['id']] * 2, names=['id2','id'])
out = pd.DataFrame({'CalVec': m.tolist(),
'MeasVec': n.tolist(),
'distance': dist}, index=mi).reset_index()
DIST.append(out)
elapsed_time = time.process_time() - t
distances = pd.concat(DIST)
distances = distances.drop(['id2'], axis = 1)
distances = distances.dropna()
print(elapsed_time)
which gives the time 0.0625 and the distance df:
id \
0 A4060462000516278
1 A4050494272716275
2 A4070271111316245
3 A4060462848716270
CalVec \
0 [0.0, 4.0, 8.0, 15.0, 10.0, 8.0, 2.54, 2.0, 4.91, 0.0, 0.0, 0.0, 0.0, 3.59, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.0]
1 [0.0, 4.0, 8.0, 15.0, 10.0, 8.0, 2.54, 2.0, 4.91, 0.0, 0.0, 0.0, 0.0, 3.59, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.0]
2 [0.0, 4.0, 8.0, 15.0, 10.0, 8.0, 2.54, 2.0, 4.91, 0.0, 0.0, 0.0, 0.0, 3.59, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.0]
3 [0.0, 4.0, 8.0, 15.0, 10.0, 8.0, 2.54, 2.0, 4.91, 0.0, 0.0, 0.0, 0.0, 3.59, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.0]
MeasVec \
0 [0.0, 0.0, 0.0, 0.0, 6.0, 15.0, 16.0, 0.0, 0.0, 5.0, 0.0, 15.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.5, 0.0, 3.0]
1 [0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 12.0, 0.0, 13.0, 15.0, 0.0, 15.0, 0.0, 0.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 0.0]
2 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 15.0, 0.0, 0.0, 0.0, 6.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
3 [0.0, 0.0, 0.0, 8.0, 13.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 0.0, 7.0]
distance
0 8.98
1 10.45
2 8.92
3 5.19
Now, this seems to be fast but it isn't. In fact the time grows exponentially and when considering the entire sets, it takes almost 20 minutes, if the kernel doesn't crash. It is so memory consuming that I cannot do anything else on my computer.
I would appreciate any insight.
DATA
df_sample = {'CalVec': {1272: [0.0,
4.0,
8.0,
15.0,
10.0,
8.0,
2.54,
2.0,
4.91,
0.0,
0.0,
0.0,
0.0,
3.59,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
8.0],
657: [1.44,
12.0,
10.0,
5.0,
6.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
8.23,
4.36,
15.0],
806: [4.58,
13.09,
15.46,
3.59,
3.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
0.0,
6.31],
771: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
4.0,
0.0,
2.0,
0.0,
5.59,
11.67,
3.91,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
1370: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
15.0,
2.89,
0.0,
0.0,
0.0,
0.0],
991: [0.0,
0.0,
0.0,
0.0,
9.0,
1.75,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
6.5,
14.71,
13.0,
9.0],
194: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
4.0,
15.54,
13.0,
2.12,
0.0],
1128: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.77,
1.8,
7.0,
6.0,
0.0,
1.8,
0.0,
9.0,
7.0,
0.0,
2.5,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
158: [0.0,
0.0,
0.0,
0.0,
3.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
8.0,
15.44,
13.0,
2.0],
580: [0.0,
2.0,
6.0,
15.64,
2.0,
2.0,
9.23,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
4.23]}}
and
DF = {'id': {4538: 'A4060462000516278',
5043: 'A4050494272716275',
11663: 'A4070271111316245',
2701: 'A4060462848716270',
825: 'A4060454573516274',
8679: 'A4060462010016274',
11700: 'A4060462080916270',
8594: 'A4060461067716272',
8707: 'A4060454363916275',
1071: 'A4060463723916275',
7128: 'A4050494407616274',
8828: 'A4060464006116272',
8505: 'A4050500855716270',
9958: 'A4060462054116273',
2048: 'A4060461032216279',
8522: 'A4050494268116274',
10934: 'A4070270449716242',
10128: 'A4050500604416279',
9453: 'A4050500735216272',
11820: 'A4060462873316274',
7617: 'A4060461991516276',
6930: 'A4050500905516274',
11376: 'A4060454760216279',
5619: 'A4139300114013544',
35: 'A4050470904716271',
7957: 'A4090281675416244',
4216: 'A4050494309816277',
6244: 'A4050494283216272',
11922: 'A4070271196316248',
8914: 'A4060461041916276',
6054: 'A4060462056416278',
12014: 'A4060464023316273',
1362: 'A4050494275316274',
749: 'A4620451876116275',
4405: 'A4620451903216277',
2021: 'A4060454386016271',
7175: 'A4060462829816270',
351: 'A4060454654316272',
5853: 'A4050494877016279',
7980: 'A4050500932116270',
17: 'A4620451899116270',
8234: 'A4050494361416271',
10271: 'A4050500470516271',
1325: 'A4050500771516275',
2391: 'A4050500683216274',
372: 'A4050494830916277',
5527: 'A4050490253316276',
5431: 'A4050500884316278',
717: 'A4060461998716275',
10015: 'A4050500032916279'},
'MeasVec': {4538: [0.0,
0.0,
0.0,
0.0,
6.0,
15.0,
16.0,
0.0,
0.0,
5.0,
0.0,
15.0,
0.0,
0.0,
0.0,
0.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
3.5,
0.0,
3.0],
5043: [0.0,
0.0,
0.0,
0.0,
0.0,
16.0,
12.0,
0.0,
13.0,
15.0,
0.0,
15.0,
0.0,
0.0,
6.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
3.0,
3.0,
0.0],
11663: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
5.0,
15.0,
0.0,
0.0,
0.0,
6.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
2701: [0.0,
0.0,
0.0,
8.0,
13.0,
16.0,
6.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
6.0,
0.0,
7.0],
825: [0.0,
0.0,
0.0,
0.0,
0.0,
11.0,
15.0,
0.0,
13.0,
16.0,
0.0,
9.0,
3.0,
0.0,
6.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
8679: [0.0,
4.0,
9.0,
15.0,
10.0,
3.0,
2.0,
0.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
9.0],
11700: [0.0,
0.0,
6.0,
0.0,
15.0,
8.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
16.0,
0.0,
6.0],
8594: [12.0,
16.0,
16.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
8.0,
0.0,
5.0],
8707: [7.0,
5.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
8.0,
15.0],
1071: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
12.0,
15.5,
6.0,
3.5,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
7128: [0.0,
0.0,
0.0,
0.0,
10.0,
15.0,
16.0,
0.0,
8.0,
12.0,
0.0,
12.0,
0.0,
0.0,
4.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
3.0,
0.0,
11.0],
8828: [0.0,
0.0,
0.0,
0.0,
11.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
12.0,
15.0,
15.0,
7.0],
8505: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
15.0,
16.0,
4.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
9958: [0.0,
0.0,
0.0,
0.0,
14.0,
9.0,
6.0,
0.0,
0.0,
0.0,
0.0,
13.0,
0.0,
0.0,
7.0,
3.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
7.0,
0.0,
6.0],
2048: [0.0,
0.0,
0.0,
11.0,
0.0,
16.0,
14.0,
0.0,
7.0,
5.0,
0.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
8522: [0.0,
0.0,
0.0,
4.0,
4.0,
16.0,
9.0,
0.0,
0.0,
3.0,
0.0,
14.0,
0.0,
0.0,
5.5,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
11.5,
0.0,
0.0],
10934: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
8.0,
4.5,
0.0,
2.0,
0.0,
15.0,
5.0,
0.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
10128: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
7.0,
12.0,
0.0,
12.0,
5.0,
3.0,
6.0,
0.0,
6.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
9453: [0.0,
0.0,
5.0,
16.0,
0.0,
2.0,
6.0,
0.0,
4.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
11820: [0.0,
0.0,
0.0,
10.0,
9.0,
15.0,
3.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
7617: [0.0,
3.0,
10.0,
9.0,
15.0,
11.0,
8.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
2.0,
15.0],
6930: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
0.0,
0.0,
0.0,
10.0,
15.5,
14.0,
15.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
11376: [0.0,
0.0,
10.0,
7.0,
7.0,
11.0,
7.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
16.0],
5619: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
5.0,
12.0,
14.0,
2.5,
2.0,
8.0],
35: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
13.0,
16.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
7957: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
4.5,
0.0,
7.0,
7.0,
0.0,
2.0,
0.0,
15.0,
8.0,
4.5,
4.5,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
4216: [16.0,
6.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
3.0,
5.0],
6244: [11.0,
7.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
5.0,
10.0],
11922: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
5.0,
15.0,
0.0,
0.0,
0.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
8914: [2.0,
0.0,
4.0,
0.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
5.0,
15.0],
6054: [0.0,
0.0,
0.0,
0.0,
15.0,
9.0,
5.0,
0.0,
0.0,
0.0,
0.0,
13.0,
0.0,
0.0,
6.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
8.0,
0.0,
6.0],
12014: [3.0,
7.0,
6.0,
0.0,
14.0,
3.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
5.0,
16.0],
1362: [15.0,
16.0,
5.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
5.0,
0.0],
749: [14.0,
15.0,
16.0,
3.0,
3.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
6.0,
2.0,
12.0],
4405: [11.0,
16.0,
16.0,
3.0,
4.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
4.0],
2021: [0.0,
0.0,
0.0,
0.0,
0.0,
8.0,
16.0,
0.0,
0.0,
4.0,
0.0,
6.5,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
7175: [0.0,
0.0,
0.0,
2.0,
9.0,
16.0,
15.0,
0.0,
0.0,
3.0,
0.0,
3.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
5.0,
0.0,
5.0],
351: [0.0,
0.0,
0.0,
0.0,
12.0,
16.0,
5.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
7.0,
0.0,
5.0],
5853: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
16.0,
8.0,
1.5,
0.0,
0.0,
0.0,
4.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
7980: [0.0,
0.0,
13.0,
8.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
4.0],
17: [11.0,
16.0,
16.0,
6.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
3.0,
11.0],
8234: [0.0,
0.0,
0.0,
0.0,
0.0,
6.0,
7.0,
5.0,
11.0,
13.0,
0.0,
13.0,
3.0,
11.0,
15.0,
12.0,
12.0,
5.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
10271: [0.0,
0.0,
0.0,
0.0,
6.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
9.0,
0.0,
15.0,
9.0,
5.0,
5.0],
1325: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
3.0,
0.0,
0.0,
5.0,
0.0,
16.0,
0.0,
0.0,
9.0,
0.0,
5.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
2391: [0.0,
0.0,
3.0,
16.0,
0.0,
0.0,
0.0,
0.0,
2.0,
0.0,
0.0,
0.0,
0.0,
4.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0],
372: [0.0,
0.0,
0.0,
0.0,
4.0,
16.0,
10.0,
0.0,
0.0,
3.0,
0.0,
12.0,
0.0,
0.0,
3.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
7.0,
6.0,
0.0],
5527: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
5.0,
0.0,
2.0,
0.0,
0.0,
14.0,
16.0,
7.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
5431: [0.0,
0.0,
0.0,
0.0,
2.0,
3.0,
8.0,
0.0,
4.0,
7.0,
0.0,
16.0,
0.0,
0.0,
8.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
717: [0.0,
0.0,
0.0,
11.0,
2.0,
14.0,
9.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
10015: [0.0,
0.0,
0.0,
7.0,
14.0,
16.0,
15.0,
0.0,
4.0,
9.0,
0.0,
11.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
6.0,
3.0,
12.0]}}
Distance calculation is a common problem, so it can be a good idea to use the available functions for that, specifically sklearn. The data you provided is not convenient to manage, but the example below might give ideas on how to adapt this workflow to the specifics of your data:
import numpy as np
import pandas as pd
from sklearn.metrics import pairwise_distances
X = pd.DataFrame(np.random.rand(10, 30))
Y = pd.DataFrame(np.random.rand(20, 30))
def custom_distance(x, y):
"""Sample asymmetric function."""
return max(x) + min(y)
# use n_jobs=-1 to run calculations with all cores
result = pairwise_distances(X, Y, metric=custom_distance, n_jobs=-1)
To complete #SultanOrazbayev:
from sklearn.metrics import pairwise_distances
Ax = df_sample['CalVec'] = df_sample['CalVec'].apply(lambda x: np.array(x))
Bx = DF['MeasVec'] = DF['MeasVec'].apply(lambda x: np.array(x))
A = Ax.to_numpy()
B = Bx.to_numpy()
AA = np.stack(A)
BB = np.stack(B)
result = pairwise_distances(AA, BB, metric=custom_distance, n_jobs=-1)
which is performed in under 3 minutes.

'ValueError: setting an array element with a sequence' while using scipy.spacial.distance function cdist

I am trying to compute the distance between vectors in two pandas dataframes using cdist from scipy.spatial.distance, but the output is all wrong and I can't pinpoint where is fails. This is actually a follow up on the question Problems computing cdist of two columns in two different dataframes
So, my original dataframes are of the type:
df_sample =
Fingerprint
1272 [0.0, 4.0, 8.0, 15.0, 10.0, 8.0, 2.54, 2.0, 4.91, 0.0, 0.0, 0.0, 0.0, 3.59, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.0]
657 [1.44, 12.0, 10.0, 5.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.23, 4.36, 15.0]
806 [4.58, 13.09, 15.46, 3.59, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 6.31]
and
DF =
barcode \
4538 A4060462000516278
5043 A4050494272716275
11663 A4070271111316245
2701 A4060462848716270
825 A4060454573516274
8679 A4060462010016274
11700 A4060462080916270
8594 A4060461067716272
8707 A4060454363916275
1071 A4060463723916275
Geopos Ack
4538 [0.0, 0.0, 0.0, 0.0, 6.0, 15.0, 16.0, 0.0, 0.0, 5.0, 0.0, 15.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.5, 0.0, 3.0]
5043 [0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 12.0, 0.0, 13.0, 15.0, 0.0, 15.0, 0.0, 0.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 0.0]
11663 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 15.0, 0.0, 0.0, 0.0, 6.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2701 [0.0, 0.0, 0.0, 8.0, 13.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 0.0, 7.0]
825 [0.0, 0.0, 0.0, 0.0, 0.0, 11.0, 15.0, 0.0, 13.0, 16.0, 0.0, 9.0, 3.0, 0.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
8679 [0.0, 4.0, 9.0, 15.0, 10.0, 3.0, 2.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 9.0]
11700 [0.0, 0.0, 6.0, 0.0, 15.0, 8.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 0.0, 6.0]
8594 [12.0, 16.0, 16.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 0.0, 5.0]
8707 [7.0, 5.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.0, 15.0]
1071 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.0, 15.5, 6.0, 3.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
(I provide dictionaries for both at the end of the question).
As you can see, they are of different dimension (although the vectors belong to the same space). So, to remedy this I create zero vectors in df_sample by doing this:
Number_AP = 26
number_zero_vectors = len(DF)-len(df_sample)
df =pd.DataFrame(columns = ['Fingerprint'])
for k in range(number_zero_vectors):
a = zerolistmaker(Number_AP)
df = df.append({'Fingerprint':a},ignore_index=True)
df_sample_ = pd.concat([df_sample, df])
Hence, DF and df_sample_ have the same shape. My first approach (described in detail in the question mentioned above) was to deal with both columns and all their rows simulataneously, but it did not work. So, my approach now is to deal with both column incrementally (e.g. take the first row of df_sample_ and compute the distance to every single vectors in DF, all this in a loop:
distances = []
for i in range(len(df_sample_)):
a = df_sample_['Fingerprint'][i:i+1]
print(a)
for j in range(len(DF)):
b = DF['Geopos Ack'][j:j+1]
print(b)
ax = a.to_numpy()
bx = b.to_numpy()
aa = ax.reshape(-1,1)
bb = bx.reshape(-1,1)
print(aa.shape)
print(bb.shape)
d = sp.cdist(aa,bb,'euclidean')
print(d)
### I leave out the part were distances.append(d) since what come before it fails
This returns:
1272 [0.0, 4.0, 8.0, 15.0, 10.0, 8.0, 2.54, 2.0, 4.91, 0.0, 0.0, 0.0, 0.0, 3.59, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.0]
Name: Fingerprint, dtype: object
4538 [0.0, 0.0, 0.0, 0.0, 6.0, 15.0, 16.0, 0.0, 0.0, 5.0, 0.0, 15.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.5, 0.0, 3.0]
Name: Geopos Ack, dtype: object
(1, 1)
(1, 1)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
TypeError: only size-1 arrays can be converted to Python scalars
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-368-e29e9515d73a> in <module>
14 print(bb.shape)
15
---> 16 d = sp.cdist(aa,bb,'euclidean')
17 print(d)
18
~\Anaconda3\envs\conda-qgis\lib\site-packages\scipy\spatial\distance.py in cdist(XA, XB, metric, *args, **kwargs)
2802 metric_name = _METRIC_ALIAS.get(mstr, None)
2803 if metric_name is not None:
-> 2804 XA, XB, typ, kwargs = _validate_cdist_input(XA, XB, mA, mB, n,
2805 metric_name, **kwargs)
2806
~\Anaconda3\envs\conda-qgis\lib\site-packages\scipy\spatial\distance.py in _validate_cdist_input(XA, XB, mA, mB, n, metric_name, **kwargs)
245 typ = types[types.index(XA.dtype)] if XA.dtype in types else types[0]
246 # validate data
--> 247 XA = _convert_to_type(XA, out_type=typ)
248 XB = _convert_to_type(XB, out_type=typ)
249
~\Anaconda3\envs\conda-qgis\lib\site-packages\scipy\spatial\distance.py in _convert_to_type(X, out_type)
182
183 def _convert_to_type(X, out_type):
--> 184 return np.ascontiguousarray(X, dtype=out_type)
185
186
~\Anaconda3\envs\conda-qgis\lib\site-packages\numpy\core\_asarray.py in ascontiguousarray(a, dtype)
175
176 """
--> 177 return array(a, dtype, copy=False, order='C', ndmin=1)
178
179
ValueError: setting an array element with a sequence.
Where did I go wrong? All arrays consist of floats and the shapes look right.
I know another approach would be to use pairwise_distance from sklearn but I did not manage to apply it to my dataframes.
Any help would be appreciated.
Data:
df_sample =
{'Fingerprint': {1272: [0.0,
4.0,
8.0,
15.0,
10.0,
8.0,
2.54,
2.0,
4.91,
0.0,
0.0,
0.0,
0.0,
3.59,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
8.0],
657: [1.44,
12.0,
10.0,
5.0,
6.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
8.23,
4.36,
15.0],
806: [4.58,
13.09,
15.46,
3.59,
3.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
0.0,
6.31]}}
and
DF =
{'barcode': {4538: 'A4060462000516278',
5043: 'A4050494272716275',
11663: 'A4070271111316245',
2701: 'A4060462848716270',
825: 'A4060454573516274',
8679: 'A4060462010016274',
11700: 'A4060462080916270',
8594: 'A4060461067716272',
8707: 'A4060454363916275',
1071: 'A4060463723916275'},
'Geopos Ack': {4538: [0.0,
0.0,
0.0,
0.0,
6.0,
15.0,
16.0,
0.0,
0.0,
5.0,
0.0,
15.0,
0.0,
0.0,
0.0,
0.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
3.5,
0.0,
3.0],
5043: [0.0,
0.0,
0.0,
0.0,
0.0,
16.0,
12.0,
0.0,
13.0,
15.0,
0.0,
15.0,
0.0,
0.0,
6.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
3.0,
3.0,
0.0],
11663: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
5.0,
15.0,
0.0,
0.0,
0.0,
6.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
2701: [0.0,
0.0,
0.0,
8.0,
13.0,
16.0,
6.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
6.0,
0.0,
7.0],
825: [0.0,
0.0,
0.0,
0.0,
0.0,
11.0,
15.0,
0.0,
13.0,
16.0,
0.0,
9.0,
3.0,
0.0,
6.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
8679: [0.0,
4.0,
9.0,
15.0,
10.0,
3.0,
2.0,
0.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
9.0],
11700: [0.0,
0.0,
6.0,
0.0,
15.0,
8.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
16.0,
0.0,
6.0],
8594: [12.0,
16.0,
16.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
8.0,
0.0,
5.0],
8707: [7.0,
5.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
8.0,
15.0],
1071: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
12.0,
15.5,
6.0,
3.5,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0]}}

How to resample hourly this dataframe?

I have this dataset, I'm trying to have a mean of "AC_POWER" every hour but isn't working properly. The dataset have 20-22 value every 15 minutes. I want to have something like this:
DATE AC_POWER
'15-05-2020 00:00' 400
'15-05-2020 01:00' 500
'15-05-2020 02:00' 500
'15-05-2020 03:00' 500
How to solve this?
import pandas as pd
df = pd.read_csv('dataset.csv')
df = df.reset_index()
df['DATE_TIME'] = df['DATE_TIME'].astype('datetime64[ns]')
df = df.resample('H', on='DATE_TIME').mean()
>>> df.head(10).to_dict()
{'AC_POWER': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0},
'DAILY_YIELD': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0},
'DATE_TIME': {0: '15-05-2020 00:00', 1: '15-05-2020 00:00', 2: '15-05-2020 00:00', 3: '15-05-2020 00:00',
4: '15-05-2020 00:00', 5: '15-05-2020 00:00', 6: '15-05-2020 00:00', 7: '15-05-2020 00:00',
8: '15-05-2020 00:00', 9: '15-05-2020 00:00'},
'DC_POWER': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0},
'PLANT_ID': {0: 4135001, 1: 4135001, 2: 4135001, 3: 4135001, 4: 4135001, 5: 4135001,
6: 4135001, 7: 4135001, 8: 4135001, 9: 4135001},
'SOURCE_KEY': {0: '1BY6WEcLGh8j5v7', 1: '1IF53ai7Xc0U56Y', 2: '3PZuoBAID5Wc2HD', 3: '7JYdWkrLSPkdwr4',
4: 'McdE0feGgRqW7Ca', 5: 'VHMLBKoKgIrUVDU', 6: 'WRmjgnKYAwPKWDb', 7: 'ZnxXDlPa8U1GXgE',
8: 'ZoEaEvLYb1n2sOq', 9: 'adLQvlD726eNBSB'},
'TOTAL_YIELD': {0: 6259559.0, 1: 6183645.0, 2: 6987759.0, 3: 7602960.0, 4: 7158964.0,
5: 7206408.0, 6: 7028673.0, 7: 6522172.0, 8: 7098099.0, 9: 6271355.0}}
EDIT: I tried with a different dataset and the same code I've posted and it worked!
You need to set your date as an index first, the following does this and computes the mean for windows of 15 minutes:
df.set_index('DATE_TIME').resample('15T').mean()
Also, make sure your date vector is correctly formated.
I think you're looking for DataFrame.resample:
df.resample(rule='H', on='DATE_TIME')['AC_POWER'].mean()

Number of labels does not match samples on decision tree regression

Trying to run a decision tree regressor on my data, but whenever I try and run my code, I get this error
ValueError: Number of labels=78177 does not match number of samples=312706
#feature selection
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
target = ['sale_price']
train, test = train_test_split(housing_data, test_size=0.2)
regression_tree = DecisionTreeRegressor(criterion="entropy",random_state=100,
max_depth=4,min_samples_leaf=5)
regression_tree.fit(train,test)
I have added a sample of my code, hopefully this gives you more context to help better understand my question and problem:
{'Age of House at Sale': {0: 6,
1: 2016,
2: 92,
3: 42,
4: 90,
5: 2012,
6: 89,
7: 3,
8: 2015,
9: 104},
'AreaSource': {0: 2.0,
1: 7.0,
2: 2.0,
3: 2.0,
4: 2.0,
5: 2.0,
6: 2.0,
7: 2.0,
8: 2.0,
9: 2.0},
'AssessLand': {0: 9900.0,
1: 1571850.0,
2: 1548000.0,
3: 36532350.0,
4: 2250000.0,
5: 3110400.0,
6: 2448000.0,
7: 1354500.0,
8: 1699200.0,
9: 1282500.0},
'AssessTot': {0: 34380.0,
1: 1571850.0,
2: 25463250.0,
3: 149792400.0,
4: 27166050.0,
5: 5579990.0,
6: 28309500.0,
7: 23965650.0,
8: 3534300.0,
9: 11295000.0},
'BldgArea': {0: 2688.0,
1: 0.0,
2: 304650.0,
3: 2548000.0,
4: 356000.0,
5: 382746.0,
6: 290440.0,
7: 241764.0,
8: 463427.0,
9: 547000.0},
'BldgClass': {0: 72,
1: 89,
2: 80,
3: 157,
4: 150,
5: 44,
6: 92,
7: 43,
8: 39,
9: 61},
'BldgDepth': {0: 50.0,
1: 0.0,
2: 92.0,
3: 0.0,
4: 100.33,
5: 315.0,
6: 125.0,
7: 100.0,
8: 0.0,
9: 80.92},
'BldgFront': {0: 20.0,
1: 0.0,
2: 335.0,
3: 0.0,
4: 202.0,
5: 179.0,
6: 92.0,
7: 500.0,
8: 0.0,
9: 304.0},
'BsmtCode': {0: 5.0,
1: 5.0,
2: 5.0,
3: 5.0,
4: 2.0,
5: 5.0,
6: 2.0,
7: 2.0,
8: 5.0,
9: 5.0},
'CD': {0: 310.0,
1: 302.0,
2: 302.0,
3: 318.0,
4: 302.0,
5: 301.0,
6: 302.0,
7: 301.0,
8: 301.0,
9: 302.0},
'ComArea': {0: 0.0,
1: 0.0,
2: 304650.0,
3: 2548000.0,
4: 30000.0,
5: 11200.0,
6: 290440.0,
7: 27900.0,
8: 4884.0,
9: 547000.0},
'CommFAR': {0: 0.0,
1: 2.0,
2: 2.0,
3: 2.0,
4: 0.0,
5: 0.0,
6: 10.0,
7: 2.0,
8: 0.0,
9: 2.0},
'Council': {0: 41.0,
1: 33.0,
2: 33.0,
3: 46.0,
4: 33.0,
5: 33.0,
6: 33.0,
7: 33.0,
8: 33.0,
9: 35.0},
'Easements': {0: 0.0,
1: 0.0,
2: 0.0,
3: 1.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0},
'ExemptLand': {0: 0.0,
1: 1571850.0,
2: 0.0,
3: 0.0,
4: 2250000.0,
5: 0.0,
6: 0.0,
7: 932847.0,
8: 0.0,
9: 0.0},
'ExemptTot': {0: 0.0,
1: 1571850.0,
2: 0.0,
3: 0.0,
4: 27166050.0,
5: 0.0,
6: 11304900.0,
7: 23543997.0,
8: 0.0,
9: 0.0},
'FacilFAR': {0: 0.0,
1: 6.5,
2: 0.0,
3: 0.0,
4: 4.8,
5: 4.8,
6: 10.0,
7: 3.0,
8: 5.0,
9: 4.8},
'FactryArea': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 547000.0},
'GarageArea': {0: 0.0,
1: 0.0,
2: 0.0,
3: 1285000.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 22200.0,
8: 0.0,
9: 0.0},
'HealthArea': {0: 6410.0,
1: 1000.0,
2: 2300.0,
3: 8822.0,
4: 2300.0,
5: 400.0,
6: 2300.0,
7: 700.0,
8: 500.0,
9: 9300.0},
'HealthCent': {0: 35.0,
1: 36.0,
2: 38.0,
3: 35.0,
4: 38.0,
5: 30.0,
6: 38.0,
7: 30.0,
8: 30.0,
9: 36.0},
'IrrLotCode': {0: 1, 1: 1, 2: 0, 3: 0, 4: 1, 5: 1, 6: 0, 7: 1, 8: 0, 9: 0},
'LandUse': {0: 2.0,
1: 10.0,
2: 5.0,
3: 5.0,
4: 8.0,
5: 4.0,
6: 5.0,
7: 3.0,
8: 3.0,
9: 6.0},
'LotArea': {0: 2252.0,
1: 134988.0,
2: 32000.0,
3: 905000.0,
4: 20267.0,
5: 57600.0,
6: 12500.0,
7: 50173.0,
8: 44704.0,
9: 113800.0},
'LotDepth': {0: 100.0,
1: 275.33,
2: 335.92,
3: 859.0,
4: 100.33,
5: 320.0,
6: 125.0,
7: 200.0,
8: 281.86,
9: 204.0},
'LotFront': {0: 24.0,
1: 490.5,
2: 92.42,
3: 930.0,
4: 202.0,
5: 180.0,
6: 100.0,
7: 521.25,
8: 225.08,
9: 569.0},
'LotType': {0: 5.0,
1: 5.0,
2: 3.0,
3: 3.0,
4: 3.0,
5: 3.0,
6: 3.0,
7: 1.0,
8: 5.0,
9: 3.0},
'NumBldgs': {0: 1.0,
1: 0.0,
2: 1.0,
3: 4.0,
4: 1.0,
5: 1.0,
6: 1.0,
7: 1.0,
8: 2.0,
9: 13.0},
'NumFloors': {0: 2.0,
1: 0.0,
2: 13.0,
3: 2.0,
4: 15.0,
5: 0.0,
6: 37.0,
7: 6.0,
8: 20.0,
9: 8.0},
'OfficeArea': {0: 0.0,
1: 0.0,
2: 264750.0,
3: 0.0,
4: 30000.0,
5: 1822.0,
6: 274500.0,
7: 4200.0,
8: 0.0,
9: 0.0},
'OtherArea': {0: 0.0,
1: 0.0,
2: 39900.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0},
'PolicePrct': {0: 70.0,
1: 84.0,
2: 84.0,
3: 63.0,
4: 84.0,
5: 90.0,
6: 84.0,
7: 94.0,
8: 90.0,
9: 88.0},
'ProxCode': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 1.0,
8: 0.0,
9: 0.0},
'ResArea': {0: 2172.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 371546.0,
6: 0.0,
7: 213864.0,
8: 458543.0,
9: 0.0},
'ResidFAR': {0: 2.0,
1: 7.2,
2: 0.0,
3: 0.0,
4: 2.43,
5: 2.43,
6: 10.0,
7: 3.0,
8: 5.0,
9: 0.0},
'RetailArea': {0: 0.0,
1: 0.0,
2: 0.0,
3: 1263000.0,
4: 0.0,
5: 9378.0,
6: 15940.0,
7: 0.0,
8: 4884.0,
9: 0.0},
'SHAPE_Area': {0: 2316.8863224,
1: 140131.577176,
2: 34656.4472405,
3: 797554.847834,
4: 21360.1476315,
5: 58564.8643115,
6: 12947.145471,
7: 50772.624868800005,
8: 47019.5677861,
9: 118754.78573699998},
'SHAPE_Leng': {0: 249.41135038849998,
1: 1559.88914353,
2: 890.718521021,
3: 3729.78685686,
4: 620.761169374,
5: 1006.33799946,
6: 460.03168012300006,
7: 1385.27352839,
8: 992.915660585,
9: 1565.91477261},
'SanitDistr': {0: 10.0,
1: 2.0,
2: 2.0,
3: 18.0,
4: 2.0,
5: 1.0,
6: 2.0,
7: 1.0,
8: 1.0,
9: 2.0},
'SanitSub': {0: 21,
1: 23,
2: 31,
3: 22,
4: 31,
5: 21,
6: 23,
7: 7,
8: 12,
9: 22},
'SchoolDist': {0: 19.0,
1: 13.0,
2: 13.0,
3: 22.0,
4: 13.0,
5: 14.0,
6: 13.0,
7: 14.0,
8: 14.0,
9: 14.0},
'SplitZone': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 0, 9: 1},
'StrgeArea': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 1500.0,
8: 0.0,
9: 0.0},
'UnitsRes': {0: 2.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 522.0,
6: 0.0,
7: 234.0,
8: 470.0,
9: 0.0},
'UnitsTotal': {0: 2.0,
1: 0.0,
2: 0.0,
3: 123.0,
4: 1.0,
5: 525.0,
6: 102.0,
7: 237.0,
8: 472.0,
9: 1.0},
'YearAlter1': {0: 0.0,
1: 0.0,
2: 1980.0,
3: 0.0,
4: 1998.0,
5: 0.0,
6: 2009.0,
7: 2012.0,
8: 0.0,
9: 0.0},
'YearAlter2': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 2000.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0},
'ZipCode': {0: 11220.0,
1: 11201.0,
2: 11201.0,
3: 11234.0,
4: 11201.0,
5: 11249.0,
6: 11241.0,
7: 11211.0,
8: 11249.0,
9: 11205.0},
'ZoneDist1': {0: 24,
1: 76,
2: 5,
3: 64,
4: 24,
5: 24,
6: 30,
7: 74,
8: 45,
9: 27},
'ZoneMap': {0: 3,
1: 19,
2: 19,
3: 22,
4: 19,
5: 19,
6: 19,
7: 2,
8: 19,
9: 19},
'building_class': {0: 141,
1: 97,
2: 87,
3: 176,
4: 168,
5: 8,
6: 102,
7: 46,
8: 97,
9: 66},
'building_class_at_sale': {0: 143,
1: 98,
2: 89,
3: 179,
4: 171,
5: 7,
6: 103,
7: 49,
8: 98,
9: 69},
'building_class_category': {0: 39,
1: 71,
2: 31,
3: 38,
4: 86,
5: 40,
6: 80,
7: 75,
8: 71,
9: 41},
'commercial_units': {0: 1,
1: 0,
2: 0,
3: 123,
4: 1,
5: 0,
6: 102,
7: 3,
8: 0,
9: 1},
'gross_sqft': {0: 0.0,
1: 0.0,
2: 304650.0,
3: 2548000.0,
4: 356000.0,
5: 0.0,
6: 290440.0,
7: 241764.0,
8: 0.0,
9: 547000.0},
'land_sqft': {0: 0.0,
1: 134988.0,
2: 32000.0,
3: 905000.0,
4: 20267.0,
5: 57600.0,
6: 12500.0,
7: 50173.0,
8: 44704.0,
9: 113800.0},
'neighborhood': {0: 43,
1: 48,
2: 6,
3: 44,
4: 6,
5: 40,
6: 6,
7: 28,
8: 40,
9: 56},
'residential_units': {0: 0,
1: 0,
2: 0,
3: 0,
4: 0,
5: 0,
6: 0,
7: 234,
8: 0,
9: 0},
'sale_date': {0: 2257,
1: 4839,
2: 337,
3: 638,
4: 27,
5: 1458,
6: 2450,
7: 3276,
8: 5082,
9: 1835},
'sale_price': {0: 499401179.0,
1: 345000000.0,
2: 340000000.0,
3: 276947000.0,
4: 202500000.0,
5: 185445000.0,
6: 171000000.0,
7: 169000000.0,
8: 165000000.0,
9: 161000000.0},
'tax_class': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3, 7: 7, 8: 3, 9: 3},
'total_units': {0: 1,
1: 0,
2: 0,
3: 123,
4: 1,
5: 0,
6: 102,
7: 237,
8: 0,
9: 1},
'zip_code': {0: 11201,
1: 11201,
2: 11201,
3: 11234,
4: 11201,
5: 11249,
6: 11241,
7: 11211,
8: 11249,
9: 11205}}

How to make a scatter plot using dictionary?

I have the following dictionary of keys and values as lists:
comp = {
0: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
1: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
2: [0.2073837448663338, 0.19919737000568305, 0.24386659105843467, 0.25659375810265855, 0.0, 0.2703400161511446, 0.0, 0.0],
3: [0.2752555116304319, 0.19919737000568305, 0.21704752129294347, 0.25659375810265855, 0.0, 0.2703400161511446, 0.0, 0.0],
4: [0.2752555116304319, 0.19919737000568305, 0.21782751590851177, 0.25659375810265855, 0.0, 0.2703400161511446, 0.0, 0.0],
5: [0.2752555116304319, 0.19919737000568305, 0.21782751590851177, 0.25659375810265855, 0.0, 0.2703400161511446, 0.0, 0.0],
6: [0.2752555116304319, 0.19919737000568305, 0.21782751590851177, 0.25659375810265855, 0.0, 0.2691379068024452, 0.0, 0.0],
7: [0.2752555116304319, 0.19919737000568305, 0.21782751590851177, 0.25659375810265855, 0.0, 0.2691379068024452, 0.0, 0.0]
}
There are 8 values in each list respectively (1 for each node/person for example). The keys can be called 'time-stamps'. And the values are recorded for 8 nodes/persons from time-stamps 0 to 7.
I want to realise a scatter-plot with x-axis being the time-stamps and y-axis being the values, and points on the plot should be the nodes/persons corresponding to their x and y.
The plot should form a cluster of 8 points (nodes) on each time-stamp. I have the following code that partly works, but I think it takes the average of all the 8 values in each list and plots the points as one in the time-stamps:
import pylab
import matplotlib.pyplot as plt
for key in comp:
#print(key)
for idx, item in enumerate(comp[key]):
x = idx
y = item
if idx == 0:
pylab.scatter(x, y, label=key)
else:
pylab.scatter(x, y)
pylab.legend()
pylab.show()
Not sure how to create the cluster that I want. Any help is appreciated.
(Using Ubuntu 14.04 32-Bit VM and Python 2.7)
I think you are slightly overcomplicating it. If you loop through and get the keys of the dictionary, you can get the values by simply comp[key_name]. This can then be passed to plt.scatter(). You will have to repeat the key 8 times using [key] * 8, in order to pass the whole list of values to scatter:
import matplotlib.pyplot as plt
comp = {
0: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
1: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
2: [0.2073837448663338, 0.19919737000568305, 0.24386659105843467, 0.25659375810265855, 0.0, 0.2703400161511446, 0.0, 0.0],
3: [0.2752555116304319, 0.19919737000568305, 0.21704752129294347, 0.25659375810265855, 0.0, 0.2703400161511446, 0.0, 0.0],
4: [0.2752555116304319, 0.19919737000568305, 0.21782751590851177, 0.25659375810265855, 0.0, 0.2703400161511446, 0.0, 0.0],
5: [0.2752555116304319, 0.19919737000568305, 0.21782751590851177, 0.25659375810265855, 0.0, 0.2703400161511446, 0.0, 0.0],
6: [0.2752555116304319, 0.19919737000568305, 0.21782751590851177, 0.25659375810265855, 0.0, 0.2691379068024452, 0.0, 0.0],
7: [0.2752555116304319, 0.19919737000568305, 0.21782751590851177, 0.25659375810265855, 0.0, 0.2691379068024452, 0.0, 0.0]
}
for key in comp:
plt.scatter([key]*8, comp[key], label=key)
plt.legend()
plt.show()
Update: To get the colors as you want you can do the following, which is a modified version of the answer given by #lkriener
array = np.zeros((8,8))
for key in comp:
array[:,key] = comp[key]
x = range(8)
for i in range (8):
plt.scatter(x, array[i,:], label=i)
plt.legend()
plt.show()
Which gives the figure:
You can move the legend by giving the call to plt.legend() certain arguments. The most important ones are loc and bbox_to_anchor, the documentation of which can be found here
A slight alternative here, if you're able to use a few other modules. A standard scatter plot is useful, however your data features a large number of overlapping points, which aren't visible in the final graph. For this, seaborn's swarmplot might be useful.
To make life a little easier, I use pandas to reshape the data into a DataFrame and then call the sramplot directly:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
comp = {
'0': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
'1': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
'2': [0.2073837448663338, 0.19919737000568305, 0.24386659105843467, 0.25659375810265855, 0.0, 0.2703400161511446, 0.0, 0.0],
'3': [0.2752555116304319, 0.19919737000568305, 0.21704752129294347, 0.25659375810265855, 0.0, 0.2703400161511446, 0.0, 0.0],
'4': [0.2752555116304319, 0.19919737000568305, 0.21782751590851177, 0.25659375810265855, 0.0, 0.2703400161511446, 0.0, 0.0],
'5': [0.2752555116304319, 0.19919737000568305, 0.21782751590851177, 0.25659375810265855, 0.0, 0.2703400161511446, 0.0, 0.0],
'6': [0.2752555116304319, 0.19919737000568305, 0.21782751590851177, 0.25659375810265855, 0.0, 0.2691379068024452, 0.0, 0.0],
'7': [0.2752555116304319, 0.19919737000568305, 0.21782751590851177, 0.25659375810265855, 0.0, 0.2691379068024452, 0.0, 0.0],
}
df = pd.DataFrame.from_dict(comp, orient='index')
df.index.rename('Observation', inplace=True)
stacked = df.stack().reset_index()
stacked.rename(columns={'level_1': 'Person', 0: 'Value'}, inplace=True)
sns.swarmplot(data=stacked, x='Observation', y='Value', hue='Person')
plt.show()
This gives the following plot:
To plot the values of the same node in the same color you could do something like this:
import numpy as np
import matplotlib.pyplot as plt
comp = {
'0': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
'1': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
'2': [0.2073837448663338, 0.19919737000568305, .24386659105843467,0.25659375810265855, 0.0, 0.2703400161511446, 0.0, 0.0],
'3': [0.2752555116304319, 0.19919737000568305, 0.21704752129294347, 0.25659375810265855, 0.0, 0.2703400161511446, 0.0, 0.0],
'4': [0.2752555116304319, 0.19919737000568305, 0.21782751590851177, 0.25659375810265855, 0.0, 0.2703400161511446, 0.0, 0.0],
'5': [0.2752555116304319, 0.19919737000568305, 0.21782751590851177, 0.25659375810265855, 0.0, 0.2703400161511446, 0.0, 0.0],
'6': [0.2752555116304319, 0.19919737000568305, 0.21782751590851177, 0.25659375810265855, 0.0, 0.2691379068024452, 0.0, 0.0],
'7': [0.2752555116304319, 0.19919737000568305, 0.21782751590851177, 0.25659375810265855, 0.0, 0.2691379068024452, 0.0, 0.0],
}
array = np.zeros([8,8])
for i, key in enumerate(comp.keys()):
for j in range(8):
array[j, i] = comp[key][j]
plt.xlim((-1,8))
plt.ylim((-0.05,0.3))
plt.xlabel('timestamps')
plt.ylabel('values of nodes')
for i in range(8):
plt.plot(range(8), array[i], ls='--', marker='o', label='node {}'.format(i))
plt.legend(loc='upper_left')
plt.savefig('temp.png')
plt.show()
This would give you the following picture:
enter image description here

Categories