This is the dictionary I have:
docs = {'computer': {'1': 1, '3': 5, '8': 2},
'politics': {'0': 2, '1': 2, '3': 1}}
I want to create a 9 * 2 tensor like this:
[
[0, 1, 0, 5, 0, 0, 0, 0, 2],
[2, 2, 0, 1, 0, 0, 0, 0, 0, 0]
]
Here, because the max item is 8 so we have 9 rows. But, the number of rows and columns can increase based on the dictionary.
I have tried to implement this using for-loop though as the dictionary is big it's not efficient at all and also it implemented using the list I need that to be a tensor.
maxr = 0
for i, val in docs.items():
for j in val.keys():
if int(j) > int(maxr):
maxr = int(j)
final_lst = []
for val in docs.values():
lst = [0] * (maxr+1)
for j, val2 in sorted(val.items()):
lst[int(j)] = val2
final_lst.append(lst)
print(final_lst)
If you are ok with using pandas and numpy, here's how you can do it.
import pandas as pd
import numpy as np
# Creates a dataframe with keys as index and values as cell values.
df = pd.DataFrame(docs)
# Create a new set of index from min and max of the dictionary keys.
new_index = np.arange( int(df.index.min()),
int(df.index.max())).astype(str)
# Add the new index to the existing index and fill the nan values with 0, take a transpose of dataframe.
new_df = df.reindex(new_index).fillna(0).T.astype(int)
# 0 1 2 3 4 5 6 7
#computer 0 1 0 5 0 0 0 0
#politics 2 2 0 1 0 0 0 0
If you just want the array, you can call array = new_df.values.
#[[0 1 0 5 0 0 0 0]
# [2 2 0 1 0 0 0 0]]
If you want tensor, then you can use tf.convert_to_tensor(new_df.values)
Related
Given my dataframe df has 11 rows and 5 classes, as the following
import pandas as pd
df = pd.DataFrame( [[0,0,1,0,0],
[0,0,0,0,1],
[0,0,1,0,0],
[0,0,0,0,1],
[0,0,1,0,0],
[0,0,0,0,1],
[0,0,1,0,0],
[0,0,0,1,0],
[0,1,1,0,0],
[1,0,0,1,0],
[0,1,1,0,0],
[1,0,0,1,0],
[0,1,1,0,1]])
Note that the columns of df are [0,1,2,3,4] as default
When you run df.value_counts(), you got
0 1 2 3 4
0 0 1 0 0 4
0 0 1 3
1 1 0 0 2
1 0 0 1 0 2
0 0 0 1 0 1
1 1 0 1 1
you could observe that it returns all unique sequence of 5 classes with its count (the sparse element represent zero, I guess), Now I am wondering is there any possible way to get the index which contain each of these sequence of unique value in form of dictionary?
so, for this case, it could return the following dictionary where its key represent each unique sequence of class and its value represent the list of index.. like this
{(0,0,1,0,0): [0,2,4,6],
(0,0,0,0,1): [1,3,5],
(0,1,1,0,0): [8,10,12],
(1,0,0,1,0): [9,11],
(0,0,0,1,0) :[7],
(0,1,1,0,1): [1]}
Thank you in advance
you can simply did .groupby method
So, .groupby which take the list of columns as input will group all possible combination of every columns, and we can follow .groups to access your expected dictionary
df.groupby([0,1,2,3,4]).groups
Result:
{(0, 0, 0, 0, 1): [1, 3, 5], (0, 0, 0, 1, 0): [7], (0, 0, 1, 0, 0): [0, 2, 4, 6], (0, 1, 1, 0, 0): [8, 10], (0, 1, 1, 0, 1): [12], (1, 0, 0, 1, 0): [9, 11]}
I have a dataframe of N columns. Each element in the dataframe is in the range 0, N-1.
For example, my dataframce can be something like (N=3):
A B C
0 0 2 0
1 1 0 1
2 2 2 0
3 2 0 0
4 0 0 0
I want to create a co-occurrence matrix (please correct me if there is a different standard name for that) of size N x N which each element ij contains the number of times that element i and j assume the same value.
A B C
A x 2 3
B 2 x 2
C 3 2 x
Where, for example, matrix[0, 1] means that A and B assume the same value 2 times.
I don't care about the value on the diagonal.
What is the smartest way to do that?
DataFrame.corr
We can define a custom callable function for calculating the correlation between the columns of the dataframe, this callable takes two 1D numpy arrays as its input arguments and return's the count of the number of times the elements in these two arrays equal to each other
df.corr(method=lambda x, y: (x==y).sum())
A B C
A 1.0 2.0 3.0
B 2.0 1.0 2.0
C 3.0 2.0 1.0
Let's try broadcasting across the transposition and summing axis 2:
import pandas as pd
df = pd.DataFrame({
'A': {0: 0, 1: 1, 2: 2, 3: 2, 4: 0},
'B': {0: 2, 1: 0, 2: 2, 3: 0, 4: 0},
'C': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}
})
vals = df.T.values
e = (vals[:, None] == vals).sum(axis=2)
new_df = pd.DataFrame(e, columns=df.columns, index=df.columns)
print(new_df)
e:
[[5 2 3]
[2 5 2]
[3 2 5]]
Turn back into a dataframe:
new_df = pd.DataFrame(e, columns=df.columns, index=df.columns)
new_df:
A B C
A 5 2 3
B 2 5 2
C 3 2 5
I don't know about the smartest way but I think this works:
import numpy as np
m = np.array([[0, 2, 0], [1, 0, 1], [2, 2, 0], [2, 0, 0], [0, 0, 0]])
n = 3
ans = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
ans[i, j] = len(m) - np.count_nonzero(m[:, i] - m[:, j])
print(ans + ans.T)
With this DataFrame:
Gp Wo Me CHi
1 0 1 0
2 1 0 0
3 0 1 0
4 1 0 0
5 0 2 0
6 1 0 0
I would like create a dictionary like :
a={'Gp':['Wo', 'Me','CHi']}
but in the case column 'Gp' row 5 the value of column 'Me' is 2 ,how I can convert like this value :
a={5:[0, [1,1],0]}
Like create another list if the value is > 1:
You can use df.itterrows() and check if the row value of the column 'Me' is equal to 2 and write an if statement:
for index, row in df.iterrows():
if row['Me'] == 2:
print({row['Gp']: [row['Wo'], [1, 1], row['CHi']]})
else:
print({row['Gp']: [row['Wo'], row['Me'], row['CHi']]})
This will output the following dictionaries:
{1: [0, 1, 0]}
{2: [1, 0, 0]}
{3: [0, 1, 0]}
{4: [1, 0, 0]}
{5: [0, [1, 1], 0]}
{6: [1, 0, 0]}
EDIT Based on the comment:
for index, row in df.iterrows():
if row['Me'] <= 1:
print({row['Gp']: [row['Wo'], row['Me'], row['CHi']]})
else:
print({row['Gp']: [row['Wo'], [1 for _ in range(row['Me'])], row['CHi']]})
I have a DataFrame of Temperature 1000s of rows(Time series data) and 40 columns(40 points in a catchment ). Entries in this DataFrame are zeros and one (1 means active part of catchment and zero means non-active part). I want to place number of intersected values in a separate column(named inter) in the same DataFrame .
I expect the output in this way [attached image]
value in the first row of inter should be zero as all entries are zero
and no part is active on day first
value in the 2nd row of inter should be 4 as four parts are active
on day 2.
value in the 3rd row of inter should be 3 (number of intersected values
of all above rows including 3rd row)[enter image description here][1].
Green boxes in image show the value for 3rd row
value in 4th row of inter should be number of intersected values of
all above rows (yellow shaded area in the image).
similarly blue boxes show the value for 5th row and red boxes show
the value for sixth row and so on
Note: for every row I will count the intersection of all above rows
I deserve a reward for this :)
Here is you answer:
import pandas as pd
import numpy as np
# setup test data
data = {'0': [0, 0, 0, 1, 0], '1': [0, 0, 1, 0, 1], '2': [0, 0, 0, 1, 0], '3': [0, 0, 1, 1, 1], '4': [0, 1, 1, 1, 0]
, '5': [0, 0, 0, 0, 1], '6': [0, 1, 1, 1, 0], '7': [0, 0, 1, 0, 1], '8': [0, 1, 0, 1, 0], '9': [0, 1, 1, 0, 0],
'10': [0, 0, 1, 0, 0], '11': [0, 0, 0, 1, 1], '12': [0, 0, 0, 1, 1]}
data = pd.DataFrame(data=data)
# collect inter data
inter_data = []
for main_index, main_row in data.iterrows():
# select data for calculations
selected_data = data.loc[0:main_index,:]
# handle firs row with 0 values
if not 1 in main_row.values:
inter_data.append(0)
else:
# handle second row
if selected_data.shape[0] == 2:
inter_data.append(selected_data[1:2].values[0].sum())
# handle rest of data
else:
# drop last row from selected data
selected_data = selected_data[:-1]
# sum selected data
summed_data = 0
for index, row in selected_data.iterrows():
summed_data += row.values
# get position of 1
positions = np.where(main_row.values == 1)
# get summed data based on position
positions_data = summed_data[positions[0]]
# sum occurance in data
inter_data.append((positions_data >= 1).sum())
# add inter data to raw data
data['inter'] = pd.DataFrame(inter_data)
Output:
0 1 2 3 4 5 6 7 8 9 10 11 12 inder
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 1 0 1 0 1 1 0 0 0 4
2 0 1 0 1 1 0 1 1 0 1 1 0 0 3
3 1 0 1 1 1 0 1 0 1 0 0 1 1 4
4 0 1 0 1 0 1 0 1 0 0 0 1 1 5
I have matrix similar to this:
1 0 0
1 0 0
0 2 0
0 2 0
0 0 3
0 0 3
(Non-zero numbers denote parts that I'm interested in. Actual number inside matrix could be random.)
And I need to produce vector like this:
[ 1 1 2 2 3 3 ].T
I can do this with loop:
result = np.zeros([rows])
for y in range(rows):
x = y // (rows // cols) # pick index of corresponded column
result[y] = mat[y][x]
But I can't figure out how to do this in vector form.
This might be what you want.
import numpy as np
m = np.array([
[1, 0, 0],
[1, 0, 0],
[0, 2, 0],
[0, 2, 0],
[0, 0, 3],
[0, 0, 3]
])
rows, cols = m.shape
# axis1 indices
y = np.arange(rows)
# axis2 indices
x = y // (rows // cols)
result = m[y,x]
print(result)
Result:
[1 1 2 2 3 3]