i have a huge cooccurence matrix with indexes as skill_id and column names as skill_id, and the matrix is filled with the co-occurence of the same. please find the sample below
I want the data in a 3 column dataframe: skillid1 skillid2 count
Any help would be highly appreciated.
Supposing your cooccurrence matrix is called df and looks like that :
4044 4092 4651 6168 6229 6284 6295
4044 0 0 0 1 1 0 0
4092 0 0 1 0 0 0 0
4651 0 1 0 0 0 0 0
6168 1 0 0 0 1 0 0
6229 1 0 0 1 0 0 0
6284 0 0 0 0 0 0 1
6295 0 0 0 0 0 1 0
I'd suggest the following :
import itertools
# get all possible pairs of (skillid1, skillid2)
edges = list(itertools.combinations(df.columns, 2))
# find associated weights in the original df
edges_with_weights = [(node1, node2, df.loc[node1][node2]) for (node1, node2) in edges]
# put it all in a new dataframe
new_df = pd.DataFrame(vertices_with_weights, columns=["skillid1", "skillid2", "count"])
Such that now you have your desired new_df:
skillid1 skillid2 count
0 4044 4092 0
1 4044 4651 0
2 4044 6168 1
3 4044 6229 1
4 4044 6284 0
5 4044 6295 0
6 4092 4651 1
7 4092 6168 0
...
...
...
from itertools import combinations
weights = []`
for skill_id in skills.skill_id:
if str(skill_id) in count_model.vocabulary_.keys():
i = count_model.vocabulary_[str(skill_id)]
j = count_model.vocabulary_[str(skill_id)]
if (skills_occurrences[i][j] > 0) and () :
weights.append([skill_id, skill_id, skills_occurrences[i][j]])
for combination in combinations(skills.skill_id, 2):
if str(combination[0]) in count_model.vocabulary_.keys() and str(combination[1]) in count_model.vocabulary_.keys():
i = count_model.vocabulary_[str(combination[0])]
j = count_model.vocabulary_[str(combination[1])]
if skills_occurrences[i][j] > 0:
weights.append([str(combination[0]), str(combination[1]), skills_occurrences[i][j]])
Had one more data set to process, after that just nested looped both skillids and compared them and kept on appending the value and the value in the indices.
Related
I am trying to make dot products of some columns in my dataset:
df_disorders_3col = df.iloc[:,disorders_indexes]
df_disorders_3col.drop([5278, 10122, 10124, 10125, 10126], axis=0, inplace=True)
df_disorders_3col = df_disorders_3col.astype(int)
df_disorders_3col['Disorders'] = df_disorders_3col.dot(df.columns + ',').str.rstrip(',')
df_disorders_3col.head()
but I get this error when running this block of code:
ValueError: Dot product shape mismatch, (10133, 38) vs (498,)
this is a sample of my data:
>>>df_disorders_3col.sample(5)
HasDiabetes HasHypertension HasCardiacDisease ... HasMS HasPregnancyHypertension HasPregnancyDiabetes
752 0 0 0 1 0 0
6312 0 0 0 0 0 0
6984 1 0 0 0 0 0
9016 0 0 0 0 0 1
8923 0 0 0 0 0 0
5 rows × 38 columns
also this is the shape of df_disorders_3col:
>>>df_disorders_3col.shape
(10133, 38)
and df:
>>>df.shape
(10138, 498)
My feature engineering runs for different documents. For some documents some features do not exist and followingly the sublist consists only of the same values such as the third sublist [0,0,0,0,0]. One hot encoding of this sublist leads to only one column, while the feature lists of other documents are transformed to two columns. Is there any possibility to tell ohe also to create two columns if it consits only of one and the same value and insert the column in the right spot? The main problem is that my feature dataframe of different documents consists in the end of a different number of columns, which make them not comparable.
import pandas as pd
feature = [[0,0,1,0,0], [1,1,1,0,1], [0,0,0,0,0], [1,0,1,1,1], [1,1,0,1,1], [1,0,1,1,1], [0,1,0,0,0]]
df = pd.DataFrame(feature[0])
df_features_final = pd.get_dummies(df[0])
for feature in feature[1:]:
df = pd.DataFrame(feature)
df_enc = pd.get_dummies(df[0])
print(df_enc)
df_features_final = pd.concat([df_features_final, df_enc], axis = 1, join ='inner')
print(df_features_final)
The result is the following dataframe. As you can see in the changing columntitles, after column 5 does not follow a 1:
0 1 0 1 0 0 1 0 1 0 1 0 1
0 1 0 0 1 1 0 1 0 1 0 1 1 0
1 1 0 0 1 1 1 0 0 1 1 0 0 1
2 0 1 0 1 1 0 1 1 0 0 1 1 0
3 1 0 1 0 1 0 1 0 1 0 1 1 0
4 1 0 0 1 1 0 1 0 1 0 1 1 0
I don't notice the functionality you want in pandas atleast. But, in TensorFlow, we do have
tf.one_hot(
indices, depth, on_value=None, off_value=None, axis=None, dtype=None, name=None
)
Set depth to 2.
I'm trying to create some extra features on a data set. I want to get a spatial context from the features I already have one hot encoded. So for example, I have this:
F1 F2 F3 F4
1 0 1 1 0
2 1 0 1 1
3 1 0 0 0
4 0 0 0 1
I want to create some new columns against the values here:
F1 F2 F3 F4 S1 S2 S3 S4
1 0 1 1 0 0 2 1 0
2 1 0 0 1 1 0 0 3
3 1 0 0 0 1 0 0 0
4 0 0 0 1 0 0 0 4
I'm hoping there is an easy way to do this, to calculate changes from the last value of the column and output that to a corresponding column. Any help is appreciated, thanks.
You could do:
def func(x):
# create result array
result = np.zeros(x.shape, dtype=np.int)
# get indices of array distinct of zero
w = np.argwhere(x).ravel()
# compute the difference between consecutive indices and add the first index + 1
array = np.hstack(([w[0] + 1], np.ediff1d(w)))
# set the values on result
np.put(result, w, array)
return result
columns = ['S{}'.format(i) for i in range(1, 5)]
s = pd.DataFrame(df.ne(0).apply(func, axis=1).values.tolist(),
columns=columns)
result = pd.concat([df, s], axis=1)
print(result)
Output
F1 F2 F3 F4 S1 S2 S3 S4
0 0 1 1 0 0 2 1 0
1 1 0 0 1 1 0 0 3
2 1 0 0 0 1 0 0 0
3 0 0 0 1 0 0 0 4
Note that you need to import numpy (import numpy as np) in order for func to work. The idea is to find the indices distinct of zero compute the difference between to consecutive values, set the first value as the index + 1, and do this for each row.
I have a dict as follows:
data_dict = {'1.160.139.117': ['712907','742068'],
'1.161.135.205': ['667386','742068'],
'1.162.51.21': ['326136', '663056', '742068']}
I want to convert the dict into a dataframe:
df= pd.DataFrame.from_dict(data_dict, orient='index')
How can I create a dataframe that has columns representing the values of the dictionary and rows representing the keys of the dictionary?, as below:
The best option is #4
pd.get_dummies(df.stack()).sum(level=0)
Option 1:
One way you could do it:
df.stack().reset_index(level=1)\
.set_index(0,append=True)['level_1']\
.unstack().notnull().mul(1)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 2
Or with a litte reshaping and pd.crosstab:
df2 = df.stack().reset_index(name='Values')
pd.crosstab(df2.level_0,df2.Values)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 3
df.stack().reset_index(name="Values")\
.pivot(index='level_0',columns='Values')['level_1']\
.notnull().astype(int)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 4 (#Wen pointed out a short solution and fastest so far)
pd.get_dummies(df.stack()).sum(level=0)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
I'd like to create some NetworkX graphs from a simple Pandas DataFrame:
Loc 1 Loc 2 Loc 3 Loc 4 Loc 5 Loc 6 Loc 7
Foo 0 0 1 1 0 0 0
Bar 0 0 1 1 0 1 1
Baz 0 0 1 0 0 0 0
Bat 0 0 1 0 0 1 0
Quux 1 0 0 0 0 0 0
Where Foo… is the index, and Loc 1 to Loc 7 are the columns. But converting to Numpy matrices or recarrays doesn't seem to work for generating input for nx.Graph(). Is there a standard strategy for achieving this? I'm not averse the reformatting the data in Pandas --> dumping to CSV --> importing to NetworkX, but it seems as if I should be able to generate the edges from the index and the nodes from the values.
NetworkX expects a square matrix (of nodes and edges), perhaps* you want to pass it:
In [11]: df2 = pd.concat([df, df.T]).fillna(0)
Note: It's important that the index and columns are in the same order!
In [12]: df2 = df2.reindex(df2.columns)
In [13]: df2
Out[13]:
Bar Bat Baz Foo Loc 1 Loc 2 Loc 3 Loc 4 Loc 5 Loc 6 Loc 7 Quux
Bar 0 0 0 0 0 0 1 1 0 1 1 0
Bat 0 0 0 0 0 0 1 0 0 1 0 0
Baz 0 0 0 0 0 0 1 0 0 0 0 0
Foo 0 0 0 0 0 0 1 1 0 0 0 0
Loc 1 0 0 0 0 0 0 0 0 0 0 0 1
Loc 2 0 0 0 0 0 0 0 0 0 0 0 0
Loc 3 1 1 1 1 0 0 0 0 0 0 0 0
Loc 4 1 0 0 1 0 0 0 0 0 0 0 0
Loc 5 0 0 0 0 0 0 0 0 0 0 0 0
Loc 6 1 1 0 0 0 0 0 0 0 0 0 0
Loc 7 1 0 0 0 0 0 0 0 0 0 0 0
Quux 0 0 0 0 1 0 0 0 0 0 0 0
In[14]: graph = nx.from_numpy_matrix(df2.values)
This doesn't pass the column/index names to the graph, if you wanted to do that you could use relabel_nodes (you may have to be wary of duplicates, which are allowed in pandas' DataFrames):
In [15]: graph = nx.relabel_nodes(graph, dict(enumerate(df2.columns))) # is there nicer way than dict . enumerate ?
*It's unclear exactly what the columns and index represent for the desired graph.
A little late answer, but now networkx can read data from pandas dataframes, in that case ideally the format is the following for a simple directed graph:
+----------+---------+---------+
| Source | Target | Weight |
+==========+=========+=========+
| Node_1 | Node_2 | 0.2 |
+----------+---------+---------+
| Node_2 | Node_1 | 0.6 |
+----------+---------+---------+
If you are using adjacency matrixes then Andy Hayden is right, you should take care of the correct format. Since in your question you used 0 and 1, I guess you would like to see an undirected graph. It may seem counterintuitive first since you said Index represents e.g. a person, and columns represent groups to which a given person belongs, but it's correct also in the other way a group (membership) belongs to a person. Following this logic, you should actually put the groups in indexes and the persons in columns too.
Just a side note: You can also define this problem in the sense of a directed graph, for example you would like to visualize an association network of hierarchical categories. There, the association e.g. from Samwise Gamgee to Hobbits is stronger than in the other direction usually (since Frodo Baggins is more likely the Hobbit prototype)
You can also use scipy to create the square matrix like this:
import scipy.sparse as sp
cols = df.columns
X = sp.csr_matrix(df.astype(int).values)
Xc = X.T * X # multiply sparse matrix
Xc.setdiag(0) # reset diagonal
# create dataframe from co-occurence matrix in dense format
df = pd.DataFrame(Xc.todense(), index=cols, columns=cols)
Later on you can create an edge list from the dataframe and import it into Networkx:
df = df.stack().reset_index()
df.columns = ['source', 'target', 'weight']
df = df[df['weight'] != 0] # remove non-connected nodes
g = nx.from_pandas_edgelist(df, 'source', 'target', ['weight'])