I have a dataset of house price predictions.
House id
society_id
building_type
households
yyyymmdd
floor
price
date
204
a9cvzgJ
170
185
01/02/2006
3
43000
01/02/2006
100
a4Nkquj
170
150
01/04/2006
13
46300
01/04/2006
the dataset has the shape of (2000,40)
while 1880 rows have same house id.
I have to make heterogenous graphs from dataset. the metapaths are as follows:
here BT stands for building type, where H1 and H2 represents house 1 and house 2.
the meta graph example is:
I know of network X. it allows dataframe to graph function . but i don't know how can i use in my scenario. the price column is target node.
A glimpse of dataset
any guidance will mean a lot.
thank you. The goal is to make adjancy matrix of dataset
To build a graph like M_1 using only one attribute (such as building type), you could do either of the following. You could use the from_pandas_edgelist as follows:
G = nx.from_pandas_edgelist(df2, source = 'house_id', target = 'buidling_id')
or you could do the following:
G = nx.Graph()
G.add_edges_from(df.loc[:,['house_id','building_id']].to_numpy())
If you have a list of graphs glist : [M_1,M_2,...] each of which connects house_id to one other attribute, you can combine them using the compose_all function. For instance,
G = nx.compose_all(glist)
Alternatively, if you have an existing graph made using certain attributes, you can add another attribute with
G.add_edges_from(df.loc[:,['house_id','new_attribute']].to_numpy())
Related
I have a dataframe of 4 columns looks like following :
timestamp Label X Y
163370622 0 1.71 18.42
163370623 1 -17.26 -13.76
163370624 1 0.91 5.8
Here every entry are one type of object and their co ordinates are given at a certain timestamp. Label 0 indicates one object class and Label 1 indicates another object class. I am trying to plot the trajectory of these two in one axis in python ( with their timestamp if possible) through the (X,Y) points given in the dataframe and their labels. The graph should look like this:
Which modules or package of python would effectively work for this purpose? or do I have to use MATLAB/R?
I want to represent relationships between nodes in python using pandas.DataFrame
And each relationship has weight so I used dataframe like this.
nodeA nodeB nodeC
nodeA 0 5 1
nodeB 5 0 4
nodeC 1 4 0
But I think this is improper way to express relationships because the dataframe
is symmetric, has duplicated datas.
Is there more proper way than using dataframe to represent graph in python?
(Sorry for my bad English)
This seems like an acceptable way to represent a graph, and is in fact compatible with, say, nextworkx. For example, you can recover a nextworkx graph object as follows:
import networkx as nx
g = nx.from_pandas_adjacency(df)
print(g.edges)
# [('nodeA', 'nodeB'), ('nodeA', 'nodeC'), ('nodeB', 'nodeC')]
print(g.get_edge_data('nodeA', 'nodeB'))
# {'weight': 5}
If your graph is sparse, you may want to store it as an edge list instead, e.g. as discussed here.
Image tree
I'm making a decision Tree based on a dataset with 3 columns:
example:
ID Area Year
1 50 1950
2 150 1981
3 210 1987
4 205 1973
5 176 1992
....
When I make a decision tree using the DecisionTreeRegressor, this tree is based on all 3 columns, what I want is that the ID is not included in the tree itself but can still be traced back (so I don't want to delete this column)
Furthermore, I also want that the column 'Year' has a priority compared to the column 'Area'. So the total number of data is first split according to the year, and afterwards to the 'Year'. (now the decisiontree decided to make 'Area' (X1) prior, and 'Year' is not even used... See image attached: Image tree)
How can I do this?
I tried to convert the first column to a string, but the tree is still using column 'ID'.
My code so far:
clf = tree.DecisionTreeRegressor(min_samples_split=20,max_leaf_nodes=20).fit(X_train, y_train)
tree.plot_tree(clf)
import os
os.environ["PATH"] += os.pathsep + r'C:\anaconda3\Library\bin\graphviz'
dot_data = tree.export_graphviz(clf, out_file=None)
graph = graphviz.Source(dot_data)
from graphviz import render
graph.render('png', "test")
Methods on the classifier like predict and predict_proba maintain the order of rows - i.e. even if you delete the ID column from your X_train dataset (while maintaining it in some base dataset) - you can simply concatenate the predicted values later.
W.r.t your question on how to split by Year first - I don't think sklearn or any of the other python ML libraries can do this. You would want to look at alternatives like SAS Enterprise Miner OR Angoss Knowledge Studio - but neither of them are FOSS.
A dirty hack could be to build a tree with just the Year column - note the splits and then segment your data into 2 (or more) parts based on the splits obtained.
I am trying to create a networkx graph mapping the business connections in our database. In other words, I would like every id (i.e. each individual business) to be a node, and I would like there to be line connecting the nodes that are 'connected'. A business is considered as connected with another if the lead_id and connection_id are associated together as per the below data structure.
lead_id connection_id
56340 1
56340 2
58684 3
58696 4
58947 5
Every example I find on the networkx documentation uses the following
G=nx.random_geometric_graph(200,0.125)
pos=nx.get_node_attributes(G,'pos')
I am trying to determine how to incorporate my values into this.
Here is a way to create a graph from the data presented:
G = nx.Graph()
for node in zip(data.lead_id,data.connection_id):
G.add_edge(node[0],node[1])
I want to use this to convert a bunch of identifiers but I need to know exactly which taxonomic rank is assigned to each taxonomy code. Shown below is an example of conversion that makes sense but I don't know what to label some of the taxonomy calls. The basic taxonomic ranks are: (domain, kingdom, phylum, class, order, family, genus, and species) https://en.wikipedia.org/wiki/Taxonomic_rank.
For most cases it will be easy, but in the case of having subspecies and strains for bacteria this can get confusing.
How do I get ete3 to specify what rank the lineage IDs correspond to in the taxonomic rank?
import ete3
import pandas as pd
ncbi = ete3.NCBITaxa()
taxon_id = 505
lineage = ncbi.get_lineage(taxon_id)
Se_lineage = pd.Series(ncbi.get_taxid_translator(lineage), name=taxon_id)
Se_lineage[lineage]
1 root
131567 cellular organisms
2 Bacteria
1224 Proteobacteria
28216 Betaproteobacteria
206351 Neisseriales
481 Neisseriaceae
32257 Kingella
505 Kingella oralis
Name: 505, dtype: object
Use ncbi.get_rank() to get a dictionary of {id:name} then do some basic transformations to get {name:taxonomy}