I got lists of coordinates in the csv file(please click the pic). How should I convert them to polygons in GeoDataFrame?
Below is the coordinates of one polygon and I have thousands rows of this.
[118.103198,24.527338],[118.103224,24.527373],[118.103236,24.527366],[118.103209,24.527331],[118.103198,24.527338]
I tried the following codes:
def bike_fence_format(s):
s = s.replace('[', '').replace(']', '').split(',')
return s
df['FENCE_LOC'] = df['FENCE_LOC'].apply(bike_fence_format)
df['LAT'] = df['FENCE_LOC'].apply(lambda x: x[1::2])
df['LON'] = df['FENCE_LOC'].apply(lambda x: x[::2])
df['geom'] = Polygon(zip(df['LON'].astype(str),df['LAT'].astype(str)))
But I failed in the last step, since df['LON'] returns 'series' not 'string' type. How should I get over this problem? It's better if there is an easier way to achieve my goal.
Recreated a sample df of what your .csv file would give (depending on how your read it in with .read_csv()).
import pandas as pd
import geopandas as gpd
df = pd.DataFrame({'FENCE_LOC': ['[32250,175889],[33913,180757],[29909,182124],[28246,177257],[32250,175889]',
'[32250,175889],[33913,180757],[29909,182124],[28246,177257],[32250,175889]',
'[32250,175889],[33913,180757],[29909,182124],[28246,177257],[32250,175889]']}, index=[0, 1, 2])
Modified your function slightly because we want numeric values, not strings
def bike_fence_format(s):
s = s.replace('[', '').replace(']', '').split(',')
s = [float(x) for x in s]
return s
df['FENCE_LOC'] = df['FENCE_LOC'].apply(bike_fence_format)
df['LAT'] = df['FENCE_LOC'].apply(lambda x: x[1::2])
df['LON'] = df['FENCE_LOC'].apply(lambda x: x[::2])
We can use some list comprehensions to build a list of Shapely polygons.
geom_list = [(x, y) for x, y in zip(df['LON'],df['LAT'])]
geom_list_2 = [Polygon(tuple(zip(x, y))) for x, y in geom_list]
Finally, we can create a gdf using our list of Shapely polygons.
polygon_gdf = gpd.GeoDataFrame(geometry=geom_list_2)
To make available a small representative dataset similar to what the OP posts as an image, I create this rows of data (sorry for too many decimal digits):
[[-2247824.100899419,-4996167.43201861],[-2247824.100899419,-4996067.43201861],[-2247724.100899419,-4996067.43201861],[-2247724.100899419,-4996167.43201861],[-2247824.100899419,-4996167.43201861]]
[[-2247724.100899419,-4996167.43201861],[-2247724.100899419,-4996067.43201861],[-2247624.100899419,-4996067.43201861],[-2247624.100899419,-4996167.43201861],[-2247724.100899419,-4996167.43201861]]
[[-2247624.100899419,-4996167.43201861],[-2247624.100899419,-4996067.43201861],[-2247524.100899419,-4996067.43201861],[-2247524.100899419,-4996167.43201861],[-2247624.100899419,-4996167.43201861]]
[[-2247824.100899419,-4996067.43201861],[-2247824.100899419,-4995967.43201861],[-2247724.100899419,-4995967.43201861],[-2247724.100899419,-4996067.43201861],[-2247824.100899419,-4996067.43201861]]
[[-2247724.100899419,-4996067.43201861],[-2247724.100899419,-4995967.43201861],[-2247624.100899419,-4995967.43201861],[-2247624.100899419,-4996067.43201861],[-2247724.100899419,-4996067.43201861]]
[[-2247624.100899419,-4996067.43201861],[-2247624.100899419,-4995967.43201861],[-2247524.100899419,-4995967.43201861],[-2247524.100899419,-4996067.43201861],[-2247624.100899419,-4996067.43201861]]
[[-2247824.100899419,-4995967.43201861],[-2247824.100899419,-4995867.43201861],[-2247724.100899419,-4995867.43201861],[-2247724.100899419,-4995967.43201861],[-2247824.100899419,-4995967.43201861]]
[[-2247724.100899419,-4995967.43201861],[-2247724.100899419,-4995867.43201861],[-2247624.100899419,-4995867.43201861],[-2247624.100899419,-4995967.43201861],[-2247724.100899419,-4995967.43201861]]
[[-2247624.100899419,-4995967.43201861],[-2247624.100899419,-4995867.43201861],[-2247524.100899419,-4995867.43201861],[-2247524.100899419,-4995967.43201861],[-2247624.100899419,-4995967.43201861]]
This data is saved as polygon_data.csv file.
For the code, modules are loaded first as
import geopandas as gpd
import pandas as pd
from shapely.geometry import Polygon
Then, the data is read to create a dataframe by pandas.read_csv(). To get each row of data into a single column of the dataframe, delimiter="x" is used. Since there is no x within any row of data, the whole row of data as a long string is the result.
df3 = pd.read_csv('polygon_data.csv', header=None, index_col=None, delimiter="x")
To view the content of df3, you can run
df3.head()
and get single column (with header: 0) dataframe:
0
0 [[-2247824.100899419,-4996167.43201861],[-2247...
1 [[-2247724.100899419,-4996167.43201861],[-2247...
2 [[-2247624.100899419,-4996167.43201861],[-2247...
3 [[-2247824.100899419,-4996067.43201861],[-2247...
4 [[-2247724.100899419,-4996067.43201861],[-2247...
Next, df3 is used to create a geoDataFrame. Data in each row of df3 is used to create a Polygon object to act as the geometry of the geoDataFrame polygon_df3.
geometry = [Polygon(eval(xy_string)) for xy_string in df3[0]]
polygon_df3 = gpd.GeoDataFrame(df3, \
#crs={'init': 'epsg:4326'}, #uncomment this if (x,y) is long/lat
geometry=geometry)
Finally, the geoDataFrame can be plotted with a simple command:
# this plot the geoDataFrame
polygon_df3.plot(edgecolor='black')
In this particular case with my proposed data, the output plot is:
For context:
I am making a visual graph for a protein-protein interaction network. A node here corresponds to a protein and an edge would indicate interaction between two nodes.
Here is my code:
First I import all the modules and files that I need:
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
interactome_edges = pd.read_csv("*a_directory*", delimiter = "\t", header = None)
interactome_nodes = pd.read_csv("*a_directory*", delimiter = "\t", header = None)
# A few adjustments for the dataframes
interactome_nodes = interactome_nodes.drop(columns = [0])
interactome_edges.columns = ["node1","node2"]
Dataframe for nodes looks like this:
1
0 MET3
1 IMD3
2 OLE1
3 MUP1
4 PIS1
...
Dataframe for edges looks like this:
node1 node2
0 MET3 MET3
1 IMD3 IMD4
2 OLE1 OLE1
3 MUP1 MUP1
4 PIS1 PIS1
...
Basically the edge goes from node1 to node2
Now I iterate through each row from the node dataframe and edge dataframe and use it as networkx nodes and edges.
interactome = nx.Graph()
# Adding Nodes to Graph
for index, row in interactome_nodes.iterrows():
interactome.add_nodes_from(row)
# Adding Edges to Graph
for index, row in interactome_edges.iterrows():
interactome.add_edges_from(row["node1", "node2"]) #### Here is the problem
My problem is at the adding Edges part.
I am currently getting the following error:
KeyError: ('node1', 'node2')
I have also tried :
for index, row in interactome_edges.iterrows():
interactome.add_edges_from((row["node1"],row["node2"]))
and:
for index, row in interactome_edges.iterrows():
interactome.add_edges_from(row["node1"],row["node2"])
and also simply:
for index, row in interactome_edges.iterrows():
interactome.add_edges_from(row)
All of which give me some form of error.
How can I use my node to node dataframe as edges for a networkx graph?
In [9]: import networkx as nx
In [10]: import pandas as pd
In [11]: df = pd.read_csv("a.csv")
In [12]: df
Out[12]:
node1 node2
0 MET3 MET3
1 IMD3 IMD4
2 OLE1 OLE1
3 MUP1 MUP1
4 PIS1 PIS1
In [13]: G=nx.from_pandas_edgelist(df, "node1", "node2")
In [14]: [e for e in G.edges]
Out[14]:
[('MET3', 'MET3'),
('IMD3', 'IMD4'),
('OLE1', 'OLE1'),
('MUP1', 'MUP1'),
('PIS1', 'PIS1')]
Networkx has methods to read from pandas dataframe. I have use the edge dataframe provided. Here, I'm using from_pandas_edgelist method to read from the dataframe of edges.
After plotting the graph,
nx.draw_planar(G, with_labels = True)
plt.savefig("filename2.png")
I have a dataframe with 50 data points per month. I'd like to run a groupby on the date, and then calculate the median value for each decile within each month. I've been able to accomplish this with the code below:
import numpy as np
import pandas as pd
datecol = pd.date_range('12/31/2018','12/31/2019', freq='M')
for ii in range(0,49):
datecol = datecol.append(pd.date_range('12/31/2018','12/31/2019', freq='M'))
datecol = datecol.sort_values()
df = pd.DataFrame(np.random.randn(len(datecol), 1), index=datecol, columns=['Data'])
dfg = df.groupby([df.index, pd.qcut(df['Data'], 10)])['Data'].median()
Now I'd like to be able to rearrange the dataframe so each decile has its own column. My goal is to plot each decile over time.
You can do:
dfg.unstack(-1).plot()
output:
I know we can use the following code to create a decile column for based on a column of given data set considering there are tie in the data (see How to qcut with non unique bin edges?):
import numpy as np
import pandas as pd
# create a sample
np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(100, 3), columns=list('ABC'))
# sort by column C
df = df.sort_values(['C'] , ascending = False )
# create decile by column C
df['decile'] = pd.qcut(df['C'].rank(method='first'), 10, labels=np.arange(10, 0, -1))
Is there an easy way to save the cut point from df then use the same cut point to cut a new data set? For example:
np.random.seed([1])
df_new = pd.DataFrame(np.random.rand(100, 1), columns=list('C'))
You can using .left get all bins
s1=pd.Series([1,2,3,4,5,6,7,8,9])
s2=pd.Series([2,3,4,6,1])
a=pd.qcut(s1,10).unique()
bins=[x.left for x in a ] + [np.inf]
pd.cut(s2,bins=bins)
I have the following problem:
Given a 2D dataframe, first column with values and second giving categories of the points, I would like to compute a k-means dictionary of the means of each category and assign the centroid that the group mean of a particular value is closest to as a new column in the original data frame.
I would like to do this using groupby.
More generally, my problem is, that apply (to my knowledge) only can use functions that are defined on the individual groups (like mean()). k-means needs information on all the groups. Is there a nicer way than transforming everything to numpy arrays and working with these?
import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans2
k=4
raw_data = np.random.randint(0,100,size=(100, 4))
f = pd.DataFrame(raw_data, columns=list('ABCD'))
df = pd.DataFrame(f, columns=['A','B'])
groups = df.groupby('A')
means = groups.mean().unstack()
centroids, dictionary = kmeans2(means,k)
fig, ax = plt.subplots()
print dictionary
What I would like to get now, is a new column in df, that gives the value in dictionary for each entry.
You can achieve it by the following:
import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans2
k = 4
raw_data = np.random.randint(0,100,size=(100, 4))
f = pd.DataFrame(raw_data, columns=list('ABCD'))
df = pd.DataFrame(f, columns=['A','B'])
groups = df.groupby('A')
means_data_frame = pd.DataFrame(groups.mean())
centroid, means_data_frame['cluster'] = kmeans2(means_data_frame['B'], k)
df.join(means_data_frame, rsuffix='_mean', on='A')
This will append 2 more columns to df B_mean and cluster denoting the group's mean and the cluster that group's mean is closest to, respectively.
If you really want to use apply, you can write a function to read the cluster value from means_data_frame and assign it to a new column in df