Convert pandas column with featurecollection to GeoJSON - python

I downloaded a CSV that contains a column which has a GeoJSON format, and imported it as a pandas dataframe. How can I convert this to a GeoJSON (.geojson)? I have about 10,000 rows, each with information as shown below:
This is an example of a cell in the column:
{"type":"FeatureCollection","features":[{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[-0.0903517,9.488375],[-0.0905786,9.488523],[-0.0909767,9.48913],[-0.09122,9.4895258],[-0.0909733,9.4901503],[-0.0908833,9.4906802],[-0.0906984,9.4905612],[-0.0907146,9.4898184],[-0.090649,9.4895175],[-0.0907516,9.489142],[-0.0906146,9.4889654],[-0.0903517,9.488375]]]},"properties":{"pointCount":"11","length":"502.9413","area":"8043.091133117676"}}]}
Overview of my pandas dataframe print now: site_registration_gps_area ... geometry
11 {"type":"FeatureCollection","features":[{"type... ... POINT (-76.75880 2.38031)
14 {"type":"FeatureCollection","features":[{"type... ... POINT (-76.73718 2.33163)
40 {"type":"FeatureCollection","features":[{"type... ... POINT (-0.15727 9.69560)
42 {"type":"FeatureCollection","features":[{"type... ... POINT (-0.11686 9.65522)
44 {"type":"FeatureCollection","features":[{"type... ... POINT (-0.10379 9.65226)

I would suggest to use GeoPandas for this task. The simplest solution would be to read your CSV file directly with GeoPandas using a GeoDataFrame. If you do this you need something like this:
import geopandas as gpd
# read the CSV file into a GeoDataFrame
gdf = gpd.read_file('myFile.csv')
# convert the GeoDataFrame to a geojson object
geo_json = gdf.to_json()
# however if the objects become very big, store the GeoDataFrame to a .geojson file
gdf.to_file('path', driver='GeoJSON')
However, if you work with the DataFrame from Pandas you can convert it to a GeoDataFrame and then use the same steps as in the above snippet.
import pandas as pd
import geopandas as gpd
# your code here ...
# extract the geojson column as a list
geo_j_list = df['geometry'].tolist()
# temporary list for GeoDataFrames
gdfs = []
# iterate over the geojson objects in the list
for geom in geo_j_list:
gdfs.append(gpd.GeoDataFrame.from_features(geom['features']))
# merge the list of GeoDataFrames into one
gdf = gpd.GeoDataFrame(pd.concat(gdfs, ignore_index=True))
# convert the gdf to a json object
geo_json = gdf.to_json()
# store the gdf in a geojson file
gdf.to_file('path', driver='GeoJSON')

Related

How to show all data (no ellipses) when converting NETCDF to CSV?

I have a trajectory file from a molecular simulation that is written in netCDF format. I would like to convert this file to .csv format so that I can apply further Python-based analysis of the proximity between molecules. The trajectory file contains information corresponding to 3D Cartesian coordinates for all 6500 atoms of my simulation for each time step.
I have used the below script to convert this netCDF file to a .csv file using the netCDF4 and Pandas modules. My code is given below.
import netCDF4
import pandas as pd
fp='TEST4_simulate1.traj'
dataset = netCDF4.Dataset(fp, mode='r')
cols = list(dataset.variables.keys())
list_dataset = []
for c in cols:
list_dataset.append(list(dataset.variables[c][:]))
#print(list_dataset)
df_dataset = pd.DataFrame(list_dataset)
df_dataset = df_dataset.T
df_dataset.columns = cols
df_dataset.to_csv("file_path.csv", index = False)
A small selection of the output .csv file is given below. Notice that a set of ellipses are given between the first and last set of 3 atomic coordinates.
time,spatial,coordinates
12.0,b'x',"[[ 33.332325 -147.24976 -107.131 ]
[ 34.240444 -147.80115 -107.4043 ]
[ 33.640083 -146.47362 -106.41945 ]
...
[ 70.31757 -16.499006 -186.13313 ]
[ 98.310844 65.95696 76.43664 ]
[ 84.08772 52.676186 145.48856 ]]"
How can I modify this code so that the entirety of my atomic coordinates are written to my .csv file?

Python df lat long in for loop

I wanted to change the code into for-loop so that I can change the style for each point.
Code below is working fine without for-loop:
import simplekml
import pandas as pd
excel_file = 'sample.xlsx'
df=pd.read_excel(excel_file)
kml = simplekml.Kml()
df.apply(lambda X: kml.newpoint( coords=[( X["Long"],X["Lat"])]) ,axis=1)
kml.save(path = "data.kml")
I wanted to do it in for-loop so that I can put style to each point, but my for-loop is not working
import simplekml
import pandas as pd
kml = simplekml.Kml()
style = simplekml.Style()
excel_file = 'sample1.xlsx'
df=pd.read_excel(excel_file)
y=df.Long
x=df.Lat
MinLat=int(df.Lat.min())
MaxLat=int(df.Lat.max())
MinLong=int(df.Long.min())
MaxLong=int(df.Long.max())
multipnt =kml.newmultigeometry()
for long in range(MinLong,MaxLong): # Generate longitude values
for lat in range(MaxLat,MinLat): # Generate latitude values
multipnt.newpoint(coords=[(y,x)])
#kml.newpoint(coords=[(y,x)])
kml.save("Point Shared Style.kml")
If want to iterate over a collection of points in an Excel file and add them to a single Placemark as a MultiGeometry using a for-loop then try this.
import simplekml
import pandas as pd
kml = simplekml.Kml()
style = simplekml.Style()
excel_file = 'sample1.xlsx'
df = pd.read_excel(excel_file)
multipnt = kml.newmultigeometry()
for row in df.itertuples(index=False):
multipnt.newpoint(coords=[(row.Lat, row.Long)])
kml.save("PointSharedStyle.kml")
If want to generate a point grid every decimal degree for the bounding box of the points then you would try the following:
import simplekml
import pandas as pd
kml = simplekml.Kml()
style = simplekml.Style()
excel_file = 'sample1.xlsx'
df = pd.read_excel(excel_file)
MinLat = int(df.Lat.min())
MaxLat = int(df.Lat.max())
MinLong = int(df.Long.min())
MaxLong = int(df.Long.max())
for long in range(MinLong, MaxLong+1): # Generate longitude values
for lat in range(MinLat, MaxLat+1): # Generate latitude values
multipnt.newpoint(coords=[(long, lat)])
#kml.newpoint(coords=[(long,lat)])
kml.save("PointSharedStyle.kml")
Note the Style is assigned to the placemark not the geometry so the MultiGeometry can only be assigned a single Style for all points. If want a different style for each point then need to create one placemark per point and assign each with its own Style.
For help setting styles, see https://simplekml.readthedocs.io/en/latest/styles.html

How do I write scikit-learn dataset to csv file

I can load a data set from scikit-learn using
from sklearn import datasets
data = datasets.load_boston()
print(data)
What I'd like to do is write this data set to a flat file (.csv)
Using the open() function,
f = open('boston.txt', 'w')
f.write(str(data))
works, but includes the description of the data set.
I'm wondering if there is some way that I can generate a simple .csv with headers from this Bunch object so I can move it around and use it elsewhere.
data = datasets.load_boston() will generate a dictionary. In order to write the data to a .csv file you need the actual data data['data'] and the columns data['feature_names']. You can use these in order to generate a pandas dataframe and then use to_csv() in order to write the data to a file:
from sklearn import datasets
import pandas as pd
data = datasets.load_boston()
print(data)
df = pd.DataFrame(data=data['data'], columns = data['feature_names'])
df.to_csv('boston.txt', sep = ',', index = False)
and the output boston.txt should be:
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
...
There are various toy datasets in scikit-learn such as Iris and Boston datasets. Let's load Boston dataset:
from sklearn import datasets
boston = datasets.load_boston()
What type of object is this? If we examine its type, we see that this is a scikit-learn Bunch object.
print(type(boston))
Output:
<class 'sklearn.utils.Bunch'>
A scikit-learn Bunch object is a kind of dictionary. So, we should treat it as such. We can use dictionary methods. Let's look at the keys:
print(boston.keys())
output:
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
Here we are interested in data, feature_names and target keys. We will import pandas module and use these keys to create a pandas DataFrame.
import pandas as pd
df = pd.DataFrame(data=boston['data'], columns=boston['feature_names'])
We should also add the target variable to the DataFrame. Target variable is what we try to predict. We should learn the target variable's name. It is written in the "DESCR". We can
print(boston["DESCR"]) and read the full description of the dataset.
In the description we see that the name of the target variable is MEDV. Now, we can add the target variable to the DataFrame:
df['MEDV'] = boston['target']
There is only one step left. We are exporting the DataFrame to a csv file without index numbers:
df.to_csv("scikit_learn_boston_dataset.csv", index=False)
BONUS: Iris dataset has additional parameters that we can utilize (look at here). Following code automatically creates the DataFrame with the target variable included:
iris = datasets.load_iris(as_frame=True)
df = iris["frame"]
Note: If we print(iris.keys()), we can see the 'frame' key:
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
BONUS2: If we print(boston["filename"]) or print(iris["filename"]), we can see the physical locations of the csv files of these datasets. For instance:
C:\Users\user\anaconda3\lib\site-packages\sklearn\datasets\data\boston_house_prices.csv
Just wanted to modify the reply by adding that you should probably include the target variable--"MV"--as well. Added an additional line below:
from sklearn import datasets
import pandas as pd
data = datasets.load_boston()
print(data)
df = pd.DataFrame(data=data['data'], columns = data['feature_names'])
**df['MV'] = data['target']**
df.to_csv('boston.txt', sep = ',', index = False)

Extract multiple polygon coordinates of csv file

I want to extract the (multiple) polygon coordinates of a .xlsx file into Panda Dataframe in Python.
The .xlsx file is available on google docs.
Now I do this:
import pandas as pd
gemeenten2019 = pd.read_excel('document.xlsx', index=False, skiprows=0 )
gemeenten2019['KML'] = str(gemeenten2019['KML'])
for index, row in gemeenten2019.iterrows():
removepart = str(row['KML'])
row['KML'] = removepart.replace('<MultiGeometry><Polygon><coordinates>', '')
gemeentenamen = []
gemeentePolygon = []
for gemeentenaam in gemeenten2019['NAAM']:
gemeentenamen.append(str(gemeentenaam))
for value in gemeenten2019['KML']:
gemeentePolygon.append(str(value))
df_gemeenteCoordinaten = pd.DataFrame({'Gemeente':gemeentenamen, 'KML': gemeentePolygon})
df_gemeenteCoordinaten
But the result is that every column ("KML") has the same results.
Only I want the coordinates for that specific row his column and not all the coordinates of all the columns.
The dataframe must look like:
Does anyone know how to extract the multiple coordinates for each row?
This would give you each pair of values on its own line:
import pandas as pd
gemeenten2019 = pd.read_excel('Gemeenten 2019.xlsx', index=False, skiprows=0)
gemeenten2019['KML'] = gemeenten2019['KML'].str.strip('<>/abcdefghijklmnopqrstuvwxyzGMP').str.replace(' ', '\n')
For example:
NAAM KML
0 Aa en Hunze 6.81394482119469,53.070971596018\n6.8612875225...
1 Aalsmeer 4.79469736599488,52.2606817589009\n4.795085405...
2 Aalten 6.63891586106867,51.9625470164657\n6.639463741...
3 Achtkarspelen 6.23217311778447,53.2567474241222\n6.235100748...

Numpy: load multiple CSV files as dictionaries

I wanted to use the numpy loadtxt method to read .csv files for my experiment. I have three different time-series data of the following format with different characteristics where the first column is timestamp and the second column is the value.
0.086206438,10
0.086425551,12
0.089227066,20
0.089262508,24
0.089744425,30
0.090036815,40
0.090054172,28
0.090377569,28
0.090514071,28
0.090762872,28
0.090912691,27
For reproducibility, I have shared the three time-series data I am using here.
If I do it like the following
import numpy as np
fname="data1.csv"
col_time,col_window = np.loadtxt(fname,delimiter=',').T
It works fine as intended. However instead of reading only a single file, I want to pass a dictionary to col_time,col_window = np.loadtxt(types,delimiter=',').T as the following
protocols = {}
types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}
so that I can read multiple csv files and do plot all the results at ones using a one for loop as in the following.
for protname, fname in types.items():
col_time, col_window = protocols[protname]["col_time"], protocols[protname]["col_window"]
rt = np.exp(np.diff(np.log(col_window)))
plt.plot(quotient_times, quotient, ".", markersize=4, label=protname)
plt.title(protname)
plt.xlabel("t")
plt.ylabel("values")
plt.legend()
plt.show()
But it is giving me an error ValueError: could not convert string to float: b'data1'. How can I load multiple csv files as a dictionary?
Assuming that you want to build a protocols dict that will be useable in your code, you can easily build it with a simple loop:
types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}
protocols = {}
for name, file in types.items():
col_time, col_window = np.loadtxt(file, delimiter=',').T
protocols[name] = {'col_time': col_time, 'col_window': col_window}
You can then successfully plot the 3 graphs:
for protname, fname in types.items():
col_time, col_window = protocols[protname]["col_time"], protocols[protname]["col_window"]
rt = np.exp(np.diff(np.log(col_window)))
plt.plot(col_time, col_window, ".", markersize=4, label=protname)
plt.title(protname)
plt.xlabel("t")
plt.ylabel("values")
plt.legend()
plt.show()
Loading data from multiple CSV files is not supported in pandas and numpy. You can use concat function of pandas DataFrame and load all the files. The example bellow demonstrates using pandas. Replace StringIO with file object.
data="""
0.086206438,10
0.086425551,12
0.089227066,20
0.089262508,24
0.089744425,30
0.090036815,40
0.090054172,28
0.090377569,28
0.090514071,28
0.090762872,28
0.090912691,27
"""
data2="""
0.086206438,29
0.086425551,32
0.089227066,50
0.089262508,54
"""
data3="""
0.086206438,69
0.086425551,72
0.089227066,70
0.089262508,74
"""
import pandas as pd
from io import StringIO
files={"data1":data,"data2":data2,"data3":data3}
# Load the first file into data frame
key=list(files.keys())[0]
df=pd.read_csv(StringIO(files.get(key)),header=None,usecols=[0,1],names=['data1','data2'])
print(df.head())
# remove file from dictionary
files.pop(key,None)
print("final values")
# Efficient :Concat this dataframe with remaining files
df=pd.concat([pd.read_csv(StringIO(files[i]),header=None,usecols=[0,1],names=['data1','data2']) for i in files.keys()],
ignore_index=True)
print(df.tail())
For more insight: pandas append vs concat

Categories