Read Shapefiles into Dataframe - python

I have a shapefile that I would like to convert into dataframe in Python 3.7. I have tried the following codes:
import pandas as pd
import shapefile
sf_path = r'data/shapefile'
sf = shapefile.Reader(sf_path, encoding = 'Shift-JIS')
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]
sf_df = pd.DataFrame(columns = fields, data = records)
But I got this error message saying
TypeError: Expected list, got _Record
So how should I convert the list to _Record or is there a way around it? I have tried GeoPandas too, but had some trouble installing it. Thanks!

def read_shapefile(sf_shape):
"""
Read a shapefile into a Pandas dataframe with a 'coords'
column holding the geometry information. This uses the pyshp
package
"""
fields = [x[0] for x in sf_shape.fields][1:]
records = [y[:] for y in sf_shape.records()]
#records = sf_shape.records()
shps = [s.points for s in sf_shape.shapes()]
df = pd.DataFrame(columns=fields, data=records)
df = df.assign(coords=shps)
return df

I had the same problem and this is because the .shp file has a kind of key field in each record and when converting to dataframe, a list is expected and only that field is found, test changing:
records = [y[:] for y in sf.records()]
I hope this works!

Related

Missing features/groups from json file, Python

I am trying to extract the English and Korean names of each municipality in South Korea from a municipality-level geojson file, and here is my Python code.
import json
import pandas as pd
Korean_municipalities = json.load(open('skorea-municipalities-2018-geo.json', 'r'))
munic_map_eng = {}
for feature in Korean_municipalities['features']:
feature['id'] = feature['properties']['name_eng']
munic_map_eng[feature['properties']['name']] = feature['id']
df_munic = pd.DataFrame(list(munic_map_eng.items()))
There are 250 municipalities. That is
len(Korean_municipalities['features']) = 250
However, there are only 227 in the data frame df_munic. That is
df_munic.shape = (227,2)
It seems like 23 municipalities are missing in this case. I use the same set of codes on province and sub-municipality level. For the sub-municipality level, the issue is the same: 3504 submunicipalties, but the data frame has only 3142 rows. However, there is no such problem at the province level (17 provinces).
Any idea where things may go wrong?
Thanks!
There must be duplicate feature['properties']['name'] values. You're using this as the dictionary key, and keys must be unique, so you only get one row in the dataframe for each name.
Use a list instead of a dictionary to save them all.
import json
import pandas as pd
Korean_municipalities = json.load(open('skorea-municipalities-2018-geo.json', 'r'))
munic_list_eng = []
for feature in Korean_municipalities['features']:
feature['id'] = feature['properties']['name_eng']
munic_list_eng.append((feature['properties']['name'], feature['id']))
df_munic = pd.DataFrame(munic_list_eng)

some coordinates that I extracted from geocoder in Python are not saving in the variable I created

enter code hereHi,
I want to save some coordinates(latitude and longitudes) I extracted through geocodes, the problem I have is those coordinates are not saving and I can't seem to add them as columns to the table I generated using pandas
I get this error:
AttributeError: 'NoneType' object has no attribute 'latitude'
import pandas
from geopy.geocoders import Nominatim
df1= pandas.read_json("supermarkets.json")
nom= Nominatim(scheme= 'http')
lis= list(df1.index)
for i in lis:
l= nom.geocode(list(df1.loc[i,"Address":"Country"]))
j=[]+ [l.latitude]
k=[]+ [l.longitude]
I expect a way to get save the coordinates and include them in my table. Thanks
The nom.geocode(..) [geopy-doc] can result in a None given the address can not be found, or the query is not answered in sufficient time. This is specified in the documentation:
Return type:
None, geopy.location.Location or a list of them, if exactly_one=False.
from operator import attrgetter
locations = df['Address':'Country'].apply(
lambda r: nom.geocode(list(r)), axis=1
)
nonnull = locations.notnull()
df.loc[nonnull, 'longitude'] = locations[nonnull].apply(attrgetter('longitude'))
df.loc[nonnull, 'latitude'] = locations[nonnull].apply(attrgetter('latitude'))
We this first query all locations, and next we check what has been succesfull, and retrieve the latitude, and latitude for that location.

How to filter spark dataframe with URL exists or not?

I want to filter my spark dataframe. In this dataframe, there is an col of URL.
I have tried to use os.path.exists(col("url")) to filter my dataframe, but I got errors like
"string is needed, but column has been found".
here is part of my code, pandas has been used in codes, and now i want to use spark to implement the following code
bob_ross = pd.DataFrame.from_csv("/dbfs/mnt/umsi-data-science/si618wn2017/bob_ross.csv")
bob_ross['image'] = ""
# create a column for each of the 85 colors (these will be c0...c84)
# we'll do this in a separate table for now and then merge
cols = ['c%s'%i for i in np.arange(0,85)]
colors = pd.DataFrame(columns=cols)
colors['EPISODE'] = bob_ross.index.values
colors = colors.set_index('EPISODE')
# figure out if we have the image or not, we don't have a complete set
for s in bob_ross.index.values:
b = bob_ross.loc[s]['TITLE']
b = b.lower()
b = re.sub(r'[^a-z0-9\s]', '',b)
b = re.sub(r'\s', '_',b)
img = b+".png"
if (os.path.exists("/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)):
bob_ross.set_value(s,"image","/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)
t = getColors("/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)
colors.loc[s] = t
bob_ross = bob_ross.join(colors)
bob_ross = bob_ross[bob_ross.image != ""]
here is how i try to implement it with spark, i am stuck at the error line
from pyspark.sql.functions import *
bob_ross = spark.read.csv('/mnt/umsi-data-science/si618wn2017/bob_ross.csv',header=True)
bob_ross=bob_ross.withColumn("image",concat(lit("/dbfs/mnt/umsi-data-science/si618wn2017/images/"),concat(regexp_replace(regexp_replace(lower(col('TITLE')),r'[^a-z0-9\s]',''),r'\s','_'),lit(".png"))))
#error line ---filter----
bob_ross.filter(os.path.exists(col("image")))
print(bob_ross.head())
You should be using filter function, not an OS function
For example
df.filter("image is not NULL")
os.path.exists only operates on the local filesystem, while Spark is meant to run on many servers, so that should be a sign you're not using the correct function

Send data from all collections to Dataframe (MongoDB, Python)

I have tried to use this code:
preDF = pd.DataFrame
for ticker in tickers:
df_forpre = read_ticker(ticker, '2017-01-22')
df_forpre['ticker'] = ticker
preDF = pd.concat([preDF,df_forpre], axis=0)
But I receive:
cannot concatenate a non-NDFrame object
How I can get data from all collection and send it to dataframe with name collection
The mistake was in concat, it should be:
preDF = pd.concat([df_forpre], axis=0)
Instead of:
preDF = pd.concat([preDF,df_forpre], axis=0)

PyTables ValueError on string column with newer pandas

Problem writing pandas dataframe (timeseries) to HDF5 using pytables/tstables:
import pandas
import tables
import tstables
# example dataframe
valfloat = [512.3, 918.8]
valstr = ['abc','cba']
tstamp = [1445464064, 1445464013]
df = pandas.DataFrame(data = zip(valfloat, valstr, tstamp), columns = ['colfloat', 'colstr', 'timestamp'])
df.set_index(pandas.to_datetime(df['timestamp'].astype(int), unit='s'), inplace=True)
df.index = df.index.tz_localize('UTC')
colsel = ['colfloat', 'colstr']
dftoadd = df[colsel].sort_index()
# try string conversion from object-type (no type mixing here ?)
##dftoadd.loc[:,'colstr'] = dftoadd['colstr'].map(str)
h5fname = 'df.h5'
# class to use as tstable description
class TsExample(tables.IsDescription):
timestamp = tables.Int64Col(pos=0)
colfloat = tables.Float64Col(pos=1)
colstr = tables.StringCol(itemsize=8, pos=2)
# create new time series
h5f = tables.open_file(h5fname, 'a')
ts = h5f.create_ts('/','example',TsExample)
# append to HDF5
ts.append(dftoadd, convert_strings=True)
# save data and close file
h5f.flush()
h5f.close()
Exception:
ValueError: rows parameter cannot be converted into a recarray object
compliant with table tstables.tstable.TsTable instance at ...
The error was: cannot view Object as non-Object type
While this particular error happens with TsTables, the code chunk responsible for it is identical to PyTables try-section here.
The error is happening after I upgraded pandas to 0.17.0; the same code was running error-free with 0.16.2.
NOTE: if a string column is excluded then everything works fine, so this problem must be related to string-column type representation in the dataframe.
The issue could be related to this question. Is there some conversion required for 'colstr' column of the dataframe that I am missing?
This is not going to work with a newer pandas as the index is timezone aware, see here
You can:
convert to a type PyTables understands, this would require localizing
use HDFStore to write the frame
Note that what you are doing is the reason HDFStore exists in the first place, to make reading/writing pyTables friendly for pandas objects. Doing this 'manually' is full of pitfalls.

Categories