Extract multiple polygon coordinates of csv file - python

I want to extract the (multiple) polygon coordinates of a .xlsx file into Panda Dataframe in Python.
The .xlsx file is available on google docs.
Now I do this:
import pandas as pd
gemeenten2019 = pd.read_excel('document.xlsx', index=False, skiprows=0 )
gemeenten2019['KML'] = str(gemeenten2019['KML'])
for index, row in gemeenten2019.iterrows():
removepart = str(row['KML'])
row['KML'] = removepart.replace('<MultiGeometry><Polygon><coordinates>', '')
gemeentenamen = []
gemeentePolygon = []
for gemeentenaam in gemeenten2019['NAAM']:
gemeentenamen.append(str(gemeentenaam))
for value in gemeenten2019['KML']:
gemeentePolygon.append(str(value))
df_gemeenteCoordinaten = pd.DataFrame({'Gemeente':gemeentenamen, 'KML': gemeentePolygon})
df_gemeenteCoordinaten
But the result is that every column ("KML") has the same results.
Only I want the coordinates for that specific row his column and not all the coordinates of all the columns.
The dataframe must look like:
Does anyone know how to extract the multiple coordinates for each row?

This would give you each pair of values on its own line:
import pandas as pd
gemeenten2019 = pd.read_excel('Gemeenten 2019.xlsx', index=False, skiprows=0)
gemeenten2019['KML'] = gemeenten2019['KML'].str.strip('<>/abcdefghijklmnopqrstuvwxyzGMP').str.replace(' ', '\n')
For example:
NAAM KML
0 Aa en Hunze 6.81394482119469,53.070971596018\n6.8612875225...
1 Aalsmeer 4.79469736599488,52.2606817589009\n4.795085405...
2 Aalten 6.63891586106867,51.9625470164657\n6.639463741...
3 Achtkarspelen 6.23217311778447,53.2567474241222\n6.235100748...

Related

How to use python to seperate a one column CSV file if the columns have no headings, then save this into a new excel file?

So, I am quite new to python and have been googling a lot but have not found a good solution. What I am looking to do is automate text to columns using python in an excel document without headers.
Here is the excel sheet I have
it is a CSV file where all the data is in one column without headers
ex. hi ho loe time jobs barber
jim joan hello
009 00487 08234 0240 2.0348 20.34829
delimeter is space and comma
What I want to come out is saved in another excel with the first two rows deleted and seperated into columns
( this can be done using text to column in excel but i would like to automate this for several excel sheets)
009 | 00487 | 08234 | 0240 | 2.0348 | 20.34829
the code i have written so far is like this:
import pandas as pd
import csv
path = 'C:/Users/ionan/OneDrive - Universiteit Utrecht/Desktop/UCU/test_excel'
os.chdir(path)
for root, dirs, files in os.walk(path):
for f in files:
df = pd.read_csv(f, delimiter='\t' + ';', engine = 'python')
Original file with name as data.xlsx:
This means all the data we need is under the column Data.
Code to split data into multiple columns for a single file:
import pandas as pd
import numpy as np
f = 'data.xlsx'
# -- Insert the following code in your `for f in files` loop --
file_data = pd.read_excel(f)
# Since number of values to be split is not known, set the value of `num_cols` to
# number of columns you expect in the modified excel file
num_cols = 20
# Create a dataframe with twenty columns
new_file = pd.DataFrame(columns = ["col_{}".format(i) for i in range(num_cols)])
# Change the column name of the first column in new_file to "Data"
new_file = new_file.rename(columns = {"col_0": file_data.columns[0]})
# Add the value of the first cell in the original file to the first cell of the
# new excel file
new_file.loc[0, new_file.columns[0]] = file_data.iloc[0, 0]
# Loop through all rows of original excel file
for index, row in file_data.iterrows():
# Skip the first row
if index == 0:
continue
# Split the row by `space`. This gives us a list of strings.
split_data = file_data.loc[index, "Data"].split(" ")
print(split_data)
# Convert each element to a float (a number) if we want numbers and not strings
# split_data = [float(i) for i in split_data]
# Make sure the size of the list matches to the number of columns in the `new_file`
# np.NaN represents no value.
split_data = [np.NaN] + split_data + [np.NaN] * (num_cols - len(split_data) - 1)
# Store the list at a given index using `.loc` method
new_file.loc[index] = split_data
# Drop all the columns where there is not a single number
new_file.dropna(axis=1, how='all', inplace=True)
# Get the original excel file name
new_file_name = f.split(".")[0]
# Save the new excel file at the same location where the original file is.
new_file.to_excel(new_file_name + "_modified.xlsx", index=False)
This creates a new excel file (with a single sheet) of name data_modified.xlsx:
Summary (code without comments):
import pandas as pd
import numpy as np
f = 'data.xlsx'
file_data = pd.read_excel(f)
num_cols = 20
new_file = pd.DataFrame(columns = ["col_{}".format(i) for i in range(num_cols)])
new_file = new_file.rename(columns = {"col_0": file_data.columns[0]})
new_file.loc[0, new_file.columns[0]] = file_data.iloc[0, 0]
for index, row in file_data.iterrows():
if index == 0:
continue
split_data = file_data.loc[index, "Data"].split(" ")
split_data = [np.NaN] + split_data + [np.NaN] * (num_cols - len(split_data) - 1)
new_file.loc[index] = split_data
new_file.dropna(axis=1, how='all', inplace=True)
new_file_name = f.split(".")[0]
new_file.to_excel(new_file_name + "_modified.xlsx", index=False)

xlsxwriter having problems when inserting 8-bit values with ' symbol

My aim is to compare a table from a htm file with a table from a xlsx file and i done it all by converting to a dataframe using python. Its all working correctly and could display the correct value from xlsx file but when I try to copy the file from xlsx file to a new xlsx file which i convert the information to a table with the name and values as column, it gives me an error. It could show the correct value when i use print(data[y].values[z,1]) but when i want to put it into the excel file im getting an error with worksheet.write_string(row,col,data[y].values[z,1]). I had tried first convert the value to a string by value=str(data[y].values[z,1]) and the print(value) then i only put the value variable into the xlsx file by worksheet.write_string(row,col,value) but everything i get from the output file is nan for the value. The name could be shows in characters but the value could not shown out. Is it because my value is a 8-bit value and the value 8'h0 and contains the symbol ' so it could not be done by the library? If it is a yes, then how can I solve this problem?
This is the output file:
This is what i get with print(data[y].values[z,1]):
This is my source code:
import pandas as pd
import numpy as np
htm = pd.read_html('HAS.htm')[5]
xlsx = pd.ExcelFile('MTL_SOCSouth_PCH_and_IOE_Security_Parameters.xlsm')
import xlsxwriter
workbook = xlsxwriter.Workbook('Output01.xlsx')
worksheet = workbook.add_worksheet()
sheets=xlsx.sheet_names
#remove unwanted sheet
sheets.pop(0);
sheets.pop(0);
sheets.pop(0);
sheets.pop(-1);
sheets.pop(-1);
sheets.pop(-1);
sheets.pop(-1);
#create an array to store data for each sheet
data=[0]*(len(sheets))
#insert each sheet into array
for x in range(len(sheets)):
data[x]=xlsx.parse(sheets[x],header=4,usecols='B,AM')
data[x]=pd.DataFrame(data[x])
#initialize to first row
row = 0
#loop from first row from htm file to last row
for x in range(len(htm.index)):
chapter=(htm.values[x,3])
chapter=chapter[:chapter.find(": ")]
chapter=chapter.split("Chapter ",maxsplit=1)[-1]
#if the chapter is equal to 37 then proceed, ignore if not 37
if(chapter=='37'):
col = 0
source=htm.values[x,0]
source=source[:source.find("[")]
print(source)
for y in range((len(sheets))):
for z in range(len(data[y].index)):
target=data[y].values[z,0]
targetname=str(target)
worksheet.write(row,col,targetname)
if source==target:
col += 1
print(sheets[y])
worksheet.write(row,col,sheets[y])
col += 1
print(data[y].values[z,1])
worksheet.write_string(row,col,data[y].values[z,1])
row += 1
workbook.close()

How to extract a particular set of values from excel file using a numerical range in python?

What I intend to do :
I have an excel file with Voltage and Current data which I would like to extract from a specific sheet say 'IV_RAW'. The values are only from 4th row and are in columns D and E.
Lets say the values look like this:
V(voltage)
I(Current)
47
1
46
2
45
3
0
4
-0.1
5
-10
5
Now, I just want to take out only the values starting with a voltage (V) of 45 and shouldnt take negative voltages. The corresponding current (I) values are also needed to be taken out. This has to be done for multiple excel files. So starting from a particular row number cannot be done instead voltage values should be the criterion.
What I know:
I know only how to take out the entire set of values using openxyl:
loc = ("path")
wb = load_workbook("Data") #thefilename
ws = wb["IV_raw"] #theactiveworksheet
#to extract the voltage and current data:
for row in ws.iter_rows(min_row=1, max_col=3, max_row=2, values_only=True):
print(row)
I am a noon coder and new to python. So it will be really helpful if you guys could help. If there is a simplified versions with pandas it will be really great.
Thank you in advance
The following uses pandas which you should definitly take a look at. with sheet_name you set the sheet_name, header is the row index of the header (starting at 0, so Row 4 -> 3), usecols defines the columns using A1 notation.
The last line filters the dataframe. If I understand correctly, then you want Voltage between 0 and 45, thats what the example does and df is your resulting data_frame
import pandas as pd
file_loc = "path.xlsx"
df = pd.read_excel(file_loc,
sheet_name = 'IV_raw',
header = 3,
usecols = "D:E")
df = df[(df['V(voltage)'] > 0) & (df['V(voltage)'] < 45)]
Building on from your example, you can use the following example to get what you need
from openpyxl import load_workbook
wb = load_workbook(filepath,data_only=True) #load the file using its full path
ws = wb["Sheet1"] #theactiveworksheet
#to extract the voltage and current data:
data = ws.iter_rows(min_col=4, max_col=5, min_row=2, max_row=ws.max_row, values_only=True)
output = [row for row in data if row[0]>45]
you can try this,
import openpyxl
tWorkbook = openpyxl.load_workbook("YOUR_FILEPATH")
tDataBase = tWorkbook.active
voltageVal= "D4"
currentVal= "E4"
V = tDataBase[voltageVal].value
I = tDataBase[currentVal].value

How to Perform Mathematical Operation on One Value of a CSV file?

I am dealing with a csv file that contains three columns and three rows containing numeric data. The csv data file simply looks like the following:
Colum1,Colum2,Colum3
1,2,3
1,2,3
1,2,3
My question is how to write a python code that take a single value of one of the column and perform a specific operation. For example, let say I want to take the first value in 'Colum1' and subtract it from the sum of all the values in the column.
Here is my attempt:
import csv
f = open('columns.csv')
rows = csv.DictReader(f)
value_of_single_row = 0.0
for i in rows:
value_of_single_Row += float(i) # trying to isolate a single value here!
print value_of_single_row - sum(float(r['Colum1']) for r in rows)
f.close()
Based on the code you provided, I suggest you take a look at the doc to see the preferred approach on how to read through a csv file. Take a look here:
How to use CsvReader
with that being said, you can modify the beginning of your code slightly to this:
import csv
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
# perform operation per row
From there you now have access to each row.
This should give you what you need to do proper row-by-row operations.
What I suggest you do is play around with printing out your rows to see what your data looks like. You will see that each row being outputted is a dictionary.
So if you were going through each row, you can just simply do something like this:
for row in rows:
row['Colum1'] # or row.get('Colum1')
# to do some math to add everything in Column1
s += float(row['Column1'])
So all of that will look like this:
import csv
s = 0
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
s += float(row['Colum1'])
You can do pretty much all of this with pandas
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
import os
Location = r'path/test.csv'
df = pd.read_csv(Location, names=['Colum1','Colum2','Colum3'])
df = df[1:] #Remove the headers since they're unnecessary
print df
df.xs(1)['Colum1']=int(df.loc[1,'Colum1'])+5
print df
You can write back to your csv using df.to_csv('File path', index=False,header=True) Having headers=True will add the headers back in.
To do this more along the lines of what you have you can do it like this
import csv
Location = r'C:/Users/tnabrelsfo/Documents/Programs/Stack/test.csv'
data = []
with open(Location, 'r') as f:
for line in f:
data.append(line.replace('\n','').replace(' ','').split(','))
data = data[1:]
print data
data[1][1] = 5
print data
it will read in each row, cut out the column names, and then you can modify the values by index
So here is my simple solution using pandas library. Suppose we have sample.csv file
import pandas as pd
df = pd.read_csv('sample.csv') # df is now a DataFrame
df['Colum1'] = df['Colum1'] - df['Colum1'].sum() # here we replace the column by subtracting sum of value in the column
print df
df.to_csv('sample.csv', index=False) # save dataframe back to csv file
You can also use map function to do operation to one column, for example,
import pandas as pd
df = pd.read_csv('sample.csv')
col_sum = df['Colum1'].sum() # sum of the first column
df['Colum1'] = df['Colum1'].map(lambda x: x - col_sum)

Read a specific column of a certain cell range and store the values using Pandas

I am trying to figure out a way to read data from a specific column from a certain cell range and store it into a array using pandas.
For example my Excel sheet consists of :
test | p
Food| Price
Chicken| 8.54
Beef |6.73
Vegetables| 3.2
Total Price |18.47
Note: there is a an empty space on the first row for a reason.
Note: | indicates cell separation.
I am trying to get the price values which start from Row B3 to row B5 and store them into an array via [8.54,6.73,3.2].
So far the code I have is:
import pandas as pd
xl_workbook = pd.ExcelFile("readme.xlsx") # Load the excel workbook
df = xl_workbook.parse("Sheet1") # Parse the sheet into a dataframe
x1_list = df['p'].tolist() # Cast the desired column into a python list
print(x1_list)
Which then results to [nan, u'price',8.54,6.73,3.2]
If I just wanted to read the values 8.54, 6.73, and 3.2, to result in [8.54,6.73,3.2] how would I do this?
Is there a way to grab a certain column of a certain cell range?
As written, you could use read_excel in Pandas. This assumes you have consistent formatting.
import pandas as pd
# define the file name and "sheet name"
fn = 'Book1.xlsx'
sn = 'Sheet1'
data = pd.read_excel(fn, sheetname=sn, index_col=0, skiprows=1, header=0, skip_footer=1)

Categories