I have a .CSV file with 75 columns and almost 4000 rows. I need to create a shapefile (point) for the entire .CSV file, with each column as a field. All 75 columns need to be brought over to the new shapefile, with each column representing a field.
There seems to a good amount on this topic already, but everything I can find addresses .csv files with a small number of columns.
https://gis.stackexchange.com/questions/17590/why-is-an-extra-field-necessary-when-creating-point-shapefile-from-csv-files-in
https://gis.stackexchange.com/questions/35593/using-the-python-shape-library-pyshp-how-to-convert-csv-file-to-shp
This script looks close to what I need to accomplish, but again it adds a field for every column in the .CSV, which in this example there are three fields; DATE, LAT, LON.
import arcpy, csv
arcpy.env.overwriteOutput = True
#Set variables
arcpy.env.workspace = "C:\\GIS\\StackEx\\"
outFolder = arcpy.env.workspace
pointFC = "art2.shp"
coordSys = "C:\\Program Files\\ArcGIS\\Desktop10.0\\Coordinate Systems" + \
"\\Geographic Coordinate Systems\\World\\WGS 1984.prj"
csvFile = "C:\\GIS\\StackEx\\chicken.csv"
fieldName = "DATE1"
#Create shapefile and add field
arcpy.CreateFeatureclass_management(outFolder, pointFC, "POINT", "", "", "", coordSys)
arcpy.AddField_management(pointFC, fieldName, "TEXT","","", 10)
gpsTrack = open(csvFile, "r")
headerLine = gpsTrack.readline()
#print headerLine
#I updated valueList to remove the '\n'
valueList = headerLine.strip().split(",")
print valueList
latValueIndex = valueList.index("LAT")
lonValueIndex = valueList.index("LON")
dateValueIndex = valueList.index("DATE")
# Read each line in csv file
cursor = arcpy.InsertCursor(pointFC)
for point in gpsTrack.readlines():
segmentedPoint = point.split(",")
# Get the lat/lon values of the current reading
latValue = segmentedPoint[latValueIndex]
lonValue = segmentedPoint[lonValueIndex]
dateValue = segmentedPoint[dateValueIndex]
vertex = arcpy.CreateObject("Point")
vertex.X = lonValue
vertex.Y = latValue
feature = cursor.newRow()
feature.shape = vertex
feature.DATE1 = dateValue
cursor.insertRow(feature)
del cursor
Is there a simpler way to create a shapefile using python without adding a field for all 75 columns in the .CSV file? Any help is greatly appreciated.
Simply select just the columns you need; you are not required to use all columns.
Use the csv module to read the file, then just pick out the 2 values from each row:
import csv
cursor = arcpy.InsertCursor(pointFC)
with open('yourcsvfile.csv', 'rb') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
point = arcpy.CreateObject("Point")
point.X, point.Y = float(row[5]), float(row[27]) # take the 6th and 28th columns from the row
cursor.insertRow(point)
Related
I have a CSV file with 3 rows: "Username", "Date", "Energy saved" and I would like to sum the "Energy saved" of a specific user by date.
For example, if username = 'merrytan', how can I print all the rows with "merrytan" such that the total energy saved is aggregated by date? (Date: 24/2/2022 Total Energy saved = 1001 , Date: 24/2/2022 Total Energy saved = 700)
I am a beginner at python and typically, I would use pandas to resolve this issue but it is not allowed for this project so I am at a complete loss on where to even begin. I would appreciate any help and guidance. Thank you.
My alternative to opening csv files is to use csv module of native python. You read them as a "file" and just extract the values that you need. I filter using the first column and keep only keep the equal index values from the concerned column. (which is thrid and index 2.)
import csv
energy_saved = []
with open(r"D:\test_stack.csv", newline="") as csvfile:
file = csv.reader(csvfile)
for row in file:
if row[0]=="merrytan":
energy_saved.append(row[2])
energy_saved = sum(map(int, energy_saved))
Now you have a list of just concerned values, and you can sum them afterwards.
Edit - So, I just realized that I left out the time part of your request completely lol. Here's the update.
import csv
my_dict = {}
with open(r"D:\test_stack.csv", newline="") as file:
for row in csv.reader(file):
if row[0]=="merrytan":
my_dict[row[1]] = my_dict.get(row[1], 0) + int(row[2])
So, we need to get the date column of the file as well. We need to make a presentation of two "rows" but when Pandas has been prohibited, we will go to dictionary with date as keys and energy as values.
But your date column has repeated values (regardless intended or else) and Dictionaries require keys to be unique. So, we use a loop. You add one date value after another as key and corresponding energy as value to the new dictionary, but when it is already present, you will sum with the existing value instead.
I would turn your CSV file into a two-level dictionary, with username and then date as the keys
infile = open("data.csv", "r").readlines()
savings = dict()
# Skip the first line of the CSV, since that has the column names
# not data
for row in infile[1:]:
username, date_col, saved = row.strip().split(",")
saved = int(saved)
if username in savings:
if date_col in savings[username]:
savings[username][date_col] = savings[username][date_col] + saved
else:
savings[username][date_col] = saved
else:
savings[username] = {date_col: saved}
I have a list of CSV filenames (which is in another CSV file called CSV_file_1). However, I want to put additional two columns in CSV_file_1, in which the row values will come from the thousands of individual CSV files.
CSV_file_1 contents are as follows:
1.csv
2.csv
3.csv
In the thousands of files I have in another folder, it contains values that I want to put in CSV_file_1. Example, 1.csv contains the following rows:
LATITUDE : ;13.63345
LONGITUDE : ;123.207083
2.csv contains the following rows:
LATITUDE : ;13.11111
LONGITUDE : ;123.22222
3.csv contains the following rows:
LATITUDE : ;13.22222
LONGITUDE : ;123.11111
and so on.
The result that I want to have for CSV_file_1 is as follows:
FILENAME: LATITUDE: LONGITUDE:
1.csv 13.63345 123.207083
2.csv 13.11111 123.22222
3.csv 13.22222 123.11111
I already managed to have my CSV_file_1 but without the LATITUDE AND LONGITUDE yet (which will come from individual files which are delimited as shown above).
My code is like this:
import pandas as pd
import glob
print(glob.glob("D:/2021/*.csv"))
#list of all the filenames collated and put in CSV_file_1
CSV_file_1 = pd.DataFrame(glob.glob("D:/2021/*.csv"))
#creating blank columns in CSV_file_1
CSV_file_1 ['Latitude'] = ""
CSV_file_1 ['Longitude'] = ""
#here im trying to access each file in the given folder(file name must correspond to the row in CSV_file_1), extract the data (latitude and longitude) and copy it to CSV_file_1
import csv
with open('D:/2021/*.csv','rt')as file:
data = csv.reader(file)
for row in file:
if glob.glob("D:/2021/*.csv") = CSV_file_1['FILENAME']:
CSV_file_1.iloc[i] ['LATITUDE:'] ==file.iloc[i]
CSV_file_1.to_csv('D:/2021/CSV_file_1.csv', index = False)
but I get invalid syntax.
if glob.glob("D:/2021/*.csv") = CSV_file_1['FILENAME']:
^
SyntaxError: invalid syntax
I am a python newbie so I would like to seek help to fix my code.
If I understand your problem correctly I think your approach is a little bit complex. I implemented a script that is creating the desired output.
First, the CSV file with the names of the other files is read directly into the first column of the data frame. Then, the file names are used to extract the longitude and latitude from each file. For this, I created a function, which you can see in the first part of the script. In the end, I add the extracted values to the data frame and store it in a file in the desired format.
import pandas as pd
import csv
# Function that takes
def get_lati_and_long_from_csv(csv_path):
with open(csv_path,'rt') as file:
# Read csv file content to list of rows
data = list(csv.reader(file, delimiter =';'))
# Take values from row zero and one
latitude = data[0][1]
longitude = data[1][1]
return (latitude, longitude)
def main():
# Define path of first csv file
csv_file_1_path = "CSV_file_1.csv"
# Read data frame from csv file and create correct column name
CSV_file_1 = pd.read_csv(csv_file_1_path, header=None)
CSV_file_1.columns = ['FILENAME:']
# Create list of files to read the coordinates
list_of_csvs = list(CSV_file_1['FILENAME:'])
# Define empty lists to add the coordinates
lat_list = []
lon_list = []
# Iterate over all csv files and extract longitude and latitude
for csv_path in list_of_csvs:
lat, lon = get_lati_and_long_from_csv(csv_path)
lat_list.append(lat)
lon_list.append(lon)
# Add coordinates to the data frame
CSV_file_1['Latitude:'] = lat_list
CSV_file_1['Longitude:'] = lon_list
# Save final data frame to csv file
CSV_file_1.to_csv(csv_file_1_path+'.out', index = False, sep='\t')
if __name__ == "__main__":
main()
Test input file content:
1.csv
2.csv
3.csv
Test output file content:
FILENAME: Latitude: Longitude:
1.csv 13.63345 123.207083
2.csv 13.11111 123.22222
3.csv 13.22222 123.11111
EDIT:
If your files do not contain any other data, I would suggest simplifying things and removing pandas as it is not needed. The following main() function produces the same result but uses only the CSV module.
def main():
# Define path of first csv file
csv_file_1_path = "CSV_file_1.csv"
# Read file to list containing the paths of the other csv files
with open(csv_file_1_path,'rt') as file:
list_of_csvs = file.read().splitlines()
print(list_of_csvs)
# Define empty lists to add the coordinates
lat_list = []
lon_list = []
# Iterate over all csv files and extract longitude and latitude
for csv_path in list_of_csvs:
lat, lon = get_lati_and_long_from_csv(csv_path)
lat_list.append(lat)
lon_list.append(lon)
# Combine the three different lists to create the rows of the new csv file
data = list(zip(list_of_csvs, lat_list, lon_list))
# Create the headers and combine them with the other rows
rows = [['FILENAME:', 'Latitude:', 'Longitude:']]
rows.extend(data)
# Write everything to the final csv file
with open(csv_file_1_path + '.out','w') as file:
csv_writer = csv.writer(file, dialect='excel', delimiter='\t')
csv_writer.writerows(rows)
unfortunately I'm quite new to python and don't have time at the moment to dig deeper, so that I can't understand and solve the error displayer by the python console. I am trying to use this code to extract data from multiple netCDF-files for multiple locations:
#this is for reading the .nc in the working folder
import glob
#this is reaquired ti read the netCDF4 data
from netCDF4 import Dataset
#required to read and write the csv files
import pandas as pd
#required for using the array functions
import numpy as np
# Record all the years of the netCDF files into a Python list
all_years = []
for file in glob.glob('*.nc'):
print(file)
#reading the files
data = Dataset(file, 'r')
#saving the data variable time
time = data.variables['time']
#saving the year which is written in the file
year = time.units[11:15]
#once we have acquired the data for one year then it will combine it for all the years as we are using for loop here
all_years.append(year)
# Creating an empty Pandas DataFrame covering the whole range of data and then we will read the required data and put it here
year_start = min(all_years)
end_year = max(all_years)
date_range = pd.date_range(start = str(year_start) + '-01-01',
end = str(end_year) + '-12-31',
freq = 'D')
#an empty having 0.0 values dataframe will be created with two columns date_range and temperature
df = pd.DataFrame(0.0, columns = ['Precipitation'], index = date_range)
# Defining the names, lat, lon for the locations of your interest into a csv file
#this will read the file locations
locations = pd.read_csv('stations_locations.csv')
#we would use a for loop as we are interested in aquiring all the information one by one from the rows
for index, row in locations.iterrows():
# one by one we will extract the information from the csv and put it into temp. variables
location = row['names']
location_lat = row['latitude']
location_lon = row['longitude']
# Sorting the all_years just to be sure that model writes the data correctly
all_years.sort()
#now we will read the netCDF file and here I have used netCDF file from FGOALS model
for yr in all_years:
# Reading-in the data
data = Dataset('pr_day_CNRM-CM5_historical_r1i1p1_%s0101-%s1231.nc'%(yr,yr), 'r')
# Storing the lat and lon data of the netCDF file into variables
lat = data.variables['lat'][:]
lon = data.variables['lon'][:]
#as we already have the co-ordinates of the point which needs to be downloaded
#in order to find the closest point around it we need to substract the cordinates
#and check which ever has the minimun distance
# Squared difference between the specified lat,lon and the lat,lon of the netCDF
sq_diff_lat = (lat - location_lat)**2
sq_diff_lon = (lon - location_lon)**2
# Identify the index of the min value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
# Accessing the average temparature data
temp = data.variables['pr']
# Creating the date range for each year during each iteration
start = str(yr) + '-01-01'
end = str(yr) + '-12-31'
d_range = pd.date_range(start = start,
end = end,
freq = 'D')
for t_index in np.arange(0, len(d_range)):
print('Recording the value for: ' + str(location)+'_'+ str(d_range[t_index]))
df.loc[d_range[t_index]]['Temparature'] = temp[t_index, min_index_lat, min_index_lon]
df.to_csv(str(location) + '.csv')
This is the error code displayed:
File "G:\Selection Cannon\Historical\CNRM-CM5_r1i1p1\pr\extracting data_CNRM-CM5_pr.py", line 62, in <module>
data = Dataset('pr_day_CNRM-CM5_historical_r1i1p1_%s0101-%s1231.nc'%(yr,yr), 'r')
File "netCDF4\_netCDF4.pyx", line 2321, in netCDF4._netCDF4.Dataset.__init__
File "netCDF4\_netCDF4.pyx", line 1885, in netCDF4._netCDF4._ensure_nc_success
FileNotFoundError: [Errno 2] No such file or directory: b'pr_day_CNRM-CM5_historical_r1i1p1_18500101-18501231.nc'
When I check the variable/function 'time.units' it says "days since 1850-1-1", but I just have files from 1975-2005 in the folder. And if I check "all_years" it just displayes '1850' seven times. I think this has to do with the "year = time.units[11:15]" line, but this is how the guy in the youtube video did it.
Can someone please help me to solve this, so that this code extracts the files from 1975 and on?
Best regards,
Alex
PS: This is my first post, please tell me if you need any supplemtary informations and data :)
Before anything else, it seems like you didn't give the correct path. Should be something like "G:/path/to/pr_day_CNRM-CM5_historical_r1i1p1_18500101-18501231.nc".
I'm pretty new to python and coding in general, so sorry in advance for any dumb questions. My program needs to split an existing log file into several *.csv files (run1,.csv, run2.csv, ...) based on the keyword 'MYLOG'. If the keyword appears it should start copying the two desired columns into the new file till the keyword appears again. When finished there need to be as many csv files as there are keywords.
53.2436 EXP MYLOG: START RUN specs/run03_block_order.csv
53.2589 EXP TextStim: autoDraw = None
53.2589 EXP TextStim: autoDraw = None
55.2257 DATA Keypress: t
57.2412 DATA Keypress: t
59.2406 DATA Keypress: t
61.2400 DATA Keypress: t
63.2393 DATA Keypress: t
...
89.2314 EXP MYLOG: START BLOCK scene [specs/run03_block01.csv]
89.2336 EXP Imported specs/run03_block01.csv as conditions
89.2339 EXP Created sequence: sequential, trialTypes=9
...
[EDIT]: The output per file (run*.csv) should look like this:
onset type
53.2436 EXP
53.2589 EXP
53.2589 EXP
55.2257 DATA
57.2412 DATA
59.2406 DATA
61.2400 DATA
...
The program creates as much run*.csv as needed, but i can't store the desired columns in my new files. When finished, all I get are empty csv files. If I shift the counter variable to == 1 it creates just one big file with the desired columns.
Thanks again!
import csv
QUERY = 'MYLOG'
with open('localizer.log', 'rt') as log_input:
i = 0
for line in log_input:
if QUERY in line:
i = i + 1
with open('run' + str(i) + '.csv', 'w') as output:
reader = csv.reader(log_input, delimiter = ' ')
writer = csv.writer(output)
content_column_A = [0]
content_column_B = [1]
for row in reader:
content_A = list(row[j] for j in content_column_A)
content_B = list(row[k] for k in content_column_B)
writer.writerow(content_A)
writer.writerow(content_B)
Looking at the code there's a few things that are possibly wrong:
the csv reader should take a file handler, not a single line.
the reader delimiter should not be a single space character as it looks like the actual delimiter in your logs is a variable number of multiple space characters.
the looping logic seems to be a bit off, confusing files/lines/rows a bit.
You may be looking at something like the code below (pending clarification in the question):
import csv
NEW_LOG_DELIMITER = 'MYLOG'
def write_buffer(_index, buffer):
"""
This function takes an index and a buffer.
The buffer is just an iterable of iterables (ex a list of lists)
Each buffer item is a row of values.
"""
filename = 'run{}.csv'.format(_index)
with open(filename, 'w') as output:
writer = csv.writer(output)
writer.writerow(['onset', 'type']) # adding the heading
writer.writerows(buffer)
current_buffer = []
_index = 1
with open('localizer.log', 'rt') as log_input:
for line in log_input:
# will deal ok with multi-space as long as
# you don't care about the last column
fields = line.split()[:2]
if not NEW_LOG_DELIMITER in line or not current_buffer:
# If it's the first line (the current_buffer is empty)
# or the line does NOT contain "MYLOG" then
# collect it until it's time to write it to file.
current_buffer.append(fields)
else:
write_buffer(_index, current_buffer)
_index += 1
current_buffer = [fields] # EDIT: fixed bug, new buffer should not be empty
if current_buffer:
# We are now out of the loop,
# if there's an unwritten buffer then write it to file.
write_buffer(_index, current_buffer)
You can use pandas to simplify this problem.
Import pandas and read in log file.
import pandas as pd
df = pd.read_fwf('localizer2.log', header=None)
df.columns = ['onset', 'type', 'event']
df.set_index('onset', inplace=True)
Set Flag where third column == 'MYLOG'
df['flag'] = 0
df.loc[df.event.str[:5] == 'MYLOG', 'flag'] = 1
df.flag = df['flag'].cumsum()
Save each run as a separate run*.csv file
for i in range(1, df.flag.max()+1):
df.loc[df.flag == i, 'event'].to_csv('run{0}.csv'.format(i))
EDIT:
Looks like your format is different than I originally assumed. Changed to use pd.read_fwf. my localizer.log file was a copy and paste of your original data, hope this works for you. I assumed by the original post that it did not have headers. If it does have headers then remove header=None and df.columns = ['onset', 'type', 'event'].
I have a data file that has 14 lines of header. In the header, there is the metadata for the latitude-longitude coordinates and time. I am currently using
pandas.read_csv(filename, delimiter",", header=14)
to read in the file but this just gets the data and I can't seem to get the metadata. Would anyone know how to read in the information in the header? The header looks like:
CSD,20160315SSIO
NUMBER_HEADERS = 11
EXPOCODE = 33RR20160208
SECT_ID = I08
STNBBR = 1
CASTNO = 1
DATE = 20160219
TIME = 0558
LATITUDE = -66.6027
LONGITUDE = 78.3815
DEPTH = 462
INSTRUMENT_ID = 0401
CTDPRS,CTDPRS_FLAG,CTDTMP,CTDTMP_FLAG
DBAR,,ITS-90,,PSS-78
You have to parse your metadata header by yourself, yet you can do it in an elegant manner in one pass and even by using it on the fly so that you can extract data out it / control the correctness of the file etc.
First, open the file yourself:
f = open(filename)
Then, do the work to parse each metadata line to extract data out it. For the sake of the explanation, I'm just skipping these rows:
for i in range(13): # skip the first 13 lines that are useless for the columns definition
f.readline() # use the resulting string for metadata extraction
Now you have the file pointer ready on the unique header line you want to use to load the DataFrame. The cool thing is that read_csv accepts file objects! Thus you start loading your DataFrame right away now:
pandas.read_csv(f, sep=",")
Note that I don't use the header argument as I consider by your description you have only that one last line of header that is useful for your dataframe. You can build and adjust hearder parsing values / rows to skip from that example.
Although the following method does not use Pandas, I was able to extract the header information.
with open(fname) as csvfile:
forheader_IO2016 = csv.reader(csvfile, delimiter=',')
header_IO2016 = []
for row in forheader_IO2016:
header_IO2016.append(row[0])
date = header_IO2016[7].split(" ")[2]
time = header_IO2016[8].split(" ")[2]
lat = float(header_IO2016[9].split(" ")[2])
lon = float(header_IO2016[10].split(" ")[4])