Extracting data from netCDF file using python - python

unfortunately I'm quite new to python and don't have time at the moment to dig deeper, so that I can't understand and solve the error displayer by the python console. I am trying to use this code to extract data from multiple netCDF-files for multiple locations:
#this is for reading the .nc in the working folder
import glob
#this is reaquired ti read the netCDF4 data
from netCDF4 import Dataset
#required to read and write the csv files
import pandas as pd
#required for using the array functions
import numpy as np
# Record all the years of the netCDF files into a Python list
all_years = []
for file in glob.glob('*.nc'):
print(file)
#reading the files
data = Dataset(file, 'r')
#saving the data variable time
time = data.variables['time']
#saving the year which is written in the file
year = time.units[11:15]
#once we have acquired the data for one year then it will combine it for all the years as we are using for loop here
all_years.append(year)
# Creating an empty Pandas DataFrame covering the whole range of data and then we will read the required data and put it here
year_start = min(all_years)
end_year = max(all_years)
date_range = pd.date_range(start = str(year_start) + '-01-01',
end = str(end_year) + '-12-31',
freq = 'D')
#an empty having 0.0 values dataframe will be created with two columns date_range and temperature
df = pd.DataFrame(0.0, columns = ['Precipitation'], index = date_range)
# Defining the names, lat, lon for the locations of your interest into a csv file
#this will read the file locations
locations = pd.read_csv('stations_locations.csv')
#we would use a for loop as we are interested in aquiring all the information one by one from the rows
for index, row in locations.iterrows():
# one by one we will extract the information from the csv and put it into temp. variables
location = row['names']
location_lat = row['latitude']
location_lon = row['longitude']
# Sorting the all_years just to be sure that model writes the data correctly
all_years.sort()
#now we will read the netCDF file and here I have used netCDF file from FGOALS model
for yr in all_years:
# Reading-in the data
data = Dataset('pr_day_CNRM-CM5_historical_r1i1p1_%s0101-%s1231.nc'%(yr,yr), 'r')
# Storing the lat and lon data of the netCDF file into variables
lat = data.variables['lat'][:]
lon = data.variables['lon'][:]
#as we already have the co-ordinates of the point which needs to be downloaded
#in order to find the closest point around it we need to substract the cordinates
#and check which ever has the minimun distance
# Squared difference between the specified lat,lon and the lat,lon of the netCDF
sq_diff_lat = (lat - location_lat)**2
sq_diff_lon = (lon - location_lon)**2
# Identify the index of the min value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
# Accessing the average temparature data
temp = data.variables['pr']
# Creating the date range for each year during each iteration
start = str(yr) + '-01-01'
end = str(yr) + '-12-31'
d_range = pd.date_range(start = start,
end = end,
freq = 'D')
for t_index in np.arange(0, len(d_range)):
print('Recording the value for: ' + str(location)+'_'+ str(d_range[t_index]))
df.loc[d_range[t_index]]['Temparature'] = temp[t_index, min_index_lat, min_index_lon]
df.to_csv(str(location) + '.csv')
This is the error code displayed:
File "G:\Selection Cannon\Historical\CNRM-CM5_r1i1p1\pr\extracting data_CNRM-CM5_pr.py", line 62, in <module>
data = Dataset('pr_day_CNRM-CM5_historical_r1i1p1_%s0101-%s1231.nc'%(yr,yr), 'r')
File "netCDF4\_netCDF4.pyx", line 2321, in netCDF4._netCDF4.Dataset.__init__
File "netCDF4\_netCDF4.pyx", line 1885, in netCDF4._netCDF4._ensure_nc_success
FileNotFoundError: [Errno 2] No such file or directory: b'pr_day_CNRM-CM5_historical_r1i1p1_18500101-18501231.nc'
When I check the variable/function 'time.units' it says "days since 1850-1-1", but I just have files from 1975-2005 in the folder. And if I check "all_years" it just displayes '1850' seven times. I think this has to do with the "year = time.units[11:15]" line, but this is how the guy in the youtube video did it.
Can someone please help me to solve this, so that this code extracts the files from 1975 and on?
Best regards,
Alex
PS: This is my first post, please tell me if you need any supplemtary informations and data :)

Before anything else, it seems like you didn't give the correct path. Should be something like "G:/path/to/pr_day_CNRM-CM5_historical_r1i1p1_18500101-18501231.nc".

Related

Extracting a row value from one file, and putting that value to another row in another file (with filename corresponds to row of the previous file)

I have a list of CSV filenames (which is in another CSV file called CSV_file_1). However, I want to put additional two columns in CSV_file_1, in which the row values will come from the thousands of individual CSV files.
CSV_file_1 contents are as follows:
1.csv
2.csv
3.csv
In the thousands of files I have in another folder, it contains values that I want to put in CSV_file_1. Example, 1.csv contains the following rows:
LATITUDE : ;13.63345
LONGITUDE : ;123.207083
2.csv contains the following rows:
LATITUDE : ;13.11111
LONGITUDE : ;123.22222
3.csv contains the following rows:
LATITUDE : ;13.22222
LONGITUDE : ;123.11111
and so on.
The result that I want to have for CSV_file_1 is as follows:
FILENAME: LATITUDE: LONGITUDE:
1.csv 13.63345 123.207083
2.csv 13.11111 123.22222
3.csv 13.22222 123.11111
I already managed to have my CSV_file_1 but without the LATITUDE AND LONGITUDE yet (which will come from individual files which are delimited as shown above).
My code is like this:
import pandas as pd
import glob
print(glob.glob("D:/2021/*.csv"))
#list of all the filenames collated and put in CSV_file_1
CSV_file_1 = pd.DataFrame(glob.glob("D:/2021/*.csv"))
#creating blank columns in CSV_file_1
CSV_file_1 ['Latitude'] = ""
CSV_file_1 ['Longitude'] = ""
#here im trying to access each file in the given folder(file name must correspond to the row in CSV_file_1), extract the data (latitude and longitude) and copy it to CSV_file_1
import csv
with open('D:/2021/*.csv','rt')as file:
data = csv.reader(file)
for row in file:
if glob.glob("D:/2021/*.csv") = CSV_file_1['FILENAME']:
CSV_file_1.iloc[i] ['LATITUDE:'] ==file.iloc[i]
CSV_file_1.to_csv('D:/2021/CSV_file_1.csv', index = False)
but I get invalid syntax.
if glob.glob("D:/2021/*.csv") = CSV_file_1['FILENAME']:
^
SyntaxError: invalid syntax
I am a python newbie so I would like to seek help to fix my code.
If I understand your problem correctly I think your approach is a little bit complex. I implemented a script that is creating the desired output.
First, the CSV file with the names of the other files is read directly into the first column of the data frame. Then, the file names are used to extract the longitude and latitude from each file. For this, I created a function, which you can see in the first part of the script. In the end, I add the extracted values to the data frame and store it in a file in the desired format.
import pandas as pd
import csv
# Function that takes
def get_lati_and_long_from_csv(csv_path):
with open(csv_path,'rt') as file:
# Read csv file content to list of rows
data = list(csv.reader(file, delimiter =';'))
# Take values from row zero and one
latitude = data[0][1]
longitude = data[1][1]
return (latitude, longitude)
def main():
# Define path of first csv file
csv_file_1_path = "CSV_file_1.csv"
# Read data frame from csv file and create correct column name
CSV_file_1 = pd.read_csv(csv_file_1_path, header=None)
CSV_file_1.columns = ['FILENAME:']
# Create list of files to read the coordinates
list_of_csvs = list(CSV_file_1['FILENAME:'])
# Define empty lists to add the coordinates
lat_list = []
lon_list = []
# Iterate over all csv files and extract longitude and latitude
for csv_path in list_of_csvs:
lat, lon = get_lati_and_long_from_csv(csv_path)
lat_list.append(lat)
lon_list.append(lon)
# Add coordinates to the data frame
CSV_file_1['Latitude:'] = lat_list
CSV_file_1['Longitude:'] = lon_list
# Save final data frame to csv file
CSV_file_1.to_csv(csv_file_1_path+'.out', index = False, sep='\t')
if __name__ == "__main__":
main()
Test input file content:
1.csv
2.csv
3.csv
Test output file content:
FILENAME: Latitude: Longitude:
1.csv 13.63345 123.207083
2.csv 13.11111 123.22222
3.csv 13.22222 123.11111
EDIT:
If your files do not contain any other data, I would suggest simplifying things and removing pandas as it is not needed. The following main() function produces the same result but uses only the CSV module.
def main():
# Define path of first csv file
csv_file_1_path = "CSV_file_1.csv"
# Read file to list containing the paths of the other csv files
with open(csv_file_1_path,'rt') as file:
list_of_csvs = file.read().splitlines()
print(list_of_csvs)
# Define empty lists to add the coordinates
lat_list = []
lon_list = []
# Iterate over all csv files and extract longitude and latitude
for csv_path in list_of_csvs:
lat, lon = get_lati_and_long_from_csv(csv_path)
lat_list.append(lat)
lon_list.append(lon)
# Combine the three different lists to create the rows of the new csv file
data = list(zip(list_of_csvs, lat_list, lon_list))
# Create the headers and combine them with the other rows
rows = [['FILENAME:', 'Latitude:', 'Longitude:']]
rows.extend(data)
# Write everything to the final csv file
with open(csv_file_1_path + '.out','w') as file:
csv_writer = csv.writer(file, dialect='excel', delimiter='\t')
csv_writer.writerows(rows)

How to find the top 10 rows of AWND from a .CSV file and store the result in a new .CSV file using Python?

From the 2 year data, find the top 10 readings/rows of AWND. Store the result in a file .csv file and name it top10AWND.csv. The new file will have all columns from filteredData.csv, but only the top 10 AWND.
Small portion of the filteredData.csv:
I am using Python 3.8 and Pandas.
I need to find the top 10 readings of AWND from my filteredData.csv file. Then, I need to store the results in a new file. The new file needs to have the columns, STATION, NAME, DATA, Month, AWND, and SNOW of the top 10 readings.
I am not sure how to go about doing this. This is what I have so far and it does not work. It gives me errors. One error I run into is a TyperError: list indices must be integers or slices, not list for the filtered_weather = line in the code.
import numpy as np
import pandas as pd
import re
for filename in ['filteredData.csv']:
file = pd.read_csv(filename)
all_loc =dict(file['AWND'].value_counts()).keys()
most_loc = list(all_loc)[:10]
filtered_weather = ['filteredData.csv'][['STATION','NAME','DATE','Month','AWND','SNOW']] #Select the column names that you want
filtered_weather.to_csv('top10AWND.csv',index=False)
You can do something like this:
#This not neccessary unless you want to read several files
for filename in ['filteredData.csv']:
file = pd.read_csv(filename)
file = file.sort_values('AWND', ascending = False).head(10)
# If it's only one file you can do
#
#file = pd.read_csv(filename)
#file = file.sort_values('AWND', ascending = False).head(10)
#Considering you want to keep all the columns you can just write the dataframe to the file
file.to_csv('top10AWND.csv',index=False)

Extracting rasters from multiple NetCDF files based on date values in Python

I have multiple NetCDF files (one for each year) that contain daily rainfall values for Australia.
At present I am able to extract the specific days I want by reading from a .csv file that contains the list of dates I want. From this it then outputs each day as a raster file.
However, the script I have at the moment only allows me to do this one year at a time. I'm fairly new to python, and rather than re-running the script many times by changing the NetCDF file it reads in (as well as the list of dates in the .csv file) I was hoping to get some assistance in creating a loop that will read through the list of NetCDF's.
I understand that modules such as NetCDF4 are available to treat all files as one, but despite many hours reading what others have done, I am none the wiser.
Here is what I have so far:
import os, sys
import arcpy
# Check out any necessary licenses
arcpy.CheckOutExtension("spatial")
arcpy.env.overwriteOutput = True
# Script arguments
netCDF = "G:\\Gridded_rain\\DAILY\\netcdf\\Daily_analysis_V3"
rainfall = "G:\\output_test\\r_"
arcpy.env.workspace = netCDF
# Read Date from csv file
eveDate = open ("G:\\selectdate_TEST1.csv", "r")
headerLine = eveDate.readline()
valueList = headerLine.split(",")
dateValueIndex = valueList.index("Date")
eventList = []
for line in eveDate.readlines():
segmenLine = line.split(",")
variable = "pre"
x_dimension = "lon"
y_dimension = "lat"
band_dimension = ""
#dimensionValues = "r_time 1900025"
valueSelectionMethod = "BY_VALUE"
outFile = "Pre"
# extract dimensionValues from csv file
arcpy.MakeNetCDFRasterLayer_md("pre.2011.nc", variable, x_dimension, y_dimension, outFile, band_dimension, segmenLine[dateValueIndex], valueSelectionMethod)
print "layer done"
#copy and save as raster tif file
arcpy.CopyRaster_management(outFile, rainfall + segmenLine[dateValueIndex] + ".tif" , "", "", "", "NONE", "NONE", "")
print "raster done"
The NetCDF files are named from pre.1900.nc through to pre.2011.nc
Any help would be greatly appreciated!
If the question is really about python command line arguments you could add something like:
import sys
year = int(sys.argv[1])
nc_name = 'pre.%d.nc' % (year,)
and then use this nc_nameas filepathargument in your arcpy.MakeNetCDFRasterLayer_mdcall
The other possibility would be to as suggested in comment to question hard code another layer like so:
for year in range(1900, 2012):
nc_name = 'pre.%d.nc' % (year,)
and then call arcpy.MakeNetCDFRasterLayer_md etc.

Cut selected data from daily precipitation CSV files

I have a csv file contain of daily precipitation with (253 rows and 191 column daily) so for one year I have 191 * 365 column.
I want to extract data for certain row and column that are my area of interest example row 20 and column 40 for the first day and the 2,3,4 ... 365 days has the same distance between the column.
I'm new in python, is there any way that I can extract the data and store it in a new csv for a certain row and column for one year?
Thanks
To get value from certain row and column you can try smth like this:
from itertools import islice
def get_value(f, row, col):
line = next(islice(f, row - 1, row))
values = line.split(',')
return values[col - 1]
with open('data.csv', 'r') as f:
print(get_value(f, 10, 4))
Apart from extracting the data, the first thing you need to do is rearrange your data.
As it is now, 191 columns are added every day. To do that, the whole file needs to be parsed (probably in memory, data growing every day), data gets added to the end of each row, and everything has to be fully written to disk again.
Usually, to add data to a csv, rows are added at the end of the file. No need to parse and rewrite the whole file each time.
On top of that, most software to read csv files starts having problems when the number of columns gets higher.
So it would be a lot better to add the daily data as rows at the end of the csv file.
While we're at it: assuming the 253 x 191 is some sort of grid, or at least every cell has te same data type, this would be a great candidate for binary storage (not sure how/if Python can handle that).
All data could be stored in it's binary form resulting in a fixed length field/cell. To access a field, it's position could simply be calculated and there would be no need to parse and convert all the data each time. Retrieving data would be almost instant.
I already managed to do the cutting with this script after read some examples and try it
`import netCDF4 as nc
pixel = [[1,36,77],[2,37,77],[3,35,78],[4,36,78],[5,37,78],[6,38,78],[7,39,78],[8,40,78],[9,35,79],[10,36,79],[11,37,79],[12,38,79],[13,39,79],[14,40,79],[15,35,80],[16,36,80],[17,37,80],[18,38,80],[19,35,81],[20,36,81],[21,37,81],[22,36,82]]
print pixel
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir('D:\RCP45\Hujan_Harian') if isfile(join('D:\RCP45\Hujan_Harian',f))]
print onlyfiles
folder = 'D:\RCP45\Hujan_Harian\\'
fout = open ("D:\My Documents\precipitation.txt", "w")
for j in range (0,len(onlyfiles)):
filename = onlyfiles[j]
print filename
tahun = filename[0:4]
print tahun
from scipy.io import netcdf
f1 = netcdf.netcdf_file(folder+filename,'r')
print (f1.variables)
jlh_hari = int(len(f1.variables['time_bnds'][:]))
print jlh_hari
output = []
for h in range (0,(jlh_hari)):
for i in range (0,22):
x=pixel[i][1]
y=pixel[i][2]
pr=f1.variables['pr'][h,x,y]
fout.write(str(pixel[i][0]) + ', , ' + str(tahun) + ', ' + str(pr) + '\n')
fout.write('\n')
print output`

Convert .CSV file to point feature class using python 2.7

I have a .CSV file with 75 columns and almost 4000 rows. I need to create a shapefile (point) for the entire .CSV file, with each column as a field. All 75 columns need to be brought over to the new shapefile, with each column representing a field.
There seems to a good amount on this topic already, but everything I can find addresses .csv files with a small number of columns.
https://gis.stackexchange.com/questions/17590/why-is-an-extra-field-necessary-when-creating-point-shapefile-from-csv-files-in
https://gis.stackexchange.com/questions/35593/using-the-python-shape-library-pyshp-how-to-convert-csv-file-to-shp
This script looks close to what I need to accomplish, but again it adds a field for every column in the .CSV, which in this example there are three fields; DATE, LAT, LON.
import arcpy, csv
arcpy.env.overwriteOutput = True
#Set variables
arcpy.env.workspace = "C:\\GIS\\StackEx\\"
outFolder = arcpy.env.workspace
pointFC = "art2.shp"
coordSys = "C:\\Program Files\\ArcGIS\\Desktop10.0\\Coordinate Systems" + \
"\\Geographic Coordinate Systems\\World\\WGS 1984.prj"
csvFile = "C:\\GIS\\StackEx\\chicken.csv"
fieldName = "DATE1"
#Create shapefile and add field
arcpy.CreateFeatureclass_management(outFolder, pointFC, "POINT", "", "", "", coordSys)
arcpy.AddField_management(pointFC, fieldName, "TEXT","","", 10)
gpsTrack = open(csvFile, "r")
headerLine = gpsTrack.readline()
#print headerLine
#I updated valueList to remove the '\n'
valueList = headerLine.strip().split(",")
print valueList
latValueIndex = valueList.index("LAT")
lonValueIndex = valueList.index("LON")
dateValueIndex = valueList.index("DATE")
# Read each line in csv file
cursor = arcpy.InsertCursor(pointFC)
for point in gpsTrack.readlines():
segmentedPoint = point.split(",")
# Get the lat/lon values of the current reading
latValue = segmentedPoint[latValueIndex]
lonValue = segmentedPoint[lonValueIndex]
dateValue = segmentedPoint[dateValueIndex]
vertex = arcpy.CreateObject("Point")
vertex.X = lonValue
vertex.Y = latValue
feature = cursor.newRow()
feature.shape = vertex
feature.DATE1 = dateValue
cursor.insertRow(feature)
del cursor
Is there a simpler way to create a shapefile using python without adding a field for all 75 columns in the .CSV file? Any help is greatly appreciated.
Simply select just the columns you need; you are not required to use all columns.
Use the csv module to read the file, then just pick out the 2 values from each row:
import csv
cursor = arcpy.InsertCursor(pointFC)
with open('yourcsvfile.csv', 'rb') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
point = arcpy.CreateObject("Point")
point.X, point.Y = float(row[5]), float(row[27]) # take the 6th and 28th columns from the row
cursor.insertRow(point)

Categories