Extracting rasters from multiple NetCDF files based on date values in Python

Extracting rasters from multiple NetCDF files based on date values in Python - python

I have multiple NetCDF files (one for each year) that contain daily rainfall values for Australia.
At present I am able to extract the specific days I want by reading from a .csv file that contains the list of dates I want. From this it then outputs each day as a raster file.
However, the script I have at the moment only allows me to do this one year at a time. I'm fairly new to python, and rather than re-running the script many times by changing the NetCDF file it reads in (as well as the list of dates in the .csv file) I was hoping to get some assistance in creating a loop that will read through the list of NetCDF's.
I understand that modules such as NetCDF4 are available to treat all files as one, but despite many hours reading what others have done, I am none the wiser.
Here is what I have so far:
import os, sys
import arcpy
# Check out any necessary licenses
arcpy.CheckOutExtension("spatial")
arcpy.env.overwriteOutput = True
# Script arguments
netCDF = "G:\\Gridded_rain\\DAILY\\netcdf\\Daily_analysis_V3"
rainfall = "G:\\output_test\\r_"
arcpy.env.workspace = netCDF
# Read Date from csv file
eveDate = open ("G:\\selectdate_TEST1.csv", "r")
headerLine = eveDate.readline()
valueList = headerLine.split(",")
dateValueIndex = valueList.index("Date")
eventList = []
for line in eveDate.readlines():
segmenLine = line.split(",")
variable = "pre"
x_dimension = "lon"
y_dimension = "lat"
band_dimension = ""
#dimensionValues = "r_time 1900025"
valueSelectionMethod = "BY_VALUE"
outFile = "Pre"
# extract dimensionValues from csv file
arcpy.MakeNetCDFRasterLayer_md("pre.2011.nc", variable, x_dimension, y_dimension, outFile, band_dimension, segmenLine[dateValueIndex], valueSelectionMethod)
print "layer done"
#copy and save as raster tif file
arcpy.CopyRaster_management(outFile, rainfall + segmenLine[dateValueIndex] + ".tif" , "", "", "", "NONE", "NONE", "")
print "raster done"
The NetCDF files are named from pre.1900.nc through to pre.2011.nc
Any help would be greatly appreciated!

If the question is really about python command line arguments you could add something like:
import sys
year = int(sys.argv[1])
nc_name = 'pre.%d.nc' % (year,)
and then use this nc_nameas filepathargument in your arcpy.MakeNetCDFRasterLayer_mdcall
The other possibility would be to as suggested in comment to question hard code another layer like so:
for year in range(1900, 2012):
nc_name = 'pre.%d.nc' % (year,)
and then call arcpy.MakeNetCDFRasterLayer_md etc.

Related

Importing csv files based on list names

I'm looking to create a function that will import csv files based on a user input of file names that were created as a list. This is for some data analysis where i will then use pandas to resample the data etc and calculate the percentages of missing data. So far I have:
parser = lambda x: pd.datetime.strptime(x, '%d/%m/%Y %H:%M')
number_stations = input(" Please tell how many stations you want to analyse: ")
list_of_stations_name_number = []
i = 0
while i < int(number_stations):
i += 1
name = input(" Please the stations name for station number {}: ".format(i))
list_of_stations_name_number.append(name+ '.csv')
This works as intended whereby, the user will add the name of the stations they are looking to analyse and then will be left with a list located in list_of_stations_name_number. Such as:
list_of_stations_name_number "['DM00115_D.csv', 'DM00117_D.csv', 'DM00118_D.csv', 'DM00121_D.csv', 'DM00129_D.csv']"
Is there any easy way for which i can then redirect to the directory (using os.chdir) and import the csv files based on them matching names. I'm not sure how complicated or simple this would be and am open to try more efficient methods if applicable

To read all files, you can do something like -
list_of_dfs = [pd.read_csv(f) for f in list_of_stations_name_number]
list_of_dfs[0] will correspond to the csv file list_of_stations_name_number[0]
If your files are not in the current directory, you can prepend the directory path to the file names -
list_of_stations_name_number = [f'location/to/folder/{fname}' for fname in list_of_stations_name_number]

Extracting data from netCDF file using python

unfortunately I'm quite new to python and don't have time at the moment to dig deeper, so that I can't understand and solve the error displayer by the python console. I am trying to use this code to extract data from multiple netCDF-files for multiple locations:
#this is for reading the .nc in the working folder
import glob
#this is reaquired ti read the netCDF4 data
from netCDF4 import Dataset
#required to read and write the csv files
import pandas as pd
#required for using the array functions
import numpy as np
# Record all the years of the netCDF files into a Python list
all_years = []
for file in glob.glob('*.nc'):
print(file)
#reading the files
data = Dataset(file, 'r')
#saving the data variable time
time = data.variables['time']
#saving the year which is written in the file
year = time.units[11:15]
#once we have acquired the data for one year then it will combine it for all the years as we are using for loop here
all_years.append(year)
# Creating an empty Pandas DataFrame covering the whole range of data and then we will read the required data and put it here
year_start = min(all_years)
end_year = max(all_years)
date_range = pd.date_range(start = str(year_start) + '-01-01',
end = str(end_year) + '-12-31',
freq = 'D')
#an empty having 0.0 values dataframe will be created with two columns date_range and temperature
df = pd.DataFrame(0.0, columns = ['Precipitation'], index = date_range)
# Defining the names, lat, lon for the locations of your interest into a csv file
#this will read the file locations
locations = pd.read_csv('stations_locations.csv')
#we would use a for loop as we are interested in aquiring all the information one by one from the rows
for index, row in locations.iterrows():
# one by one we will extract the information from the csv and put it into temp. variables
location = row['names']
location_lat = row['latitude']
location_lon = row['longitude']
# Sorting the all_years just to be sure that model writes the data correctly
all_years.sort()
#now we will read the netCDF file and here I have used netCDF file from FGOALS model
for yr in all_years:
# Reading-in the data
data = Dataset('pr_day_CNRM-CM5_historical_r1i1p1_%s0101-%s1231.nc'%(yr,yr), 'r')
# Storing the lat and lon data of the netCDF file into variables
lat = data.variables['lat'][:]
lon = data.variables['lon'][:]
#as we already have the co-ordinates of the point which needs to be downloaded
#in order to find the closest point around it we need to substract the cordinates
#and check which ever has the minimun distance
# Squared difference between the specified lat,lon and the lat,lon of the netCDF
sq_diff_lat = (lat - location_lat)**2
sq_diff_lon = (lon - location_lon)**2
# Identify the index of the min value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
# Accessing the average temparature data
temp = data.variables['pr']
# Creating the date range for each year during each iteration
start = str(yr) + '-01-01'
end = str(yr) + '-12-31'
d_range = pd.date_range(start = start,
end = end,
freq = 'D')
for t_index in np.arange(0, len(d_range)):
print('Recording the value for: ' + str(location)+'_'+ str(d_range[t_index]))
df.loc[d_range[t_index]]['Temparature'] = temp[t_index, min_index_lat, min_index_lon]
df.to_csv(str(location) + '.csv')
This is the error code displayed:
File "G:\Selection Cannon\Historical\CNRM-CM5_r1i1p1\pr\extracting data_CNRM-CM5_pr.py", line 62, in <module>
data = Dataset('pr_day_CNRM-CM5_historical_r1i1p1_%s0101-%s1231.nc'%(yr,yr), 'r')
File "netCDF4\_netCDF4.pyx", line 2321, in netCDF4._netCDF4.Dataset.__init__
File "netCDF4\_netCDF4.pyx", line 1885, in netCDF4._netCDF4._ensure_nc_success
FileNotFoundError: [Errno 2] No such file or directory: b'pr_day_CNRM-CM5_historical_r1i1p1_18500101-18501231.nc'
When I check the variable/function 'time.units' it says "days since 1850-1-1", but I just have files from 1975-2005 in the folder. And if I check "all_years" it just displayes '1850' seven times. I think this has to do with the "year = time.units[11:15]" line, but this is how the guy in the youtube video did it.
Can someone please help me to solve this, so that this code extracts the files from 1975 and on?
Best regards,
Alex
PS: This is my first post, please tell me if you need any supplemtary informations and data :)

Before anything else, it seems like you didn't give the correct path. Should be something like "G:/path/to/pr_day_CNRM-CM5_historical_r1i1p1_18500101-18501231.nc".

How do I loop through a directory of PDF files and write the information to a Pandas Dataframe in Python?

I am very new to Python, Pandas, and NLP but have taken a few intro courses. I have a directory of 3 PDF files (will be over a hundred once I get the full data set). I want to open each file and make two columns in a Pandas dataframe that I can eventually use for some NLP work. The two columns needed are an ID column with the name of the PDF and the second column is just all of the text/information located within that PDF
This is the code I used to go through one file at a time:
import PyPDF2 as pdf
i = 0
while i < pdf_reader.getNumPages():
pageinfo = pdf_reader.getPage(i)
print(pageinfo.extractText())
i = i + 1
This is the code that I used to name my directory and print out the file names:
import os
directory = os.listdir('test_files/')
directory = os.listdir('test_files/')
for entry in directory:
print(entry)
**Update, this is what I have so far. Does it seem close?
directory = os.listdir('test_files/')
for entry in directory:
file = open(entry,'rb')
pdf_reader = pdf.PdfFileReader(file)
i = 0
while i < pdf_reader.getNumPages():
pageinfo = pdf_reader.getPage(i)
i = i + 1
data = {'PDF_ID':[entry],
'Text_Data': [pageinfo.extractText()]}
df = pd.DataFrame(data, columns = ['PDF_ID','Text_Data'])
would be ideal, but I haven't found the best way combine them and create a dataframe at the same time. I already have a function created that will clean and tokenize the text, but one file at a time is not ideal. Thanks!

Extracting data from a file read in with Pandas

I believe that I have successfully read in my files with a "for loop" as shown in the code below.
import pandas as pd
import glob
filename = glob.glob('1511**.mnd')
data_nov15_hereford = pd.DataFrame()
frames = []
for i in filename:
f_nov15_hereford = pd.read_csv(i, skiprows = 33, sep='\s+')
frames.append(f_nov15_hereford)
data_nov15_hereford = pd.concat(frames)
data_nov15_hereford = data_nov15_hereford.convert_objects(convert_numeric=True)
My problem now is that I want to take out some information from the files. Specifically I want the 80 m wind speed from the files. When I read in just one file, instead of looping over multiple files the code works like I need it to by simply doing this:
height = data_nov15_hereford['#']
wspd = data_nov15_hereford["z"]
hub = np.where(height==80)
print hub
hub_wspd = wspd[5:4582:32]
hub_wspd is the 80 m wind speed that I am interested in. And I get the index numbers 5:4582 by printing out hub. And then all I have to do is skip every 32 rows to continue to pull out the 80 m wind speed from the file. However, now that I have read in multiple files (that all look the same and have the same layout as this one file) I can't seem to pull out the 80 m wind speed the same way. Basically, I print out hub and get the indices 5:65418 and then I skip every 32 rows but when I print out the tail end of the hub_wspd it doesn't match the file so I must be doing something wrong. Any ideas why it isn't working with multiple files but worked with the single file? I can also attach a copy of the single data file if that would help. Thanks!

Load all csv/txt files from one directory and merge them via python

I have a folder which contains hundreds (possibly over 1 k) of csv data files, of chronological data. Ideally this data would be in one csv, so that I can analyse it all in one go. What I would like to know is, is there a way to append all the files to one another using python.
My files exist in folder locations like so:
C:\Users\folder\Database Files\1st September
C:\Users\folder\Database Files\1st October
C:\Users\folder\Database Files\1st November
C:\Users\folder\Database Files\1st December
etc
Inside each of the folders there is 3 csv (I am using the term csv loosly since these files are actually saved as .txt files containing values seperated by pipes |)
Lets say these files are called:
MonthNamOne.txt
MonthNamTwo.txt
MonthNameOneTwoMurged.txt
How would I, or even is it possible to code something to go through all of these folders in this directory and then merge together all the OneTwoMurged.txt files?

For all files in folder with .csv suffix
import glob
import os
filelist = []
os.chdir("folderwithcsvs/")
for counter, files in enumerate(glob.glob("*.csv")):
filelist.append(files)
print "do stuff with file:", files, counter
print filelist
for fileitem in filelist:
print fileitem
Obviously the "do stuff part" depends on what you want done with the files, this is looking getting your list of files.
If you want to do something with the files on a monthly basis then you could use datetime and create possible months, same for days or yearly data.
For example, for monthly files with the names Month Year.csv it would look for each file.
import subprocess, datetime, os
start_year, start_month = "2001", "January"
current_month = datetime.date.today().replace(day=1)
possible_month = datetime.datetime.strptime('%s %s' % (start_month, start_year), '%B %Y').date()
while possible_month <= current_month:
csv_filename = possible_month.strftime('%B %Y') + '.csv'
month = possible_month.strftime('%B %Y').split(" ")[0]
year = possible_month.strftime('%B %Y').split(" ")[1]
if os.path.exists("folder/" + csv_filename):
print csv_filename
possible_month = (possible_month + datetime.timedelta(days=31)).replace(day=1)
Obviously you can change that to however you feel fit, let me know if you need more or if this suffices.

This will recursively process a directory, match a specific file pattern for processing, and append the results of processed files. This will parse the csvs as well, so you could do individual line analysis and processing as well. Modify as needed :)
#!python2
import os
import fnmatch
import csv
from datetime import datetime as dt
# Open result file
with open('output.txt','wb') as fout:
wout = csv.writer(fout,delimiter='|')
# Recursively process a directory
for path,dirs,files in os.walk('files'):
# Sort directories for processing.
# In this case, sorting directories named "Month Year" chronologically.
dirs.sort(key=lambda d: dt.strptime(d,'%B %Y'))
interesting_files = fnmatch.filter(files,'*.txt')
# Example for sorting filenames with a custom chronological sort "Month Year.txt"
for filename in sorted(interesting_files,key=lambda f: dt.strptime(f,'%B %Y.txt')):
# Generate the full path to the file.
fullname = os.path.join(path,filename)
print 'Processing',fullname
# Open and process file
with open(fullname,'rb') as fin:
for line in csv.reader(fin,delimiter='|'):
wout.writerow(line)

Reading into pandas dataframe (choice of axis depends on your application), my example adds columns of same length
import glob
import pandas as pd
df=pd.DataFrame()
for files in glob.glob("*.csv"):
print files
df = pd.concat([df,pd.read_csv(files).iloc[:,1:]],axis=1)
axis = 0 would add row-wise

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting rasters from multiple NetCDF files based on date values in Python - python

Related

Importing csv files based on list names

Extracting data from netCDF file using python

How do I loop through a directory of PDF files and write the information to a Pandas Dataframe in Python?

Extracting data from a file read in with Pandas

Load all csv/txt files from one directory and merge them via python

Categories

Resources