I'm looking to create a function that will import csv files based on a user input of file names that were created as a list. This is for some data analysis where i will then use pandas to resample the data etc and calculate the percentages of missing data. So far I have:
parser = lambda x: pd.datetime.strptime(x, '%d/%m/%Y %H:%M')
number_stations = input(" Please tell how many stations you want to analyse: ")
list_of_stations_name_number = []
i = 0
while i < int(number_stations):
i += 1
name = input(" Please the stations name for station number {}: ".format(i))
list_of_stations_name_number.append(name+ '.csv')
This works as intended whereby, the user will add the name of the stations they are looking to analyse and then will be left with a list located in list_of_stations_name_number. Such as:
list_of_stations_name_number "['DM00115_D.csv', 'DM00117_D.csv', 'DM00118_D.csv', 'DM00121_D.csv', 'DM00129_D.csv']"
Is there any easy way for which i can then redirect to the directory (using os.chdir) and import the csv files based on them matching names. I'm not sure how complicated or simple this would be and am open to try more efficient methods if applicable
To read all files, you can do something like -
list_of_dfs = [pd.read_csv(f) for f in list_of_stations_name_number]
list_of_dfs[0] will correspond to the csv file list_of_stations_name_number[0]
If your files are not in the current directory, you can prepend the directory path to the file names -
list_of_stations_name_number = [f'location/to/folder/{fname}' for fname in list_of_stations_name_number]
Related
I am trying to write a faster way to read in a group of CSV files. The format of the files is I have a common partial path which leads to a group of subfolders, which are identified by some identifier, and then a file name that starts with the identifier, and then ends with a common end phrase.
For example, lets say I have group names A, B, C. The file paths would be:
C:\Users\Name\Documents\A\A-beginninggroup.csv
C:\Users\Name\Documents\A\A-middlegroup.csv
C:\Users\Name\Documents\A\A-endinggroup.csv
C:\Users\Name\Documents\B\B-beginninggroup.csv
C:\Users\Name\Documents\B\B-middlegroup.csv
C:\Users\Name\Documents\B\B-endinggroup.csv
C:\Users\Name\Documents\C\C-beginninggroup.csv
C:\Users\Name\Documents\C\C-middlegroup.csv
C:\Users\Name\Documents\C\C-endinggroup.csv
I am trying to write a code where I can just change the name of the subgroup without having to change it in each read_csv line. The following code shows the logic, but not sure how to make it work/if its possible.
intro='C:\Users\Name\Documents\'
subgroup='C'
ending1='-endinggroup.csv'
ending2='-middlegroup.csv'
ending3='-beginninggroup.csv'
filename_1=intro+subgroup+'\'+subgroup+ending1
filename_2=intro+subgroup+'\'+subgroup+ending2
filename_3=intro+subgroup+'\'+subgroup+ending3
file1=pd.read_csv(filename_1)
file2=pd.read_csv(filename2)
file3=pd.read_csv(filename3)
I am not sure exactly where you are after, but you can use an F-string in this case.
You first define your variable (names in your case):
location = 'somewhere\anywhere'
group = 'A'
csv = 'A-beginninggroup.csv'
Now you combine these variables in an F-string:
file_location = f"{location}\{group}\{csv}"
And pass the file_location to your pandas csv reader. You can freely change the group variable and the csv variable.
I can just change the name of the subgroup without having to change it in each read_csv line.
You can define a function to handle the logic of joining path:
import os
intro='C:\\Users\\Name\\Documents\\'
subgroup='C'
ending1='-endinggroup.csv'
ending2='-middlegroup.csv'
ending3='-beginninggroup.csv'
def read_file(subgroup, ending):
csv_path = os.join(intro, subgroup, subgroup+ending)
df = pd.read_csv(csv_path)
return df
file1 = read_file('A', ending1)
file2 = read_file('A', ending2)
file3 = read_file('B', ending1)
I have a long list of csv files that I want to read as dataframes and name them by their file name. For example, I want to read in the file status.csv and assign its dataframe the name status. Is there a way I can efficiently do this using Pandas?
Looking at this, I still have to write the name of each csv in my loop. I want to avoid that.
Looking at this, that allows me to read multiple csv into one dataframe instead of many.
You can list all csv under a directory using os.listdir(dirname) and combine it with os.path.basename to parse the file name.
import os
# current directory csv files
csvs = [x for x in os.listdir('.') if x.endswith('.csv')]
# stats.csv -> stats
fns = [os.path.splitext(os.path.basename(x))[0] for x in csvs]
d = {}
for i in range(len(fns)):
d[fns[i]] = pd.read_csv(csvs[i])
you could create a dictionary of DataFrames:
d = {} # dictionary that will hold them
for file_name in list_of_csvs: # loop over files
# read csv into a dataframe and add it to dict with file_name as it key
d[file_name] = pd.read_csv(file_name)
I have almost 1000000 or even more files in a path.
My final goal is to extract some information from just names of the files.
Till now I have saved the names of the file in a list.
what information in names of the files?
so the format of the names of the file is something like this:
09066271_142468576_1_Haha_-Haha-haha_2016-10-07_haha-false_haha2427.txt
all haha are other different text that does not matter.
I want to extract 09066271 and 2016-10-07 out of the names and save in a dataframe. the first number is always 8 character.
Till now , I have saved the whole text file names in the list:
path = 'path to the saved txt files/fldr'
file_list = os.listdir(path)
firstly I wanted to save the whole txt file names in the dataframe and then do these operations on them. it seems I have to firstly read to numpy then reshape it to be readable in pandas. however I do not now before what will be the reshape numbers.
df = pd.DataFrame(np.array(file_list).reshape(,))
I would appreciate if you can give me your idea and what will be the efficient way of doing this :)
You can use os to list all of the files. Then just construct a DataFrame and use the string methods to get the parts of the filenames you need.
import pandas as pd
import os
path = 'path to the saved txt files/fldr'
file_list = os.listdir(path)
df = pd.DataFrame(file_list, columns=['file_name'])
df['data'] = df.file_name.str[0:8]
df['date'] = df.file_name.str.extract('(\d{4}-\d{2}-\d{2})', expand=True)
file_name data date
0 09066271_142468576_1_Haha_-Haha-haha_2016-10-0... 09066271 2016-10-07
1 09014271_142468576_1_Haha_-Haha-haha_2013-02-1... 09014271 2013-02-18
I have multiple NetCDF files (one for each year) that contain daily rainfall values for Australia.
At present I am able to extract the specific days I want by reading from a .csv file that contains the list of dates I want. From this it then outputs each day as a raster file.
However, the script I have at the moment only allows me to do this one year at a time. I'm fairly new to python, and rather than re-running the script many times by changing the NetCDF file it reads in (as well as the list of dates in the .csv file) I was hoping to get some assistance in creating a loop that will read through the list of NetCDF's.
I understand that modules such as NetCDF4 are available to treat all files as one, but despite many hours reading what others have done, I am none the wiser.
Here is what I have so far:
import os, sys
import arcpy
# Check out any necessary licenses
arcpy.CheckOutExtension("spatial")
arcpy.env.overwriteOutput = True
# Script arguments
netCDF = "G:\\Gridded_rain\\DAILY\\netcdf\\Daily_analysis_V3"
rainfall = "G:\\output_test\\r_"
arcpy.env.workspace = netCDF
# Read Date from csv file
eveDate = open ("G:\\selectdate_TEST1.csv", "r")
headerLine = eveDate.readline()
valueList = headerLine.split(",")
dateValueIndex = valueList.index("Date")
eventList = []
for line in eveDate.readlines():
segmenLine = line.split(",")
variable = "pre"
x_dimension = "lon"
y_dimension = "lat"
band_dimension = ""
#dimensionValues = "r_time 1900025"
valueSelectionMethod = "BY_VALUE"
outFile = "Pre"
# extract dimensionValues from csv file
arcpy.MakeNetCDFRasterLayer_md("pre.2011.nc", variable, x_dimension, y_dimension, outFile, band_dimension, segmenLine[dateValueIndex], valueSelectionMethod)
print "layer done"
#copy and save as raster tif file
arcpy.CopyRaster_management(outFile, rainfall + segmenLine[dateValueIndex] + ".tif" , "", "", "", "NONE", "NONE", "")
print "raster done"
The NetCDF files are named from pre.1900.nc through to pre.2011.nc
Any help would be greatly appreciated!
If the question is really about python command line arguments you could add something like:
import sys
year = int(sys.argv[1])
nc_name = 'pre.%d.nc' % (year,)
and then use this nc_nameas filepathargument in your arcpy.MakeNetCDFRasterLayer_mdcall
The other possibility would be to as suggested in comment to question hard code another layer like so:
for year in range(1900, 2012):
nc_name = 'pre.%d.nc' % (year,)
and then call arcpy.MakeNetCDFRasterLayer_md etc.
I have a folder which contains hundreds (possibly over 1 k) of csv data files, of chronological data. Ideally this data would be in one csv, so that I can analyse it all in one go. What I would like to know is, is there a way to append all the files to one another using python.
My files exist in folder locations like so:
C:\Users\folder\Database Files\1st September
C:\Users\folder\Database Files\1st October
C:\Users\folder\Database Files\1st November
C:\Users\folder\Database Files\1st December
etc
Inside each of the folders there is 3 csv (I am using the term csv loosly since these files are actually saved as .txt files containing values seperated by pipes |)
Lets say these files are called:
MonthNamOne.txt
MonthNamTwo.txt
MonthNameOneTwoMurged.txt
How would I, or even is it possible to code something to go through all of these folders in this directory and then merge together all the OneTwoMurged.txt files?
For all files in folder with .csv suffix
import glob
import os
filelist = []
os.chdir("folderwithcsvs/")
for counter, files in enumerate(glob.glob("*.csv")):
filelist.append(files)
print "do stuff with file:", files, counter
print filelist
for fileitem in filelist:
print fileitem
Obviously the "do stuff part" depends on what you want done with the files, this is looking getting your list of files.
If you want to do something with the files on a monthly basis then you could use datetime and create possible months, same for days or yearly data.
For example, for monthly files with the names Month Year.csv it would look for each file.
import subprocess, datetime, os
start_year, start_month = "2001", "January"
current_month = datetime.date.today().replace(day=1)
possible_month = datetime.datetime.strptime('%s %s' % (start_month, start_year), '%B %Y').date()
while possible_month <= current_month:
csv_filename = possible_month.strftime('%B %Y') + '.csv'
month = possible_month.strftime('%B %Y').split(" ")[0]
year = possible_month.strftime('%B %Y').split(" ")[1]
if os.path.exists("folder/" + csv_filename):
print csv_filename
possible_month = (possible_month + datetime.timedelta(days=31)).replace(day=1)
Obviously you can change that to however you feel fit, let me know if you need more or if this suffices.
This will recursively process a directory, match a specific file pattern for processing, and append the results of processed files. This will parse the csvs as well, so you could do individual line analysis and processing as well. Modify as needed :)
#!python2
import os
import fnmatch
import csv
from datetime import datetime as dt
# Open result file
with open('output.txt','wb') as fout:
wout = csv.writer(fout,delimiter='|')
# Recursively process a directory
for path,dirs,files in os.walk('files'):
# Sort directories for processing.
# In this case, sorting directories named "Month Year" chronologically.
dirs.sort(key=lambda d: dt.strptime(d,'%B %Y'))
interesting_files = fnmatch.filter(files,'*.txt')
# Example for sorting filenames with a custom chronological sort "Month Year.txt"
for filename in sorted(interesting_files,key=lambda f: dt.strptime(f,'%B %Y.txt')):
# Generate the full path to the file.
fullname = os.path.join(path,filename)
print 'Processing',fullname
# Open and process file
with open(fullname,'rb') as fin:
for line in csv.reader(fin,delimiter='|'):
wout.writerow(line)
Reading into pandas dataframe (choice of axis depends on your application), my example adds columns of same length
import glob
import pandas as pd
df=pd.DataFrame()
for files in glob.glob("*.csv"):
print files
df = pd.concat([df,pd.read_csv(files).iloc[:,1:]],axis=1)
axis = 0 would add row-wise