How can i find the inputdata (location(s)) from a Python Script? - python

I want to create a general program for monitoring purposes to see which inputdata is being used for various models in our company.
therefore, i want to loop through our (production) model folder and find all the .py of .ipynb files and open those, read them as a string using glob (and os). For now, i made a loop that looks for all scripts containing a csv (as a start):
path = directory
search_word = 'csv'
#list to store files that contain matching word
final_files = []
for folder_path, folders, files in os.walk(path):
#IPYNB files
path = folder_path+'\\*.IPYNB'
for filepath in glob.glob(path, recursive=True):
try:
with open(filepath) as fp:
# read the file as a string
data = fp.read()
if search_word in data:
final_files.append(filepath)
except:
print('Exception while reading file')
print(final_files)
This gives back, all IPYNB files containing the word csv in the script. So, i'm able toe read within the files.
What i want to have, is that within the part where now i'm searching for the 'CSV', i want the program to read the file (as doing right now) and determine which inputdata (and output in the end) is being used.
For example, 1 file (.IPYNB) contains this script part (input used for a model):
#Dataset 1
df1 = pd.read_csv('Data.csv', sep=';')
#dataset 2
sql_conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER=X;DATABASE=X;Trusted_Connection=yes')
query = "SELECT * FROM database.schema.data2"
df2 = pd.read_sql_query(query, sql_conn)
#dataset 3
sql_conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER=X;DATABASE=X;Trusted_Connection=yes')
query = "SELECT element1, element2 FROM database.schema.data3"
df3 = pd.read_sql_query(query, sql_conn)
How can i make the program such that it extracts the following facts:
Data.csv
database.schema.data2
database.schema.data3
Anyone a good idea?
Thanks in advance!

Related

SAX Parser in Python

I am parsing xml files in a folder using Python SAX Parser and writing the output in CSV using pandas, But I am getting only the data from last file in the CSV.
I am new to Python and this is for the first time trying SAX Parsing
File read:
for dirpath, dirs, files in os.walk(fp1):
for filename in files:
print(files)
fname = os.path.join(dirpath,filename)
if fname.endswith('.xml'):
print(fname)
#for count in files:
parser.parse(fname)
def characters(self, content):
rows = []
cols = ["ReporterCite","DecisionDate","CaseName","FileNum","CourtLocation","CourtName","CourtAbbrv","Judge","CaseLength","CourtCite","ParallelCite","CitedCount","UCN"]
#ReporteCite, DecisionDate, CaseName, FileNum, CourtLocation, CourtName, CourtAbbrv, Judge, CaseLength, CourtCite, ParallelCite, CitedCount, UCN
rows.append({"ReporterCite":self.rc,
"DecisionDate": self.dd,
"CaseName": self.can,
"FileNum": self.fn,
"CourtLocation": self.loc,
"CourtName": self.cn,
"CourtAbbrv": self.ca,
"Judge": self.j,
"CaseLength": self.cl,
"CourtCite": self.cc,
"ParallelCite": self.pc,
"CitedCount": self.cd,
"UCN": self.rn})
#print(rows)
df = pd.DataFrame(rows, columns=cols)
df.to_csv(fp2,index=False)
I assume you will always overwrite your previous result. This is a pandas question, not a SAX question. You would like append to the existing csv, right? If this is the case you have to use the mode = ‘a’, like
df.to_csv('filename.csv',mode = 'a')
More options, see Doc
'w' open for writing, truncating the file first (default)
'x' open for exclusive creation, failing if file already exists
'a' open for writing, appending to the end of file if it exists

How to: scrape URL, extract all CSV files from within the zip files on the page? [Python]

Looked all over SO for an approach to this problem and none that I have tried have worked. I've seen several posts about downloading zip files from URL or unzipping files from a local directory in Python, but I am a bit confused on how to put it all together.
My problem: I have a page of zipped data that is organized by month going back to 2010. I'd like to use some Python code to:
scrape the page and nab only the .zip files (there's other data on the page)
unzip each respective monthly dataset
extract and concatenate all the .csv files in each unzipped folder to one long Pandas dataframe
I've tried
from urllib.request import urlopen
url = "https://s3.amazonaws.com/capitalbikeshare-data/2010-capitalbikeshare-tripdata.zip"
save_as = "tripdata1.csv"
# Download from URL
with urlopen(url) as file:
content = file.read()
# Save to file
with open(save_as, 'wb') as download:
download.write(content)
but this returns gibberish.
Then, I tried an approach I saw from a related SO post:
import glob
import pandas as pd
from zipfile import ZipFile
path = r'https://s3.amazonaws.com/capitalbikeshare-data/index.html' # my path to site
#load all zip files in folder
all_files = glob.glob(path + "/*.zip")
df_comb = pd.DataFrame()
flag = False
for filename in all_files:
zip_file = ZipFile(filename)
for text_file in zip_file.infolist():
if text_file.filename.endswith('tripdata.csv'):
df = pd.read_csv(zip_file.open(text_file.filename),
delimiter=';',
header=0,
index_col=['ride_id']
)
if not flag:
df_comb = df
flag = True
else:
df_comb = pd.concat([df_comb, df])
print(df_comb.info())
But this returned a df with zero data, or with additional tinkering, returned error that there were no filenames on the page... :/
Final data should essentially just be a row-wise merge of all the monthly trip data from the index.
Any advice or fixes will be highly appreciated!

Need some assistance on a DBF "File not found" error in Python when looping through a directory?

I would like to ask for help with a Python script that is supposed to loop through a directory on a drive. Basically, what I want to do is convert over 10,0000 DBF files to CSV. So far, I can achieve this on an individual dbf file by using using the dbfread and Pandas packages. Running this script over 10,000 individual times is obviously not feasible, hence why I want automate the task by writing a script that will loop through each dbf file in the directory.
Here is what I would like to do.
Define the directory
Write a for loop that will loop through each file in the directory
Only open a file with the extension '.dbf'
Convert to Pandas DataFrame
Define the name for the output file
Write to CSV and place file in a new directory
Here is the code that I was using to test whether I could convert an individual '.dbf' file to a CSV.
from dbfread import DBF
import pandas as pd
table = DBF('Name_of_File.dbf')
#I originally kept receiving a unicode decoding error
#So I manually adjusted the attributes below
table.encoding = 'utf-8' # Set encoding to utf-8 instead of 'ascii'
table.char_decode_errors = 'ignore' #ignore any decode errors while reading in the file
frame = pd.DataFrame(iter(table)) #Convert to DataFrame
print(frame) #Check to make sure Dataframe is structured proprely
frame.to_csv('Name_of_New_File')
The above code worked exactly as it was intended.
Here is my code to loop through the directory.
import os
from dbfread import DBF
import pandas as pd
directory = 'Path_to_diretory'
dest_directory = 'Directory_to_place_new_file'
for file in os.listdir(directory):
if file.endswith('.DBF'):
print(f'Reading in {file}...')
dbf = DBF(file)
dbf.encoding = 'utf-8'
dbf.char_decode_errors = 'ignore'
print('\nConverting to DataFrame...')
frame = pd.DataFrame(iter(dbf))
print(frame)
outfile = frame.os.path.join(frame + '_CSV' + '.csv')
print('\nWriting to CSV...')
outfile.to_csv(dest_directory, index = False)
print('\nConverted to CSV. Moving to next file...')
else:
print('File not found.')
When I run this code, I receive a DBFNotFound error that says it couldn't find the first file in the directory. As I am looking at my code, I am not sure why this is happening when it worked in the first script.
This is the code from the dbfread package from where the exception is being raised.
class DBF(object):
"""DBF table."""
def __init__(self, filename, encoding=None, ignorecase=True,
lowernames=False,
parserclass=FieldParser,
recfactory=collections.OrderedDict,
load=False,
raw=False,
ignore_missing_memofile=False,
char_decode_errors='strict'):
self.encoding = encoding
self.ignorecase = ignorecase
self.lowernames = lowernames
self.parserclass = parserclass
self.raw = raw
self.ignore_missing_memofile = ignore_missing_memofile
self.char_decode_errors = char_decode_errors
if recfactory is None:
self.recfactory = lambda items: items
else:
self.recfactory = recfactory
# Name part before .dbf is the table name
self.name = os.path.basename(filename)
self.name = os.path.splitext(self.name)[0].lower()
self._records = None
self._deleted = None
if ignorecase:
self.filename = ifind(filename)
if not self.filename:
**raise DBFNotFound('could not find file {!r}'.format(filename))** #ERROR IS HERE
else:
self.filename = filename
Thank you any help provided.
os.listdir returns the file names inside the directory, so you have to join them to the base path to get the full path:
for file_name in os.listdir(directory):
if file_name.endswith('.DBF'):
file_path = os.path.join(directory, file_name)
print(f'Reading in {file_name}...')
dbf = DBF(file_path)

Python Sequential Requests: Data Processing Automation within ArcMap

my python skills are very limited (to none) and I've never created an automated, sequential request for ArcMap. Below are the steps I'd like to code, any advice would be appreciated.
Locate File folder
Import “first” file (table csv) (there are over 500 cvs, the naming convention is not sequential)
Join csv to HUC08 shapefile
Select data without Null values within under the field name “Name”
Save selected data as a layer file within my FoTX.gdb
Move to the next file within the folder and complete the same action until all actions are complete.
#Part of the code. The rest depends mostly on your data
#Set environment settings
arcpy.env.workspace = 'C:/data' #whatever it is for you. you can do this or not
import os, arcpy, csv
mxd = arcpy.mapping.MapDocument("CURRENT")
folderPath=os.path.dirname(mxd.filePath)
#Loop through each csv file
count = 0
for f_name in os.listdir(folderPath):
fullpath = os.path.join(folderPath, f_name)
if os.path.isfile(fullpath):
if f_name.lower().endswith(".csv"):
#import csv file and join to shape file code here
# Set local variables
in_features = ['SomeNAME.shp', 'SomeOtherNAME.shp'] # if there are more
#then one
out_location = 'C:/output/FoTX.gdb'
# out_location =os.path.basename(gdb.filePath) #or if the gdb is in the
#same folder as the csv
#files
# Execute FeatureClassToGeodatabase
arcpy.FeatureClassToGeodatabase_conversion(in_features, out_location)
if count ==0:
print "No CSV files in this folder"

Many-record upload to postgres

I have a series of .csv files with some data, and I want a Python script to open them all, do some preprocessing, and upload the processed data to my postgres database.
I have it mostly complete, but my upload step isn't working. I'm sure it's something simple that I'm missing, but I just can't find it. I'd appreciate any help you can provide.
Here's the code:
import psycopg2
import sys
from os import listdir
from os.path import isfile, join
import csv
import re
import io
try:
con = db_connect("dbname = '[redacted]' user = '[redacted]' password = '[redacted]' host = '[redacted]'")
except:
print("Can't connect to database.")
sys.exit(1)
cur = con.cursor()
upload_file = io.StringIO()
file_list = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in file_list:
id_match = re.search(r'.*-(\d+)\.csv', file)
if id_match:
id = id_match.group(1)
file_name = format(id_match.group())
with open(mypath+file_name) as fh:
id_reader = csv.reader(fh)
next(id_reader, None) # Skip the header row
for row in id_reader:
[stuff goes here to get desired values from file]
if upload_file.getvalue() != '': upload_file.write('\n')
upload_file.write('{0}\t{1}\t{2}'.format(id, [val1], [val2]))
print(upload_file.getvalue()) # prints output that looks like I expect it to
# with thousands of rows that seem to have the right values in the right fields
cur.copy_from(upload_file, '[my_table]', sep='\t', columns=('id', 'col_1', 'col_2'))
con.commit()
if con:
con.close()
This runs without error, but a select query in psql still shows no records in the table. What am I missing?
Edit: I ended up giving up and writing it to a temporary file, and then uploading the file. This worked without any trouble...I'd obviously rather not have the temporary file though, so I'm happy to have suggestions if someone sees the problem.
When you write to an io.StringIO (or any other file) object, the file pointer remains at the position of the last character written. So, when you do
f = io.StringIO()
f.write('1\t2\t3\n')
s = f.readline()
the file pointer stays at the end of the file and s contains an empty string.
To read (not getvalue) the contents, you must reposition the file pointer to the beginning, e.g. use seek(0)
upload_file.seek(0)
cur.copy_from(upload_file, '[my_table]', columns = ('id', 'col_1', 'col_2'))
This allows copy_from to read from the beginning and import all the lines in your upload_file.
Don't forget, that you read and keep all the files in your memory, which might work for a single small import, but may become a problem when doing large imports or multiple imports in parallel.

Categories