Python: ignore filenotfound and other errors in pandas/python code

Python: ignore filenotfound and other errors in pandas/python code - python

I have wrote a code which has basically the following structure:
Step 1: import libraries
Step 2: read multiple input files
>>>os.getcwd()
'C:\\Users\\User\\Downloads\\Input\\Daily'
>>>df_delhi = pd.read_excel("Delhi.xlsx")
>>>df_mea = pd.read_excel("MEA.xlsx",sheet_name='MTD NSV_Volumes',usecols ='B:D', skiprows=3)
Step 3: calculations on each file
Step 4: output in excel file.
I have a condition here. out of 20 odd files, if any one file is missing, i need to ignore that and move ahead with the code. Also i need to ignore the calculations from that file which come later on in the code.
How do i achieve this?
Not sure if any code is required to be pasted here since its just basic file read and calculations

Look into Exceptions (official Python tutorial: https://docs.python.org/3/tutorial/errors.html). By expecting and catching a FileNotFoundError you could neatly ignore that file, and only work on files which you find.

Look's like u need to loop over a directory and read the file's avilable in it.
import os
import pandas as pd
# each file get's different argument's while reading..
file_params = {
"Delhi.xlsx": {},
"MEA.xlsx": {"sheet_name": 'MTD NSV_Volumes', 'usecols': 'B:D'},
}
data = {}
for file_ in os.listdir('C:\\Users\\User\\Downloads\\Input\\Daily'):
data[file_.replace(".xlsx", "")] = pd.read_excel(file_, **file_params.get(file_, {}))
# dataframe's contained inside dict.
# data['Delhi'], data['MEA']....

Related

How to read Json files in a directory separately with a for loop and performing a calculation

Update: Sorry it seems my question wasn't asked properly. So I am analyzing a transportation network consisting of more than 5000 links. All the data included in a big CSV file. I have several JSON files which each consist of subset of this network. I am trying to loop through all the JSON files INDIVIDUALLY (i.e. not trying to concatenate or something), read the JSON file, extract the information from the CVS file, perform calculation, and save the information along with the name of file in new dataframe. Something like this:
enter image description here
This is the code I wrote, but not sure if it's efficient enough.
name=[]
percent_of_truck=[]
path_to_json = \\directory
import glob
z= glob.glob(os.path.join(path_to_json, '*.json'))
for i in z:
with open(i, 'r') as myfile:
l=json.load(myfile)
name.append(i)
d_2019= final.loc[final['LINK_ID'].isin(l)] #retreive data from main CSV file
avg_m=(d_2019['AADTT16']/d_2019['AADT16']*d_2019['Length']).sum()/d_2019['Length'].sum() #calculation
percent_of_truck.append(avg_m)
f=pd.DataFrame()
f['Name']=name
f['% of truck']=percent_of_truck

I'm assuming here you just want a dictionary of all the JSON. If so, use the JSON library ( import JSON). If so, this code may be of use:
import json
def importSomeJSONFile(f):
return json.load(open(f))
# make sure the file exists in the same directory
example = importSomeJSONFile("example.json")
print(example)
#access a value within this , replacing key with what you want like "name"
print(JSON_imported[key])

Since you haven't added any Schema or any other specific requirements.
You can follow this approach to solve your problem, in any language you prefer
Get Directory of the JsonFiles, which needs to be read
Get List of all files present in directory
For each file-name returned in Step2.
Read File
Parse Json from String
Perform required calculation

Python using same date/time in multiple files

I'm currently generating three different xml files, and I would like to have the second and third file have the same date/time as the first file.
In the first file, I do
import datetime
time = datetime.datetime.utcnow().strftime('%y%m%d%H%M%S')
This gives me the format I would like. I've tried multiple approaches such as storing it in a different variable and importing it to the second and third files, but it seems that it'll always keep the actual current time and not the time of the first file. I don't know if there's a solution to my problem using the datetime module but if anyone has any ideas that would be wonderful.

Whenever you call that function, whether directly or through import it will always run again and give a new "now".
If the same program just uses that string for 3 times there shouldn't be a problem, but if you're running 3 different scripts you will get 3 different dates!
To avoid this, I would save the first generated string to a file:
with open('.tmpdate') as f:
f.write(time)
And read it in the next to files:
with open('.tmpdate') as f:
time = f.read()
And finally, just to clean up after yourself, you can delete that file after it was used for the 3rd time with os.remove('.tmpdate') (you need to import os before that, of course)

How can I create a template .csv file that users can fill in before running my script?

I am trying to make a script that requires a user to input at least 12 different values in order to function. I thought this was somewhat impractical, so I decided to make a function that would generate dict from a .csv file that was designed with two columns– variables and their respective values. The user could use a provided .csv file as a template and then fill it in with all their necessary values, save it as their own .csv file, and then run it with the script.
Although this sounds simple in theory, I have found that is not working quite so well in practice. Because some of the inputs values will be text with a lot of periods in them ("..."), they are sometimes converted into the unicode representing horizontal ellipses (xe2\x80\xa6). Also, a UTF-8 mark will occur at the beginning of the first column and row (which can be designated by the code codecs.BOM_UTF8), and must be removed. In other cases, the delimiter of the .csv file was changed so that tabs were recognized as separating cells, or the contents of each row were converted from two to one cell.
I have no experience with the different forms of encoding or what any of them entail, but from what I have tested, it seems that opening the .csv template file in Excel or using different settings when opening your .csv file causes such problems. It's also possible that copying and pasting the values from other places brings hidden characters with them. I have been trying to fix the problems but then new problems keep springing up, and I feel like it's possible that my current approach is just wrong.
Can anybody recommend me a different, more efficient approach for allowing a user to enter in multiple inputs in one go? Or should I stick to my original approach and figure out how to keep the .csv formatting as rigorous as possible?

You can always use the csv module to abstract away most of the CSV oddities (although you will have to enforce the basic format):
import csv
import sys
def main(argv):
if len(argv) < 2:
print("Please provide path to your CSV template as the first argument.")
return 1
with open(argv[1], "r") as f:
reader = csv.DictReader(f)
your_vars = next(reader)
print(your_vars) # prints a dictionary of all CSV vars
return 0
if __name__ == "main":
sys.exit(main(sys.argv))
NOTE: This requires the first row to hold the variables, while the second holds their values.
So all users have to do is call the script with: python your_script.py their_file.csv and in most cases it will print out a dict with the values... However, Excel is notoriously bad in handling unicode CSVs and if your users use it as their primary spreadsheet app they're likely to encounter issues. Some of that can be rectified by installing the unicodecsv module and using it as a drop-in replacement (import unicodecsv as csv) but if your users start going wild with the format eventually it will break.
If you're looking for suggestions on formats, one of the most user-friendly formats you can use is YAML and there are several parsers available for Python - they largely work the same for the simple stuff like this but I'd recommend using the ruamel.yaml module as it's actively maintained.
Then you can create a YAML template like:
----
var1: value1
var2: value2
var3: value3
etc: add as many as you want
And your users can fill in the values in a simple text editor, then to replicate the above CSV behavior all you need is:
import yaml
import sys
def main(argv):
if len(argv) < 2:
print("Please provide path to your YAML template as the first argument.")
return 1
with open(argv[1], "r") as f:
your_vars = yaml.load(f)
print(your_vars) # prints a dictionary of all YAML vars
return 0
if __name__ == "main":
sys.exit(main(sys.argv))
Bonus is that YAML is plain-text format so your users don't need fancy editors and therefore they have a lesser chance to screw up. Of course, while YAML is permissive it still requires modicum of well-formedness so be sure to include the usual checks (if the file exists, can it be open, can it be parsed etc.)

Trying to create a Python Script to extract data from .log files

I'm trying to create a Python Script but I'm a bit stuck and can't find what I'm looking for on a Google search as it's quite specific.
I need to run a script on two .log files (auth.log and access.log) to view the following information:
Find how many attempts were made with the bin account
So how many attempts the bin account made to try and get into the server.
The logs are based off being hacked and needing to identify how and who is responsible.
Would anyone be able to give me some help in how I go about doing this? I can provide more information if needed.
Thanks in advance.
Edit:
I've managed to print all the times 'bin' appears in the log which is one way of doing it. Does anyone know if I can count how many times 'bin' appears as well?
with open("auth.log") as f:
for line in f:
if "bin" in line:
print line

Given that you work with system logs and their format is known and stable, my approach would be something like:
identify a set of keywords (either common, or one per log)
for each log, iterate line by line
once keywords match, add the relevant information from each line in e.g. a dictionary
You could use shell tools (like grep, cut and/or awk) to pre-process the log and extract relevant lines from the log (I assume you only need e.g. error entries).
You can use something like this as a starting point.

If you want ot use tool then you can use ELK(Elastic,Logstash and kibana).
if no then you have to read first log file then apply regex according to your requirment.

In case you might be interested in extracting some data and save it to a .txt file, the following sample code might be helpful:
import re
import sys
import os.path
expDate = '2018-11-27'
expTime = '11-21-09'
infile = r"/home/xenial/Datasets/CIVIT/Nov_27/rover/NMND17420010S_"+expDate+"_"+expTime+".LOG"
keep_phrases = ["FINESTEERING"]
with open(infile) as f:
f = f.readlines()
with open('/home/xenial/Datasets/CIVIT/Nov_27/rover/GPS_'+expDate+'_'+expTime+'.txt', 'w') as file:
file.write("gpsWeek,gpsSOW\n")
for line in f:
for phrase in keep_phrases:
if phrase in line:
resFind = re.findall('\.*?FINESTEERING,(\d+).*?,(\d+\.\d*)',line)[0]
gpsWeek = re.findall('\.*?FINESTEERING,(\d+)',line)[0]
gpsWeekStr = str(gpsWeek)
gpsSOW = re.findall('\.*?FINESTEERING,'+ gpsWeekStr + ',(\d+\.\d*)',line)[0]
gpsSOWStr = str(gpsSOW)
file.write(gpsWeekStr+','+gpsSOWStr+'\n')
break
print ("------------------------------------")
In my case, FINESTEERING was an interesting keyword in my .log file to extract numbers, including GPS_Week and GPS_Seconds_of_Weeks. You may modify this code to suit your own application.

Validate and format JSON files

I have around 2000 JSON files which I'm trying to run through a Python program. A problem occurs when a JSON file is not in the correct format. (Error: ValueError: No JSON object could be decoded) In turn, I can't read it into my program.
I am currently doing something like the below:
for files in folder:
with open(files) as f:
data = json.load(f); # It causes an error at this part
I know there's offline methods to validating and formatting JSON files but is there a programmatic way to check and format these files? If not, is there a free/cheap alternative to fixing all of these files offline i.e. I just run the program on the folder containing all the JSON files and it formats them as required?
SOLVED using #reece's comment:
invalid_json_files = []
read_json_files = []
def parse():
for files in os.listdir(os.getcwd()):
with open(files) as json_file:
try:
simplejson.load(json_file)
read_json_files.append(files)
except ValueError, e:
print ("JSON object issue: %s") % e
invalid_json_files.append(files)
print invalid_json_files, len(read_json_files)
Turns out that I was saving a file which is not in JSON format in my working directory which was the same place I was reading data from. Thanks for the helpful suggestions.

The built-in JSON module can be used as a validator:
import json
def parse(text):
try:
return json.loads(text)
except ValueError as e:
print('invalid json: %s' % e)
return None # or: raise
You can make it work with files by using:
with open(filename) as f:
return json.load(f)
instead of json.loads and you can include the filename as well in the error message.
On Python 3.3.5, for {test: "foo"}, I get:
invalid json: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
and on 2.7.6:
invalid json: Expecting property name: line 1 column 2 (char 1)
This is because the correct json is {"test": "foo"}.
When handling the invalid files, it is best to not process them any further. You can build a skipped.txt file listing the files with the error, so they can be checked and fixed by hand.
If possible, you should check the site/program that generated the invalid json files, fix that and then re-generate the json file. Otherwise, you are going to keep having new files that are invalid JSON.
Failing that, you will need to write a custom json parser that fixes common errors. With that, you should be putting the original under source control (or archived), so you can see and check the differences that the automated tool fixes (as a sanity check). Ambiguous cases should be fixed by hand.

Yes, there are ways to validate that a JSON file is valid. One way is to use a JSON parsing library that will throw exceptions if the input you provide is not well-formatted.
try:
load_json_file(filename)
except InvalidDataException: # or something
# oops guess it's not valid
Of course, if you want to fix it, you naturally cannot use a JSON loader since, well, it's not valid JSON in the first place. Unless the library you're using will automatically fix things for you, in which case you probably wouldn't even have this question.
One way is to load the file manually and tokenize it and attempt to detect errors and try to fix them as you go, but I'm sure there are cases where the error is just not possible to fix automatically and would be better off throwing an error and asking the user to fix their files.
I have not written a JSON fixer myself so I can't provide any details on how you might go about actually fixing errors.
However I am not sure whether it would be a good idea to fix all errors, since then you'd have assume your fixes are what the user actually wants. If it's a missing comma or they have an extra trailing comma, then that might be OK, but there may be cases where it is ambiguous what the user wants.

Here is a full python3 example for the next novice python programmer that stumbles upon this answer. I was exporting 16000 records as json files. I had to restart the process several times so I needed to verify that all of the json files were indeed valid before I started importing into a new system.
I am no python programmer so when I tried the answers above as written, nothing happened. Seems like a few lines of code were missing. The example below handles files in the current folder or a specific folder.
verify.py
import json
import os
import sys
from os.path import isfile,join
# check if a folder name was specified
if len(sys.argv) > 1:
folder = sys.argv[1]
else:
folder = os.getcwd()
# array to hold invalid and valid files
invalid_json_files = []
read_json_files = []
def parse():
# loop through the folder
for files in os.listdir(folder):
# check if the combined path and filename is a file
if isfile(join(folder,files)):
# open the file
with open(join(folder,files)) as json_file:
# try reading the json file using the json interpreter
try:
json.load(json_file)
read_json_files.append(files)
except ValueError as e:
# if the file is not valid, print the error
# and add the file to the list of invalid files
print("JSON object issue: %s" % e)
invalid_json_files.append(files)
print(invalid_json_files)
print(len(read_json_files))
parse()
Example:
python3 verify.py
or
python3 verify.py somefolder
tested with python 3.7.3

It was not clear to me how to provide path to the file folder, so I'd like to provide answer with this option.
path = r'C:\Users\altz7\Desktop\your_folder_name' # use your path
all_files = glob.glob(path + "/*.json")
data_list = []
invalid_json_files = []
for filename in all_files:
try:
df = pd.read_json(filename)
data_list.append(df)
except ValueError:
invalid_json_files.append(filename)
print("Files in correct format: {}".format(len(data_list)))
print("Not readable files: {}".format(len(invalid_json_files)))
#df = pd.concat(data_list, axis=0, ignore_index=True) #will create pandas dataframe
from readable files, if you like

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.