XML to dictionary extraction

XML to dictionary extraction - python

I wrote a code and some values at ecd are missing. I would like to indicate them as 'None' or 0000 to be able to create a dataframe. Unfortunately, the code runs until the missing place and then it crashes and I cannot spot a mistake.
The error message:
File "extra.py", line 236, in <module>
if dic['mudLogs']['mudLog']['geologyInterval'][i]['ecdTdAv']['#text'] != None:
KeyError: 'ecdTdAv'
Code:
xml_file = 'C:\\Users\\jtfra\\Desktop\\Thesis\\Volve_Real_Time_DData\\WITSML Realtime drilling data\\Norway-Statoil-NO 15_$47$_9-F-11\\1\\mudLog\\1.xml'
def convert(xml_file, xml_attribs=True):
with open(xml_file, "rb") as f: # notice the "rb" mode
d = xmltodict.parse(f, xml_attribs=xml_attribs)
return d
dic = convert(xml_file)
mdTop, ecd = [], []
for i in range(len(dic['mudLogs']['mudLog']['geologyInterval'])):
mdTop.append(dic['mudLogs']['mudLog']['geologyInterval'][i]['mdTop']['#text'])
if dic['mudLogs']['mudLog']['geologyInterval'][i]['ecdTdAv']['#text'] != None:
ecd.append(dic['mudLogs']['mudLog']['geologyInterval'][i]['ecdTdAv']['#text'])
else:
ecd.append('None')
print(ecd)

Instead of accessing it as:
dic['mudLogs']['mudLog']['geologyInterval'][0]['ecdTdAv']
do:
dic['mudLogs']['mudLog']['geologyInterval'][0].get('ecdTdAv', '0000')
or similar.
You can also check if key is present with:
if 'ecdTdAv' in dic['mudLogs']['mudLog']['geologyInterval'][I]:
# do something with it, e.g.:
print(dic['mudLogs']['mudLog']['geologyInterval'][i]['ecdTdAv']['#text'])

Related

Verify data on a file

I'm trying to make my life easier on my work, and writing down errors and solutions for that same errors. The program itself works fine when it's about adding new errors, but then I added a function to verify if the error exists in the file and then do something to it (not added yet).
The function doesn't work and I don't know why. I tried to debug it, but still not able to find the error, maybe a conceptual error?
Anyway, here's my entire code.
import sys
import os
err = {}
PATH = 'C:/users/userdefault/desktop/errordb.txt'
#def open_file(): #Not yet used
#file_read = open(PATH, 'r')
#return file_read
def verify_error(error_number, loglist): #Verify if error exists in file
for error in loglist:
if error_number in loglist:
return True
def dict_error(error_number, solution): #Puts input errors in dict
err = {error_number: solution}
return err
def verify_file(): #Verify if file exists. Return True if it does
archive = os.path.isfile(PATH)
return archive
def new_error():
file = open(PATH, 'r') #Opens file in read mode
loglist = file.readlines()
file.close()
found = False
error_number = input("Error number: ")
if verify_error(error_number, loglist) == True:
found = True
# Add new solution, or another solution.
pass
solution = str(input("Solution: "))
file = open(PATH, 'a')
error = dict_error(error_number, solution)
#Writes dict on file
file.write(str(error))
file.write("\n")
file.close()
def main():
verify = verify_file() #Verify if file exists
if verify == True:
new = str.lower(input("New job Y/N: "))
if new == 'n':
sys.exit()
while new == 'y':
new_error()
new = str.lower(input("New job Y/N: "))
else:
sys.exit()
else:
file = open(PATH, "x")
file.close()
main()
main()
To clarify, the program executes fine, it don't return an error code. It just won't execute the way I'm intended, I mean, it supposed to verify if certain error number already exists.
Thanks in advance :)

The issue I believe you're having is the fact that you're not actually creating a dictionary object in the file and modifying it but instead creating additional dictionaries every time an error is added then reading them back as a list of strings by using the .readlines() method.
An easier way of doing it would be to create a dictionary if one doesn't exist and append errors to it. I've made a few modifications to your code which should help.
import sys
import os
import json # Import in json and use is as the format to store out data in
err = {}
PATH = 'C:/users/userdefault/desktop/errordb.txt'
# You can achieve this by using a context manager
#def open_file(): #Not yet used
#file_read = open(PATH, 'r')
#return file_read
def verify_error(error_number, loglist): #Verify if error exists in file
# Notice how we're looping over keys of your dictionary to check if
# an error already exists.
# To access values use loglist[k]
for k in loglist.keys():
if error_number == k:
return True
return False
def dict_error(loglist, error_number, solution): #Puts input errors in dict
# Instead of returning a new dictionary, return the existing one
# with the new error appended to it
loglist[error_number] = solution
return loglist
def verify_file(): #Verify if file exists. Return True if it does
archive = os.path.isfile(PATH)
return archive
def new_error():
# Let's move all the variables to the top, makes it easier to read the function
# Changes made:
# 1. Changed the way we open and read files, now using a context manager (aka with open() as f:
# 2. Added a json parser to store in and read from file in a json format. If data doesn't exist (new file?) create a new dictionary object instead
# 3. Added an exception to signify that an error has been found in the database (this can be removed to add additional logic if you'd like to do more stuff to the error, etc)
# 4. Changed the way we write to file, instead of appending a new line we now override the contents with a new updated dictionary that has been serialized into a json format
found = False
loglist = None
# Open file as read-only using a context manager, now we don't have to worry about closing it manually
with open(PATH, 'r') as f:
# Lets read the file and run it through a json parser to get a python dictionary
try:
loglist = json.loads(f.read())
except json.decoder.JSONDecodeError:
loglist = {}
error_number = input("Error number: ")
if verify_error(error_number, loglist) is True:
found = True
raise Exception('Error exists in the database') # Raise exception if you want to stop loop execution
# Add new solution, or another solution.
solution = str(input("Solution: "))
# This time open in write only and replace the dictionary
with open(PATH, 'w') as f:
loglist = dict_error(loglist, error_number, solution)
# Writes dict on file in json format
f.write(json.dumps(loglist))
def main():
verify = verify_file() #Verify if file exists
if verify == True:
new = str.lower(input("New job Y/N: "))
if new == 'n':
sys.exit()
while new == 'y':
new_error()
new = str.lower(input("New job Y/N: "))
else:
sys.exit()
else:
with open(PATH, "x") as f:
pass
main()
main()
Note that you will have to create a new errordb file for this snippet to work.
Hope this has helped somehow. If you have any further questions hit me up in the comments!
References:
Reading and Writing files in Python
JSON encoder and decoder in Python

I think that there may be a couple of problems with your code, but the first thing that I noticed was that you are saving Error Numbers and Solutions as a dictionary in errorsdb.txt and when you read them back in you are reading them back in as a list of strings:
The line:
loglist = file.readlines()
in new_error returns a list of strings. This means that verify_error will always return False.
So you have a couple of choices:
You could modify verify_error to the following:
def verify_error(error_number, loglist): #Verify if error exists in file
for error in loglist:
if error_number in error:
return True
Although, I think that a better solution would be to load errorsdb.txt as a JSON file and then you'll have a dictionary. That would look something like:
import json
errordb = {}
with open(PATH) as handle:
errordb = json.load(handle)
So here are the full set of changes I would make:
import json
def verify_error(error_number, loglist): #Verify if error exists in file
for error in loglist:
if error_number in error:
return True
def new_error():
errordb = list()
exitsting = list()
with open(PATH) as handle:
existing = json.load(handle)
errordb += existing
error_number = input("Error number: ")
if verify_error(error_number, errordb) == True:
# Add new solution, or another solution.
print("I might do something here.")
else:
solution = str(input("Solution: "))
errordb.append({error_number, solution})
#Writes dict on file
with open(PATH, "w") as handle:
json.dump(errordb, handle)

I need a shortcut

So im just trying to make a simple script that can filter emails with different domains its working great but i need a shortcut, cause i dont wana write if and elif statements many time , Can anyone tell my how to write my script with function so that will become shorter and easier.. thanks in advance ,Script is below:
f_location = 'C:/Users/Jack The Reaper/Desktop/mix.txt'
text = open(f_location)
good = open('C:/Users/Jack The Reaper/Desktop/good.txt','w')
for line in text:
if '#yahoo' in line:
yahoo = None
elif '#gmail' in line:
gmail = None
elif '#yahoo' in line:
yahoo = None
elif '#live' in line:
live = None
elif '#outlook' in line:
outlook = None
elif '#hotmail' in line:
hotmail = None
elif '#aol' in line:
aol = None
else:
if ' ' in line:
good.write(line.strip(' '))
elif '' in line:
good.write(line.strip(''))
else:
good.write(line)
text.close()
good.close()

I would suggest you to use dict for this instead of having separate variables for all the cases.
my_dict = {}
...
if '#yahoo' in line:
my_dict['yahoo'] = None
But if you want to do the way you described in the question, you can do as done below,
email_domains = ['#yahoo', '#gmail', '#live', '#outlook', '#hotmail', '#aol']
for e in email_domains:
if e in line:
locals()[e[1:]] = None
#if you use dict, use the below line
#my_dict[e[1:]] = None
locals() returns a dictionary of the current namespace. The keys in this dict are the variable names and value is the value of the variable.
So locals()['gmail'] = None creates a local variable named gmail(if it doesn't exist) and assigns it None.

As you stated the problem and provided the sample file :
So i have two solution : One line solution and other is detailed solution.
First let's define regex pattern and import re module
import re
pattern=r'.+#(?!gmail|yahoo|aol|hotmail|live|outlook).+'
Now detailed version code:
emails=[]
with open('emails.txt','r') as f:
for line in f:
match=re.finditer(pattern,line)
for find in match:
emails.append(find.group())
with open('result.txt','w') as f:
f.write('\n'.join(emails))
output in result.txt file :
nic-os9#gmx.de
angelique.charuel#sfr.fr
nannik#interia.pl
l.andrioli#freenet.de
kamil_sieminski8#o2.pl
hugo.lebrun.basket#orange.fr
One line solution if you want too short:
with open('results.txt','w') as file:
file.write('\n'.join([find.group() for line in open('emails.txt','r') for find in re.finditer(pattern,line)]))
output:
nic-os9#gmx.de
angelique.charuel#sfr.fr
nannik#interia.pl
l.andrioli#freenet.de
kamil_sieminski8#o2.pl
hugo.lebrun.basket#orange.fr
P.S : with one line solution file will not close automatically but python clear that stuff its not a big issue (but not always) but still if you want you can use.

<class 'UnicodeDecodeError'> that only appears in Python 3 but not Python 2

I am doing a project analyzing tweets for an Urban Policy class. The purpose of this script is to parse out certain information from JSON files that a colleague downloaded. Here's a link to a sample Tweet I am trying to parse:
https://www.dropbox.com/s/qf1e06601m2mrxr/5thWardChicago.0.json?dl=0
I had a friend of mine test the following script in some version of Python 2 (Windows) and it worked. However, my machine (Windows 10) is running a recent version of Python 3 and its not working for me.
import json
import collections
import sys, os
import glob
from datetime import datetime
import csv
def convert(input):
if isinstance(input, dict):
return {convert(key): convert(value) for key, value in input.iteritems()}
elif isinstance(input, list):
return [convert(element) for element in input]
elif isinstance(input, unicode):
return input.encode('utf-8')
else:
return input
def to_ilan_csv(json_files):
# write the column headers
csv_writer = csv.writer(open("test.csv", "w"))
headers = ["tweet_id", "handle", "username", "tweet_text", "has_image", "image_url", "created_at", "retweets", "hashtags", "mentions", "isRT", "isMT"]
csv_writer.writerow(headers)
# open the JSON files we stored and parse them into the CSV file we're working on
try:
#json_files = glob.glob(folder)
print("Parsing %s files." % len(json_files))
for file in json_files:
f = open(file, 'r')
if f != None:
for line in f:
# hack to avoid the trailing \n at the end of the file - sitcking point LH 4/7/16
if len(line) > 3:
i = 0
tweets = convert(json.loads(line))
for tweet in tweets:
has_media = False
is_RT = False
is_MT = False
hashtags_list = []
mentions_list = []
media_list = []
entities = tweet["entities"]
# old tweets don't have key "media" so need a workaround
if entities.has_key("media"):
has_media = True
for item in entities["media"]:
media_list.append(item["media_url"])
for hashtag in entities["hashtags"] :
hashtags_list.append(hashtag["text"])
for user in entities["user_mentions"]:
mentions_list.append(user["screen_name"])
if tweet["text"][:2] == "RT":
is_RT = True
if tweet["text"][:2] == "MT":
is_MT = True
values = [
tweet["id_str"],
tweet["user"]["id_str"],
tweet["user"]["screen_name"],
tweet["text"],
has_media,
','.join(media_list) if len(media_list) > 0 else "",
datetime.strptime(tweet["created_at"], '%a %b %d %H:%M:%S +0000 %Y').strftime('%Y-%m-%d %H:%M:%S'),
tweet["retweet_count"],
','.join(hashtags_list) if len(hashtags_list) > 0 else "",
','.join(mentions_list) if len(mentions_list) > 0 else "",
is_RT,
is_MT
]
csv_writer.writerow(values)
else:
continue
f.close()
except:
print("Something went wrong. Quitting.")
for i in sys.exc_info():
print(i)
def parse_tweets():
file_names = []
file_names.append("C:\\Users\\Adam\\Downloads\\Test Code\\sample1.json")
file_names.append("C:\\Users\\Adam\\Downloads\\Test Code\\sample2.json")
to_ilan_csv(file_names)
Then I execute by simply performing
parse_tweets()
But I get the following error:
Parsing 2 files.
Something went wrong. Quitting.
<class 'UnicodeDecodeError'>
'charmap' codec can't decode byte 0x9d in position 3338: character maps to <undefined>
<traceback object at 0x0000016CCFEE5648>
I sought help from a CS friend of mine but he was unable to diagnose the problem. So I've come here.
MY QUESTION
What is this error and why is it only arising in Python 3 instead of Python 2?
For those who want to try, the code as presented should be able to be run using a Jupyter notebook and the copy of the file in the drop box link I provided.

Sooo, after a bit debugging in chat, here’s the solution:
Apparently, the file OP was using was not correctly recognized as UTF-8, so iterating over the file (with for line in f) caused the UnicodeDecodeError from the cp1252 encoding module. We fixed that by explicitely opening the file as utf-8:
f = open(file, 'r', encoding='utf-8')
After we did that, the file could be opened correctly and OP ran into the Python 3 issues we all have been expecting and seeing before. The following three issues came up:
'dict' object has no attribute 'iteritems'
dict.iteritems() no longer exists in Python 3, so we just switch to dict.items() here:
return {convert(key): convert(value) for key, value in input.items()}
name 'unicode' is not defined
Unicode is no longer a separate type in Python 3, the normal string type is already capable of unicode, so we just delete this case:
elif isinstance(input, unicode):
return input.encode('utf-8')
'dict' object has no attribute 'has_key'
To check whether a key exists in a dictionary, we use the in operator, so the if check becomes the following:
if "media" in entities:
Afterwards, the code should run fine with Python 3.

Copying string from a specific index from one file to pasting that string on a specific place in another file

My intention was to copy a piece of string after either a colon or equal sign from File 1 , and pasting that string in File 2 in a similar location after either a colon or equal sign.
For instance, if File 1 has:
username: Stack
File 2 is originally empty:
username=
I want Stack to be copied over to File 2 after username. Currently, I'm stuck and not sure what to do. The program piece I made below doesn't copy the username. I would greatly appreciate any input!
with open("C:/Users/SO//Downloads//f1.txt", "r") as f1:
with open("C:/Users/SO//Downloads//f2.txt", "r+") as f2:
searchlines = f1.readlines()
searchlines_f2=f2.readlines()
for i, line in enumerate(searchlines):
if 'username' in line:
for l in searchlines[i:i+1]:
ind = max(l.find(':'), l.find('='), 0) #finding index of specific characters
copy_string=l[ind+1:].strip() #copying string for file 2
for l in searchlines_f2[i:i+1]:
if 'username' in line:
f2.write(copy_string)

I think something like this will get you what you need in a more maintainable and Pythonic way.
Note the use of regex as well as some string methods (e.g., startswith)
import re
SOURCE_PATH = "C:/Users/SO//Downloads//f1.txt"
TARGET_PATH = "C:/Users/SO//Downloads//f2.txt"
def _get_lines(filepath):
""" read `filepath` and return a list of strings """
with open(filepath, "r+") as fh:
return fh.readlines()
def _get_value(fieldname, text):
""" parse `text` to get the value of `fieldname` """
try:
pattern = '%s[:=]{1}\s?(.*)' % fieldname
return re.match(pattern, text).group(1)
except IndexError:
# you may want to handle this differently!
return None
def _write_target(filepath, trgt_lines):
""" write `trgt_lines` to `filepath` """
with open(filepath, "w+") as fh:
fh.writelines(trgt_lines)
src_lines = _get_lines(SOURCE_PATH)
trgt_lines = _get_lines(TARGET_PATH)
# extract field values from source file
fields = ['username', 'id', 'location']
for field in fields:
value = None
for cur_src in src_lines:
if cur_src.startswith(field):
value = _get_value(field, cur_src)
break
# update target_file w/ value (if we were able to find it)
if value is not None:
for i, cur_trgt in enumerate(trgt_lines):
if cur_trgt.startswith('{0}='.format(field)):
trgt_lines[i] = '{0}={1}'.format(field, value)
break
_write_target(TARGET_PATH, trgt_lines)

Problems with variable referenced before assignment when using os.path.walk

OK. I have some background in Matlab and I'm now switching to Python.
I have this bit of code under Pythnon 2.6.5 on 64-bit Linux which scrolls through directories, finds files named 'GeneralData.dat', retrieves some data from them and stitches them into a new data set:
import pylab as p
import os, re
import linecache as ln
def LoadGenomeMeanSize(arg, dirname, files):
for file in files:
filepath = os.path.join(dirname, file)
if filepath == os.path.join(dirname,'GeneralData.dat'):
data = p.genfromtxt(filepath)
if data[-1,4] != 0.0: # checking if data set is OK
data_chopped = data[1000:-1,:] # removing some of data
Grand_mean = data_chopped[:,2].mean()
Grand_STD = p.sqrt((sum(data_chopped[:,4]*data_chopped[:,3]**2) + sum((data_chopped[:,2]-Grand_mean)**2))/sum(data_chopped[:,4]))
else:
break
if filepath == os.path.join(dirname,'ModelParams.dat'):
l = re.split(" ", ln.getline(filepath, 6))
turb_param = float(l[2])
arg.append((Grand_mean, Grand_STD, turb_param))
GrandMeansData = []
os.path.walk(os.getcwd(), LoadGenomeMeanSize, GrandMeansData)
GrandMeansData = sorted(GrandMeansData, key=lambda data_sort: data_sort[2])
TheMeans = p.zeros((len(GrandMeansData), 3 ))
i = 0
for item in GrandMeansData:
TheMeans[i,0] = item[0]
TheMeans[i,1] = item[1]
TheMeans[i,2] = item[2]
i += 1
print TheMeans # just checking...
# later do some computation on TheMeans in NumPy
And it throws me this (though I would swear it was working a month ego):
Traceback (most recent call last):
File "/home/User/01_PyScripts/TESTtest.py", line 29, in <module>
os.path.walk(os.getcwd(), LoadGenomeMeanSize, GrandMeansData)
File "/usr/lib/python2.6/posixpath.py", line 233, in walk
walk(name, func, arg)
File "/usr/lib/python2.6/posixpath.py", line 225, in walk
func(arg, top, names)
File "/home/User/01_PyScripts/TESTtest.py", line 26, in LoadGenomeMeanSize
arg.append((Grand_mean, Grand_STD, turb_param))
UnboundLocalError: local variable 'Grand_mean' referenced before assignment
All right... so I went and did some reading and came up with this global variable:
import pylab as p
import os, re
import linecache as ln
Grand_mean = p.nan
Grand_STD = p.nan
def LoadGenomeMeanSize(arg, dirname, files):
for file in files:
global Grand_mean
global Grand_STD
filepath = os.path.join(dirname, file)
if filepath == os.path.join(dirname,'GeneralData.dat'):
data = p.genfromtxt(filepath)
if data[-1,4] != 0.0: # checking if data set is OK
data_chopped = data[1000:-1,:] # removing some of data
Grand_mean = data_chopped[:,2].mean()
Grand_STD = p.sqrt((sum(data_chopped[:,4]*data_chopped[:,3]**2) + sum((data_chopped[:,2]-Grand_mean)**2))/sum(data_chopped[:,4]))
else:
break
if filepath == os.path.join(dirname,'ModelParams.dat'):
l = re.split(" ", ln.getline(filepath, 6))
turb_param = float(l[2])
arg.append((Grand_mean, Grand_STD, turb_param))
GrandMeansData = []
os.path.walk(os.getcwd(), LoadGenomeMeanSize, GrandMeansData)
GrandMeansData = sorted(GrandMeansData, key=lambda data_sort: data_sort[2])
TheMeans = p.zeros((len(GrandMeansData), 3 ))
i = 0
for item in GrandMeansData:
TheMeans[i,0] = item[0]
TheMeans[i,1] = item[1]
TheMeans[i,2] = item[2]
i += 1
print TheMeans # just checking...
# later do some computation on TheMeans in NumPy
It does not give error massages. Even gives a file with data... but data are bloody wrong! I checked some of them manually by running commands:
import pylab as p
data = p.genfromtxt(filepath)
data_chopped = data[1000:-1,:]
Grand_mean = data_chopped[:,2].mean()
Grand_STD = p.sqrt((sum(data_chopped[:,4]*data_chopped[:,3]**2) \
+ sum((data_chopped[:,2]-Grand_mean)**2))/sum(data_chopped[:,4]))
on selected files. They are different :-(
1) Can anyone explain me what's wrong?
2) Does anyone know a solution to that?
I'll be grateful for help :-)
Cheers,
PTR

I would say this condition is not passing:
if filepath == os.path.join(dirname,'GeneralData.dat'):
which means you are not getting GeneralData.dat before ModelParams.dat. Maybe you need to sort alphabetically or the file is not there.

I see one issue with the code and the solution that you have provided.
Never hide the issue of "variable referencing before assignment" by just making the variable visible.
Try to understand why it happened?
Prior to creating a global variable "Grand_mean", you were getting an issue that you are accessing Grand_mean before any value is assigned to it. In such a case, by initializing the variable outside the function and marking it as global, only serves to hide the issue.
You see erroneous result because now you have made the variable visible my making it global but the issue continues to exist. You Grand_mean was never equalized to some correct data.
This means that section of code under "if filepath == os.path.join(dirname,..." was never executed.

Using global is not the right solution. That only makes sense if you do in fact want to reference and assign to the global "Grand_mean" name. The need for disambiguation comes from the way the interpreter prescans for assignment operators in function declarations.
You should start by assigning a default value to Grand_mean within the scope of LoadGenomeMeanSize(). You have 1 of 4 branches to actually assign a value to Grand_mean that has correct semantic meaning within one loop iteration. You are likely running into a case where
if filepath == os.path.join(dirname,'ModelParams.dat'): is true, but either
if filepath == os.path.join(dirname,'GeneralData.dat'): or if data[-1,4] != 0.0: is not. It's likely the second condition that is failing for you. Move the
The quick and dirty answer is you probably need to rearrange your code like this:
...
if filepath == os.path.join(dirname,'GeneralData.dat'):
data = p.genfromtxt(filepath)
if data[-1,4] != 0.0: # checking if data set is OK
data_chopped = data[1000:-1,:] # removing some of data
Grand_mean = data_chopped[:,2].mean()
Grand_STD = p.sqrt((sum(data_chopped[:,4]*data_chopped[:,3]**2) + sum((data_chopped[:,2]-Grand_mean)**2))/sum(data_chopped[:,4]))
if filepath == os.path.join(dirname,'ModelParams.dat'):
l = re.split(" ", ln.getline(filepath, 6))
turb_param = float(l[2])
arg.append((Grand_mean, Grand_STD, turb_param))
else:
break
...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

XML to dictionary extraction - python

Related

Verify data on a file

I need a shortcut

<class 'UnicodeDecodeError'> that only appears in Python 3 but not Python 2

Copying string from a specific index from one file to pasting that string on a specific place in another file

Problems with variable referenced before assignment when using os.path.walk

Categories

Resources