Cannot pull data from XML files due to differences in format

Cannot pull data from XML files due to differences in format - python

I have a script that takes a bunch of XML files, all in the form of: HMDB61152.xml and pulls them all in using glob. For each file I need to pull some details about each, such as accession, name, and a list of diseases. To parse through each XML I used xmltodict because I traditionally like working with lists instead of XML files, although I may need to change my strategy due to the issues I am facing.
I am able to pull name and acc easily since all XML files have it in the same first level of the tree:
path = '/Users/me/Downloads/hmdb_metabolites'
for data_file in glob.glob(os.path.join(path,'*.xml')):
diseases=[]
with open(data_file) as fd:
doc = xmltodict.parse(fd.read())
name = doc['metabolite']['name']
acc = doc['metabolite']['accession']
So basically at this point there are three options for the disease information:
There are multiple disease tags within each diseases tree. I.e there are 2 or more diseases for the given accession.
There is one disease within the diseases tree meaning the accession has only one disease.
or
There are no disease in the diseases tree at all.
I need to write a loop that can handle any three cases, and thats where I am failing. Here is my approach so far:
#I get the disease root, which returns True if it has lower level items (one or more disease within diseases)
#or False if there are no disease within diseases.
dis_root=doc['metabolite']['diseases']
if (bool(dis_root)==True):
dis_init = doc['metabolite']['diseases']['disease']
if (bool(doc['metabolite']['diseases']['disease'][0]) == True):
for x in range(0,len(dis_init)):
diseases.append(doc['metabolite']['diseases']['disease'][x]['name'])
else:
diseases.append(doc['metabolite']['diseases']['disease']['name'])
else:
diseases=['None']
So the problem is, for the case where there are multiple diseases, I need to pull their names in the following format: doc['metabolite']['diseases']['disease'][x]['name'] for each x in diseases. But for the ones that have only one disease, they have no index at all, so the only way I can pull the name of that one disease is by doing doc['metabolite']['diseases']['disease']['name'].
The script is failing because as soon as we encounter a case of only one disease, it returns a KeyError when it tries to test if doc['metabolite']['diseases']['disease'][0]) == True. If anyone can help me figure this out that'd be great, or direct me to a more appropriate strategy.

Try something like
if 0 in doc['metabolite']['diseases']['disease']:
pass # if 0 is a key in the array, we have multiple entries
else
pass # only a single item.

Found a relatively easy workaround, I simply use try in the following way:
try:
for x in range(0,len(dis_init)):
diseases.append(doc['metabolite']['diseases']['disease'][x]['name'])
except KeyError:
diseases.append(doc['metabolite']['diseases']['disease']['name'])

Related

Biopython: return chain but with the new chain ID already

I have script which can extract selected chains from a structure into a new file. I do it for 400+ structures. Because chainIDs of my selected chains can differ in the structures, I parse .yaml files where I store the corresponding chainIDs. This script is working, everything is fine but the next step is to rename the chains to be the same in each file. I used edited code from here:this. Basically it worked as well, however the problem is that e.g. my new chainID of chain1 is the same as original chainID of chain2, and the error occurrs:Cannot change id from U to T. The id T is already used for a sibling of this entity. Actually, this happened for many variables and it'd be too complicated doing it manually.
I've got idea that this could be solved by renaming the chainIDs right in the moment when I'm extracting it. Is it possible using Biopython like that? Could'nt find anything similar to my problem.
Simplified code for one structure (in the original one is one more loop for iterating over 400+ structures and its .yaml files):
with open(yaml_file, "r") as file:
proteins = yaml.load(file, Loader=yaml.FullLoader)
chain1= proteins["1_chain"].split(",")[0] #just for illustration that I have to parse the original chainIDs
chain2= proteins["2_chain"].split(",")[0]
structure = parser.get_structure("xxx", "xxx.cif" )[0]
for model in structure:
for chain in model:
class ChainSelect(Select):
def accept_chain(self, chain):
if chain.get_id() == '{}'.format(chain1):
return True # I thought that somewhere in this part could be added command renaming the chain to "A"
if chain.get_id() == '{}'.format(chain2):
return True #here I'd rename it "B"
else:
return False
io = MMCIFIO()
io.set_structure(structure)
io.save("new.cif" , ChainSelect())
Is it possible to somehow expand "return" command in a way that it would return the chain with desired chainID (e.g. A)? Note that the original chain ID can differ in the structures (thus I have to use .format(chainX))
I don't have any other idea how I'd get rid of the error that my desired chainID is already in sibling entity.

Getting values from an XML file that has deep keys and values

I have a very large xml file produced from an application whose part of tree is as below:
There are several items under 'item' from 0 to 7. These names are always named as numbers it can range from 0 to any number.
Each of these items will have multiple items all with same structure as per the above tree. Only item 0 to 7 is variable all other structure remains same.
under I have a value <bbmds_questiontype>: which can be Multiple Choice or Matching or Essays.
What I need is to have a list the values of <mat_formattedtext>. ie. the output is supposed to be:
<0>
<bbmds_questiontype>Multiple Choice</bbmds_questiontype>
<mat_formattedtext>This is first question </mat_formattedtext></0>
<1>
<bbmds_questiontype>Multiple Choice</bbmds_questiontype>
<mat_formattedtext>This is second question </mat_formattedtext> </1>
<2>
<bbmds_questiontype>Essay</bbmds_questiontype>
<mat_formattedtext>This is first question </mat_formattedtext> </2>
....
I have tried several solution included xml tree, xmltodict all getting complicated as filters to be applied across different branches of children
import xmltodict
with open("C:/Users/SS/Desktop/moodlexml/00001_questions.dat") as fd:
doc = xmltodict.parse(fd.read())
shortened=doc['questestinterop']['assessment']['section']['item'] # == u'an attribute'
Any advice will be appreciated to proceed further.

Have you tried to use bs4 parsing, its simple
Check it out
https://linuxhint.com/parse_xml_python_beautifulsoup/

Replace a value on one data set with a value from another data set with a dynamic lookup

This question relates primarly to Alteryx, however if it can be done in Python, or R in Alteryx workflow using the R tool then that would work as well.
I have two data sets.
Address (contains address information: Line1, Line2, City, State, Zip)
USPS (contains USPS abbreviations: Street to ST, Boulevard to BLVD, etc.)
Goal: Look at the string on the Address data set for Line1. IF it CONTAINS one of the types of streets in the USPS data set, I want to replace that part of the string with its proper abbreviation which is in a different column of the USPS data set.
Example, 123 Main Street would become 123 Main St
What I have tried:
Imported the two data sets.
Union the two data sets with the instruction of Output All Fields for When Fields Differ.
Added a formula, but this is where I am getting stuck. So far it reads:
if [Addr1] Contains(Sting, Target)
Not sure how to have it look in the USPS for one of the values. I am also not certain if this sort of dynamic lookup can take place.
If this can be done in python (I know very basic Python so I don't have code for this yet because I do not know where to start other than importing the data) I can use python within Alteryx.
Any assistance would be great. Please let me know if you need additional information.
Thank you in advance.

Use the Find Replace tool in Alteryx. This tool is akin to a lookup. Furthermore, use the Alteryx Community as a go to for these types of questions.
Input the Address dataset into the top anchor of the Find Replace tool and the USPS dataset into the bottom anchor. You'll want to find any part of the address field using the lookup field and replace it with the abbreviation field. If you need to do this across several fields in the Address dataset, then you could replicate this logic or you could use a Record ID tool, Transpose, run this logic on one field, and then Cross Tab back to the original schema. It's an advanced recipe that you'll want to master in Alteryx.
https://help.alteryx.com/current/FindReplace.htm

The overall logic that can be used is here: Using str_detect (or some other function) and some way to loop through a list to essentially perform a vlookup
However, in order to expand to Alteryx, you would need to add the Alteryx R tool. Also, some of the code would need to be changed to use the syntax that Alteryx likes.
read in the data with:
read.Alteryx('#Link Number', mode = 'data.frame')
After, the above linked question will provide the overall framework for the logic. Reiterated here:
usps[] = lapply(usps, as.character)
##Copies the original address data to a new column that will
##be altered. Preserves the orignal formatting for rollback
##if necessary
vendorData$new_addr1 = as.character(vendorData$Addr1)
##Loops through the dictionary replacing all of the common names
##with their USPS approved abbreviations for the Addr1 field.
for(i in 1:nrow(usps)) {
vendorData$new_addr1 = str_replace_all(
vendorData$new_addr1,
pattern = paste0("\\b", usps$Abbreviation[i], "\\b"),
replacement = usps$USPS_Abbrv_updated[i]
)
}
Finally, in order to be able to see the output, we would need to write a statement that will output it in one of the 5 output slots the R tool has. Here is the code for that:
write.Alteryx(data, #)

Getting element density from abaqus output database using python scripting

I'm trying to get the element density from the abaqus output database. I know you can request a field output for the volume using 'EVOL', is something similar possible for the density?
I'm afraid it's not because of this: Getting element mass in Abaqus postprocessor
What would be the most efficient way to get the density? Look for every element in which section set it is?

Found a solution, I don't know if it's the fastest but it works:
odb_file_path=r'your_path\file.odb'
odb = session.openOdb(name=odb_file_path)
instance = odb.rootAssembly.instances['MY_PART']
material_name = instance.elements[0].sectionCategory.name[8:-2]
density=odb.materials[material_name].density.table[0][0])
note: the 'name' attribute will give you a string like, 'solid MATERIALNAME'. So I just cut out the part of the string that gave me the real material name. So it's the sectionCategory attribute of an OdbElementObject that is the answer.
EDIT: This doesn't seem to work after all, it turns out that it gives all elements the same material name, being the name of the first material.

The properties are associated something like this:
sectionAssignment connects section to set
set is the container for element
section connects sectionAssignment to material
instance is connected to part (could be from a part from another model)
part is connected to model
model is connected to section
Use the .inp or .cae file if you can. The following gets it from an opened cae file. To thoroughly get elements from materials, you would do something like the following, assuming you're starting your search in rootAssembly.instances:
Find the parts which the instances were created from.
Find the models which contain these parts.
Look for all sections with material_name in these parts, and store all the sectionNames associated with this section
Look for all sectionAssignments which references these sectionNames
Under each of these sectionAssignments, there is an associated region object which has the name (as a string) of an elementSet and the name of a part. Get all the elements from this elementSet in this part.
Cleanup:
Use the Python set object to remove any multiple references to the same element.
Multiply the number of elements in this set by the number of identical part instances that refer to this material in rootAssembly.
E.g., for some cae model variable called model:
model_part_repeats = {}
model_part_elemLabels = {}
for instance in model.rootAssembly.instances.values():
p = instance.part.name
m = instance.part.modelName
try:
model_part_repeats[(m, p)] += 1
continue
except KeyError:
model_part_repeats[(m, p)] = 1
# Get all sections in model
sectionNames = []
for s in mdb.models[m].sections.values():
if s.material == material_name: # material_name is already known
# This is a valid section - search for section assignments
# in part for this section, and then the associated set
sectionNames.append(s.name)
if sectionNames:
labels = []
for sa in mdb.models[m].parts[p].sectionAssignments:
if sa.sectionName in sectionNames:
eset = sa.region[0]
labels = labels + [e.label for e in mdb.models[m].parts[p].sets[eset].elements]
labels = list(set(labels))
model_part_elemLabels[(m,p)] = labels
else:
model_part_elemLabels[(m,p)] = []
num_elements_with_material = sum([model_part_repeats[k]*len(model_part_elemLabels[k]) for k in model_part_repeats])
Finally, grab the material density associated with material_name then multiply it by num_elements_with_material.
Of course, this method will be extremely slow for larger models, and it is more advisable to use string techniques on the .inp file for faster performance.

Parse a file in python to find first a string, then parse the following strings until it find another string

I am trying to scroll trough a result file that one of our process print.
The objective is to look through various blocks and find a specific parameter. I tried to tackle this but can't find an efficient way that would avoid to parse the file multiple times.
This is an example of the output file that I read:
ID:13123
Compound:xyz
... various parameters
RhPhase:abc
ID:543
Compound:lbm
... various parameters
ID:232355
Compound:dfs
... various parameters
RhPhase:cvb
I am looking for a specific ID that has a RhPhase in it, but since the file contains many more entry, I just want that specific ID. It may or may not have an RhPhase in it; if it has one, I get the value.
The only way that I figured out is to actually go through the whole file (which may be hundreds of blocks, to give an idea of the size), and make a list for each ID that has a RhPhase, then in second instance, I scroll through the dictionary, retrieving the value for a specific ID.
This feels so inefficient; I tried to do something different, but got stuck at how you mark the lines while you scroll through them; so I can tell python to read each line->when find the ID that I want continue to read->if you find RhPhase get the value, otherwise stop at the next ID.
I am stuck here:
datafile=open("datafile.txt", "r")
for items in datafile.readline():
if "ID:543" in items:
[read more lines]
[if "RhPhase" in lines:]
[ rhphase=lines ]
[elif ""ID:" in lines ]
[ rhphase=None ]
[ break ]
Once I find the ID; I don't know how to continue to either look for the RhPhase string or find the first ID: string and stop everything (because this means that the ID does not have an associated RhPhase).
This would pass through the file once, and just check for the specific ID, instead of parse the whole thing once and then do a second pass.
Is possible to do so or am I stuck to the double parsing ?

Usually, you solve these kind of things with a simple state machine: You read the lines until you find your id; then you put your reader into a special state that then checks for the parameter you want to extract. In your case, you only have two states: ID not found, and ID found, so a simple boolean is enough:
foundId = False
with open('datafile.txt', 'r') as datafile:
for line in datafile:
if foundId:
if line.startswith('RhPhase'):
print('Found RhPhase for ID 543:')
print(line)
# end reading the file
break
elif line.startswith('ID:'):
print('Error: Found another ID without finding RhPhase first')
break
# if we haven’t found the ID yet, keep looking for it
elif line.startswith('ID:543'):
foundId = True

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cannot pull data from XML files due to differences in format - python

Try something like if 0 in doc['metabolite']['diseases']['disease']: pass # if 0 is a key in the array, we have multiple entries else pass # only a single item.

Found a relatively easy workaround, I simply use try in the following way: try: for x in range(0,len(dis_init)): diseases.append(doc['metabolite']['diseases']['disease'][x]['name']) except KeyError: diseases.append(doc['metabolite']['diseases']['disease']['name'])

Related

Biopython: return chain but with the new chain ID already

Getting values from an XML file that has deep keys and values

Replace a value on one data set with a value from another data set with a dynamic lookup

Getting element density from abaqus output database using python scripting

Parse a file in python to find first a string, then parse the following strings until it find another string

Categories

Resources