How to create a lettered list using docx? - python

I use python for teaching some of my science courses, where I use it to generate unique assignments and tests for students. I've run into an issue that I can't sort out on my own.
I'm trying to make a series of nested lists. For example, I would like to have a numbered question, and then sub parts to the question underneath. For example:
Use the Henderson-Hasselbalch equation to determine pH of the following solutions:
A. 250 mM Ammonium Chloride
B. 100 mM Acetic Acid
I've used style "List Number" to create the numbered list, but I can't figure out how to create a custom list that starts with the letters.
Here is what I've got so far:
import sys
import os
if os.uname()[1] == 'iMac':
sys.path.append("/Users/mgreene3/Library/Python/2.7/lib/python/site-packages")
else:
sys.path.append("/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python")
import numpy as np
import math
import random
import textwrap
from docx import Document
from docx.shared import Pt, Inches
from docx.enum.style import WD_STYLE_TYPE
from docx.text.tabstops import TabStop as ts
from docx.text.parfmt import ParagraphFormat
assignment = Document()
ordered = "a"
style = assignment.styles["Normal"]
font = style.font
font.name = "Calibri"
font.size = Pt(12)
style.paragraph_format.space_after = Pt(0)
LetteredList = style.paragraph_format._NumberingStyle(ordered)
sub_style = assignment.styles["ListBullet"]
sub_font = sub_style.font
sub_font.name = "Calibri"
###sub_style.paragraph_format.style("List")
sub_font.size = Pt(12)
sub_style.paragraph_format.left_indent = Inches(1)
sub_style.paragraph_format.space_before = Pt(0)
sub_style.paragraph_format.space_after = Pt(40)
doc_heading = assignment.add_paragraph("Name:_______________________")
doc_heading.add_run("\t" * 4)
doc_heading.add_run(" " * 12)
doc_heading.add_run("BIOL444: Biochemistry\t\t\t\t\t\t ")
doc_heading.add_run("\n")
doc_heading.add_run("Take Home 1, v.")
doc_heading.add_run((str(1).zfill(2)))
doc_heading.add_run("\n" * 2)
doc_heading.add_run("Instructions: Complete test (")
show_work = doc_heading.add_run("show work")
show_work.bold = True
show_work.underline = True
show_work
doc_heading.add_run("), submit ")
hard_copy = doc_heading.add_run("hard copy")
hard_copy.bold = True
hard_copy.underline = True
hard_copy
doc_heading.add_run(" by ")
doc_heading.add_run("11:59 pm, Friday, February 10").bold =True
doc_heading.add_run(". Late submissions will ")
doc_heading.add_run("NOT").bold=True
doc_heading.add_run(" be accepted.")
question1 = assignment.add_paragraph("Using the data for K", style = "List Number")
question1.add_run("a").font.subscript = True
question1.add_run(" and pK")
question1.add_run("a").font.subscript = True
question1.add_run(" of the following compounds, calculate the concentrations (M) of all ionic species as well as the pH of the following aqueous solutions: ")
question1.add_run("\n")
question1a = assignment.add_paragraph("100 mM Acetic acid", style = sub_style)
question1b = assignment.add_paragraph("250 mM NaOH", style = sub_style)
assignment.save("TestDocx.docx")

The short answer is that it's probably more trouble than it's worth. Creating numbered lists, especially nested numbered lists in Word is a complex operation, possibly for legacy reasons (we're on version 14 or something of Word). Partly because of this complexity, API support for this doesn't yet exist in python-docx.
If you really wanted to do it, it would entail manipulating numbering definitions that exist in another package part from the document part (I believe it's numbering.xml). This would be using low-level lxml calls.
For myself, I'd be strongly inclined to use RestructuredText for a job like this, rendering to PDF, perhaps using Sphinx. As a side-effect, you could easily get HTML version as well for posting assignments on the web. However, I'm too far away from your actual requirements to say that would really suit; you'll have to check it out and see for yourself :)

Related

Permutation List with Variable Dependencies- UnboundLocalError

I was trying to break down the code to the simplest form before adding more variables and such. I'm stuck.
I wanted it so when I use intertools the first response is the permutations of tricks and the second response is dependent on the trick's landings() and is a permutation of the trick's corresponding landing. I want to add additional variables that further branch off from landings() and so on.
The simplest form should print a list that looks like:
Backflip Complete
Backflip Hyper
180 Round Complete
180 Round Mega
Gumbi Complete
My Code:
from re import I
import pandas as pd
import numpy as np
import itertools
from io import StringIO
backflip = "Backflip"
one80round = "180 Round"
gumbi = "Gumbi"
tricks = [backflip,one80round,gumbi]
complete = "Complete"
hyper = "Hyper"
mega = "Mega"
backflip_landing = [complete,hyper]
one80round_landing = [complete,mega]
gumbi_landing = [complete]
def landings(tricks):
if tricks == backflip:
landing = backflip_landing
elif tricks == one80round:
landing = one80round_landing
elif tricks == gumbi:
landing = gumbi_landing
return landing
for trik, land in itertools.product(tricks,landings(tricks)):
trick_and_landing = (trik, land)
result = (' '.join(trick_and_landing))
tal = StringIO(result)
tl = (pd.DataFrame((tal)))
print(tl)
I get the error:
UnboundLocalError: local variable 'landing' referenced before assignment
Add a landing = "" after def landings(tricks): to get rid of the error.
But the if checks in your function are wrong. You check if tricks, which is a list, is equal to backflip, etc. which are all strings. So thats why none of the ifs are true and landing got no value assigned.
That question was also about permutation in python. Maybe it helps.

Exporting 15000x voucher codes made with Python to Excel

I am trying to export a voucher code for a flyer with Excel. I want to create 15000 rows with random voucher codes. So far I have made below, but how can I make it create 15000 voucher codes for each row?
Thanks a lot.
import random
import string
import pandas as pd
def random_string_generator(str_size, allowed_chars):
return ''.join(random.choice(allowed_chars) for x in range(str_size))
chars = string.ascii_uppercase
size = 5
for i in range(15000):
print (random_string_generator(size, chars)+"-"+random_string_generator(size, chars)+"-"+random_string_generator(size, chars)+"-"+random_string_generator(size, chars)+"-"+random_string_generator(size, chars))
Here's your original code with a few tweaks:
import random
import string
chars = string.ascii_uppercase
amount_of_vouchers = 10
segments_per_voucher = 5
chars_per_segment = 5
def random_string_generator(allowed_chars, str_size):
return ''.join(random.choices(allowed_chars, k=str_size))
vouchers = []
for i in range(amount_of_vouchers):
voucher = [random_string_generator(chars, chars_per_segment) for j in range(segments_per_voucher)]
vouchers.append('-'.join(voucher))
print('\n'.join(vouchers))
What I've done:
Instead of concatenating the function calls on one line, I've changed this to a loop. This is easier to read, easier to change and shorter.
Added a loop around the code generation, so that we can create multiple vouchers
Vouchers are stored in the imaginatively named vouchers array.
Changed random.choice to random.choices, which allows us to generate the entire segment at once, rather than per character.
Example output:
RIRSE-BURXY-NTBFP-VZTBC-LNQYD
OWTSZ-AIUPS-POXMW-PQXJY-DUXUE
BFDJI-ASLPZ-XIRKR-ZKVLB-YGRCA
SQTHJ-DYJYL-IZQFD-EFBJO-OWPHO
OWPWW-PJGNY-BOCZM-ANNLJ-CFXKY
NHQUN-MMBQB-KHLYL-ZQVTD-TDUQC
MNOYT-WAVWV-QSUND-RYKHB-TNUCF
OAHOR-DPJFN-RQYHE-GUSVF-CPCBF
OFNHT-LCARH-EZDWT-YRLLI-IWJZW
NXLKI-GCJDM-QZGPU-MIZCC-XSOQD

Converting molecule name to SMILES?

I was just wondering, is there any way to convert IUPAC or common molecular names to SMILES? I want to do this without having to manually convert every single one utilizing online systems. Any input would be much appreciated!
For background, I am currently working with python and RDkit, so I wasn't sure if RDkit could do this and I was just unaware. My current data is in the csv format.
Thank you!
RDKit cant convert names to SMILES.
Chemical Identifier Resolver can convert names and other identifiers (like CAS No) and has an API so you can convert with a script.
from urllib.request import urlopen
from urllib.parse import quote
def CIRconvert(ids):
try:
url = 'http://cactus.nci.nih.gov/chemical/structure/' + quote(ids) + '/smiles'
ans = urlopen(url).read().decode('utf8')
return ans
except:
return 'Did not work'
identifiers = ['3-Methylheptane', 'Aspirin', 'Diethylsulfate', 'Diethyl sulfate', '50-78-2', 'Adamant']
for ids in identifiers :
print(ids, CIRconvert(ids))
Output
3-Methylheptane CCCCC(C)CC
Aspirin CC(=O)Oc1ccccc1C(O)=O
Diethylsulfate CCO[S](=O)(=O)OCC
Diethyl sulfate CCO[S](=O)(=O)OCC
50-78-2 CC(=O)Oc1ccccc1C(O)=O
Adamant Did not work
OPSIN (https://opsin.ch.cam.ac.uk/) is another solution for name2structure conversion.
It can be used by installing the cli, or via https://github.com/gorgitko/molminer
(OPSIN is used by the RDKit KNIME nodes also)
PubChemPy has some great features that can be used for this purpose. It supports IUPAC systematic names, trade names and all known synonyms for a given Compound as documented in PubChem database:
https://pubchempy.readthedocs.io/en/latest/
>>> import pubchempy as pcp
>>> results = pcp.get_compounds('Glucose', 'name')
>>> print results
[Compound(79025), Compound(5793), Compound(64689), Compound(206)]
The first argument is the identifier, and the second argument is the identifier type, which must be one of name, smiles, sdf, inchi, inchikey or formula. It looks like there are 4 compounds in the PubChem Database that have the name Glucose associated with them. Let’s take a look at them in more detail:
>>> for compound in results:
>>> print compound.isomeric_smiles
C([C##H]1[C#H]([C##H]([C#H]([C#H](O1)O)O)O)O)O
C([C##H]1[C#H]([C##H]([C#H](C(O1)O)O)O)O)O
C([C##H]1[C#H]([C##H]([C#H]([C##H](O1)O)O)O)O)O
C(C1C(C(C(C(O1)O)O)O)O)O
It looks like they all have different stereochemistry information !
The accepted answer uses the Chemical Identifier Resolver but for some reason the website seems to be buggy for me and the API seems to be messed up.
So another way to connvert smiles to IUPAC name is with the the PubChem python API, which can work if your smiles is in their database
e.g.
#!/usr/bin/env python
import sys
import pubchempy as pcp
smiles = str(sys.argv[1])
print(smiles)
s= pcp.get_compounds(smiles,'smiles')
print(s[0].iupac_name)
You can use batch query of pubchem:
https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi
https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange-help.html
You can use the pubchem API (PUG REST) for this
(https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest-tutorial)
Basically, the url you are calling will take the compound as a "name", you then give the name, then you specify that you want the "property" of "CanonicalSMILES", as text
identifiers = ['3-Methylheptane', 'Aspirin', 'Diethylsulfate', 'Diethyl sulfate', '50-78-2', 'Adamant']
smiles_df = pd.DataFrame(columns = ['Name', 'Smiles'])
for x in identifiers :
try:
url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/' + x + '/property/CanonicalSMILES/TXT'
# remove new line character with rstrip
smiles = requests.get(url).text.rstrip()
if('NotFound' in smiles):
print(x, " not found")
else:
smiles_df = smiles_df.append({'Name' : x, 'Smiles' : smiles}, ignore_index = True)
except:
print("boo ", x)
print(smiles_df)

Why does my association model find subgroups in a dataset when there shouldn't any?

I give a lot of information on the methods that I used to write my code. If you just want to read my question, skip to the quotes at the end.
I'm working on a project that has a goal of detecting sub populations in a group of patients. I thought this sounded like the perfect opportunity to use association rule mining as I'm currently taking a class on the subject.
I there are 42 variables in total. Of those, 20 are continuous and had to be discretized. For each variable, I used the Freedman-Diaconis rule to determine how many categories to divide a group into.
def Freedman_Diaconis(column_values):
#sort the list first
column_values[1].sort()
first_quartile = int(len(column_values[1]) * .25)
third_quartile = int(len(column_values[1]) * .75)
fq_value = column_values[1][first_quartile]
tq_value = column_values[1][third_quartile]
iqr = tq_value - fq_value
n_to_pow = len(column_values[1])**(-1/3)
h = 2 * iqr * n_to_pow
retval = (column_values[1][-1] - column_values[1][1])/h
test = int(retval+1)
return test
From there I used min-max normalization
def min_max_transform(column_of_data, num_bins):
min_max_normalizer = preprocessing.MinMaxScaler(feature_range=(1, num_bins))
data_min_max = min_max_normalizer.fit_transform(column_of_data[1])
data_min_max_ints = take_int(data_min_max)
return data_min_max_ints
to transform my data and then I simply took the interger portion to get the final categorization.
def take_int(list_of_float):
ints = []
for flt in list_of_float:
asint = int(flt)
ints.append(asint)
return ints
I then also wrote a function that I used to combine this value with the variable name.
def string_transform(prefix, column, index):
transformed_list = []
transformed = ""
if index < 4:
for entry in column[1]:
transformed = prefix+str(entry)
transformed_list.append(transformed)
else:
prefix_num = prefix.split('x')
for entry in column[1]:
transformed = str(prefix_num[1])+'x'+str(entry)
transformed_list.append(transformed)
return transformed_list
This was done to differentiate variables that have the same value, but appear in different columns. For example, having a value of 1 for variable x14 means something different from getting a value of 1 in variable x20. The string transform function would create 14x1 and 20x1 for the previously mentioned examples.
After this, I wrote everything to a file in basket format
def create_basket(list_of_lists, headers):
#for filename in os.listdir("."):
# if filename.e
if not os.path.exists('baskets'):
os.makedirs('baskets')
down_length = len(list_of_lists[0])
with open('baskets/dataset.basket', 'w') as basketfile:
basket_writer = csv.DictWriter(basketfile, fieldnames=headers)
for i in range(0, down_length):
basket_writer.writerow({"trt": list_of_lists[0][i], "y": list_of_lists[1][i], "x1": list_of_lists[2][i],
"x2": list_of_lists[3][i], "x3": list_of_lists[4][i], "x4": list_of_lists[5][i],
"x5": list_of_lists[6][i], "x6": list_of_lists[7][i], "x7": list_of_lists[8][i],
"x8": list_of_lists[9][i], "x9": list_of_lists[10][i], "x10": list_of_lists[11][i],
"x11": list_of_lists[12][i], "x12":list_of_lists[13][i], "x13": list_of_lists[14][i],
"x14": list_of_lists[15][i], "x15": list_of_lists[16][i], "x16": list_of_lists[17][i],
"x17": list_of_lists[18][i], "x18": list_of_lists[19][i], "x19": list_of_lists[20][i],
"x20": list_of_lists[21][i], "x21": list_of_lists[22][i], "x22": list_of_lists[23][i],
"x23": list_of_lists[24][i], "x24": list_of_lists[25][i], "x25": list_of_lists[26][i],
"x26": list_of_lists[27][i], "x27": list_of_lists[28][i], "x28": list_of_lists[29][i],
"x29": list_of_lists[30][i], "x30": list_of_lists[31][i], "x31": list_of_lists[32][i],
"x32": list_of_lists[33][i], "x33": list_of_lists[34][i], "x34": list_of_lists[35][i],
"x35": list_of_lists[36][i], "x36": list_of_lists[37][i], "x37": list_of_lists[38][i],
"x38": list_of_lists[39][i], "x39": list_of_lists[40][i], "x40": list_of_lists[41][i]})
and I used the apriori package in Orange to see if there were any association rules.
rules = Orange.associate.AssociationRulesSparseInducer(patient_basket, support=0.3, confidence=0.3)
print "%4s %4s %s" % ("Supp", "Conf", "Rule")
for r in rules:
my_rule = str(r)
split_rule = my_rule.split("->")
if 'trt' in split_rule[1]:
print 'treatment rule'
print "%4.1f %4.1f %s" % (r.support, r.confidence, r)
Using this, technique I found quite a few association rules with my testing data.
THIS IS WHERE I HAVE A PROBLEM
When I read the notes for the training data, there is this note
...That is, the only
reason for the differences among observed responses to the same treatment across patients is
random noise. Hence, there is NO meaningful subgroup for this dataset...
My question is,
why do I get multiple association rules that would imply that there are subgroups, when according to the notes I shouldn't see anything?
I'm getting lift numbers that are above 2 as opposed to the 1 that you should expect if everything was random like the notes state.
Supp Conf Rule
0.3 0.7 6x0 -> trt1
Even though my code runs, I'm not getting results anywhere close to what should be expected. This leads me to believe that I messed something up, but I'm not sure what it is.
After some research, I realized that my sample size is too small for the number of variables that I have. I would need a way larger sample size in order to really use the method that I was using. In fact, the method that I tried to use was developed with the assumption that it would be run on databases with hundreds of thousands or millions of rows.

Python: Joining and writing (XML.etrees) trees stored in a list

I'm looping over some XML files and producing trees that I would like to store in a defaultdict(list) type. With each loop and the next child found will be stored in a separate part of the dictionary.
d = defaultdict(list)
counter = 0
for child in root.findall(something):
tree = ET.ElementTree(something)
d[int(x)].append(tree)
counter += 1
So then repeating this for several files would result in nicely indexed results; a set of trees that were in position 1 across different parsed files and so on. The question is, how do I then join all of d, and write the trees (as a cumulative tree) to a file?
I can loop through the dict to get each tree:
for x in d:
for y in d[x]:
print (y)
This gives a complete list of trees that were in my dict. Now, how do I produce one massive tree from this?
Sample input file 1
Sample input file 2
Required results from 1&2
Given the apparent difficulty in doing this, I'm happy to accept more general answers that show how I can otherwise get the result I am looking for from two or more files.
Use Spyne:
from spyne.model.primitive import *
from spyne.model.complex import *
class GpsInfo(ComplexModel):
UTC = DateTime
Latitude = Double
Longitude = Double
DopplerTime = Double
Quality = Unicode
HDOP = Unicode
Altitude = Double
Speed = Double
Heading = Double
Estimated = Boolean
class Header(ComplexModel):
Name = Unicode
Time = DateTime
SeqNo = Integer
class CTrailData(ComplexModel):
index = UnsignedInteger
gpsInfo = GpsInfo
Header = Header
class CTrail(ComplexModel):
LastError = AnyXml
MaxTrial = Integer
Trail = Array(CTrailData)
from lxml import etree
from spyne.util.xml import *
file_1 = get_xml_as_object(etree.fromstring(open('file1').read()), CTrail)
file_2 = get_xml_as_object(etree.fromstring(open('file2').read()), CTrail)
file_1.Trail.extend(file_2.Trail)
file_1.Trail.sort(key=lambda x: x.index)
elt = get_object_as_xml(file_1, no_namespace=True)
print etree.tostring(elt, pretty_print=True)
While doing this, Spyne also converts the data fields from string to their native Python formats as well, so it'll be much easier for you to work with the data from this xml document.
Also, if you don't mind using the latest version from git, you can do e.g.:
class GpsInfo(ComplexModel):
# (...)
doppler_time = Double(sub_name="DopplerTime")
# (...)
so that you can get data from the CamelCased tags without having to violate PEP8.
Use lxml.objectify:
from lxml import etree, objectify
obj_1 = objectify.fromstring(open('file1').read())
obj_2 = objectify.fromstring(open('file2').read())
obj_1.Trail.CTrailData.extend(obj_2.Trail.CTrailData)
# .sort() won't work as objectify's lists are not regular python lists.
obj_1.Trail.CTrailData = sorted(obj_1.Trail.CTrailData, key=lambda x: x.index)
print etree.tostring(obj_1, pretty_print=True)
It doesn't do the additional conversion work that the Spyne variant does, but for your use case, that might be enough.

Categories