Converting molecule name to SMILES?

Converting molecule name to SMILES? - python

I was just wondering, is there any way to convert IUPAC or common molecular names to SMILES? I want to do this without having to manually convert every single one utilizing online systems. Any input would be much appreciated!
For background, I am currently working with python and RDkit, so I wasn't sure if RDkit could do this and I was just unaware. My current data is in the csv format.
Thank you!

RDKit cant convert names to SMILES.
Chemical Identifier Resolver can convert names and other identifiers (like CAS No) and has an API so you can convert with a script.
from urllib.request import urlopen
from urllib.parse import quote
def CIRconvert(ids):
try:
url = 'http://cactus.nci.nih.gov/chemical/structure/' + quote(ids) + '/smiles'
ans = urlopen(url).read().decode('utf8')
return ans
except:
return 'Did not work'
identifiers = ['3-Methylheptane', 'Aspirin', 'Diethylsulfate', 'Diethyl sulfate', '50-78-2', 'Adamant']
for ids in identifiers :
print(ids, CIRconvert(ids))
Output
3-Methylheptane CCCCC(C)CC
Aspirin CC(=O)Oc1ccccc1C(O)=O
Diethylsulfate CCO[S](=O)(=O)OCC
Diethyl sulfate CCO[S](=O)(=O)OCC
50-78-2 CC(=O)Oc1ccccc1C(O)=O
Adamant Did not work

OPSIN (https://opsin.ch.cam.ac.uk/) is another solution for name2structure conversion.
It can be used by installing the cli, or via https://github.com/gorgitko/molminer
(OPSIN is used by the RDKit KNIME nodes also)

PubChemPy has some great features that can be used for this purpose. It supports IUPAC systematic names, trade names and all known synonyms for a given Compound as documented in PubChem database:
https://pubchempy.readthedocs.io/en/latest/
>>> import pubchempy as pcp
>>> results = pcp.get_compounds('Glucose', 'name')
>>> print results
[Compound(79025), Compound(5793), Compound(64689), Compound(206)]
The first argument is the identifier, and the second argument is the identifier type, which must be one of name, smiles, sdf, inchi, inchikey or formula. It looks like there are 4 compounds in the PubChem Database that have the name Glucose associated with them. Let’s take a look at them in more detail:
>>> for compound in results:
>>> print compound.isomeric_smiles
C([C##H]1[C#H]([C##H]([C#H]([C#H](O1)O)O)O)O)O
C([C##H]1[C#H]([C##H]([C#H](C(O1)O)O)O)O)O
C([C##H]1[C#H]([C##H]([C#H]([C##H](O1)O)O)O)O)O
C(C1C(C(C(C(O1)O)O)O)O)O
It looks like they all have different stereochemistry information !

The accepted answer uses the Chemical Identifier Resolver but for some reason the website seems to be buggy for me and the API seems to be messed up.
So another way to connvert smiles to IUPAC name is with the the PubChem python API, which can work if your smiles is in their database
e.g.
#!/usr/bin/env python
import sys
import pubchempy as pcp
smiles = str(sys.argv[1])
print(smiles)
s= pcp.get_compounds(smiles,'smiles')
print(s[0].iupac_name)

You can use batch query of pubchem:
https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi
https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange-help.html

You can use the pubchem API (PUG REST) for this
(https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest-tutorial)
Basically, the url you are calling will take the compound as a "name", you then give the name, then you specify that you want the "property" of "CanonicalSMILES", as text
identifiers = ['3-Methylheptane', 'Aspirin', 'Diethylsulfate', 'Diethyl sulfate', '50-78-2', 'Adamant']
smiles_df = pd.DataFrame(columns = ['Name', 'Smiles'])
for x in identifiers :
try:
url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/' + x + '/property/CanonicalSMILES/TXT'
# remove new line character with rstrip
smiles = requests.get(url).text.rstrip()
if('NotFound' in smiles):
print(x, " not found")
else:
smiles_df = smiles_df.append({'Name' : x, 'Smiles' : smiles}, ignore_index = True)
except:
print("boo ", x)
print(smiles_df)

Related

PySpark / Python Slicing and Indexing Issue

Can someone let me know how to pull out certain values from a Python output.
I would like the retrieve the value 'ocweeklyreports' from the the following output using either indexing or slicing:
'config': '{"hiveView":"ocweeklycur.ocweeklyreports"}
This should be relatively easy, however, I'm having problem defining the Slicing / Indexing configuation
The following will successfully give me 'ocweeklyreports'
myslice = config['hiveView'][12:30]
However, I need the indexing or slicing modified so that I will get any value after'ocweeklycur'

I'm not sure what output you're dealing with and how robust you're wanting it but if it's just a string you can do something similar to this (for a quick and dirty solution).
input = "Your input"
indexStart = input.index('.') + 1 # Get the index of the input at the . which is where you would like to start collecting it
finalResponse = input[indexStart:-2])
print(finalResponse) # Prints ocweeklyreports
Again, not the most elegant solution but hopefully it helps or at least offers a starting point. Another more robust solution would be to use regex but I'm not that skilled in regex at the moment.

You could almost all of it using regex.
See if this helps:
import re
def search_word(di):
st = di["config"]["hiveView"]
p = re.compile(r'^ocweeklycur.(?P<word>\w+)')
m = p.search(st)
return m.group('word')
if __name__=="__main__":
d = {'config': {"hiveView":"ocweeklycur.ocweeklyreports"}}
print(search_word(d))

The following worked best for me:
# Extract the value of the "hiveView" key
hive_view = config['hiveView']
# Split the string on the '.' character
parts = hive_view.split('.')
# The value you want is the second part of the split string
desired_value = parts[1]
print(desired_value) # Output: "ocweeklyreports"

How to write in python a parser for a character-based protocol

I'm implementing a client for an already existing (old) standard for exchanging information between shops and providers of some specific sector, let's say vegetables.
It must be in python, and I want my package to read a plaintext file and build some objects accessible by a 3d party application. I want to write a client, an implementation of this standard in python, and offer it open source as a library/package, and use it for my project.
It looks roughly like this (without the # comments)
I1234X9876DELIVERY # id line. 1234 is sender id and 9876 target id.
# Doctype "delivery"
H27082022RKG # header line. specificy to "delivery" doctype.
# It will happen at 27 aug '22, at Regular time schedule. Units kg.
PAPPL0010 # Product Apple. 10 kg
PANAN0015 # Product Ananas. 15 kg
PORAN0015 # Product Orange. 15 kg
The standard has 3 types of lines: identifier, header and details or body. Header format depend on the document type of the identifier line. Body lines depend also on doc type.
Formats are defined by character-length. One character of {I, H, P, ...} at the start of the line to identify the type of line, like P. Then, if it's a product of a delivery, 4 chars to identify the type of product (APPL), and 4 digits number to specify the amount of product (10).
I thought about using a hierarchy of classes, maybe enums, to identify which kind of document I obtained, so that an application can process differently a delivery document from a catalogue document. And then, for a delivery, as the structure is known, read the date attribute, and the products array.
However, I'm not sure of:
how to parse efficiently the lines.
what to build with the parsed message.
What does it sound like to you? I didn't study computer science theory, and although I've been coding for years, it's out of the bounds I usually do. I've read an article about parsing tools for python but I'm unsure of the concepts and which tool to use, if any.
Do I need some grammar parser for this?
What would be a pythonic way to represent the data?
Thank you very much!
PS: the documents use 8-bit character encodings, usually Latin-1, so I can read byte by byte.

Looking at the start of the entry for each line would allow that line to be sent to a function for processing of that information.
This would allow for a function for each format type to allow for easier testing and maintenance.
The data could be stored in a Python dataclass. The use of enums would be possible as it looks like that is what the document is specifying.
Using enums to give more meaningful names to the abbreviations used in the format is probably a good idea.
Here is an example of do this:
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
import re
from typing import List, Union
data = """
I1234X9876DELIVERY
H27082022RKG
PAPPL0010
PANAN0015
PORAN0015
"""
class Product(Enum):
APPLE = "APPL"
PINEAPPLE = "ANAN"
ORANGE = "ORAN"
class DocType(Enum):
UNDEFINED = "NONE"
DELIVERY = "DELIVERY"
class DeliveryType(Enum):
UNDEFINED = "NONE"
REGULAR = "R"
class Units(Enum):
UNDEFINED = "NONE"
KILOGRAMS = "KG"
#dataclass
class LineItem:
product: Product
quantity: int
#dataclass
class Header:
sender: int = 0
target: int = 0
doc_type: DocType = DocType.UNDEFINED
#dataclass
class DeliveryNote(Header):
delivery_freq: DeliveryType = DeliveryType.UNDEFINED
date: Union[datetime, None] = None
units: Units = Units.UNDEFINED
line_items: List[LineItem] = field(default_factory=list)
def show(self):
print(f"Sender: {self.sender}")
print(f"Target: {self.target}")
print(f"Type: {self.doc_type.name}")
print(f"Delivery Date: {self.date.strftime('%d-%b-%Y')}")
print(f"Deliver Type: {self.delivery_freq.name}")
print(f"Units: {self.units.name}")
print()
print(f"\t|{'Item':^12}|{'Qty':^6}|")
print(f"\t|{'-' * 12}|{'-' * 6}|")
for entry in self.line_items:
print(f"\t|{entry.product.name:<12}|{entry.quantity:>6}|")
def process_identifier(entry):
match = re.match(r'(\d+)X(\d+)(\w+)', entry)
sender, target, doc_type = match.groups()
doc_type = DocType(doc_type)
sender = int(sender)
target = int(target)
if doc_type == DocType.DELIVERY:
doc = DeliveryNote(sender, target, doc_type)
return doc
def process_header(entry, doc):
match = re.match(r'(\d{8})(\w)(\w+)', entry)
if match:
date_str, freq, units = match.groups()
doc.date = datetime.strptime(date_str, '%d%m%Y')
doc.delivery_freq = DeliveryType(freq)
doc.units = Units(units)
def process_details(entry, doc):
match = re.match(r'(\D+)(\d+)', entry)
if match:
prod, qty = match.groups()
doc.line_items.append(LineItem(Product(prod), int(qty)))
def parse_data(file_content):
doc = None
for line in file_content.splitlines():
if line.startswith('I'):
doc = process_identifier(line[1:])
elif line.startswith('H'):
process_header(line[1:], doc)
elif line.startswith('P'):
process_details(line[1:], doc)
return doc
if __name__ == '__main__':
this_doc = parse_data(data)
this_doc.show()
When I ran this test it gave the following output:
$ python3 read_protocol.py
Sender: 1234
Target: 9876
Type: DELIVERY
Delivery Date: 27-Aug-2022
Deliver Type: REGULAR
Units: KILOGRAMS
| Item | Qty |
|------------|------|
|APPLE | 10|
|PINEAPPLE | 15|
|ORANGE | 15|
Hopefully that gives you some ideas as I'm sure there are lots of assumptions about your data I've got wrong.
For ease of displaying here I haven't shown reading from the file. Using Python's pathlib.read_text() should make this relatively straightforward to get data from a file.

String Manipulation for Json webscraping

I am trying to scrape a website and have all the data needed in very long matrices which were obtained through requests and json imports.
I am having issues getting any output.
Is it because of the merge of two strings in requests.get()?
Here is the part with the problem, all things used were declared at the start of the code.
balance=[]
for q in range(len(DepositMatrix)):
address= requests.get('https://ethplorer.io/service/service.php?data=' + str(DepositMatrix[q][0]))
data4 = address.json()
TokenBalances = data4['balances'] #returns a dictionary
balance.append(TokenBalances)
print(balance)
Example of DepositMatrix - list of lists with 4 elements, [[string , float, int, int]]
[['0x2b5634c42055806a59e9107ed44d43c426e58258', 488040277.1535826, 660, 7103],
['0x05ee546c1a62f90d7acbffd6d846c9c54c7cf94c', 376515313.83254075, 2069, 12705]]
I think the error is in this part:
requests.get('https://ethplorer.io/service/service.php?data=' + str(DepositMatrix[q][0]))
This change doesnt help either:
requests.get('https://ethplorer.io/service/service.php?data=' + DepositMatrix[q][0])

Like I said in my comment, I tried your code and it worked for me. But I wanted to highlight some things that could help your code be clearer:
import requests
import pprint
DepositMatrix = [['0x2b5634c42055806a59e9107ed44d43c426e58258', 488040277.1535826, 660, 7103],
['0x05ee546c1a62f90d7acbffd6d846c9c54c7cf94c', 376515313.83254075, 2069, 12705]]
balance=[]
for deposit in DepositMatrix:
address = requests.get('https://ethplorer.io/service/service.php?data=' + deposit[0])
data4 = address.json()
TokenBalances = data4['balances'] #returns a dictionary
balance.append(TokenBalances)
pprint.pprint(balance)
For your loop, instead of creating a range of the length of your list (q) and then using this q to get the information back from your list, it's simpler to get each element directly (for deposit in DepositMatrix:)
I've used the pprint module to ease the visualization of your data.

Using search terms with Biopython to return accession numbers

I am trying to use Biopython (Entrez) with search terms that will return the accession number (and not the GI*).
Here is a tiny excerpt of my code:
from Bio import Entrez
Entrez.email = 'myemailaddress'
search_phrase = 'Escherichia coli[organism]) AND (complete genome[keyword])'
handle = Entrez.esearch(db='nuccore', term=search_phrase, retmax=100, rettype='acc', retmode='text')
result = Entrez.read(handle)
handle.close()
gi_numbers = result['IdList']
print(gi_numbers)
'745369752', '910228862', '187736741', '802098270', '802098269',
'802098267', '387610477', '544579032', '544574430', '215485161',
'749295052', '387823261', '387605479', '641687520', '641682562',
'594009615', '557270520', '313848522', '309700213', '284919779',
'215263233', '544345556', '544340954', '144661', '51773702',
'202957457', '202957451', '172051323'
I am sure I can convert from GI to accession, but it would be nice to avoid the additional step. What slice of magic am I missing?
Thank you in advance.
*especially since NCBI is phasing out GI numbers

Looking through the docs for esearch on NCBI's website, there are only two rettypes available - uilist, which is the default XML format that you're getting currently (it's parsed into a dict by Entrez.read()), and count, which just displays the Count value (look at the complete contents of result, it's there), which I'm unclear on its exact meaning, as it doesn't represent the total number of items in IdList...
At any rate, Entrez.esearch() will take any value of rettype and retmode you like, but it only returns the uilist or count in xml or json mode - no accession IDs, no nothin'.
Entrez.efetch() will pass you back all sorts of cool stuff, depending on which DB you're querying. The downside, of course, is that you need to query by one or more IDs, not by a search string, so in order to get your accession IDs you'd need to run two queries:
search_phrase = "Escherichia coli[organism]) AND (complete genome[keyword])"
handle = Entrez.esearch(db="nuccore", term=search_phrase, retmax=100)
result = Entrez.read(handle)
handle.close()
fetch_handle = Entrez.efetch(db="nuccore", id=results["IdList"], rettype="acc", retmode="text")
acc_ids = [id.strip() for id in fetch_handle]
fetch_handle.close()
print(acc_ids)
gives
['HF572917.2', 'NZ_HF572917.1', 'NC_010558.1', 'NZ_HG941720.1', 'NZ_HG941719.1', 'NZ_HG941718.1', 'NC_017633.1', 'NC_022371.1', 'NC_022370.1', 'NC_011601.1', 'NZ_HG738867.1', 'NC_012892.2', 'NC_017626.1', 'HG941719.1', 'HG941718.1', 'HG941720.1', 'HG738867.1', 'AM946981.2', 'FN649414.1', 'FN554766.1', 'FM180568.1', 'HG428756.1', 'HG428755.1', 'M37402.1', 'AJ304858.2', 'FM206294.1', 'FM206293.1', 'AM886293.1']
So, I'm not terribly sure if I answered your question satisfactorily, but unfortunately I think the answer is "There is no magic."

Parsing through a deep-nested XML File in Python

I am looking at an xml file similar to the below:
<pinnacle_line_feed>
<PinnacleFeedTime>1418929691920</PinnacleFeedTime>
<lastContest>28962804</lastContest>
<lastGame>162995589</lastGame>
<events>
<event>
<event_datetimeGMT>2014-12-19 11:15</event_datetimeGMT>
<gamenumber>422739932</gamenumber>
<sporttype>Alpine Skiing</sporttype>
<league>DH 145</league>
<IsLive>No</IsLive>
<participants>
<participant>
<participant_name>Kjetil Jansrud (NOR)</participant_name>
<contestantnum>2001</contestantnum>
<rotnum>2001</rotnum>
<visiting_home_draw>Visiting</visiting_home_draw>
</participant>
<participant>
<participant_name>The Field</participant_name>
<contestantnum>2002</contestantnum>
<rotnum>2002</rotnum>
<visiting_home_draw>Home</visiting_home_draw>
</participant>
</participants>
<periods>
<period>
<period_number>0</period_number>
<period_description>Matchups</period_description>
<periodcutoff_datetimeGMT>2014-12-19 11:15</periodcutoff_datetimeGMT>
<period_status>I</period_status>
<period_update>open</period_update>
<spread_maximum>200</spread_maximum>
<moneyline_maximum>100</moneyline_maximum>
<total_maximum>200</total_maximum>
<moneyline>
<moneyline_visiting>116</moneyline_visiting>
<moneyline_home>-136</moneyline_home>
</moneyline>
</period>
</periods>
<PinnacleFeedTime>1418929691920</PinnacleFeedTime>
</event>
</events>
</pinnacle_line_feed>
I have parsed the file with the code below:
pinny_url = 'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball'
tree = ET.parse(urllib.urlopen(pinny_url))
root = tree.getroot()
list = []
for event in root.iter('event'):
event_datetimeGMT = event.find('event_datetimeGMT').text
gamenumber = event.find('gamenumber').text
sporttype = event.find('sporttype').text
league = event.find('league').text
IsLive = event.find('IsLive').text
for participants in event.iter('participants'):
for participant in participants.iter('participant'):
p1_name = participant.find('participant_name').text
contestantnum = participant.find('contestantnum').text
rotnum = participant.find('rotnum').text
vhd = participant.find('visiting_home_draw').text
for periods in event.iter('periods'):
for period in periods.iter('period'):
period_number = period.find('period_number').text
desc = period.find('period_description').text
pdatetime = period.find('periodcutoff_datetimeGMT')
status = period.find('period_status').text
update = period.find('period_update').text
max = period.find('spread_maximum').text
mlmax = period.find('moneyline_maximum').text
tot_max = period.find('total_maximum').text
for moneyline in period.iter('moneyline'):
ml_vis = moneyline.find('moneyline_visiting').text
ml_home = moneyline.find('moneyline_home').text
However, I am hoping to get the nodes separated by event similar to a 2D table (as in a pandas dataframe). However, the full xml file has multiple "event" children, some events that do not share the same nodes as above. I am struggling quite mightily with being able to take each event node and simply create a 2d table with the tag and that value where the tag acts as the column name and the text acts as the value.
Up to this point, I have done the above to gauge how I might put that information into a dictionary and subsequently put a number of dictionaries into a list from which I can create a dataframe using pandas, but that has not worked out, as all attempts have required me to find and replace text to create the dxcictionaries and python has not responded well to that when attempting to subsequently create a dataframe. I have also used a simple:
for elt in tree.iter():
list.append("'%s': '%s'") % (elt.tag, elt.text.strip()))
which worked quite well in simple pulling out every single tag and the corresponding text, but I was unable to make anything of that because any attempts at finding and replacing the text to create dictionaries was no good.
Any assistance would be greatly appreciated.
Thank you.

Here's an easy way to get your XML into a pandas dataframe. This utilizes the awesome requests library (which you can switch for urllib if you'd like, as well as the always helpful xmltodict library available in pypi. (NOTE: a reverse library is also available, knows as dicttoxml)
import json
import pandas
import requests
import xmltodict
web_request = requests.get(u'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball')
# Make that unweidly XML doc look like a native Dictionary!
result = xmltodict.parse(web_request.text)
# Next, convert the nested OrderedDict to a real dict, which isn't strictly necessary, but helps you
# visualize what the structure of the data looks like
normal_dict = json.loads(json.dumps(result.get('pinnacle_line_feed', {}).get(u'events', {}).get(u'event', [])))
# Now, make that dictionary into a dataframe
df = pandas.DataFrame.from_dict(normal_dict)
To get some idea of what this is starting to look like, here's the first couple of lines of the CSV:
>>> from StringIO import StringIO
>>> foo = StringIO() # A fake file to write to
>>> df.to_csv(foo) # Output the df to a CSV file
>>> foo.seek(0) # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,participants,periods,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,"{u'participant': [{u'contestantnum': u'1071', u'rotnum': u'1071', u'visiting_home_draw': u'Home', u'participant_name': u'Obras Sanitarias'}, {u'contestantnum': u'1072', u'rotnum': u'1072', u'visiting_home_draw': u'Visiting', u'participant_name': u'Libertad'}]}",,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,"{u'participant': [{u'contestantnum': u'1079', u'rotnum': u'1079', u'visiting_home_draw': u'Home', u'participant_name': u'Boca Juniors'}, {u'contestantnum': u'1080', u'rotnum': u'1080', u'visiting_home_draw': u'Visiting', u'participant_name': u'Penarol'}]}","{u'period': {u'total_maximum': u'450', u'total': {u'total_points': u'152.5', u'under_adjust': u'-107', u'over_adjust': u'-103'}, u'spread_maximum': u'450', u'period_description': u'Game', u'moneyline_maximum': u'450', u'period_number': u'0', u'period_status': u'I', u'spread': {u'spread_visiting': u'3', u'spread_adjust_visiting': u'-102', u'spread_home': u'-3', u'spread_adjust_home': u'-108'}, u'periodcutoff_datetimeGMT': u'2015-01-06 23:00', u'moneyline': {u'moneyline_visiting': u'136', u'moneyline_home': u'-150'}, u'period_update': u'open'}}",Basketball
Notice that the participants and periods columns are still their native Python dictionaries. You'll either need to remove them from the columns list, or do some additional mangling to get them to flatten out:
# Remove the offending columns in this example by selecting particular columns to show
>>> from StringIO import StringIO
>>> foo = StringIO() # A fake file to write to
>>> df.to_csv(foo, cols=['IsLive', 'event_datetimeGMT', 'gamenumber', 'league', 'sporttype'])
>>> foo.seek(0) # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,Basketball

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.