Doc2vec : TaggedLineDocument() - python

So,I'm trying to learn and understand Doc2Vec.
I'm following this tutorial. My input is a list of documents i.e list of lists of words. This is what my code looks like:
input = [["word1","word2",..."wordn"],["word1","word2",..."wordn"],...]
documents = TaggedLineDocument(input)
model = doc2vec.Doc2Vec(documents,size = 50, window = 10, min_count = 2, workers=2)
But I am getting some unicode error(tried googling this error, but no good ):
TypeError('don\'t know how to handle uri %s' % repr(uri))
Can somebody please help me understand where i am going wrong ? Thank you !

TaggedLineDocument should be instantiated with a file path. Make sure the file is setup in the format one document equals one line.
documents = TaggedLineDocument('myfile.txt')
documents = TaggedLineDocument('compressed_text.txt.gz')
From the source code:
The uri (the think you are instantiating TaggedLineDocument with) can be either:
1. a URI for the local filesystem (compressed ``.gz`` or ``.bz2`` files handled automatically):
`./lines.txt`, `/home/joe/lines.txt.gz`, `file:///home/joe/lines.txt.bz2`
2. a URI for HDFS: `hdfs:///some/path/lines.txt`
3. a URI for Amazon's S3 (can also supply credentials inside the URI):
`s3://my_bucket/lines.txt`, `s3://my_aws_key_id:key_secret#my_bucket/lines.txt`
4. an instance of the boto.s3.key.Key class.

For the data, I have the same formatted list as yours:
[['aw', 'wb', 'ce', 'uw', 'qqg'], ['g', 'e', 'ent', 'va'],['a']...]
For the labels, I have a list:
[1, 0, 0 ...]
It indicates the class of my above sentences, you can have any class(tag) at here(not only 1 or 0)
Since we already have the list like above, we can use TaggedDocumnet directly, instead of TaggedLineDocument
model = gensim.models.Doc2Vec(self.myDataFlow(data,labels))
def myDataFlow(self,data,labels):
for i, j in zip(data,labels):
yield TaggedDocument(i,[j])

Related

best way to create file in word2vec format to pass to spacy init model?

I have a homegrown embedding model I have built. I am trying to load my word vectors in spacy (using the init-model CLI) and therefore need to reformat my vectors output as a word2vec table which from my understanding is: first line is shape of vectors and the following lines are "word" <word_embedding>.
My question (and maybe it is a stupid one), is there a way that I can write word as a string (with parentheses) and the raw vector? My current file is "word <word_vector>" so when the file is parsed the word vector is a string which is not desirable.
The code I am trying to conform to is (spacy init-model):
def read_vectors(vectors_loc, truncate_vectors=0):
f = open_file(vectors_loc)
shape = tuple(int(size) for size in next(f).split())
if truncate_vectors >= 1:
shape = (truncate_vectors, shape[1])
vectors_data = numpy.zeros(shape=shape, dtype="f")
vectors_keys = []
for i, line in enumerate(tqdm(f)):
line = line.rstrip()
pieces = line.rsplit(" ", vectors_data.shape[1])
word = pieces.pop(0)
if len(pieces) != vectors_data.shape[1]: # <- pieces is a string!
msg.fail(Errors.E094.format(line_num=i, loc=vectors_loc), exits=1)
vectors_data[i] = numpy.asarray(pieces, dtype="f") # <- will literally create a array of length 1 dtype=object
vectors_keys.append(word)
if i == truncate_vectors - 1:
break
return vectors_data, vectors_keys
I know I could pretty easily start hacking up the init-model code if need be, but I would really rather not.
Thanks in advance.
End of the day and a bit of a doofus moment.
If anyone else is has troubles passing vectors to Spacy... you can actually pass the vectors as a numpy .npz file in sofar that you also pass the --json-loc file (as the json lines file has the id idx into the word vectors).
A bit weird as I have always serialized my numpy with a .npy extension but nonetheless.
Hope this helps someone!

Converting molecule name to SMILES?

I was just wondering, is there any way to convert IUPAC or common molecular names to SMILES? I want to do this without having to manually convert every single one utilizing online systems. Any input would be much appreciated!
For background, I am currently working with python and RDkit, so I wasn't sure if RDkit could do this and I was just unaware. My current data is in the csv format.
Thank you!
RDKit cant convert names to SMILES.
Chemical Identifier Resolver can convert names and other identifiers (like CAS No) and has an API so you can convert with a script.
from urllib.request import urlopen
from urllib.parse import quote
def CIRconvert(ids):
try:
url = 'http://cactus.nci.nih.gov/chemical/structure/' + quote(ids) + '/smiles'
ans = urlopen(url).read().decode('utf8')
return ans
except:
return 'Did not work'
identifiers = ['3-Methylheptane', 'Aspirin', 'Diethylsulfate', 'Diethyl sulfate', '50-78-2', 'Adamant']
for ids in identifiers :
print(ids, CIRconvert(ids))
Output
3-Methylheptane CCCCC(C)CC
Aspirin CC(=O)Oc1ccccc1C(O)=O
Diethylsulfate CCO[S](=O)(=O)OCC
Diethyl sulfate CCO[S](=O)(=O)OCC
50-78-2 CC(=O)Oc1ccccc1C(O)=O
Adamant Did not work
OPSIN (https://opsin.ch.cam.ac.uk/) is another solution for name2structure conversion.
It can be used by installing the cli, or via https://github.com/gorgitko/molminer
(OPSIN is used by the RDKit KNIME nodes also)
PubChemPy has some great features that can be used for this purpose. It supports IUPAC systematic names, trade names and all known synonyms for a given Compound as documented in PubChem database:
https://pubchempy.readthedocs.io/en/latest/
>>> import pubchempy as pcp
>>> results = pcp.get_compounds('Glucose', 'name')
>>> print results
[Compound(79025), Compound(5793), Compound(64689), Compound(206)]
The first argument is the identifier, and the second argument is the identifier type, which must be one of name, smiles, sdf, inchi, inchikey or formula. It looks like there are 4 compounds in the PubChem Database that have the name Glucose associated with them. Let’s take a look at them in more detail:
>>> for compound in results:
>>> print compound.isomeric_smiles
C([C##H]1[C#H]([C##H]([C#H]([C#H](O1)O)O)O)O)O
C([C##H]1[C#H]([C##H]([C#H](C(O1)O)O)O)O)O
C([C##H]1[C#H]([C##H]([C#H]([C##H](O1)O)O)O)O)O
C(C1C(C(C(C(O1)O)O)O)O)O
It looks like they all have different stereochemistry information !
The accepted answer uses the Chemical Identifier Resolver but for some reason the website seems to be buggy for me and the API seems to be messed up.
So another way to connvert smiles to IUPAC name is with the the PubChem python API, which can work if your smiles is in their database
e.g.
#!/usr/bin/env python
import sys
import pubchempy as pcp
smiles = str(sys.argv[1])
print(smiles)
s= pcp.get_compounds(smiles,'smiles')
print(s[0].iupac_name)
You can use batch query of pubchem:
https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi
https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange-help.html
You can use the pubchem API (PUG REST) for this
(https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest-tutorial)
Basically, the url you are calling will take the compound as a "name", you then give the name, then you specify that you want the "property" of "CanonicalSMILES", as text
identifiers = ['3-Methylheptane', 'Aspirin', 'Diethylsulfate', 'Diethyl sulfate', '50-78-2', 'Adamant']
smiles_df = pd.DataFrame(columns = ['Name', 'Smiles'])
for x in identifiers :
try:
url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/' + x + '/property/CanonicalSMILES/TXT'
# remove new line character with rstrip
smiles = requests.get(url).text.rstrip()
if('NotFound' in smiles):
print(x, " not found")
else:
smiles_df = smiles_df.append({'Name' : x, 'Smiles' : smiles}, ignore_index = True)
except:
print("boo ", x)
print(smiles_df)

How to iterate over and download each image in an image collection from the Google Earth Engine python api

I am new to google earth engine and was trying to understand how to use the Google Earth Engine python api. I can create an image collection, but apparently the getdownloadurl() method operates only on individual images. So I am trying to understand how to iterate over and download all of the images in the collection.
Here is my basic code. I broke it out in great detail for some other work I am doing.
import ee
ee.Initialize()
col = ee.ImageCollection('LANDSAT/LC08/C01/T1')
col.filterDate('1/1/2015', '4/30/2015')
pt = ee.Geometry.Point([-2.40986111110000012, 26.76033333330000019])
buff = pt.buffer(300)
region = ee.Feature.bounds(buff)
col.filterBounds(region)
So I pulled the Landsat collection, filtered by date and a buffer geometry. So I should have something like 7-8 images in the collection (with all bands).
However, I could not seem to get iteration to work over the collection.
for example:
for i in col:
print(i)
The error indicates TypeError: 'ImageCollection' object is not iterable
So if the collection is not iterable, how can I access the individual images?
Once I have an image, I should be able to use the usual
path = col[i].getDownloadUrl({
'scale': 30,
'crs': 'EPSG:4326',
'region': region
})
It's a good idea to use ee.batch.Export for this. Also, it's good practice to avoid mixing client and server functions (reference). For that reason, a for-loop can be used, since Export is a client function. Here's a simple example to get you started:
import ee
ee.Initialize()
rectangle = ee.Geometry.Rectangle([-1, -1, 1, 1])
sillyCollection = ee.ImageCollection([ee.Image(1), ee.Image(2), ee.Image(3)])
# This is OK for small collections
collectionList = sillyCollection.toList(sillyCollection.size())
collectionSize = collectionList.size().getInfo()
for i in xrange(collectionSize):
ee.batch.Export.image.toDrive(
image = ee.Image(collectionList.get(i)).clip(rectangle),
fileNamePrefix = 'foo' + str(i + 1),
dimensions = '128x128').start()
Note that converting a collection to a list in this manner is also dangerous for large collections (reference). However, this is probably the most scalable method if you really need to download.
Here is my solution:
import ee
ee.Initialize()
pt = ee.Geometry.Point([-2.40986111110000012, 26.76033333330000019])
region = pt.buffer(10)
col = ee.ImageCollection('LANDSAT/LC08/C01/T1')\
.filterDate('2015-01-01','2015-04-30')\
.filterBounds(region)
bands = ['B4','B5'] #Change it!
def accumulate(image,img):
name_image = image.get('system:index')
image = image.select([0],[name_image])
cumm = ee.Image(img).addBands(image)
return cumm
for band in bands:
col_band = col.map(lambda img: img.select(band)\
.set('system:time_start', img.get('system:time_start'))\
.set('system:index', img.get('system:index')))
# ImageCollection to List
col_list = col_band.toList(col_band.size())
# Define the initial value for iterate.
base = ee.Image(col_list.get(0))
base_name = base.get('system:index')
base = base.select([0], [base_name])
# Eliminate the image 'base'.
new_col = ee.ImageCollection(col_list.splice(0,1))
img_cummulative = ee.Image(new_col.iterate(accumulate,base))
task = ee.batch.Export.image.toDrive(
image = img_cummulative.clip(region),
folder = 'landsat',
fileNamePrefix = band,
scale = 30).start()
print('Export Image '+ band+ ' was submitted, please wait ...')
img_cummulative.bandNames().getInfo()
A reproducible example can you found it here: https://colab.research.google.com/drive/1Nv8-l20l82nIQ946WR1iOkr-4b_QhISu
You could possibly use ee.ImageCollection.iterate() with a function that gets the image and adds it to a list.
import ee
def accumluate_images(image, images):
images.append(image)
return images
for img in col.iterate(accumulate_images, []):
url = img.getDownloadURL(dict(scale=30, crs='EPSG:4326', region=region))
Unfortunately I am not able to test this code as I do not have access to the API, but it might help you arrive at a solution.
I have a similar problem and was not able o solve with presented solutions. Then I have elaborated a sample code for this purpose. It iterates over an image collection in client side, then it is not affected by limitations (server side only) of .map() or .iterate().
It is possible to download the code and see its explanation here
It basically transform the ImageCollection into a list (ic.toList()). Then it performs a standard loop, and for each individual image it is possible to convert it back to ee.Image(list.get(i)), and then process one by one taking all images in the collection.
In your particular case, to download each image, the function to be called within the loop could be: getDOwnloadURL() or getThumbURL():
var url = imgNew.getDownloadURL({
region: geometry,
});
var thumbURL = imgNew.getThumbURL({region: geometry,dimensions: 512, format: 'png'});

Fast Named Entity Removal with NLTK

I wrote a couple of user defined functions to remove named entities (using NLTK) in Python from a list of text sentences/paragraphs. The problem I'm having is that my method is very slow, especially for large amounts of data. Does anyone have a suggestion for how to optimize this to make it run faster?
import nltk
import string
# Function to reverse tokenization
def untokenize(tokens):
return("".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip())
# Remove named entities
def ne_removal(text):
tokens = nltk.word_tokenize(text)
chunked = nltk.ne_chunk(nltk.pos_tag(tokens))
tokens = [leaf[0] for leaf in chunked if type(leaf) != nltk.Tree]
return(untokenize(tokens))
To use the code I typically have a text list and call the ne_removal function through a list comprehension. Example below:
text_list = ["Bob Smith went to the store.", "Jane Doe is my friend."]
named_entities_removed = [ne_removal(text) for text in text_list]
print(named_entities_removed)
## OUT: ['went to the store.', 'is my friend.']
UPDATE: I tried switching to batch version with this code, but it's only slightly faster. Will keep exploring. Thanks for the input so far.
def extract_nonentities(tree):
tokens = [leaf[0] for leaf in tree if type(leaf) != nltk.Tree]
return(untokenize(tokens))
def fast_ne_removal(text_list):
token_list = [nltk.word_tokenize(text) for text in text_list]
tagged = nltk.pos_tag_sents(token_list)
chunked = nltk.ne_chunk_sents(tagged)
non_entities = []
for tree in chunked:
non_entities.append(extract_nonentities(tree))
return(non_entities)
Every time you call ne_chunk(), it needs to initialize a chunker object and load the statistical model for chunking from disk. Ditto for pos_tag(). So instead of calling them on one sentence at a time, call their batch versions on the complete list of texts:
all_data = [ nltk.word_tokenize(sent) for sent in list_of_all_sents ]
tagged = nltk.pos_tag_sents(all_data)
chunked = nltk.ne_chunk_sents(tagged)
This should give you a considerable speed-up. If that's still too slow for your needs, try profiling your code and consider whether you need to switch to more high-powered tools, like #Lenz suggested.

Python 2.7: Persisting search and indexing

I wrote a small 'search enginge' that finds all the text files in a Directory and its sub-directories - I can edit the code in but I don't think it is necessary for my question.
It works by creating a dictionary in a format like this:
term_frequency = {'file1' : { 'WORD1' : 1, 'WORD2' : 2, 'WORD3' : 3}}
{'file2' : { 'WORD1' : 1, 'WORD3' : 3, 'WORD4' : 4}}
...continues with all the files it has found...
From gathered information it creates a second dictionary like such:
document_frequency = {'WORD1' : ['file1', 'file2'....],
'WORD2' : ['file1',............],
....every found word..........]}
The purpose of the term_frequency dictionary is to hold data of how many times a word has been used in that file and document_frequency says in how many documents the word has been used.
Then, when given a word it calculates the relevance of every file by tf/df and lists the non-zero values in descending relevance of files.
for example:
file1 : 0.75
file2 : 0.5
I am aware that this is a very simple representation of the tf-idf but I am new to python and programming (2 weeks) and am getting familiar with it all.
Sorry for the long-ish intro but I feel it is relevant to the question... which brings me right to it:
How do I go about making an indexer that saves those dictionaries in a file and then make a 'searcher' read those dictionaries from a file. Because the issue right now is that every time you want to look for a different word it has to read ALL the files once again and make the same 2 dictionaries over and over.
The pickle (and for that matter cPickle) library is your friend here. By using pickle.dump(), you can turn the entire dictionary into one file which can be read back later by pickle.load().
In this case, you could use something like this:
import pickle
termfile = open('terms.pkl', 'wb')
documentfile = open('documents.pkl', 'wb')
pickle.dump(term_frequency, termfile)
pickle.dump(document_frequency, documentfile)
termfile.close()
documentfile.close()
and read it back like so:
termfile = open('terms.pkl', 'rb')
documentfile = open('documents.pkl', 'rb')
term_frequency = pickle.load(termfile)
document_frequency = pickle.load(documentfile)
termfile.close()
documentfile.close()

Categories