How to get partial match with soup.find()? - python

For some reason I wasn't able to find the answer to this somewhere.
So, I'm using this
soup.find(text="Dimensions").find_next().text
to grab the text after "Dimensions". My problem is on the website I'm scraping sometimes it is displayed as "Dimensions:" (with colon) or sometimes it has space "Dimensions " and my code throws an error. So that's why I'm looking for smth like (obviously, this is an invalid piece of code) to get a partial match:
soup.find(if "Dimensions" in text).find_next().text
How can I do that?

Ok, I've just found out looks like it's much simpler than I thought
soup.find(text=re.compile(r"Dimensions")).find_next().text
does what I need

Related

How to use searchtype option in PubChemPy

I tried to run the code
pcp.get_compounds('CC', searchtype='superstructure', listkey_count=3)
but, it didn't work.
This code is exactly the same as one shown in the documentation ("https://pubchempy.readthedocs.io/en/latest/guide/searching.html#advanced-search-types").
Another code such as pcp.get_compounds('Aspirin', 'name', record_type='3d') which is shown in the same page worked.
Please give me some advice about how to fix this error.
It appears that the example in the PubChemPy advanced search documentation is missing a parameter. The example does not identify how to search, i.e., by SMILES. Substituting the following statement for the one in the example should give you the desired result.
pcp.get_compounds('CC', 'smiles', searchtype='superstructure', listkey_count=3)

IDA python Find issues

my goal here is to search through the entire memory range of a process for the following pattern:
pop *
pop *
retn
I've tried using FindText but it seems that it only returns results for areas that have already been parsed for their instructions in IDA. so to use FindText id need to figure out how to parse the entire memory range for instructions (which seems like it would be intensive).
So i switched to FindBinary but i ran into an issue there as well. the pattern I'm searching only needs to match the first 5 bits of the byte and the rest is wildcard. so my goal would be to search for:
01011***
01011***
11000011
I've found posts claiming IDA has a ? wildcard for bytes, but i haven't been able to get it to work and even if it did it only seems to work for a full 8 bits. so for this approach i would need to find a way to search for bit patterns then parse the bits around the result. this seems like the most doable route but so far i haven't been able to find anything in the docs that can search bits like this.
does anyone know a way to accomplish what i want?
in classic stackoverflow style, i spent hours trying to figure it out then 20 minutes after asking for help i found the exact function i needed, get_byte()
def find_test():
base = idaapi.get_imagebase()
while True:
res = FindBinary(base, SEARCH_NEXT|SEARCH_DOWN, "C3")
if res==BADADDR: break
if 0b01011 == get_byte(res-1) >> 3 and 0b01011 == get_byte(res-2) >> 3:
print "{0:X}".format(res)
base=res+1
now, if only i could figure out how to do this with a wildcard in every instruction. because for this solution i need to know at least one full byte of the pattern

Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling?

I was looking at methods to split documents into paragraphs and I came across texttiling as one possible way to do this.
Here is my attempt to use it. However, I don't understand how to work with the output. I'd appreciate your help.
t = unidecode(doclist[0].decode('utf-8','ignore'))
nltk.tokenize.texttiling.TextTilingTokenizer(t)
output:
<nltk.tokenize.texttiling.TextTilingTokenizer at 0x11e9c6350>
I'm messing around with this one myself just now for the same reason you are and had the same question you did so don't be too upset if this is wrong. I figured best to pass on what little I know... :)
I'm not sure yet but I found in this bug report an example of using the TextTilingTokenizer:
alice=nltk.corpus.gutenberg.raw('carroll-alice.txt')
ttt = nltk.tokenize.TextTilingTokenizer()
tiles = ttt.tokenize(alice[140309 : ])
It appears that you want to feed your text to the tokenize method on the the TextTilingTokenizer.

BioPython Pubmed Eutils url?

I'm trying to run some queries against Pubmed's Eutils service. If I run them on the website I get a certain number of records returned, in this case 13126 (link to pubmed).
A while ago I bodged together a python script to build a query to do much the same thing, and the resultant url returns the same number of hits (link to Eutils result).
Of course, not having any formal programming background, it was all a bit cludgy, so I'm trying to do the same thing using Biopython. I think the following code should do the same thing, but it returns a greater number of hits, 23303.
from Bio import Entrez
Entrez.email = "A.N.Other#example.com"
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
record = Entrez.read(handle)
print(record["Count"])
I'm fairly sure it's just down to some subtlety in how the url is being generated, but I can't work out how to see what url is being generated by Biopython. Can anyone give me some pointers?
Thanks!
EDIT:
It's something to do with how the url is being generated, as I can get back the original number of hits by modifying the code to include double quotes around the search term, thus:
handle = Entrez.esearch(db='pubmed', term='"stem+cell"[ALL]', datetype='pdat', mindate='2012', maxdate='2012')
I'm still interested in knowing what url is being generated by Biopython as it'll help me work out how i have to structure the search term for when i want to do more complicated searches.
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
print(handle.url)
You've solved this already (Entrez likes explicit double quoting round combined search terms), but currently the URL generated is not exposed via the API. The simplest trick would be to edit the Bio/Entrez/__init__.py file to add a print statement inside the _open function.
Update: Recent versions of Biopython now save the URL as an attribute of the returned handle, i.e. in this example try doing print(handle.url)

How to transform hyperlink codes into normal URL strings?

I'm trying to build a blog system. So I need to do things like transforming '\n' into < br /> and transform http://example.com into < a href='http://example.com'>http://example.com< /a>
The former thing is easy - just using string replace() method
The latter thing is more difficult, but I found solution here: Find Hyperlinks in Text using Python (twitter related)
But now I need to implement "Edit Article" function, so I have to do the reverse action on this.
So, how can I transform < a href='http://example.com'>http://example.com< /a> into http://example.com?
Thanks! And I'm sorry for my poor English.
Sounds like the wrong approach. Making round-trips work correctly is always challenging. Instead, store the source text only, and only format it as HTML when you need to display it. That way, alternate output formats / views (RSS, summaries, etc) are easier to create, too.
Separately, we wonder whether this particular wheel needs to be reinvented again ...
Since you are using the answer from that other question your links will always be in the same format. So it should be pretty easy using regex. I don't know python, but going by the answer from the last question:
import re
myString = 'This is my tweet check it out http://tinyurl.com/blah'
r = re.compile(r'(http://[^ ]+)')
print r.sub(r'\1', myString)
Should work.

Categories