IDA python Find issues - python

my goal here is to search through the entire memory range of a process for the following pattern:
pop *
pop *
retn
I've tried using FindText but it seems that it only returns results for areas that have already been parsed for their instructions in IDA. so to use FindText id need to figure out how to parse the entire memory range for instructions (which seems like it would be intensive).
So i switched to FindBinary but i ran into an issue there as well. the pattern I'm searching only needs to match the first 5 bits of the byte and the rest is wildcard. so my goal would be to search for:
01011***
01011***
11000011
I've found posts claiming IDA has a ? wildcard for bytes, but i haven't been able to get it to work and even if it did it only seems to work for a full 8 bits. so for this approach i would need to find a way to search for bit patterns then parse the bits around the result. this seems like the most doable route but so far i haven't been able to find anything in the docs that can search bits like this.
does anyone know a way to accomplish what i want?

in classic stackoverflow style, i spent hours trying to figure it out then 20 minutes after asking for help i found the exact function i needed, get_byte()
def find_test():
base = idaapi.get_imagebase()
while True:
res = FindBinary(base, SEARCH_NEXT|SEARCH_DOWN, "C3")
if res==BADADDR: break
if 0b01011 == get_byte(res-1) >> 3 and 0b01011 == get_byte(res-2) >> 3:
print "{0:X}".format(res)
base=res+1
now, if only i could figure out how to do this with a wildcard in every instruction. because for this solution i need to know at least one full byte of the pattern

Related

How to get partial match with soup.find()?

For some reason I wasn't able to find the answer to this somewhere.
So, I'm using this
soup.find(text="Dimensions").find_next().text
to grab the text after "Dimensions". My problem is on the website I'm scraping sometimes it is displayed as "Dimensions:" (with colon) or sometimes it has space "Dimensions " and my code throws an error. So that's why I'm looking for smth like (obviously, this is an invalid piece of code) to get a partial match:
soup.find(if "Dimensions" in text).find_next().text
How can I do that?
Ok, I've just found out looks like it's much simpler than I thought
soup.find(text=re.compile(r"Dimensions")).find_next().text
does what I need

REGEX to find all matches inside a given string

I have a problem that drives me nuts currently. I have a list with a couple of million entries, and I need to extract product categories from them. Each entry looks like this: "[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]"
A type check did indeed give me string: print(type(item)) <class 'str'>
Now I searched online for a possible (and preferably fast - because of the million entries) regex solution to extract all the categories.
I found several questions here Match single quotes from python re: I tried re.findall(r"'(\w+)'", item) but only got empty brackets [].
Then I went on and searched for alternative methods like this one: Python Regex to find a string in double quotes within a string There someone tries the following matches=re.findall(r'\"(.+?)\"',item)
print(matches), but this failed in my case as well...
After that I tried some idiotic approach to get at least a workaround and solve this problem later: list_cat_split = item.split(',') which gives me
e["[['Electronics'"," 'Computers & Accessories'"," 'Cables & Accessories'"," 'Memory Card Adapters']]"]
Then I tried string methods to get rid of the stuff and then apply a regex:
list_categories = []
for item in list_cat_split:
item.strip('\"')
item.strip(']')
item.strip('[')
item.strip()
category = re.findall(r"'(\w+)'", item)
if category not in list_categories:
list_categories.append(category)
however even this approach failed: [['Electronics'], []]
I searched further but did not find a proper solution. Sorry if this question is completly stupid, I am new to regex, and probably this is a no-brainer for regular regex users?
UPDATE:
Somehow I cannot answer my own question, thererfore here an update:
thanks for the answers - sorry for incomplete information, I very rarely ask here and usually try to find solutions on my own.. I do not want to use a database, because this is only a small fraction of my preprocessing work for an ML-application that is written entirely in Python. Also this is for my MSc project, so no production environment. Therefore I am fine with a slower, but working, solution as I do it once and for all. However as far as I can see the solution of #FailSafe worked for me:screenshot of my jupyter notebook
here the result with list
But yes I totally agree with # Wiktor Stribiżew: in a production setup, I would for sure set up a database and let this run over night,.. Thanks for all the help anyways, great people here :-)
this may not be your final answer but it creates a list of categories.
x="[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]"
y=x[2:-2]
z=y.split(',')
for item in z:
print(item)

Xpath - obtaining 2 nodes with 1 node having default value if missing

I am using xpath in Python 2.7 with lxml:
from lxml import html
...
tree = html.fromstring(source)
results = tree.xpath(...xpath string...)
Now the problem is the xpath string and am getting quite lost in this. I am trying to get all the nodes from one path like this:
//a[#class="hyperlinkClass"]/span/text() (1)
There are no missing entries in this part and this works fine. But I'm also trying to get a part relative to this as well, like so:
//a[#class="hyperlinkClass"]/span/following-sibling::div[#class="divClassName"]/span[#class="spanClassName"]/text() (2)
This works fine by itself, but (2) may or may not have nodes for each node in (1). What I would like to do is to have a default value for if (2) is missing/empty for each (1), say "absent". This sounds straightforward and maybe it is, but I'm hitting a brick wall here.
By doing '(1) | (2)' I get all the values needed, but no way to match them. If I do '(1) | concat((2), "absent")', this doesn't work either - concat doesn't seem to work in python, though I've read with xpath that it is valid. I saw here the "Becker method", but that doesn't work either (or I can't get it to).
Hopefully, someone can shine a light on how to get this working or if it's even possible.
Don't make this more complicated than it is:
path1 = '//a[#class="hyperlinkClass"]/span'
path2 = './following-sibling::div[#class="divClassName"]/span[#class="spanClassName"]'
for link in tree.xpath(path1):
other_node = link.xpath(path2)
if len(other_node):
print(link.text, other_node[0].text)
else:
print(link.text, 'n/a')

BioPython Pubmed Eutils url?

I'm trying to run some queries against Pubmed's Eutils service. If I run them on the website I get a certain number of records returned, in this case 13126 (link to pubmed).
A while ago I bodged together a python script to build a query to do much the same thing, and the resultant url returns the same number of hits (link to Eutils result).
Of course, not having any formal programming background, it was all a bit cludgy, so I'm trying to do the same thing using Biopython. I think the following code should do the same thing, but it returns a greater number of hits, 23303.
from Bio import Entrez
Entrez.email = "A.N.Other#example.com"
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
record = Entrez.read(handle)
print(record["Count"])
I'm fairly sure it's just down to some subtlety in how the url is being generated, but I can't work out how to see what url is being generated by Biopython. Can anyone give me some pointers?
Thanks!
EDIT:
It's something to do with how the url is being generated, as I can get back the original number of hits by modifying the code to include double quotes around the search term, thus:
handle = Entrez.esearch(db='pubmed', term='"stem+cell"[ALL]', datetype='pdat', mindate='2012', maxdate='2012')
I'm still interested in knowing what url is being generated by Biopython as it'll help me work out how i have to structure the search term for when i want to do more complicated searches.
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
print(handle.url)
You've solved this already (Entrez likes explicit double quoting round combined search terms), but currently the URL generated is not exposed via the API. The simplest trick would be to edit the Bio/Entrez/__init__.py file to add a print statement inside the _open function.
Update: Recent versions of Biopython now save the URL as an attribute of the returned handle, i.e. in this example try doing print(handle.url)

How to transform hyperlink codes into normal URL strings?

I'm trying to build a blog system. So I need to do things like transforming '\n' into < br /> and transform http://example.com into < a href='http://example.com'>http://example.com< /a>
The former thing is easy - just using string replace() method
The latter thing is more difficult, but I found solution here: Find Hyperlinks in Text using Python (twitter related)
But now I need to implement "Edit Article" function, so I have to do the reverse action on this.
So, how can I transform < a href='http://example.com'>http://example.com< /a> into http://example.com?
Thanks! And I'm sorry for my poor English.
Sounds like the wrong approach. Making round-trips work correctly is always challenging. Instead, store the source text only, and only format it as HTML when you need to display it. That way, alternate output formats / views (RSS, summaries, etc) are easier to create, too.
Separately, we wonder whether this particular wheel needs to be reinvented again ...
Since you are using the answer from that other question your links will always be in the same format. So it should be pretty easy using regex. I don't know python, but going by the answer from the last question:
import re
myString = 'This is my tweet check it out http://tinyurl.com/blah'
r = re.compile(r'(http://[^ ]+)')
print r.sub(r'\1', myString)
Should work.

Categories