Searching a list using astroquery

Searching a list using astroquery - python

I have the following code below that I got from https://astroquery.readthedocs.io/en/latest/gaia/gaia.html.
I get the br when ra and dec is a number. However, I have not just one number, but a list of ra and dec. When I tried putting a list in for ra and dec in the code below, I got an error saying Error 500: null. Is there a way to find br using a list of ra and dec?
coord = SkyCoord(ra=, dec=, unit=(u.degree, u.degree), frame='icrs')
width = u.Quantity(0.0005, u.deg)
height = u.Quantity(0.0005, u.deg)
r = Gaia.query_object_async(coordinate=coord, width=width, height=height)
r.pprint()
r.columns
br=[r["phot_bp_rp_excess_factor"]]
print (br)
I am new to astroquery, so any help will be apreciated.

Hi and congrats on your first StackOverflow question. As I understand it, Astroquery is a community effort, and the modules for querying individual online catalogues are in many cases being developed and maintained side-by-side with the online query systems, often by their same developers. So different modules within Astroquery are being worked on sometimes by different teams, and have some varying degree of feature-completeness and consistency in interfaces (something I'd like to see improved but it's very difficult to coordinate).
In the case of Gaia.query_object_async the docs are not completely clear, but it's not obvious that it even supports array SkyCoord. It should, or at least if it doesn't it should give a better error.
To double check, I dug into the code and found what I kind of suspected: It does not at all allow this. It is building a SQL query using string replacement (generally considered a bad idea) and passing that SQL query to a web-based service. Because the ra and dec values are arrays, it just blindly passes those array representations into the template for the SQL query, resulting in an invalid query:
SELECT
TOP 50
DISTANCE(
POINT('ICRS', ra, dec),
POINT('ICRS', [99.00000712 87.00000767], [24.99999414 24.99999461])
) as dist,
*
FROM
gaiadr2.gaia_source
WHERE
1 = CONTAINS(
POINT('ICRS', ra, dec),
BOX(
'ICRS',
[99.00000712 87.00000767],
[24.99999414 24.99999461],
0.005,
0.005
)
)
ORDER BY
dist ASC
The server, rather than return an error message suggesting that the query is malformed, instead just returns a general server error. Basically it crashes.
Long story short, you should probably open a bug report about this against astroquery, and see if it's on the Gaia maintainers' radar to deal with: https://github.com/astropy/astroquery/issues/new
In the meantime it sounds like your best bet is to make multiple queries in a loop and join their results together. Since it returns a Table, you can use astropy.table.vstack:
from astropy.table import vstack
results = []
for coord in coords:
results.append(Gaia.query_object_async(coord, width=width, height=height))
results = vstack(results)

Related

IDA python Find issues

my goal here is to search through the entire memory range of a process for the following pattern:
pop *
pop *
retn
I've tried using FindText but it seems that it only returns results for areas that have already been parsed for their instructions in IDA. so to use FindText id need to figure out how to parse the entire memory range for instructions (which seems like it would be intensive).
So i switched to FindBinary but i ran into an issue there as well. the pattern I'm searching only needs to match the first 5 bits of the byte and the rest is wildcard. so my goal would be to search for:
01011***
01011***
11000011
I've found posts claiming IDA has a ? wildcard for bytes, but i haven't been able to get it to work and even if it did it only seems to work for a full 8 bits. so for this approach i would need to find a way to search for bit patterns then parse the bits around the result. this seems like the most doable route but so far i haven't been able to find anything in the docs that can search bits like this.
does anyone know a way to accomplish what i want?

in classic stackoverflow style, i spent hours trying to figure it out then 20 minutes after asking for help i found the exact function i needed, get_byte()
def find_test():
base = idaapi.get_imagebase()
while True:
res = FindBinary(base, SEARCH_NEXT|SEARCH_DOWN, "C3")
if res==BADADDR: break
if 0b01011 == get_byte(res-1) >> 3 and 0b01011 == get_byte(res-2) >> 3:
print "{0:X}".format(res)
base=res+1
now, if only i could figure out how to do this with a wildcard in every instruction. because for this solution i need to know at least one full byte of the pattern

How can i reduce my sqlite3 code using variables?

I got around 500 lines of sqlite code and i am trying to reduce that number. So i tried making a function with variables.
I tried using this code:
x = "c.execute("
o = "'"
y = "'INSERT INTO "
unio = x+y
tabl = ["d","d_2","n_d","n_d_2"]
val = "VALUES ("
w = ") "
k = "?,?"
mark = 369
ek = 963
def db_up(table,):
aa = y+table+"("+tabl[0] + ", " + tabl[1]+w+val+k+")"+"',"+"(mark,ek,)"
bb = unio+table+"("+tabl[1]+ ", " + tabl[1]+w+val+k+")"+"',"+"(mark,ek,))"
print(aa) # 'INSERT INTO avg_dt(d, d_2) VALUES (?,?)',(mark,ek,)
print(bb) # c.execute('INSERT INTO avg_dt(d_2, d_2) VALUES (?,?)',(mark,ek,))
c.execute(str(aa)) # no succes
c.execute(aa) # no success
bb # no success
When i run the "c.execute(aa)" line. it throws me this error:
sqlite3.OperationalError: near "'INSERT INTO avg_dt(d, d_2) VALUES (?,?)'": syntax error
So...how can i make a sqlite3 code using variables a functions?
Thanks for taking the time ;)

I can't post a comment with my current rep, so let me try to provide a full answer.
Your approach seems to be that you have too much SQL code, which you consider a problem, so you're trying to "compress" the code by reducing common statements into shorter variable names. There are a number of reasons you should avoid doing this, and a number of better ways to achieve a similar result.
First, let's talk about why this is a bad idea:
Your code doesn't become more maintainable. Interspersing cryptic variable names like x, o, y, or unio doesn't make the code any easier to read, even for the author, assuming a few days have passed since you wrote it.
Using this kind of method doesn't make your code any more performant, and most likely makes it less performant: your program now has to worry about allocating and reallocating memory when doing string interpolation or concatenation, which takes cycles.
Doing string interpolation or concatenation in SQL should be done with extreme caution: this is essentially a homebrew version of prepared statements, which are usually authored by people with loads of experience doing SQL programming. Making this on your own risks your program being targetted by SQL injection (or at least column/value type mismatches).
Now, let's talk about mitigating this issue for you:
Lots of SQL code need not be unmanageable: typically, if you must maintain large amounts of raw SQL in your project, you dump that into a separate SQL file which you can then either execute directly from the database CLI or run from your program via a database driver.
ORMs are (usually) your friend: with the exception of exotic or outstandingly performance-sensitive SQL queries, modern ORMs can get rid of most raw SQL code in your program. The more naturally programmatic structure of ORM code also means you can break it up into different delegate functions to avoid code reuse, to an extent.
Please feel free to add details to your question; as it stands, it's not totally clear whether your concerns can be addressed with this answer.

Specifying limit and offset in Django QuerySet wont work

I'm using Django 1.6.5 and have MySQL's general-query-log on, so I can see the sql hitting MySQL.
And I noticed that Specifying a bigger limit in Django's QuerySet would not work:
>>> from blog.models import Author
>>> len(Author.objects.filter(pk__gt=0)[0:999])
>>> len(Author.objects.all()[0:999])
And MySQL's general log showed that both query had LIMIT 21.
But a limit smaller than 21 would work, e.g. len(Author.objects.all()[0:10]) would make a sql with LIMIT 10.
Why is that? Is there something I need to configure?

It happens when you make queries from the shell - the LIMIT clause is added to stop your terminal filling up with thousands of records when debugging:
You were printing (or, at least, trying to print) the repr() of the
queryset. To avoid people accidentally trying to retrieve and print a
million results, we (well, I) changed that to only retrieve and print
the first 20 results and print "remainder truncated" if there were more.
This is achieved by limiting the query to 21 results (if there are 21
results there are more than 20, so we print the "truncated" message).
That only happens in the repr() -- i.e. it's only for diagnostic
printing. No normal user code has this limit included automatically, so
you happily create a queryset that iterates over a million results.
(Source)

Django implements OFFSET using Python’s array-slicing syntax. If you want to offset the first 10 elements and then show the next 5 elements then use it
MyModel.objects.all()[OFFSET:OFFSET+LIMIT]
For example if you wanted to check 5 authors after an offset of 10 then your code would look something like this:
Author.objects.all()[10:15]
You can read more about it here in the official Django doc
I have also written a blog around this concept, you can here more here

The LIMIT and OFFSET doesn't work in the same way in Django, the way we expect it to work.
For example.
If we have to read next 10 rows starting from 10th row and if we specify :
Author.objects.all()[10:10]
It will return the empty record list. In order to fetch the next 10 rows, we have to add the offset to the limit.
Author.objects.all()[10:10+10]
And it will return the record list of next 10 rows starting from the 10th row.

for offset and limit i used and worked for me :)
MyModel.objects.all()[offset:limit]
for exapmle:-
Post.objects.filter(Post_type=typeId)[1:1]

I does work, but django uses an iterator. It does not load all objects at once.

BioPython Pubmed Eutils url?

I'm trying to run some queries against Pubmed's Eutils service. If I run them on the website I get a certain number of records returned, in this case 13126 (link to pubmed).
A while ago I bodged together a python script to build a query to do much the same thing, and the resultant url returns the same number of hits (link to Eutils result).
Of course, not having any formal programming background, it was all a bit cludgy, so I'm trying to do the same thing using Biopython. I think the following code should do the same thing, but it returns a greater number of hits, 23303.
from Bio import Entrez
Entrez.email = "A.N.Other#example.com"
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
record = Entrez.read(handle)
print(record["Count"])
I'm fairly sure it's just down to some subtlety in how the url is being generated, but I can't work out how to see what url is being generated by Biopython. Can anyone give me some pointers?
Thanks!
EDIT:
It's something to do with how the url is being generated, as I can get back the original number of hits by modifying the code to include double quotes around the search term, thus:
handle = Entrez.esearch(db='pubmed', term='"stem+cell"[ALL]', datetype='pdat', mindate='2012', maxdate='2012')
I'm still interested in knowing what url is being generated by Biopython as it'll help me work out how i have to structure the search term for when i want to do more complicated searches.

handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
print(handle.url)

You've solved this already (Entrez likes explicit double quoting round combined search terms), but currently the URL generated is not exposed via the API. The simplest trick would be to edit the Bio/Entrez/__init__.py file to add a print statement inside the _open function.
Update: Recent versions of Biopython now save the URL as an attribute of the returned handle, i.e. in this example try doing print(handle.url)

What data is cached during a "select" in sqlite3/Python, and can this be done manually from the start?

Suppose you have a sqlite database with several thousand rows -- each of which either contains or references a sizable, unique blob -- and you want to sparsely sample this collection, pulling rows based on rowid or some equivalent primary key. I find that the first time I attempt to fetch several (500) datapoints after connecting (out of 20k rows), the call takes over 10 seconds to return; and, with every successive iteration, the calls get shorter and shorter, until converging to around 100 milliseconds after 50-100 such queries.
Clearly, either sqlite or its python wrapper must be caching... something. If I clear out inactive memory (I'm in OS X, but I think Linux has a comparable if-not-identical "purge" command?), the behavior can be replicated exactly. The question is, what is it caching that an index doesn't address? And furthermore, is it possible to automatically pull whatever information is accelerating these queries into memory from the start? Or is there something else I've missed entirely?
A few notes in case someone doesn't immediately know the answer...
Each blob is around 40kB, and are a large (ha) source of the problem. I've some code below for anyone who wants to play along at home, but I've had better luck keeping separate tables for sortable information and data. This introduces an inner join, but it's generally been better than keeping it all together (although if anyone feels this is wrong, I'm keen to hear it). Without the inner join / data fetch, things start at 4 seconds and drop to 3 ms in a hurry.
I feel like this might be a PRAGMA thing, but I fiddled with some settings suggested by others in the wilderness of the web and didn't really see any benefit.
In-memory databases are not an option. For one, I'm trying to share across threads (which might not actually be a problem for in-mems...? not sure), but more importantly the database files are typically on the order of 17 GB. So, that's out.
That being said, there's no problem caching a reasonable amount of information. After a few dozen calls, inactive memory gets somewhat bloated anyways, but I'd rather do it (1) right and (2) efficiently.
Okay, now some code for anyone who wants to try to replicate things. You should be able to copy and paste it into a stand-alone script (that's basically what I did, save for formatting).
import sqlite3
import numpy as np
import time
ref_uid_index = """CREATE INDEX ref_uid_idx
ON data(ref_uid)"""
def populate_db_split(db_file, num_classes=10, num_points=20000, VERBOSE=False):
def_schema_split0 = """
CREATE TABLE main (
uid INTEGER PRIMARY KEY,
name TEXT,
label INTEGER,
ignore INTEGER default 0,
fold INTEGER default 0)"""
def_schema_split1 = """
CREATE TABLE data (
uid INTEGER PRIMARY KEY,
ref_uid INTEGER REFERENCES main(uid),
data BLOB)"""
def_insert_split0 = """
INSERT INTO main (name, label, fold)
VALUES (?,?,?)"""
def_insert_split1 = """
INSERT INTO data (ref_uid, data)
VALUES (?,?)"""
blob_size= 5000
k_folds = 5
some_names = ['apple', 'banana', 'cherry', 'date']
dbconn = sqlite3.connect(db_file)
dbconn.execute(def_schema_split0)
dbconn.execute(def_schema_split1)
rng = np.random.RandomState()
for n in range(num_points):
if n%1000 == 0 and VERBOSE:
print n
# Make up some data
data = buffer(rng.rand(blob_size).astype(float))
fold = rng.randint(k_folds)
label = rng.randint(num_classes)
rng.shuffle(some_names)
# And add it
dbconn.execute(def_insert_split0,[some_names[0], label, fold])
ref_uid = dbconn.execute("SELECT uid FROM main WHERE rowid=last_insert_rowid()").fetchone()[0]
dbconn.execute(def_insert_split1,[ref_uid,data])
dbconn.execute(ref_uid_index)
dbconn.commit()
return dbconn
def timeit_join(dbconn, n_times=10, num_rows=500):
qmarks = "?,"*(num_rows-1)+"?"
q_join = """SELECT data.data, main.uid, main.label
FROM data INNER JOIN main ON main.uid=data.ref_uid
WHERE main.uid IN (%s)"""%qmarks
row_max = dbconn.execute("SELECT MAX(rowid) from main").fetchone()[0]
tstamps = []
for n in range(n_times):
now = time.time()
uids = np.random.randint(low=1,high=row_max,size=num_rows).tolist()
res = dbconn.execute(q_join, uids).fetchall()
tstamps += [time.time()-now]
print tstamps[-1]
Now, if you want to replicate things, do the following. On my machine, this creates an 800MB database and produces something like below.
>>> db = populate_db_split('/some/file/path.db')
>>> timeit_join(db)
12.0593519211
5.56209111214
3.51154184341
2.20699000359
1.73895692825
1.18351387978
1.27329611778
0.934082984924
0.780968904495
0.834318161011
So... what say you, knowledgable sages?

Database files with GB size are never loaded into the memory entirely. They are split into a tree of socalled pages. These pages are cached in the memory, the default is 2000 pages.
You can use the following statement to e.g. double the number of cached pages of 1kB size.
conn.execute("""PRAGMA cache_size = 4000""")
The connection again has a cache for the last 100 statements, as you can see in the function description:
sqlite3.connect(database[, timeout, detect_types, isolation_level, check_same_thread, factory, cached_statements])
cached_statements expects and integer and defaults to 100.
Except from setting up the cache size, it is not likely that you benefit from actively caching statements or pages at the application start.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.