Geting max value of an array in record - Pymongo - python

I have a Mongo collection which stores prices for a number of cryptocurrencies.
Each record looks like this:
1. ticker: string 2. dates: array of datetime.datetime 2. prices: array of float
I am trying to get the most recent date for a given ticker. I've tried this Python code:
max_date = date_price_collection.find({"ticker": "BTC"},{"dates":1}).limit(1).sort({"dates",pymongo.ASCENDING})
But it gives the error:
TypeError: if no direction is specified, key_or_list must be an instance of list
How can I get the maximum date for a specific filtered record?
UPDATE:
The following works:
max_date = date_price_collection.aggregate([ {'$match': {"ticker": ticker}},{'$project': {'max':{'$max': '$dates'}}} ])

The following query will return the max date value form the dates array. Note that this syntax (and feature) is available with MongoDB v4.4 and higher only. This feature allows use Aggregation Projection (and its operators) in find queries.
max_date = date_price_collection.find( { "ticker": "BTC" }, { "_id": False, "dates": { "$max": "$dates" } } )

Related

Sequential Searching Across Multiple Indexes In Elasticsearch

Suppose I have Elasticsearch indexes in the following order:
index-2022-04
index-2022-05
index-2022-06
...
index-2022-04 represents the data stored in the month of April 2022, index-2022-05 represents the data stored in the month of May 2022, and so on. Now let's say in my query payload, I have the following timestamp range:
"range": {
"timestampRange": {
"gte": "2022-04-05T01:00:00.708363",
"lte": "2022-06-06T23:00:00.373772"
}
}
The above range states that I want to query the data that exists between the 5th of April till the 6th of May. That would mean that I have to query for the data inside three indexes, index-2022-04, index-2022-05 and index-2022-06. Is there a simple and efficient way of performing this query across those three indexes without having to query for each index one-by-one?
I am using Python to handle the query, and I am aware that I can query across different indexes at the same time (see this SO post). Any tips or pointers would be helpful, thanks.
You simply need to define an alias over your indices and query the alias instead of the indexes and let ES figure out which underlying indexes it needs to visit.
Eventually, for increased search performance, you can also configure index-time sorting on timestampRange, so that if your alias spans a full year of indexes, ES knows to visit only three of them based on the range constraint in your query (2022-04-05 -> 2022-04-05).
Like you wrote, you can simply use a wildcard in and/or pass a list as target index.
The simplest way would be to to just query all of your indices with an asterisk wildcard (e.g. index-* or index-2022-*) as target. You do not need to define an alias for that, you can just use the wildcard in the target string, like so:
from elasticsearch import Elasticsearch
es_client = Elasticsearch('https://elastic.host:9200')
datestring_start = '2022-04-05T01:00:00.708363'
datestring_end = '2022-06-06T23:00:00.373772'
result = es_client.search(
index = 'index-*',
query = { "bool": {
"must": [{
"range": {
"timestampRange": {
"gte": datestring_start,
"lte": datestring_end
}
}
}]
}
})
This will query all indices that match the pattern, but I would expect Elasticsearch to perform some sort of optimization on this. As #Val wrote in his answer, configuring index-time sorting will be beneficial for performance, as it limits the number of documents that should be visited when the index sort and the search sort are the same.
For completeness sake, if you really wanted to pass just the relevant index names to Elasticsearch, another option would be to first figure out on the Python side which sequence of indices you need to query and supply these as a comma-separated list (e.g. ['index-2022-04', 'index-2022-05', 'index-2022-06']) as target. You could e.g. use the Pandas date_range() function to easily generate such a list of indices, like so
from elasticsearch import Elasticsearch
import pandas as pd
es_client = Elasticsearch('https://elastic.host:9200')
datestring_start = '2022-04-05T01:00:00.708363'
datestring_end = '2022-06-06T23:00:00.373772'
months_list = pd.date_range(pd.to_datetime(datestring_start).to_period('M').to_timestamp(), datestring_end, freq='MS').strftime("index-%Y-%m").tolist()
result = es_client.search(
index = months_list,
query = { "bool": {
"must": [{
"range": {
"timestampRange": {
"gte": datestring_start,
"lte": datestring_end
}
}
}]
}
})

How to access nested attribute without passing parent attribute in pyspark json

I am trying to access inner attributes of following json using pyspark
[
{
"432": [
{
"atttr1": null,
"atttr2": "7DG6",
"id":432,
"score": 100
}
]
},
{
"238": [
{
"atttr1": null,
"atttr2": "7SS8",
"id":432,
"score": 100
}
]
}
]
In the output, I am looking for something like below in form of csv
atttr1, atttr2,id,score
null,"7DG6",432,100
null,"7SS8",238,100
I understand I can get these details like below but I don't want to pass 432 or 238 in lambda expression as in bigger json this(italic one) will vary. I want to iterate over all available values.
print(inputDF.rdd.map(lambda x:(x['*432*'])).first())
print(inputDF.rdd.map(lambda x:(x['*238*'])).first())
I also tried registering a temp table with the name "test" but it gave an error with message element._id doesn't exist.
inputDF.registerTempTable("test")
srdd2 = spark.sql("select element._id from test limit 1")
Any help will be highly appreciated. I am using spark 2.4
Without using pyspark features, you can do it like this:
data = json.loads(json_str) # or whatever way you're getting the data
columns = 'atttr1 atttr2 id score'.split()
print(','.join(columns)) # headers
for item in data:
for obj in list(item.values())[0]: # since each list has only one object
print(','.join(str(obj[col]) for col in columns))
Output:
atttr1,atttr2,id,score
None,7DG6,432,100
None,7SS8,432,100
Or
for item in data:
obj = list(item.values())[0][0] # since the object is the one and only item in list
print(','.join(str(obj[col]) for col in columns))
FYI, you can store those in a variable or write it out to csv instead of/and also printing it.
And if you're just looking to dump that to csv, see this answer.

What am I doing wrong in the process of vectorizing the test of whether my geolocation fields are valid?

I was calling out to a geolocation API and was converting the results to a DataFrame like so:
results = geolocator.lookup(ip_list)
results:
[{
query: "0.0.0.0",
coordinates: { lat: "0", lon: "0" }
}, ...]
So we queried 0.0.0.0 and the API returned "0"s for the lat / long, indicating an IP that obviously cant be geolocated. Weird way to handle things as opposed to a False value or something, but we can work with it.
To DataFrame:
df = pd.DataFrame(results)
But wait, this leads to those "coordinate" fields being dictionaries within the DataFrame, and I may be a Panda beginner but I know I probably want those stored as DataFrames, not dicts, so we can vectorize.
So instead I did:
for result in results:
result["coordinates"] = pd.DataFrame(result["coordinates"], index=[0])
df = pd.DataFrame(results)
Not sure what index=[0] does there but without it I get an error, so I did it like that. Stop me here and tell me why I'm wrong if I'm doing this badly so far. I'm new to Python and DataFrames more than 2D are confusing to visualize.
Then I wanted to process over df and add a "geolocated" column with True or False based on a vectorized test, and tried to do that like so:
def is_geolocated(coordinate_df):
# yes the API returned string coords
lon_zero = np.equal(coordinate_df["lon"], "0") # error here
lat_zero = np.equal(coordinate_df["lat"], "0")
return lon_zero & lat_zero
df["geolocated"] = is_mappable(df["coordinates"])
But this throws a KeyError "lon".
Am I even on the right track, and if not, how should I set this up?
Generally I would agree with you that a dictionary is a bad way to store latitude/longitude values. This happens due to the way pd.DataFrame() works, as it will pick up on the keys query and coordinates, where the value for the key coordinates is simply a dictionary of the lat/lon values.
You can circumvent the entire problem by, e.g., defining every row as a tuple, and the whole dataframe as a list of these tuples. You can then perform a comparison whether both the lat and lon value are zero, and return this as a new column.
import pandas as pd
# Test dataset
results = [{
'query': "0.0.0.0",
'coordinates': { 'lat': "0", 'lon': "0" }
},
{
'query': "0.0.0.0",
'coordinates': { 'lat': "1", 'lon': "1" }
}]
df = pd.DataFrame([(result['query'], result['coordinates']['lat'], result['coordinates']['lon']) for result in results])
df.columns = ['Query', 'Lat', 'Lon']
df['Geolocated'] = ((df['Lat'] == '0') & (df['Lon'] == '0'))
df.head()
Query Lat Lon Geolocated
0 0.0.0.0 0 0 True
1 0.0.0.0 1 1 False
In this code I used a list comprehension to build the list of tuples and defined the 'Geolocated' column as a series, which comes from the comparison of the row's Lat and Lon values.

psycopg2/SQLAlchemy: execute function with custom type array parameter

I have a custom postgres type that looks like this:
CREATE TYPE "Sensor".sensor_telemetry AS
(
sensorid character varying(50),
measurement character varying(20),
val numeric(7,3),
ts character varying(20)
);
I am trying execute a call to a postgres function that takes an array of this type as a parameter.
I am calling this function with SQLAlchemy as follows:
result = db.session.execute("""select "Sensor"."PersistTelemetryBatch"(:batch)""", batch)
where batch looks like:
{
"batch" : [
{
"sensorID" : "phSensorA.haoshiAnalogPh",
"measurement" : "ph",
"value": 8.7,
"timestamp": "2019-12-06 18:32:36"
},
{
"sensorID" : "phSensorA.haoshiAnalogPh",
"measurement" : "ph",
"value": 8.8,
"timestamp": "2019-12-06 18:39:36"
}
]
}
When running this execution, I am met with this error:
sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) can't adapt type 'dict'
I'm guessing that psycopg2 is complaining about the custom type array entry as a dict, because I can supply dictionaries as parameters to other pg function executions (but these dictionaries are not contained within an array like this case). Am I correct about this?
How do I go about correctly passing an array of these objects to my pg function?
A straightforward way to pass the data is to convert the list of dicts to a list of tuples in Python and let psycopg2 handle adapting those to suitable SQL constructs:
from operator import itemgetter
ig = itemgetter("sensorID", "measurement", "value", "timestamp")
batch = {"batch": list(map(ig, batch["batch"]))}
query = """
SELECT "Sensor"."PersistTelemetryBatch"(
CAST(:batch AS "Sensor".sensor_telemetry[]))
"""
result = db.session.execute(query, batch)
Another interesting option when your data is a list of dict would be to use json_populate_record() or json_populate_recordset(), but for those you'd have to fix the keys to match:
import json
batch = [{"sensorid": r["sensorID"],
"measurement": r["measurement"],
"val": r["value"],
"ts": r["timestamp"]}
for r in batch["batch"]]
batch = {"batch": json.dumps(batch)}
query = """
SELECT "Sensor"."PersistTelemetryBatch"(ARRAY(
SELECT json_populate_recordset(
NULL::"Sensor".sensor_telemetry,
:batch)))
"""
result = db.session.execute(query, batch)

how to get data by range time with timestamp format on mongodb?

how to sort data based on time timestamp format on mongodb data?
I have data with the format as below:
{
"_id" : ObjectId("5996562c31f238391609f526"),
"created_at" : 1502683719,
"uname" : "username_here",
"source" : "sourcer"
}
...
...
I want to find the data with "created_at" filter, I tried with a command like this:
db.getCollection('data').find({
'created_at':{
'$gte':1502683719,
'$lt':1494616578
}
)
the result, all data that has a value of more than or less than the data entered all out.
the format of "created_at" is integer.
Assuming your collection is users:
db.users.find().sort({created_at: -1})
Check out the documentation for more detailed options https://docs.mongodb.com/manual/reference/operator/meta/orderby/

Categories