Telegram Telethon: Sharing media downloads across multiple different clients

Telegram Telethon: Sharing media downloads across multiple different clients - python

we tried to use 1 telegram client to continuously streaming messages from a list of channels, and then produce the messages to kafka. We then have a 2nd telegram client to consume the messages and download the associated media (photos/videos) using client.download_media(). Our issue is that this only works if client 1 and 2 are the same, but not when they are different accounts. We are not sure if this has to do with the session files or access hash, or maybe something else?
Is support for our use case possible? The main thing we are trying to address is that the async media download could result in a large backlog, and the backlog may go away if our server dies. That's why we wanted to put the messages into kafka for short term storage in the first place. Would also appreciate if you have better suggestions.
this is producer side
async with client:
messages = client.iter_messages(channel_id, limit=10)
async for message in messages:
print(message)
if message.media is not None:
# orig_media = message.media
# converted_media = BinaryReader(bytes(orig_media)).tgread_object()
# print('orig, media', orig_media)
# print('converted media', converted_media)
message_bytes = bytes(message) #convert to bytes
producer.produce(topic, message_bytes)
this is consumer side with a different client
with self._client:
#telethon.errors.rpcerrorlist.FileReferenceExpiredError: The file reference has expired and is no longer valid or it belongs to self-destructing media and cannot be resent (caused by GetFileRequest)
try:
self._client.loop.run_until_complete(self._client.download_media(orig_media, in_memory))
except Exception as e:
print(e)

Media files (among many other things in Telegram) contain an access_hash. While Account-A and Account-B will both see media with ID 1234, Account-A may have a hash of 5678 and Account-B may have a hash of 8765.
This is a roundabout way of saying that every account will see an access_hash that is only valid within that account. If that same hash is attempted to be used by a different account, it will fail, because that other account needs its own hash.
There is no way to bypass this, other than giving actual access to the right media files (or whatever it is) so that it can obtain its own hash.

Related

Google Cloud Pub/Sub - Filter metrics by attributes

While playing with GCP Pub/Sub I need to keep an eye on my topics and retrieve the number of undelivered messages. It's working pretty well with this snippet of Google Query Monitoring : Link.
But I need to group my messages by attributes. Each message gets a body with params like : {'target':'A'} and I really need to get somethig like that :
msg.target
undelivered messages
A
34
B
42
C
42
I don't succed to access it without consuming messages.
This is my first try :
import json
from google.cloud import pubsub_v1
publisher = pubsub_v1.PublisherClient()
subscriber = pubsub_v1.SubscriberClient()
project_id = "xxxx"
subscription_id = "xxxx"
subscription_path = subscriber.subscription_path(project_id, subscription_id)
response = subscriber.pull(
request={"subscription": subscription_path,"max_messages":9999}
)
ack_ids=[r.ack_id for r in response.received_messages]
subscriber.modify_ack_deadline(
request={
"subscription": subscription_path,
"ack_ids": ack_ids,
"ack_deadline_seconds": 0, ## The message will be immediatly 'pullable' again ?
}
)
messages = [ json.loads(r.message.data.decode()) for r in response.received_messages ]
for m in messages :
## Parse all messages to get my needed counts
But it's not working very well. I get a random number of messages each time so it's impossible to be sure of what I'm looking.
So here I am in my experimentations.
I see 3 ways :
Maybe it's possible to access messages body attributes directly from Google Query Monitoring ?
Maybe my method to consume / parse / release all messages is not correctly write and that's why it's not working well ?
Maybe I'm all wrong and it will be more efficient to create many topics instead of keep attributes in messages body OR there is another way to "tag" message to group them after in Monitoring ?
Do you figure how to do this ?
Thanks a lot in advance for your help !

The first thing to note is that the number of undelivered messages is a property of a subscription, not a topic. If there are multiple subscriptions to the same topic, then the number of undelivered messages could be different. There is no way in the Google Query Monitoring system to break down messages by attributes; it does not have any introspection into the contents of the backlog of messages, only to the metadata that is the number of messages.
The code as you have it has several things that make it problematic for trying to determine the number of messages remaining:
Synchronous pull can only return up to 1000 messages, so setting max_messages to 9999 messages will never give you that many messages.
Even with max_messages set to 1000, there is no guarantee that 1000 messages will be returned, even if there are 1000 messages that have not yet been delivered. You would need to issue multiple pull requests in order to fetch all of the messages. Of course, since you nack the messages (by doing a modify_ack_deadline with 0), messages could be redelivered and therefore double counted.
Even though you do the modify_ack_deadline request to nack the messages, while the messages are outstanding to this monitor, they are not available for delivery to your actual subscriber, which delays processing. Furthermore, consider the situation where your monitor crashes for some reason before it gets to perform the modify_ack_deadline. In this situation, those messages would not be delivered to your actual subscriber until the ack deadline you configured in the subscription had passed. If your application is latency-sensitive in any way, this could be a problem.
A different approach to consider would be to create a second subscription and have a monitoring application that receives all messages. For each message, it looks at the attribute and counts it as a received message for that attribute, then acknowledges the message. You could report this count per attributed breakdown via a custom metric. In your actual subscriber application, you would also create a custom metric that counts the number of messages received and processed per attribute. To compute the number of messages remaining to process per attribute, you would take the difference of these two numbers.
Alternatively, you could consider separating messages per attribute into different topics. However, there are a few things to consider:
Is the set of attributes fixed and known in advance? If not, how will your subscriber know which subscriptions to subscribe to?
How big is the set of attributes to be checked? There is a limit of 10,000 topics per project and so if you have more attributes than that, this approach will not work.
Are you using flow control to limit how many messages are being processed by your subscriber simultaneously? If so, is the number of messages per attribute uniform? If not, you may have to consider how to divide up the flow control across the subscribers on the different subscriptions.

Python + Azure Storage Queue receive_messages()

I'm using an azure queue storage to get blob-paths for an Azure Function to access a blob on the same storage account. (It turns out I've more or less manually created a blob storage Azure Function).
I'm using the QueueClient class to get the messages from the queue and there are two methods:
Azure Python Documentation
receive_messages(**kwargs)
peek_messages(max_messages=None, **kwargs)
I would like to be able to scale this function horizontally, so each time it's triggered (I've set it up as an HTTP function being triggered from an Azure Logic App) it grabs the FIRST message in the queue and only the first, and once retrieved deletes said message.
My problem is that peek does not make it invisible or return a pop_receipt for deletion later. And receive does not have a parameter for max_messages so that I can take one and only one message.
Does anyone have any knowledge of how to get around this roadblock?

You can try receiving messages in a batch by passing messages_per_page argument to receive_messages. From this link:
# Receive messages by batch
messages = queue.receive_messages(messages_per_page=5)
for msg_batch in messages.by_page():
for msg in msg_batch:
print(msg.content)
queue.delete_message(msg)

#Robert,
To fetch only one message from a queue you can use the code below:
pages = queue.receive_messages(visibility_timeout=30, messages_per_page=1).by_page()
page = next(pages)
msg = next(page)
print(msg)
The documentation of the receive_messages() is wrong.
Please see this for more information.

Websocket client in Python - is there a way to receive a specific response from a past subscription?

I'm using the websocket library in Python3 (installed as websocket-client) as the client in an API. I connect to the server with the following command:
ws = websocket.WebSocket(sslopt={"cert_reqs": ssl.CERT_NONE});
ws.connect("wss://here_comes_the_url");
I have previously subscribed to several "events" using:
ws.send(json.dumps({"queryId":"A_UNIQUE_VALUE", "msg_type":"subscription", "subscriptionType":"SUBSCRIBE", ...}));
Now I want to get responses. If I connect and do:
result = ws.recv();
I get a response for one of the subscriptions done previously. However, I have no idea to which subscription I will get a response. They seem to cycle, one by one, until it reaches the one I want.
Is there a way to receive a specific response from a specific subscription using, for instance, the unique queryId provided earlier?
Thanks!

Why does search in gmail API return different result than search in gmail website?

I'm using the gmail API to search emails from users. I've created the following search query:
ticket after:2015/11/04 AND -from:me AND -in:trash
When I run this query in the browser interface of Gmail I get 11 messages (as expected). When I run the same query in the API however, I get only 10 messages. The code I use to query the gmail API is written in Python and looks like this:
searchQuery = 'ticket after:2015/11/04 AND -from:me AND -in:trash'
messagesObj = google.get('/gmail/v1/users/me/messages', data={'q': searchQuery}, token=token).data
print messagesObj.resultSizeEstimate # 10
I sent the same message on to another gmail address and tested it from that email address and (to my surprise) it does show up in an API-search with that other email address, so the trouble is not the email itself.
After endlessly emailing around through various test-gmail accounts I *think (but not 100% sure) that the browser-interface search function has a different definition of "me". It seems that in the API-search it does not include emails which come from email addresses with the same name while these results are in fact included in the result of the browser-search. For example: if "Pete Kramer" sends an email from petekramer#icloud.com to pete#gmail.com (which both have their name set to "Pete Kramer") it will show in the browser-search and it will NOT show in the API-search.
Can anybody confirm that this is the problem? And if so, is there a way to circumvent this to get the same results as the browser-search returns? Or does anybody else know why the results from the gmail browser-search differ from the gmail API-search? Al tips are welcome!

I would suspect it is the after query parameter that is giving you trouble. 2015/11/04 is not a valid ES5 ISO 8601 date. You could try the alternative after:<time_in_seconds_since_epoch>
# 2015-11-04 <=> 1446595200
searchQuery = 'ticket AND after:1446595200 AND -from:me AND -in:trash'
messagesObj = google.get('/gmail/v1/users/me/messages', data={'q': searchQuery}, token=token).data
print messagesObj.resultSizeEstimate # 11 hopefully!

The q parameter of the /messages/list works the same as on the web UI for me (tried on https://developers.google.com/gmail/api/v1/reference/users/messages/list#try-it )
I think the problem is that you are calling /messages rather than /messages/list

The first time your application connects to Gmail, or if partial synchronization is not available, you must perform a full sync. In a full sync operation, your application should retrieve and store as many of the most recent messages or threads as are necessary for your purpose. For example, if your application displays a list of recent messages, you may wish to retrieve and cache enough messages to allow for a responsive interface if the user scrolls beyond the first several messages displayed. The general procedure for performing a full sync operation is as follows:
Call messages.list to retrieve the first page of message IDs.
Create a batch request of messages.get requests for each of the messages returned by the list request. If your application displays message contents, you should use format=FULL or format=RAW the first time your application retrieves a message and cache the results to avoid additional retrieval operations. If you are retrieving a previously cached message, you should use format=MINIMAL to reduce the size of the response as only the labelIds may change.
Merge the updates into your cached results. Your application should store the historyId of the most recent message (the first message in the list response) for future partial synchronization.
Note: You can also perform synchronization using the equivalent Threads resource methods. This may be advantageous if your application primarily works with threads or only requires message metadata.
Partial synchronization
If your application has synchronized recently, you can perform a partial sync using the history.list method to return all history records newer than the startHistoryId you specify in your request. History records provide message IDs and type of change for each message, such as message added, deleted, or labels modified since the time of the startHistoryId. You can obtain and store the historyId of the most recent message from a full or partial sync to provide as a startHistoryId for future partial synchronization operations.
Limitations
History records are typically available for at least one week and often longer. However, the time period for which records are available may be significantly less and records may sometimes be unavailable in rare cases. If the startHistoryId supplied by your client is outside the available range of history records, the API returns an HTTP 404 error response. In this case, your client must perform a full sync as described in the previous section.
From gmail API Documentation
https://developers.google.com/gmail/api/guides/sync

Overhead caused while fetching the required client in WAMP WS

I have created a websocket server using the WAMP WS provided in python programming language.
I have a requirement where I am subscribing about 500 clients with the WAMP WS server at a time.
But when I am publishing the data I will send it only to a single client based on certain conditions. I know that it is very much simple to just loop throgh the list of cliets and find out the eligible and then send the data to that respective client.
I would like to know, is there any other way without using the loops, as using loops will lead to a large overhead if in case the required client is at the last position.

presumably you loop through each client's eligibility data and do some sort of decision based on that data. it would follow that an index on the eligibility data would give you near instant access. so, using pseudo code, something like:
client_array = []
client_index = {}
client_array.add(new client)
if not new client.eligibility_data in client_index:
client_index[new client.eligibility_data] = []
client_index[new client.eligibility_data].add(new client)
I don't know what the eligibility data is, but, say it is the weight of the client. If you wanted to send a message to everybody that weighs between 200 and 205 points you could find those clients in the client_index[200] through [205].
if the condition cannot be determined before hand then you may need a database which can handle arbitrary queries to determine the client targets.

When doing a publish, you can provide a list of eligible receivers for the event via options, e.g. similar to this. The list of eligible receiver should be specified as a list of WAMP session IDs (which is the correct way to identify WAMP clients in this case).
Internally, AutobahnPython uses Python sets and set operations to compute the actual receivers - which is quite fast (and built into the language .. means native code runs).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.