msg = service.users().messages().get(userId='me', id=message['id']).execute()
print(msg['snippet'])
I am currently using the above code, which doesn't get the whole message. I have seen in the documentation that the google API has raw and full options, but the raw option doesn't print in a readable way and I cannot get the full option to work.
Thank you !
This is how worked for me:
# Gets message header first
msg_header = service.users().messages().get(
userId=user_id,
id=msg_id,
format="full",
metadataHeaders=None
).execute()
# Gets message body from header
body = base64.urlsafe_b64decode(msg_header.get("payload").get("body")\
.get("data").encode("ASCII")).decode("utf-8")
The body comes in HTML so, in my case, I use BeautifulSoup to extract the information I need, like below:
soup = bs(body, 'html.parser')
# Loop on e-mail table
for row in soup.findAll('tr'):
aux = row.findAll('td')
info[aux[0].string] = aux[1].string
The information extraction will depend on the pattern of the message. In my case, all messages that I'm getting have the same pattern.
Related
I am using kafka-python and BeautifulSoup to Scrape website that I enter often, and send a message to kafka broker with python producer.
What I want to do is whenever new post is uploaded on website (actually it is some kind of community like reddit, usually korean hip-hop fans are using to share information etc), that post should be send to kafka broker.
However, my problem is within while loop, only the lateset post keeps sending to kafka broker repeatedly.
This is not I want.
Also, second problem is when new post is loaded,
HTTP Error 502: Bad Gateway error occurs on
soup = BeautifulSoup(urllib.request.urlopen("http://hiphople.com/kboard").read(), "html.parser")
and message is not send anymore.
this is dataScraping.py
from bs4 import BeautifulSoup
import re
import urllib.request
pattern = re.compile('[0-9]+')
def parseContent():
soup = BeautifulSoup(urllib.request.urlopen("http://hiphople.com/kboard").read(), "html.parser")
for div in soup.find_all("tr", class_="notice"):
div.decompose()
key_num = pattern.findall(soup.find_all("td", class_="no")[0].text)
category = soup.find_all("td", class_="categoryTD")[0].find("span").text
author = soup.find_all("td", class_="author")[0].find("span").text
title = soup.find_all("td", class_="title")[0].find("a").text
link = "http://hiphople.com" + soup.find_all("td", class_="title")[0].find("a").attrs["href"]
soup2 = BeautifulSoup(urllib.request.urlopen(link).read(), "html.parser")
content = str(soup2.find_all("div", class_="article-content")[0].find_all("p"))
content = re.sub("<.+?>","", content,0).strip()
content = re.sub("\xa0","", content, 0).strip()
result = {"key_num": key_num, "catetory": category, "title": title, "author": author, "content": content}
return result
if __name__ == "__main__":
print("data scraping from website")
and this is PythonWebScraping.py
import json
from kafka import KafkaProducer
from dataScraping import parseContent
def json_serializer(data):
return json.dumps(data).encode("utf-8")
producer = KafkaProducer(acks=1, compression_type = "gzip", bootstrap_servers=["localhost:9092"],
value_serializer = json_serializer)
if __name__ == "__main__":
while (True):
result = parseContent()
producer.send("hiphople",result)
Please let me know how to fix my code so I can send newly created post to kafka broker as I expected.
Your function is working but its true you return only one event, I did not get 502 bad gateway, maybe you are getting it as ddos protection because of accessing too much times to the url, try adding delays/sleep , or your ip been banned to stop it from scraping the url...
For your second error, your function returns only one/last message
You are sending each time the result to kafka, this is why you are seeing same message over and over again,
You are scraping and taking the last event , what did you wish your function to do?
prevResult = ""
while(True):
result = parseContent()
if(prevResult!=result):
prevResult = result
print( result )
I am trying to decode an email sent to me from a specific source. The email is what looks like a CSS box that contains the info i need. When i run this through the function provided by google, I get what appears to be CSS coding and it is not possible for me to extract the information i need, and the content_type() is "text". But if i forward the same email to myself and run the same exact function on it, i get the content_type() as "multipart", and i am able to extract the plain text of the CSS body, and grab the info I need. I think this is because when I forward it to myself it contains plain text at the top (showing the forward info) as well as the CSS body.
So my question is, how can I extract the same plain text I get from the CSS body after I forward the email to myself, without forwarding the email to myself? Below is the function I am using:
def get_message(service, user_id, msg_id):
try:
# Makes the connection and GETS the emails in RAW format.
message = service.users().messages().get(userId=user_id, id=msg_id, format='raw').execute()
# Changes format from RAW to ASCII
msg_raw = base64.urlsafe_b64decode(message['raw'].encode('ASCII'))
# Changes format type again
msg_str = email.message_from_bytes(msg_raw)
# This line checks what the content is, if multipart (plaintext and html) or single part
content_types = msg_str.get_content_maintype()
print(content_types)
if content_types == 'multipart':
# Part1 is plaintext
part1, part2 = msg_str.get_payload()
raw_email = part1.get_payload()
remove_char = ["|", "=20", "=C2=A0"]
for i in remove_char:
raw_email = raw_email.replace(i, "")
raw_email = "".join([s for s in raw_email.strip().splitlines(True) if s.strip()])
return str(raw_email)
else:
return msg_str.get_payload()
except:
print('An error has occured during the get_message function.')
I currently am getting the body/content of the emails in Python using the following:
import email
message = email.message_from_file(open(file))
messages = [part.get_payload() for part in message.walk() if part.get_content_type() == 'text/plain']
This seems to work well in most cases, but I noticed that sometimes there are html tables that don't get picked up. It starts with
<html>
<style type='text/css">
Would it just be to add or part.get_content_tye() == 'text/css'?
If I had to guess, I would guess that you need to add 'text/html'.
However, you should be able to figure out what content-type is in the email by examining the content of that variable.
import email
message = email.message_from_file(open(file))
# Remove the content-type filter completely
messages = [(part.get_payload(), part.get_content_type()) for part in message.walk()]
# print the whole thing out so that you can see what content-types are in there.
print(message)
This will help you see what content types are in there and you can then filter the ones that you need.
I'm sending a GET request to /users/{id}/calendar/calendarView?startDateTime={start_datetime}&endDateTime={end_datetime} in order to get events of a user and after that I'm filtering the event subject and content.
What happens is that in the majority of times the subject in the JSON response is just the name of the event organizer and the content is just a blank HTML instead of the values that can be seen in the calendar normally.
I tried finding another fields in the JSON that could provide the correct information but there aren't any.
It looks like some people also had the same problem (here and here), but no solution was found or the one given is not what I need.
The following code is what I'm using:
graph_client = OAuth2Session(token=token)
headers = {
'Prefer' : 'outlook.timezone="America/Manaus"'
}
response = graph_client.get(f'{self.graph_url}/users/{room}/calendar/calendarView?startDateTime={start_datetime}&endDateTime={end_datetime}',
headers=headers)
print(response.json()['value'])
I have a working app using the imgur API with python.
from imgurpython import ImgurClient
client_id = '5b786edf274c63c'
client_secret = 'e44e529f76cb43769d7dc15eb47d45fc3836ff2f'
client = ImgurClient(client_id, client_secret)
items = client.subreddit_gallery('rarepuppers')
for item in items:
print(item.link)
It outputs a set of imgur links. I want to display all those pictures on a webpage.
Does anyone know how I can integrate python to do that in an HTML page?
OK, I haven't tested this, but it should work. The HTML is really basic (but you never mentioned anything about formatting) so give this a shot:
from imgurpython import ImgurClient
client_id = '5b786edf274c63c'
client_secret = 'e44e529f76cb43769d7dc15eb47d45fc3836ff2f'
client = ImgurClient(client_id, client_secret)
items = client.subreddit_gallery('rarepuppers')
htmloutput = open('somename.html', 'w')
htmloutput.write("[html headers here]")
htmloutput.write("<body>")
for item in items:
print(item.link)
htmloutput.write('SOME DESCRIPTION<br>')
htmloutput.write("</body</html>")
htmloutput.close
You can remove the print(item.link) statement in the code above - it's there to give you some comfort that stuff is happening.
I've made a few assumptions:
The item.link returns a string with only the unformatted link. If it's already in html format, then remove the HTML around item.link in the example above.
You know how to add an HTML header. Let me know if you do not.
This script will create the html file in the folder it's run from. This is probably not the greatest of ideas. So adjust the path of the created file appropriately.
Once again, though, if that is a real client secret, you probably want to change it ASAP.