Scraping chat file: How to automatically add missing usernames

Scraping chat file: How to automatically add missing usernames - python

I have a local html file containing messages from a telegram chat.
I scraped each message and extracted information like ID of the message, date it was posted and the content of each message.
The Problem: If a user posts multiple times in a row, only the first message will contain the user's name. After that, the name-column in the df remains empty- Because: All the following messages have their own ID and time stamp – but they do not have the div class="from_name" information.
Is there a way to fix that so that each message has a user name information in the df?
messages = doc.select('div.message')
rows = []
for message in messages:
print('---')
row = {}
row['id_number'] = (message['id'])
try:
row['time'] = (message.select_one('div[title]').get('title'))
except:
print("Couldn't find the time")
try:
row['username'] = (message.select_one('div.from_name').contents[0].strip())
except:
print("Couldn't find a name")
try:
row['text of the message'] = (message.select_one('div.text').text.strip())
except:
print("Couldn't find a text")
print(row)
rows.append(row)

You can handle it in your DataFrame or explicitly in your code - Instead the try/except blocks you can check directly if element is available, if not set the username from last added row in rows:
if message.select_one('div.from_name'):
row['username'] = message.select_one('div.from_name').contents[0].strip()
else:
row['username'] = rows[-1].get('username')
print("Couldn't find a name, set last username")
EDIT: Based on your comments you could check len(rows) and continue with next iteration if it is empty:
if message.select_one('div.from_name'):
row['username'] = message.select_one('div.from_name').contents[0].strip()
elif len(rows) > 0:
row['username'] = rows[-1].get('username')
else:
continue

Related

How to determine if whatsapp contact search find results with selenium python

I want to send whatsapp using selenium python
Im getting my contact numbers from a csv file
So
With a loop
Im typing phone numbers in contact search box (WhatsApp web)
(Because that some of my phone contact are duplicate so I'm using their phone in search box instead of their name)
And entering Enter button (off course with selenium)
And with that it's entering the only result chat
So i can send the message and etc.
The problem is that when there is no result in searching number it's sending the messages to the last person that was sent to
So the last person gets duplicate message
How can i determine if the search is giving me any result
Or in this case
How can i know if the number has whatsapp or not
Thanks
from selenium import webdriver
import time
import pandas as pd
import os
import xlrd
import autoit
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
fileName = 'test.csv'
messages_excel = 'messages.xlsx'
driver = webdriver.Chrome('D:\python\chromedriver')
driver.get('https://web.whatsapp.com/')
input('after QR Code')
with open(fileName) as file:
data = pd.read_csv(file)
df = pd.DataFrame(data)
msgdata = pd.read_excel(messages_excel, sheet_name=r'Sheet1')
for index, row in df.iterrows():
try:
search_phone = int(row['phone'])
search_box = driver.find_element_by_class_name('_2zCfw')
search_box.send_keys(search_phone)
time.sleep(2)
search_box.send_keys(u'\ue007')
for i in msgdata.index:
try:
clipButton = driver.find_element_by_xpath('//*[#id="main"]/header/div[3]/div/div[2]/div/span')
clipButton.click()
time.sleep(1)
# To send Videos and Images.
mediaButton = driver.find_element_by_xpath(
'//*[#id="main"]/header/div[3]/div/div[2]/span/div/div/ul/li[1]/button')
mediaButton.click()
time.sleep(3)
image_path = os.getcwd() + "\\Media\\" + msgdata['photoName'][i]+'.jpg'
autoit.control_focus("Open", "Edit1")
autoit.control_set_text("Open", "Edit1", (image_path))
autoit.control_click("Open", "Button1")
time.sleep(1)
previewMsg = driver.find_element_by_class_name("_3u328").send_keys(u'\ue007')
time.sleep(3)
productName = str(msgdata['name'][i])
oldPrice = str(msgdata['oldqimat'][i])
newPrice = str(msgdata['newqimat'][i])
inventory = str(msgdata['inventory'][i])
msg_box = driver.find_element_by_xpath('//*[#id="main"]/footer/div[1]/div[2]/div/div[2]')
msg_box.send_keys("stocks")
msg_box.send_keys(Keys.SHIFT + '\ue007')
msg_box.send_keys(productName)
msg_box.send_keys(Keys.SHIFT + '\ue007')
if oldPrice != 'nan':
msg_box.send_keys("oldPrice : "+ oldPrice)
msg_box.send_keys(Keys.SHIFT + '\ue007')
if newPrice != 'nan':
msg_box.send_keys("newPrice : "+ newPrice)
msg_box.send_keys(Keys.SHIFT + '\ue007')
if inventory!= 'nan':
msg_box.send_keys("inventory : "+ inventory)
time.sleep(1)
msg_box.send_keys(Keys.ENTER)
time.sleep(3)
except NoSuchElementException:
continue
except NoSuchElementException:
continue
print("sucessfully Done")

when there is no result in searching number it's sending the messages to the last person that was sent to So the last person gets duplicate message
Im getting my contact numbers from a csv file So With a loop Im typing phone numbers in contact search box (WhatsApp web)
You could store the last # you contacted as a variable and check if the current recipient of the message, matches the stored contact #.
A simple If/Else should do the trick.
Code
last_contacted = None
for index, row in df.iterrows():
try:
if row['phone'] == last_contacted:
print("number already contacted")
next
else:
search_phone = int(row['phone'])
last_contacted = search_phone
print(search_phone)

After you fill the search contact box string and send the Enter key, the “best match” contact name will be displayed at top of the right message panel.
Inspect that element and make sure it matches your search before continuing.

TweepError Failed to parse JSON payload - random error

I am using Tweepy package in Python to collect tweets. I track several users and collect their latest tweets. For some users I get an error like "Failed to parse JSON payload: ", e.g. "Failed to parse JSON payload: Expecting ',' delimiter or '}': line 1 column 694303 (char 694302)". I took a note of the userid and tried to reproduce the error and debug the code. The second time I ran the code for that particular user, I got results (i.e. tweets) with no problem. I adjusted my code so that when I get this error I try once more to extract the tweets. So, I might get this error once, or twice for a user, but in a second or third attempt the code returns the tweets as usual without the error. I get similar behaviour for other userids too.
My question is, why does this error appear randomly? Nothing else has changed. I searched on the internet but couldn't find a similar report. A snippet of my code follows
#initialize a list to hold all the tweepy Tweets
alltweets = []
ntries = 0
#make initial request for most recent tweets (200 is the maximum allowed count)
while True:
try: #if process fails due to connection problems, retry.
if beforeid:
new_tweets = api.user_timeline(user_id = user,count=200, since_id=sinceid, max_id=beforeid)
else:
new_tweets = api.user_timeline(user_id = user,count=200, since_id=sinceid)
break
except tweepy.error.RateLimitError:
print "Rate limit error:", sys.exc_info()[0]
print("Timeout, retry in 5 minutes...\n")
time.sleep(60 * 5)
continue
except tweepy.error.TweepError as er:
print('TweepError: ' + er.message)
if er.message == 'Not authorized.':
new_tweets = []
break
else:
print(str(ntries))
ntries +=1
pass
except:
print "Unexpected error:", sys.exc_info()[0]
new_tweets = []
break

Tweepy Twitter get all tweet replies of particular user

I am trying to get all replies of this particular user. So this particular user have reply_to_user_id_str of 151791801. I tried to print out all the replies but I'm not sure how. However, I only manage to print out only 1 of the replies. Can anyone help me how to print out all the replies?
My codes are:
for page in tweepy.Cursor(api.user_timeline, id="253346744").pages(1):
for item in page:
if item.in_reply_to_user_id_str == "151791801":
print item.text
a = api.get_status(item.in_reply_to_status_id_str)
print a.text

First, find the retweet thread of your conversation with your service provider:
# Find the last tweet
for page in tweepy.Cursor(api.user_timeline, id="253346744").pages(1):
for item in page:
if item.in_reply_to_user_id_str == "151791801":
last_tweet = item
The variable last tweet will contain their last retweet to you. From there, you can loop back to your original tweet:
# Loop until the original tweet
while True:
print(last_tweet.text)
prev_tweet = api.get_status(last_tweet.in_reply_to_status_id_str)
last_tweet = prev_tweet
if not last_tweet.in_reply_to_status_id_str:
break
It's not pretty, but it gets the job done.
Good luck!

user_name = "#nameofuser"
replies = tweepy.Cursor(api.search, q='to:{} filter:replies'.format(user_name)) tweet_mode='extended').items()
while True:
try:
reply = replies.next()
if not hasattr(reply, 'in_reply_to_user_id_str'):
continue
if str(reply.in_reply_to_user_id_str) == "151791801":
logging.info("reply of :{}".format(reply.full_text))
except tweepy.RateLimitError as e:
logging.error("Twitter api rate limit reached".format(e))
time.sleep(60)
continue
except tweepy.TweepError as e:
logging.error("Tweepy error occured:{}".format(e))
break
except StopIteration:
break
except Exception as e:
logger.error("Failed while fetching replies {}".format(e))
break

Python- Pass control flow back to top of the script

I am trying to fetch the contents of a page using Requests.The URL has 3 parameters:
Unique page ID
Username
Password
My initial block of code looks like this :
import requests
id = raw_input("Enter the unique id:")
user = raw_input("Enter your username:")
password = raw_input("Enter corresponding password:")
try:
r = requests.get('http://test.com/request.pl?id=' + id, auth=(user, password))
if r.status_code == 404:
print "No such page exists.Please check the ID and try again"
## Ask for input again
else:
print r.text
except requests.ConnectionError:
print "Server is refusing connections.Please try after sometime"
sys.exit(1)
My issue is on the commented line wherein i want the user to be prompted for the input again.How do I pass the control flow back to the top of the script.
I have a vague feeling that I might be doing this in a very crude way and there might be more elegant solutions using functions.If there are any,please do enlighten me.

This will do what actually you want.
import requests
def user_input():
id1 = raw_input("Enter the unique id:")
user = raw_input("Enter your username:")
password = raw_input("Enter corresponding password:")
try:
r = requests.get('http://test.com/request.pl?id='+ id1 + user + password)
if r.status_code == 404:
print "No such page exists.Please check the ID and try again"
## Ask for input again
user_input()
else:
print r.text
except requests.ConnectionError:
print "Server is refusing connections.Please try after sometime"
sys.exit(1)
user_input()

The simplest (but not necessarily the most extensible) way is to put everything in a while True loop.
import requests
while True:
id = raw_input("Enter the unique id:")
user = raw_input("Enter your username:")
password = raw_input("Enter corresponding password:")
try:
r = requests.get('http://test.com/request.pl?id=' + id, auth=(user, password))
if r.status_code == 404:
print "No such page exists.Please check the ID and try again"
## control flow will reach the bottom and return to the top
else:
print r.text
break
except requests.ConnectionError:
print "Server is refusing connections.Please try after sometime"
sys.exit(1) ## Exit condition of the loop

I would place this code in a while loop that always executes while True: and have a flag that allows you to break out of the loop appropriately.

Move an email in GMail with Python and imaplib

I want to be able to move an email in GMail from the inbox to another folder using Python. I am using imaplib and can't figure out how to do it.

There is no explicit move command for IMAP. You will have to execute a COPY followed by a STORE (with suitable flag to indicate deletion) and finally expunge. The example given below worked for moving messages from one label to the other. You'll probably want to add more error checking though.
import imaplib, getpass, re
pattern_uid = re.compile(r'\d+ \(UID (?P<uid>\d+)\)')
def connect(email):
imap = imaplib.IMAP4_SSL("imap.gmail.com")
password = getpass.getpass("Enter your password: ")
imap.login(email, password)
return imap
def disconnect(imap):
imap.logout()
def parse_uid(data):
match = pattern_uid.match(data)
return match.group('uid')
if __name__ == '__main__':
imap = connect('<your mail id>')
imap.select(mailbox = '<source folder>', readonly = False)
resp, items = imap.search(None, 'All')
email_ids = items[0].split()
latest_email_id = email_ids[-1] # Assuming that you are moving the latest email.
resp, data = imap.fetch(latest_email_id, "(UID)")
msg_uid = parse_uid(data[0])
result = imap.uid('COPY', msg_uid, '<destination folder>')
if result[0] == 'OK':
mov, data = imap.uid('STORE', msg_uid , '+FLAGS', '(\Deleted)')
imap.expunge()
disconnect(imap)

As for Gmail, based on its api working with labels, the only thing for you to do is adding dest label and deleting src label:
import imaplib
obj = imaplib.IMAP4_SSL('imap.gmail.com', 993)
obj.login('username', 'password')
obj.select(src_folder_name)
typ, data = obj.uid('STORE', msg_uid, '+X-GM-LABELS', desti_folder_name)
typ, data = obj.uid('STORE', msg_uid, '-X-GM-LABELS', src_folder_name)

I suppose one has a uid of the email which is going to be moved.
import imaplib
obj = imaplib.IMAP4_SSL('imap.gmail.com', 993)
obj.login('username', 'password')
obj.select(src_folder_name)
apply_lbl_msg = obj.uid('COPY', msg_uid, desti_folder_name)
if apply_lbl_msg[0] == 'OK':
mov, data = obj.uid('STORE', msg_uid , '+FLAGS', '(\Deleted)')
obj.expunge()

None of the previous solutions worked for me. I was unable to delete a message from the selected folder, and unable to remove the label for the folder when the label was the selected folder. Here's what ended up working for me:
import email, getpass, imaplib, os, sys, re
user = "user#example.com"
pwd = "password" #getpass.getpass("Enter your password: ")
m = imaplib.IMAP4_SSL("imap.gmail.com")
m.login(user,pwd)
from_folder = "Notes"
to_folder = "food"
m.select(from_folder, readonly = False)
response, emailids = imap.search(None, 'All')
assert response == 'OK'
emailids = emailids[0].split()
errors = []
labeled = []
for emailid in emailids:
result = m.fetch(emailid, '(X-GM-MSGID)')
if result[0] != 'OK':
errors.append(emailid)
continue
gm_msgid = re.findall(r"X-GM-MSGID (\d+)", result[1][0])[0]
result = m.store(emailid, '+X-GM-LABELS', to_folder)
if result[0] != 'OK':
errors.append(emailid)
continue
labeled.append(gm_msgid)
m.close()
m.select(to_folder, readonly = False)
errors2 = []
for gm_msgid in labeled:
result = m.search(None, '(X-GM-MSGID "%s")' % gm_msgid)
if result[0] != 'OK':
errors2.append(gm_msgid)
continue
emailid = result[1][0]
result = m.store(emailid, '-X-GM-LABELS', from_folder)
if result[0] != 'OK':
errors2.append(gm_msgid)
continue
m.close()
m.logout()
if errors: print >>sys.stderr, len(errors), "failed to add label", to_folder
if errors2: print >>sys.stderr, len(errors2), "failed to remove label", from_folder

I know that this is a very old question, but any way. The proposed solution by Manoj Govindan probably works perfectly (I have not tested it but it looks like it. The problem that I encounter and I had to solve is how to copy/move more than one email!!!
So I came up with solution, maybe someone else in the future might have the same problem.
The steps are simple, I connect to my email (GMAIL) account choose folder to process (e.g. INBOX) fetch all uids, instead of email(s) list number. This is a crucial point to notice here. If we fetched the list number of emails and then we processed the list we would end up with a problem. When we move an email the process is simple (copy at the destination folder and delete email from each current location). The problem appears if you have a list of emails e.g. 4 emails inside the inbox and we process the 2nd email in inside the list then number 3 and 4 are different, they are not the emails that we thought that they would be, which will result into an error because list item number 4 it will not exist since the list moved one position down because 2 position was empty.
So the only possible solution to this problem was to use UIDs. Which are unique numbers for each email. So no matter how the email will change this number will be binded with the email.
So in the example below, I fetch the UIDs on the first step,check if folder is empty no point of processing the folder else iterate for all emails found in the folder. Next fetch each email Header. The headers will help us to fetch the Subject and compare the subject of the email with the one that we are searching. If the subject matches, then continue to copy and delete the email. Then you are done. Simple as that.
#!/usr/bin/env python
import email
import pprint
import imaplib
__author__ = 'author'
def initialization_process(user_name, user_password, folder):
imap4 = imaplib.IMAP4_SSL('imap.gmail.com') # Connects over an SSL encrypted socket
imap4.login(user_name, user_password)
imap4.list() # List of "folders" aka labels in gmail
imap4.select(folder) # Default INBOX folder alternative select('FOLDER')
return imap4
def logout_process(imap4):
imap4.close()
imap4.logout()
return
def main(user_email, user_pass, scan_folder, subject_match, destination_folder):
try:
imap4 = initialization_process(user_email, user_pass, scan_folder)
result, items = imap4.uid('search', None, "ALL") # search and return uids
dictionary = {}
if items == ['']:
dictionary[scan_folder] = 'Is Empty'
else:
for uid in items[0].split(): # Each uid is a space separated string
dictionary[uid] = {'MESSAGE BODY': None, 'BOOKING': None, 'SUBJECT': None, 'RESULT': None}
result, header = imap4.uid('fetch', uid, '(UID BODY[HEADER])')
if result != 'OK':
raise Exception('Can not retrieve "Header" from EMAIL: {}'.format(uid))
subject = email.message_from_string(header[0][1])
subject = subject['Subject']
if subject is None:
dictionary[uid]['SUBJECT'] = '(no subject)'
else:
dictionary[uid]['SUBJECT'] = subject
if subject_match in dictionary[uid]['SUBJECT']:
result, body = imap4.uid('fetch', uid, '(UID BODY[TEXT])')
if result != 'OK':
raise Exception('Can not retrieve "Body" from EMAIL: {}'.format(uid))
dictionary[uid]['MESSAGE BODY'] = body[0][1]
list_body = dictionary[uid]['MESSAGE BODY'].splitlines()
result, copy = imap4.uid('COPY', uid, destination_folder)
if result == 'OK':
dictionary[uid]['RESULT'] = 'COPIED'
result, delete = imap4.uid('STORE', uid, '+FLAGS', '(\Deleted)')
imap4.expunge()
if result == 'OK':
dictionary[uid]['RESULT'] = 'COPIED/DELETED'
elif result != 'OK':
dictionary[uid]['RESULT'] = 'ERROR'
continue
elif result != 'OK':
dictionary[uid]['RESULT'] = 'ERROR'
continue
else:
print "Do something with not matching emails"
# do something else instead of copy
dictionary = {scan_folder: dictionary}
except imaplib.IMAP4.error as e:
print("Error, {}".format(e))
except Exception as e:
print("Error, {}".format(e))
finally:
logout_process(imap4)
return dictionary
if __name__ == "__main__":
username = 'example.email#gmail.com'
password = 'examplePassword'
main_dictionary = main(username, password, 'INBOX', 'BOKNING', 'TMP_FOLDER')
pprint.pprint(main_dictionary)
exit(0)
Useful information regarding imaplib Python — imaplib IMAP example with Gmail and the imaplib documentation.

This is the solution to move multiple from one folder to another.
mail_server = 'imap.gamil.com'
account_id = 'yourimap#gmail.com'
password = 'testpasword'
TLS_port = '993'
# connection to imap
conn = imaplib.IMAP4_SSL(mail_server,TLS_port)
try:
(retcode, capabilities) = conn.login(account_id, password)
# return HttpResponse("pass")
except:
# return HttpResponse("fail")
messages.error(request, 'Request Failed! Unable to connect to Mailbox. Please try again.')
return redirect('addIEmMailboxes')
conn.select('"INBOX"')
(retcode, messagess) = conn.uid('search', None, "ALL")
if retcode == 'OK':
for num in messagess[0].split():
typ, data = conn.uid('fetch', num,'(RFC822)')
msg = email.message_from_bytes((data[0][1]))
#MOVE MESSAGE TO ProcessedEmails FOLDER
result = conn.uid('COPY', num, 'ProcessedEmails')
if result[0] == 'OK':
mov, data = conn.uid('STORE', num , '+FLAGS', '(\Deleted)')
conn.expunge()
conn.close()
return redirect('addIEmMailboxes')

Solution with Python 3, to move Zoho mails from Trash to Archive. (Zoho does not archive deleted messages, so if you want to preserve them forever, you need to move from Trash to an archival folder.)
#!/usr/bin/env python3
import imaplib, sys
obj = imaplib.IMAP4_SSL('imap.zoho.com', 993)
obj.login('account', 'password')
obj.select('Trash')
_, data = obj.uid('FETCH', '1:*' , '(RFC822.HEADER)')
if data[0] is None:
print("No messages in Trash")
sys.exit(0)
messages = [data[i][0].split()[2] for i in range(0, len(data), 2)]
for msg_uid in messages:
apply_lbl_msg = obj.uid('COPY', msg_uid, 'Archive')
if apply_lbl_msg[0] == 'OK':
mov, data = obj.uid('STORE', msg_uid , '+FLAGS', '(\Deleted)')
obj.expunge()
print("Moved msg %s" % msg_uid)
else:
print("Copy of msg %s did not work" % msg_uid)

My external lib: https://github.com/ikvk/imap_tools
# MOVE all messages from INBOX to INBOX/folder2
from imap_tools import MailBox
with MailBox('imap.ya.ru').login('tst#ya.ru', 'pwd', 'INBOX') as mailbox:
mailbox.move(mailbox.fetch('ALL'), 'INBOX/folder2') # *implicit creation of uid list on fetch

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping chat file: How to automatically add missing usernames - python

Related

How to determine if whatsapp contact search find results with selenium python

TweepError Failed to parse JSON payload - random error

Tweepy Twitter get all tweet replies of particular user

Python- Pass control flow back to top of the script

Move an email in GMail with Python and imaplib

Categories

Resources