I am working to extract issues data from a repo on Github using Github3.py.
The following is a part of my code to extract issues from a repo:
I used these libraries in the main code:
from github3 import login
from mysql.connector import IntegrityError
import config as cfg
import project_list
from github3.exceptions import NotFoundError
from github3.exceptions import GitHubException
import datetime
from database import Database
import sys
import re
import time
Then the main code is:
DEBUG = False
def process(url, start):
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
splitted = url.split("/")
org_name = splitted[3]
repo_name = splitted[4]
while True:
try:
gh = login(token = cfg.TOKEN)
repo = gh.repository(org_name, repo_name)
print("{} =====================".format(repo))
if start is None:
i = 1
else:
i = int(start)
if start is None:
j = 1
else:
j = int(start)
Database.connect()
while True:
try:
issue = repo.issue(i)
issue_id = issue.id
issue_number = issue.number
status_issue = str(issue.state)
close_author = str(issue.closed_by)
com_count = issue.comments_count
title = re_pattern.sub(u'\uFFFD', issue.title)
created_at = issue.created_at
closed_at = issue.closed_at
now = datetime.datetime.now()
reporter = str(issue.user)
body_text = issue.body_text
body_html = issue.body_html
if body_text is None:
body_text = ""
if body_html is None:
body_html = ""
body_text = re_pattern.sub(u'\uFFFD', body_text)
body_html = re_pattern.sub(u'\uFFFD', body_html)
Database.insert_issues(issue_id, issue_number, repo_name,status_issue , close_author, com_count, title, reporter, created_at, closed_at, now, body_text, body_html)
print("{} inserted.".format(issue_id))
if DEBUG == True:
break;
except NotFoundError as e:
print("Exception # {}: {}".format(i, str(e)))
except IntegrityError as e:
print("Data was there # {}".format(str(e)))
i += 1
j += 1
except GitHubException as e:
print("Exception: {}".format(str(e)))
time.sleep(1000)
i -= 1
j -= 1
if __name__ == "__main__":
if len(sys.argv) == 1:
sys.exit("Please specify project name: python issue-github3.py <project name>")
if len(sys.argv) == 2:
start = None
print("Start from the beginning")
else:
start = sys.argv[2]
project = sys.argv[1]
url = project_list.get_project(project)
process(url, start)
With the above code, everything is ok for me and I can extract issues from a repo on GitHub.
Problem: Exception: 410 Issues are disabled for this repo occurs after 100 successful issues extraction from a repo.
How could I solve this problem?
As mentioned in the main code, I fixed the exception 404 (i.e., Not found issues) with the library of from github3.exceptions import NotFoundError and the below code:
except NotFoundError as e:
print("Exception # {}: {}".format(i, str(e)))
Given the main code, what library and code should I use to fix exception 410?
I found an easy way to fix it but it doesn't solve the problem completely.
As I mentioned before, the exception occurs after 100 successful issue number (i.e., issue number of 101 is a problem), and as #LhasaDad said above, there is no issue number of 101 in the repo (I checked it manually). So we just need to put 102 instead of None where start = None, then execute the code again.
Related
I am trying to explore the enron email dataset using python Jupyter notebook. But I am getting this attribute error. I am trying to read the emails and convert them into csv format so that I can further apply Ml for sentiment analysis.
import tarfile
import re
from datetime import datetime
from collections import namedtuple, Counter
import pandas as pd
import altair as alt
tar =tarfile.open(r"C:\Users\nikip\Documents\2021\Interview Preparation\sentiment analysis\enron_mail_20150507.tar.gz", "r")
items = tar.getmembers()
Email = namedtuple('Email', 'Date, From, To, Subject, Cc, Bcc, Message')
def get_msg(item_number):
f = tar.extractfile(items[item_number])
try:
date = from_ = to = subject = cc= bcc = message= ''
in_to = False
in_message = False
to = []
message = []
item = f.read().decode()
item = item.replace('\r', '').replace('\t', '')
lines = item.split('\n')
for num, line in enumerate(lines):
if line.startswith('Date:') and not date:
date = datetime.strptime(' '.join(line.split('Date: ')[1].split()[:-2]), '%a, %d %b %Y %H:%M:%S')
elif line.startswith('From:') and not from_:
from_ = line.replace('From:', '').strip()
elif line.startswith('To:')and not to:
in_to = True
to = line.replace('To:', '').replace(',', '').replace(',', '').split()
elif line.startswith('Subject:') and not subject:
in_to = False
subject = line.replace('Subject:', '').strip()
elif line.startswith('Cc:') and not cc:
cc = line.replace('Cc:', '').replace(',', '').replace(',', '').split()
elif line.startswith('Bcc:') and not bcc:
bcc = line.replace('Bcc:', '').replace(',', '').replace(',', '').split()
elif in_to:
to.extend(line.replace(',', '').split())
elif line.statswith('Subject:') and not subject:
in_to =False
elif line.startswith('X-FileName'):
in_message = True
elif in_message:
message.append(line)
to = '; '.join(to).strip()
cc = '; '.join(cc).strip()
bcc = '; '.join(bcc).strip()
message = ' '.join(message).strip()
email = Email(date, from_, to, subject, cc, bcc, message)
return email
except Exception as e:
return e
msg = get_msg(3002)
msg.date
I am getting error message like below:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-11-e1439579a8e7> in <module>
----> 1 msg.To
AttributeError: 'AttributeError' object has no attribute 'To'
Can someone help ?thanks in advance
The problem is that you are return an exception in your get_msg function, which broadly looks like this:
def get_msg(item_number):
try:
...do some stuff...
except Exception as e:
return e
It looks like you're triggering an AttributeError exception somewhere in your code, and you're returning that exception, rather than an Email object.
You almost never want to have an except statement that suppresses all exceptions like that, because it will hide errors in your code (as we see here). It is generally better practice to catch specific exceptions, or at least log the error if your code will continue despite the exception.
As a first step, I would suggest removing the entire try/except block and get your code working without it.
I'm trying to make my own logging system because I don't like Python's one, but I ran into a small problem. Whenever I log two or more things, only the last log is written. Example - I log 'code initialized' at the start of the program. Then, I log 'exiting'. The result in the log file is only 'exiting' How do I fix this? Here is my code.
File 1.
from functions import *
from variables import *
try:
open('key.txt', "r")
key = True
except FileNotFoundError:
key = False
if check_internet():
x = status_boot[0]
elif not check_internet():
x = status_boot[1]
log("archer initialized")
if key:
print("archer status: " + x)
elif not key:
print(status_boot[4])
File 2.
from datetime import datetime
from variables import *
import requests
def time():
now = datetime.now()
return now.strftime("%H:%M:%S")
def check_internet():
timeout = 5
try:
_ = requests.get(url, timeout=timeout)
return True
except requests.ConnectionError:
return False
def log(m):
write = open('archerlog.txt', 'w+')
return write.write(f'[{time()}] archer: {m}\n')
File 3.
url = "http://www.google.com/"
status_boot = ["online", "offline", "key invalid", "error", "Error: Key not accessible!"]
x = status_boot[3]
I'm trying to get the list of files that are fully uploaded on the FTP server.
I have access to this FTP server where a 3rd party writes data and marker files every 15 minutes. Once the data file is completely uploaded then a marker file gets created. we know once this marker file is there that means data files are ready and we can download it. I'm looking for a way to efficiently approach this problem. I want to check every minute if there are any new stable files on FTP server, if there is then I'll download those files. one preferred way is see if the marker file is 2 minutes old then we are good to download marker file and corresponding data file.
I'm new with python and looking for help.
I have some code till I list out the files
import paramiko
from datetime import datetime, timedelta
FTP_HOST = 'host_address'
FTP_PORT = 21
FTP_USERNAME = 'username'
FTP_PASSWORD = 'password'
FTP_ROOT_PATH = 'path_to_dir'
def today():
return datetime.strftime(datetime.now(), '%Y%m%d')
def open_ftp_connection(ftp_host, ftp_port, ftp_username, ftp_password):
"""
Opens ftp connection and returns connection object
"""
client = paramiko.SSHClient()
client.load_system_host_keys()
try:
transport = paramiko.Transport(ftp_host, ftp_port)
except Exception as e:
return 'conn_error'
try:
transport.connect(username=ftp_username, password=ftp_password)
except Exception as identifier:
return 'auth_error'
ftp_connection = paramiko.SFTPClient.from_transport(transport)
return ftp_connection
def show_ftp_files_stat():
ftp_connection = open_ftp_connection(FTP_HOST, int(FTP_PORT), FTP_USERNAME, FTP_PASSWORD)
full_ftp_path = FTP_ROOT_PATH + "/" + today()
file_attr_list = ftp_connection.listdir_attr(full_ftp_path)
print(file_attr_list)
for file_attr in file_attr_list:
print(file_attr.filename, file_attr.st_size, file_attr.st_mtime)
if __name__ == '__main__':
show_ftp_files_stat()
Sample file name
org-reference-delta-quotes.REF.48C2.20200402.92.1.1.txt.gz
Sample corresponding marker file name
org-reference-delta-quotes.REF.48C2.20200402.92.note.txt.gz
I solved my use case with 2 min stable rule, if modified time is within 2 min of the current time, I consider them stable.
import logging
import time
from datetime import datetime, timezone
from ftplib import FTP
FTP_HOST = 'host_address'
FTP_PORT = 21
FTP_USERNAME = 'username'
FTP_PASSWORD = 'password'
FTP_ROOT_PATH = 'path_to_dir'
logger = logging.getLogger()
logger.setLevel(logging.ERROR)
def today():
return datetime.strftime(datetime.now(tz=timezone.utc), '%Y%m%d')
def current_utc_ts():
return datetime.utcnow().timestamp()
def current_utc_ts_minus_120():
return int(datetime.utcnow().timestamp()) - 120
def yyyymmddhhmmss_string_epoch_ts(dt_string):
return time.mktime(time.strptime(dt_string, '%Y%m%d%H%M%S'))
def get_ftp_connection(ftp_host, ftp_username, ftp_password):
try:
ftp = FTP(ftp_host, ftp_username, ftp_password)
except Exception as e:
print(e)
logger.error(e)
return 'conn_error'
return ftp
def get_list_of_files(ftp_connection, date_to_process):
full_ftp_path = FTP_ROOT_PATH + "/" + date_to_process + "/"
ftp_connection.cwd(full_ftp_path)
entries = list(ftp_connection.mlsd())
entry_list = [line for line in entries if line[0].endswith('.gz') | line[0].endswith('.zip')]
ftp_connection.quit()
print('Total file count', len(entry_list))
return entry_list
def parse_file_list_to_dict(entries):
try:
file_dict_list = []
for line in entries:
file_dict = dict({"file_name": line[0],
"server_timestamp": int(yyyymmddhhmmss_string_epoch_ts(line[1]['modify'])),
"server_date": line[0].split(".")[3])
file_dict_list.append(file_dict)
except IndexError as e:
# Output expected IndexErrors.
logging.exception(e)
except Exception as exception:
# Output unexpected Exceptions.
logging.exception(exception, False)
return file_dict_list
def get_stable_files_dict_list(dict_list):
stable_list = list(filter(lambda d: d['server_timestamp'] < current_utc_ts_minus_120(), dict_list))
print('stable file count: {}'.format(len(stable_list)))
return stable_list
if __name__ == '__main__':
ftp_connection = get_ftp_connection(FTP_HOST, FTP_USERNAME, FTP_PASSWORD)
if ftp_connection == 'conn_error':
logger.error('Failed to connect FTP Server!')
else:
file_list = get_list_of_files(ftp_connection, today())
parse_file_list = parse_file_list_to_dict(file_list)
stable_file_list = get_stable_files_dict_list(parse_file_list)
I am trying to scrape data off of WhoScored.com. I am not sure what is the best way to do it or if anyone is familiar with this particular website, but I have a Python script that is supposed to scrape the data.
Here is my code:
import time
import bs4
import selenium_func as sel
from helper_functions import read_from_file, append_to_file
TIERS_PATH = 'tiers_urls/tiers_urls.txt'
TEAMS_PATH = 'teams_urls/teams_urls.txt'
TEAMS_LOGS = 'teams_urls/teams_logs.txt'
"""
Functions
"""
def get_teams_urls(start_idx):
"""
Searches each tier and extracts all the teams' urls within that tier.
"""
server, driver = sel.start_server_and_driver()
tiers_urls = read_from_file(TIERS_PATH)
length = len(tiers_urls)
for tier in tiers_urls[start_idx:]:
error = False
teams_urls = []
try:
complete_url = sel.WHOSCORED_URL + tier
try:
driver.get(complete_url)
content = driver.page_source
soup = bs4.BeautifulSoup(''.join(content), 'lxml')
except Exception as e:
print('\n')
print("Problem accessing {}".format(tier))
print(str(e))
print('\n')
append_to_file("\nError accessing: " + tier + "\n", TEAMS_LOGS)
append_to_file("Index: " + str(tiers_urls.index(tier)), TEAMS_LOGS)
continue
stage = None
stages_div = soup.find('div', {'id':'sub-navigation'})
if stages_div != None:
stage_li = stages_div.find_all('li')[0]
if stage_li != None:
stage_href = stage_li.find('a', href=True)['href']
if stage_href != None:
stage = stage_href.split('/')[8]
if stage != None:
standings_table = soup.find('div', {'id':'standings-'+stage})
standings_tbody = standings_table.find(id='standings-'+stage+'-content')
teams_tr = standings_tbody.find_all('tr')
if len(teams_tr) > 0:
for tr in teams_tr:
team_td = tr.find_all('td')[1]
team_href = team_td.find('a', href=True)['href']
teams_urls.append(team_href)
except Exception as e:
print('\n')
print("Problem reading data from: {}".format(tier))
print(str(e))
print('\n')
append_to_file("\nError reading data from: " + tier + "\n", TEAMS_LOGS)
append_to_file("Index: " + str(tiers_urls.index(tier)), TEAMS_LOGS)
error = True
if error == False:
if len(teams_urls) > 0:
to_store = {tier:teams_urls}
append_to_file(str(to_store), TEAMS_PATH)
append_to_file("\nSuccessfully retrieved from: " + str(tiers_urls.index(tier)) + "/" + str(length), TEAMS_LOGS)
time.sleep(1)
sel.stop_server_and_driver(server, driver)
return
if __name__ == '__main__':
get_teams_urls(0)
I am trying to scrape data off of WhoScored.com and it opens up the website, but it returns this error:
'NoneType' object has no attribute 'find'
How do I fix this and successfully scrape the data ?
Sounds like you need some null/None-checks:
for tr in teams_tr:
team_td = tr.find_all('td')[1]
if team_td != None:
team_href = team_td.find('a', href=True)['href']
teams_urls.append(team_href)
You didn't check if team_td was None before calling find
I Have this Code... i want to see this directory on my ftp server. In that dir i have 3 files but my code reads only 2. And it throws an error which i handle. Take a look on the code and the output
import datetime
import ftplib
import os
errors = 0
default_ftp_name = 'username'
default_ftp_pass = 'pass'
host = 'ftp.example.com'
print('Connecting To ' + host)
try:
ftp = ftplib.FTP(host)
print('Successfully connected! ')
except:
print('[!]Failed To Connect')
errors += 1
print('Logging with: ' + default_ftp_name)
try:
ftp.login(default_ftp_name,default_ftp_pass)
print('Login Success')
except:
print('[!]Couldnt log in')
errors += 1
print('Changing Directory to /Public/ConnectedUsers/')
try:
ftp.cwd('/Public/ConnectedUsers/')
except:
print('[!]Directory failed to change')
errors += 1
try:
print('Retrieving Files...')
files = ftp.dir()
print(files)
except:
print('[!]Didnt Get The Files')
errors += 1
print('[t] Total Errors: ' + str(errors))
connection = False
if connection is True:
#Dosomehting
var = 10
else:
print('Connection Error')
See the output right here.
it shows 2 items but i have 3.
What do i need to change in order to access all files?
Take a look here
http://i.stack.imgur.com/TQDyv.png
ftp.dir prints the output to stdout, so files is currently None. You want to replace that call with ftp.nlst (or ftp.retrlines with a callback handler). See the docs.