How can I prevent my Python web-scraper from stopping? - python

Ahoy! I've written a quick (Python) program that grabs the occupancy of a climbing gym every five minutes for later analysis. I'd like it to run non-stop, but I've noticed that after a couple hours pass, one of two things will happen.
It will detect a keyboard interrupt (which I did not enter) and stop, or
It will simply stop writing to the .csv file without showing any failure in the shell.
Here is the code:
import os
os.chdir('~/Documents/Other/g1_capacity') #ensure program runs in correct directory if opened elsewhere
import requests
import time
from datetime import datetime
import numpy as np
import csv
def get_count():
url = 'https://portal.rockgympro.com/portal/public/b01ab221559163c5e9a73e078fe565aa/occupancy?&iframeid=occupancyCounter&fId='
text = requests.get(url).text
line = ""
for item in text.split("\n"):
if "\'count\'" in item:
line = (item.strip())
count = int(line.split(":")[1][0:-1]) #really gross way to get count number for this specific source
return count
while True: #run until manual stop
with open('g1_occupancy.csv', mode='a') as occupancy:
occupancy_writer = csv.writer(occupancy)
occupancy_writer.writerow([datetime.now(), get_count()]) #append new line to .csv with timestamp and current count
time.sleep(60 * 5) #wait five minutes before adding new line
I am new to web scraping (in fact, this is my first time) and I'm wondering if anyone might have a suggestion to help eliminate the issue I described above. Many thanks!

Related

Download csv file and convert to JSON

I would like to write Python script that download csv file from URL and then return this in JSON. The problem is that I need execute it as fast as it possible. What is the best way to do it? I was thinking about something like this:
r_bytes = requests.get(URL).content
r = r_bytes.decode('utf8')
reader = csv.DictReader(io.StringIO(r))
json_data = json.dumps(list(reader))
What do you think? It doesn't look good for me but I can t find any better way to solve this problem.
I tried comparing your conversion process with pandas and used this code:
import io
import pandas as pd
import requests
import json
import csv
import time
r_bytes = requests.get("https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv").content
print("finished download")
r = r_bytes.decode('utf8')
print("finished decode")
start_df_timestamp = time.time()
df = pd.read_csv(io.StringIO(r), sep=";")
result_df = json.dumps(df.to_dict('records'))
end_df_timestamp = time.time()
print("The df method took {d_t}s".format(d_t=end_df_timestamp-start_df_timestamp))
start_csv_reader_timestamp = time.time()
reader = csv.DictReader(io.StringIO(r))
result_csv_reader = json.dumps(list(reader))
end_csv_reader_timestamp = time.time()
print("The csv-reader method took {d_t}s".format(d_t=end_csv_reader_timestamp-start_csv_reader_timestamp))
and the result was:
finished download
finished decode
The df method took 0.200181245803833s
The csv-reader method took 0.3164360523223877s
this was using a random 37k row CSV file and i noticed that downloading it was by far the most time-intensive thing to do. Even if the the pandas.df functions were faster for me, you should probably try to profile your code, to see whether the conversion really is significantly adding to your runtime. :-)
PS: If you need to constantly monitor the CSV and processing updates turns out to be time-intensive, you could use hashes to only process alterations if the CSV has changed since your last download.

Python: Issue with rapidly reading and writing excel files after web scraping? Works for a bit then weird issues come up

So I developed a script that would pull data from a live-updated site tracking coronavirus data. I set it up to pull data every 30 minutes but recently tested it on updates every 30 seconds.
The idea is that it creates the request to the site, pulls the html, creates a list of all of the data I need, then restructures into a dataframe (basically it's the country, the cases, deaths, etc.).
Then it will take each row and append to the rows of each of the 123 excel files that are for the various countries. This will work well for, I believe, somewhere in the range of 30-50 iterations before it either causes file corruptions or weird data entries.
I have my code below. I know it's poorly written (my initial reasoning was I felt confident I could set it up quickly and I wanted to collect data quickly.. unfortunately I overestimated my abilities but now I want to learn what went wrong). Below my code I'll include sample output.
PLEASE note that this 30 second interval code pull is only for quick testing. I don't usually look to send that many requests for months. I just wanted to see what the issue was. Originally it was set to pull every 30 minutes when I detected this issue.
See below for the code:
import schedule
import time
def RecurringProcess2():
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import numpy as np
from os import listdir
import os
try:
extractTime = datetime.datetime.now()
extractTime = str(extractTime)
print("Access Initiated at " + extractTime)
link = 'https://www.worldometers.info/coronavirus/'
response = requests.get(link)
soup = BeautifulSoup(response.text,'html.parser').findAll('td')#[1107].get_text()
table = pd.DataFrame(columns=['Date and Time','Country','Total Cases','New Cases','Total Deaths','New Deaths','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop'])
soupList = []
for i in range(1107):
value = soup[i].get_text()
soupList.insert(i,value)
table = np.reshape(soupList,(123,-1))
table = pd.DataFrame(table)
table.columns=['Country','Total Cases','New Cases (+)','Total Deaths','New Deaths (+)','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop']
table['Date & Time'] = extractTime
#Below code is run once to generate the initial files. That's it.
# for i in range(122):
# fileName = table.iloc[i,0] + '.xlsx'
# table.iloc[i:i+1,:].to_excel(fileName)
FilesDirectory = 'D:\\Professional\\Coronavirus'
fileType = '.csv'
filenames = listdir(FilesDirectory)
DataFiles = [ filename for filename in filenames if filename.endswith(fileType) ]
for file in DataFiles:
countryData = pd.read_csv(file,index_col=0)
MatchedCountry = table.loc[table['Country'] == str(file)[:-4]]
if file == ' USA .csv':
print("Country Data Rows: ",len(countryData))
if os.stat(file).st_size < 1500:
print("File Size under 1500")
countryData = countryData.append(MatchedCountry)
countryData.to_csv(FilesDirectory+'\\'+file, index=False)
except :
pass
print("Process Complete!")
return
schedule.every(30).seconds.do(RecurringProcess2)
while True:
schedule.run_pending()
time.sleep(1)
When I check the code after some number of iterations (usually successful for like 30-50) it has either displayed only 2 rows and lost all other rows, or it'll keep appending while deleting a single entry in the row above while two rows above loses 2 entries, etc. (essentially forming a triangle of sorts).
Above that image would be a few hundred empty rows. Does anyone have an idea of what is going wrong here? I'd consider this a failed attempt but would still like to learn from this attempt. I appreciate any help in advance.
Hi as per my understanding the webpage only has one table element. My suggestion would be to use pandas read_html method as it provides clean and structured table.
Try the below code you can modify to schedule the same:-
import requests
import pandas as pd
url = 'https://www.worldometers.info/coronavirus/'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
Disclaimer: I'm still evaluating this solution. So far it works almost perfectly for 77 rows.
Originally I had set the script up to run for .xlsx files. I converted everything to .csv but retained the index column code:
countryData = pd.read_csv(file,index_col=0)
I started realizing that things were being ordered differently every time the script ran. I have since removed that from the code and so far it works. Almost.
Unnamed: 0 Unnamed: 0.1
0 7
7
For some reason I have the above output in every file. I don't know why. But it's in the first 2 columns yet it still seems to be reading and writing correctly. Not sure what's going on here.

How to automatically import csv file(from local pc) every minute (synchronize it with my computer clock)

I have a csv file in my computer that updates automatically after every 1 minute eg. after 08:01(it updates), after 08:02(it updates) etc...
importing this file to python is easy...
import pandas as pd
myfile=pd.read_csv(r'C:\Users\HP\Desktop\levels.csv')
i want to update/re-import this file after every minute based on my pc clock/time. i want to use 'threading' since i want to run other cells while the import function is running at all times.
so basically the code might be(other suggestions are welcome):
import pandas as pd
import threading
import datetime
import time
# code to import the csv file based on pc clock automatically after every
minute.
i want this to run in a way that i can still run other functions in other cells(i tried using "schedule" but i cant run other functions after that since it shows the asterisk symbol(*))
meaning if i run on another cell the variable 'myfile'
myfile
it shows a dataframe with updated values each time.

Make Jupyter Notebook script work 1 time per hour

I have a Jupyter Notebook. Here is just a simplified example.
#Parsing the website
def parse_website_function(url):
return(value,value2)
#Making some calculations(hypothesis)
def linear_model(value, value2):
return(calculations)
#Jot down calculations to csv file
pd.to_csv(calculations)
I would like to know how to make it work every hour and enable to rewrite(add new rows) to csv time series data in the same output file. Thanks!
A really basic way to do this would be to just make the program sleep for 3600 seconds.
For example this would make your program pause for 1 hour:
import time
time.sleep(3600)

Use Benthic Golden "ImpExp 6" from python code

Using Benthic Golden6 "ImpExp6" tool -- I can successfully import 122+K rows of data from csv file.
Attempting to automate via .py as I have with other smaller data sets and I am encountering the exceeded table space error. I dropped everything from the user, maximizing available space just for test purposes-- continue to receive the error -- however I can use the import tool and import the 122K rows no problems.
If I can import the file manually with no issues -- should I not be able to also do so via python script? Below is the script I am using.
Note: if I use lines = [] for lines in reader: lines.append(line) it will append 5556 rows of data VS the nothing I am getting with the script below. Using Python2.7
import cx_Oracle
import csv
connection = cx_Oracle.connect('myinfo')
cursor = connection.cursor()
L=[]
reader = csv.reader(open("myfile.csv","r"))
for row in reader:
L.append(row)
cursor.execute("ALTER SESSION SET NLS_DATE_FORMAT = 'MM/DD/YYYY'")
cursor.executemany("INSERT INTO BI_VANTAGE_TEST VALUES(:25,:24,:23,:22,:21,:20,:19,:18,:17,:16,:15,:14,:13,:12,:11,:10,:9,:8,:7,:6,:5,:4,:3,:2,:1)",L)
connection.commit
I was able to automate this import using an alternate method (note keystroke commands are unique to what steps I needed to complete within the tool I was utilizing).
from pywinauto.application import Application
import pyautogui
app = Application().start("C:\myprogram.exe")
pyautogui.typewrite(['enter', 'right', 'tab'])
pyautogui.typewrite('myfile.txt')
pyautogui.typewrite(['tab'])
pyautogui.typewrite('myoracletbl')
pyautogui.typewrite(['tab', 'tab', 'tab'])
pyautogui.typewrite(['enter'])
pyautogui.typewrite(['enter'])
time.sleep(#seconds)
Application.Kill_(app)

Categories