How do I run a scraper on each entry in a database?

How do I run a scraper on each entry in a database? - python

I'm scraping data with requests library and parsing them with Beautiful Soup.
I'm storing scraped data in mysql db.
I want to run a scraper each time when it finds a new entry in a table.

Assuming you have your scraping method already, let's call it scrape_data()
You can use the MySQL-Python-Connector to run a query on the database directly to scrape as it reads each row (although you might want to buffer them into memory to handle disconnects)
# Importing the MySQL-Python-connector
import mysql.connector as mysqlConnector
# Creating connection with the MySQL Server Running. Remember to use your own credentials.
conn = mysqlConnector.connect(host='localhost',user='root',passwd='root')
# Handle bad connections
if conn:
print("Connection Successful :)")
else:
print("Connection Failed :(")
# Creating a cursor object to traverse the resultset
cur = conn.cursor()
# Assuming the column is called data in a table called table. Replace as needed.
cur.execute("SELECT data FROM table")
for row in cur:
scrape_data(row[0]) # Assumes data is the first column.
# Closing the connection - or you will end up with a resource leak
conn.close()
Note
You can find the official connector here.

Related

Pyodbc stored procedure with params not updating table

I am using python 3.9 with a pyodbc connection to call a SQL Server stored procedure with two parameters.
This is the code I am using:
connectionString = buildConnection() # build connection
cursor = connectionString.cursor() # Create cursor
command = """exec [D7Ignite].[Service].[spInsertImgSearchHitResults] #RequestId = ?, #ImageInfo = ?"""
values = (requestid, data)
cursor.execute(command, (values))
cursor.commit()
cursor.close()
requestid is simply an integer number, but data is defined as follows (list of json):
[{"ImageSignatureId":"27833", "SimilarityPercentage":"1.0"}]
The stored procedure I am trying to run is supposed to insert data into a table, and it works perfectly fine when executed from Management Studio. When running the code above I notice there are no errors but data is not inserted into the table.
To help me debug, I printed the query preview:
exec [D7Ignite].[Service].[spInsertImgSearchHitResults] #RequestId = 1693, #ImageInfo = [{"ImageSignatureId":"27833", "SimilarityPercentage":"1.0"}]
Pasting this exact line into SQL Server runs the stored procedure with no problem, and data is properly inserted.
I have enabled autocommit = True when setting up the connection and other CRUD commands work perfectly fine with pyodbc.
Is there anything I'm overlooking? Or is pyodbc simply not processing my query properly? If so, are there any other ways to run Stored Procedures from Python?

Connect Snowflake in Python

Hi I'm trying to connect my Snowflake ODBC to python to create queries. I got results but I see that console display a lot of messages that hide me the results. There any way to just display the results of my query? or another way to connect with Python?
import pyodbc
con = pyodbc.connect('DSN=MYODBCNAME;UID=MYUSER')
con.setencoding(encoding='utf-8')
con.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
cursor=con.cursor()
cursor.execute("use warehouse WAREHOUSENAME;")
cursor.execute("select * from SCHEMA.DATABASE.TABLE limit 5")
while True:
row=cursor.fetchone()
if not row:
break
print(row)
Here some of those messages, but if I scroll up for some time found the results
Regards

I would recommend that you use the Snowflake Python connector to connect Snowflake to Python:
https://docs.snowflake.com/en/user-guide/python-connector.html

sqlite-problem-sqlite3-operationalerror-near-where-syntax-error

I am trying to insert data into database, but here is error:
sqlite-problem-sqlite3-operationalerror-near-where-syntax-error
This is my code:
c.execute(f"INSERT INTO math(qula) WHERE name = '{member.name}' VALUES({saboloo})")

I suspect that you want to update the column qula of an existing row of the table math and not insert a new row.
Also, it's a good practice to use ? placeholders:
c.execute("UPDATE math SET qula = ? WHERE name = ?", (saboloo, member.name))

To insert data into sqlite3, first you have to import sqlite3 module in the Python standard library. You then connect to the file by passing a file path to the connect (xxxx) method in the sqlite3 module, if the database you passed in the connect method does not exist one will be created at that path and if the database exist it will connect to it.
import sqlite3
con = sqlite3.connect('/path/xxx.sqlite3')
You than have to create a cursor object using the cursor() method
c = con.cursor()
You than Prepare, SQL queries to INSERT a record into the database.
c.execute(f"INSERT INTO math(qula) VALUES({saboloo})").
I hope this one helps.
You can also read more from here Python SQLite insert data

Fastest way to load .xlsx file into MySQL database

I'm trying to import data from a .xlsx file into a SQL database.
Right now, I have a python script which uses the openpyxl and MySQLdb modules to
establish a connection to the database
open the workbook
grab the worksheet
loop thru the rows the the worksheet, extracting the columns I need
and inserting each record into the database, one by one
Unfortunately, this is painfully slow. I'm working with a huge data set, so I need to find a faster way to do this (preferably with Python). Any ideas?
wb = openpyxl.load_workbook(filename="file", read_only=True)
ws = wb['My Worksheet']
conn = MySQLdb.connect()
cursor = conn.cursor()
cursor.execute("SET autocommit = 0")
for row in ws.iter_rows(row_offset=1):
sql_row = # data i need
cursor.execute("INSERT sql_row")
conn.commit()

Disable autocommit if it is on! Autocommit is a function which causes MySQL to immediately try to push your data to disk. This is good if you only have one insert, but this is what causes each individual insert to take a long time. Instead, you can turn it off and try to insert the data all at once, committing only once you've run all of your insert statements.
Something like this might work:
con = mysqldb.connect(
host="your db host",
user="your username",
passwd="your password",
db="your db name"
)
con.execute("SET autocommit = 0")
cursor = con.cursor()
data = # some code to get data from excel
for datum in data:
cursor.execute("your insert statement".format(datum))
con.commit()
con.close()

Consider saving workbook's worksheet as a CSV, then use MySQL's LOAD DATA INFILE. This is often a very fast read.
sql = """LOAD DATA INFILE '/path/to/data.csv'
INTO TABLE myTable
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
LINES TERMINATED BY '\n'"""
cursor.execute(sql)
con.commit()

python script hangs when calling cursor.fetchall() with large data set

I have a query that returns over 125K rows.
The goal is to write a script the iterates through the rows, and for each, populate a second table with data processed from the result of the query.
To develop the script, I created a duplicate database with a small subset of the data (4126 rows)
On the small database, the following code works:
import os
import sys
import random
import mysql.connector
cnx = mysql.connector.connect(user='dbuser', password='thePassword',
host='127.0.0.1',
database='db')
cnx_out = mysql.connector.connect(user='dbuser', password='thePassword',
host='127.0.0.1',
database='db')
ins_curs = cnx_out.cursor()
curs = cnx.cursor(dictionary=True)
#curs = cnx.cursor(dictionary=True,buffered=True) #fail
with open('sql\\getRawData.sql') as fh:
sql = fh.read()
curs.execute(sql, params=None, multi=False)
result = curs.fetchall() #<=== script stops at this point
print len(result) #<=== this line never executes
print curs.column_names
curs.close()
cnx.close()
cnx_out.close()
sys.exit()
The line curs.execute(sql, params=None, multi=False) succeeds on both the large and small databases.
If I use curs.fetchone() in a loop, I can read all records.
If I alter the line:
curs = cnx.cursor(dictionary=True)
to read:
curs = cnx.cursor(dictionary=True,buffered=True)
The script hangs at curs.execute(sql, params=None, multi=False).
I can find no documentation on any limits to fetchall(), nor can I find any way to increase the buffer size, and no way to tell how large a buffer I even need.
There are no exceptions raised.
How can I resolve this?

I was having this same issue, first on a query that returned ~70k rows and then on one that only returned around 2k rows (and for me RAM was also not the limiting factor). I switched from using mysql.connector (i.e. the mysql-connector-python package) to MySQLdb (i.e. the mysql-python package) and then was able to fetchall() on large queries with no problem. Both packages seem to follow the python DB API, so for me MySQLdb was a drop-in replacement for mysql.connector, with no code changes necessary beyond the line that sets up the connection. YMMV if you're leveraging something specific about mysql.connector.
Pragmatically speaking, if you don't have a specific reason to be using mysql.connector the solution to this is just to switch to a package that works better!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I run a scraper on each entry in a database? - python

I'm scraping data with requests library and parsing them with Beautiful Soup. I'm storing scraped data in mysql db. I want to run a scraper each time when it finds a new entry in a table.

Related

Pyodbc stored procedure with params not updating table

Connect Snowflake in Python

sqlite-problem-sqlite3-operationalerror-near-where-syntax-error

Fastest way to load .xlsx file into MySQL database

python script hangs when calling cursor.fetchall() with large data set

Categories

Resources