I don't get what's the problem here. I want to build a web scraper that scrapes amazon and takes the price and the name into a database. But for some reason, it tells me that the columns and values are not matching. I do have one additional column in my database called "timestamp" where I automatically put in the time, but that is handled by the database. I am using MariaDB. A friend said I can use the MySQL API for MariaDB as well.
P.S. preis = price, coming from Germany, switching between English and German sometimes, just in case anyone is wondering.
import requests, time, csv, pymysql
from bs4 import BeautifulSoup as bs
#URL = input("URL")
URL = "https://www.amazon.de/gp/product/B075FTXF15/ref=crt_ewc_img_bw_3?ie=UTF8&psc=1&smid=A24FLB4J0NZBNT"
def SOUPIT (tempURL):
URL = tempURL
page = requests.get(URL,headers={"User-Agent":"Defined"})
soup = bs(page.content, "html.parser")
raw_price = soup.find(id="priceblock_ourprice").get_text()
price = raw_price[:-2]
raw_name = soup.find(id="productTitle").get_text()
name = raw_name.strip()
for i in range(0,len(name)-1):
if name[i] == "(":
name = name[:i]
break
data = [name, price, time.strftime("%H:%M:%S"), time.strftime("%d.%m.%Y")]
return(data)
data = SOUPIT(URL)
while True:
data = SOUPIT(URL)
db = pymysql.connect("localhost", "root", "root", "test")
cursor = db.cursor()
if (data == None):
break
print("break")
else:
name = data[0]
preis = data[1]
sql = """INSERT INTO amazon_preise (Name, Preis) VALUES ('{}',{})""".format(name,preis)
cursor.execute(sql)
db.commit()
print("success")
print(data)
time.sleep(60)
error message:
Traceback (most recent call last):
File "amazonscraper_advanced.py", line 43, in <module>
cursor.execute(sql)
File "C:\Users\...\AppData\Local\Programs\Python\Python36\lib\site-packages\pymysql\cursors.py", line 170, in execute
result = self._query(query)
File "C:\Users\...\AppData\Local\Programs\Python\Python36\lib\site-packages\pymysql\cursors.py", line 328, in _query
conn.query(q)
File "C:\Users\...\AppData\Local\Programs\Python\Python36\lib\site-packages\pymysql\connections.py", line 517, in query
self._affected_rows = self._read_query_result(unbuffered=unbuffered)
File "C:\Users\...\AppData\Local\Programs\Python\Python36\lib\site-packages\pymysql\connections.py", line 732, in _read_query_result
result.read()
File "C:\Users\...\AppData\Local\Programs\Python\Python36\lib\site-packages\pymysql\connections.py", line 1075, in read
first_packet = self.connection._read_packet()
File "C:\Users\...\AppData\Local\Programs\Python\Python36\lib\site-packages\pymysql\connections.py", line 684, in _read_packet
packet.check_error()
File "C:\Users\...\AppData\Local\Programs\Python\Python36\lib\site-packages\pymysql\protocol.py", line 220, in check_error
err.raise_mysql_exception(self._data)
File "C:\Users\...\AppData\Local\Programs\Python\Python36\lib\site-packages\pymysql\err.py", line 109, in raise_mysql_exception
raise errorclass(errno, errval)
pymysql.err.InternalError: (1136, "Column count doesn't match value count at row 1")
The problem is caused, at least partially, by a using string formatting to insert values into an SQL statement.
Here is the scraped data:
>>> data = ['Sweatshirt Alien VS. Predator Z100088', '32,99', '14:08:43', '08.09.2019']
>>> name, preis, *_ = data
Let's create the SQL statement
>>> sql = """INSERT INTO amazon_preise (Name, Preis) VALUES ('{}',{})""".format(name,preis)
And display it:
>>> sql
"INSERT INTO amazon_preise (Name, Preis) VALUES ('Sweatshirt Alien VS. Predator Z100088',32,99)"
Observe that the VALUES clause contains three comma-separated values; this is because the web page displays currency in the German style, that is with commas separating the cents from the euros. When interpolated into the SQL statement
preis becomes two values instead of one.
The right way to fix this would be to convert preis from a string to a float or decimal, and use parameter substitution instead of string formatting to interpolate the values..
>>> fpreis = float(preis.replace(',', '.'))
>>> sql = """INSERT INTO amazon_preise (Name, Preis) VALUES (%s, %s)"""
>>> cursor.execute(sql, (name, fpreis))
Related
So I have data in a 2 Dimensional Array, and I am trying to insert it into my SQL Database.
for scrapeddata in range(len(all_images)):
mycursor.execute('SELECT * FROM ScrapedBooks WHERE BookLink = %s',( all_images[scrapeddata][3],))
img_link_table = mycursor.fetchall()
if len(img_link_table)==0:
HoldBookTitle = [all_images[scrapeddata][0], all_images[scrapeddata][2],all_images[scrapeddata][1],all_images[scrapeddata][3]]
mycursor.executemany("INSERT INTO ScrapedBooks(BookName, Price, ImageLink, BookLink) VALUES(%s,%s,%s,%s)", (HoldBookTitle))
mydb.commit()
Error:
mycursor.executemany("INSERT INTO ScrapedBooks(BookName, Price, ImageLink, BookLink) VALUES(%s,%s,%s,%s)", (HoldBookTitle))
File"C:\Users\msala\AppData\Local\Programs\Python\Python39\lib\site-packages\mysql\connector\cursor_cext.py", line 355, in executemany
stmt = self._batch_insert(operation, seq_params)
File "C:\Users\msala\AppData\Local\Programs\Python\Python39\lib\site-packages\mysql\connector\cursor_cext.py", line 333, in _batch_insert
raise errors.InterfaceError(
mysql.connector.errors.InterfaceError: Failed executing the operation; Could not process parameters
I plan to use the saved data in the database in my HTML code for my website, if you guys have any idea on how to do that, please do help me. Thank you!
Array can't be handled that way, but you can make an array of lists and add that ion one go into the database
all_images = []
HoldBookTitle = []
for scrapeddata in range(len(all_images)):
mycursor.execute('SELECT * FROM ScrapedBooks WHERE BookLink = %s',( all_images[scrapeddata][3],))
img_link_table = mycursor.fetchall()
if len(img_link_table)>0:
HoldBookTitle.append( (all_images[scrapeddata][0], all_images[scrapeddata][2],all_images[scrapeddata][1],all_images[scrapeddata][3]))
if not HoldBookTitle:
mycursor.executemany("INSERT INTO ScrapedBooks(BookName, Price, ImageLink, BookLink) VALUES(%s,%s,%s,%s)", HoldBookTitle)
mydb.commit()
I have a data scraped from cars.com. And am trying to save them to MySQL database and I couldn't managed to do so. Here is my full code:
#ScrapeData.py
import requests
from bs4 import BeautifulSoup
URL = "https://www.cars.com/shopping/results/?dealer_id=&keyword=&list_price_max=&list_price_min=&makes[]=&maximum_distance=all&mileage_max=&page=1&page_size=100&sort=best_match_desc&stock_type=cpo&year_max=&year_min=&zip="
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html.parser')
cars = soup.find_all('div', class_='vehicle-card')
name = []
mileage = []
dealer_name = []
rating = []
rating_count = []
price = []
for car in cars:
#name
name.append(car.find('h2').get_text())
#mileage
mileage.append(car.find('div', {'class':'mileage'}).get_text())
#dealer_name
dealer_name.append(car.find('div', {'class':'dealer-name'}).get_text())
#rate
try:
rating.append(car.find('span', {'class':'sds-rating__count'}).get_text())
except:
rating.append("n/a")
#rate_count
rating_count.append(car.find('span', {'class':'sds-rating__link'}).get_text())
#price
price.append(car.find('span', {'class':'primary-price'}).get_text())
#save_to_mysql.py
import pymysql
import scrapeData
import mysql.connector
connection = pymysql.connect(
host='localhost',
user='root',
password='',
db='cars',
)
name = scrapeData.name
mileage = scrapeData.mileage
dealer_name = scrapeData.dealer_name
rating = scrapeData.rating
rating_count = scrapeData.rating_count
price = scrapeData.price
try:
mySql_insert_query = """INSERT INTO cars_details (name, mileage, dealer_name, rating, rating_count, price)
VALUES (%s, %s, %s, %s, %s, %s) """
records_to_insert = [(name, mileage, dealer_name, rating, rating_count, price)]
print(records_to_insert)
cursor = connection.cursor()
cursor.executemany(mySql_insert_query, records_to_insert)
connection.commit()
print(cursor.rowcount, "Record inserted successfully into cars_details table")
except mysql.connector.Error as error:
print("Failed to insert record into MySQL table {}".format(error))
connection.commit()
finally:
connection.close()
whenever I run this code I get this error message:
Traceback (most recent call last):
File "c:\scraping\save_to_mysql.py", line 28, in <module>
cursor.executemany(mySql_insert_query, records_to_insert)
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\pymysql\cursors.py", line 173, in executemany
return self._do_execute_many(
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\pymysql\cursors.py", line 211, in _do_execute_many rows += self.execute(sql + postfix)
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\pymysql\cursors.py", line 148, in execute
result = self._query(query)
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\pymysql\cursors.py", line 310, in _query
conn.query(q)
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\pymysql\connections.py", line 548, in query
self._affected_rows = self._read_query_result(unbuffered=unbuffered)
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\pymysql\connections.py", line 775, in _read_query_result
result.read()
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\pymysql\connections.py", line 1156, in read
first_packet = self.connection._read_packet()
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\pymysql\connections.py", line 725, in _read_packet packet.raise_for_error()
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\pymysql\protocol.py", line 221, in raise_for_error err.raise_mysql_exception(self._data)
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\pymysql\err.py", line 143, in raise_mysql_exception
raise errorclass(errno, errval)
pymysql.err.OperationalError: (1241, 'Operand should contain 1 column(s)')
Anybody have any idea of how to solve this? I want to insert multiple scraped data in MySQL at one execution. I will be glad for your help
Firstly i wouldnt use seperated lists for all your data, but use a single list, with all the information about a single car gathered. So like nested within it. so instead of
millage = []
delar_name = []
i would create a single list called cars:
cars = []
Then i would create dirrerent variables for all the different pieces of scrape info you have on a car like this:
#brand
brand = car.find('h2').get_text()
#mileage
mileage = car.find('div', {'class':'mileage'}).get_text()
Then i would create the list for appending and append it to the list.
toAppend = brand, mileage, dealer_name, rating, rating_count, price
cars.append(toAppend)
Then the output would be:
[('2018 Mercedes-Benz CLA 250 Base', '21,326 mi.', '\nMercedes-Benz of South Bay\n', '4.6', '(1,020 reviews)', '$33,591'), ('2021 Toyota Highlander Hybrid XLE', '9,529 mi.', '\nToyota of Gastonia\n', '4.6', '(590 reviews)', '$47,869')]
I have made a small change to the mysql. Inserted into a function, then importing that function into the main script at just thrownin the list as a paramter. Works like a charm.
I know this isnt an elaborate answer on why and how things are working, but it is a solution non the less.
import requests
from bs4 import BeautifulSoup
from scrapertestsql import insertScrapedCars
URL = "https://www.cars.com/shopping/results/?dealer_id=&keyword=&list_price_max=&list_price_min=&makes[]=&maximum_distance=all&mileage_max=&page=1&page_size=100&sort=best_match_desc&stock_type=cpo&year_max=&year_min=&zip="
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html.parser')
scrapedCars = soup.find_all('div', class_='vehicle-card')
cars = []
# mileage = []
# dealer_name = []
# rating = []
# rating_count = []
# price = []
for car in scrapedCars:
#name
brand = car.find('h2').get_text()
#mileage
mileage = car.find('div', {'class':'mileage'}).get_text()
#dealer_name
dealer_name = car.find('div', {'class':'dealer-name'}).get_text()
#rate
try:
rating = car.find('span', {'class':'sds-rating__count'}).get_text()
except:
rating = "n/a"
#rate_count
rating_count = car.find('span', {'class':'sds-rating__link'}).get_text()
#price
price = car.find('span', {'class':'primary-price'}).get_text()
toAppend = brand, mileage, dealer_name, rating, rating_count, price
cars.append(toAppend)
insertScrapedCars(cars)
print(cars)
Next i would :
import pymysql
import mysql.connector
connection = pymysql.connect(
host='127.0.0.1',
user='test',
password='123',
db='cars',
port=8889
)
def insertScrapedCars(CarsToInsert):
try:
mySql_insert_query = """INSERT INTO cars_details (name, mileage, dealer_name, rating, rating_count, price)
VALUES (%s, %s, %s, %s, %s, %s) """
cursor = connection.cursor()
cursor.executemany(mySql_insert_query, CarsToInsert)
connection.commit()
print(cursor.rowcount, "Record inserted successfully into cars_details table")
except mysql.connector.Error as error:
print("Failed to insert record into MySQL table {}".format(error))
finally:
connection.close()
I'm trying to load data from MYSQL to BigQuery. I'm using pandas,jaydebeapi and load_table_from_dataframe.
While using the same, getting below error:
>>> job = client.load_table_from_dataframe(chunk, table_id)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/google/cloud/bigquery/client.py", line 1993, in load_table_from_dataframe
parquet_compression=parquet_compression,
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 486, in dataframe_to_parquet
arrow_table = dataframe_to_arrow(dataframe, bq_schema)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 450, in dataframe_to_arrow
bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 224, in bq_to_arrow_array
return pyarrow.Array.from_pandas(series, type=arrow_type)
File "pyarrow/array.pxi", line 755, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
TypeError: an integer is required
>>>
Couple of points:
My source table exists and has below schema:
EMPID INTEGER,
EMPNAME VARCHAR,
STREETADRESS VARCHAR,
REGION VARCHAR,
STATE VARCHAR,
COUNTRY VARCHAR,
joining_date date,
last_update_date TIMESTAMP(6) -- to hold till millisecond
My target table also exists in BigQuery and below is the schema:
create table if not exists `Project.dataset.table_name`
(EMPID INT64,
EMPNAME STRING,
STREETADRESS STRING,
REGION STRING,
STATE STRING,
COUNTRY STRING,
joining_date DATE,
last_update_date TIMESTAMP
);
Below is the code I'm using:
import datetime
from google.cloud import bigquery
import pandas as pd
import jaydebeapi
import os
client = bigquery.Client()
table_id = "<project_id>.<dataset>.<target_table>"
database_host='<IP Address>'
database_user='<user id>'
database_password='<password>'
database_port='<port>'
database_db='<database_name>'
jclassname = "com.mysql.jdbc.Driver"
url = "jdbc:mysql://{host}:{port}/{database}".format(host=database_host, port=database_port, database=database_db)
driver_args = [database_user, database_password]
jars = ["/<Home_Dir>/script/jars/mysql-connector-java-5.1.45.jar"]
libs = None
cnx = jaydebeapi.connect(jclassname, url, driver_args, jars=jars, libs=libs)
query='select EMPID,EMPNAME,STREETADRESS,REGION,STATE,COUNTRY,joining_date,last_update_date from <table_name>'
cursor = cnx.cursor()
for chunk in pd.read_sql(query, cnx, coerce_float=True, params=None, parse_dates=None, columns=None,chunksize=500000):chunk.apply(lambda x: x.replace(u'\r', u' ').replace(u'\n', u' ') if isinstance(x, str) or isinstance(x, unicode) else x)
job = client.load_table_from_dataframe(chunk, table_id)
job.result()
Kindly help me getting the issue resolved. I tried to use LoadJobConfig as well, but same error is coming.
Is that the same code you copied? i mean what about the indents statements?
for chunk in pd.read_sql(query, cnx, coerce_float=True, params=None, parse_dates=None, columns=None,chunksize=500000):
chunk.apply(lambda x: x.replace(u'\r', u' ').replace(u'\n', u' ') if isinstance(x, str) or isinstance(x, unicode) else x)
job = client.load_table_from_dataframe(chunk, table_id)
job.result()
Fix the above line.
I'm trying to populate my db from a csv file using python.
Below is the code I use to populate my sales table:
import csv
import pymssql as psql
conn = psql.connect('localhost:8888', 'SA', 'superSecret','videogame')
cursor = conn.cursor()
cursor.execute("""
IF OBJECT_ID('sales', 'U') IS NOT NULL
DROP TABLE sales
CREATE TABLE sales
(
Id int,
Name varchar(250),
Platform varchar(250),
Year int,
Genre varchar(250),
Publisher varchar(250),
NA_Sales float,
EU_Sales float,
JP_Sales float,
Other_Sales float,
Global_Sales float
)
""")
conn.commit()
with open ('./sales.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
row[1] = row[1].replace("'", "")
row[5] = row[5].replace("'", "")
data = tuple(row)
query = 'insert into sales values {0}'.format(data).replace("N/A","0")
print(query)
cursor.execute(query)
conn.commit()
conn.close()
However, some of my data contains the character:(') (e.g. Assassin's creed)in their name column. This caused an error, as below:
insert into sales values ('129', "Assassin's Creed III", 'PS3', '2012', 'Action', 'Ubisoft', '2.64', '2.56', '0.16', '1.14', '6.5')
Traceback (most recent call last):
File "pymssql.pyx", line 447, in pymssql.Cursor.execute (pymssql.c:7119)
File "_mssql.pyx", line 1011, in _mssql.MSSQLConnection.execute_query (_mssql.c:11586)
File "_mssql.pyx", line 1042, in _mssql.MSSQLConnection.execute_query (_mssql.c:11466)
File "_mssql.pyx", line 1175, in _mssql.MSSQLConnection.format_and_run_query (_mssql.c:12746)
File "_mssql.pyx", line 1586, in _mssql.check_cancel_and_raise (_mssql.c:16880)
File "_mssql.pyx", line 1630, in _mssql.maybe_raise_MSSQLDatabaseException (_mssql.c:17524)
_mssql.MSSQLDatabaseException: (207, b"Invalid column name 'Assassin's Creed III'.DB-Lib error message 20018, severity 16:\nGeneral SQL Server error: Check messages from the SQL Server\n")
Is there any workaround for this other than manually update the row (e.g. row[1] = row[1].replace("'","")?
Thanks!!
You could use a proper parameterized query, like this:
row = ["Assassin's", "N/A", 9] # test data as list (e.g., from CSV)
data = tuple("0" if x=="N/A" else x for x in row)
print(data) # ("Assassin's", '0', 9)
placeholders = ','.join(['%s' for i in range(len(data))])
query = 'INSERT INTO sales VALUES ({0})'.format(placeholders)
print(query) # INSERT INTO sales VALUES (%s,%s,%s)
cursor.execute(query, data)
You can replace the ' with a \', which should stop it crashing whilst preserving the apostrophe in your data:
with open ('./sales.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
row[1] = row[1].replace("'", "\'")
row[5] = row[5].replace("'", "\'")
data = tuple(row)
query = 'insert into sales values {0}'.format(data).replace("N/A","0")
print(query)
cursor.execute(query)
I wish to import my .csv file to a table 'testcsv' in MySQL using Python, but I'm unable to do so because I keep getting the following error:
Traceback (most recent call last):
File "<pyshell#9>", line 2, in <module>
cursor.execute(query,r)
File "C:\Python27\lib\site-packages\MySQLdb\cursors.py", line 184, in execute
query = query % db.literal(args)
TypeError: not enough arguments for format string
My code looks like this:
import csv
import MySQLdb
mydb = MySQLdb.connect(host='127.0.0.1',
port=3306,
user='root',
passwd='tulip',
db='frompython')
cursor = mydb.cursor()
csv_data = csv.reader(file('C:\Users\Tulip\Desktop\New_Sheet.csv'))
row_count = sum(1 for row in csv_data)
query = """INSERT INTO testcsv (number, name, marks) VALUES (%s, %s, %s)"""
for r in range(1, row_count):
cursor.execute(query,r)
I've tried every possible answer given to the related questions here, but none of them worked for me. Please help!
for r in range(1, row_count):
just iterates over numbers, i.e. in the first iteration r = 1. Remove the line defining row_count and get the actual rows:
for r in csv_data: