I am using mechanicalsoup and I need to auto-fill the internet form using information from a dataframe, automatically.
The dataframe is called checkdataframe.
br = mechanicalsoup.StatefulBrowser(user_agent='MechanicalSoup')
for row in checkdataframe.codenumber:
if row in '105701':
url = "https://www.internetform.com"
br.open(url)
for column in checkdataframe[['ID', 'name','email']]:
br.select_form('form[action="/internetform.com/info.do"]')
br.form['companycode'] =checkdataframe['ID'] #THIS INFORMATION SHOULD COMING FROM DATAFRAME
br.form['username'] = checkdataframe['name'] #THIS INFORMATION SHOULD COMING FROM DATAFRAME
br.form['emailaddress'] = checkdataframe['email'] #THIS INFORMATION SHOULD COMING FROM DATAFRAME
response = br.submit_selected()
soup = BeautifulSoup(response.text, 'lxml')
table = soup.find('div', attrs = {'class':'row'})
for row in table.findAll('div', attrs = {'class':'col-md-4 col-4'}):
scrapeinfo = {}
scrapeinfo['STATUS'] = row.div
scrapeinfo['NAMEOFITEM'] = row.label
scrapeinfo['PRICE'] = row.div
checkdataframe.append(scrapeinfo)
else:
break
How can I make br.form['companycode'] =checkdataframe['ID'] this work, instead of
br.form['companycode'] = '105701'
br.form['username'] = 'myusername'
br.form['emailaddress'] = 'myusername#gmail.com'
I need to append the information that is scrape into the checkdataframe.
I need help, please.
Use selenium for this activity
Related
I'm trying to extract the data from this website -https://www.texmesonet.org/DataProducts/CustomDownloads
I have to fill the data in the text fields and select few options before downloading the data and I'm trying to do it with mechanicalsoup. The roadblock that I'm facing is that the tags corresponding to the parameters needs to download are not inside a form. Is there any way to tackle this using mechanicalsoup? I have pasted the code that I'm using to select the parameters below.
browser = mechanicalsoup.Browser()
url = 'https://www.texmesonet.org/DataProducts/CustomDownloads'
page = browser.get(url)
html_page = page.soup
#print(html_page.select('div'))
region = html_page.select('select')[0]
region.select('option')[0]["value"] = 'Station'
data_type = html_page.select('select')[1]
data_type.select('option')[2]["value"] = 'Timeseries'
#print(html_page.select('span'))
start_date = html_page.find_all("div", {"class": "col50 field-container"})[2]
start_date.select('input')[0]["value"] = '11/28/2022'
end_date = html_page.find_all("div", {"class": "col50 field-container"})[3]
end_date.select('input')[0]["value"] = '12/05/2022'
station = html_page.find_all("div", {"class": "col50 field-container"})[4]
station.select('input')[0]["value"] = 'Headwaters Ranch'
interval = html_page.select('select')[3]
interval.select('option')[0]["value"] = 'Daily'
units = html_page.select('select')[5]
units.select('option')[0]["value"] = 'US / Customary'
https://www.investing.com/economic-calendar/initial-jobless-claims-294
As stated in the question, I tried to web scrape a data table from this link. However, I was only able to scrape the first few lines of the data until the "show more" button. Except for web scraping, I ve also tried investpy.economic_calendar(), yet the filtering parameters are so random so that I could not extract the jobless claim data directly. Could somebody please help me with this?
url = 'https://www.investing.com/economic-calendar/initial-jobless-claims-294'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
table1 = soup.find('table', id='eventHistoryTable294')
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
mydata = pd.DataFrame(columns = headers)
table_rows = table1.find_all('tr')
#$df_side = pd.DataFrame(mydata)
#x = df_side.head(100)
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(mydata)
mydata.loc[length] = row
print(mydata)
1) I am trying to scrape data for multiple URL's stored in CSV, but in result it gives None.
2) I want to store the fetched data simultaneously in rows one by one in a dataframe named df but it only stores one row.
here's my code(i have pasted below from where the data extraction started) -
import csv
df=pd.DataFrame()
with open('test1.csv', newline='', encoding='utf-8-sig' ) as f:
reader = csv.reader(f)
for line in reader:
link = line[0]
print(type(link))
print(link)
driver.get(link)
height = driver.execute_script("return document.body.scrollHeight")
for scrol in range(100,height,100):
driver.execute_script(f"window.scrollTo(0,{scrol})")
time.sleep(0.2)
src = driver.page_source
soup = BeautifulSoup(src, 'lxml')
name_div = soup.find('div', {'class': 'flex-1 mr5'})
name_loc = name_div.find_all('ul')
name = name_loc[0].find('li').get_text().strip()
loc = name_loc[1].find('li').get_text().strip()
connection = name_loc[1].find_all('li')
connection = connection[1].get_text().strip()
exp_section = soup.find('section', {'id': 'experience-section'})
exp_section = exp_section.find('ul')
div_tag = exp_section.find('div')
a_tag = div_tag.find('a')
job_title = a_tag.find('h3').get_text().strip()
company_name = a_tag.find_all('p')[1].get_text().strip()
joining_date = a_tag.find_all('h4')[0].find_all('span')[1].get_text().strip()
exp = a_tag.find_all('h4')[1].find_all('span')[1].get_text().strip()
df['name']=[name]
df['location']=[loc]
df['connection']=[connection]
df['company_name']=[company_name]
df['job_title']=[job_title]
df['joining_date']=[joining_date]
df['tenure']=[exp]
df
output -
name location connection company_name job_title joining_date tenure
0 None None None None None None None
I am not sure whether the for loop goes wrong or whats the exact problem but for a single URL it works fine.
I am using Beautiful soup for the first time so I don't have proper knowledge. Please help me to make the desired changes. Thanks.
I don't think the end of your code is appending new lines to the dataframe.
Try replacing df["name""] = [name] and the other lines with the following:
new_line = {
"name": [name],
"location": [loc],
"connection": [connection],
"company_name": [company_name],
"job_title": [job_title],
"joining_date": [joining_date],
"tenure": [exp],
}
temp_df = pd.DataFrame.from_dict(new_line)
df.append(temp_df)
the web page being scraped
the wrong output i get
So basically I was trying to scrape over those rows of streamers on each page with the tag name "tr". And in each row, there's multiple columns that I want to include into my output. I was able to include almost all of those columns, but there's two that have the same tag name frustrated me a lot. (The two columns about followers) I tried the method of indexing them to get only odd or even, but the result is included in the second picture and it did not work out well. The numbers just keeps repeating itself and does not go down the way as it should. So is there some way to get the column of "followers gained" correctly into the output?
It's my first time asking here, so i am not sure if it is enough. I am glad to update more info later if needed.
for i in range(30): # Number of pages plus one
url = "https://twitchtracker.com/channels/viewership?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
headers = {'User-agent': 'Mozilla/5.0'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content)
channels = soup.find_all('tr')
for idx, channel in enumerate(channels):
if idx % 2 == 1:
idx += 1
Name = ", ".join([p.get_text(strip=True) for p in channel.find_all('a', attrs={'style': 'color:inherit'})])
Avg = ", ".join([p.get_text(strip=True) for p in channel.find_all('td', class_ = 'color-viewers')])
Time = ", ".join([p.get_text(strip=True) for p in channel.find_all('td', class_ = 'color-streamed')])
All = ", ".join([p.get_text(strip=True) for p in channel.find_all('td', class_ = 'color-viewersMax')])
HW = ", ".join([p.get_text(strip=True) for p in channel.find_all('td', class_ = 'color-watched')])
FG = ", ".join([soup.find_all('td', class_ = 'color-followers hidden-sm')[idx].get_text(strip=True)])
Maybe an alternativ approach ?##
It uses pandas to read the tables, you just have to clean the ads out.
I also used time.sleep() delaying the loops and to be gentle to the server.
Example
import requests, time
import pandas as pd
df_list = []
for i in range(30): # Number of pages plus one
url = "https://twitchtracker.com/channels/viewership?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
headers = {'User-agent': 'Mozilla/5.0'}
page = requests.get(url, headers=headers)
df_list.append(pd.read_html(page.text)[0])
time.sleep(1.5)
df = pd.concat(df_list).reset_index(drop=True)
df.rename(columns={'Unnamed: 2':'Name'}, inplace=True)
df.drop(df.columns[[0,1]],axis=1,inplace=True)
df[~df.Rank.str.contains(".ads { display:")].to_csv('table.csv', mode='w', index=False)
I am trying to scrape a website https://remittanceprices.worldbank.org/en/corridor/Australia/China for Fee field.
url = 'https://remittanceprices.worldbank.org/en/corridor/Australia/China'
r = requests.get(url,verify=False)
soup = BeautifulSoup(r.text,'lxml')
rows = [i.get_text("|").split("|") for i in soup.select('#tab-1 .corridor-row')]
for row in rows:
#a,b,c,d,e = row[2],row[15],row[18],row[21],row[25]
#print(a,b,c,d,e,sep='|')
print('{0[2]}|{0[15]}|{0[18]}|{0[21]}|{0[25]}').format(row)
But I am receiving a response AttributeError with the above code.
Can anyone help me out?
The problem is you are using .format() with print(), not the string. .format() is a method of str type and print() actually returns None, so try:
url = 'https://remittanceprices.worldbank.org/en/corridor/Australia/China'
r = requests.get(url,verify=False)
soup = BeautifulSoup(r.text,'lxml')
rows = [i.get_text("|").split("|") for i in soup.select('#tab-1 .corridor-row')]
for row in rows:
#a,b,c,d,e = row[2],row[15],row[18],row[21],row[25]
#print(a,b,c,d,e,sep='|')
print('{0[2]}|{0[15]}|{0[18]}|{0[21]}|{0[25]}'.format(row))