How to make a web scrap more human-like?

How to make a web scrap more human-like? - python

I have a web scrap script written in python and when I use it to a website it blocks me and says "you getting the page very fast. you might be a bot".
I tried adding time.sleep() to delay code but it always gets blocked. Is there any way to make this code a little slower?
I'm not sure why it should say so. Isn't it the same as viewing page from a website? What does it load that makes it not labelled as a bot but my script does?
from bs4 import BeautifulSoup
import re
import requests
import time
import sys
import csv
FIXED_WEB = "web.net"
def load_car_pages(seq, limit, i):
time.sleep(10)
html_web = requests.get(
f"web.net/homepage",
headers={
'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0',
'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,/;q=0.8",
'Accept-Language': "en-US,en;q=0.5",
'Accept-Encoding': "gzip, deflate",
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Te': 'trailers'
}).text
time.sleep(10)
sup_me_patate = BeautifulSoup(html_web, 'lxml')
headers = sup_me_patate.find_all('div', class_='sui-AtomCard-info') # find headers
print(f"{headers}")
for a in headers:
string = str(a)
href_pos = [m.start() for m in re.finditer('href=', string)]
for pos in href_pos:
slicing = string[pos + 6: string.find('"', pos + 6)]
print(f"For Link: {slicing}")
web_link = FIXED_WEB + slicing
print(f"LINK: {web_link}")
# limit = 25
# i = 0
time.sleep(10)
try:
car_web = requests.get(web_link, headers={
'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0',
'Origin': FIXED_WEB,
"Access-Control-Request-Method": "GET",
'Accept-Language': "en-US,en;q=0.5",
'Accept-Encoding': "gzip, deflate",
'Request-Domain': 'web.net',
'Site': 'car',
'Referer': web_link,
"Sec-Fetch-Dest": "empty",
"Sec- Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"Te": "trailers",
'Connection': 'close'}).text
soup = BeautifulSoup(web_link, "lxml")
# with open(soup.title.string + ".html", 'w') as coolhtml:
# string = str(soup)
# coolhtml.write(string)
# sys.exit(0)
name = soup.find_all('h2',
class_="mt-TitleBasic-title mt-TitleBasic-title--xs mt-TitleBasic-title--black")
address = soup.find('p', class_="mt-CardUser-location").text
phone_number = soup.find('span', class_='mt-LeadPhoneCall-linkText mt-LeadPhoneCall-linkText--small')\
.text
j = 0
for b in name:
if j == 8:
real_name = b.text
print(b.text)
j += 1
# some costansts
NAME = real_name
ADDRESS = address
PHONE_NUMBER = phone_number
header = ['Name', 'Address', 'Phone Number']
data = [ADDRESS, PHONE_NUMBER, NAME]
with open("info.csv", 'a', encoding='UTF8') as csv_numbers:
writer = csv.writer(csv_numbers)
writer.writerow(data)
i += 1
print(i)
if i == limit:
print("it prints...")
limit += 35
seq += 1
load_car_pages(seq, limit, i)
except Exception as ACX:
print(f"Bro Exception occurred::{ACX}...")
# continue
def main():
# get_car_links()
load_car_pages(0, 35, 0)
main()

You're asking too many overloaded questions all at once (even though they're somewhat related in your particular context). I'll only answer the one in your title: How to make a web scraper more human-like?
That question is too open-ended to be definitively answered. New methods of bot detection will continue to be developed, as well as ways to bypass them.
That being said: a couple highlights off the top of my head:
Browsers send & receive a lot of metadata, like user agent, headers, cookies, runtime JavaScript, etc. Bare HTTP requests look very different from that.
Browser automation systems behave very differently from humans by default: they don't really use the mouse, they click buttons instantly at their exact centers, etc
Browser automation detection and detection bypass is a rabbit hole: Can a website detect when you are using Selenium with chromedriver?

Related

How to write seperate functions in seperate py files and execute it using main.py without using concept of class

i am new to python and i am yet to learn the concept of oop,classes with python. i thought i understood functions. But i am facing issue while calling functions from different py file.
Below code shows all my fuctions described in main.py
i want to split main.py and get 2 other py files as data extraction.py and data processing.py
i understand that it can be done using classes, but can we do it without using classes as well?
i divided the code in two other files but i am getting error(please find my attached screenshot)
please explain me what i can do here!
main.py
import pandas as pd
import requests
from bs4 import BeautifulSoup
from configparser import ConfigParser
import logging
import data_extraction
config = ConfigParser()
config.read('config.ini')
logging.basicConfig(filename='logfile.log', level=logging.DEBUG,
format='%(asctime)s:%(lineno)d:%(name)s:%(levelname)s:%(message)s')
baseurl = config['configData']['baseurl']
sub_url = config['configData']['sub_url']
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36',
"Upgrade-Insecure-Requests": "1", "DNT": "1",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate"
}
r = requests.get(baseurl, headers=headers)
status = r.status_code
soup = BeautifulSoup(r.content, 'html.parser')
model_links = []
all_keys = ['Model', 'Platform', 'Product Family', 'Product Line', '# of CPU Cores',
'# of Threads', 'Max. Boost Clock', 'Base Clock', 'Total L2 Cache', 'Total L3 Cache',
'Default TDP', 'Processor Technology for CPU Cores', 'Unlocked for Overclocking', 'CPU Socket',
'Thermal Solution (PIB)', 'Max. Operating Temperature (Tjmax)', 'Launch Date', '*OS Support']
# function to get the model links in one list from soup object(1st page extraction)
def get_links_in_list():
for model_list in soup.find_all('td', headers='view-name-table-column'):
# model_list = model_list.a.text - to get the model names
model_list = model_list.a.get('href')
# print(model_list)
model_list = sub_url + model_list
# print(model_list)
one_link = model_list.split(" ")[0]
model_links.append(one_link)
return model_links
model_links = get_links_in_list()
logging.debug(model_links)
each_link_data = data_extraction()
print(each_link_data)
#all_link_data = data_processing()
#write_to_csv(all_keys)
data_extraction.py
import requests
from bs4 import BeautifulSoup
from main import baseurl
from main import all_keys
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36',
"Upgrade-Insecure-Requests": "1", "DNT": "1",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate"
}
r = requests.get(baseurl, headers=headers)
status = r.status_code
soup = BeautifulSoup(r.content, 'html.parser')
model_links = []
# function to get data for each link from the website(2nd page extraction)
def data_extraction(model_links):
each_link_data = []
try:
for link in model_links:
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
specification = {}
for key in all_keys:
spec = soup.select_one(
f'.field__label:-soup-contains("{key}") + .field__item, .field__label:-soup-contains("{key}") + .field__items .field__item')
# print(spec)
if spec is None:
specification[key] = ''
if key == 'Model':
specification[key] = [i.text for i in soup.select_one('.page-title')]
specification[key] = specification[key][0:1:1]
# print(specification[key])
else:
if key == '*OS Support':
specification[key] = [i.text for i in spec.parent.select('.field__item')]
else:
specification[key] = spec.text
specification['link'] = link
each_link_data.append(specification)
except:
print('Error occurred')
return each_link_data
# print(each_link_data)
data processing.py
# function for data processing : converting the each link object into dataframe
def data_processing():
all_link_data = []
for each_linkdata_obj in each_link_data:
# make the nested dictionary to normal dict
norm_dict = dict()
for key in each_linkdata_obj:
if isinstance(each_linkdata_obj[key], list):
norm_dict[key] = ','.join(each_linkdata_obj[key])
else:
norm_dict[key] = each_linkdata_obj[key]
all_link_data.append(norm_dict)
return all_link_data
# print(all_link_data)
all_link_data = data_processing()
# function to write dataframe data into csv
def write_to_csv(all_keys):
all_link_df = pd.DataFrame.from_dict(all_link_data)
all_link_df2 = all_link_df.drop_duplicates()
all_link_df3 = all_link_df2.reset_index()
# print(all_link_df3)
all_keys = all_keys + ['link']
all_link_df4 = all_link_df3[all_keys]
# print(all_link_df4)
all_link_df4.to_csv('final_data.csv')
write_to_csv(all_keys)

Move the existing functions(ex. write_to_csv) to different file for example 'utility_functions.py'. Import it in main.py using from utility_functions import write_to_csv. Now you can use the function 'write_to_csv' in main.py as
write_to_csv(all_keys)
Edit
In the main.pyfile
use from data_extraction import data_extraction instead of import data_extraction
In data_extraction.py file
Remove lines
from main import baseurl from main import all_keys
It will throw variable undefined error, you can fix it by passing the variable in the function call.

Python Web Scraping / Beautiful Soup, with list of keywords at the end of URL

I'm trying to build a webscraper to get the reviews of wine off Vivino.com. I have a large list of wines and wanted it to search
url = ("https://www.vivino.com/search/wines?q=")
Then cycle through the list. Scraping the rating text "4.5 - 203 reviews", the name of the wine and the attached link to page.
I found a 20 lines of code https://www.kashifaziz.me/web-scraping-python-beautifulsoup.html/ to build a web scraper. Was trying to compile it with
url = ("https://www.vivino.com/search/wines?q=")
#list having the keywords (made by splitting input with space as its delimiter)
keyword = input().split()
#go through the keywords
for key in keywords :
#everything else is same logic
r = requests.get(url + key)
print("URL :", url+key)
if 'The specified profile could not be found.' in r.text:
print("This is available")
else :
print('\nSorry that one is taken')
Also, where would I include the list of keywords?
I'd love any help with this! I'm trying to get better at python but not sure I'm at this level yet haha.
Thank you.

This script traverses all pages for selected keyword and selects title, price, rating, reviews and link to wine:
import re
import requests
from time import sleep
from bs4 import BeautifulSoup
url = 'https://www.vivino.com/search/wines?q={kw}&start={page}'
prices_url = 'https://www.vivino.com/prices'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0'}
def get_wines(kw):
with requests.session() as s:
page = 1
while True:
soup = BeautifulSoup(s.get(url.format(kw=kw, page=page), headers=headers).content, 'html.parser')
if not soup.select('.default-wine-card'):
break
params = {'vintages[]': [wc['data-vintage'] for wc in soup.select('.default-wine-card')]}
prices_js = s.get(prices_url, params=params, headers={
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
'X-Requested-With': 'XMLHttpRequest',
'Accept': 'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01'
}).text
wine_prices = dict(re.findall(r"\$\('\.vintage-price-id-(\d+)'\)\.find\( '\.wine-price-value' \)\.text\( '(.*?)' \);", prices_js))
for wine_card in soup.select('.default-wine-card'):
title = wine_card.select_one('.header-smaller').get_text(strip=True, separator=' ')
price = wine_prices.get(wine_card['data-vintage'], '-')
average = wine_card.select_one('.average__number')
average = average.get_text(strip=True) if average else '-'
ratings = wine_card.select_one('.text-micro')
ratings = ratings.get_text(strip=True) if ratings else '-'
link = 'https://www.vivino.com' + wine_card.a['href']
yield title, price, average, ratings, link
sleep(3)
page +=1
kw = 'angel'
for title, price, average, ratings, link in get_wines(kw):
print(title)
print(price)
print(average + ' / ' + ratings)
print(link)
print('-' * 80)
Prints:
Angél ica Zapata Malbec Alta
-
4,4 / 61369 ratings
https://www.vivino.com/wines/1469874
--------------------------------------------------------------------------------
Château d'Esclans Whispering Angel Rosé
16,66
4,1 / 38949 ratings
https://www.vivino.com/wines/1473981
--------------------------------------------------------------------------------
Angél ica Zapata Cabernet Sauvignon Alta
-
4,3 / 27699 ratings
https://www.vivino.com/wines/1471376
--------------------------------------------------------------------------------
... and so on.
EDIT: To select only one wine, you can put keyword inside a list and then check each wine in loop:
import re
import requests
from time import sleep
from bs4 import BeautifulSoup
url = 'https://www.vivino.com/search/wines?q={kw}&start={page}'
prices_url = 'https://www.vivino.com/prices'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0'}
def get_wines(kw):
with requests.session() as s:
page = 1
while True:
soup = BeautifulSoup(s.get(url.format(kw=kw, page=page), headers=headers).content, 'html.parser')
if not soup.select('.default-wine-card'):
break
params = {'vintages[]': [wc['data-vintage'] for wc in soup.select('.default-wine-card')]}
prices_js = s.get(prices_url, params=params, headers={
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
'X-Requested-With': 'XMLHttpRequest',
'Accept': 'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01'
}).text
wine_prices = dict(re.findall(r"\$\('\.vintage-price-id-(\d+)'\)\.find\( '\.wine-price-value' \)\.text\( '(.*?)' \);", prices_js))
no = 1
for no, wine_card in enumerate(soup.select('.default-wine-card'), 1):
title = wine_card.select_one('.header-smaller').get_text(strip=True, separator=' ')
price = wine_prices.get(wine_card['data-vintage'], '-')
average = wine_card.select_one('.average__number')
average = average.get_text(strip=True) if average else '-'
ratings = wine_card.select_one('.text-micro')
ratings = ratings.get_text(strip=True) if ratings else '-'
link = 'https://www.vivino.com' + wine_card.a['href']
yield title, price, average, ratings, link
# if no < 20:
# break
# sleep(3)
page +=1
wines = ['10 SPAN VINEYARDS CABERNET SAUVIGNON CENTRAL COAST',
'10 SPAN VINEYARDS CHARDONNAY CENTRAL COAST']
for wine in wines:
for title, price, average, ratings, link in get_wines(wine):
print(title)
print(price)
print(average + ' / ' + ratings)
print(link)
print('-' * 80)
Prints:
10 Span Vineyards Cabernet Sauvignon
-
3,7 / 557 ratings
https://www.vivino.com/wines/4535453
--------------------------------------------------------------------------------
10 Span Vineyards Chardonnay
-
3,7 / 150 ratings
https://www.vivino.com/wines/5815131
--------------------------------------------------------------------------------

import requests
#list having the keywords (made by splitting input with space as its delimiter)
keywords = input().split()
#go through the keywords
for key in keywords :
url = "https://www.vivino.com/search/wines?q={}".format(key)
#everything else is same logic
r = requests.get(url)
print("URL :", url)
if 'The specified profile could not be found.' in r.text:
print("This is available")
else :
print('\nSorry that one is taken')
for the list of keywords you can use a text file where you put in each line one keyword

Python script does not work but gives no error message any helps

I am new to programming/scripting in general and just becoming to understand the basics of the workings of python3.8. Anyways, I have here a script I have been working on to randomly generate a Instagram account generator.
Please excuse if this seems like a stupid question but i am new to this after all :(. When i run the above script i receive no output as well as no error message resulting in me not knowing what the heck im doing wrong!
import random
import string
from threading import Thread
from discord_webhook import DiscordWebhook
from discord_webhook import DiscordWebhook, DiscordEmbed
webhookbro = "webhook"#Discord WebHook
password = "MadeWithLoveByOnurCreed"
threads_ammount = 50
max_proxies = 0
proxies = [line for line in list(set(open("Proxies.txt", encoding="UTF-8", errors="ignore").read().splitlines()))]
for x in proxies:
max_proxies += 1
def run():
global max_proxies
while True:
proxy = proxies[random.randint(0,max_proxies-1)]
asd = ('').join(random.choices(string.ascii_letters + string.digits, k=10))
with requests.Session() as (c):
email = str(asd+"#gmail.com")
password = "MadeWithLoveByOnurCreed"
username = str("Jacob"+asd)
name = str("Jacon"+"P")
asd = random
data = {
'email': email,
'password': password,
'username': username,
'first_name': name,
'client_id': 'W6mHTAAEAAHsVu2N0wGEChTQpTfn',
'seamless_login_enabled': '1',
'gdpr_s': '%5B0%2C2%2C0%2Cnull%5D',
'tos_version': 'eu',
'opt_into_one_tap': 'false'
}
headers = {
'accept': "*/*",
'accept-encoding': "gzip, deflate, br",
'accept-language': "en-US,en;q=0.8",
'content-length': "241",
'content-type': 'application/x-www-form-urlencoded',
'origin': "https://www.instagram.com",
'referer': "https://www.instagram.com/",
'user-agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36",
'x-csrftoken': "95RtiLDyX9J6AcVz9jtUIySbwf75WhvG",
'x-instagram-ajax': "c7e210fa2eb7",
'x-requested-with': "XMLHttpRequest",
'Cache-Control': "no-cache"
}
try:
req = c.post("https://www.instagram.com/accounts/web_create_ajax/",data=data,headers=headers,proxies={'https': 'https://' + proxy,'http':'http://' + proxy}).json()
try:
created = req['account_created']
if created is True:
userid = req['user_id']
print(f"Successfully Created A Account | Username : {username} | Password : {password}")
webhook = DiscordWebhook(url=webhookurlbro, username="Testing")
embed = DiscordEmbed(title='Testing', color=242424)
embed.set_footer(text='Testing')
embed.set_timestamp()
embed.add_embed_field(name='Username', value=f'{username}')
embed.add_embed_field(name='Password', value=f'{password}')
embed.add_embed_field(name='Proxy', value=f'{proxy}')
webhook.add_embed(embed)
webhook.execute()
save = open("Accounts.txt","a")
save.write(f"{username}:{password}\n")
else:
dk = None
except:
dk = None
except requests.ConnectionError:
dk = None
threads = []
for i in range(0,threads_ammount):
threads.append(Thread(target=run))
for thread in threads:
thread.start()
for thread in threads:
thread.join()

Sorry, I can not comment yet so I have to write it as an answer. How about removing the try/except blocks? Especially the inner one catches ALL exceptions, thus you do not get any error reported. (Generally a good advise to always specify the exception(s) you would like to catch)

Why does my Instagram bot comment 'None' sometimes?

I have created a bot for Instagram and it works well when it likes an image but when it comments, the first few work in a set of images but then starts commenting 'None'. I cannot for the life of me find the error and assume it is a problem to do with commenting the same thing twice?
Here is the code. I know it is a massive chink but it will work out of the box and replicates the problem exactly.
import requests
import json
import random
def gen_comment():
comment_list = ["great job!", "lb", "nice shot, take a look at mine!", "perfect!", "#beautiful", "nice shot!", "awesome!", "great photo!", "nice", ":)", "love this", "great shot! take a look at my account! ;)", "WOW!", "lol", "WoW!", 'lovely!', 'amazing']
comment = random.choice(comment_list)
return comment
def comment(media_id):
""" Send http request to comment """
comment_text = gen_comment()
comment_post = {'comment_text': comment_text}
url_comment = 'https://www.instagram.com/web/comments/%s/add/' % (media_id)
try:
comment = s.post(url_comment, data=comment_post)
if comment.status_code == 200:
return comment_text
else:
print comment.status_code
except:
print("error on comment")
def get_media_id_by_tag(tag):
""" Get media ID set, by your hashtag """
url_tag = '%s%s%s' % ('https://www.instagram.com/explore/tags/', tag, '/')
try:
r = s.get(url_tag)
text = r.text
finder_text_start = ('<script type="text/javascript">'
'window._sharedData = ')
finder_text_start_len = len(finder_text_start) - 1
finder_text_end = ';</script>'
all_data_start = text.find(finder_text_start)
all_data_end = text.find(finder_text_end, all_data_start + 1)
json_str = text[(all_data_start + finder_text_start_len + 1) \
: all_data_end]
all_data = json.loads(json_str)
return list(all_data['entry_data']['TagPage'][0]['tag']['media']['nodes'])
except Exception as e:
print("Except on get_media!")
print(e)
username = raw_input("Username: ")
password = raw_input("Password: ")
tag = raw_input("Tag: ")
user_agent = ("Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36")
s = requests.Session()
s.cookies.update({'sessionid': '', 'mid': '', 'ig_pr': '1',
'ig_vw': '1920', 'csrftoken': '',
's_network': '', 'ds_user_id': ''})
login_post = {'username': username,
'password': password}
s.headers.update({'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en',
'Connection': 'keep-alive',
'Content-Length': '0',
'Host': 'www.instagram.com',
'Origin': 'https://www.instagram.com',
'Referer': 'https://www.instagram.com/',
'User-Agent': user_agent,
'X-Instagram-AJAX': '1',
'X-Requested-With': 'XMLHttpRequest'})
r = s.get('https://www.instagram.com/')
s.headers.update({'X-CSRFToken': r.cookies['csrftoken']})
login = s.post('https://www.instagram.com/accounts/login/ajax/', data=login_post,
allow_redirects=True)
s.headers.update({'X-CSRFToken': login.cookies['csrftoken']})
csrftoken = login.cookies['csrftoken']
media = get_media_id_by_tag(tag)
for i in media:
c = comment(i['id'])
print 'commented {} on media {}'.format(c, i['id'])
EDIT 1 I have edited my code and found that I am getting a 400 error on the comment function which stops the function returning the desired text. I don't know why i am getting this 400 error but i will research it. Any help is appreciated

Sending multiple POST data in Python

I have a Python code that sends POST request to a website, reads the response and filters it. For the POST data I used ('number', '11111') and it works perfect. However, I want to create a txt file that contains 100 different numbers as 1111,2222,3333,4444... and then send the POST requests for each of them. Can you help me how to do this in Python?
import urllib
from bs4 import BeautifulSoup
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin': 'http://mahmutesat.com/python.aspx',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://mahmutesat.com/python.aspx',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
url = 'http://mahmutesat.com/python.aspx'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f)
# parse and retrieve two vital form values
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
viewstategenerator = soup.select("#__VIEWSTATEGENERATOR")[0]['value']
formData = (
('__EVENTVALIDATION', eventvalidation),
('__VIEWSTATE', viewstate),
('__VIEWSTATEGENERATOR',viewstategenerator),
('number', '11111'),
('Button', 'Sorgula'),
)
encodedFields = urllib.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)
soup = BeautifulSoup(f.read())
name=soup.findAll('input',{'id':'name_field'})
for eachname in name:
print eachname['value']

If your file has data:
"sample.txt"
1111,2222,3333,4444,5555,6666,7777,8888,......(and so on)
To read the file contents, you can use the file open operation:
import itertools
#open the file for read
with open("sample.txt", "r") as fp:
values = fp.readlines()
#Get the values split with ","
data = [map(int, line.split(",")) for line in values]
numbers = list(itertools.chain(*data)) #Ensuring if its having many lines then concatenate
Now, use it as:
for number in numbers:
formData = (
('__EVENTVALIDATION', eventvalidation),
('__VIEWSTATE', viewstate),
('__VIEWSTATEGENERATOR',viewstategenerator),
('number', str(number)), # Here you use the number obtained
('Button', 'Sorgula'),
)
encodedFields = urllib.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)
soup = BeautifulSoup(f.read())
name=soup.findAll('input',{'id':'name_field'})
for eachname in name:
print eachname['value']

1 - Here is an example on how to create a file:
f = open('test.txt','w')
This will open the test.txt file for writing ('w') (if it has already data, it will be erased but if you want to append it write: f = open('test.txt','a') ) or create one if it does not exist yet. Note that this will happen in your current working directory, if you want it in a specific directory, include with the file name the full directory path, example:
f = open('C:\\Python\\test.txt','w')
2 - Then write/append to this file the data you want, example:
for i in range(1,101):
f.write(str(i*1111)+'\n')
This will write 100 numbers as string from 1111 to 111100
3 - You should always close the file at the end:
f.close()
4 - Now if you want to read from this file 'test.txt':
f = open('C:\\Python\\test.txt','r')
for i in f:
print i,
file.close()
This is as simple as it can be,
You need to read about File I/O in python from:
https://docs.python.org/2.7/tutorial/inputoutput.html#reading-and-writing-files
Make sure you select the right Python version for you in this docs.

using dictionary you can deal with the multiple requests, very easily.
import requests
values = {
'__EVENTVALIDATION': event_validation,
'__LASTFOCUS': '',
'__VIEWSTATE': view_state,
'__VIEWSTATEGENERATOR': '6264FB8D',
'ctl00$ContentPlaceHolder1$ButGet': 'Get Report',
'ctl00$ContentPlaceHolder1$Ddl_Circles': 'All Circles',
'ctl00$ContentPlaceHolder1$Ddl_Divisions': '-- Select --',
'ctl00$ContentPlaceHolder1$TxtTin': tin_num,
'ctl00$ContentPlaceHolder1$dropact': 'all'
}
headers_1 = {
'Origin': 'https://www.apct.gov.in',
'User-Agent': user_agent,
'Cookie': cookie_1,
'Accept-Encoding': 'gzip, deflate, br',
'Referer': url_1,
'Content-Type': 'application/x-www-form-urlencoded',
'Upgrade-Insecure-Requests': '1'
}
try:
req = requests.post(url_1, data=values, headers=headers_1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to make a web scrap more human-like? - python

Related

How to write seperate functions in seperate py files and execute it using main.py without using concept of class

Python Web Scraping / Beautiful Soup, with list of keywords at the end of URL

Python script does not work but gives no error message any helps

Why does my Instagram bot comment 'None' sometimes?

Sending multiple POST data in Python

Categories

Resources