I have a Python code that sends POST request to a website, reads the response and filters it. For the POST data I used ('number', '11111') and it works perfect. However, I want to create a txt file that contains 100 different numbers as 1111,2222,3333,4444... and then send the POST requests for each of them. Can you help me how to do this in Python?
import urllib
from bs4 import BeautifulSoup
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin': 'http://mahmutesat.com/python.aspx',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://mahmutesat.com/python.aspx',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
url = 'http://mahmutesat.com/python.aspx'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f)
# parse and retrieve two vital form values
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
viewstategenerator = soup.select("#__VIEWSTATEGENERATOR")[0]['value']
formData = (
('__EVENTVALIDATION', eventvalidation),
('__VIEWSTATE', viewstate),
('__VIEWSTATEGENERATOR',viewstategenerator),
('number', '11111'),
('Button', 'Sorgula'),
)
encodedFields = urllib.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)
soup = BeautifulSoup(f.read())
name=soup.findAll('input',{'id':'name_field'})
for eachname in name:
print eachname['value']
If your file has data:
"sample.txt"
1111,2222,3333,4444,5555,6666,7777,8888,......(and so on)
To read the file contents, you can use the file open operation:
import itertools
#open the file for read
with open("sample.txt", "r") as fp:
values = fp.readlines()
#Get the values split with ","
data = [map(int, line.split(",")) for line in values]
numbers = list(itertools.chain(*data)) #Ensuring if its having many lines then concatenate
Now, use it as:
for number in numbers:
formData = (
('__EVENTVALIDATION', eventvalidation),
('__VIEWSTATE', viewstate),
('__VIEWSTATEGENERATOR',viewstategenerator),
('number', str(number)), # Here you use the number obtained
('Button', 'Sorgula'),
)
encodedFields = urllib.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)
soup = BeautifulSoup(f.read())
name=soup.findAll('input',{'id':'name_field'})
for eachname in name:
print eachname['value']
1 - Here is an example on how to create a file:
f = open('test.txt','w')
This will open the test.txt file for writing ('w') (if it has already data, it will be erased but if you want to append it write: f = open('test.txt','a') ) or create one if it does not exist yet. Note that this will happen in your current working directory, if you want it in a specific directory, include with the file name the full directory path, example:
f = open('C:\\Python\\test.txt','w')
2 - Then write/append to this file the data you want, example:
for i in range(1,101):
f.write(str(i*1111)+'\n')
This will write 100 numbers as string from 1111 to 111100
3 - You should always close the file at the end:
f.close()
4 - Now if you want to read from this file 'test.txt':
f = open('C:\\Python\\test.txt','r')
for i in f:
print i,
file.close()
This is as simple as it can be,
You need to read about File I/O in python from:
https://docs.python.org/2.7/tutorial/inputoutput.html#reading-and-writing-files
Make sure you select the right Python version for you in this docs.
using dictionary you can deal with the multiple requests, very easily.
import requests
values = {
'__EVENTVALIDATION': event_validation,
'__LASTFOCUS': '',
'__VIEWSTATE': view_state,
'__VIEWSTATEGENERATOR': '6264FB8D',
'ctl00$ContentPlaceHolder1$ButGet': 'Get Report',
'ctl00$ContentPlaceHolder1$Ddl_Circles': 'All Circles',
'ctl00$ContentPlaceHolder1$Ddl_Divisions': '-- Select --',
'ctl00$ContentPlaceHolder1$TxtTin': tin_num,
'ctl00$ContentPlaceHolder1$dropact': 'all'
}
headers_1 = {
'Origin': 'https://www.apct.gov.in',
'User-Agent': user_agent,
'Cookie': cookie_1,
'Accept-Encoding': 'gzip, deflate, br',
'Referer': url_1,
'Content-Type': 'application/x-www-form-urlencoded',
'Upgrade-Insecure-Requests': '1'
}
try:
req = requests.post(url_1, data=values, headers=headers_1)
Related
I'm trying to simulate sending a tweet containing an image using the requests library , Apparently to do it properly you need to go through 4 stages :
The first stage: INIT
The second phase: APPEND
third level: FINALIZE
The fourth stage is the publishing stage of the tweet.
According to the experiments that I have done, I am facing the problem in the second stage, which is the stage of downloading the content of the image. It seems that I am doing it wrong.
I hope to find the right way to do this
s = requests.Session()
s.headers.update({
'Cookie':'guest_id_marketing=v1%3A167542157148611150; guest_id_ads=v1%3A167542157148611150; personalization_id="v1_RXdbXKB8hgRH0Li/icKGWQ=="; guest_id=v1%3A167542157148611150; gt=1621461709524275202; ct0=f2020601bfa05bab3846cbb2cb6fcc8de5414370c7a2cb3de579fea7e2a344b25771a8989e7bf0a75c82b8f54061c54e95a9a7e8f06eaf995dffb20b1018f4ec1333fe1416a93c1968a44eae9c7cdddd; external_referer=padhuUp37zjgzgv1mFWxJ12Ozwit7owX|0|8e8t2xd8A2w%3D; _ga=GA1.2.180578843.1675421587; _gid=GA1.2.1928677010.1675421587; _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoHaWQiJTNlMDdhMDMxNTVmNWI0ZDNmZjAwMWEw%250ANmJjOTFmYzc0Og9jcmVhdGVkX2F0bCsIWO7oFoYBOgxjc3JmX2lkIiUwMDIy%250AYmY4NjNjZWZlMGE0NTZlMTM2ZTYwZTAyYjYyYw%253D%253D--59831749d79402bc50de0786d3c9133b80d9ceca; kdt=Y7chgKmPh7qohOCcrIpIGjefeuwn4xa3tzCO41hT; twid="u=614982902"; auth_token=a43fbd826c64630a88399e3f7d80ae2a71e05f39; att=1-vBhpupCEykW4h0LvapSOdJbpJopEv3saTLiQqkFb',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0',
'Accept':'*/*',
'Accept-Language':'en-US,en;q=0.5',
'Accept-Encoding':'gzip, deflate',
'Origin':'https://twitter.com',
'Referer':'https://twitter.com/',
'Sec-Fetch-Dest':'empty',
'Sec-Fetch-Mode':'cors',
'Sec-Fetch-Site':'same-site',
'Content-Length':'0',
'Te':'trailers',
})
file_name = '1.jpg'
#INTI
total_bytes = os.path.getsize(file_name)
url= f'https://upload.twitter.com/i/media/upload.json?command=INIT&total_bytes={total_bytes}&media_type=image/jpeg&media_category=tweet_image'
json = {'command':'INIT','total_bytes':total_bytes,'media_type':'image/jpeg','media_category':'tweet_image'}
req = s.post(url, json=json, timeout=20, allow_redirects=True)
print(req.json())
media_id = req.json()['media_id']
#APPEND
file = open(file_name, 'rb')
file_data = file.read()
url = f'https://upload.twitter.com/i/media/upload.json?command=APPEND&media_id={media_id}&segment_index=0'
data = {'media':file_data}
req = s.post(url, data=data,allow_redirects=True)
print(req.text)
print(req.status_code)
#FINALIZE
url = f'https://upload.twitter.com/i/media/upload.json?command=FINALIZE&media_id={media_id}'
json = {'command':'FINALIZE','media_id':media_id,'total_bytes':total_bytes}
req = s.post(url, data=json, timeout=20, allow_redirects=True)
print(req.json())
**The error:
**{'request': '/i/media/upload.json', 'error': 'Segments do not add up to provided total file size.'}
I'm trying to simulate the restart from the browser by using cookies
I have a web scrap script written in python and when I use it to a website it blocks me and says "you getting the page very fast. you might be a bot".
I tried adding time.sleep() to delay code but it always gets blocked. Is there any way to make this code a little slower?
I'm not sure why it should say so. Isn't it the same as viewing page from a website? What does it load that makes it not labelled as a bot but my script does?
from bs4 import BeautifulSoup
import re
import requests
import time
import sys
import csv
FIXED_WEB = "web.net"
def load_car_pages(seq, limit, i):
time.sleep(10)
html_web = requests.get(
f"web.net/homepage",
headers={
'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0',
'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,/;q=0.8",
'Accept-Language': "en-US,en;q=0.5",
'Accept-Encoding': "gzip, deflate",
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Te': 'trailers'
}).text
time.sleep(10)
sup_me_patate = BeautifulSoup(html_web, 'lxml')
headers = sup_me_patate.find_all('div', class_='sui-AtomCard-info') # find headers
print(f"{headers}")
for a in headers:
string = str(a)
href_pos = [m.start() for m in re.finditer('href=', string)]
for pos in href_pos:
slicing = string[pos + 6: string.find('"', pos + 6)]
print(f"For Link: {slicing}")
web_link = FIXED_WEB + slicing
print(f"LINK: {web_link}")
# limit = 25
# i = 0
time.sleep(10)
try:
car_web = requests.get(web_link, headers={
'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0',
'Origin': FIXED_WEB,
"Access-Control-Request-Method": "GET",
'Accept-Language': "en-US,en;q=0.5",
'Accept-Encoding': "gzip, deflate",
'Request-Domain': 'web.net',
'Site': 'car',
'Referer': web_link,
"Sec-Fetch-Dest": "empty",
"Sec- Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"Te": "trailers",
'Connection': 'close'}).text
soup = BeautifulSoup(web_link, "lxml")
# with open(soup.title.string + ".html", 'w') as coolhtml:
# string = str(soup)
# coolhtml.write(string)
# sys.exit(0)
name = soup.find_all('h2',
class_="mt-TitleBasic-title mt-TitleBasic-title--xs mt-TitleBasic-title--black")
address = soup.find('p', class_="mt-CardUser-location").text
phone_number = soup.find('span', class_='mt-LeadPhoneCall-linkText mt-LeadPhoneCall-linkText--small')\
.text
j = 0
for b in name:
if j == 8:
real_name = b.text
print(b.text)
j += 1
# some costansts
NAME = real_name
ADDRESS = address
PHONE_NUMBER = phone_number
header = ['Name', 'Address', 'Phone Number']
data = [ADDRESS, PHONE_NUMBER, NAME]
with open("info.csv", 'a', encoding='UTF8') as csv_numbers:
writer = csv.writer(csv_numbers)
writer.writerow(data)
i += 1
print(i)
if i == limit:
print("it prints...")
limit += 35
seq += 1
load_car_pages(seq, limit, i)
except Exception as ACX:
print(f"Bro Exception occurred::{ACX}...")
# continue
def main():
# get_car_links()
load_car_pages(0, 35, 0)
main()
You're asking too many overloaded questions all at once (even though they're somewhat related in your particular context). I'll only answer the one in your title: How to make a web scraper more human-like?
That question is too open-ended to be definitively answered. New methods of bot detection will continue to be developed, as well as ways to bypass them.
That being said: a couple highlights off the top of my head:
Browsers send & receive a lot of metadata, like user agent, headers, cookies, runtime JavaScript, etc. Bare HTTP requests look very different from that.
Browser automation systems behave very differently from humans by default: they don't really use the mouse, they click buttons instantly at their exact centers, etc
Browser automation detection and detection bypass is a rabbit hole: Can a website detect when you are using Selenium with chromedriver?
i am new to python and i am yet to learn the concept of oop,classes with python. i thought i understood functions. But i am facing issue while calling functions from different py file.
Below code shows all my fuctions described in main.py
i want to split main.py and get 2 other py files as data extraction.py and data processing.py
i understand that it can be done using classes, but can we do it without using classes as well?
i divided the code in two other files but i am getting error(please find my attached screenshot)
please explain me what i can do here!
main.py
import pandas as pd
import requests
from bs4 import BeautifulSoup
from configparser import ConfigParser
import logging
import data_extraction
config = ConfigParser()
config.read('config.ini')
logging.basicConfig(filename='logfile.log', level=logging.DEBUG,
format='%(asctime)s:%(lineno)d:%(name)s:%(levelname)s:%(message)s')
baseurl = config['configData']['baseurl']
sub_url = config['configData']['sub_url']
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36',
"Upgrade-Insecure-Requests": "1", "DNT": "1",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate"
}
r = requests.get(baseurl, headers=headers)
status = r.status_code
soup = BeautifulSoup(r.content, 'html.parser')
model_links = []
all_keys = ['Model', 'Platform', 'Product Family', 'Product Line', '# of CPU Cores',
'# of Threads', 'Max. Boost Clock', 'Base Clock', 'Total L2 Cache', 'Total L3 Cache',
'Default TDP', 'Processor Technology for CPU Cores', 'Unlocked for Overclocking', 'CPU Socket',
'Thermal Solution (PIB)', 'Max. Operating Temperature (Tjmax)', 'Launch Date', '*OS Support']
# function to get the model links in one list from soup object(1st page extraction)
def get_links_in_list():
for model_list in soup.find_all('td', headers='view-name-table-column'):
# model_list = model_list.a.text - to get the model names
model_list = model_list.a.get('href')
# print(model_list)
model_list = sub_url + model_list
# print(model_list)
one_link = model_list.split(" ")[0]
model_links.append(one_link)
return model_links
model_links = get_links_in_list()
logging.debug(model_links)
each_link_data = data_extraction()
print(each_link_data)
#all_link_data = data_processing()
#write_to_csv(all_keys)
data_extraction.py
import requests
from bs4 import BeautifulSoup
from main import baseurl
from main import all_keys
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36',
"Upgrade-Insecure-Requests": "1", "DNT": "1",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate"
}
r = requests.get(baseurl, headers=headers)
status = r.status_code
soup = BeautifulSoup(r.content, 'html.parser')
model_links = []
# function to get data for each link from the website(2nd page extraction)
def data_extraction(model_links):
each_link_data = []
try:
for link in model_links:
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
specification = {}
for key in all_keys:
spec = soup.select_one(
f'.field__label:-soup-contains("{key}") + .field__item, .field__label:-soup-contains("{key}") + .field__items .field__item')
# print(spec)
if spec is None:
specification[key] = ''
if key == 'Model':
specification[key] = [i.text for i in soup.select_one('.page-title')]
specification[key] = specification[key][0:1:1]
# print(specification[key])
else:
if key == '*OS Support':
specification[key] = [i.text for i in spec.parent.select('.field__item')]
else:
specification[key] = spec.text
specification['link'] = link
each_link_data.append(specification)
except:
print('Error occurred')
return each_link_data
# print(each_link_data)
data processing.py
# function for data processing : converting the each link object into dataframe
def data_processing():
all_link_data = []
for each_linkdata_obj in each_link_data:
# make the nested dictionary to normal dict
norm_dict = dict()
for key in each_linkdata_obj:
if isinstance(each_linkdata_obj[key], list):
norm_dict[key] = ','.join(each_linkdata_obj[key])
else:
norm_dict[key] = each_linkdata_obj[key]
all_link_data.append(norm_dict)
return all_link_data
# print(all_link_data)
all_link_data = data_processing()
# function to write dataframe data into csv
def write_to_csv(all_keys):
all_link_df = pd.DataFrame.from_dict(all_link_data)
all_link_df2 = all_link_df.drop_duplicates()
all_link_df3 = all_link_df2.reset_index()
# print(all_link_df3)
all_keys = all_keys + ['link']
all_link_df4 = all_link_df3[all_keys]
# print(all_link_df4)
all_link_df4.to_csv('final_data.csv')
write_to_csv(all_keys)
Move the existing functions(ex. write_to_csv) to different file for example 'utility_functions.py'. Import it in main.py using from utility_functions import write_to_csv. Now you can use the function 'write_to_csv' in main.py as
write_to_csv(all_keys)
Edit
In the main.pyfile
use from data_extraction import data_extraction instead of import data_extraction
In data_extraction.py file
Remove lines
from main import baseurl from main import all_keys
It will throw variable undefined error, you can fix it by passing the variable in the function call.
There is a site https://ru.myip.ms/browse/market_bitcoin/%D0%91%D0%B8%D1%82%D0%BA%D0%BE%D0%B8%D0%BD_%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D1%8F_%D1%86%D0%B5%D0%BD.html#a, below is a table with BTC prices, I need to like then parse this table. I was trying to do, but for some reason, instead of the price in the table is displayed dots
from time import sleep
import pandas as pd
import requests
host = 'ru.myip.ms'
index_url = 'https://ru.myip.ms'
home_url = "https://ru.myip.ms/browse/market_bitcoin/%D0%91%D0%B8%D1%82%D0%BA%D0%BE%D0%B8%D0%BD_%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D1%8F_%D1%86%D0%B5%D0%BD.html#a"
base_ajax_url = "https://ru.myip.ms/ajax_table/market_bitcoin/{page}"
with requests.Session() as session:
session.headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Host': host
}
# visit home page and parse the initial dataframe
response = session.get(home_url)
df = pd.read_html(response.text, attrs={"id": "market_bitcoin_tbl"})[0]
df = df.rename(columns=lambda x: x.strip()) # remove extra newlines from the column names
sleep(2)
# start paginating with page=2
page = 1
while True:
url = base_ajax_url.format(page=page)
print("Processing {url}...".format(url=url))
response = session.post(url,
data={'getpage': 'yes', 'lang': 'ru'},
headers={
'X-Requested-With': 'XMLHttpRequest',
'Origin': index_url,
'Referer': home_url
})
# add data to the existing dataframe
try:
new_df = pd.read_html("<table>{0}</table>".format(response.text))[0]
except ValueError: # could not extract data from HTML - last page?
break
new_df.columns = df.columns
df = pd.concat([df, new_df])
page += 1
sleep(1)
print(df)
you are doing it correctly. and you already have your results.
try just to do this to see the result.
print(df['Bitcoin Price'])
you see the dots, just because the df is big to show it all when you run it, but it exists.
I am working on my first project with API's and I am having trouble accessing the data. I am working off an example that calls its data with this loop:
for item in data['objects']:
print item['name'], item['phone']
This works great for data stored as nested dictionaries (the outside being called objects, and the inside containing the data)
The issue I am having is my data is formatted with dictonaries inside of lists
[
{
"key":"2014cama",
"website":"http://www.cvrobotics.org/frc/regional.html",
"official":true,
"end_date":"2014-03-09",
"name":"Central Valley Regional",
"short_name":"Central Valley",
"facebook_eid":null,
"event_district_string":null,
"venue_address":"Madera South High School\n705 W. Pecan Avenue\nMadera, CA 93637\nUSA",
"event_district":0,
"location":"Madera, CA, USA",
"event_code":"cama",
"year":2014,
"webcast":[],
"timezone":"America/Los_Angeles",
"alliances":[],
"event_type_string":"Regional",
"start_date":"2014-03-07",
"event_type":0
},'more data...']
so calling,
for item in data['objects']:
print item['name']
Won't work to pull the value stored in name.
Any help would be much appreciated.
Edit: The full Dataset I'm pulling (http://www.thebluealliance.com/api/v2/team/frc254/2014/events?X-TBA-App-Id=Peter_Hartnett:Scouting:v1)
And the code I am running:
import json,urllib2, TBA
team ='frc254'
year = '2014'
Url = 'http://www.thebluealliance.com/api/v2/team/'+team+'/'+year+'/events?X- TBA-App-Id=Peter_Hartnett:Scouting:v1'
data = TBA.GetData(Url)
for item in data:
print item['name']
The TBA Class just imports the data and returns it.
Edit2:
Here is the TBA class that pulls the data, I can assure you it is identical to that found at the link above
import urllib2,cookielib
content='none'
def GetData(Url):
site= Url
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': ' 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(site, headers=hdr)
try:
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.fp.read()
content = page.read()
return content
If I understood it correctly, your data['objects'] is now an entry in a list, right?
So just iterate the list and your logic will remain the same
for item in objects:
print item['name'], item['phone']
being
objects = [
{
"key":"2014cama",
"website":"http://www.cvrobotics.org/frc/regional.html",
"official":true,
"end_date":"2014-03-09",
"name":"Central Valley Regional",
"short_name":"Central Valley",
"facebook_eid":null,
"event_district_string":null,
"venue_address":"Madera South High School\n705 W. Pecan Avenue\nMadera, CA 93637\nUSA",
"event_district":0,
"location":"Madera, CA, USA",
"event_code":"cama",
"year":2014,
"webcast":[],
"timezone":"America/Los_Angeles",
"alliances":[],
"event_type_string":"Regional",
"start_date":"2014-03-07",
"event_type":0
},'more data...']
Edit
I get your problem now. Your object data is a string that represents a JSONArray . You should load that before iterating, in order to be able to work with that as a real list, like so:
data = GetData(Url)
loaded_array = json.loads(data)
for item in loaded_array:
print item['name']