I'm trying to simulate sending a tweet containing an image using the requests library , Apparently to do it properly you need to go through 4 stages :
The first stage: INIT
The second phase: APPEND
third level: FINALIZE
The fourth stage is the publishing stage of the tweet.
According to the experiments that I have done, I am facing the problem in the second stage, which is the stage of downloading the content of the image. It seems that I am doing it wrong.
I hope to find the right way to do this
s = requests.Session()
s.headers.update({
'Cookie':'guest_id_marketing=v1%3A167542157148611150; guest_id_ads=v1%3A167542157148611150; personalization_id="v1_RXdbXKB8hgRH0Li/icKGWQ=="; guest_id=v1%3A167542157148611150; gt=1621461709524275202; ct0=f2020601bfa05bab3846cbb2cb6fcc8de5414370c7a2cb3de579fea7e2a344b25771a8989e7bf0a75c82b8f54061c54e95a9a7e8f06eaf995dffb20b1018f4ec1333fe1416a93c1968a44eae9c7cdddd; external_referer=padhuUp37zjgzgv1mFWxJ12Ozwit7owX|0|8e8t2xd8A2w%3D; _ga=GA1.2.180578843.1675421587; _gid=GA1.2.1928677010.1675421587; _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoHaWQiJTNlMDdhMDMxNTVmNWI0ZDNmZjAwMWEw%250ANmJjOTFmYzc0Og9jcmVhdGVkX2F0bCsIWO7oFoYBOgxjc3JmX2lkIiUwMDIy%250AYmY4NjNjZWZlMGE0NTZlMTM2ZTYwZTAyYjYyYw%253D%253D--59831749d79402bc50de0786d3c9133b80d9ceca; kdt=Y7chgKmPh7qohOCcrIpIGjefeuwn4xa3tzCO41hT; twid="u=614982902"; auth_token=a43fbd826c64630a88399e3f7d80ae2a71e05f39; att=1-vBhpupCEykW4h0LvapSOdJbpJopEv3saTLiQqkFb',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0',
'Accept':'*/*',
'Accept-Language':'en-US,en;q=0.5',
'Accept-Encoding':'gzip, deflate',
'Origin':'https://twitter.com',
'Referer':'https://twitter.com/',
'Sec-Fetch-Dest':'empty',
'Sec-Fetch-Mode':'cors',
'Sec-Fetch-Site':'same-site',
'Content-Length':'0',
'Te':'trailers',
})
file_name = '1.jpg'
#INTI
total_bytes = os.path.getsize(file_name)
url= f'https://upload.twitter.com/i/media/upload.json?command=INIT&total_bytes={total_bytes}&media_type=image/jpeg&media_category=tweet_image'
json = {'command':'INIT','total_bytes':total_bytes,'media_type':'image/jpeg','media_category':'tweet_image'}
req = s.post(url, json=json, timeout=20, allow_redirects=True)
print(req.json())
media_id = req.json()['media_id']
#APPEND
file = open(file_name, 'rb')
file_data = file.read()
url = f'https://upload.twitter.com/i/media/upload.json?command=APPEND&media_id={media_id}&segment_index=0'
data = {'media':file_data}
req = s.post(url, data=data,allow_redirects=True)
print(req.text)
print(req.status_code)
#FINALIZE
url = f'https://upload.twitter.com/i/media/upload.json?command=FINALIZE&media_id={media_id}'
json = {'command':'FINALIZE','media_id':media_id,'total_bytes':total_bytes}
req = s.post(url, data=json, timeout=20, allow_redirects=True)
print(req.json())
**The error:
**{'request': '/i/media/upload.json', 'error': 'Segments do not add up to provided total file size.'}
I'm trying to simulate the restart from the browser by using cookies
Related
I am making my first steps in webscraping and wanted to get Video Data from Pornhub.
In a first step i went trough all the pages on the main page and collected the video links. This worked and i got a csv with around 100k links. If i copy/paste those links to the brower , those work fine. BUT, when i go over them with my script to get my desired values, it always redirects me to a Cornhub Video (i know this was an april fools day joke some time ago). So it seems that my request gets redirected, but i dont know how this happens and if i can do anything about it.
'''with open("links.csv", "r") as f:
lines = csv.reader(f)
for adress in lines:
data = []
print(data)
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
sleep(randint(2,5))
html = requests.get(adress[0], headers=headers)
soup = BeautifulSoup(html.text, features="html.parser")
print(soup)
views = soup.find("script", type="application/ld+json")
json_data = json.loads(views.contents[0])
interaction_stat = json_data["interactionStatistic"]
views = int(interaction_stat[0]
["userInteractionCount"].replace(",", ""))
duration = int(
soup.find("meta", property="video:duration").get("content"))
upload_date = datetime.datetime.strptime(
json_data["uploadDate"][0:10], '%Y-%m-%d').date()
video_id = soup.find("form", id="shareToStream")
video_id = video_id.find("input", id="attachment").get("value")
data.append(video_id)
data.append(upload_date)
data.append(views)
data.append(duration)
with open("data.csv", "a", newline="") as f: # Das hier über die schleife um es nur einmal zu machen
writer = csv.writer(f)
writer.writerow(data)
'''
Your headers are really old, but this works just fine. Maybe make sure you alternate your IP or take some time before the subsequent requests.
import requests
from bs4 import BeautifulSoup
import json
lines = [
"XX62f79e2ed1ed8",
"XX63078405e84b6",
]
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Host": "www.pornhub.com",
"Refer": "https://www.pornhub.com/",
}
with requests.Session() as s:
for line in lines:
soup = (
BeautifulSoup(s.get(line, headers=headers).text, features="html.parser")
.find("script", type="application/ld+json")
)
json_data = (
json
.loads(soup.getText())
['interactionStatistic'][0]['userInteractionCount']
)
print(json_data)
For the videos I've used the output is:
3,339,324
384,482
My code is only making empty folders and not downloading images.
So, I think I need it to be modified so that the images can be clearly downloaded.
I tried to fix it by myself, but can't figure it out how to do.
Anyone please help me. Thank you!
import requests
import parsel
import os
import time
for page in range(1, 310): # Total 309pages
print(f'======= Scraping data from page {page} =======')
url = f'https://www.bikeexif.com/page/{page}'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
response = requests.get(url, headers=headers)
html_data = response.text
selector = parsel.Selector(html_data)
containers = selector.xpath('//div[#class="container"]/div/article[#class="smallhalf"]')
for v in containers:
old_title = v.xpath('.//div[2]/h2/a/text()').get()#.replace(':', ' -')
if old_title is not None:
title = old_title.replace(':', ' -')
title_url = v.xpath('.//div[2]/h2/a/#href').get()
print(title, title_url)
if not os.path.exists('img\\' + title):
os.mkdir('img\\' + title)
response_image = requests.get(url=title_url, headers=headers).text
selector_image = parsel.Selector(response_image)
# Full Size Images
images_url = selector_image.xpath('//div[#class="image-context"]/a[#class="download"]/#href').getall()
for title_url in images_url:
image_data = requests.get(url=title_url, headers=headers).content
file_name = title_url.split('/')[-1]
time.sleep(1)
with open(f'img\\{title}\\' + file_name, mode='wb') as f:
f.write(image_data)
print('Download complete!!:', file_name)
This page uses JavaScript to create link "download" but requests/urllib/beautifulsoup/lxml/parsel/scrapy can't run JavaScript - and this makes problem.
But it seems page uses the same urls to display images on page - so you may use //img/#src
But this makes another problem because page uses JavaScript for "lazy loading" images and only first img has src. Other images have url in data-src (and normally Javascript copy data-src to src when you scroll page) so you have to get data-src to download some of images.
You need something like this to get #src (for first image) and #data-src (for other images).
images_url = selector_image.xpath('//div[#id="content"]//img/#src').getall() + \
selector_image.xpath('//div[#id="content"]//img/#data-src').getall()
Full working code (with other small changes)
Because I use Linux so string img\\{title} creates wrong path
so I use os.path.join('img', title, filename) to create correct path on Windows, Linux, Mac.
import requests
import parsel
import os
import time
# you can define it once
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
for page in range(1, 310): # Total 309pages
print(f'======= Scraping data from page {page} =======')
url = f'https://www.bikeexif.com/page/{page}'
response = requests.get(url, headers=headers)
selector = parsel.Selector(response.text)
containers = selector.xpath('//div[#class="container"]/div/article[#class="smallhalf"]')
for v in containers:
old_title = v.xpath('.//div[2]/h2/a/text()').get()#.replace(':', ' -')
if old_title is not None:
title = old_title.replace(':', ' -')
title_url = v.xpath('.//div[2]/h2/a/#href').get()
print(title, title_url)
os.makedirs( os.path.join('img', title), exist_ok=True ) # it create only if doesn't exists
response_article = requests.get(url=title_url, headers=headers)
selector_article = parsel.Selector(response_article.text)
# Full Size Images
images_url = selector_article.xpath('//div[#id="content"]//img/#src').getall() + \
selector_article.xpath('//div[#id="content"]//img/#data-src').getall()
print('len(images_url):', len(images_url))
for img_url in images_url:
response_image = requests.get(url=img_url, headers=headers)
filename = img_url.split('/')[-1]
with open( os.path.join('img', title, filename), 'wb') as f:
f.write(response_image.content)
print('Download complete!!:', filename)
I m trying to get data using requests web scraping from this web site https://enlinea.sunedu.gob.pe/verificainscripcion, the parameter is a doc in the example 06950413, and the captcha, also a hide parameter called _token, I got using xpath, so in the case of the captcha I get the image using xpath too and also I downloaded the image in a imagenes folder, after that I wait the captcha using the input() while I type the captcha letters in a captcha.txt, next I type the captcha i hit enter to continue but, I got a response json captcha error. this is my code:
from time import sleep
import requests
from lxml import html
from PIL import Image # pip install Pillow
import io
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebkit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36",
"Host": "enlinea.sunedu.gob.pe",
"Origin": "https://enlinea.sunedu.gob.pe",
"Referer": "https://enlinea.sunedu.gob.pe/verificainscripcion",
}
session = requests.Session()
login_form_url = 'https://enlinea.sunedu.gob.pe/verificainscripcion'
login_form_res = session.get(login_form_url, headers=headers)
sleep(5)
parser = html.fromstring(login_form_res.text)
special_token = parser.xpath('//input[#name="_token"]/#value')
print('token:', special_token[0])
span_image = parser.xpath('//div[#class="pull-right"]/span[#id="captchaImgPriv"]/img')[0].get("src")
print(span_image)
image_content = requests.get(span_image).content
image_file = io.BytesIO(image_content)
image = Image.open(image_file).convert('RGB')
file_path = './imagenes/captcha.jpg'
with open(file_path, 'wb') as f:
image.save(f, "JPEG", quality=85)
input()
login_url = 'https://enlinea.sunedu.gob.pe/consulta'
login_data = {
"doc": "06950413",
"opcion": 'PUB',
"_token": special_token[0],
"icono": '',
"captcha": open('captcha.txt').readline().strip()
}
print(login_data)
rep = session.post(
login_url,
data=login_data,
headers=headers
)
print(rep.text)
Thanks in advance.
The issue was that you didn't use the session when you built the captcha image request. Requesting the image gives a cookie that should be sent back with the form :
The following script use beautifulsoup instead of xpath/lxml, download the captcha, shows the captcha, waits for user input and gets the data :
import requests
from bs4 import BeautifulSoup
from PIL import Image
import io
import json
host = "https://enlinea.sunedu.gob.pe"
doc = "06950413"
s = requests.Session()
r = s.get(f"{host}/verificainscripcion")
soup = BeautifulSoup(r.text, "html.parser")
payload = dict([
(t["name"],t.get("value",""))
for t in soup.find("form", {"id": "consultaForm"}).find_all("input")
])
payload["doc"] = doc
image_content = s.get(f'{host}/simplecaptcha').content
image_file = io.BytesIO(image_content)
image = Image.open(image_file).convert('RGB')
image.show()
captcha = input("please enter captcha : ")
payload["captcha"] = captcha
print(payload)
r = s.post("https://enlinea.sunedu.gob.pe/consulta", data = payload)
data = json.loads(r.text)
print(data)
output :
please enter captcha : YLthP
{'doc': '06950413', 'opcion': 'PUB', '_token': 'gfgUTJrqPcmM9lyFHqW0u5aOdoF4gNSJm60kUNRu', 'icono': '', 'nombre': '', 'captcha': 'YLthP'}
[{"ID":"1519729","NOMBRE":"OSORIO DELGADILLO, FLOR DE MARIA","DOC_IDENT":"DNI 06950413","GRADO":"<b>BACHILLER EN EDUCACION ","TITULO_REV":"<b>BACHILLER EN EDUCACION","GRADO_REV":null,"DIPL_FEC":"26\/06\/1987","RESO_FEC":"-","ESSUNEDU":"0","UNIV":"UNIVERSIDAD INCA GARCILASO DE LA VEGA ASOCIACI\u00d3N CIVIL","PAIS":"PERU","COMENTARIO":"-","TIPO":"N","TIPO_GRADO":"B","DIPL_TIP_EMI":null,"TIPO_INSCRI":null,"NUM_DIPL_REVA":null,"NUM_ORD_PAG":null,"V_ORIGEN":null,"NRO_RESOLUCION_NULIDAD":null,"FLG_RESOLUCION_NULIDAD":null,"FECHA_RESOLUCION_NULIDAD":null,"MODALIDAD_ESTUDIO":"-"}]
I am trying to create a script that will submit a form and return me the results. I am able to pull the form information from the URL but I am not able to update the fields of the form or get a response.
I currently have:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://dos.elections.myflorida.com/campaign-finance/contributions/'
response = requests.get(url)
soup = bs(response.text)
form_info = soup.find_all('action')
print(form_info[0]['action'])
Which works and returns:
'/cgi-bin/contrib.exe'
This form should be able to be submitted with the defaults, so I then try:
session = requests.Session()
BASE_URL = 'https://dos.elections.myflorida.com'
headers = {'User-Agent': "Mozilla/5.0" , 'referer' :'{}/campaign-finance/contributions/'.format(BASE_URL)}
data = {'Submit' : 'Submit'}
res = session.post( '{}/cgi-bin/contrib.exe'.format(BASE_URL), data = data, headers = headers )
And I get a 502 Response. I did the referer and url in the form they are in because of this post.
https://dos.elections.myflorida.com/campaign-finance/contributions/
and the results redirect me to:
https://dos.elections.myflorida.com/cgi-bin/contrib.exe
The solution by SIM worked, thanks!!
Try the following to get the required content using default search:
import requests
from bs4 import BeautifulSoup
link = 'https://dos.elections.myflorida.com/campaign-finance/contributions/'
post_url = 'https://dos.elections.myflorida.com/cgi-bin/contrib.exe'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
payload['election'] = '20201103-GEN'
payload['search_on'] = '1'
payload['CanNameSrch'] = '2'
payload['office'] = 'All'
payload['party'] = 'All'
payload['ComNameSrch'] = '2'
payload['committee'] = 'All'
payload['namesearch'] = '2'
payload['csort1'] = 'NAM'
payload['csort2'] = 'CAN'
payload['queryformat'] = '2'
r = s.post(post_url,data=payload)
print(r.text)
I have a Python code that sends POST request to a website, reads the response and filters it. For the POST data I used ('number', '11111') and it works perfect. However, I want to create a txt file that contains 100 different numbers as 1111,2222,3333,4444... and then send the POST requests for each of them. Can you help me how to do this in Python?
import urllib
from bs4 import BeautifulSoup
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin': 'http://mahmutesat.com/python.aspx',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://mahmutesat.com/python.aspx',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
url = 'http://mahmutesat.com/python.aspx'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f)
# parse and retrieve two vital form values
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
viewstategenerator = soup.select("#__VIEWSTATEGENERATOR")[0]['value']
formData = (
('__EVENTVALIDATION', eventvalidation),
('__VIEWSTATE', viewstate),
('__VIEWSTATEGENERATOR',viewstategenerator),
('number', '11111'),
('Button', 'Sorgula'),
)
encodedFields = urllib.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)
soup = BeautifulSoup(f.read())
name=soup.findAll('input',{'id':'name_field'})
for eachname in name:
print eachname['value']
If your file has data:
"sample.txt"
1111,2222,3333,4444,5555,6666,7777,8888,......(and so on)
To read the file contents, you can use the file open operation:
import itertools
#open the file for read
with open("sample.txt", "r") as fp:
values = fp.readlines()
#Get the values split with ","
data = [map(int, line.split(",")) for line in values]
numbers = list(itertools.chain(*data)) #Ensuring if its having many lines then concatenate
Now, use it as:
for number in numbers:
formData = (
('__EVENTVALIDATION', eventvalidation),
('__VIEWSTATE', viewstate),
('__VIEWSTATEGENERATOR',viewstategenerator),
('number', str(number)), # Here you use the number obtained
('Button', 'Sorgula'),
)
encodedFields = urllib.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)
soup = BeautifulSoup(f.read())
name=soup.findAll('input',{'id':'name_field'})
for eachname in name:
print eachname['value']
1 - Here is an example on how to create a file:
f = open('test.txt','w')
This will open the test.txt file for writing ('w') (if it has already data, it will be erased but if you want to append it write: f = open('test.txt','a') ) or create one if it does not exist yet. Note that this will happen in your current working directory, if you want it in a specific directory, include with the file name the full directory path, example:
f = open('C:\\Python\\test.txt','w')
2 - Then write/append to this file the data you want, example:
for i in range(1,101):
f.write(str(i*1111)+'\n')
This will write 100 numbers as string from 1111 to 111100
3 - You should always close the file at the end:
f.close()
4 - Now if you want to read from this file 'test.txt':
f = open('C:\\Python\\test.txt','r')
for i in f:
print i,
file.close()
This is as simple as it can be,
You need to read about File I/O in python from:
https://docs.python.org/2.7/tutorial/inputoutput.html#reading-and-writing-files
Make sure you select the right Python version for you in this docs.
using dictionary you can deal with the multiple requests, very easily.
import requests
values = {
'__EVENTVALIDATION': event_validation,
'__LASTFOCUS': '',
'__VIEWSTATE': view_state,
'__VIEWSTATEGENERATOR': '6264FB8D',
'ctl00$ContentPlaceHolder1$ButGet': 'Get Report',
'ctl00$ContentPlaceHolder1$Ddl_Circles': 'All Circles',
'ctl00$ContentPlaceHolder1$Ddl_Divisions': '-- Select --',
'ctl00$ContentPlaceHolder1$TxtTin': tin_num,
'ctl00$ContentPlaceHolder1$dropact': 'all'
}
headers_1 = {
'Origin': 'https://www.apct.gov.in',
'User-Agent': user_agent,
'Cookie': cookie_1,
'Accept-Encoding': 'gzip, deflate, br',
'Referer': url_1,
'Content-Type': 'application/x-www-form-urlencoded',
'Upgrade-Insecure-Requests': '1'
}
try:
req = requests.post(url_1, data=values, headers=headers_1)