How should I scrape an idx file on EDGAR?

How should I scrape an idx file on EDGAR? - python

I have an idx file:
https://www.sec.gov/Archives/edgar/daily-index/2020/QTR4/master.20201231.idx
I could open the idx file with following codes one year ago, but the codes don't work now. Why is that? How should I modify the code?
import requests
import urllib
from bs4 import BeautifulSoup
master_data = []
file_url = r"https://www.sec.gov/Archives/edgar/daily-index/2020/QTR4/master.20201231.idx"
byte_data = requests.get(file_url).content
data_format = byte_data.decode('utf-8').split('------')
content = data_format[-1]
data_list = content.replace('\n','|').split('|')
for index, item in enumerate(data_list):
if '.txt' in item:
if data_list[index - 2] == '10-K':
entry_list = data_list[index - 4: index + 1]
entry_list[4] = "https://www.sec.gov/Archives/" + entry_list[4]
master_data.append(entry_list)
print(master_data)

If you had inspected the contents of the byte_data variable, you would find that it does not have the actual content of the idx file. It is basically present to prevent scraping bots like yours. You can find more information in this answer: Problem HTTP error 403 in Python 3 Web Scraping
In this case, your answer would be to just use the User-Agent in the header for the request.
import requests
master_data = []
file_url = r"https://www.sec.gov/Archives/edgar/daily-index/2020/QTR4/master.20201231.idx"
byte_data = requests.get(file_url, allow_redirects=True, headers={"User-Agent": "XYZ/3.0"}).content
# Your further processing here
On a side note, your processing does not print anything as the if condition is never met for any of the lines, so do not think this solution does not work.

Related

How do I write my data along the columns in a CSV file?

When I write to the csv file all of my data is printed in only the first column. Using my loop, how do I iterate along the columns to write the data?
import csv
import bs4
import urllib
from urllib.request import urlopen as uReq
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
#For sites that can't be opened due to Urllib blocker, use a Mozilla User agent to get access
pageRequest = Request('https://coronavirusbellcurve.com/', headers = {'User-Agent': 'Mozilla/5.0'})
htmlPage = urlopen(pageRequest).read()
page_soup = soup(htmlPage, 'html.parser')
specificDiv = page_soup.find("div", {"class": "table-responsive-xl"})
TbodyStats = specificDiv.table.tbody.tr.contents
TbodyDates = specificDiv.table.thead.tr.contents
with open('CovidHTML.csv','w', newline= '') as file:
theWriter = csv.writer(file)
theWriter.writerow(['5/4', ' 5/5', ' 5/6',' 5/7',' 5/8',' 5/9'])
for i in range(3,len(TbodyStats)):
if i%2 != 0:
theWriter.writerow([TbodyStats[i].text])

Another method, For reference only.
from simplified_scrapy import SimplifiedDoc,utils,req
html = req.get('https://coronavirusbellcurve.com/')
doc = SimplifiedDoc(html)
specificDiv = doc.select('div.table-responsive-xl') # Get first div. If you want to get all divs, use this method: doc.selects('div.table-responsive-xl')
# TbodyStats = specificDiv.tbody.trs.selects('td|th').text # Get data
# TbodyDates = specificDiv.thead.trs.selects('td|th').text # Get date
data = specificDiv.table.trs.selects('td|th').text # Get all
rows = []
for row in data:
rows.append(row[1:])
utils.save2csv('test.csv',rows)
Result:
5/5,5/6,5/7,5/8,5/9
1213260,1237960,1266822,1294664,1314610
24423,24700,28862,27842,19946
2.05%,2.04%,2.33%,2.20%,1.54%

I think you may be able to do this (I can't test for sure because I don't have your exact data on hand):
row = []
for i in range(3, len(TbodyStats), 2):
row.append(TbodyStats[i].text)
if len(row) == 6:
theWriter.writerow(row)
row = []
I added the 'step' to your range so you don't have to use % for finding odd numbered indices, then just built each row until it hits 6 members, then flush that to the csv file, then empty the row so you can repeat the process.

Python requests html printed response in array

I am trying to check if link contains http and print the URL.
import requests
from requests_html import HTMLSession
import sys
link = "http://www.tvil.me/view/93/4/8/v/%D7%90%D7%99%D7%99_%D7%96%D7%95%D7%9E%D7%91%D7%99_IZombie.html"
enter_episodes = HTMLSession().get(link)
page = enter_episodes.html
s = page.xpath("//*[#class='view-watch-button']/a")
for l in s:
link = l.links
if link != "set()":
print(link)
Response:
{'http://streamcloud.eu/ga4m4hizbrfb/iZombie.S04E08.HDTV.x264-SVA.mkv.html'}
{'http://uptostream.com/77p26f7twwhe'}
set()
{'https://clipwatching.com/aog2ni06rzjt/rrFhepnbFfpt6xg.mkv.html'}
set()
[Finished in 1.7s]
I tried to delete the set() response and to get only the link without {' and '}.

You just need to make sure the set has a length of more than one, and then pop it:
import requests
from requests_html import HTMLSession
import sys
link = "http://www.tvil.me/view/93/4/8/v/%D7%90%D7%99%D7%99_%D7%96%D7%95%D7%9E%D7%91%D7%99_IZombie.html"
enter_episodes = HTMLSession().get(link)
page = enter_episodes.html
s = page.xpath("//*[#class='view-watch-button']/a")
for l in s:
link = l.links
if len(link) > 0: # make sure it has a value
print(link.pop()) # get the last value (in your case, the only one)

How to parse Etherscan?

How can I parse every single page for eth addresses from https://etherscan.io/token/generic-tokenholders2?a=0x6425c6be902d692ae2db752b3c268afadb099d3b&s=0&p=1 ? Then add it to .txt .

Okay, possibly off-topic, but I had a play around with this. (Mainly because I thought I might need to use something similar to grab stuff in future that Etherscan's APIs don't return... )
The following Python2 code will grab what you're after. There's a hacky sleep in there to get around what I think is either something to do with how quickly the pages load, or some rate limiting imposed by Etherscan. I'm not sure.
Data gets written to a .csv file - a text file wouldn't be much fun.
#!/usr/bin/env python
from __future__ import print_function
import os
import requests
from bs4 import BeautifulSoup
import csv
import time
RESULTS = "results.csv"
URL = "https://etherscan.io/token/generic-tokenholders2?a=0x6425c6be902d692ae2db752b3c268afadb099d3b&s=0&p="
def getData(sess, page):
url = URL + page
print("Retrieving page", page)
return BeautifulSoup(sess.get(url).text, 'html.parser')
def getPage(sess, page):
table = getData(sess, str(int(page))).find('table')
return [[X.text.strip() for X in row.find_all('td')] for row in table.find_all('tr')]
def main():
resp = requests.get(URL)
sess = requests.Session()
with open(RESULTS, 'wb') as f:
wr = csv.writer(f, quoting=csv.QUOTE_ALL)
wr.writerow(map(str, "Rank Address Quantity Percentage".split()))
page = 0
while True:
page += 1
data = getPage(sess, page)
# Even pages that don't contain the data we're
# after still contain a table.
if len(data) < 4:
break
else:
for row in data:
wr.writerow(row)
time.sleep(1)
if __name__ == "__main__":
main()
I'm sure it's not the best Python in the world.

Removing duplicate URLs in Python including URLs that contain a forward slash

The following program is giving me output that includes URLs with and without the forward slash (e.g. ask.census.gov and ask.census.gov/). I need to eliminate one or the other. Thank you in advance for your help!
from bs4 import BeautifulSoup as mySoup
from urllib.parse import urljoin as myJoin
from urllib.request import urlopen as myRequest
my_url = "https://www.census.gov/programs-surveys/popest.html"
# call on packages
html_page = myRequest(my_url)
raw_html = html_page.read()
html_page.close()
page_soup = mySoup(raw_html, "html.parser")
f = open("censusTest.csv", "w")
hyperlinks = page_soup.findAll('a')
set_urls = set()
for checked in hyperlinks:
found_link = checked.get("href")
result_set = myJoin(my_url, found_link)
if result_set and result_set not in set_urls:
set_urls.add(result_set)
f.write(str(result_set) + "\n")
f.close()

You can always right-strip the slash - it would be removed if exists and nothing will be done if not:
result_set = myJoin(my_url, found_link).rstrip("/")

my_url = "https://www.census.gov/programs-surveys/popest.html/"
if my_url[-1:] == '/':
my_url = my_url[:-1]
This snip of code will check to see if the last character in your string is a '/', and if it is, it will delete it.
Good examples of python string manipulation:
http://www.pythonforbeginners.com/basics/string-manipulation-in-python

Can't edit a URL with python

I am new to python and just wanted to know if this is possible: I have scraped a url using urllib and want to edit different pages.
Example:
http://test.com/All/0.html
I want the 0.html to become 50.html and then 100.html and so on ...

found_url = 'http://test.com/All/0.html'
base_url = 'http://test.com/All/'
for page_number in range(0,1050,50):
url_to_fetch = "{0}{1}.html".format(base_url,page_number)
That should give you URLs from 0.html to 1000.html
If you want to use urlparse(as suggested in comments to your question):
import urlparse
found_url = 'http://test.com/All/0.html'
parsed_url = urlparse.urlparse(found_url)
path_parts = parsed_url.path.split("/")
for page_number in range(0,1050,50):
new_path = "{0}/{1}.html".format("/".join(path_parts[:-1]), page_number)
parsed_url = parsed_url._replace(path= new_path)
print parsed_url.geturl()
Executing this script would give you the following:
http://test.com/All/0.html
http://test.com/All/50.html
http://test.com/All/100.html
http://test.com/All/150.html
http://test.com/All/200.html
http://test.com/All/250.html
http://test.com/All/300.html
http://test.com/All/350.html
http://test.com/All/400.html
http://test.com/All/450.html
http://test.com/All/500.html
http://test.com/All/550.html
http://test.com/All/600.html
http://test.com/All/650.html
http://test.com/All/700.html
http://test.com/All/750.html
http://test.com/All/800.html
http://test.com/All/850.html
http://test.com/All/900.html
http://test.com/All/950.html
http://test.com/All/1000.html
Instead of printing in the for loop you can use the value of parsed_url.geturl() as per your need. As mentioned, if you want to fetch the content of the page you can use python requests module in the following manner:
import requests
found_url = 'http://test.com/All/0.html'
parsed_url = urlparse.urlparse(found_url)
path_parts = parsed_url.path.split("/")
for page_number in range(0,1050,50):
new_path = "{0}/{1}.html".format("/".join(path_parts[:-1]), page_number)
parsed_url = parsed_url._replace(path= new_path)
# print parsed_url.geturl()
url = parsed_url.geturl()
try:
r = requests.get(url)
if r.status_code == 200:
with open(str(page_number)+'.html', 'w') as f:
f.write(r.content)
except Exception as e:
print "Error scraping - " + url
print e
This fetches the content from http://test.com/All/0.html till http://test.com/All/1000.html and saves the content of each URL into its own file. The file name on disk would be the file name in URL - 0.html to 1000.html
Depending on the performance of the site you are trying to scrape from you might experience considerable time delays in running the script. If performance is of importance, you can consider using grequests

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How should I scrape an idx file on EDGAR? - python

Related

How do I write my data along the columns in a CSV file?

Python requests html printed response in array

How to parse Etherscan?

Removing duplicate URLs in Python including URLs that contain a forward slash

Can't edit a URL with python

Categories

Resources