is there a simple way in Python to loop over a simple list of numbers?
I want to scrape some data from different URLs that only differ in 3 numbers?
I'm quite new to python and couldn't figure out an easy way to do it.
Thanks a lot!
Here's my code:
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.example.com/3322")
bsObj = BeautifulSoup(html)
table = bsObj.findAll("table",{"class":"MainContent"})[0]
rows=table.findAll("td")
csvFile = open("/Users/Max/Desktop/file1.csv", 'wt')
writer = csv.writer(csvFile)
try:
for row in rows:
csvRow=[]
for cell in row.findAll(['tr', 'td']):
csvRow.append(cell.get_text())
writer.writerow(csvRow)
finally:
csvFile.close()
In Stata this would be like:
foreach i of 13 34 55 67{
html = urlopen("http://www.example.com/`i'")
....
}
Thanks a lot!
Max
I've broken your original code into functions simply to make clearer what I think is the answer to your question: use a simple loop, and .format() to construct urls and filenames.
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
def scrape_url(url):
html = urlopen(url)
bsObj = BeautifulSoup(html)
table = bsObj.findAll("table",{"class":"MainContent"})[0]
rows=table.findAll("td")
return rows
def write_csv_data(path, rows):
csvFile = open(path, 'wt')
writer = csv.writer(csvFile)
try:
for row in rows:
csvRow=[]
for cell in row.findAll(['tr', 'td']):
csvRow.append(cell.get_text())
writer.writerow(csvRow)
finally:
csvFile.close()
for i in (13, 34, 55, 67):
url = "http://www.example.com:3322/{}".format(i)
csv_path = "/Users/MaximilianMandl/Desktop/file-{}.csv".format(i)
rows = scrape_url(url)
write_csv_data(csv_path, rows)
i would use set.intersection() for that:
mylist=[1,16,8,32,7,5]
fieldmatch=[5,7,16]
intersection = list(set(mylist).intersection(fieldmatch))
I'm not familiar with stata, but. It looks like the python equivalent might be simply:
import request
for i in [13 34 55 67]:
response = request("http://www.example.com/{}".format(i))
....
The simplest way to do this it to apply the filter inside the loop:
mylist=[1,16,8,32,7,5]
for myitem in mylist:
if myitem in (5,7,16):
print myitem # or print(myitem)
This may not, however, be the most elegant way to do it. If you wanted to store a new list of the matching results, you can use a list comprehension:
mylist=[1,16,8,32,7,5]
fieldmatch=[5,7,16]
filteredlist=[ x for x in mylist if x in fieldmatch ]
You can then take filteredlist which contains only the items in mylist that match fieldmatch (in other words your original list filtered by your criteria) and iterate over it like any other list:
for myitem in filteredlist:
# Perform whatever process you want to each item here
do_something_with(myitem)
Hope this helps.
Related
Hi everyone although I got the data I was looking for in a text format, when I try to record it as a list or convert it into a dataframe, it simply doesn't work. What I got was a huge list with only one item, which is the last text line of the data I got, i.e. the number '9.054.333,18'. Can anyone help me, please? I need to organize all this data in a list or dataframe.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
html = urlopen('http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-a-vista/termo/posicoes-em-aberto/posicoes-em-aberto-8AA8D0CC77D179750177DF167F150965.htm?data=16/04/2021&f=0#conteudo-principal')
soup = BeautifulSoup(html.read(), 'html.parser')
texto = soup.find_all('td')
for t in texto:
print(t.text)
lista=[]
for i in soup.find_all('td'):
lista.append(t.text)
print(lista)
Your iterators are wrong -- you're using i in the last loop while appending t.text.
You can just use a list comprehension:
# ...
soup = BeautifulSoup(html.read(), 'html.parser')
lista = [t.text for t in soup.find_all('td')]
How do I go about extracting more than one JSON key at a time given this script - the script cycles through a list of message ids and extracts the JSON response. I only want to extract certain keys from the response.
import urllib3
import json
import csv
from progressbar import ProgressBar
import time
pbar = ProgressBar()
base_url = 'https://api.pipedrive.com/v1/mailbox/mailMessages/'
fields = {"include_body": "1", "api_token": "token"}
json_arr = []
http = urllib3.PoolManager()
with open('ten.csv', newline='') as csvfile:
for x in pbar(csv.reader(csvfile, delimiter=' ', quotechar='|')):
r = http.request('GET', base_url + "".join(x), fields=fields)
mails = json.loads(r.data.decode('utf-8'))
json_arr.append(mails['data']['from'][0]['id'])
print(json_arr)
This works as intended. But I want to do the following.
json_arr.append(mails(['data']['from'][0]['id'],['data']['to'][0]['id'])
Which results in TypeError: list indices must be integers or slices, not str
Did you mean:
json_arr.append(mails['data']['from'][0]['id'])
json_arr.append(mails['data']['to'][0]['id'])
The answer already posted looks good but I'll share the one-liner equivalent, using extend() instead of append():
json_arr.extend([mails['data']['from'][0]['id'], mails['data']['to'][0]['id']])
I am a Python novice and have no experience with BeautifulSoup and urllib
I've tried to frankenstein my own code from other questions to no avail, so I will try to detail what I’m trying to achieve from the pseudocode and description below:
import urllib2
from bs4 import BeautifulSoup
for eachurl in "urllist.txt":
urllib read first (or 2nd or 3rd) url in list
find.all("<form")
if number of "<form" > 0:
result = True
if number of "<form" == 0:
result = False
write result to csv/excel/html
table col 1 = url in urllist
table col 2 = result
So basically, I have a txt file with a list of URLs in; I would like urllib to open each URL one by one and see whether or not the html contains a form tag. (Writing to a new file) the URL string in the left column and a y or n in the right, depending on whether finding all form tag returned a result greater than 0, and then of course stop once the URLs have been exhausted in the txt file.
Use requests instead of urllib2.
Try this:
import requests
from bs4 import BeautifulSoup
with open('data.txt', 'r') as data:
for line in data:
res = requests.get(line.strip()).content
soup = BeautifulSoup(res, 'html.parser')
with open('result.txt', 'a') as result_file:
if soup.find_all('form'):
result_file.write('{} y\n'.format(line.strip()))
else:
result_file.write('{} n\n'.format(line.strip()))
data.txt
http://stackoverflow.com/questions/34263219/urllib2-and-beautifulsoup-loop-through-urls-and-return-whether-html-contains
http://blank.org/
result.txt
http://stackoverflow.com/questions/34263219/urllib2-and-beautifulsoup-loop-through-urls-and-return-whether-html-contains y
http://blank.org/ n
I want to read a txt file and store it as a list of string. This is a way that I come up with myself. It looks really clumsy. Is there any better way to do this? Thanks.
import re
import urllib2
import re
import numpy as np
url=('http://quant-econ.net/_downloads/graph1.txt')
response= urllib2.urlopen(url)
txt= response.read()
f=open('graph1.txt','w')
f.write(txt)
f.close()
f=open('graph1.txt','r')
nodes=f.readlines()
I tried the solutions provided below, but they all actually return something different from my previous code.
This is string produced by split()
'node0, node1 0.04, node8 11.11, node14 72.21'
This is what my code produce
'node0, node1 0.04, node8 11.11, node14 72.21\n'
The problem is without the'\n' when I try process the string list it will confront some index error.
" row = index[0] IndexError: list index out of range "
for node in nodes:
index = re.findall('(?<=node)\w+',node)
index = map(int,index)
row = index[0]
del index[0]
According to the documentation, response is already a file-like object: you should be able to do response.readlines().
For those problems where you do need to create an intermediate file like this, though, you want to use io.StringIO
Look at split. So:
nodes = response.read().split("\n")
EDIT: Alternatively if you want to avoid \r\n newlines, use splitlines.
nodes = response.read().splitlines()
Try:
url=('http://quant-econ.net/_downloads/graph1.txt')
response= urllib2.urlopen(url)
txt= response.read()
with open('graph1.txt','w') as f:
f.write(txt)
nodes=txt.split("\n")
If you don't want the file, this should work:
url=('http://quant-econ.net/_downloads/graph1.txt')
response= urllib2.urlopen(url)
txt= response.read()
nodes=txt.split("\n")
I use this code:
from bs4 import BeautifulSoup
parser = BeautifulSoup(remote_data)
parse_data = parser.find_all('a')
for atag_data in parse_data:
URL_list = atag_data.get('href')
When i try to split URL_list to array:
array = str.split(URL_list)
I give this 3 arrays:
['index1.html']
['example.exe']
['document.doc']
But i need only one array like:
['index1.html','example.exe','document.doc']
Any suggestions please?
You don't get an array - its a list!
Also, you should avoid naming variables like builtins.
Regarding your question:
from bs4 import BeautifulSoup
parser = BeautifulSoup(remote_data)
link_list = [a['href'] for a in parser.find_all('a')]