How to split multiline string to array - python

I use this code:
from bs4 import BeautifulSoup
parser = BeautifulSoup(remote_data)
parse_data = parser.find_all('a')
for atag_data in parse_data:
URL_list = atag_data.get('href')
When i try to split URL_list to array:
array = str.split(URL_list)
I give this 3 arrays:
['index1.html']
['example.exe']
['document.doc']
But i need only one array like:
['index1.html','example.exe','document.doc']
Any suggestions please?

You don't get an array - its a list!
Also, you should avoid naming variables like builtins.
Regarding your question:
from bs4 import BeautifulSoup
parser = BeautifulSoup(remote_data)
link_list = [a['href'] for a in parser.find_all('a')]

Related

extract Unique id from the URL using Python

I've a URL like this:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
x= 'Enterprise-Business-Planning-Analyst_3103928-1'
I want to extract id at the last of url you can say the x part from the above string to get the unique id.
Any help regarding this will be highly appreciated.
_parsed_url.path.split("/")[-1].split('-')[-1]
I am using this but it is giving error.
Python's urllib.parse and pathlib builtin libraries can help here.
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
from urllib.parse import urlparse
from pathlib import PurePath
x = PurePath(urlparse(url).path).name
print(x)
# Enterprise-Business-Planning-Analyst_3103928-1
To print the text Enterprise-Business-Planning-Analyst_3103928-1 you can split() with respect to the / character:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
print(url.split("/")[-1])
# Enterprise-Business-Planning-Analyst_3103928-1
To print the text 3103928 you can replace the _ character with - and you can split() with respect to the - character:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
print(url.replace("_", "-").split("-")[-2])
# 3103928

Recording web scraping data

Hi everyone although I got the data I was looking for in a text format, when I try to record it as a list or convert it into a dataframe, it simply doesn't work. What I got was a huge list with only one item, which is the last text line of the data I got, i.e. the number '9.054.333,18'. Can anyone help me, please? I need to organize all this data in a list or dataframe.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
html = urlopen('http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-a-vista/termo/posicoes-em-aberto/posicoes-em-aberto-8AA8D0CC77D179750177DF167F150965.htm?data=16/04/2021&f=0#conteudo-principal')
soup = BeautifulSoup(html.read(), 'html.parser')
texto = soup.find_all('td')
for t in texto:
print(t.text)
lista=[]
for i in soup.find_all('td'):
lista.append(t.text)
print(lista)
Your iterators are wrong -- you're using i in the last loop while appending t.text.
You can just use a list comprehension:
# ...
soup = BeautifulSoup(html.read(), 'html.parser')
lista = [t.text for t in soup.find_all('td')]

How to substract soup.find_all() in python 3

I want to change the output of my soup.find.all. In the original source we have this:
my soup.find_all:
href = [b.get('href') for b in soup.find_all('a', href=re.compile(r'.*\?id\=\d{4,8}'))]
gives me this:
/book/nfo/?id=4756888
but I want this:
http://127.0.0.1/book/download/?id=4756888
You can use the properties of a Python string to add and replace parts to/from it:
a='/book/nfo/?id=4756888'
b = 'http://127.0.0.1' + a.replace('nfo', 'download')
print(b)
which gives:
'http://127.0.0.1/book/download/?id=4756888'
There's no need to use regex here.
You could compile a regular expression and apply it in a list comprehension as follows:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup('', 'html.parser')
re_s = re.compile(r'(.*?\/)nfo(\/.*?)').sub
hrefs = [re_s('http://127.0.0.1' + r'\1download\2', a.get('href')) for a in soup.find_all('a', href=re.compile(r'.*\?id\=\d{4,8}'))]
print(hrefs)
Giving you:
['http://127.0.0.1/book/download/?id=4756888']
You can prepend http://127.0.0.1 in front and replace 'nfo' by 'download' using python's re.sub() function.
re.sub(r'pattern_to_match',r'replacement_string', string)
You can implement it as follows:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup("""""")
c = ['http://127.0.0.1'+b.get('href') for b in soup.find_all('a', href=re.compile(r'.*\?id\=\d{4,8}'))]
print([re.sub(r'nfo',r'download',q) for q in c ])
Output:
['http://127.0.0.1/book/download/?id=4756888']

Simple forvalues loop in python?

is there a simple way in Python to loop over a simple list of numbers?
I want to scrape some data from different URLs that only differ in 3 numbers?
I'm quite new to python and couldn't figure out an easy way to do it.
Thanks a lot!
Here's my code:
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.example.com/3322")
bsObj = BeautifulSoup(html)
table = bsObj.findAll("table",{"class":"MainContent"})[0]
rows=table.findAll("td")
csvFile = open("/Users/Max/Desktop/file1.csv", 'wt')
writer = csv.writer(csvFile)
try:
for row in rows:
csvRow=[]
for cell in row.findAll(['tr', 'td']):
csvRow.append(cell.get_text())
writer.writerow(csvRow)
finally:
csvFile.close()
In Stata this would be like:
foreach i of 13 34 55 67{
html = urlopen("http://www.example.com/`i'")
....
}
Thanks a lot!
Max
I've broken your original code into functions simply to make clearer what I think is the answer to your question: use a simple loop, and .format() to construct urls and filenames.
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
def scrape_url(url):
html = urlopen(url)
bsObj = BeautifulSoup(html)
table = bsObj.findAll("table",{"class":"MainContent"})[0]
rows=table.findAll("td")
return rows
def write_csv_data(path, rows):
csvFile = open(path, 'wt')
writer = csv.writer(csvFile)
try:
for row in rows:
csvRow=[]
for cell in row.findAll(['tr', 'td']):
csvRow.append(cell.get_text())
writer.writerow(csvRow)
finally:
csvFile.close()
for i in (13, 34, 55, 67):
url = "http://www.example.com:3322/{}".format(i)
csv_path = "/Users/MaximilianMandl/Desktop/file-{}.csv".format(i)
rows = scrape_url(url)
write_csv_data(csv_path, rows)
i would use set.intersection() for that:
mylist=[1,16,8,32,7,5]
fieldmatch=[5,7,16]
intersection = list(set(mylist).intersection(fieldmatch))
I'm not familiar with stata, but. It looks like the python equivalent might be simply:
import request
for i in [13 34 55 67]:
response = request("http://www.example.com/{}".format(i))
....
The simplest way to do this it to apply the filter inside the loop:
mylist=[1,16,8,32,7,5]
for myitem in mylist:
if myitem in (5,7,16):
print myitem # or print(myitem)
This may not, however, be the most elegant way to do it. If you wanted to store a new list of the matching results, you can use a list comprehension:
mylist=[1,16,8,32,7,5]
fieldmatch=[5,7,16]
filteredlist=[ x for x in mylist if x in fieldmatch ]
You can then take filteredlist which contains only the items in mylist that match fieldmatch (in other words your original list filtered by your criteria) and iterate over it like any other list:
for myitem in filteredlist:
# Perform whatever process you want to each item here
do_something_with(myitem)
Hope this helps.

"Expected string or buffer" error using Beautiful Soup

I'm trying a code that will pull numbers from a URL using Beautiful Soup, then sum these numbers, but I keep getting an error that looks like this:
Expected string or buffer
I think it's related to the regular expressions, but I can't pinpoint the problem.
import re
import urllib
from BeautifulSoup import *
htm1 = urllib.urlopen('https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/comments_42.html').read()
soup = BeautifulSoup(htm1)
tags = soup('span')
for tag in tags:
y = re.findall ('([0-9]+)',tag.txt)
print sum(y)
I recommend bs4 instead of BeautifulSoup (which is the old version). You also need to change this line:
y = re.findall ('([0-9]+)',tag)
to something like this:
y = re.findall ('([0-9]+)',tag.text)
See if this gets you further:
sum = 0 #initialize the sum
for tag in tags:
y = re.findall ('([0-9]+)',tag.text) #get the text from the tag
print(y[0]) #y is a list, print the first element of the list
sum += int(y[0]) #convert it to an integer and add it to the sum
print('the sum is: {}'.format(sum))

Categories