web data having unequal length arrays python 3 - python

Below is my code to scrape a website. i have to create a DataFrame with arrays of unequal length, for instance property_Type has varying length, soe listings have one property_type , some has two & some has three. Similarly , Agency name also has varying length.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
urls = []
for i in range(1,3):
pages = "http://www.realcommercial.com.au/for-sale/property-offices-retail-in-vic/list-{0}?includePropertiesWithin=includesurrounding&activeSort=list-date&autoSuggest=true".format(i)
urls.append(pages)
Data = []
for info in urls:
page = requests.get(info)
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all('a', attrs ={'class' :'details-panel'})
hrefs = [link['href'] for link in links]
for href in hrefs:
pages = requests.get(href)
soup_2 =BeautifulSoup(pages.content, 'html.parser')
Address_1 = soup_2.find_all('p', attrs={'class' :'full-address'})
Address = [Address.text.strip() for Address in Address_1]
Prop_Type = soup_2.find_all('div', attrs={'class' :'propType ellipsis'})
Property_Type = [Property_Type.text.strip() for Property_Type in Prop_Type]
Agency_1=soup_2.find_all('div', attrs={'class' :'agencyName ellipsis'})
Agency_Name=[Agency_Name.text.strip() for Agency_Name in Agency_1]
Agent_1=soup_2.find_all('div', attrs={'class' :'agentName ellipsis'})
Agent_Name=[Agent_Name.text.strip() for Agent_Name in Agent_1]
raw_data = dict(A=np.array(Address),B=np.array(Property_Type),C=np.array(Agency_Name),D=np.array(Agent_Name))
raw_df = pd.DataFrame(dict([ k,series(v) for k,v in raw_data.iteritems() ]))
The error i am getting is
File "<ipython-input-8-3a7c5fc4fb93>", line 32
raw_df = pd.DataFrame(dict([ k,series(v) for k,v in raw_data.iteritems() ]))
^
SyntaxError: invalid syntax
What should i do to have a dataframe where only the relevant values fall under relevant columns, like property type should be in property type and not fall in agency name.
Any help would be highly appreciated,
Thanks !!!

series appears to do nothing -- perhaps you mean to turn each v into a pandas Series with pd.Series. Also, the syntax inside dict([...]) appears to be way off. If you want to create key:value pairs where each value is a Series, and then to use the key:value pairs to create a DataFrame, you can do it with a dictionary comprehension, like so:
raw_df = pd.DataFrame.from_dict({k:pd.Series(v) for k,v in raw_data.iteritems()})

Related

Parsing a table from website (choosing correct HTML tag)

I need to make dataframe from the following page: http://pitzavod.ru/products/upakovka/
from bs4 import BeautifulSoup
import pandas as pd
import requests
kre = requests.get(f'http://pitzavod.ru/products/upakovka/')
soup = BeautifulSoup(kre.text, 'lxml')
table1 = soup.find('table', id="tab3")
I chose "tab3", as I find in the HTML text <div class="tab-pane fade" id="tab3". But the variable table1 gives no output. How can I get the table? Thank You.
NOTE: you can get the table as a DataFrame in one statement with .read_html, but the DataFrame returned by pd.read_html('http://pitzavod.ru/products/upakovka/')[0] will not retain line breaks.
.find('table', id="tab3") searches for table tags with id="tab3", and there are no such elements in that page's HTML.
There's a div with id="tab3" (as you've notice), but it does not contain any tables.
The only table on the page is contained in a div with id="tab4", so you might have used table1 = soup.find('div', id="tab4").table [although I prefer using .select with CSS selectors for targeting nested tags].
Suggested solution:
kre = requests.get('http://pitzavod.ru/products/upakovka/')
# print(kre.status_code, kre.reason, 'from', kre.url)
kre.raise_for_status()
soup = BeautifulSoup(kre.content, 'lxml')
# table = soup.select_one('div#tab4>div.table-responsive>table')
table = soup.find('table') # soup.select_one('table')
tData = [{
1 if 'center' in c.get('style', '') else ci: '\n'.join([
l.strip() for l in c.get_text('\n').splitlines() if l.strip()
]) for ci, c in enumerate(r.select('td'))
} for r in table.select('tr')]
df = pandas.DataFrame(tData)
## combine the top 2 rows to form header ##
df.columns = ['\n'.join([
f'{d}' for d in df[c][:2] if pandas.notna(d)
]) for c in df.columns]
df = df.drop([0,1], axis='rows').reset_index(drop=True)
# print(df.to_markdown(tablefmt="fancy_grid"))
(Normally, I would use this function if I wanted to specify the separator for tag-contents inside cells, but the middle cell in the 2nd header row would be shifted if I used .DataFrame(read_htmlTable(table, tSep='\n', asObj='dicts')) - the 1 if 'center' in c.get('style', '') else ci bit in the above code is for correcting that.)

How to add Items to a Single Dictionary from Multiple For Loops?

I started Learning Python Recently. What I am basically doing is Scraping data from Website and adding to a list of dictionaries ,
This is what the final structure should look like :
This is basically my scraping code. I had to use two for loops since , the element to target are present at different positions on the webpage(One for Title and Another for Description)
jobslist=[]
for item in title:
MainTitle = item.text
mydict = {
'title' : MainTitle,
}
jobslist.append(mydict)
for i in link:
links = i['href']
r2 = requests.get(links, headers = headers)
soup2 = BeautifulSoup(r2.content,'lxml')
entry_content = soup2.find('div', class_ ='entry-content')
mydict= {
'description' : entry_content
}
jobslist.append(mydict)
Finally Saving to a CSV (pandas library used where pd is the import)
df = pd.DataFrame(jobslist)
df.to_csv('data.csv')
But, the Output is quite strange. The description are added below the Titles and not side by side. This is the Screenshot :
How can I align it side by side ?
Disclaimer: It's hard to give a perfect answer because your code is not reproducible; I have no idea what your date looks like, nor what you're trying to do, so I can't really test anything.
From what I understand of your code, it looks like the dictionaries are completely unnecessary. You have a list of titles, and a list of descriptions. So be it:
titles_list = []
for item in title:
titles_list.append(item.text)
descriptions_list = []
for i in link:
links = i['href']
r2 = requests.get(links, headers = headers)
soup2 = BeautifulSoup(r2.content,'lxml')
entry_content = soup2.find('div', class_ ='entry-content')
descriptions_list.append(entry_content)
df = pd.DataFrame(data = {'title': titles_list, 'description': descriptions_list}) # here we use a dict of lists instead of a list of dicts
df.to_csv('data.csv')

Cannot get the hyperlink href beautiful soup

I am trying to get the hyperlink of anchor (a) element but I get keep getting:
h ttps://in.finance.yahoo.com/h ttps://in.finance.yahoo.com/
I have tried all solutions provided here: link
Here's my code:
href_links = []
symbols = []
prices = []
commodities = []
CommoditiesUrl = "https://in.finance.yahoo.com/commodities"
r = requests.get(CommoditiesUrl)
data = r.text
soup = BeautifulSoup(data)
counter = 40
for i in range(40, 404, 14):
for row in soup.find_all('tbody'):
for srow in row.find_all('tr'):
for symbol in srow.find_all('td', attrs={'class':'data-col0'}):
symbols.append(symbol.text)
href_link = soup.find('a').get('href')
href_links.append('https://in.finance.yahoo.com/' + href_link)
for commodity in srow.find_all('td', attrs={'class':'data-col1'}):
commodities.append(commodity.text)
for price in srow.find_all('td', attrs={'class':'data-col2'}):
prices.append(price.text)
pd.DataFrame({"Links": href_links, "Symbol": symbols, "Commodity": commodities, "Prices": prices })
Also, I would like to know if it's feasible, to similarly to the website, to have the symbol of the commodity as a hyperlink in my pandas dataframe.
I'm not sure what's going on with the code you posted, but you can simply get that URL by finding an a element with the attribute data-symbol set to GC=F. The html has 2 such elements. The one you want is the first one, which is what is returned by soup.find('a', {'data-symbol': 'GC=F'}).get('href').
import requests, urllib
from bs4 import BeautifulSoup
CommoditiesUrl = "https://in.finance.yahoo.com/commodities"
r = requests.get(CommoditiesUrl)
data = r.text
soup = BeautifulSoup(data)
gold_href = soup.find('a', {'data-symbol': 'GC=F'}).get('href')
# If it is a relative URL, we need to transform it into an absolute URL (it always is, fwiw)
if not gold_href.startswith('http'):
# If you insist, you can do 'https://in.finance.yahoo.com" + gold_href
gold_href = urllib.parse.urljoin(CommoditiesUrl, gold_href)
print(gold_url)
Also, I would like to know if it's feasible, to similarly to the website, to have the symbol of the commodity as a hyperlink in my pandas dataframe.
I'm not familiar with pandas, but I'd say the answer is yes. See: How to create a table with clickable hyperlink in pandas & Jupyter Notebook

how to return data from multiple pages from table in url using beautifulsoup

i am trying to retrieve the code as well as title but somehow i am not able to retrieve the website is
https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27
here i have tried to get the value from the table
import requests
unspsc_link = "https://www.unspsc.org/search-code/default.aspx?
CSS=51%&Type=desc&SS%27"
link = requests.get(unspsc_link).text
from bs4 import BeautifulSoup
soup = BeautifulSoup(link, 'lxml')
print(soup.prettify())
all_table = soup.find_all('table')
print(all_table)
right_table = soup.find_all('table',
id="dnn_ctr1535_UNSPSCSearch_gvDetailsSearchView")
tables = right_table.find_all('td')
print(tables)
the errors AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
i expect to save the code as well as title in a list and save it in dataframe later
is there any way to continue to next page without manually providing values like in search code like 51% as there as more than 20 pages inside 51%
From the documentation
AttributeError: 'ResultSet' object has no attribute 'foo' - This
usually happens because you expected find_all() to return a single tag
or string. But find_all() returns a list of tags and stringsā€“a
ResultSet object. You need to iterate over the list and look at the
.foo of each one. Or, if you really only want one result, you need to
use find() instead of find_all()
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
unspsc_link = "https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27"
link = requests.get(unspsc_link).text
soup = BeautifulSoup(link, 'lxml')
right_table = soup.find('table', id="dnn_ctr1535_UNSPSCSearch_gvDetailsSearchView")
df = pd.read_html(str(right_table))[0]
# Clean up the DataFrame
df = df[[0, 1]]
df.columns = df.iloc[0]
df = df[1:]
print(df)
Output:
0 Code Title
1 51180000 Hormones and hormone antagonists
2 51280000 Antibacterials
3 51290000 Antidepressants
4 51390000 Sympathomimetic or adrenergic drugs
5 51460000 Herbal drugs
...
Notes:
The row order may be a little different but the data seems to be the same.
You will have to remove the last one or two rows
from the DataFrame as they are not relevant.
This is the data from the first page only. Look into
selenium to get the data from all pages by clicking on the buttons [1] [2] .... You can also use requests to emulate the POST request but it is a bit difficult for this site (IMHO).

Writing extracted items of from a website onto a .xls sheet with lists of different length using Pandas module in Python

I am a beginner in Python Programming and I am practicing scraping different values from websites.
I have extracted the items from a particular website and now want to write them onto a .xls file.
The whole web page has 714 records including duplicate records but the excel sheet is displaying only 707 records because of the zip() function which stops when the smallest list gets exhausted. Here the smallest list is the email list. So it is getting exhausted and the iteration stops due to the property of zip() function.I have even kept a checking for it within a if condition for the records which has no email address so that it displays "No email address" but still the same result is displayed with 704 with duplicates records. Kindly tell where am I going wrong and if possible suggest what to be done regarding removing duplicate records and displaying "No email address" where there is no email.
from bs4 import BeautifulSoup as bs
import pandas as pd
res = requests.get('https://www.raywhite.com/contact/?type=People&target=people&suburb=Sydney%2C+NSW+2000&radius=50%27%27&firstname=&lastname=&_so=contact', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')
names=[]
positions=[]
phone=[]
emails=[]
links=[l1['href'] for l1 in soup.select('.agent-name a')]
nlist = soup.find_all('li', class_='agent-name')
plist= soup.find_all('li',class_='agent-role')
phlist = soup.find_all('li', class_='agent-officenum')
elist = soup.find_all('a',class_='val withicon')
for n1 in nlist:
names.append(n1.text)
for p1 in plist:
positions.append(p1.text)
for ph1 in phlist:
phone.append(ph1.text)
for e1 in elist:
emails.append(e1.get('href') if e1.get('href') is not None else 'No Email address')
df = pd.DataFrame(list(zip(names,positions,phone,emails,links)),columns=['Names','Position','Phone','Email','Link'])
df.to_excel(r'C:\Users\laptop\Desktop\RayWhite.xls', sheet_name='MyData2', index = False, header=True)
The excel sheet looks like this where we can see the last records name and it's email address does not match:
Ray White Excel Sheet
It looks like you are doing many find_all's and then stitching them together. My advice would be to do one find_all then iterate through that. It makes it a lot easier to build out the columns of your dataframe when all your data is in one place.
I have updated the below code to successfully extract links without error. With any code there is a number of ways to perform the same task. This one may not be the most elegant but it does get the job done.
import requests
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get('https://www.raywhite.com/contact/?type=People&target=people&suburb=Sydney%2C+NSW+2000&radius=50%27%27&firstname=&lastname=&_so=contact', headers = {'User-agent': 'Super Bot 9000'})
soup = BeautifulSoup(r.text, 'html.parser')
get_cards = soup.find_all("div",{"class":"card horizontal-split vcard"})
agent_list = []
for item in get_cards:
name = item.find('li', class_='agent-name').text
position = item.find('li', class_='agent-role').text
phone = item.find('li', class_='agent-officenum').text
link = item.find('li', class_='agent-name').a['href']
try:
email = item.find('a',class_='val withicon')['href'].replace('mailto:','')
except:
email = 'No Email address'
agent_list.append({'name':name,'position':position,'email':email,'link':link})
df = pd.DataFrame(agent_list)
Above is some sample code I have put together to create the dataframe. The key here is to do one find_all on "class":"card horizontal-split vcard"}
Hope that has been some help.
Cheers,
Adam

Categories