I am using the following code.
And getting [].
Please help me find my mistake.
from urllib import urlopen
optionsUrl = 'http://www.moneycontrol.com/commodity/'
optionsPage = urlopen(optionsUrl)
from bs4 import BeautifulSoup
soup = BeautifulSoup(optionsPage)
print soup.findAll(text='MCX')
This will grab that list of commodities for you (tested on python 2.7). You need to isolate that commodity table then work your way down the rows, reading each row and extracting the data from each column
import urllib2
import bs4
# Page with commodities
URL = "http://www.moneycontrol.com/commodity/"
# Download the page data and create a BeautitulSoup object
commodityPage = urllib2.urlopen(URL)
commodityPageText = commodityPage.read()
commodityPageSoup = bs4.BeautifulSoup(commodityPageText)
# Extract the div with the commodities table and find all the table rows
commodityTable = commodityPageSoup.find_all("div", "equity com_ne")
commodittTableRows = commodityTable[0].find_all("tr")
# Trim off the table header row
commodittTableRows = commodittTableRows[1:]
# Iterate over the table rows and print out the commodity name and price
for commodity in commodittTableRows:
# Separate all the table columns
columns = commodity.find_all("td")
# -------------- Get the values from each column
# ROW 1: Name and date
nameAndDate = columns[0].text
nameAndDate = nameAndDate.split('-')
name = nameAndDate[0].strip()
date = nameAndDate[1].strip()
# ROW 2: Price
price = float(columns[1].text)
# ROW 3: Change
change = columns[2].text.replace(',', '') # Remove commas from change value
change = float(change)
# ROW 4: Percentage change
percentageChange = columns[3].text.replace('%', '') # Remove the percentage symbol
percentageChange = float(percentageChange)
# Print out the data
print "%s on %s was %.2f - a change of %.2f (%.2f%%)" % (name, date, price, change, percentageChange)
Which gives the results
Gold on 5 Oct was 30068.00 - a change of 497.00 (1.68%)
Silver on 5 Dec was 50525.00 - a change of 1115.00 (2.26%)
Crudeoil on 19 Sep was 6924.00 - a change of 93.00 (1.36%)
Naturalgas on 25 Sep was 233.80 - a change of 0.30 (0.13%)
Aluminium on 30 Sep was 112.25 - a change of 0.55 (0.49%)
Copper on 29 Nov was 459.80 - a change of 3.40 (0.74%)
Nickel on 30 Sep was 882.20 - a change of 5.90 (0.67%)
Lead on 30 Sep was 131.80 - a change of 0.70 (0.53%)
Zinc on 30 Sep was 117.85 - a change of 0.75 (0.64%)
Menthaoil on 30 Sep was 871.90 - a change of 1.80 (0.21%)
Cotton on 31 Oct was 21350.00 - a change of 160.00 (0.76%)
Related
I'm trying to scrape a data from a table on a website.
However, I am continuously running into "ValueError: cannot set a row with mismatched columns".
The set-up is:
url = 'https://kr.youtubers.me/united-states/all/top-500-youtube-channels-in-united-states/en'
page = requests.get(url)
soup = BeautifulSoup(page.text,'lxml')
table1 = soup.find('div', id = 'content')
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
my_data = pd.DataFrame(columns = headers)
my_data = my_data.iloc[:,:-4]
Here, I was able to make an empty dataframe with headers same as the table (I did iloc because there were some repeating columns at the end).
Now, I wanted to fill in the empty dataframe through:
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(my_data)
my_data.loc[length] = row
However, as mentioned, I get "ValueError: cannot set a row with mismatched columns" in this line: length = len(my_data).
I would really appreciate any help to solve this problem and to fill in the empty dataframe.
Thanks in advance.
Rather than trying to fill an empty DataFrame, it would be simpler to utilize .read_html, which returns a list of DataFrames after parsing every table tag within the HTML.
Even though this page has only two tables ("Top Youtube channels" and "Top Youtube channels - detail stats"), 3 DataFrames are returned because the second table is split into two table tags between rows 12 and 13 for some reason; but they can all be combined into DataFrame.
dfList = pd.read_html(url) # OR
# dfList = pd.read_html(page.text) # OR
# dfList = pd.read_html(soup.prettify())
allTime = dfList[0].set_index(['rank', 'Youtuber'])
# (header row in 1st half so 2nd half reads as headerless to pandas)
dfList[2].columns = dfList[1].columns
perYear = pd.concat(dfList[1:]).set_index(['rank', 'Youtuber'])
columns_ordered = [
'started', 'category', 'subscribers', 'subscribers/year',
'video views', 'Video views/Year', 'video count', 'Video count/Year'
] # re-order columns as preferred
combinedDf = pd.concat([allTime, perYear], axis='columns')[columns_ordered]
If the [columns_ordered] part is omitted from the last line, then the expected column order would be 'subscribers', 'video views', 'video count', 'category', 'started', 'subscribers/year', 'Video views/Year', 'Video count/Year'.
combinedDf should look like
You can try to use pd.read_html to read the table into a dataframe:
import pandas as pd
url = "https://kr.youtubers.me/united-states/all/top-500-youtube-channels-in-united-states/en"
df = pd.read_html(url)[0]
print(df)
Prints:
rank Youtuber subscribers video views video count category started
0 1 ✿ Kids Diana Show 106000000 86400421379 1052 People & Blogs 2015
1 2 Movieclips 58500000 59672883333 39903 Film & Animation 2006
2 3 Ryan's World 34100000 53568277882 2290 Entertainment 2015
3 4 Toys and Colors 38300000 44050683425 901 Entertainment 2016
4 5 LooLoo Kids - Nursery Rhymes and Children's Songs 52200000 30758617681 605 Music 2014
5 6 LankyBox 22500000 30147589773 6913 Comedy 2016
6 7 D Billions 24200000 27485780190 582 NaN 2019
7 8 BabyBus - Kids Songs and Cartoons 31200000 25202247059 1946 Education 2016
8 9 FGTeeV 21500000 23255537029 1659 Gaming 2013
...and so on.
I'm scraping a website using Python and I'm having troubles with extracting the dates and creating a new Date dataframe with Regex.
The code below is using BeautifulSoup to scrape event data and the event links:
import pandas as pd
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('https://www.techmeme.com/events').read()
soup = bs.BeautifulSoup(source,'html.parser')
event = []
links = []
# ---Event Data---
for a in soup.find_all('a'):
event.append(a.text)
df_event = pd.DataFrame(event)
df_event.columns = ['Event']
df_event = df_event.iloc[1:]
# ---Links---
for a in soup.find_all('a', href=True):
if a.text:
links.append(a['href'])
df_link = pd.DataFrame(links)
df_link.columns = ['Links']
# ---Combines dfs---
df = pd.concat([df_event.reset_index(drop=True),df_link.reset_index(drop=True)],sort=False, axis=1)
At the beginning of the each event data row, the date is present. Example: (May 26-29Augmented World ExpoSan...). The date follows the following format and I have included my Regex(which I believe is correct).
Different Date Formats:
May 27: [A-Z][a-z]*(\ )[0-9]{1,2}
May 26-29: [A-Z][a-z]*(\ )[0-9]{1,2}-[0-9]{1,2}
May 28-Jun 2: [A-Z][a-z]*(\ )[0-9]{1,2}-[A-Z][a-z]*(\ )[0-9]{1,2}
Combined
[A-Z][a-z]*(\ )[0-9]{1,2}|[A-Z][a-z]*(\ )[0-9]{1,2}-[0-9]{1,2}|[A-Z][a-z]*(\ )[0-9]{1,2}-[A-Z][a-z]*(\ )[0-9]{1,2}
When I try to create a new column and extract the dates using Regex, I just receive an empty df['Date'] column.
df['Date'] = df['Event'].str.extract(r[A-Z][a-z]*(\ )[0-9]{1,2}')
df.head()
Any help would be greatly appreciated! Thank you.
You may use
date_reg = r'([A-Z][a-z]* [0-9]{1,2}(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})?)'
df['Date'] = df['Event'].str.extract(date_reg, expand=False)
See the regex demo. If you want to match as whole words and numbers, you may use (?<![A-Za-z])([A-Z][a-z]* [0-9]{1,2}(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})?)(?!\d).
Details
[A-Z][a-z]* - an uppercase letter and then 0 or more lowercase letters
- a space (replace with \s to match any whitespace)
[0-9]{1,2} - one or two digits
(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})? - an optional sequence of
- - hyphen
(?:[A-Z][a-z]* )? - an optional sequence of
[A-Z][a-z]* - an uppercase letter and then 0 or more lowercase letters
- a space (replace with \s to match any whitespace)
[0-9]{1,2} - one or two digits
The (?<![A-Za-z]) construct is a lookbehind that fails the match if there is a letter immediately before the current location and (?!\d) fails the match if there is a digit immediately after.
This script:
import requests
from bs4 import BeautifulSoup
url = 'https://www.techmeme.com/events'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data = []
for row in soup.select('.rhov a'):
date, event, place = map(lambda x: x.get_text(strip=True), row.find_all('div', recursive=False))
data.append({'Date': date, 'Event': event, 'Place': place, 'Link': 'https://www.techmeme.com' + row['href']})
df = pd.DataFrame(data)
print(df)
will create this dataframe:
Date Event Place Link
0 May 26-29 NOW VIRTUAL:Augmented World Expo Santa Clara https://www.techmeme.com/gotos/www.awexr.com/
1 May 27 Earnings: HPQ,BOX https://www.techmeme.com/gotos/finance.yahoo.c...
2 May 28 Earnings: CRM, VMW https://www.techmeme.com/gotos/finance.yahoo.c...
3 May 28-29 CANCELED:WeAreDevelopers World Congress Berlin https://www.techmeme.com/gotos/www.wearedevelo...
4 Jun 2 Earnings: ZM https://www.techmeme.com/gotos/finance.yahoo.c...
.. ... ... ... ...
140 Dec 7-10 NEW DATE:GOTO Amsterdam Amsterdam https://www.techmeme.com/gotos/gotoams.nl/
141 Dec 8-10 Microsoft Azure + AI Conference Las Vegas https://www.techmeme.com/gotos/azureaiconf.com...
142 Dec 9-10 NEW DATE:Paris Blockchain Week Summit Paris https://www.techmeme.com/gotos/www.pbwsummit.com/
143 Dec 13-16 NEW DATE:KNOW Identity Las Vegas https://www.techmeme.com/gotos/www.knowidentit...
144 Dec 15-16 NEW DATE, NEW LOCATION:Fortune Brainstorm Tech San Francisco https://www.techmeme.com/gotos/fortuneconferen...
[145 rows x 4 columns]
I am trying to get the total assets values from the 10-K text filings. The problem is that the html format varies from one company to another.
Take Apple 10-K as an example:
total assets is in a table that has balance sheet header and typical terms like cash, inventories, ... exist in some rows of that table. In the last row, there is a summation of assets of 290,479 for 2015 and 231,839 for 2014. I wanted to get the number for the 2015 --> 290,479. I have not been able to find a way that
1) finds the relevant table that has some specific headings (like balance sheet) and words in rows (cash, ...)
2) get the value in the row that has the word total assets and belongs to the greater year (2015 for our example).
import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, "xml")
for tag in soup.find_all(text=re.compile('Total\sassets')):
print(tag.findParent('table').findParent('table'))
Using lxml or html.parser instead of xml I can get
title > CONSOLIDATED BALANCE SHEETS
row > Total assets
column 0 > Total assets
column 1 >
column 2 > $
column 3 > 290,479
column 4 >
column 5 >
column 6 > $
column 7 > 231,839
column 8 >
using code
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')# "lxml")
# get all `b` to find title
all_b = soup.find_all('b')
for item in all_b:
# check text in every `b`
title = item.get_text(strip=True)
if title == 'CONSOLIDATED BALANCE SHEETS':
print('title >', title)
# get first `table` after `b`
table = item.parent.findNext('table')
# all rows in table
all_tr = table.find_all('tr')
for tr in all_tr:
# all columns in row
all_td = tr.find_all('td')
# text in first column
text = all_td[0].get_text(strip=True)
if text == 'Total assets':
print('row >', text)
for i, td in enumerate(all_td):
print('column', i, '>', td.get_text(strip=True))
Date Revenue
9-Jan $943,690.00
9-Feb $1,062,565.00
9-Mar $210,079.00
9-Apr -$735,286.00
9-May $842,933.00
9-Jun $358,691.00
9-Jul $914,953.00
9-Aug $723,427.00
9-Sep -$837,468.00
9-Oct -$146,929.00
9-Nov $831,730.00
9-Dec $917,752.00
10-Jan $800,038.00
10-Feb $1,117,103.00
10-Mar $181,220.00
10-Apr $120,968.00
10-May $844,012.00
10-Jun $307,468.00
10-Jul $502,341.00
# This is what I did so far...
# Dependencies
import csv
# Files to load (Remember to change these)
file_to_load = "raw_data/budget_data_2.csv"
totalrev = 0
count = 0
# Read the csv and convert it into a list of dictionaries
with open(file_to_load) as revenue_data:
reader = csv.reader(revenue_data)
next(reader)
for row in reader:
count += 1
revenue = float(row[1])
totalrev += revenue
for i in range(1,revenue):
revenue_change = (revenue[i+1] - revenue[i])
avg_rev_change = sum(revenue_change)/count
print("avg rev change: ", avg_rev_change)
print ("budget_data_1.csv")
print ("---------------------------------")
print ("Total Months: ", count)
print ("Total Revenue:", totalrev)
I have above data in CSV file. I am having problem in finding revenue change, which is Revenue of row 1 - row 0 , row 2 - row 1 and so on... finally, I want sum of total revenue change. I tried with loop but I guess there is some silly mistake. Please suggest me codes so I can compare my mistake. I am new to python and coding.
It's unclear whether you can use third-party packages, e.g. pandas, but pandas is great at these types of operations. I would suggest you use its capabilities instead of iterating through, line-by-line.
df is a pandas.DataFrame object. Use pandas.read_csv to load your data into a DataFrame.
>>> df
Date Revenue
0 9-Jan $943,690.00
1 9-Feb $1,062,565.00
2 9-Mar $210,079.00
3 9-Apr -$735,286.00
4 9-May $842,933.00
5 9-Jun $358,691.00
6 9-Jul $914,953.00
7 9-Aug $723,427.00
8 9-Sep -$837,468.00
9 9-Oct -$146,929.00
10 9-Nov $831,730.00
11 9-Dec $917,752.00
12 10-Jan $800,038.00
13 10-Feb $1,117,103.00
14 10-Mar $181,220.00
15 10-Apr $120,968.00
16 10-May $844,012.00
17 10-Jun $307,468.00
18 10-Jul $502,341.00
# Remove the dollar sign and any other weird chars
>>> df['Revenue'] = [float(''.join(c for c in row if c in '.1234567890')) for row in df['Revenue']]
Use pandas.Series.shift to line up the previous month's value with that of the current month, and subtract the two:
>>> df['Diff'] = df['Revenue'] - df['Revenue'].shift(1)
>>> df
Date Revenue Diff
0 9-Jan 943690.0 NaN
1 9-Feb 1062565.0 118875.0
2 9-Mar 210079.0 -852486.0
3 9-Apr 735286.0 525207.0
4 9-May 842933.0 107647.0
5 9-Jun 358691.0 -484242.0
6 9-Jul 914953.0 556262.0
7 9-Aug 723427.0 -191526.0
8 9-Sep 837468.0 114041.0
9 9-Oct 146929.0 -690539.0
10 9-Nov 831730.0 684801.0
11 9-Dec 917752.0 86022.0
12 10-Jan 800038.0 -117714.0
13 10-Feb 1117103.0 317065.0
14 10-Mar 181220.0 -935883.0
15 10-Apr 120968.0 -60252.0
16 10-May 844012.0 723044.0
17 10-Jun 307468.0 -536544.0
18 10-Jul 502341.0 194873.0
import csv
# Files to load (Remember to change these)
file_to_load = "raw_data/budget_data_2.csv"
# Read the csv and convert it into a list of dictionaries
with open(file_to_load) as revenue_data:
reader = csv.reader(revenue_data)
# use of next to skip first title row in csv file
next(reader)
revenue = []
date = []
rev_change = []
# in this loop I did sum of column 1 which is revenue in csv file and counted total months which is column 0
for row in reader:
revenue.append(float(row[1]))
date.append(row[0])
print("Financial Analysis")
print("-----------------------------------")
print("Total Months:", len(date))
print("Total Revenue: $", sum(revenue))
#in this loop I did total of difference between all row of column "Revenue" and found total revnue change. Also found out max revenue change and min revenue change.
for i in range(1,len(revenue)):
rev_change.append(revenue[i] - revenue[i-1])
avg_rev_change = sum(rev_change)/len(rev_change)
max_rev_change = max(rev_change)
min_rev_change = min(rev_change)
max_rev_change_date = str(date[rev_change.index(max(rev_change))])
min_rev_change_date = str(date[rev_change.index(min(rev_change))])
print("Avereage Revenue Change: $", round(avg_rev_change))
print("Greatest Increase in Revenue:", max_rev_change_date,"($", max_rev_change,")")
print("Greatest Decrease in Revenue:", min_rev_change_date,"($", min_rev_change,")")
Output I got
Financial Analysis
-----------------------------------
Total Months: 86
Total Revenue: $ 36973911.0
Avereage Revenue Change: $ -5955
Greatest Increase in Revenue: Jun-2014 ($ 1645140.0 )
Greatest Decrease in Revenue: May-2014 ($ -1947745.0 )
Here's the site I'm working with: http://www.fantasypros.com/mlb/probable-pitchers.php
What I want to do it run the code every day, and it return a list of pitchers that are pitching that day, so just the first column. Here's what I have so far.
from bs4 import BeautifulSoup
import urllib.request
url = 'http://www.fantasypros.com/mlb/probable-pitchers.php'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
table = soup.find('table',{'class': 'table table-condensed'})
table2 = table.find('tbody') #this find just the rows with pitchers (excludes dates)
daysOnPage = []
for row in table.findAll('th'):
daysOnPage.append(row.text)
daysOnPage.pop(0)
#print(daysOnPage)
pitchers = []
for row in table2.findAll('a', {'class': 'available mpb-available'}):
pitchers.append(row.text)
This returns a list of every pitcher on the page. If every cell on the table was always filled, I could do something like deleting every nth player or something like that, but that seems pretty inelegant, and also doesn't work since you don't ever know which cells will be blank. I've looked through the table2.prettify() code but I can't find anything that indicates to me where a blank cell is coming.
Thanks for the help.
Edit: Tinkering a little bit, I've figured this much out:
for row in table2.find('tr'):
for a in row.findAll('a', {'class': 'available mpb-available'}):
pitchers.append(a.text)
continue
That prints the first row of pitchers, which is also a problem I was going to tackle later. Why does the continue not make it iterate through the rows?
When I hear table, I think pandas. You can have pandas.read_html do the parsing for you then use pandas.Series.dropna return only valid values.
In [1]: import pandas as pd
In [2]: dfs = pd.read_html('http://www.fantasypros.com/mlb/probable-pitchers.php')
In [3]: df = dfs[0].head(10) # get the first dataframe and just use the first 10 teams for this example
In [4]: print(df['Thu Aug 6']) # Selecting just one day by label
0 #WSHJ. Hellickson(7-6)SP 124
1 MIAM. Wisler(5-1)SP 306
2 NaN
3 #NYYE. Rodriguez(6-3)SP 177
4 SFK. Hendricks(4-5)SP 51
5 STLM. Lorenzen(3-6)SP 300
6 NaN
7 NaN
8 NaN
9 KCB. Farmer(0-2)SP 270
Name: Thu Aug 6, dtype: object
In [5]: active = df['Thu Aug 6'].dropna() # now just drop any fields that are NaNs
In [6]: print(active)
0 #WSHJ. Hellickson(7-6)SP 124
1 MIAM. Wisler(5-1)SP 306
3 #NYYE. Rodriguez(6-3)SP 177
4 SFK. Hendricks(4-5)SP 51
5 STLM. Lorenzen(3-6)SP 300
9 KCB. Farmer(0-2)SP 270
Name: Thu Aug 6, dtype: object
I suppose the last thing you'll want to do is parse the strings in the table to get just the pitchers name.
If you want to write the Series to a csv, you can do so directly by:
In [7]: active.to_csv('active.csv')
This gives you a csv that looks something like this:
0,#WSHJ. Hellickson(7-6)SP 126
1,MIAM. Wisler(5-1)SP 306
3,#NYYE. Rodriguez(6-3)SP 179
4,SFK. Hendricks(4-5)SP 51
5,STLM. Lorenzen(3-6)SP 301
9,KCB. Farmer(0-2)SP 267