Exporting to CSV File

Exporting to CSV File - python

I'm trying to export the results of this code to a CSV file. I copied 2 of the results further down below after the code. There are 14 items for each stock and I'd like to write to a CSV file and have a column for each of the 14 items and one row for each stock.
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
table = main_div.find('table')
sub = table.findAll('tr')
rows = sub[5].findAll('td')
for row in rows:
link = row.a
if link is not None:
print(link.get_text())
This is the format of the results, 14 items/columns for each stock.
PTN
Palatin Technologies, Inc.
Healthcare
Diagnostic Substances
USA
240.46M
9.22
193.43M
2.23M
0.76
1.19
7.21%
1,703,285
3
LKM
Link Motion Inc.
Technology
Application Software
China
128.95M
-
50.40M
616.76K
1.73
1.30
16.07%
1,068,798
4
Tried this but couldn't get this to work.
TextWriter x = File.OpenWrite ("my.csv", ....);
x.WriteLine("Column1,Column2"); // header
x.WriteLine(coups.Cells[0].Text + "," + coups.Cells[1].Text);

This should work:
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("AAA.csv", header=False)
Few things here:
I used "table-[light|dark]-row-cp" becouse all rows of intrest had one of those classes (and no other rows had them)
There are two sepearate parts: one is fetching data in correct structure, other - writing CSV file.
I used pandas CSV writer, because I'm familiar with it, but when you have rectangular data (named "data" here) you may use any other CSV writer
You should never name variables with reserved names, like 'sub' or 'link' : )
Hope that helps.

Why don't you use the built-in csv.writer?

Related

Index out of range in Python while finding tr tags using BeautifulSoup

So I am trying to crawl the below data.
And the problem is that I don't know how many tr is in the website so I just said range(0, 24). However I am pretty sure that it has at least 24. But the code still says it's out of range.
How do I crawl this website and get all the information (the bilingual text), even if I don't know how many rows there are?
Below is my code.
from bs4 import BeautifulSoup
import requests
url="http://www.mongols.eu/mongolian-language/mongolian-tale-six-silver-stars/"
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
gdp_table = soup.find("table", attrs={"class": "table-translations"})
gdp_table_data = gdp_table.tbody.find_all("tr") # contains # rows
for i in range(0, 24):
for td in gdp_table_data[i].find_all("td"):
headings = []
headings.append(td.get_text(strip=True))
print(headings[1], " | ", headings[2])

You already iterate over each element in gdp_table_data[i].find_all("td"). Use the same idea for the row iteration
for tr in gdp_table_data:
for td in tr.find_all("td"):
...

I guess this is the best solution to save it as a csv:
import pandas as pd
dfs = pd.read_html('http://www.mongols.eu/mongolian-language/mongolian-tale-six-silver-stars/')
df = pd.concat(dfs)
df.to_csv('a.csv')
saves a csv file (a.csv) with the data.
Or only printing:
import requests
from bs4 import BeautifulSoup
r =requests.get('http://www.mongols.eu/mongolian-language/mongolian-tale-six-silver-stars/')
soup = BeautifulSoup(r.content, 'html.parser')
trs = soup.select('table.table-translations tr')
for tr in trs:
print(tr.get_text())
prints:
No.
Mongolian text
Loosely translated into English
1.
Зургаан мөнгөн мичид
Six silver stars
2.
Эрт урьд цагт зургаан өнчин хүүхэд товцог толгой дээр наадан суудаг юм санжээ.
Long ago, there were six orphan brothers playing on the top of a hill.
Тэгсэн чинь ах нь нэг өдөр хэлж:
One day the oldest brother said:
and so on...

This script will write all translations to data.csv:
import csv
import requests
from bs4 import BeautifulSoup
url = 'http://www.mongols.eu/mongolian-language/mongolian-tale-six-silver-stars/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('.table-translations tr')[1:]:
mongolian, english = map(lambda t: t.get_text(strip=True), row.select('td')[1:])
all_data.append((mongolian, english))
with open('data.csv', 'w', newline='') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in all_data:
spamwriter.writerow(row)
Creates:

Issue using BeautifulSoup and reading target URLs from a CSV

Everything works as expected when I'm using a single URL for the URL variable to scrape, but not getting any results when attempting to read links from a csv. Any help is appreciated.
Info about the CSV:
One column with a header called "Links"
300 rows of links with no space, commoa, ; or other charters before/after the links
One link in each row
import requests # required to make request
from bs4 import BeautifulSoup # required to parse html
import pandas as pd
import csv
with open("urls.csv") as infile:
reader = csv.DictReader(infile)
for link in reader:
res = requests.get(link['Links'])
#print(res.url)
url = res
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
email_elm0 = soup.find_all(class_= "app-support-list__item")[0].text.strip()
email_elm1 = soup.find_all(class_= "app-support-list__item")[1].text.strip()
email_elm2 = soup.find_all(class_= "app-support-list__item")[2].text.strip()
email_elm3 = soup.find_all(class_= "app-support-list__item")[3].text.strip()
final_email_elm = (email_elm0,email_elm1,email_elm2,email_elm3)
print(final_email_elm)
df = pd.DataFrame(final_email_elm)
#getting an output in csv format for the dataframe we created
#df.to_csv('draft_part2_scrape.csv')

The problem lies in this part of the code:
with open("urls.csv") as infile:
reader = csv.DictReader(infile)
for link in reader:
res = requests.get(link['Links'])
...
After the loop is executed, res will have the last link. So, this program will only scrape the last link.
To solve this problem, store all the links in a list and iterate that list to scrape each of the link. You can store the scraped result in a seperate dataframe and concatenate them at the end to store in a single file:
import requests # required to make request
from bs4 import BeautifulSoup # required to parse html
import pandas as pd
import csv
links = []
with open("urls.csv") as infile:
reader = csv.DictReader(infile)
for link in reader:
links.append(link['Links'])
dfs = []
for url in links:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
email_elm0 = soup.find_all(class_="app-support-list__item")[0].text.strip()
email_elm1 = soup.find_all(class_="app-support-list__item")[1].text.strip()
email_elm2 = soup.find_all(class_="app-support-list__item")[2].text.strip()
email_elm3 = soup.find_all(class_="app-support-list__item")[3].text.strip()
final_email_elm = (email_elm0, email_elm1, email_elm2, email_elm3)
print(final_email_elm)
dfs.append(pd.DataFrame(final_email_elm))
#getting an output in csv format for the dataframe we created
df = pd.concat(dfs)
df.to_csv('draft_part2_scrape.csv')

Problems with web scraping (William Hill-UFC Odds)

I'm creating a web scraper that will let me get the odds of upcoming UFC Fights on William Hill. I'm using beautiful soup but have yet been able to successfully scrape the needed data. (https://sports.williamhill.com/betting/en-gb/ufc)
I need the fighters names and their odds.
I've attempted a variety of methods to try get the data, trying to scrape different tags etc., but nothing happens.
def scrape_data():
data = requests.get("https://sports.williamhill.com/betting/en-
gb/ufc")
soup = BeautifulSoup(data.text, 'html.parser')
links = soup.find_all('a',{'class': 'btmarket__name btmarket__name--
featured'}, href=True)
for link in links:
links.append(link.get('href'))
for link in links:
print(f"Now currently scraping link: {link}")
data = requests.get(link)
soup = BeautifulSoup(data.text, 'html.parser')
time.sleep(1)
fighters = soup.find_all('p', {'class': "btmarket__name"})
c = fighters[0].text.strip()
d = fighters[1].text.strip()
f1.append(c)
f2.append(d)
odds = soup.find_all('span', {'class': "betbutton_odds"})
a = odds[0].text.strip()
b = odds[1].text.strip()
f1_odds.append(a)
f2_odds.append(b)
return None
I would expect it to be exported to a CSV file. I'm currently using Morph.io to host and run the scraper, but it returns nothing.
If correct, it would output:
Fighter1Name:
Fighter2Name:
F1Odds:
F2Odds:
For every available fight.
Any help would be greatly appreciated.

The html returned has different attributes and values. You need to inspect the response.
For writing out to csv you will want to append "'" in front of odds to prevent odds being treated as fractions or dates. See commented out alternatives in code below.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://sports.williamhill.com/betting/en-gb/ufc')
soup = bs(r.content, 'lxml')
results = []
for item in soup.select('.btmarket:has([data-odds])'):
match_name = item.select_one('.btmarket__name[title]')['title']
odds = [i['data-odds'] for i in item.select('[data-odds]')]
row = {'event-starttime' : item.select_one('[datetime]')['datetime']
,'match_name' : match_name
,'home_name' : match_name.split(' vs ')[0]
#,'home_odds' : "'" + str(odds[0])
,'home_odds' : odds[0]
,'away_name' : match_name.split(' vs ')[1]
,'away_odds' : odds[1]
#,'away_odds' : "'" + str(odds[1])
}
results.append(row)
df = pd.DataFrame(results, columns = ['event-starttime','match_name','home_name','home_odds','away_name','away_odds'])
print(df.head())
#write to csv
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )

"How to fix 'AttributeError: 'NoneType' object has no attribute 'tbody'' error in Python?

I expected a csv file created with in my desktop directory.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://basketball.realgm.com/ncaa/conferences/Big-12-
Conference/3/Kansas/54/nba-players"
# get permission
response = requests.get(url)
# access html files
soup = BeautifulSoup(response.text, 'html.parser')
# creating data frame
columns = ['Player', 'Position', 'Height', 'Weight', 'Draft Year', 'NBA
Teams', 'Years', 'Games Played','Points Per Game', 'Rebounds Per Game',
'Assists Per Game']
df = pd.DataFrame(columns=columns)
table = soup.find(name='table', attrs={'class': 'tablesaw','data-
tablesaw-mode':'swipe','id': 'table-6615'}).tbody
trs = table.find('tr')
# rewording html
for tr in trs:
tds = tr.find_all('td')
row = [td.text.replace('\n', '')for td in tds]
df = df.append(pd.Series(row, index=columns), ignore_index=True)
df.to_csv('kansas_player', index=False)
I expected a csv file created with in my desktop directory.

Looks like by your way the soup.find(...) can not find 'table', and that's might
be why you get a None type returned, here is my change and you can tailor it to cope with you csv export need:
from bs4 import BeautifulSoup
import urllib.request
url = "https://basketball.realgm.com/ncaa/conferences/Big-12-Conference/3/Kansas/54/nba-players"
# get permission
response = urllib.request.urlopen(url)
# access html files
html = response.read()
soup = BeautifulSoup(html)
table = soup.find("table", {"class": "tablesaw"})
At this point, you can return full table content as:
From there on, you can easily extract the table row information by such as:
for tr in table.findAll('tr'):
tds = tr.find_all('td')
row = [td.text.replace('\n', '')for td in tds]
.....
Now each row would look like:
Finally, you can write each row into the csv with or without the pandas, your call then.

Web-Scraping Python, Indexing Issue for DataFrame

I'm working on a web-scraper for Spotify Charts to extract the top 200 daily songs each day. I have done everything to extract the data I'm interested in including rank, artist, track title, and stream numbers. What I'm stuck on is putting everything into a DataFrame to export as a CSV to excel. Right now when I print my DataFrame, it is treating each cycle as 1 row with 4 columns as opposed to 200 rows with 4 columns.
I'm not sure what the issue is as I've tried just about everything and looked into it as much as I could. I know something is wrong with the indexing because each "what should be a row" has the same first "0" index, when they should go sequential to 199. Also, the column names for my DataFrame keep repeating after each "what should be a row", so I know there is definitely an issue there.
import requests
from bs4 import BeautifulSoup
from datetime import date, timedelta
from time import time
from time import sleep
from random import randint
import pandas as pd
import numpy as np
base_url = 'https://spotifycharts.com/regional/global/daily/'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, 'html.parser')
chart = soup.find('table', {'class': 'chart-table'})
tbody = chart.find('tbody')
for tr in tbody.find_all('tr'):
rank_text = []
rank_text_elem = tr.find('td', {'class': 'chart-table-
position'})
for item in rank_text_elem:
rank_text = []
rank_text.append(item)
artist_text = []
artist_text_elem = tr.find('td', {'class': 'chart-table-
track'}).find_all('span')
for item in artist_text_elem:
artist_text = []
artist_text.append(item.text.replace('by ','').strip())
title_text = []
title_text_elem = tr.find('td', {'class': 'chart-table-
track'}).find_all('strong')
for item in title_text_elem:
title_text = []
title_text.append(item.text)
streams_text = []
streams_text_elem = tr.find('td', {'class': 'chart-table-streams'})
for item in streams_text_elem:
streams_text = []
streams_text.append(item)
# creating dataframe to store 4 variables
list_of_data = list(zip(rank_text, artist_text, title_text,
streams_text))
df = pd.DataFrame(list_of_data, columns =
['Rank','Artist','Title','Streams'])
print(df)
Basically, I'm trying to create a dataframe to hold 4 variables in each row for 200 rows for each date of spotify global charts. Please ignore some of the modules and libraries I've included at the top, they are used for iterating through each page of the historical data based on dynamic urls which I have already figured out. Any help is greatly appreciated! Thank you!

Before for loop I create list all_rows.
Inside for loop I add list with single row of data to all_rows.
After for loop I use all_rows to create DataFrame
import requests
from bs4 import BeautifulSoup
import pandas as pd
base_url = 'https://spotifycharts.com/regional/global/daily/'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, 'html.parser')
chart = soup.find('table', {'class': 'chart-table'})
tbody = chart.find('tbody')
all_rows = []
for tr in tbody.find_all('tr'):
rank_text = tr.find('td', {'class': 'chart-table-position'}).text
artist_text = tr.find('td', {'class': 'chart-table-track'}).find('span').text
artist_text = artist_text.replace('by ','').strip()
title_text = tr.find('td', {'class': 'chart-table-track'}).find('strong').text
streams_text = tr.find('td', {'class': 'chart-table-streams'}).text
all_rows.append( [rank_text, artist_text, title_text, streams_text] )
# after `for` loop
df = pd.DataFrame(all_rows, columns=['Rank','Artist','Title','Streams'])
print(df.head())

You could use pandas and requests
import pandas as pd
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
url ='https://spotifycharts.com/regional/global/daily/'
r = requests.get(url, headers = headers).content
table = pd.read_html(r)[0] #transfer html to pandas
table.dropna(axis = 1, how = 'all', inplace = True) #drop nan column
table[['Title','Artist']] = table['Unnamed: 3'].str.split(' by ',expand=True) #split title artist strings into two columns
del table['Unnamed: 3'] #remove combined column
table = table[['Track', 'Artist','Title', 'Unnamed: 4']] #re-order cols
table.columns= ['Rank', 'Artist','Title', 'Streams'] #rename cols
print(table)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Exporting to CSV File - python

Why don't you use the built-in csv.writer?

Related

Index out of range in Python while finding tr tags using BeautifulSoup

Issue using BeautifulSoup and reading target URLs from a CSV

Problems with web scraping (William Hill-UFC Odds)

"How to fix 'AttributeError: 'NoneType' object has no attribute 'tbody'' error in Python?

Web-Scraping Python, Indexing Issue for DataFrame

Categories

Resources