Hi I'm having trouble understanding a few things when it comes to loops and searching through a .json. I want to get the .json from a website then retrieve 25 items from ['body']'s then restart on a new .json page with new ['body']'s and retrieve those also. Finally, send the all data to a .txt file.
Here's my code
import json
import requests
#Settings
user_id = 29851266
page_num= 1
#Finds user data
max_p_f = requests.get('http://someforum/users/'+str(user_id)+'/posts.json?page='+str(page_num))
json_string = max_p_f.text
obj = json.loads(json_string)
max_page = obj['meta']['max_page']
current_page = obj['meta']['page']
posts_count = obj['meta']['posts_count']
username = obj['users'][0]['username']
count = 0
start_page = 1
while page_num <= max_page:
requests.get('http://www.someforum/users/'+str(user_id)+'/posts.json?page='+str(page_num))
page_num += 1
print("Page "+str(start_page + 1)+ " complete")
for x in range(0, 25):
data = obj['posts'][x]['body']
file = open(username+"_postdata.txt", 'a')
file.write("\n =============="+str(count)+"==================\n")
file.write(data)
count += 1
file.close()
I want the code to give me the 25 ['body'] values from the .json on the first page. Then go to a the second page a retrieve the new 25 ['body'] values. I am having trouble because when the values are written to the text file it only shows the first 25 ['body'] values and repeats those some 25 values until the while is fulfilled.
I would start by using the native .json() for requests instead of converting it from text to json so it would be:
requests.get('http://www.someforum/users/'+str(user_id)+'/posts.json?page='+str(page_num)).json()
Also you're just using a request string in the loop, you're not actually saving the new obj with the new page number inside the loop
so outside your loop:
max_p_f = 'http://someforum/users/'+str(user_id)+'/posts.json?page='
and inside your loop it should be :
obj = requests.get(max_p_f +str(page_num)).json()
Here is a sample snippet, how I would do something very similar:
base_url = 'http://somewebsite/bunchofbjectsonapage.json?page='
max_page = 3
current_page = 0
while current_page <= max_page:
current_page = current_page + 1
obj = requests.get(base_url + str(current_page)).json()
for item in obj:
name = item['company_name']
cat = item['category']
print([name,cat])
Related
I am using Selenium to crawl some Facebook group information:
with open("groups.txt") as file:
lines = file.readlines()
total = len(lines)
count = 1
for line in lines:
group_id = line.strip().split(".com/")[1]
if "groups" not in line:
new_line = "https://www.facebook.com/groups/" + str(group_id) + "/about"
else:
new_line = line.strip() + '/about'
sleep(2)
driver.get(new_line)
page_source = driver.page_source
page_id = page_source.split('"groupID":"')[1].split('","')[0]
page_followers = page_source.split('<!-- --> total members')[0][-15:]
page_followers = str(page_followers.split('>')[1]).replace(',', '')
page_name = page_source.split("</title>")[0].split("<title>")[1]
df1.loc[len(df1)] = [line.strip(), 'https://www.facebook.com/' + str(page_id), page_followers, page_name]
print(f"{count}/{total}", line.strip(), 'https://www.facebook.com/' + str(page_id), page_followers)
count += 1
df1.to_csv("groups.csv", encoding='utf-8', index=False, header=False)
Facebook has updated something recently so this code fails to return number of group members.
These are the relevant lines:
page_followers = page_source.split('<!-- --> total members')[0][-15:]
page_followers = str(page_followers.split('>')[1]).replace(',', '')
Taking view-source:https://www.facebook.com/groups/764385144252353/about as an example, I find two instances of "total members". Is it possible to get some advice on what I should change to be able to catch this number?
NEW
This code extract the exact number of members and convert it from string to integer
driver.get('https://www.facebook.com/groups/410943193806268/about')
members = driver.find_element(By.XPATH, "//span[contains(text(), 'total members')]").text
members = int(''.join(i for i in members if i.isdigit()))
print(members)
output
15589
OLD
I suggest don't use page_source to extract this kind of data, instead use find_element in this way
driver.find_element(By.CSS_SELECTOR, "a[href*='members']").text.split()[0]
output
'186'
Explanation: a[href*='members'] search for a elements (for example <a class='test'>...</a>) having a href attribute containing the string members (for example ...)
I am trying to scrape a bunch of baseball statistics and get all that data into separate data Frames so I can use it for my project. I am able to get all of the data, but I am having trouble figuring out how to store all of this data in variables and slice it accordingly.
def parse_row(rows):
return [str(x.string)for x in rows.find_all('td')]
def soop(url):
page = requests.get(url)
text = soup(page.text, features = 'lxml')
row = text.find_all('tr')
data = [parse_row(rows)for rows in row]
df = pd.DataFrame(data)
df = df.dropna()
if dp_num in url:
df.columns = dp_col
elif sb_num in url:
df.columns = sb_col
elif hr_num in url:
df.columns = hr_col
elif obp_num in url:
df.columns = obp_col
elif b2_num in url:
df.columns = b2_col
elif b3_num in url:
df.columns = b3_col
elif era_num in url:
df.columns = era_col
elif fld_num in url:
df.columns = fld_col
else:
print('error')
return(df)
# ncaa scraping function
def scrape(id_num):
loop = 1
page_num = 2
page_numii = 2
page_numiii = 2
url = 'https://www.ncaa.com/stats/baseball/d1/current/team/' + id_num
dii_url = 'https://www.ncaa.com/stats/baseball/d2/current/team/' + id_num
diii_url = 'https://www.ncaa.com/stats/baseball/d3/current/team/' + id_num
while loop == 1: #first di page
df = soop(url)
loop += 1
print(df)
while loop <= 6: #number of remaining di pages
df = soop(url + '/p' + str(page_num))
page_num += 1
loop += 1
print(df)
while loop == 7: # first d2 page
df = soop(dii_url)
loop += 1
print(df)
while loop <= 11:#remaining d2 pages
df = soop(dii_url + '/p' + str(page_numii))
page_numii += 1
loop += 1
print(df)
while loop == 12: #first diii page
df = soop(diii_url)
loop += 1
print(df)
while loop < 20:#remaining d3 pages
df = soop(diii_url + '/p' + str(page_numiii))
page_numiii += 1
loop += 1
print(df)
All of the code works, and I get no errors, but I would like to store the data it prints out in variables instead of printing it out, and then have those as separate data Frames for each stat page I scraped. But I have no clue where to start doing that, I have seen on here that maybe i should try appending it to a list? I am a statistics major in college, and I am pretty new to programming. Any help is appreciated.
To store dataframes into variables, you would have to construct a list or dictionary to store the dataframes.
With that being said, I probably wouldn't store the tables into variables, but rather write to a database or csv files so that you have the data locally available. Otherwise you'd have to run the scrape every time to get the data. Pandas can handle that for you (as well as parse the tables with .read_html()).
Not sure exactly what data you want or how you want it (I'm also surprised to not see an api here to get that data), but this will grab it and store it into folders with the structure of:
-data
-d1
-INDIVIDUAL STATISTICS
csv files
...
...
-TEAM STATISTICS
.csv files
...
...
-d2
-INDIVIDUAL STATISTICS
csv files
...
...
-TEAM STATISTICS
csv files
...
...
-d3
-INDIVIDUAL STATISTICS
csv files
...
...
-TEAM STATISTICS
csv files
...
...
So looks like this:
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
statsIds_dict = {}
for division in [1,2,3]:
statsIds_dict[f'd{division}'] = {}
url = f'https://www.ncaa.com/stats/baseball/d{division}/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
statsIds = soup.find_all('div', {'class':'stats-header__filter'})
for each in statsIds:
statsType = each.text.split('\n')[0]
statsIds_dict[f'd{division}'][statsType] = {}
options = each.find_all('option')
for option in options:
if option['value']:
statsIds_dict[f'd{division}'][statsType][option.text] = 'https://www.ncaa.com' + option['value']
for division, v1 in statsIds_dict.items():
for statsType, v2 in v1.items():
for statTitle, link in v2.items():
response = requests.get(link)
soup = BeautifulSoup(response.text, 'html.parser')
try:
totPages = int(soup.find('ul', {'class':'stats-pager'}).find_all('li')[-2].text)
except:
totPages = 1
df = pd.read_html(link)[0]
print(link)
for page in range(2, totPages+1):
temp_df = pd.read_html(link + f'/p{page}')[0]
print(link + f'/p{page}')
df = df.append(temp_df).reset_index(drop=True)
path = f'data/{division}/{statsType}'
# Check whether the specified path exists or not
isExist = os.path.exists(path)
if not isExist:
# Create a new directory because it does not exist
os.makedirs(path)
print(f"The data/{division}/{statsType} directory is created!")
df.to_csv(f'data/{division}/{statsType}/{division}_{statsType}_{statTitle}.csv' , index=False)
print(f'Saved: {division} {statsType} {statTitle}')
import pandas as pd
import requests
import json
import datetime
import csv
def get_pushshift_data(after, before, sub):
url = 'https://api.pushshift.io/reddit/search/submission/?&after=' + str(after) + '&before='+ str(before) + '&subreddit='+ str(sub) + '&sort=asc&sort_type=created_utc&size=400'
print(url)
r = requests.get(url).json()
# data = json.loads(r.text, strict=False)
return r['data']
def collect_subData(subm):
subData = list() #list to store data points
title = subm['title']
url = subm['url']
try:
flair = subm['link_flair_text']
except KeyError:
flair = "NaN"
try:
# returns the body of the posts
body = subm['selftext']
except KeyError:
body = ''
author = subm['author']
subId = subm['id']
score = subm['score']
created = datetime.datetime.fromtimestamp(subm['created_utc']) #1520561700.0
numComms = subm['num_comments']
permalink = subm['permalink']
subData.append((subId,title,body,url,author,score,created,numComms,permalink,flair))
subStats[subId] = subData
def update_subFile():
upload_count = 0
location = "subreddit_data_uncleaned/"
print("Input filename of submission file, please add .csv")
filename = input()
file = location + filename
with open(file, 'w', newline='', encoding='utf-8') as file:
a = csv.writer(file, delimiter=',')
headers = ["Post ID","Title","Body","Url","Author","Score","Publish Date","Total No. of Comments","Permalink","Flair"]
a.writerow(headers)
for sub in subStats:
a.writerow(subStats[sub][0])
upload_count+=1
print(str(upload_count) + " submissions have been uploaded into a csv file")
# global dictionary to hold 'subData'
subStats = {}
# tracks no. of submissions
subCount = 0
#Subreddit to query
sub = 'politics'
# Unix timestamp of date to crawl from.
before = int(datetime.datetime(2021,5,17,0,0).timestamp())
after = int(datetime.datetime(2014,1,1,0,0).timestamp())
data = get_pushshift_data(after, before, sub)
while len(data) > 0:
for submission in data:
collect_subData(submission)
subCount+=1
# Calls getPushshiftData() with the created date of the last submission
print(len(data))
print(str(datetime.datetime.fromtimestamp(data[-1]['created_utc'])))
after = data[-1]['created_utc']
data = get_pushshift_data(after, before, sub)
print(len(data))
update_subFile()
At line 1: I call the get_pushshift_data(after, before, sub) function to scrape the data and there is no error. But then when I want to the same thing again at line 11 but with different time for after variable(type: int), the program comes out the error of JSONDecodeError: Expecting value: line 1 column 1 (char 0).
This is the image for you to refer to which I have just described above
This is the Error Image
I have a script to extract data from here: http://espn.go.com/nba/statistics/player/_/stat/scoring-per-48-minutes/
Part of obtaining the data in the script looks like this:
pts_start = data.find('">',mpg_end) + 2
pts_end = data.find('<',pts_start)
store.append(data[pts_start:pts_end])
mf_start = data.find(' >',pts_end) + 2
mf_end = data.find('<',mf_start)
store.append(data[mf_start:mf_end])
fg_start = data.find(' >',mf_end) + 2
fg_end = data.find('<',fg_start)
store.append(data[fg_start:fg_end])
I see that the names like fg and pts correspond to the table headlines, but I don't understand why certain ones are abbreviated in the script.
I want to modify the script to obtain the headlines on this table: http://espn.go.com/nba/statistics/player/_/stat/rebounds. I tried doing this by just plugging in the names as they appear at the top of the table but the resulting CSV file had missing information.
Full code :
import os
import csv
import time
import urllib2
uri = 'http://espn.go.com/nba/statistics/player/_/stat/scoring-per-48-minutes'
def get_data():
try:
req = urllib2.Request(uri)
response = urllib2.urlopen(req, timeout=600)
content = response.read()
return content
except Exception, e:
print "\n[!] Error: " + str(e)
print ''
return False
def extract(data,rk):
print '\n[+] Extracting data.'
start = 0
while True:
store = [rk]
if data.find('nba/player/',start) == -1:
break
with open("data.csv", "ab") as fcsv:
main = data.find('nba/player/',start)
name_start = data.find('>',main) + 1
name_end = data.find('<',name_start)
store.append(data[name_start:name_end])
team_start = data.find('">',name_end) + 2
team_end = data.find('<',team_start)
store.append(data[team_start:team_end])
gp_start = data.find(' >',team_end) + 2
gp_end = data.find('<',gp_start)
store.append(data[gp_start:gp_end])
mpg_start = data.find(' >',gp_end) + 2
mpg_end = data.find('<',mpg_start)
store.append(data[mpg_start:mpg_end])
pts_start = data.find('">',mpg_end) + 2
pts_end = data.find('<',pts_start)
store.append(data[pts_start:pts_end])
mf_start = data.find(' >',pts_end) + 2
mf_end = data.find('<',mf_start)
store.append(data[mf_start:mf_end])
fg_start = data.find(' >',mf_end) + 2
fg_end = data.find('<',fg_start)
store.append(data[fg_start:fg_end])
m3_start = data.find(' >',fg_end) + 2
m3_end = data.find('<',m3_start)
store.append(data[m3_start:m3_end])
p3_start = data.find(' >',m3_end) + 2
p3_end = data.find('<',p3_start)
store.append(data[p3_start:p3_end])
ft_start = data.find(' >',p3_end) + 2
ft_end = data.find('<',ft_start)
store.append(data[ft_start:ft_end])
ftp_start = data.find(' >',ft_end) + 2
ftp_end = data.find('<',ftp_start)
store.append(data[ftp_start:ftp_end])
start = name_end
rk = rk + 1
csv.writer(fcsv).writerow(store)
fcsv.close()
def main():
print "\n[+] Initializing..."
if not os.path.exists("data.csv"):
with open("data.csv", "ab") as fcsv:
csv.writer(fcsv).writerow(["RK","PLAYER","TEAM","GP", "MPG","PTS","FGM-FGA","FG%","3PM-3PA","3P%","FTM-FTA","FT%"])
fcsv.close()
rk = 1
global uri
while True:
time.sleep(1)
start = 0
print "\n[+] Getting data, please wait."
data = get_data()
if not data:
break
extract(data,rk)
print "\n[+] Preparing for next page."
time.sleep(1.5)
rk = rk + 40
if rk > 300:
print "\n[+] All Done !\n"
break
uri = 'http://espn.go.com/nba/statistics/player/_/stat/scoring-per-48-minutes/sort/avg48Points/count/' + str(rk)
if __name__ == '__main__':
main()
I specifically want to know how to grab info based on the headlines. Like TEAM GP MPG PTS FGM-FGA FG% 3PM-3PA 3P% FTM-FTA FT%
So the script doesn't need to be changed besides things like pts or mpg in pts_start = data.find('">',mpg_end) + 2
I don't understand why I can't just input the name of the headline in the table has shown for certain ones. Like instead of FTM-FTA, the script puts ft.
Extracting html data rather easy with BeautifulSoup. Following example is you to get the idea but not a complete solution to your problem. However you can easily extend.
from bs4 import BeautifulSoup
import urllib2
def get_html_page_dom(url):
response = urllib2.urlopen(url)
html_doc = response.read()
return BeautifulSoup(html_doc, 'html5lib')
def extract_rows(dom):
table_rows = dom.select('.mod-content tbody tr')
for tr in table_rows:
# skip headers
klass = tr.get('class')
if klass is not None and 'colhead' in klass:
continue
tds = tr.select('td')
yield {'RK': tds[0].string,
'PLAYER': tds[1].select('a')[0].string,
'TEAM': tds[2].string,
'GP': tds[3].string
# you can fetch rest of the indexs for corresponding headers
}
if __name__ == '__main__':
dom = get_html_page_dom('http://espn.go.com/nba/statistics/player/_/stat/scoring-per-48-minutes/')
for data in extract_rows(dom):
print(data)
You can simply run and see the result ;).
I am trying to open a webpage and scrape some strings from it into a list. The list would ultimately be populated by all of the names displayed on the webpage. In trying to do so, my code looks like this:
import xlsxwriter, urllib.request, string, http.cookiejar, requests
def main():
username = 'john.mauran'
password = 'fZSUME1q'
log_url = 'https://aries.case.com.pl/'
dest_url = 'https://aries.case.com.pl/main_odczyt.php?strona=eksperci'
login_values = {'username' : username , 'password' : password }
r = requests.post(dest_url, data=login_values, verify=False, allow_redirects=False)
open_sesame = r.text
#reads the expert page
readpage_list = open_sesame.splitlines()
#opens up a new file in excel
workbook = xlsxwriter.Workbook('expert_book.xlsx')
#adds worksheet to file
worksheet = workbook.add_worksheet()
#initializing the variable used to move names and dates
#in the excel spreadsheet
boxcoA = ""
boxcoB = ""
#initializing expert attribute variables and lists
url_ticker = 0
name_ticker = 0
raw_list = []
url_list = []
name_list= []
date_list= []
#this loop goes through and finds all the lines
#that contain the expert URL and name and saves them to raw_list::
#raw_list loop
for i in open_sesame:
if '<tr><td align=left><a href=' in i:
raw_list += i
if not raw_list:
print("List is empty")
if raw_list:
print(raw_list)
main()
As you can see, all I want to do is take the lines from the text returned by the Requests operation which start with the following characters '
I don't know exactly what you're trying to do, but this doesn't make any sense:
for i in open_sesame:
if '<tr><td align=left><a href=' in i:
raw_list += i
First of all, if you iterate over open_sesame, which is a string, each item in the iteration will be a character in the string. Then '<tr><td align=left><a href=' in i will always be false.
Second of all, raw_list += i is not how you append an item to a list.
Finally, why is the variable called open_sesame? Is it a joke?