I can't get wanted parameters through the json response (web-scraping)

I can't get wanted parameters through the json response (web-scraping) - python

I'm trying to extract data through the json response of this link : https://www.bienici.com/recherche/achat/france?page=2
I have 2 problems:
- first, I want scrape a house's parametrs like (price, area, city, zip code) but I don't know how ?
- Secondly, I want to make a loop that goes all the pages up to page 100
This is the program :
import requests
from pandas.io.json import json_normalize
import csv
payload = {'filters': '{"size":24,"from":0,"filterType":"buy","newProperty":false,"page":2,"resultsPerPage":24,"maxAuthorizedResults":2400,"sortBy":"relevance","sortOrder":"desc","onTheMarket":[true],"limit":"ih{eIzjhZ?q}qrAzaf}AlrD?rvfrA","showAllModels":false,"blurInfoType":["disk","exact"]}'}
url = 'https://www.bienici.com/realEstateAds.json'
response = requests.get(url, params = payload).json()
with open("selog.csv", "w", newline="") as f:
writer = csv.writer(f)
for prop in response['realEstateAds']:
title = prop['title']
city = prop['city']
desc = prop['description']
price = prop['price']
df = json_normalize(response['realEstateAds'])
df.to_csv('selog.csv', index=False)
writer.writerow([price,title,city,desc])

Hi first thing I notice is you're writing the csv twice. Once with writer and once with .to_csv(). Depending what you are trying to do, you don't need both, but ultimately either would work. It just depends then how you iterated through the data.
Personally, I like working with pandas. I’ve had people tell me it’s a little overkill to store temp dataframes and append to a “final” dataframe, but it’s just what I’m comfortable doing and haven’t had issues with it, so I just used that.
To get other data parts, you'll need to investigate what’s all there and work your way through the json format to pull that out of the json response (if you’re going the route of using csv writer).
The pages are part of the payload parameters. To go through pages, just iterate that. The weird thing is, when I tried that, not only do you have to iterate through pages, but also the from parameter. Ie. since I have it doing 60 per page, page 1 is from 0, page 2 is from 60, page 3 is from 120, etc. So had it iterate through those multiples of 60 (it seems to get it). Sometimes it’s possible to see how many pages you’ll iterate through, but I couldn’t find it, so simply left it as a try/except, so when it reaches the end, it’ll break the loop. The only downside, is it could draw an error unexpected before, causing it to stop pre-maturely. I didn’t look too much into that, but just as a side note.
so it would look something like this (might take a while to go through all the pages, so I just did pages 1-10$:
You can also before saving to csv, manipulte the dataframe to keep only the columns you want:
import requests
import pandas as pd
from pandas.io.json import json_normalize
tot_pages = 10
url = 'https://www.bienici.com/realEstateAds.json'
results_df = pd.DataFrame()
for page in range(1, tot_pages+1):
try:
payload = {'filters': '{"size":60,"from":%s,"filterType":"buy","newProperty":false,"page":%s,"resultsPerPage":60,"maxAuthorizedResults":2400,"sortBy":"relevance","sortOrder":"desc","onTheMarket":[true],"limit":"ih{eIzjhZ?q}qrAzaf}AlrD?rvfrA","showAllModels":false,"blurInfoType":["disk","exact"]}' %((60 * (page-1)), page)}
response = requests.get(url, params = payload).json()
print ('Processing Page: %s' %page)
temp_df = json_normalize(response['realEstateAds'])
results_df = results_df.append(temp_df).reset_index(drop=True)
except:
print ('No more pages.')
break
# To Filter out to certain columns, un-comment below
#results_df = results_df[['city','district.name','postalCode','price','propertyType','surfaceArea','bedroomsQuantity','bathroomsQuantity']]
results_df.to_csv('selog.csv', index=False)
Output:
print(results_df.head(5).to_string())
city district.name postalCode price propertyType surfaceArea bedroomsQuantity bathroomsQuantity
0 Colombes Colombes - Fossés Jean Bouvier 92700 469000 flat 92.00 3.0 1.0
1 Nice Nice - Parc Impérial - Le Piol 06000 215000 flat 49.05 1.0 NaN
2 Nice Nice - Gambetta 06000 145000 flat 21.57 0.0 NaN
3 Cagnes-sur-Mer Cagnes-sur-Mer - Les Bréguières 06800 770000 house 117.00 3.0 3.0
4 Pau Pau - Le Hameau 64000 310000 house 110.00 3.0 2.0

Related

Extract single data point from multiple, webscraping

I am trying to extract stock symbols (3rd column) from the table in below screener:
https://chartink.com/screener/2-short-trend
and pass them on to a dataframe.
Due to my limited knowledge, I have hit a wall and can not move past it.
My code is:
from requests_html import HTMLSession
session = HTMLSession()
response = session.get('https://chartink.com/screener/2-short-trend')
response.html.render()
for result in response.html.xpath('//*[#id="DataTables_Table_0"]/tbody/tr/td/a[1]'):
print(f'{result.text}\n')
Output:
Mahindra & Mahindra Limited
M&M
P&F
Apollo Tyres Limited
APOLLOTYRE
P&F
....
I just need stock symbols: M&M, APOLLOTYRE etc., and passed into a dataframe.
Can someone pls guide.

Bit of a quick fix, but you could use a counter assuming that the relevant output is the second result for every company. Something like the below:
from requests_html import HTMLSession
import pandas as pd
session = HTMLSession()
response = session.get('https://chartink.com/screener/2-short-trend')
response.html.render()
i = 1
symbols = []
for result in response.html.xpath('//*[#id="DataTables_Table_0"]/tbody/tr/td/a[1]'):
print(f'{result.text}\n')
if i == 2:
symbols.append(result.text)
i -= 2
else:
i += 1
df = pd.DataFrame({"Symbol": symbols})
I structured i to trigger appending the result to a symbols list at the position where the symbol is iterated over and then a dataframe is created using the output. Using that code gave me a dataframe with the 5 symbols from your link.

any way to download the data with custom queries from url in python?

I want to download the data from USDA site with custom queries. So instead of manually selecting queries in the website, I am thinking about how should I do this handier in python. To do so, I used request, http to access the url and read the content, it is not intuitive for me how should I pass the queries then make a selection and download the data as csv. Does anyone knows of doing this easily in python? Is there any workaround we could download the data from url with specific queries? Any idea?
this is my current attempt
here is the url that I am going to select data with custom queries.
import io
import requests
import pandas as pd
url="https://www.marketnews.usda.gov/mnp/ls-report-retail?&repType=summary&portal=ls&category=Retail&species=BEEF&startIndex=1"
s=requests.get(url).content
df=pd.read_csv(io.StringIO(s.decode('utf-8')))
so before reading the requested json in pandas, I need to pass following queries for correct data selection:
Category = "Retail"
Report Type = "Item"
Species = "Beef"
Region(s) = "National"
Start Dates = "2020-01-01"
End Date = "2021-02-08"
it is not intuitive for me how should I pass the queries with requested json then download the filtered data as csv. Is there any efficient way of doing this in python? Any thoughts? Thanks

A few details
simplest format is text rather that HTML. Got URL from HTML page for text download
requests(params=) is a dict. Built it up and passed, no need to deal with building complete URL string
clearly text is space delimited, found minimum of double space
import io
import requests
import pandas as pd
url="https://www.marketnews.usda.gov/mnp/ls-report-retail"
p = {"repType":"summary","species":"BEEF","portal":"ls","category":"Retail","format":"text"}
r = requests.get(url, params=p)
df = pd.read_csv(io.StringIO(r.text), sep="\s\s+", engine="python")
Date
Region
Feature Rate
Outlets
Special Rate
Activity Index
0
02/05/2021
NATIONAL
69.40%
29,200
20.10%
81,650
1
02/05/2021
NORTHEAST
75.00%
5,500
3.80%
17,520
2
02/05/2021
SOUTHEAST
70.10%
7,400
28.00%
23,980
3
02/05/2021
MIDWEST
75.10%
6,100
19.90%
17,430
4
02/05/2021
SOUTH CENTRAL
57.90%
4,900
26.40%
9,720
5
02/05/2021
NORTHWEST
77.50%
1,300
2.50%
3,150
6
02/05/2021
SOUTHWEST
63.20%
3,800
27.50%
9,360
7
02/05/2021
ALASKA
87.00%
200
.00%
290
8
02/05/2021
HAWAII
46.70%
100
.00%
230

Just format the query data in the url - it's actually a REST API:
To add more query data, as #mullinscr said, you can change the values on the left and press submit, then see the query name in the URL (for example, start date is called repDate).
If you hover on the Download as XML link, you will also discover you can specify the download format using format=<format_name>. Parsing the tabular data in XML using pandas might be easier, so I would append format=xml at the end as well.
category = "Retail"
report_type = "Item"
species = "BEEF"
regions = "NATIONAL"
start_date = "01-01-2019"
end_date = "01-01-2021"
# the website changes "-" to "%2F"
start_date = start_date.replace("-", "%2F")
end_date = end_date.replace("-", "%2F")
url = f"https://www.marketnews.usda.gov/mnp/ls-report-retail?runReport=true&portal=ls&startIndex=1&category={category}&repType={report_type}&species={species}&region={regions}&repDate={start_date}&endDate={end_date}&compareLy=No&format=xml"
# parse with pandas, etc...

Getting no data when scraping a table

I am trying to scrape historical data from a table in coinmarketcap. However, the code that I run gives back "no data." I thought it would be fairly easy, but not sure what I am missing.
url = "https://coinmarketcap.com/currencies/bitcoin/historical-data/"
data = requests.get(url)
bs=BeautifulSoup(data.text, "lxml")
table_body=bs.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('td')
cols=[x.text.strip() for x in cols]
print(cols)
Output:
C:\Users\Ejer\anaconda3\envs\pythonProject\python.exe C:/Users/Ejer/PycharmProjects/pythonProject/CloudSQL_test.py
['No Data']
Process finished with exit code 0

You don't need to scrape the data, you can get request it:
import time
import requests
def get_timestamp(datetime: str):
return int(time.mktime(time.strptime(datetime, '%Y-%m-%d %H:%M:%S')))
def get_btc_quotes(start_date: str, end_date: str):
start = get_timestamp(start_date)
end = get_timestamp(end_date)
url = f'https://web-api.coinmarketcap.com/v1/cryptocurrency/ohlcv/historical?id=1&convert=USD&time_start={start}&time_end={end}'
return requests.get(url).json()
data = get_btc_quotes(start_date='2020-12-01 00:00:00',
end_date='2020-12-10 00:00:00')
import pandas as pd
# making A LOT of assumptions here, hopefully the keys don't change in the future
data_flat = [quote['quote']['USD'] for quote in data['data']['quotes']]
df = pd.DataFrame(data_flat)
print(df)
Output:
open high low close volume market_cap timestamp
0 18801.743593 19308.330663 18347.717838 19201.091157 3.738770e+10 3.563810e+11 2020-12-02T23:59:59.999Z
1 19205.925404 19566.191884 18925.784434 19445.398480 3.193032e+10 3.609339e+11 2020-12-03T23:59:59.999Z
2 19446.966422 19511.404714 18697.192914 18699.765613 3.387239e+10 3.471114e+11 2020-12-04T23:59:59.999Z
3 18698.385279 19160.449265 18590.193675 19154.231131 2.724246e+10 3.555639e+11 2020-12-05T23:59:59.999Z
4 19154.180593 19390.499895 18897.894072 19345.120959 2.529378e+10 3.591235e+11 2020-12-06T23:59:59.999Z
5 19343.128798 19411.827676 18931.142919 19191.631287 2.689636e+10 3.562932e+11 2020-12-07T23:59:59.999Z
6 19191.529463 19283.478339 18269.945444 18321.144916 3.169229e+10 3.401488e+11 2020-12-08T23:59:59.999Z
7 18320.884784 18626.292652 17935.547820 18553.915377 3.442037e+10 3.444865e+11 2020-12-09T23:59:59.999Z
8 18553.299728 18553.299728 17957.065213 18264.992107 2.554713e+10 3.391369e+11 2020-12-10T23:59:59.999Z

Your problem basically is you're trying to get a table but this table is dynamically created by JS in this case you need to call an interpreter for this JS. But however you just can check the network monitor on your browser and you can get the requests and probably contains a full JSON or XML raw data and you don't need to scrape. I did it and I got this request:
https://web-api.coinmarketcap.com/v1/cryptocurrency/ohlcv/historical?id=1&convert=USD&time_start=1604016000&time_end=1609286400
Check it out and I hope help you!

Scraping OSHA website using BeautifulSoup

I'm looking for help with two main things: (1) scraping a web page and (2) turning the scraped data into a pandas dataframe (mostly so I can output as .csv, but just creating a pandas df is enough for now). Here is what I have done so far for both:
(1) Scraping the web site:
I am trying to scrape this page: https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015&id=1284178.015&id=1283809.015&id=1283549.015&id=1282631.015. My end goal is to create a dataframe that would ideally contain only the information I am looking for (i.e. I'd be able to select only the parts of the site that I am interested in for my df); it's OK if I have to pull in all the data for now.
As you can see from the URL as well as the ID hyperlinks underneath "Quick Link Reference" at the top of the page, there are five distinct records on this page. I would like each of these IDs/records to be treated as an individual row in my pandas df.
EDIT: Thanks to a helpful comment, I'm including an example of what I would ultimately want in the table below. The first row represents column headers/names and the second row represents the first inspection.
inspection_id open_date inspection_type close_conference close_case violations_serious_initial
1285328.015 12/28/2017 referral 12/28/2017 06/21/2018 2
Mostly relying on BeautifulSoup4, I've tried a few different options to get at the page elements I'm interested in:
# This is meant to give you the first instance of Case Status, which in the case of this page is "CLOSED".
case_status_template = html_soup.head.find('div', {"id" : "maincontain"},
class_ = "container").div.find('table', class_ = "table-bordered").find('strong').text
# I wasn't able to get the remaining Case Statuses with find_next_sibling or find_all, so I used a different method:
for table in html_soup.find_all('table', class_= "table-bordered"):
print(table.text)
# This gave me the output I needed (i.e. the Case Status for all five records on the page),
# but didn't give me the structure I wanted and didn't really allow me to connect to the other data on the page.
# I was also able to get to the same place with another page element, Inspection Details.
# This is the information reflected on the page after "Inspection: ", directly below Case Status.
insp_details_template = html_soup.head.find('div', {"id" : "maincontain"},
class_ = "container").div.find('table', class_ = "table-unbordered")
for div in html_soup.find_all('table', class_ = "table-unbordered"):
print(div.text)
# Unfortunately, although I could get these two pieces of information to print,
# I realized I would have a hard time getting the rest of the information for each record.
# I also knew that it would be hard to connect/roll all of these up at the record level.
So, I tried a slightly different approach. By focusing instead on a version of that page with a single inspection record, I thought maybe I could just hack it by using this bit of code:
url = 'https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
first_table = html_soup.find('table', class_ = "table-borderedu")
first_table_rows = first_table.find_all('tr')
for tr in first_table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
# Then, actually using pandas to get the data into a df and out as a .csv.
dfs_osha = pd.read_html('https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015',header=1)
for df in dfs_osha:
print(df)
path = r'~\foo'
dfs_osha = pd.read_html('https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015',header=1)
for df[1,3] in dfs_osha:
df.to_csv(os.path.join(path,r'osha_output_table1_012320.csv'))
# This worked better, but didn't actually give me all of the data on the page,
# and wouldn't be replicable for the other four inspection records I'm interested in.
So, finally, I found a pretty handy example here: https://levelup.gitconnected.com/quick-web-scraping-with-python-beautiful-soup-4dde18468f1f. I was trying to work through it, and had gotten as far as coming up with this code:
for elem in all_content_raw_lxml:
wrappers = elem.find_all('div', class_ = "row-fluid")
for x in wrappers:
case_status = x.find('div', class_ = "text-center")
print(case_status)
insp_details = x.find('div', class_ = "table-responsive")
for tr in insp_details:
td = tr.find_all('td')
td_row = [i.text for i in td]
print(td_row)
violation_items = insp_details.find_next_sibling('div', class_ = "table-responsive")
for tr in violation_items:
tr = tr.find_all('tr')
tr_row = [i.text for i in tr]
print(tr_row)
print('---------------')
Unfortunately, I ran into too many bugs with this to be able to use it so I was forced to abandon the project until I got some further guidance. Hopefully the code I've shared so far at least shows the effort I've put in, even if it doesn't do much to get to the final output! Thanks.

For this type of page you don't really need beautifulsoup; pandas is enough.
url = 'your url above'
import pandas as pd
#use pandas to read the tables on the page; there are lots of them...
tables = pd.read_html(url)
#Select from this list of tables only those tables you need:
incident = [] #initialize a list of inspections
for i, table in enumerate(tables): #we need to find the index position of this table in the list; more below
if table.shape[1]==5: #all relevant tables have this shape
case = [] #initialize a list of inspection items you are interested in
case.append(table.iat[1,0]) #this is the location in the table of this particular item
case.append(table.iat[1,2].split(' ')[2]) #the string in the cell needs to be cleaned up a bit...
case.append(table.iat[9,1])
case.append(table.iat[12,3])
case.append(table.iat[13,3])
case.append(tables[i+2].iat[0,1]) #this particular item is in a table which 2 positions down from the current one; this is where the index position of the current table comes handy
incident.append(case)
columns = ["inspection_id", "open_date", "inspection_type", "close_conference", "close_case", "violations_serious_initial"]
df2 = pd.DataFrame(incident,columns=columns)
df2
Output (pardon the formatting):
inspection_id open_date inspection_type close_conference close_case violations_serious_initial
0 Nr: 1285328.015 12/28/2017 Referral 12/28/2017 06/21/2018 2
1 Nr: 1283809.015 12/18/2017 Complaint 12/18/2017 05/24/2018 5
2 Nr: 1284178.015 12/18/2017 Accident 05/17/2018 09/17/2018 1
3 Nr: 1283549.015 12/13/2017 Referral 12/13/2017 05/22/2018 3
4 Nr: 1282631.015 12/12/2017 Fat/Cat 12/12/2017 11/16/2018 1

How to Automatically select data on webpage and download the resulting xls file using Python

I am new to Python. I am trying to scrape the data on the page:
For example:
Category: grains
Organic: No
Commodity: Coarse
SubCommodity: Corn
Publications: Daily
Location: All
Refine Commodity: All
Dates: 07/31/2018 - 08/01/2019
Is there a way in which Python can select this on the webpage and then click on run and then
Click on Download as Excel and store the excel file?
Is it possible? I am new to coding and need some guidance here.
Currently what I have done is enter the data and then on the resulting page I used Beautiful Soup to scrape the table. However it takes a lot of time since the table is spread on more than 200 pages.

Using the query you defined as an example, I input the query manually and found the following URL for the Excel (really HTML) format:
url = 'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain=08%2F01%2F2019&commDetail=All&repMonth=1&endDateWeekly=&repType=Daily&rtype=&fsize=&_use=1&use=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain=07%2F31%2F2018&runReport=true&grade=&regionsDesc=&subprimals=&mscore=&endYear=2019&repDateWeekly=&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&organic=NO&category=Grain&_mscore=1&subComm=Corn&commodity=Coarse&_commDetail=1&_subprimals=1&cut=&endMonth=1&repDate=07%2F31%2F2018&endDate=08%2F01%2F2019&format=excel'
In the URL are parameters we can set in Python, and we could easily make a loop to change the parameters. For now, let me just get into the example of actually getting this data. I use the pandas.read_html to read this HTML result and populate a DataFrame, which could be thought of as a table with columns and rows.
import pandas as pd
# use URL defined earlier
# url = '...'
df_lst = pd.read_html(url, header=1)
Now df_lst is a list of DataFrame objects containing the desired data. For your particular example, this results in 30674 rows and 11 columns:
>>> df_lst[0].columns
Index([u'Report Date', u'Location', u'Class', u'Variety', u'Grade Description',
u'Units', u'Transmode', u'Low', u'High', u'Pricing Point',
u'Delivery Period'],
dtype='object')
>>> df_lst[0].head()
Report Date Location Class Variety Grade Description Units Transmode Low High Pricing Point Delivery Period
0 07/31/2018 Blytheville, AR YELLOW NaN US NO 2 Bushel Truck 3.84 3.84 Country Elevators Cash
1 07/31/2018 Helena, AR YELLOW NaN US NO 2 Bushel Truck 3.76 3.76 Country Elevators Cash
2 07/31/2018 Little Rock, AR YELLOW NaN US NO 2 Bushel Truck 3.74 3.74 Mills and Processors Cash
3 07/31/2018 Pine Bluff, AR YELLOW NaN US NO 2 Bushel Truck 3.67 3.67 Country Elevators Cash
4 07/31/2018 Denver, CO YELLOW NaN US NO 2 Bushel Truck-Rail 3.72 3.72 Terminal Elevators Cash
>>> df_lst[0].shape
(30674, 11)
Now, back to the point I made about the URL parameters--using Python, we can run through lists and format the URL string to our liking. For instance, iterating through 20 years of the given query can be done by modifying the URL to have numbers corresponding to positional arguments in Python's str.format() method. Here's a full example below:
import datetime
import pandas as pd
url = 'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain={1}&commDetail=All&repMonth=1&endDateWeekly=&repType=Daily&rtype=&fsize=&_use=1&use=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain={0}&runReport=true&grade=&regionsDesc=&subprimals=&mscore=&endYear=2019&repDateWeekly=&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&organic=NO&category=Grain&_mscore=1&subComm=Corn&commodity=Coarse&_commDetail=1&_subprimals=1&cut=&endMonth=1&repDate={0}&endDate={1}&format=excel'
start = [datetime.date(2018-i, 7, 31) for i in range(20)]
end = [datetime.date(2019-i, 8, 1) for i in range(20)]
for s, e in zip(start, end):
url_get = url.format(s.strftime('%m/%d/%Y'), e.strftime('%m/%d/%Y'))
df_lst = pd.read_html(url_get, header=1)
#print(df_lst[0].head()) # uncomment to see first five rows
#print(df_lst[0].shape) # uncomment to see DataFrame shape
Be careful with pd.read_html. I've modified my answer with a header keyword argument to pd.read_html() because the multi-indexing made it a pain to get results. By giving a single row index as the header, it's no longer a multi-index, and data indexing is easy. For instance, I can get corn class using this:
df_lst[0]['Class']
Compiling all the reports into one large file is also easy with Pandas. Since we have a DataFrame, we can use the pandas.to_csv function to export our data as a CSV (or any other file type you want, but I chose CSV for this example). Here's a modified version with the additional capability of outputting a CSV:
import datetime
import pandas as pd
# URL
url = 'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain={1}&commDetail=All&repMonth=1&endDateWeekly=&repType=Daily&rtype=&fsize=&_use=1&use=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain={0}&runReport=true&grade=&regionsDesc=&subprimals=&mscore=&endYear=2019&repDateWeekly=&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&organic=NO&category=Grain&_mscore=1&subComm=Corn&commodity=Coarse&_commDetail=1&_subprimals=1&cut=&endMonth=1&repDate={0}&endDate={1}&format=excel'
# CSV output file and flag
csv_out = 'myreports.csv'
flag = True
# Start and end dates
start = [datetime.date(2018-i, 7, 31) for i in range(20)]
end = [datetime.date(2019-i, 8, 1) for i in range(20)]
# Iterate through dates and get report from URL
for s, e in zip(start, end):
url_get = url.format(s.strftime('%m/%d/%Y'), e.strftime('%m/%d/%Y'))
df_lst = pd.read_html(url_get, header=1)
print(df_lst[0].head()) # uncomment to see first five rows
print(df_lst[0].shape) # uncomment to see DataFrame shape
# Save to big CSV
if flag is True:
# 0th iteration, so write header and overwrite existing file
df_lst[0].to_csv(csv_out, header=True, mode='w') # change mode to 'wb' if Python 2.7
flag = False
else:
# Subsequent iterations should append to file and not add new header
df_lst[0].to_csv(csv_out, header=False, mode='a') # change mode to 'ab' if Python 2.7

Your particular query generates at least 1227 pages of data - so I just trimmed it down to one location - Arizona(from 07/31/2018 - 08/1/2019) - now generating 47 pages of data. xml size was 500KB
You can semi automate like this:
>>> end_day='01'
>>> start_day='31'
>>> start_month='07'
>>> end_month='08'
>>> start_year='2018'
>>> end_year='2019'
>>> link = f"https://marketnews.usda.gov/mnp/ls-report?&endDateGrain={end_month}%2F{end_day}%2F{end_year}&commDetail=All&endDateWeekly={end_month}%2F{end_day}%2F{end_year}&repMonth=1&repType=Daily&rtype=&use=&_use=1&fsize=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain={start_month}%2F{start_day}%2F{start_year}+&runReport=true&grade=&regionsDesc=All+AR&subprimals=&mscore=&endYear={end_year}&repDateWeekly={start_month}%2F{start_day}%2F{start_year}&_wrange=1&endDateWeeklyGrain=&repYear={end_year}&loc=All+AR&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&category=Grain&organic=NO&commodity=Coarse&subComm=Corn&_mscore=0&_subprimals=1&_commDetail=1&cut=&endMonth=1&repDate={start_month}%2F{start_day}%2F{start_year}&endDate={end_month}%2F{end_day}%2F{end_year}&format=xml"
>>> link
'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain=08%2F01%2F2019&commDetail=All&endDateWeekly=08%2F01%2F2019&repMonth=1&repType=Daily&rtype=&use=&_use=1&fsize=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain=07%2F31%2F2018+&runReport=true&grade=&regionsDesc=All+AR&subprimals=&mscore=&endYear=2019&repDateWeekly=07%2F31%2F2018&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All+AR&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&category=Grain&organic=NO&commodity=Coarse&subComm=Corn&_mscore=0&_subprimals=1&_commDetail=1&cut=&endMonth=1&repDate=07%2F31%2F2018&endDate=08%2F01%2F2019&format=xml'
>>> with urllib.request.urlopen(link) as response:
... html = response.read()
...
loading the html could take a hot minute with large queries
If you for some reason wished to process the entire data set - you could repeat this process - but you may wish to look into techniques that can be specifically optimized for big data - perhaps a solution involving Python's Pandas and numexpr(for GPU acceleration/parallelization)
You can find the data used in this answer here - which you can download as an xml.
First import your xml:
>>> import xml.etree.ElementTree as ET
you can either download the file from the website in python
>>> tree = ET.parse(html)
or manually
>>> tree = ET.parse('report.xml')
>>> report = tree.getroot()
you can then do stuff like this:
>>> report[0][0]
<Element 'reportDate' at 0x7f902adcf368>
>>> report[0][0].text
'07/31/2018'
>>> for el in report[0]:
... print(el)
...
<Element 'reportDate' at 0x7f902adcf368>
<Element 'location' at 0x7f902ac814f8>
<Element 'classStr' at 0x7f902ac81548>
<Element 'variety' at 0x7f902ac81b88>
<Element 'grade' at 0x7f902ac29cc8>
<Element 'units' at 0x7f902ac29d18>
<Element 'transMode' at 0x7f902ac29d68>
<Element 'bidLevel' at 0x7f902ac29db8>
<Element 'deliveryPoint' at 0x7f902ac29ea8>
<Element 'deliveryPeriod' at 0x7f902ac29ef8>
More info on parsing xml is here.
You're going to want to learn some python - but hopefully you can make sense of the following. Luckily - there are many free python tutorials online - here is a quick snippet to get you started.
#lets find the lowest bid on a certain day
>>> report[0][0]
<Element 'reportDate' at 0x7f902adcf368>
>>> report[0][0].text
'07/31/2018'
>>> report[0][7]
<Element 'bidLevel' at 0x7f902ac29db8>
>>> report[0][7][0]
<Element 'lowPrice' at 0x7f902ac29e08>
>>> report[0][7][0].text
'3.84'
#how many low bids are there?
>>> len(report)
1216
#get an average of the lowest bids...
>>> low_bid_list = [float(bid[7][0].text) for bid in report]
[3.84, 3.76, 3.74, 3.67, 3.65, 3.7, 3.5, 3.7, 3.61,...]
>>> sum = 0
>>> for el in low_bid_list:
... sum = sum + el
...
>>> sum
4602.599999999992
>>> sum/len(report)
3.7850328947368355

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

I can't get wanted parameters through the json response (web-scraping) - python

Related

Extract single data point from multiple, webscraping

any way to download the data with custom queries from url in python?

Getting no data when scraping a table

Scraping OSHA website using BeautifulSoup

How to Automatically select data on webpage and download the resulting xls file using Python

Categories

Resources