I would like to put some data in a html file into a pandas dataframe but I'm getting the error '. My data has the following structures. It is the data between the square brackets after lots I would like to put into a dataframe but I'm pretty confused as to what type of object this is.
html_doc = """<html><head><script>
"unrequired_data = [{"ID":XXX, "Name":XXX, "Price":100GBP, "description": null },
{"ID":XXX, "Name":XXX, "Price":150GBP, "description": null },
{"ID":XXX, "Name":XXX, "Price":150GBP, "description": null }]
"lots":[{"ID":123, "Name":ABC, "Price":100, "description": null },
{"ID":456, "Name":DEF, "Price":150, "description": null },
{"ID":789, "Name":GHI, "Price":150, "description": null }]
</script></head></html>"""
I have tried the following code
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(html_doc)
df = pd.DataFrame("lots")
The output I would like to get would be in this format.
Your data is not valid JSON, so you need to fix it.
I would use:
from bs4 import BeautifulSoup
import pandas as pd
import json, re
soup = BeautifulSoup(html_doc)
# extract script
script = soup.find("script").text.strip()
# get first value that starts with "lot"
data = next((s.split(':', maxsplit=1)[-1] for s in re.split('\n{2,}', script) if s.startswith('"lots"')), None)
# fix the json
if data:
data = (re.sub(r':\s*([^",}]+)\s*', r':"\1"', data))
df = pd.DataFrame(json.loads(data))
print(df)
Output:
ID Name Price description
0 123 ABC 100 null
1 456 DEF 150 null
2 789 GHI 150 null
Related
I have a JSON file and I need to convert that into CSV. But my JSON file contains JSON object which is an array and my all attributes are in that array but the code I am trying converts the first object into a single value but in actual I want all those attributes from JSON object.
JSON file content
{
"leads": [
{
"id": "31Y2V29CH0X82",
"product_type": "prelist"
},
{
"id": "2N649TAJBA50Z",
"product_type": "prelist"
}
],
"has_next_page": true,
"next_cursor": "2022-07-27T20:02:13.856000-07:00"
}
Python code
import pandas as pd
df = pd.read_json (r'C:\Users\Ron\Desktop\Test\Product_List.json')
df.to_csv (r'C:\Users\Ron\Desktop\Test\New_Products.csv', index = None)
The output I am getting is as following
And the output I want
I want the attributes as CSV content with headers?
I think you'll have to do this row by row.
data = {"leads": [{"id": "31Y2V29CH0X82", "product_type": "prelist"}, {"id": "2N649TAJBA50Z", "product_type": "prelist"}], "has_next_page": True,
"next_cursor": "2022-07-27T20:02:13.856000-07:00"}
headers = data.copy()
del headers['leads']
rows = []
for row in data['leads']:
row.update( headers )
rows.append( row )
import pandas as pd
df = pd.DataFrame( rows )
print(df)
Output:
id product_type has_next_page next_cursor
0 31Y2V29CH0X82 prelist True 2022-07-27T20:02:13.856000-07:00
1 2N649TAJBA50Z prelist True 2022-07-27T20:02:13.856000-07:00
I am looking to get just the "ratingValue" and "reviewCount" from the following application/ld+json but cannot figure out how to do this after looking through numerous How-to's, so I've essentially given up. In advance thank you for your help.
Sample of the application/ld+json
{
"#context": "http://schema.org",
"#graph": [
{
"#type": "Product",
"name": "MERV 8 Replacement for Trion Air Bear 20x20x5 (19.63x20.13x4.88) ",
"description": "MERV 8 Replacement for Trion Air Bear 20x20x5 (19.63x20.13x4.88) - FilterBuy.com",
"productID": 30100,
"sku": "ABR20x20x5M8",
"mpn": "ABR20x20x5M8",
"url": "https://filterbuy.com/brand/trion-air-bear-air-filters/20x20x5-air-bear-20x20/merv-8/",
"itemCondition": "new",
"brand": "FilterBuy",
"image": "https://filterbuy.com/media/pla_images/20x25x5AB/20x25x5AB-m8-(x1).jpg",
"aggregateRating": {
"#type": "AggregateRating",
**"ratingValue": 4.79926,
"reviewCount": 538**}
My Code:
from bs4 import BeautifulSoup
import bs4
import requests
import json
import re
import numpy as np
import csv
urls = ['https://filterbuy.com/brand/trion-air-bear-air-filters/20x20x5-air-bear-20x20',
'https://filterbuy.com/brand/trion-air-bear-air-filters/16x25x5-air-bear-1400/?selected_merv=11']
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
mervs = BeautifulSoup(response.text, 'lxml').find_all('strong')
product = BeautifulSoup(response.text, 'lxml').find("h1", class_="text-center")
jsonString = soup.find_all('script', type='application/ld+json')[1].text
json_schema = soup.find_all('script', attrs={'type': 'application/ld+json'})[1]
json_file = json.loads(json_schema.get_text())
for i, cart in enumerate(BeautifulSoup(response.text, 'lxml').find_all('form', class_='cart')):
for tax in cart.attrs:
if 'data-price' in tax:
print(product.text, mervs[i].get_text(), [tax], cart[tax], json_file)
In python, pretty much anything can be nested inside other things. This is an example of nesting lists and dictionaries inside a dictionary. You can go about getting the value by thinking about what you need to do on each level.
Start by assigning the above dictionary to a variable, like the_dict. You want to access the "#graph" key, then access the first item in the list it returns, then access "aggregateRating". From there, you can get both the values you want. Your code may look something like this:
the_dict = ...
d = the_dict['#graph'][0][aggregateRating']
rating_value, review_count = d['ratingValue'], d['reviewCount']
So I have this data that I scraped
[
{
"id": 4321069,
"points": 52535,
"name": "Dennis",
"avatar": "",
"leaderboardPosition": 1,
"rank": ""
},
{
"id": 9281450,
"points": 40930,
"name": "Dinh",
"avatar": "https://uploads-us-west-2.insided.com/koodo-en/icon/90x90/aeaf8cc1-65b2-4d07-a838-1f078bbd2b60.png",
"leaderboardPosition": 2,
"rank": ""
},
{
"id": 1087209,
"points": 26053,
"name": "Sophia",
"avatar": "https://uploads-us-west-2.insided.com/koodo-en/icon/90x90/c3e9ffb1-df72-46e8-9cd5-c66a000e98fa.png",
"leaderboardPosition": 3,
"rank": ""
And so on... Big leaderboard of 20 ppl
Scraped with this code
import json
import requests
import pandas as pd
url_all_time = 'https://community.koodomobile.com/widget/pointsLeaderboard?period=allTime&maxResults=20&excludeRoles='
# print for all time:
data = requests.get(url_all_time).json()
# for item in data:
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for item in data:
print(item['name'], item['points'])
And I want to be able to create a table that ressembles this
Every time I scrape data, I want it to update the table with the number of points with a new data stamped as the header. So basically what I was thinking is that my index = usernames and the header = date. The problem is, I can't even get to make a csv file with that NAME/POINTS columns.
The only thing I have succeeded doing so far is writing ALL the data into a csv file. I haven't been able to pinpoint the data I want like in the print command.
EDIT : After reading what #Shijith posted I succeeded at transferring data to .csv but with what I have in mind (add more data as time flies), I was asking myself if I should do a code with an Index or without.
WITH
import pandas as pd
url_all_time = 'https://community.koodomobile.com/widget/pointsLeaderboard?period=allTime&maxResults=20&excludeRoles='
data = pd.read_json(url_all_time)
table = pd.DataFrame.from_records(data, index=['name'], columns=['points','name'])
table.to_csv('products.csv', index=True, encoding='utf-8')
WITHOUT
import pandas as pd
url_all_time = 'https://community.koodomobile.com/widget/pointsLeaderboard?period=allTime&maxResults=20&excludeRoles='
data = pd.read_json(url_all_time)
table = pd.DataFrame.from_records(data, columns=['points','name'])
table.to_csv('products.csv', index=False, encoding='utf-8')
Have you tried just reading the json directly into a pandas dataframe? From here it should be pretty easy to transform it like you want. You could add a column for today's date and pivot it.
import pandas as pd
url_all_time = 'https://community.koodomobile.com/widget/pointsLeaderboard?period=allTime&maxResults=20&excludeRoles='
df = pd.read_json(url_all_time)
data['date'] = pd.Timestamp.today().strftime('%m-%d-%Y')
data.pivot(index='name',columns='date',values='points')
I am trying to use BeautifulSoup to get some data from website the data is returned as follows
window._sharedData = {
"config": {
"csrf_token": "DMjhhPBY0i6ZyMKYQPjMjxJhRD0gkRVQ",
"viewer": null,
"viewerId": null
},
"country_code": "IN",
"language_code": "en",
"locale": "en_US"
}
How can I import the same into json.loads so I can extract the data?
You need to change it first to a json format by removing the variable name and parsing it as a string:
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.find('script').text
text = text.replace('window._sharedData = ', '')
data = json.loads(text)
country_code = data['country_code']
Or you can use the eval function to transform it to a python dictionary. For that you need to replace json types to python and parse it in a dictionary format:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.find('script').text
text = text.replace('null', None)
text = text.replace('window._sharedData = ', '')
data = eval(text)
country_code = data['country_code']
I am reading API data from the cloud server in JSON format as shown here
How to write a code to store this data frame into the data frame in any database. How to convert JSON format to DataFrame?
The requirement output format is shown in table2
Here is an example:
from requests import request
import json
import pandas as pd
response = request(url="http://api.open-notify.org/astros.json", method='get')# API source
data=json.loads(response.text)['people']# pick the 'people' data source from json
pd.DataFrame(data) # convert to pandas dataframe
let me know if it works.
You can try this one, tell me if it works!
import request
import json
import pandas as pd
response = request(url='http://google.com') # Assuming the url
res = response.json()
df = pd.DataFrame(data)
Goodluck mate!
check out the docs here
import pandas as pd
df = pd.read_json("http://some.com/blah.json")
and as for storing it to a database you will need to know some things about your database connection. docs here
tablename = "my_tablename"
connection_values = <your sql alchemy connection here>
df.to_sql(name=tablename, con=connection_values)
To help others also, here is an example with nested json. I have tried to make this similar to the example you showed in your question.
import json
import pandas as pd
import pandas.io.json as pd_json
jsondata = '''
{
"source": { "id": "2480300" },
"time": "2013-07-02T16:32:30.152+02:00",
"type": "huawei_E3131SignalStrength",
"c8y_SignalStrength": {
"rssi": { "value": -53, "unit": "dBm" },
"ber": { "value": 0.14, "unit": "%" }
}
}
'''
data = pd_json.loads(jsondata) #load
df=pd_json.json_normalize(data) #normalise
df
Result:
c8y_SignalStrength.ber.unit c8y_SignalStrength.ber.value c8y_SignalStrength.rssi.unit c8y_SignalStrength.rssi.value source.id time type
% 0.14 dBm -53 2480300 2013-07-02T16:32:30.152+02:00 huawei_E3131SignalStrength