Python with very nested JSON to CSV file - python

https://www.eex.com/data//view/data/detail/phelix-power-futures/2018/02.27.json
I have changed the script following Stev's answer. The error no longer applies.
#import pandas as pd
import requests
import json
import csv
outfile = open('D:\\test.csv','w')
url = 'https://www.eex.com/data//view/data/detail/phelix-power-futures/2018/02.27.json'
resp = requests.get(url)
data = json.loads(resp.content.decode('UTF8'))
for d in data['data']:
for r in d['rows']:
for sd in (d['rows']):
for td in (sd['data']):
dparsed = sd['data']
w = csv.DictWriter(outfile, dparsed.keys())
w.writeheader()
w.writerow(dparsed)
I ran the script and it created the csv file, but it is showing 0 KB and is saying it is locked by another user so I don't know exactly what I have goofed up this time. This is clearly not a duplicate question, so thanks for flagging it as such... /s
I ran the above script and after about 3 hours of waiting I killed spyder to see what happened with the excel file. It kind of worked, but it only managed to spit out some of the data into columns and about like 3 rows. Not sure where I fell down yet.

This is more of a comment because it doesn't give you the answer, but I am not sure you json file is formatted properly in order to use pd.json_normalize. You might have to loop through your json file, using something like the following:
import pandas as pd
import requests
import json
url = 'https://www.eex.com/data//view/data/detail/phelix-power-futures/2018/02.27.json'
resp = requests.get(url)
data = json.loads(resp.content.decode('UTF8'))
df1 = pd.DataFrame()
df2 = pd.DataFrame()
for d in data['data']:
# print(d['identifier'])
for r in d['rows']:
# print(r['contractIdentifier'])
# print(r['data'])
df1 = df1.append(json_normalize(r['data']))
df2 = df2.append(pd.DataFrame([r['contractIdentifier']]))
# print(r)
df = pd.concat([df1,df2], axis=1)
df.to_csv('my_file.txt')

Related

A DataFrame object does not have an attribute select

In palantir foundry, I am trying to read all xml files from a dataset. Then, in a for loop, I parse the xml files.
Until the second last line, the code runs fine without errors.
from transforms.api import transform, Input, Output
from transforms.verbs.dataframes import sanitize_schema_for_parquet
from bs4 import BeautifulSoup
import pandas as pd
import lxml
#transform(
output=Output("/Spring/xx/datasets/mydataset2"),
source_df=Input("ri.foundry.main.dataset.123"),
)
def read_xml(ctx, source_df, output):
df = pd.DataFrame()
filesystem = source_df.filesystem()
hadoop_path = filesystem.hadoop_path
files = [f"{hadoop_path}/{f.path}" for f in filesystem.ls()]
for i in files:
with open(i, 'r') as f:
file = f.read()
soup = BeautifulSoup(file,'xml')
data = []
for e in soup.select('offer'):
data.append({
'meldezeitraum': e.find_previous('data').get('meldezeitraum'),
'id':e.get('id'),
'parent_id':e.get('parent_id'),
})
df = df.append(data)
output.write_dataframe(sanitize_schema_for_parquet(df))
However, as soon as I add the last line:
output.write_dataframe(sanitize_schema_for_parquet(df))
I get this error:
Missing transform attribute
A DataFrame object does not have an attribute select. Please check the spelling and/or the datatype of the object.
/transforms-python/src/myproject/datasets/mydataset.py
output.write_dataframe(sanitize_schema_for_parquet(df))
What am I doing wrong?
You have to convert your pandas DataFrame to a spark DataFrame. Even though they have the same name those are two different object types in python.
The easiest way to do that is
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_spark = spark.createDataFrame(df)
You can then pass the spark_df to the output.write_dataframe() function

How to extract html text output as list for each input from list using python web scraping. I have written code, but gives only first entry output

I am new to python and programming. I am trying to extract pubchem ID from database called IMPAAT(https://cb.imsc.res.in/imppat/home). I have a list of chemical ids from the database for a herb, where going into each chemical ID hyperlink gives details on its pubchem ID and smiles data.
I have written a script in python to take each chemical ID as input and look for pubchem ID from the html page and print output to a text file using API web scraping method.
I am finding it difficult to get all the data as output. Pretty sure there is some error in the for loop as it prints only the first output many times, instead of the different output for each input.
Please help with this.
Also, I dont know how to save this kind of file where it prints input and corresponding output side by side. Please help.
import requests
import xmltodict
from pprint import pprint
import time
from bs4 import BeautifulSoup
import json
import pandas as pd
import os
from pathlib import Path
from tqdm.notebook import tqdm
cids = 'output.txt'
df = pd.read_csv(cids, sep='\t')
df
data = []
for line in df.iterrows():
out = requests.get(f'https://cb.imsc.res.in/imppat/Phytochemical-detailedpage-auth/CID%{line}')
soup = BeautifulSoup(out.text, "html.parser")
if soup.status_code == 200:
script_data = soup.find('div', {'class': 'views-field views-field-Pubchem-id'}).find('span', {'class': 'field-content'}).find('h3')
#print(script_data.text)
for text in script_data:
texts = script_data.get_text()
print(text)
data.append(text)
print(data)
****
input file consists of
cids
0 3A155934
1 3A117235
2 3A12312921
3 3A12303662
4 3A225688
5 3A440966
6 3A443160 ```
There are few things you need to correct in your code.
Incorrect indentation of out variable.
Status Code should be checked on response object i.e., out not soup.
You don't need second loop as each response contains only single pubchem ID which you are already collecting in script_data variable.
Lastly, you can use pandas to associate each chemical ID to its pubchem ID and then can write to CSV file.
Refer to below code for complete result.
Code
import requests
import xmltodict
from pprint import pprint
import time
from bs4 import BeautifulSoup
import json
import pandas as pd
import os
from pathlib import Path
from tqdm.notebook import tqdm
cids = 'output.txt'
df = pd.read_csv(cids, sep='\t')
pubchem_id= []
for line in df.iterrows():
out = requests.get(f'https://cb.imsc.res.in/imppat/Phytochemical-detailedpage-auth/CID%{line}')
if out.status_code == 200:
soup = BeautifulSoup(out.text, "html.parser")
script_data = soup.find('div', {'class': 'views-field views-field-Pubchem-id'}).find('span', {'class': 'field-content'}).find('h3').getText()
script_data = script_data.replace('PubChem Identifier:','')
pubchem_id.append(script_data)
# As you have not mentioned column index of cids, I am assuming it should be the first column
df1 = pd.DataFrame({"chemical_id": df.iloc[:, 0].tolist(), "pubchem_id": pubchem_id})
print(df1)
# uncomment below line to write the dataframe into csv files & replace 'filename' by the complete filepath
# df1.to_csv('filename.csv')

Extract json data in web page using pd.read_json()?

Trying to extract the table from this page "https://www.hkex.com.hk/Market-Data/Statistics/Consolidated-Reports/Monthly-Bulletin?sc_lang=en#select1=0&select2=28". By inspect/network function of chorme, the data request link is "https://www.hkex.com.hk/eng/stat/smstat/mthbull/rpt_turnover_short_selling_current_month_1910.json?_=1574650413485". This links looks like json format when access directly. However, the codes using this link does not work.
My codes:
import pandas as pd
url="https://www.hkex.com.hk/eng/stat/smstat/mthbull/rpt_turnover_short_selling_current_month_1910.json?_=1574650413485"
df = pd.read_json(url)
print(df.info(verbose=True))
print(df)
also tried:
url="https://www.hkex.com.hk/eng/stat/smstat/mthbull/rpt_turnover_short_selling_current_month_1910.json?"
You can try downloading the json first and then convert it back to DataFrame
import pandas as pd
url='https://www.hkex.com.hk/eng/stat/smstat/mthbull/rpt_turnover_short_selling_current_month_1910.json?_=1574650413485'
import urllib.request, json
with urllib.request.urlopen(url) as r:
data = json.loads(r.read().decode())
df = pd.DataFrame(data['tables'][0]['body'])
columns = [item['text'] for item in data['tables'][0]['header']]
row_count = max(df['row'])
new_df = pd.DataFrame(df.text.values.reshape((row_count,-1)),columns = columns)

Web scraping golf data from ESPN. I am receiving 3 ouputs of the same table and only want 1. How can I limit this?

I am new to python and am stuck. I cant figure out how to only output one of the tables given. In the output, it gives the desired table, but three versions of them. The first two are awfully formatted, and the last table is the table desired.
I have tried running a for loop and counting to only print the third table.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in dfs:
print(df[0:])
Just use index to print the table.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
print(dfs[2])
OR
print(dfs[-1])
OR If you want to use loop then try that.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in range(len(dfs)):
if df==2:
print(dfs[df])

Turning a .csv from yahoo finance into lists of columns with Python

I'm attempting to pull data from Yahoo finance in the form of a .csv, then turn columns 1 and 5 into lists in Python. The portion of the code that turns the columns into lists is functional if the .csv has been previously download but what I'm trying to do is to get the data from the url into Python directly.
The error I get is "Attribute Error: 'module' object has no attribute 'request'." Here is the code:
import urllib
def data_pull():
#gets data out of a .csv file from yahoo finance, separates specific columns into lists
datafile = urllib.request.urlretrieve('http://ichart.finance.yahoo.com/table.csv?s=xom&a=00&b=2&c=1999&d=01&e=12&f=2014&g=m&ignore=.csv')
datafile = open(datafile)
datelist = [] #blank list for dates
pricelist = [] #blank list for prices
for row in datafile:
datelist.append(row.strip().split(","))
pricelist.append(row.strip().split(","))
datelist = zip(*datelist) #rows into columns
datelist = datelist[0] #turns the list into data from the first column
pricelist = zip(*pricelist)
pricelist = pricelist[4] #list gets data from the fifth column
print datelist
print pricelist
data_pull()
I'm brand new to Python and coding in general. I know there are probably more efficient ways of executing the code above but my main concern is getting the urllib piece to function correctly. Thanks in advance for your comments.
You need to import the full module:
import urllib.request
If you don't, the parent package will not have the submodule as an attribute.
You probably don't want to use urllib.request.urlretrieve() here; you'd normally process the response directly in Python. You can also make use of the csv module to read the data without needing to split:
from urllib.request import urlopen
import io
import csv
url = 'http://ichart.finance.yahoo.com/table.csv?s=xom&a=00&b=2&c=1999&d=01&e=12&f=2014&g=m&ignore=.csv'
reader_input = io.TextIOWrapper(urlopen(url), encoding='utf8', newline='')
reader = csv.reader(reader_input)
next(reader, None) # skip headers
cols = list(zip(*reader))
datelist, pricelist = cols[0], cols[4]

Categories