A DataFrame object does not have an attribute select - python

In palantir foundry, I am trying to read all xml files from a dataset. Then, in a for loop, I parse the xml files.
Until the second last line, the code runs fine without errors.
from transforms.api import transform, Input, Output
from transforms.verbs.dataframes import sanitize_schema_for_parquet
from bs4 import BeautifulSoup
import pandas as pd
import lxml
#transform(
output=Output("/Spring/xx/datasets/mydataset2"),
source_df=Input("ri.foundry.main.dataset.123"),
)
def read_xml(ctx, source_df, output):
df = pd.DataFrame()
filesystem = source_df.filesystem()
hadoop_path = filesystem.hadoop_path
files = [f"{hadoop_path}/{f.path}" for f in filesystem.ls()]
for i in files:
with open(i, 'r') as f:
file = f.read()
soup = BeautifulSoup(file,'xml')
data = []
for e in soup.select('offer'):
data.append({
'meldezeitraum': e.find_previous('data').get('meldezeitraum'),
'id':e.get('id'),
'parent_id':e.get('parent_id'),
})
df = df.append(data)
output.write_dataframe(sanitize_schema_for_parquet(df))
However, as soon as I add the last line:
output.write_dataframe(sanitize_schema_for_parquet(df))
I get this error:
Missing transform attribute
A DataFrame object does not have an attribute select. Please check the spelling and/or the datatype of the object.
/transforms-python/src/myproject/datasets/mydataset.py
output.write_dataframe(sanitize_schema_for_parquet(df))
What am I doing wrong?

You have to convert your pandas DataFrame to a spark DataFrame. Even though they have the same name those are two different object types in python.
The easiest way to do that is
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_spark = spark.createDataFrame(df)
You can then pass the spark_df to the output.write_dataframe() function

Related

How do I export a read_html df to Excel, when it related to table ID rather than data in the code?

I am experiencing this error with the code below:
File "\<stdin\>", line 1, in \<module\>
AttributeError: 'list' object has no attribute 'to_excel'
I want to save the table I am scraping from wikipedia to an Excel file - but I can't work out how to adjust the code to get the data list from the terminal to the Excel file using to_excel.
I can see it works for a similar problem when a dataset has data set out as a 'DataFrame'
(i.e. df = pd.DataFrame(data, columns = \['Product', 'Price'\]).
But can't work out how to adjust my code for the df = pd.read*html(str(congresstable)) line - which I think is the issue. (i.e. using read*_html and sourcing the data from a table id)
How can I adjust the code to make it save an excel file to the path specified?
from bs4 import BeautifulSoup
import requests
import pandas as pd
wiki_url = 'https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives'
table_id = 'votingmembers'
response = requests.get(wiki_url)
soup = BeautifulSoup(response.text, 'html.parser')
congress_table = soup.find('table', attrs={'id': table_id})
df = pd.read_html(str(congress_table))
df.to_excel (r'C:\Users\name\OneDrive\Code\.vscode\Test.xlsx', index = False, header=True)
print(df)
I was expecting the data list to be saved to Excel at the folder path specified.
I tried following multiple guides, but they don't show the read_html item, only DataFrame solutions.
pandas.read_html() creates a list of tables respectiv dataframe objects, so you have to pick one by index in your case [0] - You also do not need requests and BeautifulSoup, separatly, just go with pandas.read_html()
pd.read_html(wiki_url,attrs={'id': table_id})[0]
Example
import pandas as pd
wiki_url = 'https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives'
table_id = 'votingmembers'
congress_table = soup.find('table', )
df = pd.read_html(wiki_url,attrs={'id': table_id})[0]
df.to_excel (r'C:\Users\name\OneDrive\Code\.vscode\Test.xlsx', index = False, header=True)

How to Speed-Up Writing Dataframe to s3 from EMR PySpark Notebook?

So I'm learning PySpark by playing around with the DMOZ dataset in a jupyter notebook attached to an EMR cluster. The process I'm trying to achieve is as follows:
Load a csv with the location of files in an s3 public dataset in to a PySpark DataFrame (~130k rows)
Map over the DF with a function that retrieves the file contents (html) and rips the text
Join the output with the original DF as a new column
Write the joined DF to s3 (the problem: It seems to hang forever, its not a large job and the output json should only be a few gigs)
All of the writing is done in a function called run_job()
I let it sit for about 2 hours on a cluster with 10 m5.8xlarge instances which should be enough (?). All of the other steps execute fine on their own, except for the df.write(). I have tested on a
much smaller subset and it wrote to s3 with no issue, but when I go to do the whole file it seemingly hangs at at "0/n jobs complete."
I am new to PySpark and distributed computing in general, so its probably a simple "best practice" that I am missing. (Edit: Maybe its in the config of the notebook? I'm not using any magics to configure spark currently, do I need to?)
Code below...
import html2text
import boto3
import botocore
import os
import re
import zlib
import gzip
from bs4 import BeautifulSoup as bs
from bs4 import Comment
# from pyspark import SparkContext, SparkConf
# from pyspark.sql import SQLContext, SparkSession
# from pyspark.sql.types import StructType, StructField, StringType, LongType
import logging
def load_index():
input_file='s3://cc-stuff/uploads/DMOZ_bussineses_ccindex.csv'
df = spark.read.option("header",True) \
.csv(input_file)
#df = df.select('url_surtkey','warc_filename', 'warc_record_offset', 'warc_record_length','content_charset','content_languages','fetch_time','fetch_status','content_mime_type')
return df
def process_warcs(id_,iterator):
html_textract = html2text.HTML2Text()
html_textract.ignore_links = True
html_textract.ignore_images = True
no_sign_request = botocore.client.Config(signature_version=botocore.UNSIGNED)
s3client = boto3.client('s3', config=no_sign_request)
text = None
s3pattern = re.compile('^s3://([^/]+)/(.+)')
PREFIX = "s3://commoncrawl/"
for row in iterator:
try:
start_byte = int(row['warc_record_offset'])
stop_byte = (start_byte + int(row['warc_record_length']))
s3match = s3pattern.match((PREFIX + row['warc_filename']))
bucketname = s3match.group(1)
path = s3match.group(2)
#print('Bucketname: ',bucketname,'\nPath: ',path)
resp = s3client.get_object(Bucket=bucketname, Key=path, Range='bytes={}-{}'.format(start_byte, stop_byte))
content = resp['Body'].read()#.decode()
data = zlib.decompress(content, wbits = zlib.MAX_WBITS | 16).decode('utf-8',errors='ignore')
data = data.split('\r\n\r\n',2)[2]
soup = bs(data,'html.parser')
for x in soup.findAll(text=lambda text:isinstance(text, Comment)):
x.extract()
for x in soup.find_all(["head","script","button","form","noscript","style"]):
x.decompose()
text = html_textract.handle(str(soup))
except Exception as e:
pass
yield (id_,text)
def run_job(write_out=True):
df = load_index()
df2 = df.rdd.repartition(200).mapPartitionsWithIndex(process_warcs).toDF()
df2 = df2.withColumnRenamed('_1','idx').withColumnRenamed('_2','page_md')
df = df.join(df2.select('page_md'))
if write_out:
output = "s3://cc-stuff/emr-out/DMOZ_bussineses_ccHTML"
df.coalesce(4).write.json(output)
return df
df = run_job(write_out=True)
So I managed to make it work. I attribute this to either of the 2 changes below. I also changed the hardware configuration and opted for a higher quantity of smaller instances. Gosh I just LOVE it when I spend an entire day in a deep state of utter confusion when all I needed to do was add an "/" to the save location......
I added a trailing "/" to the output file location in s3
1 Old:
output = "s3://cc-stuff/emr-out/DMOZ_bussineses_ccHTML"
1 New:
output = "s3://cc-stuff/emr-out/DMOZ_bussineses_ccHTML/"
I removed the "coalesce" in the "run_job()" function, I have 200 output files now, but it worked and it was super quick (under 1 min).
2 Old:
df.coalesce(4).write.json(output)
2 New:
df.write.mode('overwrite').json(output)

Python AttributeError: 'list' object has no attribute 'to_csv'

I'm currently encountering an error with my code, and I have no idea why. I originally thought it was because I couldn't save a csv file with hyphens in it, but that turns out not to be the case. Does anyone have any suggestions to what might be causing the problem. My code is below:
import pandas as pd
import requests
query_set = ["points-per-game"]
for query in query_set:
url = 'https://www.teamrankings.com/ncaa-basketball/stat/' + str(query)
html = requests.get(url).content
df_list = pd.read_html(html)
print(df_list)
df_list.to_csv(str(query) + "stat.csv", encoding="utf-8")
the read_html() method returns a list of dataframes, not a single dataframe:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
You'll want to loop through your df_list and run to_csv on each entry, like so:
import pandas as pd import requests
query_set = ["points-per-game"]
for query in query_set:
url = 'https://www.teamrankings.com/ncaa-basketball/stat/' + str(query)
html = requests.get(url).content
df_list = pd.read_html(html)
print(df_list)
for current_df in df_list:
current_df.to_csv(str(query) + "stat.csv", encoding="utf-8")
print(current_df)
The function pd.read_html returns a list of DataFrames found in the HTML source. Use df_list[0] to get the DataFrame which is the first element of this list.

Save pretty table output as a pandas data frame

How to convert the output I get from a pretty table to pandas dataframe and save it as an excel file.
My code which gets the pretty table output
from prettytable import PrettyTable
prtab = PrettyTable()
prtab.field_names = ['Item_1', 'Item_2']
for item in Items_2:
prtab.add_row([item, difflib.get_close_matches(item, Items_1)])
print(prtab)
I'm trying to convert this to a pandas dataframe however I get an error saying DataFrame constructor not properly called! My code to convert this is shown below
AA = pd.DataFrame(prtab, columns = ['Item_1', 'Item_2']).reset_index()
I found this method recently.
pretty_table.get_csv_string()
this will convert it to a csv string where you could write to a csv file.
I use it like this:
tbl_as_csv = pretty_table.get_csv_string().replace('\r','')
text_file = open("output_path.csv", "w")
n = text_file.write(tbl_as_csv)
text_file.close()
Load the data into a DataFrame first, then export to PrettyTable and Excel:
import io
import difflib
import pandas as pd
import prettytable as pt
data = []
for item in Items_2:
data.append([item, difflib.get_close_matches(item, Items_1)])
df = pd.DataFrame(data, columns=['Item_1', 'Item_2'])
# Export to prettytable
# https://stackoverflow.com/a/18528589/190597 (Ofer)
# Use io.StringIO with Python3, use io.BytesIO with Python2
output = io.StringIO()
df.to_csv(output)
output.seek(0)
print(pt.from_csv(output))
# Export to Excel file
filename = '/tmp/output.xlsx'
writer = pd.ExcelWriter(filename)
df.to_excel(writer,'Sheet1')

Python with very nested JSON to CSV file

https://www.eex.com/data//view/data/detail/phelix-power-futures/2018/02.27.json
I have changed the script following Stev's answer. The error no longer applies.
#import pandas as pd
import requests
import json
import csv
outfile = open('D:\\test.csv','w')
url = 'https://www.eex.com/data//view/data/detail/phelix-power-futures/2018/02.27.json'
resp = requests.get(url)
data = json.loads(resp.content.decode('UTF8'))
for d in data['data']:
for r in d['rows']:
for sd in (d['rows']):
for td in (sd['data']):
dparsed = sd['data']
w = csv.DictWriter(outfile, dparsed.keys())
w.writeheader()
w.writerow(dparsed)
I ran the script and it created the csv file, but it is showing 0 KB and is saying it is locked by another user so I don't know exactly what I have goofed up this time. This is clearly not a duplicate question, so thanks for flagging it as such... /s
I ran the above script and after about 3 hours of waiting I killed spyder to see what happened with the excel file. It kind of worked, but it only managed to spit out some of the data into columns and about like 3 rows. Not sure where I fell down yet.
This is more of a comment because it doesn't give you the answer, but I am not sure you json file is formatted properly in order to use pd.json_normalize. You might have to loop through your json file, using something like the following:
import pandas as pd
import requests
import json
url = 'https://www.eex.com/data//view/data/detail/phelix-power-futures/2018/02.27.json'
resp = requests.get(url)
data = json.loads(resp.content.decode('UTF8'))
df1 = pd.DataFrame()
df2 = pd.DataFrame()
for d in data['data']:
# print(d['identifier'])
for r in d['rows']:
# print(r['contractIdentifier'])
# print(r['data'])
df1 = df1.append(json_normalize(r['data']))
df2 = df2.append(pd.DataFrame([r['contractIdentifier']]))
# print(r)
df = pd.concat([df1,df2], axis=1)
df.to_csv('my_file.txt')

Categories