I've asked this question differently before but there are some changes so I thought asking it again as a new question.
I have a structured data which only part of it is in json format but I need to map the entire data to an schemaRDD. The data looks like this:
03052015 04:13:20
{"recordType":"NEW","data":{"keycol":"val1","col2":"val2","col3":"val3"}
Each line starts with date followed by time and a json formatted text.
I need to map not only the text in json but also the date and time into the same structure.
I tried it in Python but obviously it doesn't work because Row does not take an RDD (jsonRDD in this case).
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
orderFile = sc.textFile(myfile)
orderLine = orderFile.map(lambda line: line.split(" ", 2))
anotherOrderLine = orderLine.map(lambda p: Row(date=p[0], time=p[1], content=sqlContext.jsonRDD(p[3])))
schemaOrder = sqlContext.inferSchema(anotherOrderLine)
schemaOrder.printSchema()
for x in schemaOrder.collect():
print x
The goal is to be able to run a query like this against the schemaRDD:
select date, time, data.keycol, data.val1, data.val2, data.val3 from myOrder
How can I map the entire line to a schemaRDD?
Any help is appreciated?
The simplest option would be to add this field to JSON and use jsonRDD
My data:
03052015 04:13:20 {"recordType":"NEW","data":{"keycol":"val1","col1":"val5","col2":"val3"}}
03062015 04:13:20 {"recordType":"NEW1","data":{"keycol":"val2","col1":"val6","col2":"val3"}}
03072015 04:13:20 {"recordType":"NEW2","data":{"keycol":"val3","col1":"val7","col2":"val3"}}
03082015 04:13:20 {"recordType":"NEW3","data":{"keycol":"val4","col1":"val8","col2":"val3"}}
Code:
import json
def transform(data):
ts = data[:18].strip()
jss = data[18:].strip()
jsj = json.loads(jss)
jsj['ts'] = ts
return json.dumps(jsj)
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
rdd = sc.textFile('/sparkdemo/sample.data')
tbl = sqlContext.jsonRDD(rdd.map(transform))
tbl.registerTempTable("myOrder")
sqlContext.sql("select ts, recordType, data.keycol, data.col1, data.col2 data from myOrder").collect()
Result:
[Row(ts=u'03052015 04:13:20', recordType=u'NEW', keycol=u'val1', col1=u'val5', data=u'val3'), Row(ts=u'03062015 04:13:20', recordType=u'NEW1', keycol=u'val2', col1=u'val6', data=u'val3'), Row(ts=u'03072015 04:13:20', recordType=u'NEW2', keycol=u'val3', col1=u'val7', data=u'val3'), Row(ts=u'03082015 04:13:20', recordType=u'NEW3', keycol=u'val4', col1=u'val8', data=u'val3')]
In your code there is a problem that you are calling jsonRDD for each of the rows, this is not correct - it accepts an RDD and returns SchemaRDD.
The sqlContext.jsonRDD creates a schema rdd from an RDD containing strings where each string contains a JSON representation. This code sample is from the SparkSQL documentation (https://spark.apache.org/docs/1.2.0/sql-programming-guide.html):
val anotherPeopleRDD = sc.parallelize("""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD)
One of the cool things about jsonRDD is the fact that you can provide and additional parameter stating the JSONs schema, which should improve performance your performance. This can be don by creating an schemaRDD (just load a sample document) and then call the schemaRDD.schema method to get the schema.
Related
I have a table which has data as shown in the diagram . I want to create store results in dynamically generated data frame names.
For eg here in the below example I want to create two different data frame name
dnb_df and es_df and store the read result in these two frames and print structure of each data frame
When I am running the below code getting the error
SyntaxError: can't assign to operator (TestGlue2.py, line 66)
import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import regexp_replace, col
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
#sc.setLogLevel('DEBUG')
glueContext = GlueContext(sc)
spark = glueContext.spark_session
#logger = glueContext.get_logger()
#logger.DEBUG('Hello Glue')
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
client = boto3.client('glue', region_name='XXXXXX')
response = client.get_connection(Name='XXXXXX')
connection_properties = response['Connection']['ConnectionProperties']
URL = connection_properties['JDBC_CONNECTION_URL']
url_list = URL.split("/")
host = "{}".format(url_list[-2][:-5])
new_host=host.split('#',1)[1]
port = url_list[-2][-4:]
database = "{}".format(url_list[-1])
Oracle_Username = "{}".format(connection_properties['USERNAME'])
Oracle_Password = "{}".format(connection_properties['PASSWORD'])
#print("Oracle_Username:",Oracle_Username)
#print("Oracle_Password:",Oracle_Password)
print("Host:",host)
print("New Host:",new_host)
print("Port:",port)
print("Database:",database)
Oracle_jdbc_url="jdbc:oracle:thin:#//"+new_host+":"+port+"/"+database
print("Oracle_jdbc_url:",Oracle_jdbc_url)
source_df = spark.read.format("jdbc").option("url", Oracle_jdbc_url).option("dbtable", "(select * from schema.table order by VENDOR_EXECUTION_ORDER) ").option("user", Oracle_Username).option("password", Oracle_Password).load()
vendor_data=source_df.collect()
for row in vendor_data :
vendor_query=row.SRC_QUERY
row.VENDOR_NAME+'_df'= spark.read.format("jdbc").option("url",
Oracle_jdbc_url).option("dbtable", vendor_query).option("user",
Oracle_Username).option("password", Oracle_Password).load()
print(row.VENDOR_NAME+'_df')
Added use case in picture
Update: As discussed in the comments, your requirement is to further join all with another dataframe
for row in vendor_data:
rowAsDict=row.asDict()
# Here you can use any variable as rowAsDict is not going to be used anywhere else anyway
rowAsDict[rowAsDict["VENDOR_NAME"]+"_df"] = spark.sql(rowAsDict["SOURCE_QUERY"])
main_dataframe=main_dataframe.join(rowAsDict[rowAsDict["VENDOR_NAME"]+"_df"], "acc_id")
Input main_dataframe:
source_df :
View1 and View2:
Output main_dataframe
If I understood correctly, you need to generate the VENDOR_NAME_DF dynamically.
You won't be able to assign to the Row Object, neither it'll be useful to assign dataframe to a Row as you can't create a Dataframe with a column of type Dataframe.
Though, you can convert a row to a dict using asDict and use that instead.
This would work:
vendor_data=source_df.collect()
for row in vendor_data:
rowAsDict=row.asDict()
# Replace this with spark.read() or any way to create a Dataframe
rowAsDict[rowAsDict["VENDOR_NAME"]+"_df"] = spark.sql(rowAsDict["SOURCE_QUERY"])
rowAsDict[rowAsDict["VENDOR_NAME"]+"_df"].show()
Input Source_DF:
Result of SOURCE_QUERY:
Output (of rowAsDict[rowAsDict["VENDOR_NAME"]+"_df"].show()):
Final rowAsDict:
{'VENDOR_NAME': 'Name1', 'SOURCE_QUERY': 'select * from view1', 'Name1_df': DataFrame[id: string, date: string, Code: string]}
Add the last two lines in your for loop, you should be able to get the results.
First one is creating a temp table using the dynamic df name
Second is to show the data in that temp table.
for row in vendor_data :
vendor_query=row.SRC_QUERY
spark.read.format("jdbc").option("url",
Oracle_jdbc_url).option("dbtable", vendor_query).option("user",
Oracle_Username).option("password", Oracle_Password).load().createOrReplaceTempView(row.VENDOR_NAME+'_df')
spark.sql("select * from "+row.VENDOR_NAME+"_df").show()
I am trying to load a big Pandas table to dynamoDB.
I have tried the for loop method as follow
for k in range(1000):
trans = {}
trans['Director'] = DL_dt['director_name'][k]
trans['Language'] = DL_dt['original_language'][k]
print("add :", DL_dt['director_name'][k] , DL_dt['original_language'][k])
table.put_item(Item=trans)
it works but it's very time consuming.
Is there a faster way to load it ? (equivalent of to_sql for sql database)
I've found the batchwriteitem function but i am not sure it works and i don't know exactly how to use it.
Thanks a lot.
You can iterate over the dataframe rows, transform each row to json and then convert it to a dict using json.loads, this will also avoid the numpy data type errors.
you can try this:
import json
from decimal import Decimal
DL_dt = DL_dt.rename(columns={
'director_name': 'Director',
'original_language': 'Language'
})
with table.batch_writer() as batch:
for index, row in DL_dt.iterrows():
batch.put_item(json.loads(row.to_json(), parse_float=Decimal))
I did this using aws wrangler. It was a fairly simple process, the only tricky bit was handling pandas floats, so I converted them to decimals before loading the data in.
import awswrangler as wr
def float_to_decimal(num):
return Decimal(str(num))
def pandas_to_dynamodb(df):
df = df.fillna(0)
# convert any floats to decimals
for i in df.columns:
datatype = df[i].dtype
if datatype == 'float64':
df[i] = df[i].apply(float_to_decimal)
# write to dynamodb
wr.dynamodb.put_df(df=df, table_name='table-name')
pandas_to_dynamodb(df)
Batch writer docs here.
Try this:
with table.batch_writer() as batch:
for k in range(1000):
trans = {}
trans['Director'] = DL_dt['director_name'][k]
trans['Language'] = DL_dt['original_language'][k]
print("add :", DL_dt['director_name'][k] , DL_dt['original_language'][k])
batch.put_item(trans))
I need to convert a Google Cloud Datastore query result to a dataframe, to create a chart from the retrieved data. The query:
def fetch_times(limit):
start_date = '2019-10-08'
end_date = '2019-10-19'
query = datastore_client.query(kind='ParticleEvent')
query.add_filter(
'published_at', '>', start_date)
query.add_filter(
'published_at', '<', end_date)
query.order = ['-published_at']
times = query.fetch(limit=limit)
return times
creates a json like string of the results for each entity returned by the query:
Entity('ParticleEvent', 5942717456580608) {'gc_pub_sub_id': '438169950283983', 'data': '605', 'event': 'light intensity', 'published_at': '2019-10-11T14:37:45.407Z', 'device_id': 'e00fce6847be7713698287a1'}>
Thought I found something that would translate to json which I could convert to dataframe, but get an error that the properties attribute does not exist:
def to_json(gql_object):
result = []
for item in gql_object:
result.append(dict([(p, getattr(item, p)) for p in item.properties()]))
return json.dumps(result, cls=JSONEncoder)
Is there a way to iterate through the query results to get them into a dataframe either directly to a dataframe or by converting to json then to dataframe?
Datastore entities can be treated as Python base dictionaries! So you should be able to do something as simple as...
df = pd.DataFrame(datastore_entities)
...and pandas will do all the rest.
If you needed to convert the entity key, or any of its attributes to a column as well, you can pack them into the dictionary separately:
for e in entities:
e['entity_key'] = e.key
e['entity_key_name'] = e.key.name # for example
df = pd.DataFrame(entities)
You can use pd.read_json to read your json query output into a dataframe.
Assuming the output is the string that you have shared above, then the following approach can work.
#Extracting the beginning of the dictionary
startPos = line.find("{")
df = pd.DataFrame([eval(line[startPos:-1])])
Output looks like :
gc_pub_sub_id data event published_at \
0 438169950283983 605 light intensity 2019-10-11T14:37:45.407Z
device_id
0 e00fce6847be7713698287a1
Here, line[startPos:-1] is essentially the entire dictionary in that sthe string input. Using eval, we can convert it into an actual dictionary. Once we have that, it can be easily converted into a dataframe object
Original poster found a workaround, which is to convert each item in the query result object to string, and then manually parse the string to extract the needed data into a list.
The return value of the fetch function is google.cloud.datastore.query.Iterator which behaves like a List[dict] so the output of fetch can be passed directly into pd.DataFrame.
import pandas as pd
df = pd.DataFrame(fetch_times(10))
This is similar to #bkitej, but I added the use of the original poster's function.
I am trying to create a dataframe in pyspark, then write it as a Hive table, and then read it back, but it is not working...
sqlContext = HiveContext(sc)
hive_context = HiveContext(sc) #Initialize Hive
#load the control table
cntl_dt = [('2016-04-30')]
rdd = sc.parallelize(cntl_dt)
row_cntl_dt = rdd.map(lambda x: Row(load_dt=x[0]))
df_cntl_dt = sqlContext.createDataFrame(row_cntl_dt)
df_cntl_dt.write.mode("overwrite").saveAsTable("schema.cntrl_tbl")
load_dt = hive_context.sql("select load_dt from schema.cntrl_tbl" ).first()['load_dt'];
print (load_dt)
Prints: 2
I expect :2016-12-31
This is because:
cntl_dt = [('2016-04-30')]
is not a valid syntax for a single element tuple. Quotes will be ignored and result will be the same as:
['2016-04-30']
and
Row(load_dt=x[0])
will give:
Row(load_dt='2')
Use:
cntl_dt = [('2016-04-30', )]
Also you're mixing different context (SQLContext and HiveContext) which is generally a bad idea (and both shouldn't be used in any recent Spark version)
The following is more or less straight python code which functionally extracts exactly as I want. The data schema for the column I'm filtering out within the dataframe is basically a json string.
However, I had to greatly bump up the memory requirement for this and I'm only running on a single node. Using a collect is probably bad and creating all of this on a single node really isn't taking advantage of the distributed nature of Spark.
I'd like a more Spark centric solution. Can anyone help me massage the logic below to better take advantage of Spark? Also, as a learning point: please provide an explanation for why/how the updates make it better.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
from pyspark.sql.types import SchemaStruct, SchemaField, StringType
input_schema = SchemaStruct([
SchemaField('scrubbed_col_name', StringType(), nullable=True)
])
output_schema = SchemaStruct([
SchemaField('val01_field_name', StringType(), nullable=True),
SchemaField('val02_field_name', StringType(), nullable=True)
])
example_input = [
'''[{"val01_field_name": "val01_a", "val02_field_name": "val02_a"},
{"val01_field_name": "val01_a", "val02_field_name": "val02_b"},
{"val01_field_name": "val01_b", "val02_field_name": "val02_c"}]''',
'''[{"val01_field_name": "val01_c", "val02_field_name": "val02_a"}]''',
'''[{"val01_field_name": "val01_a", "val02_field_name": "val02_d"}]''',
]
desired_output = {
'val01_a': ['val_02_a', 'val_02_b', 'val_02_d'],
'val01_b': ['val_02_c'],
'val01_c': ['val_02_a'],
}
def capture(dataframe):
# Capture column from data frame if it's not empty
data = dataframe.filter('scrubbed_col_name != null')\
.select('scrubbed_col_name')\
.rdd\
.collect()
# Create a mapping of val1: list(val2)
mapping = {}
# For every row in the rdd
for row in data:
# For each json_string within the row
for json_string in row:
# For each item within the json string
for val in json.loads(json_string):
# Extract the data properly
val01 = val.get('val01_field_name')
val02 = val.get('val02_field_name')
if val02 not in mapping.get(val01, []):
mapping.setdefault(val01, []).append(val02)
return mapping
One possible solution:
(df
.rdd # Convert to rdd
.flatMap(lambda x: x) # Flatten rows
# Parse JSON. In practice you should add proper exception handling
.flatMap(lambda x: json.loads(x))
# Get values
.map(lambda x: (x.get('val01_field_name'), x.get('val02_field_name')))
# Convert to final shape
.groupByKey())
Given output specification this operation is not exactly efficient (do you really require grouped values?) but still much better than collect.