I am trying to load a big Pandas table to dynamoDB.
I have tried the for loop method as follow
for k in range(1000):
trans = {}
trans['Director'] = DL_dt['director_name'][k]
trans['Language'] = DL_dt['original_language'][k]
print("add :", DL_dt['director_name'][k] , DL_dt['original_language'][k])
table.put_item(Item=trans)
it works but it's very time consuming.
Is there a faster way to load it ? (equivalent of to_sql for sql database)
I've found the batchwriteitem function but i am not sure it works and i don't know exactly how to use it.
Thanks a lot.
You can iterate over the dataframe rows, transform each row to json and then convert it to a dict using json.loads, this will also avoid the numpy data type errors.
you can try this:
import json
from decimal import Decimal
DL_dt = DL_dt.rename(columns={
'director_name': 'Director',
'original_language': 'Language'
})
with table.batch_writer() as batch:
for index, row in DL_dt.iterrows():
batch.put_item(json.loads(row.to_json(), parse_float=Decimal))
I did this using aws wrangler. It was a fairly simple process, the only tricky bit was handling pandas floats, so I converted them to decimals before loading the data in.
import awswrangler as wr
def float_to_decimal(num):
return Decimal(str(num))
def pandas_to_dynamodb(df):
df = df.fillna(0)
# convert any floats to decimals
for i in df.columns:
datatype = df[i].dtype
if datatype == 'float64':
df[i] = df[i].apply(float_to_decimal)
# write to dynamodb
wr.dynamodb.put_df(df=df, table_name='table-name')
pandas_to_dynamodb(df)
Batch writer docs here.
Try this:
with table.batch_writer() as batch:
for k in range(1000):
trans = {}
trans['Director'] = DL_dt['director_name'][k]
trans['Language'] = DL_dt['original_language'][k]
print("add :", DL_dt['director_name'][k] , DL_dt['original_language'][k])
batch.put_item(trans))
Related
I need to convert a Google Cloud Datastore query result to a dataframe, to create a chart from the retrieved data. The query:
def fetch_times(limit):
start_date = '2019-10-08'
end_date = '2019-10-19'
query = datastore_client.query(kind='ParticleEvent')
query.add_filter(
'published_at', '>', start_date)
query.add_filter(
'published_at', '<', end_date)
query.order = ['-published_at']
times = query.fetch(limit=limit)
return times
creates a json like string of the results for each entity returned by the query:
Entity('ParticleEvent', 5942717456580608) {'gc_pub_sub_id': '438169950283983', 'data': '605', 'event': 'light intensity', 'published_at': '2019-10-11T14:37:45.407Z', 'device_id': 'e00fce6847be7713698287a1'}>
Thought I found something that would translate to json which I could convert to dataframe, but get an error that the properties attribute does not exist:
def to_json(gql_object):
result = []
for item in gql_object:
result.append(dict([(p, getattr(item, p)) for p in item.properties()]))
return json.dumps(result, cls=JSONEncoder)
Is there a way to iterate through the query results to get them into a dataframe either directly to a dataframe or by converting to json then to dataframe?
Datastore entities can be treated as Python base dictionaries! So you should be able to do something as simple as...
df = pd.DataFrame(datastore_entities)
...and pandas will do all the rest.
If you needed to convert the entity key, or any of its attributes to a column as well, you can pack them into the dictionary separately:
for e in entities:
e['entity_key'] = e.key
e['entity_key_name'] = e.key.name # for example
df = pd.DataFrame(entities)
You can use pd.read_json to read your json query output into a dataframe.
Assuming the output is the string that you have shared above, then the following approach can work.
#Extracting the beginning of the dictionary
startPos = line.find("{")
df = pd.DataFrame([eval(line[startPos:-1])])
Output looks like :
gc_pub_sub_id data event published_at \
0 438169950283983 605 light intensity 2019-10-11T14:37:45.407Z
device_id
0 e00fce6847be7713698287a1
Here, line[startPos:-1] is essentially the entire dictionary in that sthe string input. Using eval, we can convert it into an actual dictionary. Once we have that, it can be easily converted into a dataframe object
Original poster found a workaround, which is to convert each item in the query result object to string, and then manually parse the string to extract the needed data into a list.
The return value of the fetch function is google.cloud.datastore.query.Iterator which behaves like a List[dict] so the output of fetch can be passed directly into pd.DataFrame.
import pandas as pd
df = pd.DataFrame(fetch_times(10))
This is similar to #bkitej, but I added the use of the original poster's function.
I have two column, one is a string, and the other is a numpy array of floats
a = 'this is string'
b = np.array([-2.355, 1.957, 1.266, -6.913])
I would like to store them in a row as separate columns in a hdf5 file. For that I am using pandas
hdf_key = 'hdf_key'
store5 = pd.HDFStore('file.h5')
z = pd.DataFrame(
{
'string': [a],
'array': [b]
})
store5.append(hdf_key, z, index=False)
store5.close()
However, I get this error
TypeError: Cannot serialize the column [array] because
its data contents are [mixed] object dtype
Is there a way to store this to h5? If so, how? If not, what's the best way to store this sort of data?
I can't help you with pandas, but can show you how do this with pytables.
Basically you create a table referencing either a numpy recarray or a dtype that defines the mixed datatypes.
Below is a super simple example to show how to create a table with 1 string and 4 floats. Then it adds rows of data to the table.
It shows 2 different methods to add data:
1. A list of tuples (1 tuple for each row) - see append_list
2. A numpy recarray (with dtype matching the table definition) -
see simple_recarr in the for loop
To get the rest of the arguments for create_table(), read the Pytables documentation. It's very helpful, and should answer additional questions. Link below:
Pytables Users's Guide
import tables as tb
import numpy as np
with tb.open_file('SO_55943319.h5', 'w') as h5f:
my_dtype = np.dtype([('A','S16'),('b',float),('c',float),('d',float),('e',float)])
dset = h5f.create_table(h5f.root, 'table_data', description=my_dtype)
# Append one row using a list:
append_list = [('test string', -2.355, 1.957, 1.266, -6.913)]
dset.append(append_list)
simple_recarr = np.recarray((1,),dtype=my_dtype)
for i in range(5):
simple_recarr['A']='string_' + str(i)
simple_recarr['b']=2.0*i
simple_recarr['c']=3.0*i
simple_recarr['d']=4.0*i
simple_recarr['e']=5.0*i
dset.append(simple_recarr)
print ('done')
I have a problem to append my array that created from numpy library. Here is my python's code
# to get array's column name
columnData = [x[0] for x in curHeader.description]
# to get data result
rData = curHeader.fetchall()
# loop data
for i in rData:
arrayData = np.asarray(dict(zip(columnData, i)))
# print data
print(arrayData)
# {'KD_VAL': 'USD', 'FOB': None, 'FREIGHT': None, 'CIF': 33090.0}
# sample append data into list
listData.append(arrayData)
# Convert listData to json
# Insert json into MongoDB using insert_many
Unfortunately, the array can't insert into my mongoDB event my code doesn't produce an error. Is there any logical missing?
Thanks!
Sorry guys, I have an answer after several trials.
So, the point is only change the code into listData.append(arrayData.toList()).
Thanks to #Santosh Kumar.
if it is possible to skip using numpy arrays this will work
array_data = dict(zip(columnData, rData))
list_data.append(array_data)
# my_mongo_collection.insert_many(list_data)
I have some difficulty in importing a JSON file with pandas.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
This is the error that I get:
ValueError: If using all scalar values, you must pass an index
The file structure is simplified like this:
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
It is from the machine learning course of University of Washington on Coursera. You can find the file here.
Try
ser = pd.read_json('people_wiki_map_index_to_word.json', typ='series')
That file only contains key value pairs where values are scalars. You can convert it to a dataframe with ser.to_frame('count').
You can also do something like this:
import json
with open('people_wiki_map_index_to_word.json', 'r') as f:
data = json.load(f)
Now data is a dictionary. You can pass it to a dataframe constructor like this:
df = pd.DataFrame({'count': data})
You can do as #ayhan mention which will give you a column base format
Or you can enclose the object in [ ] (source) as shown below to give you a row format that will be convenient if you are loading multiple values and planing on using matrix for your machine learning models.
df = pd.DataFrame([data])
I think what is happening is that the data in
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
is being read as a string instead of a json
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
is actually
'{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}'
Since a string is a scalar, it wants you to load it as a json, you have to convert it to a dict which is exactly what the other response is doing
The best way is to do a json loads on the string to convert it to a dict and load it into pandas
myfile=f.read()
jsonData=json.loads(myfile)
df=pd.DataFrame(data)
{
"biennials": 522004,
"lb915": 116290
}
df = pd.read_json('values.json')
As pd.read_json expects a list
{
"biennials": [522004],
"lb915": [116290]
}
for a particular key, it returns an error saying
If using all scalar values, you must pass an index.
So you can resolve this by specifying 'typ' arg in pd.read_json
map_index_to_word = pd.read_json('Datasets/people_wiki_map_index_to_word.json', typ='dictionary')
For newer pandas, 0.19.0 and later, use the lines parameter, set it to True.
The file is read as a json object per line.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json', lines=True)
If fixed the following errors I encountered especially when some of the json files have only one value:
ValueError: If using all scalar values, you must pass an index
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
ValueError: Trailing data
For example
cat values.json
{
name: "Snow",
age: "31"
}
df = pd.read_json('values.json')
Chances are you might end up with this
Error: if using all scalar values, you must pass an index
Pandas looks up for a list or dictionary in the value. Something like
cat values.json
{
name: ["Snow"],
age: ["31"]
}
So try doing this. Later on to convert to html tohtml()
df = pd.DataFrame([pd.read_json(report_file, typ='series')])
result = df.to_html()
I solved this by converting it into an array like so
[{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}]
I've asked this question differently before but there are some changes so I thought asking it again as a new question.
I have a structured data which only part of it is in json format but I need to map the entire data to an schemaRDD. The data looks like this:
03052015 04:13:20
{"recordType":"NEW","data":{"keycol":"val1","col2":"val2","col3":"val3"}
Each line starts with date followed by time and a json formatted text.
I need to map not only the text in json but also the date and time into the same structure.
I tried it in Python but obviously it doesn't work because Row does not take an RDD (jsonRDD in this case).
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
orderFile = sc.textFile(myfile)
orderLine = orderFile.map(lambda line: line.split(" ", 2))
anotherOrderLine = orderLine.map(lambda p: Row(date=p[0], time=p[1], content=sqlContext.jsonRDD(p[3])))
schemaOrder = sqlContext.inferSchema(anotherOrderLine)
schemaOrder.printSchema()
for x in schemaOrder.collect():
print x
The goal is to be able to run a query like this against the schemaRDD:
select date, time, data.keycol, data.val1, data.val2, data.val3 from myOrder
How can I map the entire line to a schemaRDD?
Any help is appreciated?
The simplest option would be to add this field to JSON and use jsonRDD
My data:
03052015 04:13:20 {"recordType":"NEW","data":{"keycol":"val1","col1":"val5","col2":"val3"}}
03062015 04:13:20 {"recordType":"NEW1","data":{"keycol":"val2","col1":"val6","col2":"val3"}}
03072015 04:13:20 {"recordType":"NEW2","data":{"keycol":"val3","col1":"val7","col2":"val3"}}
03082015 04:13:20 {"recordType":"NEW3","data":{"keycol":"val4","col1":"val8","col2":"val3"}}
Code:
import json
def transform(data):
ts = data[:18].strip()
jss = data[18:].strip()
jsj = json.loads(jss)
jsj['ts'] = ts
return json.dumps(jsj)
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
rdd = sc.textFile('/sparkdemo/sample.data')
tbl = sqlContext.jsonRDD(rdd.map(transform))
tbl.registerTempTable("myOrder")
sqlContext.sql("select ts, recordType, data.keycol, data.col1, data.col2 data from myOrder").collect()
Result:
[Row(ts=u'03052015 04:13:20', recordType=u'NEW', keycol=u'val1', col1=u'val5', data=u'val3'), Row(ts=u'03062015 04:13:20', recordType=u'NEW1', keycol=u'val2', col1=u'val6', data=u'val3'), Row(ts=u'03072015 04:13:20', recordType=u'NEW2', keycol=u'val3', col1=u'val7', data=u'val3'), Row(ts=u'03082015 04:13:20', recordType=u'NEW3', keycol=u'val4', col1=u'val8', data=u'val3')]
In your code there is a problem that you are calling jsonRDD for each of the rows, this is not correct - it accepts an RDD and returns SchemaRDD.
The sqlContext.jsonRDD creates a schema rdd from an RDD containing strings where each string contains a JSON representation. This code sample is from the SparkSQL documentation (https://spark.apache.org/docs/1.2.0/sql-programming-guide.html):
val anotherPeopleRDD = sc.parallelize("""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD)
One of the cool things about jsonRDD is the fact that you can provide and additional parameter stating the JSONs schema, which should improve performance your performance. This can be don by creating an schemaRDD (just load a sample document) and then call the schemaRDD.schema method to get the schema.