Some of my data in snowflake is in json str format, but in actual it is list of floats. I used an udf to convert json str to list of floats but seems like snowflake is internally auto converting list of floats to string format again.
Just want to know if this is how it works with snowflake or is there any better method to store list of floats in their actual format.
I don't want to process the data everytime to convert it from json str to list of floats.
Use below code to demostrate the problem
connection_parameters = {
"account": "MY_ACCOUNT"
"user": "USER",
"password": "PASSWORD",
"role": "MY_ROLE",
"warehouse": "MY_WH",
"database": "MY_DB",
"schema": "MY_SCHEMA"
}
table = "MY_TABLE"
sf_session = Session.builder.configs(connection_parameters).create()
from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import ArrayType, DoubleType, StringType
import json
from typing import List
def parse_embedding_from_string(x: str) -> List[float]:
res = json.loads(x)
return res
retrieve_embedding = udf(parse_embedding_from_string)
df = sf_session.createDataFrame(data=[['[0.4, 2.57, 3.47]'], ['[34.50, 16.34, 12.9]'], ['[413.0, 1.211, 8.41]'], ['[0.4, 8.1, 10.11]'], ['[-6.89, 7.1, -12.1]'], ['[14.0, -21.0, 3.12]'], ['[11.0, 44.1, 26.2]'], ['[-4.4, 5.8, -0.10]']], schema=["embedding"])
df = df.withColumn("embedding_new", retrieve_embedding(col("embedding")))
# Output -
df.toPandas().iloc[0]["EMBEDDING_NEW"]
Below is the output
'[\n 0.4,\n 2.57,\n 3.47\n]'
The Snowflake connectors do not support passing arrays in either direction. Passing an array will convert it to a string formatted as JSON. On the Python side, you can parse the string to convert it back to an array.
When sending data to Snowflake, esp. using a bind variable, you can convert arrays to a JSON-formatted string and use Snowflake's parse_json function to convert it back to an array. There's a good example showing that here:
https://community.snowflake.com/s/article/HowTo-Programmatically-insert-the-array-data-using-the-bing-variable-via-python-connector
On the Python side, you can do something like this after retrieving the array as a string:
import ast
my_array = ast.literal_eval(input_string)
There is a full explanation of that here:
How to convert string representation of list to a list
Related
i have a .geojson that i converted to a dataframe using
df = geojson_to_dataframe(raw, glob='*.geojson', batch_size=100000)
Once that is set up, i need to convert the first column "properties" into a stringtype since pyspark labels it as a map type. To cast it as a string i run this code below.
df = df.select(df.properties.cast(StringType()).alias("properties"))
The problem i am having is the new column is stripped of all of the quotation marks and the colon is now a "->" and it loses its json formatting. Does anyone know how to convert a map type to string type without losing the json formatting?
The column as a map type before the cast
{
"BFE_LN_ID":"01001C_722",
"DFIRM_ID":"01001C",
"NAME":"Alabama",
"ALAND":"131185042550",
}
...
The column as a string after the cast
[
BFE_LN_ID->01001C_722,
DFIRM_ID->01001C,
NAME->Alabama,
ALAND->131185042550,
]
Thank you
Update:
Source geojson data that i converted using the above geojson_to_dataframe() code
properties
geometry
{"BFE_LN_ID":"01001C_722","DFIRM_ID":"01001C","NAME":"Alabama","ALAND":"131185042550","Shape_Length":"0.00010778212891927651","AWATER":"4582333181","LEN_UNIT":"Feet","VERSION_ID":"1.1.1.0","GFID":"20140910","INTPTLAT":"+32.7395785","STATENS":"01779775","REGION":"3","FUNCSTAT":"A","DIVISION":"6","GEOID":"01","INTPTLON":"-086.8434469","STATEFP":"01","SOURCE_CIT":"01001C_STUDY1","ELEV":"376.0","V_DATUM":"NAVD88","LSAD":"00","STUSPS":"AL","MTFCC":"G4000"}
{"coordinates": [[-86.46788227456062, 32.487228761833364], [-86.46796895264879, 32.48717204906342], [-86.46797248221748, 32.48716977416143]], "type": "LineString"}
Code used to get incorrect result
def nfhl_string(nfhl):
df = nfhl
df = df.select(df.properties.cast(StringType()).alias("properties"))
return df
I have this dataframe with two fields coordinates and status
using pandas to_json, I get this
[{"coordinates":"[143.4865219,-34.7560602]","status":"not started"},
the correct format should be
[{"coordinates":[143.4865219,-34.7560602],"status":"not started"},
how to tell pandas not to put double quotes on the values of coordinates.
you can try explicitly convert string list to list by using ast module
code
import ast
s = [{"coordinates":"[143.4865219,-34.7560602]","status":"not started"},{"coordinates":"[143.4865241,-34.7561332]","status":"not started"}]
s = list(map(lambda x : {"coordinates": ast.literal_eval(x['coordinates'].strip('"')), "status": x['status']}, s))
I have some difficulty in importing a JSON file with pandas.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
This is the error that I get:
ValueError: If using all scalar values, you must pass an index
The file structure is simplified like this:
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
It is from the machine learning course of University of Washington on Coursera. You can find the file here.
Try
ser = pd.read_json('people_wiki_map_index_to_word.json', typ='series')
That file only contains key value pairs where values are scalars. You can convert it to a dataframe with ser.to_frame('count').
You can also do something like this:
import json
with open('people_wiki_map_index_to_word.json', 'r') as f:
data = json.load(f)
Now data is a dictionary. You can pass it to a dataframe constructor like this:
df = pd.DataFrame({'count': data})
You can do as #ayhan mention which will give you a column base format
Or you can enclose the object in [ ] (source) as shown below to give you a row format that will be convenient if you are loading multiple values and planing on using matrix for your machine learning models.
df = pd.DataFrame([data])
I think what is happening is that the data in
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
is being read as a string instead of a json
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
is actually
'{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}'
Since a string is a scalar, it wants you to load it as a json, you have to convert it to a dict which is exactly what the other response is doing
The best way is to do a json loads on the string to convert it to a dict and load it into pandas
myfile=f.read()
jsonData=json.loads(myfile)
df=pd.DataFrame(data)
{
"biennials": 522004,
"lb915": 116290
}
df = pd.read_json('values.json')
As pd.read_json expects a list
{
"biennials": [522004],
"lb915": [116290]
}
for a particular key, it returns an error saying
If using all scalar values, you must pass an index.
So you can resolve this by specifying 'typ' arg in pd.read_json
map_index_to_word = pd.read_json('Datasets/people_wiki_map_index_to_word.json', typ='dictionary')
For newer pandas, 0.19.0 and later, use the lines parameter, set it to True.
The file is read as a json object per line.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json', lines=True)
If fixed the following errors I encountered especially when some of the json files have only one value:
ValueError: If using all scalar values, you must pass an index
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
ValueError: Trailing data
For example
cat values.json
{
name: "Snow",
age: "31"
}
df = pd.read_json('values.json')
Chances are you might end up with this
Error: if using all scalar values, you must pass an index
Pandas looks up for a list or dictionary in the value. Something like
cat values.json
{
name: ["Snow"],
age: ["31"]
}
So try doing this. Later on to convert to html tohtml()
df = pd.DataFrame([pd.read_json(report_file, typ='series')])
result = df.to_html()
I solved this by converting it into an array like so
[{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}]
I am trying to work with Coinbase's API, and would like to use their prices as a float, but the object returns an API object and I don't know how to convert it.
For example if I call client.get_spot_price() it will return this:
{
"amount": "316.08",
"currency": "USD"
}
And I just want the 316.08. How can I solve it?
data = {
"amount": "316.08",
"currency": "USD"
}
price = float(text['amount'])
With API use JSON parser
import json
data = client.get_spot_price()
price = float(json.loads(data)['amount'])
print price
It looks like a json output. You could import json library in python and read it using the loads method, for sample:
import json
# get data from the API's method
response = client.get_spot_price()
# parse the content of the response using json format
data = json.loads(response )
# get the amount and convert to float
amount = float(data['amount'])
print(amount)
At first you put your returned object into a variable and check the type of the returned value.
just like this:
print type(your_returned object/variable)
If this is dictionary You can access data from a dictionary via dictionary key. A dictionary structure is:
dict = {key_1 : value_1 , key_2 : value_2, ......key_n : value_n}
1.You can access all values of dictionary.
like below:
print dict[key_1] #output will be returned value_1
Then you can convert returned data to integer or float.
For converting to integer:
int(your_data)
convert to float:
float(your_data)
If it is not a dictionary you need to convert it to a dictionary or json via :
json.loads(your returned object)
In your case you can do:
variable = client.get_spot_price()
print type(variable) #if it is dictionary or not
print float(variable["amount"]) #will return your price in float
I am trying to import a json file using the function:
sku = pandas.read_json('https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111')
However, i keep getting the following error
ValueError: arrays must all be same length
What should I do to import it correctly into a dataframe?
this is the structure of the json:
{
"id": "5",
"sku": "JOSH:BECO-BRN",
"last_updated": "2013-06-10 15:46:22",
...
"propertyType1": [
"manufacturer_colour"
],
"category": [
{
"category_id": "10",
"category_name": "All Products"
},
...
{
"category_id": "238",
"category_name": "All Sofas"
}
],
"root_categories": [
"516"
],
"url": "/p/Beco Suede Sofa Bed?product_id=5",
"item": [
"2"
],
"image_names": "[\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/L\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/P\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SP\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SS\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/ST\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/WP\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/L\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/P\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SP\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk \\/images\\/products\\/SS\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/ST\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/WP\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\"]"
}
The pandas.read_json function takes multiple formats.
Since you did not specify which format your json file is in (orient= attribute), pandas will default to believing your data is columnar. The different formats pandas expects are discussed below.
The data that you are trying to parse from https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111
Does not seem to conform to any of the supported formats as it seems to be only a single "record". Pandas expects some kind of collection.
You probably should try to collect multiple entries into a single file, then parse it with the read_json function.
EDIT:
Simple way of getting multiple rows and parsing it with the pandas.read_json function:
import urllib2
import pandas as pd
url_base = "https://cws01.worldstores.co.uk/api/product.php?product_sku={}"
products = ["125T:FT0111", "125T:FT0111", "125T:FT0111"]
raw_data_list = []
for sku in products:
url = url_base.format(sku)
raw_data_list.append(urllib2.urlopen(url).read())
data = "[" + (",".join(raw_data_list)) + "]"
data = pd.read_json(data, orient='records')
data
/EDIT
My take on the pandas.read_json function formats.
The pandas.read_json function is yet another shining example of pandas trying to jam as much functionality as possible into a single function. This leads of course to a very very complicated function.
Series
If your data is a Series, pandas.read_json(orient=) defaults to 'index'
The values allowed for orient while parsing a Series are: {'split','records','index'}
Note that the Series index must be unique for orient='index'.
DataFrame
If your data is a DataFrame, pandas.read_json(orient=) defaults to 'columns'
The values allowed for orient while parsing a DataFrame are:
{'split','records','index','columns','values'}
Note that the Series index must be unique for orient='index' and orient='columns', and the DataFrame columns must be unique for orient='index', orient='columns', and orient='records'.
Format
No matter if your data is a DataFrame or a Series, the orient= will expect data in the same format:
Split
Expects a string representation of a dict like what the DataFrame constructor takes:
{"index":[1,2,3,4], "columns":["col1","col2"], "data":[[8,7,6,5], [5,6,7,8]]}
Records
Expects a string representation of a list of dicts like:
[{"col1":8,"col2":5},{"col1":7,"col2":6},{"col1":6,"col2":7},{"col1":5,"col2":8}]
Note there is no index set here.
Index
Expects a string representation of a nested dict dict like:
{"1":{"col1":8,"col2":5},"2":{"col1":7,"col2":6},"3":{"col1":6,"col2":7},"4":{"col1":5,"col2":8}}
Good to note is that it won't accept indicies of other types than strings. May be fixed in later versions.
Columns
Expects a string representation of a nested dict like:
{"col1":{"1":8,"2":7,"3":6,"4":5},"col2":{"1":5,"2":6,"3":7,"4":8}}
Values
Expects a string representation of a list like:
[[8, 5],[7, 6],[6, 7],[5, 8]]
Resulting dataframe
In most cases, the dataframe you get will look like this, with the json strings above:
col1 col2
1 8 5
2 7 6
3 6 7
4 5 8
Maybe this is not the most elegant solution however gives me back what I want, or at least I believe so, feel free to warn if something is wrong
url = "https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111"
data = urllib2.urlopen(url).read()
data = json.loads(data)
data = pd.DataFrame(data.items())
data = data.transpose()
Another solution is to use a try except.
json_path='https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111'
try: a=pd.read_json(json_path)
except ValueError: a=pd.read_json("["+json_path+"]")
Iterating on #firelynx's answer:
#! /usr/bin/env python3
from urllib.request import urlopen
import pandas as pd
products = ["125T:FT0111", "125T:FT0111", "125T:FT0111"]
raw_lines = ""
for sku in products:
url = f"https://cws01.worldstores.co.uk/api/product.php?product_sku={sku}"
raw_lines += urlopen(url).read() + "\n"
data = pd.read_json(raw_lines, lines=True)
This would support any source returning a single JSON object or a bunch of newline ('\n') separated ones.
Or this one-liner(ish) should work the same:
#! /usr/bin/env python3
import pandas as pd
products = ["125T:FT0111", "125T:FT0111", "125T:FT0111"]
data = pd.concat(
pd.read_json(
f"https://cws01.worldstores.co.uk/api/product.php?product_sku={sku}",
lines=True
) for sku in products
)
PS: python3 is only for fstring support here, so you should use str.format for python2 compatibility.