Converting a dataframe into JSON (in pyspark) and then selecting desired fields - python

I'm new to Spark. I have a dataframe that contains the results of some analysis. I converted that dataframe into JSON so I could display it in a Flask App:
results = result.toJSON().collect()
An example entry in my json file is below. I then tried to run a for loop in order to get specific results:
{"userId":"1","systemId":"30","title":"interest"}
for i in results:
print i["userId"]
This doesn't work at all and I get errors such as: Python (json) : TypeError: expected string or buffer
I used json.dumps and json.loads and still nothing - I keep on getting errors such as string indices must be integers, as well as the above error.
I then tried this:
print i[0]
This gave me the first character in the json "{" instead of the first line. I don't really know what to do, can anyone tell me where I'm going wrong?
Many Thanks.

If the result of result.toJSON().collect() is a JSON encoded string, then you would use json.loads() to convert it to a dict. The issue you're running into is that when you iterate a dict with a for loop, you're given the keys of the dict. In your for loop, you're treating the key as if it's a dict, when in fact it is just a string. Try this:
# toJSON() turns each row of the DataFrame into a JSON string
# calling first() on the result will fetch the first row.
results = json.loads(result.toJSON().first())
for key in results:
print results[key]
# To decode the entire DataFrame iterate over the result
# of toJSON()
def print_rows(row):
data = json.loads(row)
for key in data:
print "{key}:{value}".format(key=key, value=data[key])
results = result.toJSON()
results.foreach(print_rows)
EDIT: The issue is that collect returns a list, not a dict. I've updated the code. Always read the docs.
collect() Return a list that contains all of the elements in this RDD.
Note This method should only be used if the resulting array is
expected to be small, as all the data is loaded into the driver’s
memory.
EDIT2: I can't emphasize enough, always read the docs.
EDIT3: Look here.

import json
>>> df = sqlContext.read.table("n1")
>>> df.show()
+-----+-------+----+---------------+-------+----+
| c1| c2| c3| c4| c5| c6|
+-----+-------+----+---------------+-------+----+
|00001|Content| 1|Content-article| |2018|
|00002|Content|null|Content-article|Content|2015|
+-----+-------+----+---------------+-------+----+
>>> results = df.toJSON().map(lambda j: json.loads(j)).collect()
>>> for i in results: print i["c1"], i["c6"]
...
00001 2018
00002 2015

Here is what worked for me:
df_json = df.toJSON()
for row in df_json.collect():
#json string
print(row)
#json object
line = json.loads(row)
print(line[some_key])
Keep in mind that using .collect() is not advisable, since it collects the distributed data frames, and defeats the purpose of using data frames.

To get an array of python dicts:
results = df.toJSON().map(json.loads).collect()
To get an array of JSON strings:
results = df.toJSON().collect()
To get a JSON string (i.e. a JSON string of an array):
results = df.toPandas().to_json(orient='records')
and using that to get an array of Python dicts:
results = json.loads(df.toPandas().to_json(orient='records'))

Related

Convert json dict

I have pulled JSON data from a url. The result is a dictionary. How can I transform this dictionary so metric is a column, and the time is the index for each value
Thanks in advance
time------------------------AdrActCnt-----BlkCnt------BlkSizeByte
2021-01-28T00:00:00.000Z----1097896.0-----145.0-------190568423.0
2021-01-29T00:00:00.000Z----1208741.0-----152.0-------199725189.0
2021-01-29T00:00:00.000Z----1087755.0-----136.0-------177349536.0
Output:
{"metricData":{"metrics":["AdrActCnt","BlkCnt","BlkSizeByte"],"series":
[{"time":"2021-01-28T00:00:00Z","values"["1097896.0","145.0","190568423.0"]},
{"time":"2021-01-29T00:00:00Z","values":["1208741.0","152.0","199725189.0"]},
{"time":"2021-01-30T00:00:00Z","values":["1087755.0","136.0","177349536.0"]}
You may be looking for a dict comprehension, which is similar to a list comprehension, just creates a dictionary at the end:
liststuff = [{"time":"2021-01-28T00:00:00.000Z","values":["1097896.0","145.0","190568423.0"]},{"time":"2021-01-29T00:00:00.000Z","values":["1208741.0","152.0","199725189.0"]},{"time":"2021-01-30T00:00:00.000Z","values":["1087755.0","136.0","177349536.0"]}]
dictstuff = {item['time']:item['values'] for item in liststuff}
print(dictstuff)
{'2021-01-28T00:00:00.000Z': ['1097896.0', '145.0', '190568423.0'], '2021-01-29T00:00:00.000Z': ['1208741.0', '152.0', '199725189.0'], '2021-01-30T00:00:00.000Z': ['1087755.0', '136.0', '177349536.0']}
liststuff is your data, just needed [] wrapping (I assume that's a typo in the question, it's not valid JSON without the brackets). If you need help with parsing the string, use json.loads() (from the json module) to make it actual Python data:
import json
jsonstuff = '[{"time":"2021-01-28T00:00:00.000Z","values":["1097896.0","145.0","190568423.0"]},{"time":"2021-01-29T00:00:00.000Z","values":["1208741.0","152.0","199725189.0"]},{"time":"2021-01-30T00:00:00.000Z","values":["1087755.0","136.0","177349536.0"]}]'
liststuff = json.loads(jsonstuff)
(here jsonstuff is the string you've downloaded)

How do I replace a JSON list to print to a CSV?

I'm using an API to gather some data that comes to me in JSON format. I'm using json.loads to import the data and can successfully write it to a CSV. Unfortunately, the data comes in in a format that I don't want so I'd like to reformat the json list.
I've tried creating a new list and assigning the JSON list to the desired list. I get the following error: TypeError: list indices must be integers or slices, not str
import requests
import json
import csv
response = requests.get(url).text //json source
data = json.loads(response)
newsdata = (data["response"]["docs"])
// These two lines reformat the date to what I want it to look like
newsdate = [y["pub_date"] for y in newsdata]
newsdate = [y.split('T')[0] for y in newsdate]
newsdata["pub_date"] = newsdate // This line is what I've tried to replace the json
newssnip = [y["snippet"] for y in newsdata]
newshead = [y["headline"]["main"] for y in newsdata]
for z in newsdata:
csvwriter.writerow([z["pub_date"], //This is the JSON data i want to reformat
z["headline"]["main"],
z["snippet"],
z["web_url"]])
I expected the newsdata["pub_date"] to be overwritten when I assigned newsdate to it but I get the following error instead: TypeError: list indices must be integers or slices, not str
Thank you for your help! :)
EDIT:
I've uploaded an example json response here on github called "exmaple.json": https://github.com/theChef613/nytnewsscrapper
That error is saying that newsdata is list and is therefore not subscriptable with a string. If you post the raw JSON data returned or also print(type(newsdata)) to figure out what class newsdata is and how to work with it. It's also possible that newsdata is a 2D (or N-d) array where the first element is the key and the second element is the value.

Changing Json from API with Python

I have a json code from API and I want to get new chat members with the code below but I only get the first two results and not the last (Tester). Why? It should itereate through the whole json file, shouldn't it?
r = requests.get("https://api.../getUpdates").json()
chat_members = []
a = 0
for i in r:
chat_members.append(r['result'][a]['message']['new_chat_members'][0]['last_name'])
a = a + 1
Json here:
{"ok":true,"result":[{"update_id":213849278,
"message":{"message_id":37731,"from":{"id":593029363,"is_bot":false,"first_name": "#tutu"},"chat":{"id":-1001272017174,"title":"tester account","username":"v_glob","type":"supergroup"},"date":1537470595,"new_chat_participant":{"id":593029363,"is_bot":false,"first_name":"tutu "},"new_chat_member":{"id":593029363,"is_bot":false,"first_name":"\u7535\u62a5\u589e\u7c89\uff0c\u4e2d\u82f1\u6587\u5ba2\u670d\uff0c\u62c9\u4eba\u6e05\u5783\u573e\u8f6f\u4ef6\uff0c\u5e7f\u544a\u63a8\u5e7f\uff0cKYC\u6750\u6599\u8ba4\u8bc1\uff0c","last_name":"#tutupeng"},"new_chat_members":[{"id":593029363,"is_bot":false,"first_name":"\u7535\u62a5\u589e\u7c89\uff0c\u4e2d\u82f1\u6587\u5ba2\u670d\uff0c\u62c9\u4eba\u6e05\u5783\u573e\u8f6f\u4ef6\uff0c\u5e7f\u544a\u63a8\u5e7f\uff0cKYC\u6750\u6599\u8ba4\u8bc1\uff0c","last_name":"#tutu"}]}},{"update_id":213849279,
"message":{"message_id":37732,"from":{"id":658150956,"is_bot":false,"first_name":"Rebecca","last_name":"Lawson"},"chat":{"id":-10012720,"title":"v glob OFFICIAL","username":"v_glob","type":"supergroup"},"date":1537484441,"new_chat_participant":{"id":65815,"is_bot":false,"first_name":"Rebecca","last_name":"Lawson"},"new_chat_member":{"id":65815,"is_bot":false,"first_name":"Rebecca","last_name":"Lawson"},"new_chat_members":[{"id":65815,"is_bot":false,"first_name":"Rebecca","last_name":"Lawson"}]}},{"update_id":213849280,
"message":{"message_id":12,"from":{"id":696749142,"is_bot":false,"first_name":"daniel","language_code":"cs-cz"},"chat":{"id":696749142,"first_name":"daniel","type":"private"},"date":1537537013,"text":"/stat","entities":[{"offset":0,"length":5,"type":"bot_command"}]}},{"update_id":213849281,
"message":{"message_id":37740,"from":{"id":669620,"is_bot":false,"first_name":"Ivan","last_name":"Tester"},"chat":{"id":-100127201,"title":"test account","username":"v_glob","type":"supergroup"},"date":1537537597,"new_chat_participant":{"id":669620191,"is_bot":false,"first_name":"Ivan","last_name":"Tester"},"new_chat_member":{"id":669620191,"is_bot":false,"first_name":"Ivan","last_name":"Tester"},"new_chat_members":[{"id":669620191,"is_bot":false,"first_name":"Ivan","last_name":"Tester"}]}}]}
Because you iterate over the entire response dict. The top level only has two items, so that's what you iterate over. Note that you don't actually use the iterator variable, and you have a completely unnecessary separate counter.
Instead, you should be iterating over the result dict:
for result in r['result']:
if "new_chat_members" in result['message']:
chat_members.append(result['message']['new_chat_members'][0]['last_name'])
A colleague of mine has come up with a solution:
for i in l['result']:
chat_members.append(i['message']['new_chat_member']['first_name'])
To sum up: Iterate through 'result' with no positional arguments

How to store list in rows of mysql using Python/Flask?

I am getting some values from a html form and I am storing these values to a list. List is like:
["string1", "string2", "string3", "string4", "string5"]
I want to store these values in rows of mysql but I am confused how to do?
What I did till now is:
descrip = []
descrip.append(description1)
descrip.append(description2)
descrip.append(description3)
descrip.append(description4)
descrip.append(description5)
for r in descrp:
result_descrp = db.execute("""INSERT INTO description(id,description) VALUES (1,%s)""",((descrip))
return render_template('forms/success.html')
But I am getting this error:
TypeError: not all arguments converted during string formatting
At first, You use the placeholder %s in the format string which expect a str. But you pass a list to it.
And I don't know the type of description in your schema. If you just want to save the string presentation of list in the database, you can transform list to str with str(desciption).
And Mysql also support json type of field.(MariaDB also support json type.)
descrip = []
descrip.append(description1)
descrip.append(description2)
descrip.append(description3)
descrip.append(description4)
descrip.append(description5)
for r in range(5):
if descrip[r]:
result_add_event = db.execute("""INSERT INTO event_description(event_id,title,description, created_at) VALUES (%s,%s,%s)""",(id,descrip[r],timestamp))
This above code worked very fine. :)
Special thanks to #shiva and also to those who helped me.

extract a dictionary key value from a string

I am currently in the process of using python to transmit a python dictionary from one raspberry pi to another over a 433Mhz link, using virtual wire (vw.py) to send data.
The issue with vw.py is that data being sent is in string format.
I am successfully receiving the data on PI_no2, and now I am trying to reformat the data so it can be placed back in a dictionary.
I have created a small snippet to test with, and created a temporary string in the same format it is received as from vw.py
So far I have successfully split the string at the colon, and I am now trying to get rid of the double quotes, without much success.
my_status = {}
#temp is in the format the data is recieved
temp = "'mycode':['1','2','firstname','Lastname']"
key,value = temp.split(':')
print key
print value
key = key.replace("'",'')
value = value.replace("'",'')
my_status.update({key:value})
print my_status
Gives the result
'mycode'
['1','2','firstname','Lastname']
{'mycode': '[1,2,firstname,Lastname]'}
I require the value to be in the format
['1','2','firstname','Lastname']
but the strip gets rid of all the single speech marks.
You can use ast.literal_eval
import ast
temp = "'mycode':['1','2','firstname','Lastname']"
key,value = map(ast.literal_eval, temp.split(':'))
status = {key: value}
Will output
{'mycode': ['1', '2', 'firstname', 'Lastname']}
This shouldn't be hard to solve. What you need to do is strip away the [ ] in your list string, then split by ,. Once you've done this, iterate over the elements are add them to a list. Your code should look like this:
string = "[1,2,firstname,lastname]"
string = string.strip("[")
string = string.strip("]")
values = string.split(",")
final_list = []
for val in values:
final_list.append(val)
print final_list
This will return:
> ['1','2','firstname','lastname']
Then take this list and insert it into your dictionary:
d = {}
d['mycode'] = final_list
The advantage of this method is that you can handle each value independently. If you need to convert 1 and 2 to int then you'll be able to do that while leaving the other two as str.
Alternatively to cricket_007's suggestion of using a syntax tree parser - you're format is very similar to the standard yaml format. This is a pretty lightweight and intutive framework so I'll suggest it
a = "'mycode':['1','2','firstname','Lastname']"
print yaml.load(a.replace(":",": "))
# prints the dictionary {'mycode': ['1', '2', 'firstname', 'Lastname']}
The only thing that's different between your format and yaml is the colon needs a space
It also will distinguish between primitive data types for you, if that's important. Drop the quotes around 1 and 2 and it determines that they're numerical.
Tadhg McDonald-Jensen suggested pickling in the comments. This will allow you to store more complicated objects, though you may lose the human-readable format you've been experimenting with

Categories