Extract list from dict of lists then append to dataframe - python

I'm trying to extract a field from a json that contains a list then append that list to a dataframe, but I'm running in to a few different errors.
I think I can write it to a csv then read the csv with Pandas, but I'm trying to avoid writing any files. I know that I can also use StringIO to make a csv, but that has issues with null bytes. Replacing those would be (i think) another line-by-line step that will further extend the time the script takes to complete... i'm running this against a query that returns thens of thousands of results so keeping it fast and simple is a priority
First I tried this:
hit_json = json.loads(hit)
for ln in hit_json.get('hits').get('hits'):
df = df.append(ln['_source'], ignore_index=True)
print(df)
This gives me a result that looks like this:
1 2 3 4
a b d,e,f... x
Then I tried this:
df = df.append(ln['_source']['payload'], ignore_index=True)
But that gives me this error:
TypeError: cannot concatenate object of type "<class 'str'>"; only pd.Series,
pd.DataFrame, and pd.Panel (deprecated) objs are valid
What I'm looking for would be something like this:
0 1 2 3 4
d e f g h
On top of this... I need to figure out a way to handle a specific string in this list that contains a comma... which may be a headache that's best handled in a different question... something like:
# Obviously this is incorrect but I think you get the idea :)
str.replace(',', '^')
except if ',' followed by ' '
Greatly appreciate any help!
EDITING TO ADD JSON AS REQUESTED
{
"_index": "sanitized",
"_type": "sanitized",
"_id": "sanitized".,
"_score": sanitized,
"_source": {
"sanitized": sanitized,
"sanitized": "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,\"34,35\",36,37,38,39,40",
"sanitized": "sanitized",
"sanitized": ["sanitized"],
"sanitized": "sanitized",
"sanitized": "sanitized",
"sanitized": "sanitized",
"sanitized": "sanitized",
}
}]
}
}

You can maybe write a temporary file with StringIO, like it's done here.
Then for the second part you could do
if ',' in data and ', ' not in data:
data = data.replace(',', '^')

You can try the following
hit_json = json.loads(hit)
for ln in hit_json.get('hits').get('hits'):
data = ln['_source']["payload"].split(",")
df.loc[len(df)] = pd.Series(data, index=range(len(data)))
print(df)
The benefit of the loc is that you will not create a new dataframe each time so it will be fast. You can find the post here.
I would also like to suggest an alternative that can be faster. First create a dictionary with all the data and then dump the dictionary into a dataframe.

Related

Detecting duplicates in pandas when a column contains lists

Is there a reasonable way to detect duplicates in a Pandas dataframe when a column contains lists or numpy nd arrays, like the example below? I know I could convert the lists into strings, but the act of converting back and forth feels... wrong. Plus, lists seem more legible and convenient given ~how I got here (online code) and where I'm going after.
import pandas as pd
df = pd.DataFrame(
{
"author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
"date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
"ingredients": [
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredD"],
["ingredA", "ingredB", "ingredD", "ingredE"],
["ingredB", "ingredC", "ingredF"],
],
}
)
# Traditional find duplicates
# df[df.duplicated(keep=False)]
# Avoiding pandas duplicated function (question 70596016 solution)
i = [hash(tuple(i.values())) for i in df.to_dict(orient="records")]
j = [i.count(k) > 1 for k in i]
df[j]
Both methods (the latter from this alternative find duplicates answer) result in
TypeError: unhashable type: 'list'.
They would work, of course, if the dataframe looked like this:
df = pd.DataFrame(
{
"author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
"date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
"recipe": [
"recipeC",
"recipeC",
"recipeD",
"recipeE",
"recipeF",
],
}
)
Which made me wonder if something like integer encoding might be reasonable? It's not that different from converting to/from strings, but at least it's legible. Alternatively, suggestions for converting to a single string of ingredients per row directly from the starting dataframe in the code link above would be appreciated (i.e., avoiding lists altogether).
With map tuple
out = df[df.assign(rating = df['rating'].map(tuple)).duplicated(keep=False)]
Out[295]:
author date rating
0 Jefe98 1423112400 [ingredA, ingredB, ingredC]
1 Jefe98 1423112400 [ingredA, ingredB, ingredC]

How to access nested attribute without passing parent attribute in pyspark json

I am trying to access inner attributes of following json using pyspark
[
{
"432": [
{
"atttr1": null,
"atttr2": "7DG6",
"id":432,
"score": 100
}
]
},
{
"238": [
{
"atttr1": null,
"atttr2": "7SS8",
"id":432,
"score": 100
}
]
}
]
In the output, I am looking for something like below in form of csv
atttr1, atttr2,id,score
null,"7DG6",432,100
null,"7SS8",238,100
I understand I can get these details like below but I don't want to pass 432 or 238 in lambda expression as in bigger json this(italic one) will vary. I want to iterate over all available values.
print(inputDF.rdd.map(lambda x:(x['*432*'])).first())
print(inputDF.rdd.map(lambda x:(x['*238*'])).first())
I also tried registering a temp table with the name "test" but it gave an error with message element._id doesn't exist.
inputDF.registerTempTable("test")
srdd2 = spark.sql("select element._id from test limit 1")
Any help will be highly appreciated. I am using spark 2.4
Without using pyspark features, you can do it like this:
data = json.loads(json_str) # or whatever way you're getting the data
columns = 'atttr1 atttr2 id score'.split()
print(','.join(columns)) # headers
for item in data:
for obj in list(item.values())[0]: # since each list has only one object
print(','.join(str(obj[col]) for col in columns))
Output:
atttr1,atttr2,id,score
None,7DG6,432,100
None,7SS8,432,100
Or
for item in data:
obj = list(item.values())[0][0] # since the object is the one and only item in list
print(','.join(str(obj[col]) for col in columns))
FYI, you can store those in a variable or write it out to csv instead of/and also printing it.
And if you're just looking to dump that to csv, see this answer.

Extract JSON | API | Pandas DataFrame

I am using the Facebook API (v2.10) to which I've extracted the data I need, 95% of which is perfect. My problem is the 'actions' metric which returns as a dictionary within a list within another dictionary.
At present, all the data is in a DataFrame, however, the 'actions' column is a list of dictionaries that contain each individual action for that day.
{
"actions": [
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "7"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "3"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "144"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "34"
}]}
All this appears in one cell (row) within the DataFrame.
What is the best way to:
Get the action type, create a new column and use the Use "action_type" as the column name?
List the correct value under this column
It looks like JSON but when I look at the type, it's a panda series (stored as an object).
For those willing to help (thank you, I greatly appreciate it) - can you either point me in the direction of the right material and I will read it and work it out on my own (I'm not entirely sure what to look for) or if you decide this is an easy problem, explain to me how and why you solved it this way. Don't just want the answer
I have tried the following (with help from a friend) and it kind of works, but I have issues with this running in my script. IE: if it runs within a bigger code block, I get the following error:
for i in range(df.shape[0]):
line = df.loc[i, 'Conversions']
L = ast.literal_eval(line)
for l in L:
cid = l['action_type']
value = l['value']
df.loc[i, cid] = value
If I save the DF as a csv, call it using pd.read_csv...it executes properly, but not within the script. No idea why.
Error:
ValueError: malformed node or string: [{'value': '1', 'action_type': 'offsite_conversion.custom.xxxxx}]
Any help would be greatly appreciated.
Thanks,
Adrian
You can use json_normalize:
In [11]: d # e.g. dict from json.load OR instead pass the json path to json_normalize
Out[11]:
{'actions': [{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx',
'value': '7'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '3'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '144'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '34'}]}
In [12]: pd.io.json.json_normalize(d, record_path="actions")
Out[12]:
action_type value
0 offsite_conversion.custom.xxxxxxxxxxx 7
1 offsite_conversion.custom.xxxxxxxxxxx 3
2 offsite_conversion.custom.xxxxxxxxxxx 144
3 offsite_conversion.custom.xxxxxxxxxxx 34
You can use df.join(pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)).
Explanation:
df['Conversions'].tolist() returns a list of dictionaries. This list is then transformed into a DataFrame using pd.DataFrame. Then, you can use the pivot function to pivot the table into the shape that you want.
Lastly, you can join the table with your original DataFrame. Note that this only works if you DataFrame's index is the default (i.e., integers starting from 0). If this is not the case, you can do this instead:
df2 = pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)
for col in df2.columns:
df[col] = df2[col]

Manipulating/Replacing values in dictionaries within lists

My data looks similar to the data shown below ('snippet.json'). I want to be able to replace values for example, for id:1, replace employee number to 2455.
Data Snippet:
{"employees": [{"level1":{"id":1, "firstname": "John", "employee
number": 2343 },{"level1":{"id":2, "firstname": "Jane", "employee
number": 5647 }}]}
I understand that it is much easier to replace values when in the form of a list or a dictionary, so I did the following to convert it to a list.
import json
viewer_string=open('snippet.json','r')
data_str = viewer_string.read()
data_list = []
data_list.append(data_str)
But this doesn't seem to be working. Is there anyway I could convert Snippet.json into a dictionary? Or is there another way to go about this?
Since you are importing json, you might want to do something like below,
json_data = json.loads(viewer_string.read())
you have your data in dict type and you can loop through and replace values as you wish. Make sure the file has valid json

Format JSON objects in python

I have json objects in a notepad(C:\data.txt).There are millions of records I just used one record as an example.But I want to see only data on my notepad like:
1 123-567-9876 TEST1 TEST 717-567-9876 Harrisburg null US_PA
I dont want paranthesis,etc
Once I get the clean data,plan is to import the data from notepad(say C:\data2.txt) into SQL database.
This is the format of json object.
{
"status":"ok",
"items":[
{
"1":{
"Work_Phone":"123-567-9876",
"Name_Part":[
"TEST1",
"TEST"
],
"Residence_Phone":"717-567-9876",
"Mailing_City":"Harrisburg",
"Mailing_Street_Address_line_1":"",
"Cell_Phone":null,
"Mailing_Country_AND_Province_OR_State":"US_PA"
}
}
]
}
Can someone pls help with python code to format this json object and export it to notepad.
You can use
import simplejson as json
Then you can open your file and load it into a python-Dictionary:
f = file("C:/data.txt","r")
data = json.loads(f.read())
But this works only, when the json-objects are stored in an array in your file. So this has to look like this:
[{ ... first date ...},
{... second date ...},
...,
{... last date ...}]
Then in data there is an array of dictionaries. Now you can write the dates in another file:
g = file("output.txt","w")
for d in data:
for i in items:
for k in i.keys:
g.write(... some string build from the parameters ...)
If well done the file output.txt contains the lines. In detail it might be difficult becaus each item seems to contain some arrays.

Categories