Unable to covert Json to Dataframe - python

Unable to covert Json to Dataframe, the following TypeError shows :
The following data is created
Test_data =
{'archived': False,
'archived_at': None,
'associations': None,
'created_at': datetime.datetime(2020, 10, 30, 8, 3, 54, 190000, tzinfo=tzlocal()),
'id': '12345',
'properties': {'createdate': '[![2020-10-30T08:03:54.190Z][1]][1]',
'email': 'testmail#gmail.com',
'firstname': 'TestFirst',
'lastname': 'TestLast'},
'properties_with_history': None,
'updated_at': datetime.datetime(2022, 11, 10, 6, 44, 14, 5000, tzinfo=tzlocal())}
data = json.loads(test_data)
TypeError: the JSON object must be str, bytes or bytearray, not SimplePublicObjectWithAssociations
The following has been tried:
s1 = json.dumps(test_data)
d2 = json.loads(s1)
TypeError: Object of type SimplePublicObjectWithAssociations is not JSON serializable
Prefered Output :

can you try this:
df=pd.json_normalize(Test_data)
print(df)
'''
archived archived_at associations created_at id properties_with_history updated_at properties.createdate properties.email properties.firstname
0 False None None 2020-10-30T08:03:54.190Z 12345 2022-11-10T06:44:14.500Z [![2020-10-30T08:03:54.190Z][1]][1] testmail#gmail.com TestFirst
'''
if you want to specific columns:
df = df[['id','properties.createdate','properties.email','properties.firstname','properties.lastname']]
df.columns = df.columns.str.replace('properties.', '')
df
id createdate email firstname lastname
0 12345 [![2020-10-30T08:03:54.190Z][1]][1] testmail#gmail.com TestFirst TestLast
if you want convert createdate column to datetime:
import datefinder
df['createdate']=df['createdate'].apply(lambda x: list(datefinder.find_dates(x))[0])
df
id createdate email firstname lastname
0 12345 2020-10-30 08:03:54.190000+00:00 testmail#gmail.com TestFirst TestLast

There is a partial solution.....Maybe selecting or doing an unpivot dataframe this approach could be useful...
import pandas as pd
import datetime
import json
import jsonpickle
test_data ={'archived': False,
'archived_at': None,
'associations': None,
'created_at': datetime.datetime(2020, 10, 30, 8, 3, 54, 190000),
'id': '12345',
'properties': {'createdate': '[![2020-10-30T08:03:54.190Z][1]][1]',
'email': 'testmail#gmail.com',
'firstname': 'TestFirst',
'lastname': 'TestLast'},
'properties_with_history': None,
'updated_at': datetime.datetime(2022, 11, 10, 6, 44, 14, 5000)}
data = jsonpickle.encode(test_data, unpicklable=False)
pd.read_json(data)
I have tried with melt and unstack but I didn't reach your prefered output...

Related

Create subgroups based on dateTime

So I have a list of objects within a list, over 600~ of them.
I have a single object example here:
{'Description': '', 'Encrypted': False, 'OwnerId': '', 'Progress': '100%', 'SnapshotId': '', 'StartTime': datetime.datetime(2021, 7, 16, 22, 47, 50, 383000, tzinfo=tzlocal()) }
The problem is that I have a list/array of these and I want to group them all into subgroups such that each object is grouped into a group with others that are timestamped with the "StartTime" datetime that are within 5 minutes of each other. I've been working on this for over a week and I have no idea how to do this grouping. After I group them, I need to apply some rules to each group to ensure they have the correct tags and information.
Just for reference, these are snapshot objects created by amazon aws boto3 describe_snapshots method. You can read about them here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2.html#EC2.Client.describe_snapshots
You can use pandas for this to group the dataframe with pd.Grouper(key='StartTime', freq='5min'):
import pandas as pd
import datetime
from dateutil.tz import tzlocal
data = [{'Description': '', 'Encrypted': False, 'OwnerId': '', 'Progress': '100%', 'SnapshotId': '', 'StartTime': datetime.datetime(2021, 7, 16, 22, 47, 50, 383000, tzinfo=tzlocal()) },{'Description': '', 'Encrypted': False, 'OwnerId': '', 'Progress': '100%', 'SnapshotId': '', 'StartTime': datetime.datetime(2021, 7, 16, 22, 48, 50, 383000, tzinfo=tzlocal()) },{'Description': '', 'Encrypted': False, 'OwnerId': '', 'Progress': '100%', 'SnapshotId': '', 'StartTime': datetime.datetime(2021, 7, 16, 22, 58, 50, 383000, tzinfo=tzlocal()) },{'Description': '', 'Encrypted': False, 'OwnerId': '', 'Progress': '100%', 'SnapshotId': '', 'StartTime': datetime.datetime(2021, 7, 16, 22, 59, 50, 383000, tzinfo=tzlocal()) }]
df = pd.DataFrame(data)
df_grouped = df.groupby(pd.Grouper(key='StartTime', freq='5min'))
Or you could create an extra row in the original dataframe with the number of the group. Eg:
def bin_number(table):
table['bin'] = list(df_grouped.groups.keys()).index(table.name)
return table
df_grouped = df.groupby(pd.Grouper(key='StartTime', freq='5min'), as_index=False)
df_grouped = df_grouped.apply(bin_number).reset_index()
Output:
index
Description
Encrypted
OwnerId
Progress
SnapshotId
StartTime
bin
0
0
False
100%
2021-07-16 22:47:50.383000+02:00
0
1
1
False
100%
2021-07-16 22:48:50.383000+02:00
0
2
2
False
100%
2021-07-16 22:58:50.383000+02:00
2
3
3
False
100%
2021-07-16 22:59:50.383000+02:00
2

Normalize JSON API data to columns

I'm trying to get data from our Hubspot CRM database and convert it to a dataframe using pandas. I'm still a beginner in python, but I can't get json_normalize to work.
The output from the database is i JSON format like this:
{'archived': False,
'archived_at': None,
'associations': None,
'created_at': datetime.datetime(2019, 12, 21, 17, 56, 24, 739000, tzinfo=tzutc()),
'id': 'xxx',
'properties': {'createdate': '2019-12-21T17:56:24.739Z',
'email': 'xxxxx#xxxxx.com',
'firstname': 'John',
'hs_object_id': 'xxx',
'lastmodifieddate': '2020-04-22T04:37:40.274Z',
'lastname': 'Hansen'},
'updated_at': datetime.datetime(2020, 4, 22, 4, 37, 40, 274000, tzinfo=tzutc())}, {'archived': False,
'archived_at': None,
'associations': None,
'created_at': datetime.datetime(2019, 12, 21, 17, 52, 38, 485000, tzinfo=tzutc()),
'id': 'bbb',
'properties': {'createdate': '2019-12-21T17:52:38.485Z',
'email': 'bbb#bbb.dk',
'firstname': 'John2',
'hs_object_id': 'bbb',
'lastmodifieddate': '2020-05-19T07:18:28.384Z',
'lastname': 'Hansen2'},
'updated_at': datetime.datetime(2020, 5, 19, 7, 18, 28, 384000, tzinfo=tzutc())}, {'archived': False,
'archived_at': None,
'associations': None,
etc.
Trying to put it into a dataframe using this code:
import hubspot
import pandas as pd
import json
from pandas.io.json import json_normalize
import os
client = hubspot.Client.create(api_key='################')
all_contacts = contacts_client = client.crm.contacts.get_all()
df=pd.io.json.json_normalize(all_contacts,'properties')
df.head
df.to_csv ('All contacts.csv')
But i keep getting an error that i can't resolve.
I have also tried the
pd.dataframe(all_contacts)
and
pf.dataframe.from_dict(all_contacts)
The all_contacts variable is a list of dictionary-like elements. So to create the dataframe I have used list comprehension to create a tuple that only contains the 'properties' for each dictionary-like element.
import datetime
import pandas as pd
from dateutil.tz import tzutc
data = ({'archived': False,
'archived_at': None,
'associations': None,
'created_at': datetime.datetime(2019, 12, 21, 17, 56, 24, 739000, tzinfo=tzutc()),
'id': 'xxx',
'properties': {'createdate': '2019-12-21T17:56:24.739Z',
'email': 'xxxxx#xxxxx.com',
'firstname': 'John',
'hs_object_id': 'xxx',
'lastmodifieddate': '2020-04-22T04:37:40.274Z',
'lastname': 'Hansen'},
'updated_at': datetime.datetime(2020, 4, 22, 4, 37, 40, 274000, tzinfo=tzutc())},
{'archived': False,
'archived_at': None,
'associations': None,
'created_at': datetime.datetime(2019, 12, 21, 17, 52, 38, 485000, tzinfo=tzutc()),
'id': 'bbb',
'properties': {
'createdate': '2019-12-21T17:52:38.485Z',
'email': 'bbb#bbb.dk',
'firstname': 'John2',
'hs_object_id': 'bbb',
'lastmodifieddate': '2020-05-19T07:18:28.384Z',
'lastname': 'Hansen2'},
'updated_at': datetime.datetime(2020, 5, 19, 7, 18, 28, 384000, tzinfo=tzutc())})
df = pd.DataFrame([row['properties'] for row in data])
print(df)
OUTPUT:
createdate email ... lastmodifieddate lastname
0 2019-12-21T17:56:24.739Z xxxxx#xxxxx.com ... 2020-04-22T04:37:40.274Z Hansen
1 2019-12-21T17:52:38.485Z bbb#bbb.dk ... 2020-05-19T07:18:28.384Z Hansen2
[2 rows x 6 columns]

Annotation of cumulative sums of fields in queryset

I have a model which records data from when a user has watched a video:
class VideoViewed(models.Model):
user = models.ForeignKey(Employee, on_delete=models.CASCADE)
video = models.ForeignKey(Video, on_delete=models.CASCADE)
date_time = models.DateTimeField(default=timezone.now)
From the queryset I would like to have each object in the queryset store a value which is a cumulative sum of total videos viewed by that specific user up until this point in time.
Currently I have this annotation:
queryset = queryset.annotate(
user_views_cumsum=Window(Sum('video'),
order_by=F('date_time').asc()))\
.values('user', 'video', 'date_time', 'id', 'user_views_cumsum').order_by('date_time', 'user_views_cumsum')
Which I want to give the Queryset:
<QuerySet [
{'user': 2, 'video': 13, 'date_time': datetime, 'id': 5, 'user_views_cumsum': 1},
{'user': 2, 'video': 13, 'date_time': datetime, 'id': 6, 'user_views_cumsum': 2},
{'user': 4, 'video': 13, 'date_time': datetime, 'id': 7, 'user_views_cumsum': 1},
{'user': 2, 'video': 13, 'date_time': datetime, 'id': 8, 'user_views_cumsum': 3},
{'user': 2, 'video': 13, 'date_time': datetime, 'id': 9, 'user_views_cumsum': 4},
{'user': 4, 'video': 13, 'date_time': datetime, 'id': 10, 'user_views_cumsum': 2}
]>
But is giving me the cumsum of the videos id so it looks like this:
<QuerySet [
{'user': 2, 'video': 13, 'date_time': datetime, 'id': 5, 'user_views_cumsum': 13},
{'user': 2, 'video': 13, 'date_time': datetime, 'id': 6, 'user_views_cumsum': 26},
{'user': 4, 'video': 13, 'date_time': datetime, 'id': 7, 'user_views_cumsum': 39},
{'user': 2, 'video': 13, 'date_time': datetime, 'id': 8, 'user_views_cumsum': 52},
{'user': 2, 'video': 13, 'date_time': datetime, 'id': 9, 'user_views_cumsum': 65},
{'user': 4, 'video': 13, 'date_time': datetime, 'id': 10, 'user_views_cumsum': 78}
]>
There are 2 issues. I need to separate the user_views_cumsum to only be a cumsum of each user's video views and I need it add the cumsum of each user's video views instead of the id.
Thoughts?
You can use Count() instead of Sum() so every viewing will be counted only once.
As you probably know the way relational databases store this type of data is using foreign keys. When calling Sum('video') you are asking the sum of the column named video, which contains the id of the video. This is why you are getting the sum of the ids. However you do not really care about the value of any column, so you can use count.

How to write list of multilevel dictionaries into csv

I have a list of dictionaries stored in tweets, and I am trying to write these into a csv file using writerows method.
Sample List looks something like this:
[{'sentiment': 'Unknown', 'date': datetime.datetime(2013, 1, 1, 5, 31, 32), 'body': 'mcd brk b'},
{'sentiment': 'Unknown', 'date': datetime.datetime(2013, 1, 1, 6, 55, 23), 'body': 'co hihq'},
{'sentiment': {'basic': 'Bullish'}, 'date': datetime.datetime(2013, 1, 1, 7, 36, 32), 'body': 'new year bac'}]
Here sentiment key has either one level or two. I am trying to write these dictionaries into a csv format such that I only have the values of these keys for above either 'Unknown' or 'Bullish'.
file = open('BAC.csv','w')
keys=tweets[0].keys()
dict_writer=csv.DictWriter(file,keys)
dict_writer.writerows(tweets)
I get the csv file in the following format
Unknown,2013-01-01 05:31:32,mcd brk b
Unknown,2013-01-01 06:55:23,co hihq
{'basic': 'Bullish'},2013-01-01 07:36:32,mnew year bac
But I need it as
Unknown,2013-01-01 05:31:32,mcd brk b
Unknown,2013-01-01 06:55:23,co hihq
Bullish,2013-01-01 07:36:32,mnew year bac
Is there any easy way to do this? In many instances the levels go up to five, but similar deal just need the value.
You will need to write a function to flatten these sentiment values.
Something like this could work if you have only one element in each level.
def flatten(row, field):
if isinstance(row[field], dict):
row[field] = row[field].values()[0]
return flatten(row, field)
return row
Then you would need to call this method on each row before writing it to the csv.
tweets = [{'sentiment': 'Unknown', 'date': datetime.datetime(2013, 1, 1, 5, 31, 32), 'body': 'mcd brk b'},
{'sentiment': 'Unknown', 'date': datetime.datetime(2013, 1, 1, 6, 55, 23), 'body': 'co hihq'},
{'sentiment': {'basic': {'text': 'Bullish' } }, 'date': datetime.datetime(2013, 1, 1, 7, 36, 32), 'body': 'new year bac'}]
print [flatten(row, 'sentiment') for row in tweets]
Output
[{'date': datetime.datetime(2013, 1, 1, 5, 31, 32), 'body': 'mcd brk b', 'sentiment': 'Unknown'},
{'date': datetime.datetime(2013, 1, 1, 6, 55, 23), 'body': 'co hihq', 'sentiment': 'Unknown'},
{'date': datetime.datetime(2013, 1, 1, 7, 36, 32), 'body': 'new year bac', 'sentiment': 'Bullish'}]

How to extract raw data from Salesforce using Beatbox python API

I am using the following code to extract data from Salesforce using beatbox python API.
import beatbox
sf_username = "xyz#salesforce.com"
sf_password = "123"
sf_api_token = "ABC"
def extract():
sf_client = beatbox.PythonClient()
password = str("%s%s" % (sf_password, sf_api_token))
sf_client.login(sf_username, password)
lead_qry = "SELECT CountryIsoCode__c,LastModifiedDate FROM Country limit 10"
records = sf_client.query(lead_qry)
output = open('output','w')
for record in records:
output.write('\t'.join(record.values())
output.close()
if _name_ == '__main__':
extract()
But this is what I get in the output. How to get the raw data, just the values I see in the workbench. I don't want to parse each datatype and get the raw value.
Actual Output:
[{'LastModifiedDate': datetime.datetime(2012, 11, 2, 9, 32, 4),
'CountryIsoCode_c': 'AU', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 8, 18, 14, 0, 21),
'CountryIsoCode_c': 'LX', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 11, 12, 15, 20, 11),
'CountryIsoCode_c': 'AE', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 11, 12, 15, 20, 29),
'CountryIsoCode_c': 'AR', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 11, 2, 9, 32, 4),
'CountryIsoCode_c': 'AT', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 11, 2, 9, 32, 4),
'CountryIsoCode_c': 'BE', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 11, 12, 15, 21, 28),
'CountryIsoCode_c': 'BR', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 11, 12, 15, 21, 42),
'CountryIsoCode_c': 'CA', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 11, 12, 15, 36, 18),
'CountryIsoCode_c': 'CH', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 11, 12, 15, 35, 8),
'CountryIsoCode_c': 'CL', 'type': 'Country_c', 'Id': ''}]
Expected Output:
AU 2012-11-02T09:32:04Z
LX 2012-08-18T14:00:21Z
If you work with table data you should use Pandas library
Here is an example:
import pandas as pd
from datetime import datetime
import beatbox
service = beatbox.PythonClient()
service.login('login_here', 'creds_here')
query_result = service.query("SELECT Name, Country, CreatedDate FROM Lead limit 5") # CreatedDate is a datetime object
records = query_result['records'] # records is a list of dictionaries
records is a list of dictionaries as you mentioned before
df = pd.DataFrame(records)
print (df)
Country CreatedDate Id Name type
0 United States 2011-05-26 23:39:58 qwe qwe Lead
1 France 2011-09-01 08:45:26 qwe qwe Lead
2 France 2011-09-01 08:37:36 qwe qwe Lead
3 France 2011-09-01 08:46:38 qwe qwe Lead
4 France 2011-09-01 08:46:57 qwe qwe Lead
Now you have table-style Dataframe object. You can index multiple columns and rows:
df['CreatedDate']
0 2011-05-26 23:39:58
1 2011-09-01 08:45:26
2 2011-09-01 08:37:36
3 2011-09-01 08:46:38
4 2011-09-01 08:46:57
Here is more about pandas time functionality http://pandas.pydata.org/pandas-docs/stable/timeseries.html
And here is about pandas http://pandas.pydata.org/pandas-docs/stable/install.html

Categories