Create subgroups based on dateTime - python

So I have a list of objects within a list, over 600~ of them.
I have a single object example here:
{'Description': '', 'Encrypted': False, 'OwnerId': '', 'Progress': '100%', 'SnapshotId': '', 'StartTime': datetime.datetime(2021, 7, 16, 22, 47, 50, 383000, tzinfo=tzlocal()) }
The problem is that I have a list/array of these and I want to group them all into subgroups such that each object is grouped into a group with others that are timestamped with the "StartTime" datetime that are within 5 minutes of each other. I've been working on this for over a week and I have no idea how to do this grouping. After I group them, I need to apply some rules to each group to ensure they have the correct tags and information.
Just for reference, these are snapshot objects created by amazon aws boto3 describe_snapshots method. You can read about them here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2.html#EC2.Client.describe_snapshots

You can use pandas for this to group the dataframe with pd.Grouper(key='StartTime', freq='5min'):
import pandas as pd
import datetime
from dateutil.tz import tzlocal
data = [{'Description': '', 'Encrypted': False, 'OwnerId': '', 'Progress': '100%', 'SnapshotId': '', 'StartTime': datetime.datetime(2021, 7, 16, 22, 47, 50, 383000, tzinfo=tzlocal()) },{'Description': '', 'Encrypted': False, 'OwnerId': '', 'Progress': '100%', 'SnapshotId': '', 'StartTime': datetime.datetime(2021, 7, 16, 22, 48, 50, 383000, tzinfo=tzlocal()) },{'Description': '', 'Encrypted': False, 'OwnerId': '', 'Progress': '100%', 'SnapshotId': '', 'StartTime': datetime.datetime(2021, 7, 16, 22, 58, 50, 383000, tzinfo=tzlocal()) },{'Description': '', 'Encrypted': False, 'OwnerId': '', 'Progress': '100%', 'SnapshotId': '', 'StartTime': datetime.datetime(2021, 7, 16, 22, 59, 50, 383000, tzinfo=tzlocal()) }]
df = pd.DataFrame(data)
df_grouped = df.groupby(pd.Grouper(key='StartTime', freq='5min'))
Or you could create an extra row in the original dataframe with the number of the group. Eg:
def bin_number(table):
table['bin'] = list(df_grouped.groups.keys()).index(table.name)
return table
df_grouped = df.groupby(pd.Grouper(key='StartTime', freq='5min'), as_index=False)
df_grouped = df_grouped.apply(bin_number).reset_index()
Output:
index
Description
Encrypted
OwnerId
Progress
SnapshotId
StartTime
bin
0
0
False
100%
2021-07-16 22:47:50.383000+02:00
0
1
1
False
100%
2021-07-16 22:48:50.383000+02:00
0
2
2
False
100%
2021-07-16 22:58:50.383000+02:00
2
3
3
False
100%
2021-07-16 22:59:50.383000+02:00
2

Related

Populate a column ONLY if there is a value on another column

I am trying to get a specific value in one column only if the value on the other column is not blank. My current code returns the value for all the column regardless of if its blank or not.
Minimal reproducible example:
df=pd.DataFrame(
{
'col1': [12, 34, 54, 23, 12, 43, 53, 12, 43, 12],
'col2': ['USD', 'CAD', 'USD', 'USD', 'CAD', 'USD', 'USD',
'EURO', 'USD', 'USD'],
}
)
Here's my code:
# Set date/time reply and empty column
def add_days_to_date(date, days):
subtracted_date = pd.to_datetime(date) + timedelta(days=days)
subtracted_date = subtracted_date.strftime("%m-%d")
return subtracted_date
def replied_sent_date(date):
return f"Opened {date} need update {add_days_to_date(date, 1)}"
date = datetime.date.today()
date_format = date.strftime('%m-%d')
df['col3'] = df.apply(lambda _:" ", axis=1)
df.loc[df['col1'] != '', 'col3'] = f"Opened {date_format} will need an update {add_days_to_date(date, 30)}"
Desired output:
df=pd.DataFrame(
{
'col1': [12, 34, 54, 23, 12, 43, 53, 12, 43, 12],
'col2': ['USD', 'CAD', 'USD', 'USD', 'CAD', 'USD', 'USD',
'EURO', 'USD', 'USD'],
'col3': [*Look at note below*]
}
)
** "col3" would output "f"Opened {date_format} will need an update {add_days_to_date(date, 30)}" for every cell in which there exists a value for col1.
Any help is appreciated, thanks.

Unable to covert Json to Dataframe

Unable to covert Json to Dataframe, the following TypeError shows :
The following data is created
Test_data =
{'archived': False,
'archived_at': None,
'associations': None,
'created_at': datetime.datetime(2020, 10, 30, 8, 3, 54, 190000, tzinfo=tzlocal()),
'id': '12345',
'properties': {'createdate': '[![2020-10-30T08:03:54.190Z][1]][1]',
'email': 'testmail#gmail.com',
'firstname': 'TestFirst',
'lastname': 'TestLast'},
'properties_with_history': None,
'updated_at': datetime.datetime(2022, 11, 10, 6, 44, 14, 5000, tzinfo=tzlocal())}
data = json.loads(test_data)
TypeError: the JSON object must be str, bytes or bytearray, not SimplePublicObjectWithAssociations
The following has been tried:
s1 = json.dumps(test_data)
d2 = json.loads(s1)
TypeError: Object of type SimplePublicObjectWithAssociations is not JSON serializable
Prefered Output :
can you try this:
df=pd.json_normalize(Test_data)
print(df)
'''
archived archived_at associations created_at id properties_with_history updated_at properties.createdate properties.email properties.firstname
0 False None None 2020-10-30T08:03:54.190Z 12345 2022-11-10T06:44:14.500Z [![2020-10-30T08:03:54.190Z][1]][1] testmail#gmail.com TestFirst
'''
if you want to specific columns:
df = df[['id','properties.createdate','properties.email','properties.firstname','properties.lastname']]
df.columns = df.columns.str.replace('properties.', '')
df
id createdate email firstname lastname
0 12345 [![2020-10-30T08:03:54.190Z][1]][1] testmail#gmail.com TestFirst TestLast
if you want convert createdate column to datetime:
import datefinder
df['createdate']=df['createdate'].apply(lambda x: list(datefinder.find_dates(x))[0])
df
id createdate email firstname lastname
0 12345 2020-10-30 08:03:54.190000+00:00 testmail#gmail.com TestFirst TestLast
There is a partial solution.....Maybe selecting or doing an unpivot dataframe this approach could be useful...
import pandas as pd
import datetime
import json
import jsonpickle
test_data ={'archived': False,
'archived_at': None,
'associations': None,
'created_at': datetime.datetime(2020, 10, 30, 8, 3, 54, 190000),
'id': '12345',
'properties': {'createdate': '[![2020-10-30T08:03:54.190Z][1]][1]',
'email': 'testmail#gmail.com',
'firstname': 'TestFirst',
'lastname': 'TestLast'},
'properties_with_history': None,
'updated_at': datetime.datetime(2022, 11, 10, 6, 44, 14, 5000)}
data = jsonpickle.encode(test_data, unpicklable=False)
pd.read_json(data)
I have tried with melt and unstack but I didn't reach your prefered output...

How to filter list of dictionaries in python?

I have a list of dictionaries which is as follow-
VehicleList = [
{
'id': '1',
'VehicleType': 'Car',
'CreationDate': datetime.datetime(2021, 12, 10, 16, 9, 44, 872000)
},
{
'id': '2',
'VehicleType': 'Bike',
'CreationDate': datetime.datetime(2021, 12, 15, 11, 8, 21, 612000)
},
{
'id': '3',
'VehicleType': 'Truck',
'CreationDate': datetime.datetime(2021, 9, 13, 10, 1, 50, 350095)
},
{
'id': '4',
'VehicleType': 'Bike',
'CreationDate': datetime.datetime(2021, 12, 10, 21, 1, 00, 300012)
},
{
'id': '5',
'VehicleType': 'Car',
'CreationDate': datetime.datetime(2021, 12, 21, 10, 1, 50, 600095)
}
]
How can I get a list of the latest vehicles for each 'VehicleType' based on their 'CreationDate'?
I expect something like this-
latestVehicles = [
{
'id': '5',
'VehicleType': 'Car',
'CreationDate': datetime.datetime(2021, 12, 21, 10, 1, 50, 600095)
},
{
'id': '2',
'VehicleType': 'Bike',
'CreationDate': datetime.datetime(2021, 12, 15, 11, 8, 21, 612000)
},
{
'id': '3',
'VehicleType': 'Truck',
'CreationDate': datetime.datetime(2021, 9, 13, 10, 1, 50, 350095)
}
]
I tried separating out each dictionary based on their 'VehicleType' into different lists and then picking up the latest one.
I believe there might be a more optimal way to do this.
Use a dictionary mapping from VehicleType value to the dictionary you want in your final list. Compare the date of each item in the input list with the one your dict, and keep the later one.
latest_dict = {}
for vehicle in VehicleList:
t = vehicle['VehicleType']
if t not in latest_dict or vehicle['CreationDate'] > latest_dict[t]['CreationDate']:
latest_dict[t] = vehicle
latestVehicles = list(latest_dict.values())
Here is a solution using max and filter:
VehicleLatest = [
max(
filter(lambda _: _["VehicleType"] == t, VehicleList),
key=lambda _: _["CreationDate"]
) for t in {_["VehicleType"] for _ in VehicleList}
]
Result
print(VehicleLatest)
# [{'id': '2', 'VehicleType': 'Bike', 'CreationDate': datetime.datetime(2021, 12, 15, 11, 8, 21, 612000)}, {'id': '3', 'VehicleType': 'Truck', 'CreationDate': datetime.datetime(2021, 9, 13, 10, 1, 50, 350095)}, {'id': '5', 'VehicleType': 'Car', 'CreationDate': datetime.datetime(2021, 12, 21, 10, 1, 50, 600095)}]
I think you can acheive what you want using the groupby function from itertools.
from itertools import groupby
# entries sorted according to the key we wish to groupby: 'VehicleType'
VehicleList = sorted(VehicleList, key=lambda x: x["VehicleType"])
latestVehicles = []
# Then the elements are grouped.
for k, v in groupby(VehicleList, lambda x: x["VehicleType"]):
# We then append to latestVehicles the 0th entry of the
# grouped elements after sorting according to the 'CreationDate'
latestVehicles.append(sorted(list(v), key=lambda x: x["CreationDate"], reverse=True)[0])
Sort by 'VehicleType' and 'CreationDate', then create a dictionary from 'VehicleType' and vehicle to get the latest vehicle for each type:
VehicleList.sort(key=lambda x: (x.get('VehicleType'), x.get('CreationDate')))
out = list(dict(zip([item.get('VehicleType') for item in VehicleList], VehicleList)).values())
Output:
[{'id': '2',
'VehicleType': 'Bike',
'CreationDate': datetime.datetime(2021, 12, 15, 11, 8, 21, 612000)},
{'id': '5',
'VehicleType': 'Car',
'CreationDate': datetime.datetime(2021, 12, 21, 10, 1, 50, 600095)},
{'id': '3',
'VehicleType': 'Truck',
'CreationDate': datetime.datetime(2021, 9, 13, 10, 1, 50, 350095)}]
This is very straightforwards in pandas. First load the list of dicts as a pandas dataframe, then sort the values by date, take the top n items (3 in the example below), and export to dict.
import pandas as pd
df = pd.DataFrame(VehicleList)
df.sort_values('CreationDate', ascending=False).head(3).to_dict(orient='records')
You can use the operator to achieve that goal:
import operator
my_sorted_list_by_type_and_date = sorted(VehicleList, key=operator.itemgetter('VehicleType', 'CreationDate'))
A small plea for more readable code:
from operator import itemgetter
from itertools import groupby
vtkey = itemgetter('VehicleType')
cdkey = itemgetter('CreationDate')
latest = [
# Get latest from each group.
max(vs, key = cdkey)
# Sort and group by VehicleType.
for g, vs in groupby(sorted(vehicles, key = vtkey), vtkey)
]
A variation on Blckknght's answer using defaultdict to avoid the long if condition:
from collections import defaultdict
import datetime
from operator import itemgetter
latest_dict = defaultdict(lambda: {'CreationDate': datetime.datetime.min})
for vehicle in VehicleList:
t = vehicle['VehicleType']
latest_dict[t] = max(vehicle, latest_dict[t], key=itemgetter('CreationDate'))
latestVehicles = list(latest_dict.values())
latestVehicles:
[{'id': '5', 'VehicleType': 'Car', 'CreationDate': datetime.datetime(2021, 12, 21, 10, 1, 50, 600095)},
{'id': '2', 'VehicleType': 'Bike', 'CreationDate': datetime.datetime(2021, 12, 15, 11, 8, 21, 612000)},
{'id': '3', 'VehicleType': 'Truck', 'CreationDate': datetime.datetime(2021, 9, 13, 10, 1, 50, 350095)}]

Normalize JSON API data to columns

I'm trying to get data from our Hubspot CRM database and convert it to a dataframe using pandas. I'm still a beginner in python, but I can't get json_normalize to work.
The output from the database is i JSON format like this:
{'archived': False,
'archived_at': None,
'associations': None,
'created_at': datetime.datetime(2019, 12, 21, 17, 56, 24, 739000, tzinfo=tzutc()),
'id': 'xxx',
'properties': {'createdate': '2019-12-21T17:56:24.739Z',
'email': 'xxxxx#xxxxx.com',
'firstname': 'John',
'hs_object_id': 'xxx',
'lastmodifieddate': '2020-04-22T04:37:40.274Z',
'lastname': 'Hansen'},
'updated_at': datetime.datetime(2020, 4, 22, 4, 37, 40, 274000, tzinfo=tzutc())}, {'archived': False,
'archived_at': None,
'associations': None,
'created_at': datetime.datetime(2019, 12, 21, 17, 52, 38, 485000, tzinfo=tzutc()),
'id': 'bbb',
'properties': {'createdate': '2019-12-21T17:52:38.485Z',
'email': 'bbb#bbb.dk',
'firstname': 'John2',
'hs_object_id': 'bbb',
'lastmodifieddate': '2020-05-19T07:18:28.384Z',
'lastname': 'Hansen2'},
'updated_at': datetime.datetime(2020, 5, 19, 7, 18, 28, 384000, tzinfo=tzutc())}, {'archived': False,
'archived_at': None,
'associations': None,
etc.
Trying to put it into a dataframe using this code:
import hubspot
import pandas as pd
import json
from pandas.io.json import json_normalize
import os
client = hubspot.Client.create(api_key='################')
all_contacts = contacts_client = client.crm.contacts.get_all()
df=pd.io.json.json_normalize(all_contacts,'properties')
df.head
df.to_csv ('All contacts.csv')
But i keep getting an error that i can't resolve.
I have also tried the
pd.dataframe(all_contacts)
and
pf.dataframe.from_dict(all_contacts)
The all_contacts variable is a list of dictionary-like elements. So to create the dataframe I have used list comprehension to create a tuple that only contains the 'properties' for each dictionary-like element.
import datetime
import pandas as pd
from dateutil.tz import tzutc
data = ({'archived': False,
'archived_at': None,
'associations': None,
'created_at': datetime.datetime(2019, 12, 21, 17, 56, 24, 739000, tzinfo=tzutc()),
'id': 'xxx',
'properties': {'createdate': '2019-12-21T17:56:24.739Z',
'email': 'xxxxx#xxxxx.com',
'firstname': 'John',
'hs_object_id': 'xxx',
'lastmodifieddate': '2020-04-22T04:37:40.274Z',
'lastname': 'Hansen'},
'updated_at': datetime.datetime(2020, 4, 22, 4, 37, 40, 274000, tzinfo=tzutc())},
{'archived': False,
'archived_at': None,
'associations': None,
'created_at': datetime.datetime(2019, 12, 21, 17, 52, 38, 485000, tzinfo=tzutc()),
'id': 'bbb',
'properties': {
'createdate': '2019-12-21T17:52:38.485Z',
'email': 'bbb#bbb.dk',
'firstname': 'John2',
'hs_object_id': 'bbb',
'lastmodifieddate': '2020-05-19T07:18:28.384Z',
'lastname': 'Hansen2'},
'updated_at': datetime.datetime(2020, 5, 19, 7, 18, 28, 384000, tzinfo=tzutc())})
df = pd.DataFrame([row['properties'] for row in data])
print(df)
OUTPUT:
createdate email ... lastmodifieddate lastname
0 2019-12-21T17:56:24.739Z xxxxx#xxxxx.com ... 2020-04-22T04:37:40.274Z Hansen
1 2019-12-21T17:52:38.485Z bbb#bbb.dk ... 2020-05-19T07:18:28.384Z Hansen2
[2 rows x 6 columns]

How to extract raw data from Salesforce using Beatbox python API

I am using the following code to extract data from Salesforce using beatbox python API.
import beatbox
sf_username = "xyz#salesforce.com"
sf_password = "123"
sf_api_token = "ABC"
def extract():
sf_client = beatbox.PythonClient()
password = str("%s%s" % (sf_password, sf_api_token))
sf_client.login(sf_username, password)
lead_qry = "SELECT CountryIsoCode__c,LastModifiedDate FROM Country limit 10"
records = sf_client.query(lead_qry)
output = open('output','w')
for record in records:
output.write('\t'.join(record.values())
output.close()
if _name_ == '__main__':
extract()
But this is what I get in the output. How to get the raw data, just the values I see in the workbench. I don't want to parse each datatype and get the raw value.
Actual Output:
[{'LastModifiedDate': datetime.datetime(2012, 11, 2, 9, 32, 4),
'CountryIsoCode_c': 'AU', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 8, 18, 14, 0, 21),
'CountryIsoCode_c': 'LX', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 11, 12, 15, 20, 11),
'CountryIsoCode_c': 'AE', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 11, 12, 15, 20, 29),
'CountryIsoCode_c': 'AR', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 11, 2, 9, 32, 4),
'CountryIsoCode_c': 'AT', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 11, 2, 9, 32, 4),
'CountryIsoCode_c': 'BE', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 11, 12, 15, 21, 28),
'CountryIsoCode_c': 'BR', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 11, 12, 15, 21, 42),
'CountryIsoCode_c': 'CA', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 11, 12, 15, 36, 18),
'CountryIsoCode_c': 'CH', 'type': 'Country_c', 'Id': ''},
{'LastModifiedDate': datetime.datetime(2012, 11, 12, 15, 35, 8),
'CountryIsoCode_c': 'CL', 'type': 'Country_c', 'Id': ''}]
Expected Output:
AU 2012-11-02T09:32:04Z
LX 2012-08-18T14:00:21Z
If you work with table data you should use Pandas library
Here is an example:
import pandas as pd
from datetime import datetime
import beatbox
service = beatbox.PythonClient()
service.login('login_here', 'creds_here')
query_result = service.query("SELECT Name, Country, CreatedDate FROM Lead limit 5") # CreatedDate is a datetime object
records = query_result['records'] # records is a list of dictionaries
records is a list of dictionaries as you mentioned before
df = pd.DataFrame(records)
print (df)
Country CreatedDate Id Name type
0 United States 2011-05-26 23:39:58 qwe qwe Lead
1 France 2011-09-01 08:45:26 qwe qwe Lead
2 France 2011-09-01 08:37:36 qwe qwe Lead
3 France 2011-09-01 08:46:38 qwe qwe Lead
4 France 2011-09-01 08:46:57 qwe qwe Lead
Now you have table-style Dataframe object. You can index multiple columns and rows:
df['CreatedDate']
0 2011-05-26 23:39:58
1 2011-09-01 08:45:26
2 2011-09-01 08:37:36
3 2011-09-01 08:46:38
4 2011-09-01 08:46:57
Here is more about pandas time functionality http://pandas.pydata.org/pandas-docs/stable/timeseries.html
And here is about pandas http://pandas.pydata.org/pandas-docs/stable/install.html

Categories