How to convert JSON data into specified Pandas DataFrame

How to convert JSON data into specified Pandas DataFrame - python

I have a json data which looks like this:
"rows": [
["2019-08-02", 364, 209, 2, 2],
["2019-08-03", 386, 250, 2, 5],
["2019-08-04", 382, 221, 3, 1],
["2019-08-05", 361, 218, 1, 0],
["2019-08-06", 338, 205, 4, 0],
["2019-08-07", 353, 208, 2, 2],
["2019-08-08", 405, 223, 2, 2],
["2019-08-09", 405, 266, 2, 2],
["2019-08-10", 494, 288, 0, 1],
]
I wanted to be headers of data as(not included in JSON file) as
["day", "estimatedPeopleVisited", "bought", "gives_pfeedback", "gives_nfeedback"]
I tried following code for reading file:
f = pd.read_json("data1308.json")
print(f)
and this gives output like:
rows
0 [2019-08-02, 364, 209, 2, 2]
1 [2019-08-03, 386, 250, 2, 5]
2 [2019-08-04, 382, 221, 3, 1]
3 [2019-08-05, 361, 218, 1, 0]
4 [2019-08-06, 338, 205, 4, 0]
5 [2019-08-07, 353, 208, 2, 2]
6 [2019-08-08, 405, 223, 2, 2]
7 [2019-08-09, 405, 266, 2, 2]
8 [2019-08-10, 494, 288, 0, 1]
I expect the output in form of:
day est bought gives_pfeedback gives_nfeedback
0 2019-08-02 364 209 2 2
1 2019-08-03 386 250 2 5
2 2019-08-04 382 221 3 1
3 2019-08-05 361 218 1 0
4 2019-08-06 338 205 4 0
. . . . . .
. . . . . .
. . . . . .
I can transform data in specified form after reading as problemset format but, is there any way to read directly JSON data in specified format?

What about this?
import pandas as pd
data = {"rows": [
["2019-08-02", 364, 209, 2, 2],
["2019-08-03", 386, 250, 2, 5],
["2019-08-04", 382, 221, 3, 1],
["2019-08-05", 361, 218, 1, 0],
["2019-08-06", 338, 205, 4, 0],
["2019-08-07", 353, 208, 2, 2],
["2019-08-08", 405, 223, 2, 2],
["2019-08-09", 405, 266, 2, 2],
["2019-08-10", 494, 288, 0, 1],
]}
cols = ["day", "estimatedPeopleVisited", "bought", "gives_pfeedback", "gives_nfeedback"]
df = pd.DataFrame.from_dict(data["rows"])
df.columns = cols

Related

Using Python Pandas' dataframe, how to tweak the display and apply a calculation on sequential log rows?

Looking for some assistance in cracking the 2 following tasks using Python and Pandas.
Here's the input sequential log table:
user_id token action_id action_timestamp
7 223 1 timestamp1
12 191 1 timestamp2
45 667 2 timestamp3
7 223 3 timestamp4
12 191 2 timestamp5
12 339 1 timestamp6
7 223 2 timestamp7
12 339 2 timestamp8
7 564 1 timestamp9
12 778 1 timestamp10
91 551 1 timestamp11
12 778 4 timestamp12
12 778 5 timestamp13
91 551 5 timestamp14
91 551 3 timestamp15
91 551 2 timestamp16
45 122 1 timestamp17
The first desired outcome should select only those rows that have action_id 1 and action_id 2 for a token, and display their timestamps as columns- per user, per token:
user_id token timestamp_action_id_1 timestamp_action_id_2
7 223 timestamp1 timestamp7
12 191 timestamp2 timestamp5
12 339 timestamp6 timestamp8
91 551 timestamp11 timestamp16
The second desired outcome is a calculation of the average time measured from action_id 1 to action_id 2 across all tokens, per user:
user_id action_id_1_to_action_id_2_time_delta_average
7 <avg of time delta for token 223>
12 <avg of time delta for tokens 191 and 339>
91 <avg of time delta for token 551>
Thanks in advance!
Update:
Here's the code that implements mozway's answer:
df = pd.DataFrame({
'user_id': [7, 12, 45, 7, 12, 12, 7, 12, 7, 12, 91, 12, 12, 91, 91, 91, 45],
'token': [223, 191, 667, 223, 191, 339, 223, 339, 564, 778, 551, 778, 778, 551, 551, 551, 122],
'action_id': [1, 1, 2, 3, 2, 1, 2, 2, 1, 1, 1, 4, 5, 5, 3, 2, 1],
'action_timestamp': [f'timestamp{x}' for x in range(1,18)]
})
# For all columns
df.pivot(index=['user_id', 'token'], columns='action_id', values='action_timestamp').add_prefix('timestamp_action_id_').reset_index().rename_axis(None, axis=1)
# Only for the desired columns
df2 = df[df['action_id'].isin([1,2])].pivot(index=['user_id', 'token'], columns='action_id', values='action_timestamp').add_prefix('timestamp_action_id_').reset_index().rename_axis(None, axis=1)
df3 = df2[~df2.isnull().any(axis=1)].reset_index(drop=True)
df3
user_id token timestamp_action_id_1 timestamp_action_id_2
0 7 223 timestamp1 timestamp7
1 12 191 timestamp2 timestamp5
2 12 339 timestamp6 timestamp8
3 91 551 timestamp11 timestamp16
However, if the log table has a repeating action for the user within a token, an 'Index contains duplicate entries, cannot reshape' error occurs.
Here's the table with a repeating action:
df = pd.DataFrame({
'user_id': [7, 12, 45, 7, 12, 12, 7, 12, 7, 12, 91, 12, 12, 91, 91, 91, 91, 45],
'token': [223, 191, 667, 223, 191, 339, 223, 339, 564, 778, 551, 778, 778, 551, 551, 551, 551, 122],
'action_id': [1, 1, 2, 3, 2, 1, 2, 2, 1, 1, 1, 4, 5, 5, 3, 5, 2, 1],
'action_timestamp': [f'timestamp{x}' for x in range(1,19)]
})

For the first step pivot your dataframe:
df.pivot(index=['user_id', 'token'], columns='action_id', values='action_timestamp').add_prefix('timestamp_action_id_')
Second filter, pivot, calculate the delta, groupby user_id and take the mean:
(df[df['action_id'].isin([1,2])]
.pivot(index=['user_id', 'token'],
columns='action_id',
values='action_timestamp')
.add_prefix('timestamp_action_id_')
.assign(delta=lambda d:d['timestamp_action_id_2']-d['timestamp_action_id_1'])
.groupby('user_id')['delta'].apply(lambda x: x.dt.total_seconds().mean())
.dropna()
)
output (using random timestamps):
user_id
7 813000.0
12 -29000.0
91 -601000.0
N.B. unfortunately I cannot test/run the code at the moment, so consider it pseudocode

Google OR-TOOLS VRP Problem with previous OR-TOOLS Assignment problem

Im trying to solve a 2 step problem, where in the first one I run an assignment model which calculates the best option that optimize the pick up and deliveries arcs between nodes because not all the vehicles can transport the same products and other complications the problem has. The result of the first model are the arcs that serves as an input in the second VRP model as data['pickups_deliveries']. The next code is an easy example where the code works but a node cant be a delivery and also a pickup node at the same time. Which is what i need to solve.
"""Capacited Vehicles Routing Problem (CVRP)."""
from ortools.constraint_solver import routing_enums_pb2
from ortools.constraint_solver import pywrapcp
def create_data_model():
"""Stores the data for the problem."""
data = {}
data['distance_matrix'] = [
[
0, 548, 776, 696, 582, 274, 502, 194, 308, 194, 536, 502, 388, 354,
468, 776, 662
],
[
548, 0, 684, 308, 194, 502, 730, 354, 696, 742, 1084, 594, 480, 674,
1016, 868, 1210
],
[
776, 684, 0, 992, 878, 502, 274, 810, 468, 742, 400, 1278, 1164,
1130, 788, 1552, 754
],
[
696, 308, 992, 0, 114, 650, 878, 502, 844, 890, 1232, 514, 628, 822,
1164, 560, 1358
],
[
582, 194, 878, 114, 0, 536, 764, 388, 730, 776, 1118, 400, 514, 708,
1050, 674, 1244
],
[
274, 502, 502, 650, 536, 0, 228, 308, 194, 240, 582, 776, 662, 628,
514, 1050, 708
],
[
502, 730, 274, 878, 764, 228, 0, 536, 194, 468, 354, 1004, 890, 856,
514, 1278, 480
],
[
194, 354, 810, 502, 388, 308, 536, 0, 342, 388, 730, 468, 354, 320,
662, 742, 856
],
[
308, 696, 468, 844, 730, 194, 194, 342, 0, 274, 388, 810, 696, 662,
320, 1084, 514
],
[
194, 742, 742, 890, 776, 240, 468, 388, 274, 0, 342, 536, 422, 388,
274, 810, 468
],
[
536, 1084, 400, 1232, 1118, 582, 354, 730, 388, 342, 0, 878, 764,
730, 388, 1152, 354
],
[
502, 594, 1278, 514, 400, 776, 1004, 468, 810, 536, 878, 0, 114,
308, 650, 274, 844
],
[
388, 480, 1164, 628, 514, 662, 890, 354, 696, 422, 764, 114, 0, 194,
536, 388, 730
],
[
354, 674, 1130, 822, 708, 628, 856, 320, 662, 388, 730, 308, 194, 0,
342, 422, 536
],
[
468, 1016, 788, 1164, 1050, 514, 514, 662, 320, 274, 388, 650, 536,
342, 0, 764, 194
],
[
776, 868, 1552, 560, 674, 1050, 1278, 742, 1084, 810, 1152, 274,
388, 422, 764, 0, 798
],
[
662, 1210, 754, 1358, 1244, 708, 480, 856, 514, 468, 354, 844, 730,
536, 194, 798, 0
],
]
data['pickups_deliveries'] = [
[1, 6],
[2, 10],
[4, 3],
[5, 9],
[7, 8],
[15, 11],
[13, 12],
[16, 14]
]
data['demands'] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data['vehicle_capacities'] = [1, 1, 1, 1, 1, 1, 1, 1, 1]
data['num_vehicles'] = 9
data['depot'] = 0
return data
def print_solution(data, manager, routing, solution):
"""Prints solution on console."""
print(f'Objective: {solution.ObjectiveValue()}')
total_distance = 0
total_load = 0
for vehicle_id in range(data['num_vehicles']):
index = routing.Start(vehicle_id)
plan_output = 'Route for vehicle {}:\n'.format(vehicle_id)
route_distance = 0
route_load = 0
while not routing.IsEnd(index):
node_index = manager.IndexToNode(index)
route_load += data['demands'][node_index]
plan_output += ' {0} Load({1}) -> '.format(node_index, route_load)
previous_index = index
index = solution.Value(routing.NextVar(index))
route_distance += routing.GetArcCostForVehicle(
previous_index, index, vehicle_id)
plan_output += ' {0} Load({1})\n'.format(manager.IndexToNode(index),
route_load)
plan_output += 'Distance of the route: {}m\n'.format(route_distance)
plan_output += 'Load of the route: {}\n'.format(route_load)
print(plan_output)
total_distance += route_distance
total_load += route_load
print('Total distance of all routes: {}m'.format(total_distance))
print('Total load of all routes: {}'.format(total_load))
def main():
"""Entry point of the program."""
# Instantiate the data problem.
# [START data]
data = create_data_model()
# [END data]
# Create the routing index manager.
# [START index_manager]
manager = pywrapcp.RoutingIndexManager(len(data['distance_matrix']),
data['num_vehicles'], data['depot'])
# [END index_manager]
# Create Routing Model.
# [START routing_model]
routing = pywrapcp.RoutingModel(manager)
# [END routing_model]
# Define cost of each arc.
# [START arc_cost]
def distance_callback(from_index, to_index):
"""Returns the manhattan distance between the two nodes."""
# Convert from routing variable Index to distance matrix NodeIndex.
from_node = manager.IndexToNode(from_index)
to_node = manager.IndexToNode(to_index)
return data['distance_matrix'][from_node][to_node]
transit_callback_index = routing.RegisterTransitCallback(distance_callback)
routing.SetArcCostEvaluatorOfAllVehicles(transit_callback_index)
# [END arc_cost]
# Add Distance constraint.
# [START distance_constraint]
dimension_name = 'Distance'
routing.AddDimension(
transit_callback_index,
0, # no slack
3000, # vehicle maximum travel distance
True, # start cumul to zero
dimension_name)
distance_dimension = routing.GetDimensionOrDie(dimension_name)
distance_dimension.SetGlobalSpanCostCoefficient(100)
# [END distance_constraint]
# Define Transportation Requests.
# [START pickup_delivery_constraint]
for request in data['pickups_deliveries']:
pickup_index = manager.NodeToIndex(request[0])
delivery_index = manager.NodeToIndex(request[1])
routing.AddPickupAndDelivery(pickup_index, delivery_index)
routing.solver().Add(
routing.VehicleVar(pickup_index) == routing.VehicleVar(
delivery_index))
routing.solver().Add(
distance_dimension.CumulVar(pickup_index) <=
distance_dimension.CumulVar(delivery_index))
routing.SetPickupAndDeliveryPolicyOfAllVehicles(
pywrapcp.RoutingModel.PICKUP_AND_DELIVERY_FIFO)
# [END pickup_delivery_constraint]
# Setting first solution heuristic.
search_parameters = pywrapcp.DefaultRoutingSearchParameters()
search_parameters.first_solution_strategy = (
routing_enums_pb2.FirstSolutionStrategy.PATH_CHEAPEST_ARC)
search_parameters.local_search_metaheuristic = (
routing_enums_pb2.LocalSearchMetaheuristic.GUIDED_LOCAL_SEARCH)
search_parameters.time_limit.FromSeconds(1)
# Solve the problem.
solution = routing.SolveWithParameters(search_parameters)
# Print solution on console.
if solution:
print_solution(data, manager, routing, solution)
if __name__ == '__main__':
main()
Route for vehicle 0:
0 Load(1) -> 4 Load(2) -> 3 Load(3) -> 5 Load(4) -> 9 Load(5) -> 0 Load(5)
Distance of the route: 1780m
Load of the route: 5
Route for vehicle 1:
0 Load(1) -> 2 Load(2) -> 10 Load(3) -> 0 Load(3)
Distance of the route: 1712m
Load of the route: 3
Route for vehicle 2:
0 Load(1) -> 0 Load(1)
Distance of the route: 0m
Load of the route: 1
Route for vehicle 3:
0 Load(1) -> 0 Load(1)
Distance of the route: 0m
Load of the route: 1
Route for vehicle 4:
0 Load(1) -> 0 Load(1)
Distance of the route: 0m
Load of the route: 1
Route for vehicle 5:
0 Load(1) -> 0 Load(1)
Distance of the route: 0m
Load of the route: 1
Route for vehicle 6:
0 Load(1) -> 1 Load(2) -> 6 Load(3) -> 0 Load(3)
Distance of the route: 1780m
Load of the route: 3
Route for vehicle 7:
0 Load(1) -> 7 Load(2) -> 8 Load(3) -> 16 Load(4) -> 14 Load(5) -> 0 Load(5)
Distance of the route: 1712m
Load of the route: 5
Route for vehicle 8:
0 Load(1) -> 13 Load(2) -> 12 Load(3) -> 15 Load(4) -> 11 Load(5) -> 0 Load(5)
Distance of the route: 1712m
Load of the route: 5
This code works fine for simple graph assignment where each pickup node is just a pickup node and each delivery node is just a delivery node. But if a want a node to be pickup and delivery, I thought i can add this as another graph, for example, making node 14, former delivery node, also a pickup node for the arc[14,13]. I thought i could force one vehicle to go 16->14->13->12 by adding this to the data['pickups_deliveries'] but python collapse and stops working.
data['pickups_deliveries'] = [
[1, 6],
[2, 10],
[4, 3],
[5, 9],
[7, 8],
[15, 11],
[13, 12],
[16, 14],
[14,13] ## Added
]
Mainly what I want to do is be able to add graphs where in one a node can be a pickup node and in another the same node can be a delivery one.
Thanks and sorry for the extension.

You must duplicate the node and adapt your transit callback accordingly.
Then you could merge node id when post processing the solution assignment.
Another way is to hack the transit callback to do the mapping there so you have to recompute a new transit matrix.
e.g.
create a duplicate node 17 and 18 for node 13 and 14.
so you can add the new P&D pair [18, 17]
in your transit callback:
def distance_callback(from_index, to_index):
"""Returns the manhattan distance between the two nodes."""
# Convert from routing variable Index to distance matrix NodeIndex.
from_node = manager.IndexToNode(from_index)
# rebind 17 or 18 to 13 or 14 respectively
if from_node in [17, 18]:
from_node = from_node - 4
to_node = manager.IndexToNode(to_index)
# rebind 17 or 18 to 13 or 14 respectively
if to_node in [17, 18]:
to_node = to_node - 4
return data['distance_matrix'][from_node][to_node]
and also change
# [START index_manager]
manager = pywrapcp.RoutingIndexManager(len(data['distance_matrix']) + 2,
data['num_vehicles'], data['depot'])
# [END index_manager]

Using np.newaxis to compute sum of squared differences

In chapter 2 of "Python Data Science Handbook" by Jake VanderPlas, he computes the sum of squared differences of several 2-d points using the following code:
rand = np.random.RandomState(42)
X = rand.rand(10,2)
dist_sq = np.sum(X[:,np.newaxis,:] - X[np.newaxis,:,:]) ** 2, axis=-1)
Two questions:
Why is a third axis created? What is the best way to visualize what is going on?
Is there a more intuitive way to perform this calculation?

Why is a third axis created? What is the best way to visualize what is going on?
The adding new dimensions before adding/subtracting trick is a relatively common one to generate all pairs, by using broadcasting (None is the same as np.newaxis here):
>>> a = np.arange(10)
>>> a[:,None]
array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]])
>>> a[None,:]
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
>>> a[:,None] + 100*a[None,:]
array([[ 0, 100, 200, 300, 400, 500, 600, 700, 800, 900],
[ 1, 101, 201, 301, 401, 501, 601, 701, 801, 901],
[ 2, 102, 202, 302, 402, 502, 602, 702, 802, 902],
[ 3, 103, 203, 303, 403, 503, 603, 703, 803, 903],
[ 4, 104, 204, 304, 404, 504, 604, 704, 804, 904],
[ 5, 105, 205, 305, 405, 505, 605, 705, 805, 905],
[ 6, 106, 206, 306, 406, 506, 606, 706, 806, 906],
[ 7, 107, 207, 307, 407, 507, 607, 707, 807, 907],
[ 8, 108, 208, 308, 408, 508, 608, 708, 808, 908],
[ 9, 109, 209, 309, 409, 509, 609, 709, 809, 909]])
Your example does the same, just with 2-vectors instead of scalars at the innermost level:
>>> X[:,np.newaxis,:].shape
(10, 1, 2)
>>> X[np.newaxis,:,:].shape
(1, 10, 2)
>>> (X[:,np.newaxis,:] - X[np.newaxis,:,:]).shape
(10, 10, 2)
Thus we find that the 'magical subtraction' is just all combinations of the coordinate X subtracted from each other.
Is there a more intuitive way to perform this calculation?
Yes, use scipy.spatial.distance.pdist for pairwise distances. To get an equivalent result to your example:
from scipy.spatial.distance import pdist, squareform
dist_sq = squareform(pdist(X))**2

How can I save DataFrame as list and not as string [duplicate]

This question already has answers here:
Pandas DataFrame stored list as string: How to convert back to list
(9 answers)
Closed 2 years ago.
I have this pd.DataFrame() created. I want to create a .csv and it is saving as this
0 1
0 [15921, 10, 82, 22, 202973, 368, 1055, 3135, 1... 0
1 [609, 226, 413, 363, 211, 241, 988, 80, 12, 19... 0
2 [22572, 3720, 233, 13, 827, 710, 512, 354, 1, ... 0
3 [345, 656, 25, 2589, 6, 866] 0
4 [29142, 8, 4141, 456, 24] 0
... ..
1599995 [256, 8, 80, 110, 25, 152] 4
1599996 [609039, 22, 129, 184, 163, 9419, 769, 358, 10... 4
1599997 [140, 5715, 6540, 294, 1552] 4
1599998 [59, 22771, 189, 387, 4483, 13, 10305, 112231,... 4
1599999 [59, 15833, 200370, 609041, 609042] 4
but by doing
data.to_csv("foo.csv", index=True)
The problem is that each list is now saved as str. For example, row 3 is
"[345, 656, 25, 2589, 6, 866]"
And for those skeptics, I've tried type(row 3) and it's str. Column 2 is working well.
How can I save as list each row and not as str? That is, how can I save all rows of col 1as DataFrame and not as str?

Try this:
import pandas as pd
df = pd.DataFrame({'a': ["[1,2,3,4]", "[6,7,8,9]"]})
df['b'] = df['a'].apply(eval)
print(df)
The data in column b is now an array.
a b
0 [1,2,3,4] [1, 2, 3, 4]
1 [6,7,8,9] [6, 7, 8, 9]

numpy How to get elements from a two-dimensional array, when each row of a slice has a different number of columns? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
How to get elements from a two-dimensional array, when each row of a slice has a different number of columns?
buffer = np.zeros((32, 32, 3), 'u1') # this is our data buffer 2d.
buffer[2:5, (2:4, 3:7, 0:11)] # does not work.
# vertical interval: 2..5; horizontal intervals: 1..3, 4..9, 7..10
multi_intervals = ((2, 5), ((1, 3), (4, 9), (7, 10)))
# our very slowerest function.
def gen_xy_indices(y_interval, x_multi_intervals):
x_multi_ranges = list(map(lambda x: np.arange(*x),x_multi_intervals))
y_range = np.arange(*y_interval)
y_indices = np.repeat(y_range, list(map(len, x_multi_ranges)))
x_indices = np.concatenate(x_multi_ranges)
return x_indices, y_indices
ix, iy = gen_xy_indices(*multi_intervals)
buffer[iy, ix].shape == (10, 3) # yeah work but slow.
# IS THERE A FASTER WAY TO DO THIS?! (in python with numpy)

You can use np.repeat and np.concatenate.
>>> import numpy as np
>>>
>>> class By_Row:
... def __getitem__(self, idx):
... y, *x = (np.arange(i.start, i.stop, i.step) for i in idx)
... return y.repeat(np.fromiter((i.size for i in x), int, y.size)), np.concatenate(x)
...
>>>
>>> b_ = By_Row()
>>>
>>> A = sum(np.ogrid[:600:100, :12])
>>> A
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111],
[200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211],
[300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311],
[400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411],
[500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511]])
>>> A[b_[2:5, 2:4, 3:7, 0:11]]
array([202, 203, 303, 304, 305, 306, 400, 401, 402, 403, 404, 405, 406,
407, 408, 409, 410])

Here's one way you could do it:
x = range(2,5)
y = range(17)
divs = [(2,4), (3,7), (12,17)]
y_vals = []
x_vals = []
for d, div in enumerate(divs):
y_grp = y[div[0]:div[1]]
y_vals += y_grp
x_vals += [x[d]]*len(y_grp)
print(x_vals)
print(y_vals)
> [2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4]
> [2, 3, 3, 4, 5, 6, 12, 13, 14, 15, 16]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert JSON data into specified Pandas DataFrame - python

Related

Using Python Pandas' dataframe, how to tweak the display and apply a calculation on sequential log rows?

Google OR-TOOLS VRP Problem with previous OR-TOOLS Assignment problem

Using np.newaxis to compute sum of squared differences

How can I save DataFrame as list and not as string [duplicate]

numpy How to get elements from a two-dimensional array, when each row of a slice has a different number of columns? [closed]

Categories

Resources