Merge and add duplicate integers from a multidimensional array - python

I have a multidimensional list where the first item is a date and the second is a date time object needing to be added together. For example (leave the second as a integer for simplicity):
[[01/01/2019, 10], [01/01/2019, 3], [02/01/2019, 4], [03/01/2019, 2]]
The resulting array should be:
[[01/01/2019, 13], [02/01/2019, 4], [03/01/2019, 2]]
Does someone have a short way of doing this?
The background to this is vehicle tracking, I have a list of trips performed by vehicle and I want to have a summary by day with a count of total time driven per day.

You should change your data 01/01/2019 to '01/01/2019'.
#naivepredictor suggested good sample, anyway, if you don't want to import pandas, use this.
my_list = [['01/01/2019', 10], ['01/01/2019', 3], ['02/01/2019', 4], ['03/01/2019', 2]]
result_d = {}
for i in my_list:
result_d[i[0]] = result_d.get(i[0], 0) + i[1]
print(result_d) #{'01/01/2019': 13, '02/01/2019': 4, '03/01/2019': 2}
print([list(d) for d in result_d.items()]) #[['01/01/2019', 13], ['02/01/2019', 4], ['03/01/2019', 2]]

import pandas as pd
# create dataframe out of the given imput
df = pd.DataFrame(data=[['01/01/2019', 10], ['01/01/2019', 3], ['02/01/2019', 4]], columns=['date', 'trip_len'])
# groupby date and sum values for each day
df = df.groupby('date').sum().reset_index()
# output result as list of lists
result = df.values.tolist()

Related

Appending list with highest value from two columns using for loop in pandas

I have two columns: column A, and column B.
I would like to find whether the value in each row of column A is larger than the value for the same row in column B, and if it is append a list with these values.
I'm able to append the list if the value in column A is higher than a set value, but I'm unsure how to compare it to the value from column B.
The below code appends the list if the value in column A is higher than 4. Hopefully I'm on the right track and can just substitute 4 with some other code?
list = []
for x in A:
if x > 4:
list.append(x)
print(list)
Any help would be greatly appreciated.
Thank you!
An approach could be:
import pandas as pd
df = pd.DataFrame({"A":[2, 3, 4, 5], "B":[1, 4, 6, 3]}) # Test DataFrame
print(list(df[df["A"] > df["B"]]["A"]))
OUTPUT
[2, 5]
FOLLOW UP
If, as described in the comments, you want to check conditions on multiple columns:
import pandas as pd
df = pd.DataFrame({"A":[2, 3, 4, 5], "B":[1, 4, 6, 3], "C":[1, 1, 1, 10]}) # Test DataFrame
print(list(df[(df["A"] > df["B"]) & (df["A"] > df["C"])]["A"]))
OUTPUT
[2]

Pandas: Looking to avoid a for loop when creating a nested dictionary

Here is my data:
df:
id sub_id
A 1
A 2
B 3
B 4
and I have the following array:
[[1,2],
[2,5],
[1,4],
[7,8]]
Here is my code:
from collections import defaultdict
sub_id_array_dict = defaultdict(dict)
for i, s, a in zip(df['id'].to_list(), df['sub_id'].to_list(), arrays):
sub_id_array_dict[i][s] = a
Now, my actual dataframe includes a total of 100M rows (unique sub_id) with 500K unique ids. Ideally, I'd like to avoid a for loop.
Any help would be much appreciated.
Assuming the arrays variable has same number of rows as in the Dataframe,
df['value'] = arrays
Convert into dictionary by grouping
df.groupby('id').apply(lambda x: dict(zip(x.sub_id, x.value))).to_dict()
Output
{'A': {1: [1, 2], 2: [2, 5]}, 'B': {3: [1, 4], 4: [7, 8]}}

Comparing rows in dataset based on a specific column to find min/max

So I have a dataset that contains history of a specific tag from a start to end date. I am trying to compare rows based on the a date column, if they're similar by month, day and year, I'll add those to a temporary list by the value of the next column and then once I have those items by similar date, I'll take that list and find the min/max values subtract them, then add the result to another list and empty the temp_list to start all over again.
For the sake of time and simplicity, I am just presenting a example of 2D List. Here's my example data
dataset = [[1,5],[1,6],[1,10],[1,23],[2,4],[2,8],[2,12],[3,10],[3,20],[3,40],[4,50],[4,500]]
Where the first column will act as dates and second value.
The issues I am having is :
I cant seem to compare every row based on its first column which would take the value in the second column and include it in the temp list to perform min/max operations?
Based on the above 2D List I would expect to get [18,8,30,450] but the result is [5,4,10]
dataset = [[1,5],[1,6],[1,10],[1,23],[2,4],[2,8],[2,12],[3,10],[3,30],[3,40],[4,2],[4,5]]
temp_list = []
daily_total = []
for i in range(len(dataset)-1):
if dataset[i][0] == dataset[i+1][0]:
temp_list.append(dataset[i][1])
else:
max_ = max(temp_list)
min_ = min(temp_list)
total = max_ - min_
daily_total.append(total)
temp_list = []
print([x for x in daily_total])
Try:
tmp = {}
for d, v in dataset:
tmp.setdefault(d, []).append(v)
out = [max(v) - min(v) for v in tmp.values()]
print(out)
Prints:
[18, 8, 30, 450]
Here is a solution using pandas:
import pandas as pd
dataset = [
[1, 5],
[1, 6],
[1, 10],
[1, 23],
[2, 4],
[2, 8],
[2, 12],
[3, 10],
[3, 20],
[3, 40],
[4, 50],
[4, 500],
]
df = pd.DataFrame(dataset)
df.columns = ["date", "value"]
df = df.groupby("date").agg(min_value=("value", "min"), max_value=("value", "max"))
df["res"] = df["max_value"] - df["min_value"]
df["res"].to_list()
Output:
[18, 8, 30, 450]

Can't get the size of numpy.ndarray

I have a dataframe as follows:
version count region listing
2 v2 2 CAN [7, 8]
2 v3 3 CAN [7, 8, 9]
I want to extract listing list for each row and get the length. So I did the following:
group_v2_list = group[group['version'] == 'v2']['listing'].values
and I get output as [list([7, 8])]. Here the type of listing column is numpy.ndarray which I get after using type(group_v2_list).
Now I want to get the number of elements in this group_v2_list but I am unable to get it.
I tried len(group_v2_list) and group_v2_list.size but both are giving me 1. I want to get the number of elements which should be 2 as 7, 8.
How can I get that?
You do not need to access the numpy representation for this.
One way is to use .loc accessor to extract the series and find the length of the first element:
df = pd.DataFrame({'version': ['v2', 'v3'],
'count': [2, 3],
'region': ['CAN', 'CAN'],
'listing': [[7, 8], [7, 8, 9]]})
df_v2_list = df.loc[df['version'] == 'v2', 'listing']
res_v2 = len(df_v2_list[0])
# 2
If there are multiple elements in your filtered data, you can retrieve a list of their lengths by using pd.Series.map(len):
df_v_all_list = df.loc[df['version'].str.startswith('v'), 'listing']
res_all = df_v_all_list.map(len).tolist()
# [2, 3]

Drop Columns that starts with any of a list of strings Pandas

I'm trying to drop all columns from a df that start with any of a list of strings. I needed to copy these columns to their own dfs, and now want to drop them from a copy of the main df to make it easier to analyze.
df.columns = ["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123", "DDD123"...]
Entered some code that gave me this dataframes with these columns:
aaa.columns = ["AAA1234", "AAA5678"]
bbb.columns = ["BBB1234", "BBB5678"]
I did get the final df that I wanted, but my code felt rather clunky:
droplist_cols = [aaa, bbb]
droplist = []
for x in droplist_cols:
for col in x.columns:
droplist.append(col)
df1 = df.drop(labels=droplist, axis=1)
Columns of final df:
df1.columns = ["CCC123", "DDD123"...]
Is there a better way to do this?
--Edit for sample data--
df = pd.DataFrame([[1, 2, 3, 4, 5], [1, 3, 4, 2, 1], [4, 6, 9, 8, 3], [1, 3, 4, 2, 1], [3, 2, 5, 7, 1]], columns=["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123"])
Desired result:
CCC123
0 5
1 1
2 3
3 1
4 1
IICU
Lets begin with a dataframe thus;
df=pd.DataFrame({"A":[0]})
Modify dataframe to include your columns
df2=df.reindex(columns=["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123", "DDD123"], fill_value=0)
Drop all columns starting with A
df3=df2.loc[:,~df2.columns.str.startswith('A')]
If you need to drop say A OR B I would
df3=df2.loc[:,~(df2.columns.str.startswith('A')|df2.columns.str.startswith('B'))]

Categories