Can't get the size of numpy.ndarray - python

I have a dataframe as follows:
version count region listing
2 v2 2 CAN [7, 8]
2 v3 3 CAN [7, 8, 9]
I want to extract listing list for each row and get the length. So I did the following:
group_v2_list = group[group['version'] == 'v2']['listing'].values
and I get output as [list([7, 8])]. Here the type of listing column is numpy.ndarray which I get after using type(group_v2_list).
Now I want to get the number of elements in this group_v2_list but I am unable to get it.
I tried len(group_v2_list) and group_v2_list.size but both are giving me 1. I want to get the number of elements which should be 2 as 7, 8.
How can I get that?

You do not need to access the numpy representation for this.
One way is to use .loc accessor to extract the series and find the length of the first element:
df = pd.DataFrame({'version': ['v2', 'v3'],
'count': [2, 3],
'region': ['CAN', 'CAN'],
'listing': [[7, 8], [7, 8, 9]]})
df_v2_list = df.loc[df['version'] == 'v2', 'listing']
res_v2 = len(df_v2_list[0])
# 2
If there are multiple elements in your filtered data, you can retrieve a list of their lengths by using pd.Series.map(len):
df_v_all_list = df.loc[df['version'].str.startswith('v'), 'listing']
res_all = df_v_all_list.map(len).tolist()
# [2, 3]

Related

How can I find the missing index using python pandas?

Example
Order_ID Name
1 Man
2 Boss
5 Don
7 Lil
9 Dom
10 Bob
Want to get an output as:
3 4 6 8 are the missing Order_ID
Try using a list comprehension with range:
print([i for i in range(1, 10) if i not in df['Order_ID']])
Output:
[3, 4, 6, 8]
Solution for generate missing values from index dynamically by maximum and minimum values:
print (np.setdiff1d(np.arange(df.index.min(), df.index.max() + 1), df.index).tolist())
[3, 4, 6, 8]
Convert the list to set and compute its difference with a set that contains integers ranging from min(lst) and max(lst).
lst=df["Order_ID"].to_list()
sorted(set(range(lst[0], lst[-1])) - set(lst))
> [3, 4, 6, 8]
Try this code;
Code Syntax
missData = list(filter(lambda x: x not in df['Order_ID'], range(1, df['Order_ID].max()+1)))
print(f"{missData} are the missing Order_ID")
Output
[3, 4, 6, 8] are the missing Order_ID
[Program finished]

How to order columns of old/new values such that the ith old value = the (i-1)th new value

Edit: title suggestions welcome. This probably has a name, but I don't know what it is and could not find something similar.
Edit2: I've rewritten the problem to try and explain it more clearly. In both versions, I think I've met the site standards by putting forth an explanation, reproducible example, and my own solution... if you could suggest improvements before downvoting, that would be appreciated.
I have user entered data from a system containing these three columns:
date: timestamps in %Y-%m-%d %H:%M:%S format; however %S=00 for all cases
old: the old value of this observation
new: the new value of this observation
If the user entered data within the same minute, then sorting by the timestamp alone is insufficient. We end up with a "chunk" of entries that may or may not be in the correct order. To illustrate, I've replaced dates with integers and present a correct and jumbled case:
How do we know the data is in the correct order? When each row's value for old equals the previous row's value for new (ignoring the first/last row where we have nothing to compare to). Put another way: old_i = new_(i-1). This creates the matching diagonal colors on the left, which are jumbled on the right.
Other comments:
there may be multiple solutions, as two rows may have the same values for old and new and thus are interchanbeable
if an ambiguous chunk occurs by itself (imagine the data is only the rows where date=1 above), any solution will suffice
if the ambiguous chunk occurs with a unique date before and/or after, these serve as additional constraints and must be considered to achieve the solution
consider the case with back to back ambiguous chunks as bonus; I plan to ignore these and am not sure they even exist in the data
My data set is much larger, so my end solution will involve using pandas.groupby() to feed a function chunks like the above. The right side would be passed to the function, and I need the left side returned (or some index/order to get me to the left side).
Here's a reproducible example, using the same data as above, but adding a group column and another chunk so you can see my groupby() solution.
Setup and input jumbled data:
import pandas as pd
import itertools
df = pd.DataFrame({'group': ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b'],
'date': [0, 1, 1, 1, 1, 2, 3, 4, 4],
'old': [1, 8, 2, 2, 5, 5, 4, 10, 7],
'new': [2, 5, 5, 8, 2, 4, 7, 1, 10]})
print(df)
### jumbled: the `new` value of a row is not the same as the next row's `old` value
# group date old new
# 0 a 0 1 2
# 1 a 1 8 5
# 2 a 1 2 5
# 3 a 1 2 8
# 4 a 1 5 2
# 5 a 2 5 4
# 6 b 3 4 7
# 7 b 4 10 1
# 8 b 4 7 10
I wrote a kludgy solution that begs for a more elegant approach. See my gist here for the code behind the order_rows function I call below. The output is correct:
df1 = df.copy()
df1 = df1.groupby(['group'], as_index=False, sort=False).apply(order_rows).reset_index(drop=True)
print(df1)
### correct: the `old` value in each row equals the `new` value of the previous row
# group date old new
# 0 a 0 1 2
# 1 a 1 2 5
# 2 a 1 5 2
# 3 a 1 2 8
# 4 a 1 8 5
# 5 a 2 5 4
# 6 b 3 4 7
# 7 b 4 7 10
# 8 b 4 10 1
Update based on networkx suggestion
Note that bullet #2 above suggests that these ambiguous chunks can occur without a prior reference row. In that case, feeding the starting point as df.iloc[0] is not safe. In addition, I found that when seeding the graph with an incorrect starting point, it appears to only output the nodes it could successfully order. Note that 5 rows were passed, but only 4 values were returned.
Example:
import networkx as nx
import numpy as np
df = pd.DataFrame({'group': ['a', 'a', 'a', 'a', 'a'],
'date': [1, 1, 1, 1, 1],
'old': [8, 1, 2, 2, 5],
'new': [5, 2, 5, 8, 2]})
g = nx.from_pandas_edgelist(df[['old', 'new']],
source='old',
target='new',
create_using=nx.DiGraph)
ordered = np.asarray(list(nx.algorithms.traversal.edge_dfs(g, df.old[0])))
ordered
# array([[8, 5],
# [5, 2],
# [2, 5],
# [2, 8]])
This is a graph problem. You can use networkx to create your graph, and then use numpy for manipulation. A simple traversal algorithm, like depth-first search, will visit all your edges starting from a source.
The source is simply your first node (i.e. df.old[0])
To your example:
import networkx as nx
g = nx.from_pandas_edgelist(df[['old', 'new']],
source='old',
target='new',
create_using=nx.DiGraph)
ordered = np.asarray(list(nx.algorithms.traversal.edge_dfs(g, df.old[0])))
>>>ordered
array([[ 1, 2],
[ 2, 5],
[ 5, 2],
[ 2, 8],
[ 8, 5],
[ 5, 4],
[ 4, 7],
[ 7, 10],
[10, 1]])
You may just assign back to your data frame: df[['old', 'new']] = ordered. You might have to change some small details, e.g. if your groups are not inter-connected. But, if the starting point is a sorted df on group and date and the dependencies on old_i = new_(i-1) are respected inter-groups, then you should be fine to just assign back the ordered array.
I still believe, however, that you should investigate your timestamps. I believe this is a simpler problem that can be solved by just sorting the timestamps. Make sure you are not losing precision on your timestamps when reading/writing to files.

Merge and add duplicate integers from a multidimensional array

I have a multidimensional list where the first item is a date and the second is a date time object needing to be added together. For example (leave the second as a integer for simplicity):
[[01/01/2019, 10], [01/01/2019, 3], [02/01/2019, 4], [03/01/2019, 2]]
The resulting array should be:
[[01/01/2019, 13], [02/01/2019, 4], [03/01/2019, 2]]
Does someone have a short way of doing this?
The background to this is vehicle tracking, I have a list of trips performed by vehicle and I want to have a summary by day with a count of total time driven per day.
You should change your data 01/01/2019 to '01/01/2019'.
#naivepredictor suggested good sample, anyway, if you don't want to import pandas, use this.
my_list = [['01/01/2019', 10], ['01/01/2019', 3], ['02/01/2019', 4], ['03/01/2019', 2]]
result_d = {}
for i in my_list:
result_d[i[0]] = result_d.get(i[0], 0) + i[1]
print(result_d) #{'01/01/2019': 13, '02/01/2019': 4, '03/01/2019': 2}
print([list(d) for d in result_d.items()]) #[['01/01/2019', 13], ['02/01/2019', 4], ['03/01/2019', 2]]
import pandas as pd
# create dataframe out of the given imput
df = pd.DataFrame(data=[['01/01/2019', 10], ['01/01/2019', 3], ['02/01/2019', 4]], columns=['date', 'trip_len'])
# groupby date and sum values for each day
df = df.groupby('date').sum().reset_index()
# output result as list of lists
result = df.values.tolist()

Import a list from a single cell of Excel Python

I am using xlrd package and I want to import lists from excel spreadsheet which are written in single cells (comma separated or space separated), just like the spreadsheet below:
I want to import those machine sequence values in four different lists, such that my expected output would look like the following:
M1 = [2, 4, 6, 8]
M2 = [1,5,9]
M3 = [2,5,1,4,9,4]
M4 = [7, 4]
Should I format the spreadsheet in any better way to do this?
Please help.
Try:
import xlrd
main_book = xlrd.open_workbook('machine_sequence_excel_file.xlsx')
main_sheet = main_book.sheet_by_index(0) # assuming that the data is in the first sheet
M1, M2, M3, M4 = [[int(y) for y in x.split(',')] for x in main_sheet.col_values(1)[1:]]
# M1
# [2, 4, 6, 8]
# M2
# [1, 5, 9]
# M3
# [2, 5, 1, 4, 9, 4]
# M4
# [7, 4]
Essentially, what is happening here is that you are reading the values of the second column, starting with the second the value. This is done by invoking the col_values method of the Sheet object. Subsequently, you then split each value on the comma to generate lists whose values are coerced into integers.
I hope this helps.

In python how can I find all the values in an array that are in between two specific values

Hi I would like to go all over certain values of an array lets say
spikeTimes = [1,2,3,4,5,6,7,8]
stimTimes =[0.5, 4.5, 7.5]
WIN_SIZE=2
I want for every element of stimTimes to be able to get the values that are in between stimTimes[indx] to stimTimes[indx]+WIN_SIZE
I should be able to get for the first value of spikeTimes(0.5) - 1,2 (between 0.5-2.5)
for the second value(4.5) - 5,6 (values that are between 4.5- 4.5+WIN_SIZE=6.5)
and for the 3rd value(7.5) - 8
This can be done with nested list comprehensions:
spikeTimes = [1, 2, 3, 4, 5, 6, 7, 8]
stimTimes = [0.5, 4.5, 7.5]
WIN_SIZE = 2
partitions = [[spike for spike in spikeTimes if stim <= spike <= stim + WIN_SIZE] for stim in stimTimes]
print(partitions)
Produces:
[[1, 2], [5, 6], [8]]

Categories