Making two arrays the same size by grouping - python

I have 2 arrays with the same size, the first one represents time and the second one represents distance i want to group by the first one so each group will have only the same integer values (The floats between two integers)
Heres my orginal time array:
time=[0.2,0.4,0.6,0.8,1,1.2,1.4,1.6,1.8,1.9]
And heres my distance array:
distance=[1,2,4,5.5,7.8,9.6,10,11,11.6,11.9]
so in the time array after grouping by same intger im getting this :
time=[0.2,0.4,0.6,0.8],[1,1.2,1.4,1.6,1.8,1.9]
the first subgroup contain 4 elements and the second contain 6 elements
so the distance groups should be contain 4 then 6 elements accordingly
like this :
distance=[1,2,4,5.5],[7.8,9.6,10,11,11.6,11.9]
so the size of each group will be same size of the distance groups
Any ideas or help?

The following code does what you want:
s=set([int(i) for i in time])
timesplit={i:[] for i in s}
for i in time:
k=int(i)
timesplit[k].append(i)
timelengths=[len(i) for i in timesplit.values()]
distances=[]
for i in timelengths:
distances.append(distance[:i])
distance=distance[i:]
res_times=list(timesplit.values())
res_distances=distances
print(res_times)
print(res_distances)
Output:
[[0.2,0.4,0.6,0.8],[1,1.2,1.4,1.6,1.8,1.9]]
[[1,2,4,5.5],[7.8,9.6,10,11,11.6,11.9]]

Related

Easy way to cluster entries in a numpy array with a condition

Background: I have a numpy array of float entries. This is basically a set of observations of something, suppose temperature measured during 24 hours. Imagine that one who records the temperature is not available for the entire day, instead he/she takes few (say 5) readings during an hour and again after few hours, takes reading (say 8 times). All the measurements he/she puts in a single np.array and has handed over to me!
Problem: I have no idea when the readings were taken. So I decide to cluster the observations in the following way: maybe, first recognize local peaks in the array and all entries that are close enough (chosen tolerance, say 1 deg) are grouped together, meaning, I want to split the array into a list of sub-arrays. Note, any entry should belong to exactly one group.
One possible approach: First, sort the array, then split it into sub-arrays with two conditions: (1) Difference between the first and last entries is not more than 1 deg, (2) Difference between the last entry of a sub-array and the first entry of the next sub-array is greater than 1 deg. How can I achieve this fast (numpy way)?

python Numppy array get the last 3 values to identify its accending or decending or inorder

I am trying to do the following with NumPy array or normal array:
For push the data I am doing:
ar1 = []
#Read from Pandas dataframe column. i is row number of data - it's working fine.
ar1.append((df['rolenumber'][i]))
OUTPUT:
[34768, 34739, 34726, 34719, 34715]
This result possible to come as Ascending/Descending or combined anything possible.
Here I want to take the last 3 values to validate whether it is ascending or descending or mixed.
Ascending: If the last 3 values increased regular. Example: 34726, 34739, 34745
Descending: If the last 3 values decrease properly. Example: 34726, 34719, 34715
Mixed: If the last 3 start with a big number then small number then big number. Example: 34726, 34719, 34725
Note: No need to sort only validate.
This little snippet should get you going:
a = np.array([34768, 34739, 34726, 34719, 34715])
is_descending = np.all(np.diff(a[-3:]) < 0)
is_ascending = np.all(np.diff(a[-3:]) > 0)
is_mixed = ~(is_ascending | is_descending)

How to get consecutive averages of the column values based on the condition from another column in the same data frame using pandas

I have large data frame in pandas which has two columns Time and Values. I want to calculate consecutive averages for values in column Values based on the condition which is formed from the column Time.
I want to calculate average of the first l values in column Values, then next l values from the same column and so on, till the end of the data frame. The value l is the number of values that go into every average and it is determined by the time difference in column Time. Starting data frame looks like this
Time Values
t1 v1
t2 v2
t3 v3
... ...
tk vk
For example, average needs to be taken at every 2 seconds and the number of time values inside that time difference will determine the number of values l for which the average will be calculated.
a1 would be the first average of l values, a2 next, and so on.
Second part of the question is the same calculation of averages, but if the number l is known in advance. I tried this
df['Time'].iloc[0:l].mean()
which works for the first l values.
In addition, I would need to store the average values in another data frame with columns Time and Averages for plotting using matplotlib.
How can I use pandas to achieve my goal?
I have tried the following
df = pd.DataFrame({'Time': [1595006371.756430732,1595006372.502789381 ,1595006373.784446912 ,1595006375.476658051], 'Values': [4,5,6,10]},index=list('abcd'))
I get
Time Values
a 1595006371.756430732 4
b 1595006372.502789381 5
c 1595006373.784446912 6
d 1595006375.476658051 10
Time is in the format seconds.milliseconds.
If I expect to have the same number of values in every 2 seconds till the end of the data frame, I can use the following loop to calculate value of l:
s=1
l=0
while df['Time'][s] - df['Time'][0] <= 2:
s+=1
l+=1
Could this be done differently, without the loop?
How can I do this if number l is not expected to be the same inside each averaging interval?
For the given l, I want to calculate average values of l elements in another column, for example column Values, and to populate column Averages of data frame df1 with these values.
I tried with the following code
p=0
df1=pd.DataFrame(columns=['Time','Averages']
for w in range (0, len(df)-1,2):
df1['Averages'][p]=df['Values'].iloc[w:w+2].mean()
p=p+1
Is there any other way to calculate these averages?
To clarify a bit more.
I have two columns Time and Values. I want to determine how many consecutive values from the column Values should be averaged at one point. I do that by determining this number l from the column Time by calculating how many rows are inside the time difference of 2 seconds. When I determined that value, for example 2, then I average first two values from the column Values, and then next 2, and so on till the end of the data frame. At the end, I store this value in the separate column of another data frame.
I would appreciate your assistance.
You talk about Time and Value and then groups of consecutive rows.
If you want to group by consecutive rows and get the mean of the Time and Value this does it for you. You really need to show by example what you are really trying to achieve.
d = list(pd.date_range(dt.datetime(2020,7,1), dt.datetime(2020,7,2), freq="15min"))
df = pd.DataFrame({"Time":d,
"Value":[round(random.uniform(0, 1),6) for x in d]})
df
n = 5
df.assign(grp=df.index//5).groupby("grp").agg({"Time":lambda s: s.mean(),"Value":"mean"})

random binary matrix with restrictions

I want to create a binary 16*15 matrix with certain conditions. I use binary strings to make the matrix. I want my matrix to be as described:
-The first and last two elements of each row must be alternative.
-the sum of each row must be 8 or 7.
-in each row, there should not be consecutive 1s or 0s. (one couple(00 or 11) is allowed in each row) .
-the sum of the columns must be 8.
there are 26 possible strings that can fulfill the first 3 conditions.how can I fulfill the last conditions?
I have a code but it is not working because it takes so much time and it is almost impossible.is there any other way?
I don't think you need any constraint to fulfill the last conditions. Columns = 8, which is just half of 16. You can just simply copy the first 8 rows to the last 8 rows and reverse all the 0 and 1, then the column sum would be 8 and the first three conditions are met.

Measuring covariance on several rows

I'm new to Python and I'm trying to find my way by trying to perform some calculations (i can do them easily in excel, but now I want to know how to do it in Python).
One calculation is the covariance.
I have a simple example where I have 3 items that are sold and we have the demand per item of 24 months.
Here, you see a snapshot of the excel file:
Items and their demand over 24 months
The goal is to measure the covariance between all the three items. Thus the covariance between item 1 and 2, 1 and 3 and 2 and 3. But also, I want to know how to do it for more than 3 items, let's say for a thousand items.
The calculations are as follows:
First I have to calculate the averages per item. This is already something I found by doing the following code:
after importing the following:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
I imported the file:
df = pd.read_excel("Directory\\Covariance.xlsx")
And calculated the average per row:
x=df.iloc[:,1:].values
df['avg'] = x.mean(axis=1)
This gives the file with an extra column, the average (avg):
Items, their demand and the average
The following calculation that should be done is to calculate the covariance between, lets say for example, item 1 and 2. this is mathematically done as follows:
(column "1" of item 1- column "avg" of item 1)*(column "1" of item 2- column "avg" of item 2). This has to be done for column "1" to "24", so 24 times. This should add 24 columns to the file df.
After this, we should take the average of these columns and that displays the covariance between item 1 and 2. Because we have to do this N-1 times, so in this simple case we should have 2 covariance numbers (for the first item, the covariance with item 2 and 3, for the second item, the covariance with item 1 and 3 and for the third item, the covariance with item 1 and 2).
So the first question is; how can I achieve this for these 3 items, so that the file has a column that displays 2 covariance outcomes per item (first item should have a column with the covariance number of item 1 and 2 and a second column with the covariance number between item 1 and 3, and so on...).
The second question is of course: what if I have a 1000 items; how do I then efficiently do this, because then I have 999 covariance numbers per item and thus 999 extra columns, but also 999*25 columns extra if I calculate it via the above methodology. So how do I perform this calculation for every item as efficient as possible?
Pandas has a builtin function to calculate the covariance matrix, but first you need to make sure your dataframe is in the correct format. The first column in your data actually contains the row labels, so let's put those in the index:
df = pd.read_excel("Directory\\Covariance.xlsx", index_col=0)
Then you can calculate also the mean more easily, but don't put it back in your dataframe yet!
avg = df.mean(axis=1)
To calculate the covariance matrix, just call .cov(). This however calculates pair-wise covariances of columns, to transpose the dataframe first:
cov = df.T.cov()
If you want, you can put everything together in 1 dataframe:
df['avg'] = avg
df = df.join(cov, rsuffix='_cov')
Note: the covariance matrix includes the covariance with itself = the variance per item.

Categories