Good day!
There is the following time series dataset:
Time Value
1 1
2 1
3 1
4 2
5 2
6 2
7 2
8 3
9 3
10 4
11 4
12 5
I need to split and group data by value like this:
Value Time start, Time end
1 1 3
2 4 7
3 8 9
4 10 11
5 12 12
How to do it fast and in the most functional programming style on python? Various libraries can be used for example pandas, numpy.
Try with pandas:
df.groupby('Time')['Value'].agg(['min','max'])
We can use pandas for this:
Solution:
data = {'Time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'Value': [1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 5]
}
df = pd.DataFrame(data, columns= ['Time', 'Value'])
res = df.groupby('Value').agg(['min', 'max'])
f_res = res.rename(columns = {'min': 'Start Time', 'max': 'End Time'}, inplace = False)
print(f_res)
Output:
Time
Start Time End Time
Value
1 1 3
2 4 7
3 8 9
4 10 11
5 12 12
first get the count of Values
result = df.groupby('Value').agg(['count'])
result.columns = result.columns.get_level_values(1) #drop multi-index
result
count
Value
1 3
2 4
3 2
4 2
5 1
then cumcount to get time start
s = df.groupby('Value').cumcount()
result["time start"] = s[s == 0].index.tolist()
result
count time start
Value
1 3 0
2 4 3
3 2 7
4 2 9
5 1 11
finally,
result["time start"] += 1
result["time end"] = result["time start"] + result['count'] - 1
result
count time start time end
Value
1 3 1 3
2 4 4 7
3 2 8 9
4 2 10 11
5 1 12 12
Related
Let's say I have the following data:
day
query
num_searches
1
abc
2
1
def
3
2
abc
6
3
abc
5
4
def
1
4
abc
3
5
abc
7
6
abc
8
7
abc
10
8
abc
1
I'd like to generate z-score (excluding the current row's value) such that for query 'abc':
Day 1: [6, 5, 3, 7, 8, 10, 1] (exclude the value 2) zscore = -1.32
Day 2: [2, 5, 3, 7, 8, 10, 1] (exclude the value 6) zscore = 0.28
...
Day 7: [2, 6, 5, 3, 7, 8, 1] (exclude the value 10) zscore = 2.22
Day 8: [2, 6, 5, 3, 7, 8, 10] (exclude the value 1) zscore = -1.88
I have the following function to calculate this 'exclusive' zscore.
def zscore_exclusive(arr):
newl = []
for index, val in enumerate(x):
l = list(x)
val = l.pop(index)
arr_popped = np.array(l)
avg = np.mean(arr_popped)
stdev = np.std(arr_popped)
newl.append((val - avg) / stdev)
return np.array(newl)
How can I apply this custom function to each grouping (by query string)? Remember, I'd like to pop the currently evaluated element from the series.
Given:
day query num_searches
0 1 abc 2
1 1 def 3
2 2 abc 6
3 3 abc 5
4 4 def 1
5 4 abc 3
6 5 abc 7
7 6 abc 8
8 7 abc 10
9 8 abc 1
Doing:
Note!
For np.std, ddof = 0 by default.
But for pd.Series.std, ddof = 1 by default.
You should be sure which one you want to use.
z_score = lambda x: [(x[i]-x.drop(i).mean())/x.drop(i).std(ddof=0) for i in x.index]
df['z-score'] = df.groupby('query')['num_searches'].transform(z_score)
print(df)
Output:
day query num_searches z-score
0 1 abc 2 -1.319950
1 1 def 3 inf
2 2 abc 6 0.277350
3 3 abc 5 -0.092057
4 4 def 1 -inf
5 4 abc 3 -0.866025
6 5 abc 7 0.661438
7 6 abc 8 1.083862
8 7 abc 10 2.223782
9 8 abc 1 -1.877336
I have a baseline column (base) in a pandas data frame and I want to difference all other columns x* from this column while preserving two groups group1, group2:
The easiest way is to simply difference by doing:
df = pd.DataFrame({'group1': [0, 0, 1, 1], 'group2': [2, 2, 3, 4],
'base': [0, 1, 2, 3], 'x1': [3, 4, 5, 6], 'x2': [5, 6, 7, 8]})
df['diff_x1'] = df['x1'] - df['base']
df['diff_x2'] = df['x2'] - df['base']
group1 group2 base x1 x2 diff_x1 diff_x2
0 0 2 0 3 5 3 5
1 0 2 1 4 6 3 5
2 1 3 2 5 7 3 5
3 1 4 3 6 8 3 5
But I have hundreds of columns I need to do this for, so I'm looking for a more efficient way.
You can subtract a Series from a dataframe column wise using the sub method with axis=0, which can save you from doing the subtraction for each column individually:
to_sub = df.filter(regex='x.*') # filter based on your actual logic
pd.concat([
df,
to_sub.sub(df.base, axis=0).add_prefix('diff_')
], axis=1)
# group1 group2 base x1 x2 diff_x1 diff_x2
#0 0 2 0 3 5 3 5
#1 0 2 1 4 6 3 5
#2 1 3 2 5 7 3 5
Another way is using df.drop(..., axis=1). Then pass each remaining column of that dataframe into sub(..., axis=0). Guarantees you catch all columns, and preserve their order, don't even need a regex.
df_diff = df.drop(['group1','group2','base'], axis=1).sub(df['base'], axis=0).add_prefix('diff_')
diff_x1 diff_x2
0 3 5
1 3 5
2 3 5
3 3 5
Hence your full solution is:
pd.concat([df, df_diff], axis=1)
group1 group2 base x1 x2 diff_x1 diff_x2
0 0 2 0 3 5 3 5
1 0 2 1 4 6 3 5
2 1 3 2 5 7 3 5
3 1 4 3 6 8 3 5
I have a unique ID and time-series data. Time-series data contains 3 macro variables.
I want to construct the data frame, where columns are date , and they are the same. Here are example of initial and expected outputs
Length of ID is not important here
Setup
Recreate OP's dataframe
dat = [[3, 4, 1], [4, 5, 3]]
idx = [2017, 2018]
col = ['A', 'B', 'C']
df = pd.DataFrame(dat, idx, col).rename_axis('time')
pd.concat
I rap enumerate in dict where enumerate starts from 1 to match OP's ID that starts from 1
new = pd.concat(dict(enumerate([df] * 3, 1)), names=['ID']).unstack()
new.columns = [f'{x}{y}' for x, y in new.columns]
new
A2017 A2018 B2017 B2018 C2017 C2018
ID
1 3 4 4 5 1 3
2 3 4 4 5 1 3
3 3 4 4 5 1 3
Details
To see what the concatenated dataframe looks like
pd.concat(dict(enumerate([df] * 3, 1)), names=['ID'])
A B C
ID time
1 2017 3 4 1
2018 4 5 3
2 2017 3 4 1
2018 4 5 3
3 2017 3 4 1
2018 4 5 3
If we unstack it
A B C
time 2017 2018 2017 2018 2017 2018
ID
1 3 4 4 5 1 3
2 3 4 4 5 1 3
3 3 4 4 5 1 3
Only thing left to do is to smash the column levels together, which you can see how I did it above.
I have this pandas series:
ts = pd.Series([1, 2, 3, 4, 5, 6, 7, 8])
What I would like to get is a dataframe which contains another column with the sum of rows 0, 2, 4, 6 and for 1, 3, 5 and 7 (that means, one row is left out when creating the sum).
In this case, this means a new dataframe should look like this one:
index ts sum
0 1 16
1 2 20
2 3 16
3 4 20
4 5 16
5 6 20
6 7 16
7 8 20
How could I do this?
Use modulo by k for each kth rows:
k = 2
df = ts.to_frame('ts')
df['sum'] = df.groupby(ts.index % k).transform('sum')
#if not default RangeIndex
#df['sum'] = df.groupby(np.arange(len(ts)) % k).transform('sum')
print (df)
ts sum
0 1 16
1 2 20
2 3 16
3 4 20
4 5 16
5 6 20
6 7 16
7 8 20
I have an excel sheet with values representing start and end_time of a time series data, as shown below. Times are in seconds.
+------------+---------+-------+
Start_Time End_Time Value
0 2 A
2 3 B
3 9 A
9 11 C
I want to extrapolate the values between start and end_time and display the values for each second.
+---------+------+
Time Value
0 A
1 A
2 A
3 B
4 A
5 A
6 A
7 A
8 A
9 A
10 C
11 c
Any help to implement it in Python will be appreciated. Thanks.
Setup
You should find how to read your excel sheet with pandas easily, and options will depend on the file itself, so I won't cover this part.
Below is the reproduction of your sample dataframe, used for the example.
import pandas as pd
df = pd.DataFrame({'Start_Time': [0, 2, 3, 9],
'End_Time': [2, 3, 9, 11],
'Value': ['A', 'B', 'A', 'C']})
>>> df
Out[]:
End_Time Start_Time Value
0 2 0 A
1 3 2 B
2 9 3 A
3 11 9 C
Solution
(pd.Series(range(df.End_Time.max() + 1), name='Value') # Create a series on whole range
.map(df.set_index('End_Time').Value) # Set values from "df"
.bfill() # Backward fill NaNs values
.rename_axis('Time')) # Purely cosmetic axis rename
Out[]:
Time
0 A
1 A
2 A
3 B
4 A
5 A
6 A
7 A
8 A
9 A
10 C
11 C
Name: Value, dtype: object
Walkthrough
Create the whole "Time" range
s = pd.Series(range(df.End_Time.max() + 1))
>>> s
Out[]:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
dtype: int32
Use "End_Time" as index for df
>>> df.set_index('End_Time')
Out[]:
Start_Time Value
End_Time
2 0 A
3 2 B
9 3 A
11 9 C
Map df values to corresponding "End_Time" values from s
s = s.map(df.set_index('End_Time').Value)
>>> s
Out[]:
0 NaN
1 NaN
2 A
3 B
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 A
10 NaN
11 C
dtype: object
Backward-fill the NaN values
s = s.bfill()
>>> s
Out[]:
0 A
1 A
2 A
3 B
4 A
5 A
6 A
7 A
8 A
9 A
10 C
11 C
dtype: object
Then rename_axis('Time') only renames the series axis to match your desired output.
Note that this works here because you use excluding Start_Time.
If you were using including Start_Time (where Value really starts at Start_Time, which is more common) you should change End_Time to Start_Time and bfill() to ffill() (forward-fill).