I have this data frame about vehicle vibration and I want to calculate the dominant frequency of the vibration. I know that we can calculate it using numpy.fft, but I have no idea how can I apply numpy.fft to my data frame.
Please enlighten me how to do it in python.
Thankyou.
A dataframe column is a NumPy array effectively
df = pd.DataFrame({"Vibration":[random.uniform(2,10) for i in range(10)]})
df["fft"] = np.fft.fft(df["Vibration"].values)
print(df.to_string())
output
Vibration fft
0 8.212039 63.320213+0.000000j
1 5.590523 2.640720-2.231825j
2 8.945281 -2.977825-5.716229j
3 6.833036 4.657765+5.649944j
4 5.150939 -0.216720-0.445046j
5 3.174186 10.592292+0.000000j
6 9.054791 -0.216720+0.445046j
7 5.830278 4.657765-5.649944j
8 5.593203 -2.977825+5.716229j
9 4.935937 2.640720+2.231825j
batching to every 15 rows
df = pd.DataFrame({"Vibration":[random.uniform(2,10) for i in range(800)]})
df.assign(
fft=df.groupby(df.index // 15)["Vibration"].transform(lambda s: np.fft.fft(list(s)).astype("object")),
grpfirst=df.groupby(df.index // 15)["Vibration"].transform(lambda s: list(s)[0])
)
Without knowing how the DataFrame looks like, and which fields you need to use for your calculations, you can apply any function to dataframe using .apply()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
Related
I have a sample dataset. Here is:
import pandas as pd
import numpy as np
df = {'Point1': [50,50,50,45,45,35,35], 'Point2': [48,44,30,35,33,34,32], 'Dist': [4,6,2,7,8,3,6]}
df = pd.DataFrame(df)
df
And its output is here:
My goal is to find dist value with its condition and point2 value for each group of point1.
Here is my code. (It gives an error)
if df['dist'] < 5 :
df1 = df[df['dist'].isin(df.groupby('Point1').max()['Dist'].values)]
else :
df1 = df[df['dist'].isin(df.groupby('Point1').min()['Dist'].values)]
df1
And here is the expected output:
So, if there is exist Dist value less than 5, I would like to take the max one of these groups. If no, I would like to take the min one. I hope it would be clear.
IIUC, you want to find the closest Dist to 5, with prority for values lower than 5.
For this you can compute two columns to help you sort the values in order of priority and take the first one. Here 'cond' sort by ≤5 first, then >5, and cond2 by absolute distance to 5.
thresh = 5
(df
.assign(cond=df['Dist'].gt(thresh),
cond2=df['Dist'].sub(thresh).abs(),
)
.sort_values(by=['cond', 'cond2'])
.groupby('Point1', as_index=False).first()
.drop(columns=['cond', 'cond2'])
)
output:
Point1 Point2 Dist
0 35 34 3
1 45 35 7
2 50 48 4
NB. this is also sorting by Point1 in the process, if this is unwanted on can create a function to sort a dataframe this way and apply it per group. Let me know if this is the case
Since you are using pandas DataFrame you can use the brackets syntax to filter the the data
In your case:
df[df['Dist']] < 5
About the second part of the question, it was a little confusing, can you explain more about the "take the max one of these groups. If no, I would like to take the min one"
df_list = []
for i in tqdm(item_list_short):
df = query_result_df[query_result_df['id_item']==i]
# calculate mean
mean = np.mean(df['price'])
# calculate standard deviation
sd = np.std(df['price'])
# create empty list to store outliers
outliers = []
if sd == 0:
outliers = 0
else:
# detect outlier
for i in df['price']:
z = (i-mean)/sd # calculate z-score
outliers.append(z) # add to the empty list
df['z-score'] = outliers
df_list.append(df)
df_score = pd.concat(df_list)
df_score
Right now if the length of item_list_short is in few millions then it will take few days to finish. Checked the time using tqdm library.
Data query_result_df looks something like this:
id_item id_seller price
11 1 40
22 2 30
33 3 10
33 4 9
44 5 8
and the list item_list_short contains list of all unique id_item.
You have two major factors for slow performance here:
1) Filtering in the loop instead of grouping
At the moment your code takes O(rows * items), while a standard .groupby() would take O(rows), i.e items times faster. Check out some examples
In your case that would be:
df['z_score'] = df.groupby('id_item')['price'].transform(
lambda rows: (ps - ps.mean()) / ps.std() if ps.min() < ps.max() else ps * 0.0
)
If you need to speedup things to max, spending a bit more code, try this:
mean_and_std = df.groupby('id_item')['price'].agg(['mean','std']).reset_index()
df = df.merge(mean_and_std, on='id_item')
df['z_score'] = (df['price'] - df['mean']) / df['std'].apply(lambda s: s or 1.0)
Please report what speedup you got with this.
It seems to be the best practice for such calculations so it definitely worth reading all the article.
2) Avoiding vectorized operations
Appending elements one by one is way slower than writing something like
z_score = (df['price'] - df['mean']) / df['std']
Looks like you are trying to compute z-score per group. See the code below that doesn't use any loops and works with groupby in pandas
from scipy.stats import zscore
df.groupby(["id_item"]).price.transform(lambda x : zscore(x,ddof=1))
0 NaN
1 NaN
2 0.707107
3 -0.707107
4 NaN
Name: price, dtype: float64
For testing whether a statistical difference occurs between to (large) samples I want to compute the mean and sd from a value_counts Series:
In [0]: counts.value_counts()
0 783
1 1128
2 744
3 366
4 119
5 38
6 10
7 3
I'm aware calculating the mean is not hard at all by doing something like
total = 0
for idx, val in counts.value_counts().iteritems():
total = total + idx*val
m = total/sum(sum(counts.value_counts()))
I'm asking if there's a shorter way of doing this.
And I'm also asking how to calculate the the standard deviation from the counts.value_counts() output.
You can actually do these.
counts.value_counts().mean()
counts.value_counts().median()
counts.value_counts().mode()
counts.value_counts().std()
you can use pandas series index to get mean of the indexes
import pandas as pd
import numpy as np
df = pd.DataFrame([1,2,3,4,4,4,4,4], columns = ['num'])
np.mean(df['num'].value_counts().index)
#op
2.5
You can get the mean from value counts by doing a weighted average with numpy.average:
counts = df.value_counts()
np.average(counts.index, weights=counts)
1.3979943591350674
Stdev is a little more tricky since it is less common to do that analysis with weights, but it looks like there is something in statsmodels that can help:
from statsmodels.stats.weightstats import DescrStatsW
weighted_stats = DescrStatsW(counts.index, weights=counts, ddof=0)
weighted_stats.mean, weighted_stats.std
(1.3979943591350674, 1.1904965747995073)
Don't forget about describe(). It can be used on a series or dataframe.
DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
df.describe() # returns dataframe containing describes for each column.
df['counts'].describe() # describe for values
df['counts'].value_counts().describe() # describe for value_counts()
df.value_counts().describe()['mean'] # returns mean
df['counts'].describe()[['mean','std']] # returns mean and std
I am creating a new column in a dataframe that is based on other values in the entire dataframe. I have found a couple of ways to do so (shown below), but they are very slow when working with large datasets (500k rows takes 1 hour to run). I am looking to increase the speed of this process.
I have attempt to use .apply with a lambda function. I have also used .map to obtain a list to put into the new column. Both of these methods work but are too slow.
values = {'ID': ['1','2','3','4','1','2','3'],
'MOD': ['X','Y','Z','X','X','Z','Y'],
'Period': ['Current','Current','Current','Current','Past','Past','Past']
}
df = DataFrame(values,columns= ['ID', 'MOD','Period'])
df['ID_MOD']=df['ID']+df['MOD']
def funct(identifier, indentifier_modification,period):
if period=="Current":
if (df.ID==identifier).sum()==1:
return "New"
elif (df.ID_MOD==indentifier_modification).sum()==1:
return "Unique"
else:
return "Repeat"
else:
return "n/a"
Initial df:
ID MOD Period ID_MOD
0 1 X Current 1X
1 2 Y Current 2Y
2 3 Z Current 3Z
3 4 X Current 4X
4 1 X Past 1X
5 2 Z Past 2Z
6 3 Y Past 3Y
Here are the two methods that are too slow:
1)
df['new_column']=df.apply(lambda x:funct(x['ID'],x['ID_MOD'],x['Period']), axis=1)
2)
df['new_column']=list(map(funct,df['ID'],df['ID_MOD'],df['Period']))
Intended final df:
ID MOD Period ID_MOD new_column
0 1 X Current 1X Repeat
1 2 Y Current 2Y Unique
2 3 Z Current 3Z Unique
3 4 X Current 4X New
4 1 X Past 1X n/a
5 2 Z Past 2Z n/a
6 3 Y Past 3Y n/a
There are no error messages; the code just takes ~1 hour to run with a large data set.
your current code is currently scales as O(N**2) where N is the number of rows. if your df really is 500k rows this is going to take a long time! you really want to be using code from numpy and pandas that has much lower computational complexity.
the aggregations built into pandas would help a lot in place of your use of sum, as would learning about how pandas does indexing and merge. in your case I can get 500k rows down to less than a second pretty easily.
start by defining a dummy data set:
import numpy as np
import pandas as pd
N = 500_000
df = pd.DataFrame({
'id': np.random.choice(N//2, N),
'a': np.random.choice(list('XYZ'), N),
'b': np.random.choice(list('CP'), N),
})
next we can do the aggregations to count across your various groups:
ids = df.groupby(['id']).size().rename('ids')
idas = df.groupby(['id','a']).size().rename('idas')
next we can join these aggregations back to the original data set
cutting down the data as much as possible is always a good idea and in your case Past values always get a value of n/a and as they take up half your data would seem to half the amount of work:
df2 = df.loc[df['b'] == 'C',]
df2 = pd.merge(df2, ids, left_on=['id'], right_index=True)
df2 = pd.merge(df2, idas, left_on=['id','a'], right_index=True)
finally we use where from numpy to vectorise all your conditions and hence work much faster, then use pandas indexing to put everything back together efficiently, patching up missing values afterwards
df2['out'] = np.where(
df2['ids'] == 1, 'New',
np.where(df2['idas'] == 1, 'Unique', 'Repeat'))
df['out'] = df2['out']
df['out'].fillna('n/a', inplace=True)
hope some of that helps! for reference, the above runs in ~320ms for 500k rows on my laptop
I'd like to do some math on a series vector. I'd like to take the difference between two rows in a vector. My first intuition was:
def row_diff(prev, next):
return(next - prev)
and then using it
my_col_vec.apply(row_diff)
but this doesn't do what I'd like. It appears apply is row-wise, which is fine, but I can't seem to find an equivalent operation that will allow me to easy create a new vector from the old one by subtracting the previous row from the next.
Is there a better way to do this? I've been reading this document and it doesn't look like it.
Thanks!
To calculate inter-row differences use diff:
In [6]:
df = pd.DataFrame({'a':np.random.rand(5)})
df
Out[6]:
a
0 0.525220
1 0.031826
2 0.260853
3 0.273792
4 0.281368
In [7]:
df['diff'] = df['a'].diff()
df
Out[7]:
a diff
0 0.525220 NaN
1 0.031826 -0.493394
2 0.260853 0.229027
3 0.273792 0.012940
Also please try to avoid using apply as there is usually a vectorised method available