I try to compare below two dataframe with "check_index_type" set to False. According to the documentation, if it set to False, it shouldn't "check the Index class, dtype and inferred_type are identical". Did I misunderstood the documentation? how to compare ignoring the index and return True for below test?
I know I can reset the index but prefer not to.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.testing.assert_frame_equal.html
from pandas.util.testing import assert_frame_equal
import pandas as pd
d1 = pd.DataFrame([[1,2], [10, 20]], index=[0,2])
d2 = pd.DataFrame([[1, 2], [10, 20]], index=[0, 1])
assert_frame_equal(d1, d2, check_index_type=False)
AssertionError: DataFrame.index are different
DataFrame.index values are different (50.0 %)
[left]: Int64Index([0, 2], dtype='int64')
[right]: Int64Index([0, 1], dtype='int64')
If you really don't care about the index being equal, you can drop the index as follows:
assert_frame_equal(d1.reset_index(drop=True), d2.reset_index(drop=True))
Index is part of data frame , if the index are different , we should say the dataframes are different , even the value of dfs are same , so , if you want to check the value , using array_equal from numpy
d1 = pd.DataFrame([[1,2], [10, 20]], index=[0,2])
d2 = pd.DataFrame([[1, 2], [10, 20]], index=[0, 1])
np.array_equal(d1.values,d2.values)
Out[759]: True
For more info about assert_frame_equal in git
For those who came to this question because they're interested in using pd.testing.assert_series_equal (operates on pd.Series), pandas 1.1.0 has introduced an argument check_index:
import pandas as pd
s1 = pd.Series({"a": 1})
s2 = pd.Series({"b": 1})
pd.testing.assert_series_equal(s1, s2, check_index=False)
This argument does not yet exist for pd.testing.assert_frame_equals.
Related
I am calculating a grouped row-wise moving average on a large data set. However, the process takes a too long time on a single thread. How can I efficiently speed up the process?
Please find a reproducible example below:
dataframe = pd.DataFrame({'id': range(2),
'group_id': range(2),
'Date_1_F1': [1,2],
'Date_2_F1': [2,4],
'Date_3_F1': [3, 6],
'Date_4_F1': [4,8],
'Date_1_F2': [2,11],
'Date_2_F2': [6, 13],
'Date_3-F2': [10, 15],
'Date_4_F2': [14, 17]})
dataframe
id group_id Date_1_F1 ... Date_2_F2 Date_3-F2 Date_4_F2
0 0 0 1 ... 6 10 14
1 1 1 2 ... 13 15 17
I have a function that returns the (row-wise) smoothed version of the dataset.
def smooth_ts(dataframe, ma_parameter = 2):
dataframe = (dataframe
.set_index(["id", "group_id"])
.groupby(lambda x: x.split("_")[-1], axis = 1, group_keys=False)
.apply(lambda x: x.rolling(ma_parameter, axis = 1)
.mean()
.dropna(axis=1, how='all')))
dataframe.reset_index(inplace = True)
return dataframe
smoothed_df = smooth_ts(dataframe)
Thank you very much
You could (1) melt your data frame using pd.melt, (2) create your grouping variable, (3) sort and group it aggregated by rolling.mean(2). Then you can use df.pivot to display the required data. In this approach, there is an apply method that can be parallelized using swifter. Here is an example:
import pandas as pd
import numpy as np
import swifter
dataframe = pd.DataFrame({'id': range(2),
'group_id': range(2),
'Date_1_F1': [1,2],
'Date_2_F1': [2,4],
'Date_3_F1': [3, 6],
'Date_4_F1': [4,8],
'Date_1_F2': [2,11],
'Date_2_F2': [6, 13],
'Date_3-F2': [10, 15],
'Date_4_F2': [14, 17]})
df_melted = pd.melt(dataframe, id_vars=['id', 'group_id'])
# Use next line if you want to parallelize the apply method
# df_melted['groups'] = df_melted['variable'].str.split('_').swifter.apply(lambda v: v[-1])
df_melted['groups'] = df_melted['variable'].str.split('_').apply(lambda v: v[-1])
df_melted = df_melted.sort_values(['id', 'group_id', 'groups'])
df_tmp = df_melted.copy()
df_tmp['rolling_val'] = df_tmp.groupby(['id', 'group_id', 'groups'])['value'].rolling(2).mean().values
df_tmp.pivot(index=['id', 'group_id'], columns='variable', values='rolling_val').dropna(axis=1).reset_index().rename_axis(None, axis=1)
If you want to stick to your approach, you can accelerate it using the Pool object from the multiprocessing library, which parallelizes the mapping of a function to an iterator.
import pandas as pd
import numpy as np
from multiprocessing import Pool
dataframe = pd.DataFrame({'id': range(2),
'group_id': range(2),
'Date_1_F1': [1,2],
'Date_2_F1': [2,4],
'Date_3_F1': [3, 6],
'Date_4_F1': [4,8],
'Date_1_F2': [2,11],
'Date_2_F2': [6, 13],
'Date_3-F2': [10, 15],
'Date_4_F2': [14, 17]})
dataframe
def smooth_ts(dataframe, ma_parameter = 2):
dataframe = (dataframe
.set_index(["id", "group_id"])
.groupby(lambda x: x.split("_")[-1], axis = 1, group_keys=False)
.apply(lambda x: x.rolling(ma_parameter, axis = 1)
.mean()
.dropna(axis=1, how='all')))
dataframe.reset_index(inplace = True)
return dataframe
id_chunks = np.array_split(dataframe.id.unique(), 2) # 2 : number of splits => corresponds to number of chunks
df_chunks = [dataframe[dataframe['id'].isin(i)] for i in id_chunks] # list containing chunked data frames
with Pool(2) as p: dfs_chunks = p.map(smooth_ts, df_chunks) # applies function smooth_ts to list of data frames, use two processors as dfs_chunks only contain two data frames. For more chunks, number of processors can be increased
pd.concat(dfs_chunks).reset_index(drop=True)
So I have a dataframe as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3, 3, 2, 1], [4, 3, 6, 6 ,3 ,4], [7, 2, 9, 9, 2, 7]]),
columns=['a', 'b', 'c', 'a_select','b_select','c_select'])
df
Now, I may need to reorganize the dataframe (or use two) to accomplish this, but...
I'd like to select the 2 largest values from each '_select' column per row, then use that to mean the corresponding column.
For example, row 1 would mean the values from a & b, row 2 a & c (NOT the values from the _select columns that we're looking at).
Currently I'm just iterating each row - as that seems rather simple, but slow with a large dataset - however I can't figure out how to use an apply or lambda function to do the equivelant (or if it's even possible).
Simple oneliner using nlargest
>>> df.filter(like='select').apply(lambda s: s.nlargest(2), 1).mean(1)
For performance, maybe numpy is useful:
>>> np.sort(df.filter(like='select').to_numpy(), 1)[:, -2:].mean(1)
To get values from the first columns, use argsort
>>> arr = df.filter(like='select').to_numpy()
>>> df[['a', 'b', 'c']].to_numpy()[[[x] for x in np.arange(len(arr))],
np.argsort(arr, 1)][:, -2:].mean(1)
array([1.5, 5. , 8. ])
I am trying to apply the following function over each column in a dataframe:
def hurst_lag(x):
minlag = 200
maxlag = 300
lags = range(minlag, maxlag)
tau = [sqrt(std(subtract(x.dropna()[lag:], x.dropna()[:-lag]))) for lag in lags]
m = polyfit(log(lags), log(tau), 1)
return m[0]*2
The function only works on non NA values. In my dataframe, the lengths of my columns differ after applying dropna(). e.g.
df = pd.DataFrame({
'colA':[None, None, 1, 2],
'colB': [None, 2, 6, 4],
'colC': [None, None, 2, 8],
'colD': [None, 2.0, 3.0, 4.0],
})
Any ideas how to run the function over each column individually, excluding the NA values for that specific column? Many thanks
Use apply to run it on the dataframe
df = df.apply(hurst_lag)
For a pandas dataframe of:
import pandas as pd
df = pd.DataFrame({
'id': [1, 1, 2, 1], 'anomaly_score':[5, 10, 8, 100], 'match_level_0':[np.nan, 1, 1, 1], 'match_level_1':[np.nan, np.nan, 1, 1], 'match_level_2':[np.nan, 1, 1, 1]
})
display(df)
df = df.groupby(['id', 'match_level_0']).agg(['mean', 'sum'])
I want to calculate the largest rows per group.
df.columns = ['__'.join(col).strip() for col in df.columns.values]
df.groupby(['id'])['anomaly_score__mean'].nlargest(2)
Works but requires to flatten the multiindex for the columns.
Instead I want to directly use,
df.groupby(['id'])[('anomaly_score', 'mean')].nlargest(2)
But this fails with the key not being found.
Interestingly, it works just fine when not grouping:
df[('anomaly_score', 'mean')].nlargest(2)
For me working grouping by Series with first level of MultiIndex, but it seems bug why not working like in your solution:
print (df[('anomaly_score', 'mean')].groupby(level=0).nlargest(2))
id match_level_0
1 1.0 55
2 1.0 8
Name: (anomaly_score, mean), dtype: int64
print (df[('anomaly_score', 'mean')].groupby(level='id').nlargest(2))
I concatenated three dataframes. How can I print df.index in RangeIndex, instead of Int64Index?
My Input:
df = pd.concat([df1, df2, df3])
print(df.index)
My Output:
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8,
9,
...
73809, 73810, 73811, 73812, 73813, 73814, 73815, 73816, 73817,
73818],
dtype='int64', length=495673)
Desired Output:
RangeIndex(start=X, stop=X, step=X)
You can use reset_index to get desired indices. For example:
df = pd.concat([df1,df2,df3])
df.index
Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2], dtype='int64')
After resetting indices:
df.reset_index(inplace=True)
df.index
RangeIndex(start=0, stop=9, step=1)
Also it is good to use axis keyword in concat function.
you can use the built-in ignore_index option:
df = pd.concat([df1, df2, df3],ignore_index=True)
print(df.index)
From the docs:
ignore_index : boolean, default False
If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.