Can DataFrame use np.select after two DataFrame combined? - python

I can use np.select to insert a new column and set the value for one dataFrame.
But when I combined both dataFrame. The np.select does not work. Seems index error.
import pandas as pd
import numpy as np
df = pd.DataFrame([[3, 2, 1],[4, 5, 6]], columns=['col1','col2','col3'], index=['a','b'])
df2 = pd.DataFrame([[14, 15, 16],[17, 16, 15]], columns=['col1','col2','col3'], index=['c','e'])
count = df.append(df2)
print(count)
conditions = [
(df["col1"] >= df["col2"]) & (df["col2"] >= df["col3"]),
]
choices = [100]
count["col4"] = np.select(conditions,choices, default='WHAT')
count
This is success
This is error after combine, error is :
ValueError: Length of values does not match length of index

I think there is a typo in your code when it comes to count vs df. The following code just works fine.
import pandas as pd
import numpy as np
df = pd.DataFrame([[3, 2, 1],[4, 5, 6]], columns=['col1','col2','col3'], index=['a','b'])
df2 = pd.DataFrame([[14, 15, 16],[17, 16, 15]], columns=['col1','col2','col3'], index=['c','e'])
count = df.append(df2)
print(count)
conditions = [
(count["col1"] >= count["col2"]) & (count["col2"] >= count["col3"]),
]
print(conditions)
choices = [100]
count["col4"] = np.select(conditions,choices, default='WHAT')
count

Related

A more efficient way to take samples from a pandas DataFrame

I have a piece of code like this:
import pandas as pd
data = {
'col1': [17,2,3,4,5,5,10,22,31,11,65,86],
'col2': [6,7,8,9,10,31,46,12,20,37,91,32],
'col3': [1,2,3,4,5,6,7,8,9,10,11,12]
}
df = pd.DataFrame(data)
sampling_period = 3
abnormal_data = set()
for i in range(sampling_period):
# get index of [0, 3, 6, 9, ...], [1, 4, 7, 10, ...], and [2, 5, 8, 11, ...]
df_sampled = df[i::sampling_period]
diff = df_sampled - df_sampled.shift(1)
# diff >= 5 are considered as an abnormal columns
abnormal_df = df_sampled[
diff >= 5
].dropna(how="all", axis=1)
abnormal_data = abnormal_data.union(set(abnormal_df.columns))
print(f"abnormal_data: {abnormal_data}")
What the code above does are as the followings:
Sampling all the columns in df based on sampling_period.
If the difference between 2 consecutive elements in df_sampled is larger than or equal to 5, mark this column as abnormal.
Return abnormal columns.
Is there anyway to avoid the for loop in the code?
The code above takes a lot of time to run when sampling_period and df becomes large. I wish that it could run faster.
For example, when my sampling_period is 60, and df.shape is (20040, 3562), it takes about 683 seconds to run the above code.

How to speed up (parallelize) a grouped row-wise rolling mean calculation?

I am calculating a grouped row-wise moving average on a large data set. However, the process takes a too long time on a single thread. How can I efficiently speed up the process?
Please find a reproducible example below:
dataframe = pd.DataFrame({'id': range(2),
'group_id': range(2),
'Date_1_F1': [1,2],
'Date_2_F1': [2,4],
'Date_3_F1': [3, 6],
'Date_4_F1': [4,8],
'Date_1_F2': [2,11],
'Date_2_F2': [6, 13],
'Date_3-F2': [10, 15],
'Date_4_F2': [14, 17]})
dataframe
id group_id Date_1_F1 ... Date_2_F2 Date_3-F2 Date_4_F2
0 0 0 1 ... 6 10 14
1 1 1 2 ... 13 15 17
I have a function that returns the (row-wise) smoothed version of the dataset.
def smooth_ts(dataframe, ma_parameter = 2):
dataframe = (dataframe
.set_index(["id", "group_id"])
.groupby(lambda x: x.split("_")[-1], axis = 1, group_keys=False)
.apply(lambda x: x.rolling(ma_parameter, axis = 1)
.mean()
.dropna(axis=1, how='all')))
dataframe.reset_index(inplace = True)
return dataframe
smoothed_df = smooth_ts(dataframe)
Thank you very much
You could (1) melt your data frame using pd.melt, (2) create your grouping variable, (3) sort and group it aggregated by rolling.mean(2). Then you can use df.pivot to display the required data. In this approach, there is an apply method that can be parallelized using swifter. Here is an example:
import pandas as pd
import numpy as np
import swifter
dataframe = pd.DataFrame({'id': range(2),
'group_id': range(2),
'Date_1_F1': [1,2],
'Date_2_F1': [2,4],
'Date_3_F1': [3, 6],
'Date_4_F1': [4,8],
'Date_1_F2': [2,11],
'Date_2_F2': [6, 13],
'Date_3-F2': [10, 15],
'Date_4_F2': [14, 17]})
df_melted = pd.melt(dataframe, id_vars=['id', 'group_id'])
# Use next line if you want to parallelize the apply method
# df_melted['groups'] = df_melted['variable'].str.split('_').swifter.apply(lambda v: v[-1])
df_melted['groups'] = df_melted['variable'].str.split('_').apply(lambda v: v[-1])
df_melted = df_melted.sort_values(['id', 'group_id', 'groups'])
df_tmp = df_melted.copy()
df_tmp['rolling_val'] = df_tmp.groupby(['id', 'group_id', 'groups'])['value'].rolling(2).mean().values
df_tmp.pivot(index=['id', 'group_id'], columns='variable', values='rolling_val').dropna(axis=1).reset_index().rename_axis(None, axis=1)
If you want to stick to your approach, you can accelerate it using the Pool object from the multiprocessing library, which parallelizes the mapping of a function to an iterator.
import pandas as pd
import numpy as np
from multiprocessing import Pool
dataframe = pd.DataFrame({'id': range(2),
'group_id': range(2),
'Date_1_F1': [1,2],
'Date_2_F1': [2,4],
'Date_3_F1': [3, 6],
'Date_4_F1': [4,8],
'Date_1_F2': [2,11],
'Date_2_F2': [6, 13],
'Date_3-F2': [10, 15],
'Date_4_F2': [14, 17]})
dataframe
def smooth_ts(dataframe, ma_parameter = 2):
dataframe = (dataframe
.set_index(["id", "group_id"])
.groupby(lambda x: x.split("_")[-1], axis = 1, group_keys=False)
.apply(lambda x: x.rolling(ma_parameter, axis = 1)
.mean()
.dropna(axis=1, how='all')))
dataframe.reset_index(inplace = True)
return dataframe
id_chunks = np.array_split(dataframe.id.unique(), 2) # 2 : number of splits => corresponds to number of chunks
df_chunks = [dataframe[dataframe['id'].isin(i)] for i in id_chunks] # list containing chunked data frames
with Pool(2) as p: dfs_chunks = p.map(smooth_ts, df_chunks) # applies function smooth_ts to list of data frames, use two processors as dfs_chunks only contain two data frames. For more chunks, number of processors can be increased
pd.concat(dfs_chunks).reset_index(drop=True)

Selecting a row in pandas based on all its column values

I would like to locate a specific row (given all its columns values) within a pandas frame.
My attempts so far:
df = pd.DataFrame(
columns = ["A", "B", "C"],
data = [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12],
])
# row to find (last one)
row = {"A" : 10, "B" : 11, "C" : 12}
# chain
idx = df[(df["A"] == 10) & (df["B"] == 11) & (df["B"] == 11)].index[0]
print(idx)
# iterative
mask = pd.Series([True] * len(df))
for k, v in row.items():
mask &= (df[k] == v)
idx = df[mask].index[0]
print(idx)
# pandas series
for idx in df.index:
print(idx, (df.iloc[idx,:] == pd.Series(row)).all())
Is there a simpler way to do that? Something like idx = df.find(row)?
This functionality is often needed for example to locate one specific sample in a time series. I cannot believe that there is no straightforward way to do that.
Do you simply want?
df[df.eq(row).all(axis=1)] #.index # if the indices are needed
output:
A B C
3 10 11 12
Or, if you have more columns and want to ignore them for the comparison:
df[df[list(row)].eq(row).all(axis=1)]

Selecting a Range of Adjacent Columns for Dataframe

I am not understanding how to essentially say: columns= [0:6, 12:15])
When I try this I get invalid syntax at the :
import pandas as pd
data = pd.read_excel (rf'C:\Users\dusti\Desktop\bulk export.xlsx',
sheet_name=1,
header=None)
df = pd.DataFrame(data,
columns= [0,1,2,3,4,5,6,12,13,14,15])
df.to_csv(rf'C:\Users\dusti\Desktop\bulk export1.csv',
header=False,
index=False)
print (df)
this thing that you trying is slicing. it used for select a subset of a list
You can use the range function for create numbers and convert it to a list with the list function
list(range(0,6+1)) + list(range(12,15+1))
#output :
[0, 1, 2, 3, 4, 5, 6, 12, 13, 14, 15]

transpose multiple datas in pandas

I have raw data containing the number of stores spread over numerous pages with no headers or columns.
Please sample below
I want to tranpose the data to this
Anyone who can help me figure out how to get the results I want?
import pandas as pd
# Creating the DataFrame
df = pd.DataFrame({"A":[12, 4, 5, None, 1],
"B":[7, 2, 54, 3, None],
"C":[20, 16, 11, 3, 8],
"D":[14, 3, None, 2, 6]})
index_ = ['Row_1', 'Row_2', 'Row_3', 'Row_4', 'Row_5']
df.index = index_
# Print the DataFrame
print(df)
> # return the transpose result = df.transpose()
> # Print the result print(result)

Categories