I have an interesting problem, which I have fixed on a surface level, but I would like to enhance and improve my implementation.
I have a DataFrame, which holds a dataset for later Machine Learning. It has feature columns (~500 of them) and 4 columns of targets. The targets are related to each other, in an increasing granularity fashion (e.g. fault/no_fault, fault-where, fault-group, fault-exact).
The DataFrame has quite a lot of NaN values, since it was compiled of 2 separate data sets via OUTER join - some rows are full, others have data from one dataset, but not the other etc. - see pic below, and sorry for terrible edits.
Anyway, Sci-kit Learn's SimpleImputer() Transformer did not give me the ML results I was after, and I figured that maybe I should do imputation based on targets, as in e.g. compute a median value for samples available per each target in each column, and impute these. Then check, if there are any NaN values left, and if there are, move to tar_3 (one level of granularity down), compute median as well, and impute that value against per target, per column. And so on, until no NaNs are left.
I have implemented that with the code below, which I fully understand is clunky as, and takes forever to execute:
tar_list = ['tar_4', 'tar_3', 'tar_2', 'tar_1']
for tar in tar_list:
medians = df.groupby(by = tar).agg('median')
print("\nFilling values based on {} column granularity.".format(tar))
for col in [col for col in df.columns if col not in tar_list]:
print(col)
uniques = sorted(df[tar].unique())
for class_name in uniques:
value_to_fill = medians.loc[class_name][col]
print("Setting NaNs for target {} in column {} to {}".format(class_name, col, value_to_fill))
df.loc[df[tar] == class_name, col] = df.loc[df[tar] == class_name, col].fillna(value = value_to_fill)
print()
While I am happy with the result this code produces, it has 2 drawbacks, which I cannot ignore:
1) It takes forever to execute even on my small ~1000 samples x ~500 columns dataset.
2) It imputes the same median value to all NaN's in each column per target value it is currently working on. I would rather prefer it to impute something with a bit of noise, to prevent just a simple repetition of the data (maybe either a value randomly selected from a normal distribution of values in that column for that target?).
As far as I am aware, there are no out-of-box tools in Sci-Kit Learn or Pandas to achieve this task in a more efficient way. However, if there are - can someone point me in the right direction? Alternatively, I am open to suggestions on how to enhance this code to address both my concerns.
UPDATE:
Code generating sample DataFrame I mentioned:
df = pd.DataFrame(np.random.randint(0, 100, size=(vsize, 10)),
columns = ["col_{}".format(x) for x in range(10)],
index = range(0, vsize * 3, 3))
df_2 = pd.DataFrame(np.random.randint(0,100,size=(vsize, 10)),
columns = ["col_{}".format(x) for x in range(10, 20, 1)],
index = range(0, vsize * 2, 2))
df = df.merge(df_2, left_index = True, right_index = True, how = 'outer')
df_tar = pd.DataFrame({"tar_1": [np.random.randint(0, 2) for x in range(vsize * 3)],
"tar_2": [np.random.randint(0, 4) for x in range(vsize * 3)],
"tar_3": [np.random.randint(0, 8) for x in range(vsize * 3)],
"tar_4": [np.random.randint(0, 16) for x in range(vsize * 3)]})
df = df.merge(df_tar, left_index = True, right_index = True, how = 'inner')
Try this:
tar_list = ['tar_4', 'tar_3', 'tar_2', 'tar_1']
cols = [col for col in df.columns if col not in tar_list]
# since your dataframe may not have continuous index
idx = df.index
for tar in tar_list:
medians = df[cols].groupby(by = df[tar]).agg('median')
df.set_index(tar, inplace=True)
for col in cols:
df[col] = df[col].fillna(medians[col])
df.reset_index(inplace=True)
df.index = idx
Took about 1.5s with the sample data:
np.random.seed(2019)
len_df=1000
num_cols = 500
df = pd.DataFrame(np.random.choice(list(range(10))+[np.nan],
size=(len_df, num_cols),
p=[0.05]*10+[0.5]),
columns=[str(x) for x in range(num_cols)])
for i in range(1,5):
np.random.seed(i)
df[f'tar_{i}'] = np.random.randint(i*4, (i+1)*4, len_df)
Related
I have a csv dataset (with > 8m rows) that I load into a dataframe. The csv has columns like:
...,started_at,ended_at,...
2022-04-01 18:23:32,2022-04-01 22:18:15
2022-04-02 01:16:34,2022-04-02 02:18:32
...
I am able to load the dataset into my dataframe, but then I need to add multiple calculated columns to the dataframe for each row. In otherwords, unlike this SO question, I do not want the rows of the new columns to have the same initial value (col 1 all NAN, col 2, all "dogs", etc.).
Right now, I can add my columns by doing something like:
df['start_time'] = df.apply(lambda row: add_start_time(row['started_at']), axis = 1)
df['start_cat'] = df.apply(lambda row: add_start_cat(row['start_time']), axis = 1)
df['is_dark'] = df.apply(lambda row: add_is_dark(row['started_at']), axis = 1)
df['duration'] = df.apply(lamba row: calc_dur(row'[started_at'],row['ended_at']), axis = 1)
But it seems inefficient since the entire dataset is processed N times (once for each call).
It seems that I should be able to calculate all of the new columns in a single go, but I am missing some conceptual approach.
Examples:
def calc_dur(started_at, ended_at):
# started_at, ended_at are datetime64[ns]; converted at csv load
diff = ended_at - started_at
return diff.total_seconds() / 60
def add_start_time(started_at):
# started_at is datetime64[ns]; converted at csv load
return started_at.time()
def add_is_dark(started_at):
# tz is pytz.timezone('US/Central')
# chi_town is the astral lookup for Chicago
st = started_at.replace(tzinfo=TZ)
chk = sun(chi_town.observer, date=st, tzinfo=chi_town.timezone)
return st >= chk['dusk'] or st <= chk['dawn']
Update 1
Following on the information for MoRe, I was able to get the essential working. I needed to augment by adding the column names, and then with the merge to specify the index.
data = pd.Series(df.apply(lambda x: [
add_start_time(x['started_at']),
add_is_dark(x['started_at']),
yrmo(x['year'], x['month']),
calc_duration_in_minutes(x['started_at'], x['ended_at']),
add_start_cat(x['started_at'])
], axis = 1))
new_df = pd.DataFrame(data.tolist(),
data.index,
columns=['start_time','is_dark','yrmo',
'duration','start_cat'])
df = df.merge(new_df, left_index=True, right_index=True)
import pandas as pd
data = pd.Series(dataframe.apply(lambda x: [function1(x[column_name]), function2(x[column_name)], function3(x[column_name])], axis = 1))
pd.DataFrame(data.tolist(),data.index)
if i understood your mean correctly, it's your answer. but before everything please use Swifter pip :)
first create a series by lists and convert it to columns...
swifter is a simple library (at least i think it is simple) that only has only one useful method: apply
import swifter
data.swifter.apply(lambda x: x+1)
it use parallel manner to improve speed in large datasets... in small ones, it isn't good and even is worse
https://pypi.org/project/swifter/
Currently I have a dataframe that I rank the values of each column and output them into a new dataframe. Example code below:
df = pd.DataFrame(np.random.randint(0, 500, size=(500, 1000)), columns=list(range(0, 1000)))
ranking = pd.DataFrame(range(0, 500), columns=['Lineup'])
ranking = pd.concat([ranking, df[range(0, 1000)].rank(ascending=False, method='min')],
axis=1)
df is the dataframe of the values, each column header is an integer, increasing by 1 for each successive column. ranking is first created with a single column as a identifier by "Lineup" then the dataframe "df" is concatenated and ranked at the same time.
Now the question is, is this the fastest way to do this? When there are tens of thousands of columns and hundreds of rows this can take far longer than I hoped it would. Is there a way to use list comprehension to speed this up or some sort of other method that outputs a list, dictionary, dataframe or anything else I can use for my future steps.
Thanks
You can use the Numba JIT to compute this more efficiently and in parallel. The idea is to compute the rank of each column in parallel. Here is the resulting code:
# Equivalent of df.rank(ascending=False, method='min')
#nb.njit('int32[:,:](int32[:,:])', parallel=True)
def fastRanks(df):
n, m = df.shape
res = np.empty((n, m), dtype=np.int32)
for col in nb.prange(m):
dfCol = -df[:, col]
order = np.argsort(dfCol)
# Compute the ranks with the min method
if n > 0:
prevVal = dfCol[order[0]]
prevRank = 1
res[order[0], col] = 1
for row in range(1, n):
curVal = dfCol[order[row]]
if curVal == prevVal:
res[order[row], col] = prevRank
else:
res[order[row], col] = row + 1
prevVal = curVal
prevRank = row + 1
return res
df = pd.DataFrame(np.random.randint(0, 500, size=(500, 1000)), columns=list(range(0, 1000)))
ranking = pd.DataFrame(range(0, 500), columns=['Lineup'])
ranking = pd.concat([ranking, pd.DataFrame(fastRanks(df[range(0, 1000)].to_numpy()))], axis=1)
On my 6-core machine, the computation of the ranks is about 7 times faster. The overall computation is bounded by the slow pd.concat.
You can further improve the speed of the overall computation by building directly the output of fastRanks with the "Lineup" column. The names of the dataframe columns have to be set manually from the Numpy array produced by the Numba function. Note that this optimization require all the columns to be of the same type, which is the case in your example.
Note that the ranks are of types int32 in this solution for sake of performance (since float64 are not needed here).
My goal is to group n records by 4, say for example:
0-3
4-7
8-11
etc.
find max() value of each group of 4 based on one column among other columns, and create a new dataset or csv file. The max() operation would be performed on one column, while the other columns remain as they are.
Based on the research I have made here (Stackoverflow), I have tried to customize and apply the following solution on this site on my dataset, but it wasn't giving me my expectations:
# Group by every 4 row until the len(dataset)
groups = dataset.groupby(pd.cut(dataset.index, range(0,len(dataset),3))
needataset = groups.max()
I'm getting results similar to the following:
Column 1 Column 2 ... Column n
0. (0,3]
1. (3,6]
The targeted column for the max() operation did not also produce the expected result.
I will appreciate any guide to tackling the problem.
This example should help you. Here I create a DataFrame of some random values between 0 and 100 with step 5, and group those values in groups of 4 (sort_values is really important, it will make your life easier)
df = pd.DataFrame({'value': np.random.randint(0, 100, 5)})
df = df.sort_values(by='value')
labels = ["{0} - {1}".format(i, i + 4) for i in range(0, 100, 5)]
df['group'] = pd.cut(df.value, range(0, 105, 5), right=False, labels=labels)
groups = df["group"].unique()
Then I create an array for max values
max_vals = np.zeros((len(groups)))
for i, group in enumerate(groups):
max_vals[i] = max(df[df["group"] == group]["value"])
And then a DataFrame out of those max values
pd.DataFrame({"group": groups, "max value": max_vals})
I have two large dataframes I want to compare. I want a comparison result capable of a column and / or row wise comparison of similarities by percent. This part is simple. However, I want to be able to make the comparison ignore differences based upon value criteria. A small example is below.
d1 = {'Sample':pd.Series([101,102,103]),
'Col1':pd.Series(['AA','--','BB']),
'Col2':pd.Series(['AB','AA','BB'])}
d2 = {'Sample':pd.Series([101,102,103]),
'Col1':pd.Series(['BB','AB','--']),
'Col2':pd.Series(['AB','AA','AB'])}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df1 = df1.set_index('Sample')
df2 = df2.set_index('Sample')
comparison = df1.eq(df2)
# for column stats
comparison.sum(axis=0) / float(len(df1.index))
# for row stats
comparison.sum(axis=1) / float(len(df1.columns))
My problem is that for when value1='AA' and value2 = '--' I want them to be viewed as equal (so when one is '--' basically always be true) but, otherwise perform a normal Boolean comparison. I need an efficient way to do this that doesn't include excessive looping as the datasets are quite large.
Below, I'm interpreting "when one is '--' basically always be true" to mean that any comparison against '--' (no matter what the other value is) should return True. In that case, you could use
mask = (df1=='--') | (df2=='--')
to find every location where either df1 or df2 is equal to '--' and then use
comparison |= mask
to update comparison. For example,
import itertools as IT
import numpy as np
import pandas as pd
np.random.seed(2015)
N = 10000
df1, df2 = [pd.DataFrame(
np.random.choice(map(''.join, IT.product(list('ABC'), repeat=2))+['--'],
size=(N, 2)),
columns=['Col1', 'Col2']) for i in range(2)]
comparison = df1.eq(df2)
mask = (df1=='--') | (df2=='--')
comparison |= mask
# for column stats
column_stats = comparison.sum(axis=0) / float(len(df1.index))
# for row stats
row_stats = comparison.sum(axis=1) / float(len(df1.columns))
I think loop comprehension should be quite fast:
new_columns = []
for col in df1.columns:
new_columns.append([True if (x==y or x=='--' or y=='--') else False for x,y in zip(df1[col],df2[col])])
results = pd.DataFrame(new_columns).T
results.index = df1.index
This outputs the full true/false df.
I've just started working with Pandas and I am trying to figure if it is the right tool for my problem.
I have a dataset:
date, sourceid, destid, h1..h12
I am basically interested in the sum of each of the H1..H12 columns, but, I need to exclude multiple ranges from the dataset.
Examples would be to:
exclude H4, H5, H6 data where sourceid = 4944 and exclude H8, H9-H12
where destination = 481981 and ...
... this can go on for many many filters as we are
constantly removing data to get close to our final model.
I think I saw in a solution that I could build a list of the filters I would want and then create a function to test against, but I haven't found a good example to work from.
My initial thought was to create a copy of the df and just remove the data we didn't want and if we need it back - we could just copy it back in from the origin df, but that seems like the wrong road.
By using masks, you don't have to remove data from the dataframe. E.g.:
mask1 = df.sourceid == 4944
var1 = df[mask1]['H4','H5','H6'].sum()
Or directly do:
var1 = df[df.sourceid == 4944]['H4','H5','H6'].sum()
In case of multiple filters, you can combine the Boolean masks with Boolean operators:
totmask = mask1 & mask2
you can use DataFrame.ix[] to set the data to zeros.
Create a dummy DataFrame first:
N = 10000
df = pd.DataFrame(np.random.rand(N, 12), columns=["h%d" % i for i in range(1, 13)], index=["row%d" % i for i in range(1, N+1)])
df["sourceid"] = np.random.randint(0, 50, N)
df["destid"] = np.random.randint(0, 50, N)
Then for each of your filter you can call:
df.ix[df.sourceid == 10, "h4":"h6"] = 0
since you have 600k rows, create a mask array by df.sourceid == 10 maybe slow. You can create Series objects that map value to the index of the DataFrame:
sourceid = pd.Series(df.index.values, index=df["sourceid"].values).sort_index()
destid = pd.Series(df.index.values, index=df["destid"].values).sort_index()
and then exclude h4,h5,h6 where sourceid == 10 by:
df.ix[sourceid[10], "h4":"h6"] = 0
to find row ids where sourceid == 10 and destid == 20:
np.intersect1d(sourceid[10].values, destid[20].values, assume_unique=True)
to find row ids where 10 <= sourceid <= 12 and 3 <= destid <= 5:
np.intersect1d(sourceid.ix[10:12].values, destid.ix[3:5].values, assume_unique=True)
sourceid and destid are Series with duplicated index values, when the index values are in order, Pandas use searchsorted to find index. it's O(log N), faster then create mask arrays which is O(N).