Averages of DataFrame columns in Python - python

I am unable to comment on the original question as I don't have a high enough reputation, but I refer to this question DataFrames - Average Columns, specifically this line of code:
dfgrp= df.iloc[:,2:].groupby((np.arange(len(df.iloc[:,2:].columns)) // 2) + 1, axis=1).mean().add_prefix('ColumnAVg')
As I read it, take all rows from column 2 onwards, group by the length of the same rows and columns something something something on columns, not rows, get the mean of those columns then add to new columns called ColumnAVg1/2/3 etc.
I also know this takes the mean of columns 1&2, 3&4, 5&6 etc. but I don't know how it does.
And so my question is, what needs to change in the above code to get the mean of columns 1&2, 2&3, 3&4, 4&5 etc. with the results in the same format?

df = pd.DataFrame(np.random.randn(2, 4), columns=['a', 'b', 'c', 'd'])
groups = [(1,2),(2,3),(2,3,4),(1,3)]
df2 = pd.DataFrame([df.iloc[:, i - 1] for z in groups for i in z]).T
labels = [str(z) for z in groups for _ in z]
result = df2.groupby(by=labels, axis=1).mean()
Probably not what you were looking for but something like this should work.

So unfortunately you cannot alter that code to get your result, because it achieved what it does by assigning a number to each column, and thus grouping them together. However, you can do something cheeky. Just provide 2 groupings, get the average for each grouping and combined them into a single frame.
df = pd.DataFrame(np.random.randn(2, 4), columns=['a', 'b', 'c', 'd'])
d1 = df.groupby((np.arange(len(df.columns)) // 2), axis=1).mean()
d2 = df.groupby((np.arange(len(df.columns) + 1) // 2)[1:], axis=1).mean()
dfo = pd.DataFrame()
for i in range(len(df.columns)-1):
c = f'average_{df.columns[i]}_{df.columns[i+1]}'
if i % 2 == 0:
dfo[c] = d1[d1.columns[i / 2]]
else:
dfo[c] = d2[d2.columns[(i+1) / 2]]
What he did is to assign columns 1,2,3,4 to 1,1,2,2. So in our code, we have d1 assigned according to 1,1,2,2 and d2 assigned according to 0,1,1,2. The for loop is to combine the results.

Related

Pandas filter smallest by group

I have a data frame that has the following format:
d = {'id1': ['a', 'a', 'b', 'b',], 'id2': ['a', 'b', 'b', 'c'], 'score': ['1', '2', '3', '4']}
df = pd.DataFrame(data=d)
print(df)
id1 id2 score
0 a a 1
1 a b 2
3 b b 3
4 b c 4
The data frame has over 1 billion rows, it represents pairwise distance scores between objects in columns id1 and id2. I do not need all object pair combinations, for each object in id1 (there are about 40k unique id's) I only want to keep the top 100 closest (smallest) distance scores
The code I'm running to do this is the following:
df = df.groupby(['id1'])['score'].nsmallest(100)
The issue with this code is that I run into a memory error each time I try to run it
MemoryError: Unable to allocate 8.53 GiB for an array with shape (1144468900,) and data type float64
I'm assuming it is because in the background pandas is now creating a new data frame for the result of the group by, but the existing data frame is still held in memory.
The reason I am only taking the top 100 of each id is to reduce the size of the data frame, but I seems that while doing that process I am actually taking up more space.
Is there a way I can go about filtering this data down but not taking up more memory?
The desired output would be something like this (assuming top 1 instead of top 100)
id1 id2 score
0 a a 1
1 b b 3
Some additional info about the original df:
df.count()
permid_1 1144468900
permid_2 1144468900
distance 1144468900
dtype: int64
df.dtypes
permid_1 int64
permid_2 int64
distance float64
df.shape
dtype: object
(1144468900, 3)
id1 & id2 unique value counts: 33,830
I can't test this code, lacking your data, but perhaps try something like this:
indicies = []
for the_id in df['id1'].unique():
scores = df['score'][df['id1'] == the_id]
min_subindicies = np.argsort(scores.values)[:100] # numpy is raw index only
min_indicies = scores.iloc[min_subindicies].index # convert to pandas indicies
indicies.extend(min_indicies)
df = df.loc[indicies]
Descriptively, in each unique ID (the_id), extract the matching scores. Then find the raw indicies which are the smallest 100. Select those indicies, then map from the raw index to the Pandas index. Save the Pandas index to your list. Then at the end, subset on the pandas index.
iloc does take a list input. some_series.iloc should align properly with some_series.values which should allow this to work. Storing indicies indirectly like this should make this substantially more memory-efficient.
df['score'][df['id1'] == the_id] should work more efficiently than df.loc[df['id1'] == the_id, 'score']. Instead of taking the whole data frame and masking it, it takes only the score column of the data frame and masks it for matching IDs. You may want to del scores at the end of each loop if you want to immediately free more memory.
You can try the following:
df.sort_values(["id1", "scores"], inplace=True)
df["dummy_key"] = df["id1"].shift(100).ne(df["id1"])
df = df.loc[df["dummy_key"]]
You sort ascending (smallest on top), by first grouping, then by score.
You add column to indicate whether current id1 is different than the one 100 rows back (if it's not - your row is 101+ in order).
You filter by column from 2.
As Aryerez outlined in a comment, you can do something along the lines of:
closest = pd.concat([df.loc[df['id1'] == id1].sort_values(by = 'score').head(100) for
id1 in set(df['id1'])])
You could also do
def get_hundredth(id1):
sub_df = df.loc[df['id1'] == id1].sort_values(by = 'score')
return sub_df.iloc[100]['score']
hundredth_dict = {id1: get_hundredth(id1) for id1 in set(df['id1'])}
def check_distance(row):
return row['score'] <= hundredth_dict[row['id1']]
closest = df.loc[df.apply(check_distance, axis = 1)
Another strategy would be to look at how filtering out distances past a threshold affects the dataframe. That is, take
low_scores = df.loc[df['score']<threshold]
Does this significantly decrease the size of the dataframe for some reasonable threshold? You'd need a threshold that makes the dataframe small enough to work with, but leaves the lowest 100 scores for each id1.
You also might want to look into what sort of optimization you can do given your distance metric. There's probably algorithms out there specifically for cosine similarity.
For the given shape (1144468900, 3) with 33,830 unique value counts, id1 and id2 columns are good candidates for categorical column, convert them to categorical data type, and that will reduce the memory requirement by 1144468900/33,830 = 33,830 times approximately for these two columns, then perform any aggregation you want.
df[['id1', 'id2']] = df[['id1', 'id2']].astype('category')
out = df.groupby(['id1'])['score'].nsmallest(100)

Pandas & fillna based on groups

I have an interesting problem, which I have fixed on a surface level, but I would like to enhance and improve my implementation.
I have a DataFrame, which holds a dataset for later Machine Learning. It has feature columns (~500 of them) and 4 columns of targets. The targets are related to each other, in an increasing granularity fashion (e.g. fault/no_fault, fault-where, fault-group, fault-exact).
The DataFrame has quite a lot of NaN values, since it was compiled of 2 separate data sets via OUTER join - some rows are full, others have data from one dataset, but not the other etc. - see pic below, and sorry for terrible edits.
Anyway, Sci-kit Learn's SimpleImputer() Transformer did not give me the ML results I was after, and I figured that maybe I should do imputation based on targets, as in e.g. compute a median value for samples available per each target in each column, and impute these. Then check, if there are any NaN values left, and if there are, move to tar_3 (one level of granularity down), compute median as well, and impute that value against per target, per column. And so on, until no NaNs are left.
I have implemented that with the code below, which I fully understand is clunky as, and takes forever to execute:
tar_list = ['tar_4', 'tar_3', 'tar_2', 'tar_1']
for tar in tar_list:
medians = df.groupby(by = tar).agg('median')
print("\nFilling values based on {} column granularity.".format(tar))
for col in [col for col in df.columns if col not in tar_list]:
print(col)
uniques = sorted(df[tar].unique())
for class_name in uniques:
value_to_fill = medians.loc[class_name][col]
print("Setting NaNs for target {} in column {} to {}".format(class_name, col, value_to_fill))
df.loc[df[tar] == class_name, col] = df.loc[df[tar] == class_name, col].fillna(value = value_to_fill)
print()
While I am happy with the result this code produces, it has 2 drawbacks, which I cannot ignore:
1) It takes forever to execute even on my small ~1000 samples x ~500 columns dataset.
2) It imputes the same median value to all NaN's in each column per target value it is currently working on. I would rather prefer it to impute something with a bit of noise, to prevent just a simple repetition of the data (maybe either a value randomly selected from a normal distribution of values in that column for that target?).
As far as I am aware, there are no out-of-box tools in Sci-Kit Learn or Pandas to achieve this task in a more efficient way. However, if there are - can someone point me in the right direction? Alternatively, I am open to suggestions on how to enhance this code to address both my concerns.
UPDATE:
Code generating sample DataFrame I mentioned:
df = pd.DataFrame(np.random.randint(0, 100, size=(vsize, 10)),
columns = ["col_{}".format(x) for x in range(10)],
index = range(0, vsize * 3, 3))
df_2 = pd.DataFrame(np.random.randint(0,100,size=(vsize, 10)),
columns = ["col_{}".format(x) for x in range(10, 20, 1)],
index = range(0, vsize * 2, 2))
df = df.merge(df_2, left_index = True, right_index = True, how = 'outer')
df_tar = pd.DataFrame({"tar_1": [np.random.randint(0, 2) for x in range(vsize * 3)],
"tar_2": [np.random.randint(0, 4) for x in range(vsize * 3)],
"tar_3": [np.random.randint(0, 8) for x in range(vsize * 3)],
"tar_4": [np.random.randint(0, 16) for x in range(vsize * 3)]})
df = df.merge(df_tar, left_index = True, right_index = True, how = 'inner')
Try this:
tar_list = ['tar_4', 'tar_3', 'tar_2', 'tar_1']
cols = [col for col in df.columns if col not in tar_list]
# since your dataframe may not have continuous index
idx = df.index
for tar in tar_list:
medians = df[cols].groupby(by = df[tar]).agg('median')
df.set_index(tar, inplace=True)
for col in cols:
df[col] = df[col].fillna(medians[col])
df.reset_index(inplace=True)
df.index = idx
Took about 1.5s with the sample data:
np.random.seed(2019)
len_df=1000
num_cols = 500
df = pd.DataFrame(np.random.choice(list(range(10))+[np.nan],
size=(len_df, num_cols),
p=[0.05]*10+[0.5]),
columns=[str(x) for x in range(num_cols)])
for i in range(1,5):
np.random.seed(i)
df[f'tar_{i}'] = np.random.randint(i*4, (i+1)*4, len_df)

Why do loc and iloc work differently for slicing rows of a pandas DataFrame?

I want a DataFrame where the top rows of one column (called 'cat') have value "LOW", the mid and bottom parts of the frame will have values "MID" and "HI". So, for a frame of 1,200 rows, the value counts for the cat columns should result in:
LOW 400
MID 400
HI 400
This should be easy. But, apparently it is not really. To no avail I tried to select and change the bottom rows using df.loc[-400:,["cat"]] = "HI"
But, this approach does work for the top-rows: df.loc[:399,["cat"]] = "LOW"
The sample below shows a working example, and note that it requires both loc and iloc. Is this where pandas can improve?
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random([1200, 4]), columns=['A', 'B', 'C', 'D'])
df["cat"] = "MID"
df.loc[:399,["cat"]] = "LOW"
df.iloc[-400:,-1] = "HI" # The -1 selects the last column ('cat') - not ideal.
df.cat.value_counts()
Use get_loc for position of column cat if want select by positions by iloc - need positions of index and columns:
df = pd.DataFrame(np.random.random([1200, 4]), columns=['A', 'B', 'C', 'D'])
df["cat"] = "MID"
df.iloc[:400,df.columns.get_loc('cat')] = "LOW"
df.iloc[-400:,df.columns.get_loc('cat')] = "HI"
Detail:
print (df.columns.get_loc('cat'))
4
Alternative is use loc for select by labels - then need select 400 values of index by indexing:
df.loc[df.index[:400],"cat"] = "LOW"
df.loc[df.index[-400:],"cat"] = "HI"
a = df.cat.value_counts()
print (a)
MID 400
HI 400
LOW 400
Name: cat, dtype: int64
Another ways for set 400 values use numpy.repeat or set values by repeat of lists:
df["cat"] = np.array(["LOW", "MID", "HI"]).repeat(400)
df["cat"] = ["LOW"] * 400 + ["MID"] * 400 + ["HI"] * 400
#thanks #Quickbeam2k1
df = df.assign(cat = ['LOW']*400 + ['MID']*400 + ['HIGH']*400 )
Answering the question if pandas can improve here:
I the documentation it's clearly stated what loc is doing:
.loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found.
so -400 is simply not a label in your index. Thus the behavior is as intended.
What one often wants is and accessor for iloc based row access and loc based column access. But for this, the .get_loc-function comes into play.
You could also use the deprecated .ix-indexer. However, its behavior caused some confusion. She examples and methods using the .loc and .iloc accessors here.
Essentially, #Jezrael's solution are also found in the link above.
To summarize: Pandas had a solution to your problem in place, but it confused users. So in order to provide a more consistent API it was decided to remove that feature in the future

How to group records with Pandas cut()?

My goal is to group n records by 4, say for example:
0-3
4-7
8-11
etc.
find max() value of each group of 4 based on one column among other columns, and create a new dataset or csv file. The max() operation would be performed on one column, while the other columns remain as they are.
Based on the research I have made here (Stackoverflow), I have tried to customize and apply the following solution on this site on my dataset, but it wasn't giving me my expectations:
# Group by every 4 row until the len(dataset)
groups = dataset.groupby(pd.cut(dataset.index, range(0,len(dataset),3))
needataset = groups.max()
I'm getting results similar to the following:
Column 1 Column 2 ... Column n
0. (0,3]
1. (3,6]
The targeted column for the max() operation did not also produce the expected result.
I will appreciate any guide to tackling the problem.
This example should help you. Here I create a DataFrame of some random values between 0 and 100 with step 5, and group those values in groups of 4 (sort_values is really important, it will make your life easier)
df = pd.DataFrame({'value': np.random.randint(0, 100, 5)})
df = df.sort_values(by='value')
labels = ["{0} - {1}".format(i, i + 4) for i in range(0, 100, 5)]
df['group'] = pd.cut(df.value, range(0, 105, 5), right=False, labels=labels)
groups = df["group"].unique()
Then I create an array for max values
max_vals = np.zeros((len(groups)))
for i, group in enumerate(groups):
max_vals[i] = max(df[df["group"] == group]["value"])
And then a DataFrame out of those max values
pd.DataFrame({"group": groups, "max value": max_vals})

Getting the row and column of a string from a Pandas dataframe

I have a dataframe of unique strings and I want to find the row and column for a given string. I want these values because I'll be eventually exporting this dataframe to an excel spreadsheet. The easiest way I've found so far to get these values is the follwing:
jnames = list(df.iloc[0].to_frame().index)
for i in jnames:
for k in df[i]:
if 'searchstring' in str(k):
print('Column: {}'.format( (jnames.index(i) + 1 ) ) )
print('Row: {}'.format( list( df[i] ).index('searchstring') ) )
break
Can anyone advise a solution that takes better advantage of the inherent capabilities of pandas?
Without reproducible code / data, I'm going to make up a dataframe and show one simple way:
Setup
import pandas as pd, numpy as np
df = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'b']])
The dataframe looks like this:
0 1 2
0 a b c
1 d e f
2 g h b
Solution
result = list(zip(*np.where(df.values == 'b')))
Result
[(0, 1), (2, 2)]
Explanation
df.values accesses the numpy array underlying the dataframe.
np.where creates an array of coordinates satisfying the provided condition.
zip(*...) transforms [x-coords-array, y-coords-array] into (x, y) coordinate pairs.
Try using contains. This will return you a dataframe of rows that contain the slice you are looking for.
df[df['<my_col>'].str.contains('<my_string_slice>')]
Similarly, you can use match for a direct match.
This is my approach not writing double for loops:
value_to_search = "c"
print(df[[x for x in df.columns if value_to_search in df[x].unique()]].index[0])
print(df[[x for x in df.columns if value_to_search in df[x].unique()]].columns[0])
The first will return the column name and the second will return the index. Combined together, you will get the index-column combination. Since you mentioned that all values in the df is unique, both lines will return exactly one value.
You might need a try-except if value_to_search might not in the data frame.
By using stack , data from jpp
df[df=='b'].stack()
Out[211]:
0 1 b
2 2 b
dtype: object

Categories