My goal is to group n records by 4, say for example:
0-3
4-7
8-11
etc.
find max() value of each group of 4 based on one column among other columns, and create a new dataset or csv file. The max() operation would be performed on one column, while the other columns remain as they are.
Based on the research I have made here (Stackoverflow), I have tried to customize and apply the following solution on this site on my dataset, but it wasn't giving me my expectations:
# Group by every 4 row until the len(dataset)
groups = dataset.groupby(pd.cut(dataset.index, range(0,len(dataset),3))
needataset = groups.max()
I'm getting results similar to the following:
Column 1 Column 2 ... Column n
0. (0,3]
1. (3,6]
The targeted column for the max() operation did not also produce the expected result.
I will appreciate any guide to tackling the problem.
This example should help you. Here I create a DataFrame of some random values between 0 and 100 with step 5, and group those values in groups of 4 (sort_values is really important, it will make your life easier)
df = pd.DataFrame({'value': np.random.randint(0, 100, 5)})
df = df.sort_values(by='value')
labels = ["{0} - {1}".format(i, i + 4) for i in range(0, 100, 5)]
df['group'] = pd.cut(df.value, range(0, 105, 5), right=False, labels=labels)
groups = df["group"].unique()
Then I create an array for max values
max_vals = np.zeros((len(groups)))
for i, group in enumerate(groups):
max_vals[i] = max(df[df["group"] == group]["value"])
And then a DataFrame out of those max values
pd.DataFrame({"group": groups, "max value": max_vals})
Related
I have a data frame that has the following format:
d = {'id1': ['a', 'a', 'b', 'b',], 'id2': ['a', 'b', 'b', 'c'], 'score': ['1', '2', '3', '4']}
df = pd.DataFrame(data=d)
print(df)
id1 id2 score
0 a a 1
1 a b 2
3 b b 3
4 b c 4
The data frame has over 1 billion rows, it represents pairwise distance scores between objects in columns id1 and id2. I do not need all object pair combinations, for each object in id1 (there are about 40k unique id's) I only want to keep the top 100 closest (smallest) distance scores
The code I'm running to do this is the following:
df = df.groupby(['id1'])['score'].nsmallest(100)
The issue with this code is that I run into a memory error each time I try to run it
MemoryError: Unable to allocate 8.53 GiB for an array with shape (1144468900,) and data type float64
I'm assuming it is because in the background pandas is now creating a new data frame for the result of the group by, but the existing data frame is still held in memory.
The reason I am only taking the top 100 of each id is to reduce the size of the data frame, but I seems that while doing that process I am actually taking up more space.
Is there a way I can go about filtering this data down but not taking up more memory?
The desired output would be something like this (assuming top 1 instead of top 100)
id1 id2 score
0 a a 1
1 b b 3
Some additional info about the original df:
df.count()
permid_1 1144468900
permid_2 1144468900
distance 1144468900
dtype: int64
df.dtypes
permid_1 int64
permid_2 int64
distance float64
df.shape
dtype: object
(1144468900, 3)
id1 & id2 unique value counts: 33,830
I can't test this code, lacking your data, but perhaps try something like this:
indicies = []
for the_id in df['id1'].unique():
scores = df['score'][df['id1'] == the_id]
min_subindicies = np.argsort(scores.values)[:100] # numpy is raw index only
min_indicies = scores.iloc[min_subindicies].index # convert to pandas indicies
indicies.extend(min_indicies)
df = df.loc[indicies]
Descriptively, in each unique ID (the_id), extract the matching scores. Then find the raw indicies which are the smallest 100. Select those indicies, then map from the raw index to the Pandas index. Save the Pandas index to your list. Then at the end, subset on the pandas index.
iloc does take a list input. some_series.iloc should align properly with some_series.values which should allow this to work. Storing indicies indirectly like this should make this substantially more memory-efficient.
df['score'][df['id1'] == the_id] should work more efficiently than df.loc[df['id1'] == the_id, 'score']. Instead of taking the whole data frame and masking it, it takes only the score column of the data frame and masks it for matching IDs. You may want to del scores at the end of each loop if you want to immediately free more memory.
You can try the following:
df.sort_values(["id1", "scores"], inplace=True)
df["dummy_key"] = df["id1"].shift(100).ne(df["id1"])
df = df.loc[df["dummy_key"]]
You sort ascending (smallest on top), by first grouping, then by score.
You add column to indicate whether current id1 is different than the one 100 rows back (if it's not - your row is 101+ in order).
You filter by column from 2.
As Aryerez outlined in a comment, you can do something along the lines of:
closest = pd.concat([df.loc[df['id1'] == id1].sort_values(by = 'score').head(100) for
id1 in set(df['id1'])])
You could also do
def get_hundredth(id1):
sub_df = df.loc[df['id1'] == id1].sort_values(by = 'score')
return sub_df.iloc[100]['score']
hundredth_dict = {id1: get_hundredth(id1) for id1 in set(df['id1'])}
def check_distance(row):
return row['score'] <= hundredth_dict[row['id1']]
closest = df.loc[df.apply(check_distance, axis = 1)
Another strategy would be to look at how filtering out distances past a threshold affects the dataframe. That is, take
low_scores = df.loc[df['score']<threshold]
Does this significantly decrease the size of the dataframe for some reasonable threshold? You'd need a threshold that makes the dataframe small enough to work with, but leaves the lowest 100 scores for each id1.
You also might want to look into what sort of optimization you can do given your distance metric. There's probably algorithms out there specifically for cosine similarity.
For the given shape (1144468900, 3) with 33,830 unique value counts, id1 and id2 columns are good candidates for categorical column, convert them to categorical data type, and that will reduce the memory requirement by 1144468900/33,830 = 33,830 times approximately for these two columns, then perform any aggregation you want.
df[['id1', 'id2']] = df[['id1', 'id2']].astype('category')
out = df.groupby(['id1'])['score'].nsmallest(100)
I have a Pandas dataframe with two columns I am interested in: A categorical label and a timestamp. Presumably what I'm trying to do would also work with ordered numerical data. The dataframe is already sorted by timestamps in ascending order. I want to find out which label spans the longest time-window and select only the values associated with it in the original dataframe.
I have tried grouping the df by label, calculating the difference and selecting the maximum (longest time-window) successfully, however I'm having trouble finding an expression to select the corresponding values in the original df using this information.
Consider this example with numerical values:
d = {'cat': ['A','A','A','A','A','A','B','B','B','B','C','C','C','C','C','C','C'],
'val': [1,3,5,6,8,9,0,5,10,20,4,5,6,7,8,9,10]}
df = pd.DataFrame(data = d)
Here I would expect something equivalent to df.loc[df.cat == 'B'] since B has the maximum difference of all the categories.
df.groupby('cat').val.apply(lambda x: x.max() - x.min()).max()
gives me the correct difference, but I have no idea how to use this to select the correct category in the original df.
You can go for idxmax to get the category that gave rise to maximum peak-to-peak value within groups (np.ptp does the maximum minus minimum). Then you can index with loc as you said, or query:
>>> max_cat = df.groupby("cat").val.apply(np.ptp).idxmax()
>>> max_cat
"B"
>>> df.query("cat == #max_cat") # or df.loc[df.cat == max_cat]
cat val
6 B 0
7 B 5
8 B 10
9 B 20
A DataFrame containing data with age binned in separate rows, as below:
VALUE,AGE
10, 0-4
20, 5-9
30, 10-14
40, 15-19
.. .. .....
So, basically, the age is grouped in 5 year bins. I'd like to have 10 year bins, that is, 0-9,10-19 etc. What I'm after is the VALUE, but for 10-year based age bins, so the values would be:
VALUE,AGE
30, 0-9
70, 10-19
I can do it by shifting and adding, and taking every second row of the resulting dataframe, but is there any smart, more general way built into Pandas to do this ?
Here's a "dumb" version, based on this answer - just sum every 2 rows:
In[0]
df.groupby(df.index // 2).sum()
Out[0]:
VALUE
0 30
1 70
I say "dumb" because this method doesn't factor in the age cut offs, it just happens to align with them. So say if the age ranges are variable, or if you have data that start at 5-9 instead of 0-4, this will likely cause an issue. You also have to rename the index as it is unclear.
A "smarter" version would be to actually create bins with pd.cut and use that to group the data, based on the ages for each row:
In[0]
df['MAX_AGE'] = df['AGE'].str.split('-').str[-1].astype(int)
bins = [0,10,20]
out = df.groupby(pd.cut(df['MAX_AGE'], bins=bins, right=False)).sum().drop('MAX_AGE',axis=1)
Out[0]:
VALUE
AGE
(0, 10] 30
(10, 20] 70
Explanation:
Use pandas.Series.str methods to get out the maximum age for each row,
store in a column "MAX_AGE"
Create bins at 10 year cut offs
Use pd.cut to assign the data into bins based on the max age of each row. Then use groupby on these bins and sum. Note that since we specify right = False, the bins depicted in the index should mean 0-9 and 10-19.
For reference, here is the data I was using:
import pandas as pd
VALUE = [10,20,30,40,]
AGE = ['0-4','5-9','10-14','15-19']
df = pd.DataFrame({'VALUE':VALUE,
'AGE':AGE})
This should work as long as they are all in 5 year increments. This will find where the upper number is uneven and group it with what came before, stopping at the last uneven number.
Below splits the string to get the numerical value
df['lower'] = df['AGE'].str.split('-').str[0]
df['upper'] = df['AGE'].str.split('-').str[1]
df[['lower','upper']] = df[['lower','upper']].astype(int)
Then it will apply the grouping logic, and rename the columns to represent the desired time period.
df['VALUE'] = df.groupby((df['upper'] % 2 == 1).shift().fillna(0).cumsum())['VALUE'].transform('sum')
df = df.drop_duplicates(subset = ['VALUE'],keep = 'last')
df['lower'] = df['lower'] - 5
df[['lower','upper']] = df[['lower','upper']].astype(str)
df['AGE'] = df['lower'] + '-' + df['upper']
df = df.drop(columns = ['lower','upper'])
EDIT:
I realized I set up my example incorrectly, the corrected version follows:
I have two dataframes:
df1 = pd.DataFrame({'x values': [11, 12, 13], 'time':[1,2.2,3.5})
df2 = pd.DataFrame({'x values': [11, 21, 12, 43], 'time':[1,2.1,2.6,3.1})
What I need to do is iterate over both of these dataframes, and compute a new value, which is a ratio of the x values in df1 and df2. The difficulty comes in because these dataframes are of different lengths.
If I just wanted to compute values in the two, I know that I could use something like zip, or even map. Unfortunately, I don't want to drop any values. Instead, I need to be able to compare the time column between the two frames to determine whether or not to copy over a value from a previous time to the computation of in the next time period.
So for instance, I would compute the first ratio:
df1["x values"][0]/df2["x values"][0]
Then for the second I check which update happens next, which in this case is to df2, so df1["time"] < df2["time"] and:
df1["x values"][0]/df2["x values"][1]
For the third I would see that df1["time"] > df2["time"], so the third computation would be:
df1["x values"][1]/df2["x values"][1]
The only time both values should be used to compute the ratio from the same "position" is if the times in the two dataframes are equal.
And so on. I'm very confused as to whether or not this is possible to execute using something like a lambda function, or itertools. I've made some attempts, but most have yielded errors. Any help would be appreciated.
Here is what I ended up doing. Hopefully it helps clarify what my question was. Also, if anyone can think of a more pythonic way to do this, I would appreciate the feedback.
#add a column indicating which 'type' of dataframe it is
df1['type']=pd.Series('type1',index=df1.index)
df2['type']=pd.Series('type2',index=df2.index)
#concatenate the dataframes
df = pd.concat((df1, df2),axis=0, ignore_index=True)
#sort by time
df = df.sort_values(by='time').reset_index()
#we create empty arrays in order to track records
#in a way that will let us compute ratios
x1 = []
x2 = []
#we will iterate through the dataframe line by line
for i in range(0,len(df)):
#if the row contains data from df1
if df["type"][i] == "type1":
#we append the x value for that type
x1.append(df[df["type"]=="type1"]["x values"][i])
#if the x2 array contains exactly 1 value
if len(x2) == 1:
#we add it to match the number of x1
#that we have recorded up to that point
#this is useful if one time starts before the other
for j in range(1, len(x1)-1):
x2.append(x2[0])
#if the x2 array contains more than 1 value
#add a copy of the previous x2 record to correspond
#to the new x1 record
if len(x2) > 0:
x2.append(x2[len(x2)-1])
#if the row contains data from df2
if df["type"][i] == "type2":
#we append the x value for that type
x2.append(df[df["type"]=="type2"]["x values"][i])
#if the x1 array contains exactly 1 value
if len(x1) == 1:
#we add it to match the number of x2
#that we have recorded up to that point
#this is useful if one time starts before the other
for j in range(1, len(x2)-1):
x1.append(x2[0])
#if the x1 array contains more than 1 value
#add a copy of the previous x1 record to correspond
#to the new x2 record
if len(x1) > 0:
x1.append(x1[len(x1)-1])
#combine the records
new__df = pd.DataFrame({'Type 1':x1, 'Type 2': x2})
#compute the ratio
new_df['Ratio'] = new_df['x1']/f_df['x2']
You can merge the two dataframes on time and then calculate ratios
new_df = df1.merge(df2, on = 'time', how = 'outer')
new_df['ratio'] = new_df['x values_x'] / new_df['x values_y']
You get
time x values_x x values_y ratio
0 1 11 11 1.000000
1 2 12 21 0.571429
2 2 12 12 1.000000
3 3 13 43 0.302326
I've just started working with Pandas and I am trying to figure if it is the right tool for my problem.
I have a dataset:
date, sourceid, destid, h1..h12
I am basically interested in the sum of each of the H1..H12 columns, but, I need to exclude multiple ranges from the dataset.
Examples would be to:
exclude H4, H5, H6 data where sourceid = 4944 and exclude H8, H9-H12
where destination = 481981 and ...
... this can go on for many many filters as we are
constantly removing data to get close to our final model.
I think I saw in a solution that I could build a list of the filters I would want and then create a function to test against, but I haven't found a good example to work from.
My initial thought was to create a copy of the df and just remove the data we didn't want and if we need it back - we could just copy it back in from the origin df, but that seems like the wrong road.
By using masks, you don't have to remove data from the dataframe. E.g.:
mask1 = df.sourceid == 4944
var1 = df[mask1]['H4','H5','H6'].sum()
Or directly do:
var1 = df[df.sourceid == 4944]['H4','H5','H6'].sum()
In case of multiple filters, you can combine the Boolean masks with Boolean operators:
totmask = mask1 & mask2
you can use DataFrame.ix[] to set the data to zeros.
Create a dummy DataFrame first:
N = 10000
df = pd.DataFrame(np.random.rand(N, 12), columns=["h%d" % i for i in range(1, 13)], index=["row%d" % i for i in range(1, N+1)])
df["sourceid"] = np.random.randint(0, 50, N)
df["destid"] = np.random.randint(0, 50, N)
Then for each of your filter you can call:
df.ix[df.sourceid == 10, "h4":"h6"] = 0
since you have 600k rows, create a mask array by df.sourceid == 10 maybe slow. You can create Series objects that map value to the index of the DataFrame:
sourceid = pd.Series(df.index.values, index=df["sourceid"].values).sort_index()
destid = pd.Series(df.index.values, index=df["destid"].values).sort_index()
and then exclude h4,h5,h6 where sourceid == 10 by:
df.ix[sourceid[10], "h4":"h6"] = 0
to find row ids where sourceid == 10 and destid == 20:
np.intersect1d(sourceid[10].values, destid[20].values, assume_unique=True)
to find row ids where 10 <= sourceid <= 12 and 3 <= destid <= 5:
np.intersect1d(sourceid.ix[10:12].values, destid.ix[3:5].values, assume_unique=True)
sourceid and destid are Series with duplicated index values, when the index values are in order, Pandas use searchsorted to find index. it's O(log N), faster then create mask arrays which is O(N).