Pandas: Keep only first occurance of value in group of consecutive values - python

I have a dataframe that looks like the following (actually, this is the abstracted result of a calculation):
import pandas as pd
data = {"A":[i for i in range(10)]}
index = [1, 3, 4, 5, 9, 10, 12, 13, 15, 20]
df = pd.DataFrame(index=index, data=data)
print(df)
yields:
A
1 0
3 1
4 2
5 3
9 4
10 5
12 6
13 7
15 8
20 9
Now I want to filter the index values to only show the first value in a group of consecutive values e. g. the following result:
A
1 0
3 1
9 4
12 6
15 8
20 9
Any hints on how to achieve this efficiently?

Use Series.diff which is not implemented for Index, so convert to Series and compre for not equal 1:
df = df[df.index.to_series().diff().ne(1)]
print (df)
A
1 0
3 1
9 4
12 6
15 8
20 9

Try this one:
import numpy as np
df.iloc[np.unique(np.array(index)-np.arange(len(index)), return_index=True)[1]]

Try this:
df.groupby('A').index.first().reset_index()

Related

Auto re-assign ids in a dataframe

I have the following dataframe:
import pandas as pd
data = {'id': [542588, 542594, 542594, 542605, 542605, 542605, 542630, 542630],
'label': [3, 3, 1, 1, 2, 0, 0, 2]}
df = pd.DataFrame(data)
df
id label
0 542588 3
1 542594 3
2 542594 1
3 542605 1
4 542605 2
5 542605 0
6 542630 0
7 542630 2
The id columns contains large integers (6-digits). I want a way to simplify it, starting from 10, so that 542588 becomes 10, 542594 becomes 11, etc...
Required output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
You can use factorize:
df['id'] = df['id'].factorize()[0] + 10
Output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
Note: factorize will enumerate the keys in the order that they occur in your data, while groupby().ngroup() solution will enumerate the key in the increasing order. You can mimic the increasing order with factorize by sorting the data first. Or you can replicate the data order with groupby() by passing sort=False to it.
You can try
df['id'] = df.groupby('id').ngroup().add(10)
print(df)
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
This is a naive way of looping through the IDs, and every time you encounter an ID you haven't seen before, associate it in a dictionary with a new ID (starting at 10, incrementing by 1 each time).
You can then swap out the values of the ID column using the map method.
new_ids = dict()
new_id = 10
for old_id in df['id']:
if old_id not in new_ids:
new_ids[old_id] = new_id
new_id += 1
df['id'] = df['id'].map(new_ids)

More efficient way to filter over a subset of a dataframe

The problem is this, I have a dataframe like so:
A B C D
2 3 X 5
7 2 5 7
1 2 7 9
3 4 X 9
1 2 3 5
6 3 X 8
I wish to iterate over the rows of the dataframe and every time column C=X I want to reset a counter and start adding the values in column B until column C=X again. Then rince and repeat down the rows till complete.
Currently I am iterating over the rows using .iterrows(), comparing column C and then procedurally adding to a variable.
I'm hoping there is a more efficient 'pandas' like approach to doing something like this.
Use cumsum() as follows.
import pandas as pd
df = pd.DataFrame({"B":[3, 2, 2, 4, 2, 3],
"C":["X", 5, 7, "X", 3, "X"]})
df['C'].loc[df['C'] == "X"] = df['B'].loc[df['C'] == "X"].cumsum()
The output is
B C
0 3 3
1 2 5
2 2 7
3 4 7
4 2 3
5 3 10

Pandas dataframe randomly shuffle some column values in groups

I would like to shuffle some column values but only within a certain group and only a certain percentage of rows within the group. For example, per group, I want to shuffle n% of values in column b with each other.
df = pd.DataFrame({'grouper_col':[1,1,2,3,3,3,3,4,4], 'b':[12, 13, 16, 21, 14, 11, 12, 13, 15]})
grouper_col b
0 1 12
1 1 13
2 2 16
3 3 21
4 3 14
5 3 11
6 3 12
7 4 13
8 4 15
Example output:
grouper_col b
0 1 13
1 1 12
2 2 16
3 3 21
4 3 11
5 3 14
6 3 12
7 4 15
8 4 13
I found
df.groupby("grouper_col")["b"].transform(np.random.permutation)
but then I have no control over the percentage of shuffled values.
Thank you for any hints!
You can use numpy to create a function like this (it takes a numpy array for input)
import numpy as np
def shuffle_portion(arr, percentage):
shuf = np.random.choice(np.arange(arr.shape[0]),
round(arr.shape[0]*percentage/100),
replace=False)
arr[np.sort(shuf)] = arr[shuf]
return arr
np.random.choice will choose a set of indexes with the size you need. Then the corresponding values in the given array can be rearranged in the shuffled order. Now this should shuffle 3 values out of the 9 in cloumn 'b'
df['b'] = shuffle_portion(df['b'].values, 33)
EDIT:
To use with apply, you need to convert the passed dataframe to an array inside the function (explained in the comments) and create the return dataframe as well
def shuffle_portion(_df, percentage=50):
arr = _df['b'].values
shuf = np.random.choice(np.arange(arr.shape[0]),
round(arr.shape[0]*percentage/100),
replace=False)
arr[np.sort(shuf)] = arr[shuf]
_df['b'] = arr
return _df
Now you can just do
df.groupby("grouper_col", as_index=False).apply(shuffle_portion)
It would be better practice if you pass the name of the column which you need to shuffle, to the function (def shuffle_portion(_df, col='b', percentage=50): arr = _df[col].values ...)

Replacing all values in a Pandas column, with no conditions

I have a Pandas dataframe with a column full of values I want to replace with another, non conditionally.
For the purpose of this question, let's assume I don't know how long this column is and I don't want to iterate over its values.
Using .replace() is not appropriate since I don't know which values are in that column: I want to replace all values, non conditionally.
Using df.loc[<row selection>, <column selection>] is not appropriate since there is no row selection logic: I want all the rows and simply writing True (as in data.loc[True, 'ColumnName'] = new_value) returns KeyError(True,). I tried data.loc[1, 'ColumnName'] = new_value and it works but it really looks like a shitty solution.
If I know len() of data['ColumnName'] I could create an array of that size, filled with as many time of my new_value and simply replace the column with that array. 10 lines of code to do something simpler than something that requires 1 line of code (doing so conditionally): this is also not ok.
How can I tell Pandas in 1 line: all the values in ColumnName are now new_value? I refuse to believe there's no way to tell Pandas not to bother me with conditions.
As I explained in the comment, you don't need to create an array.
Let's say you have df:
InvoiceNO Month Year Size
0 1 1 2 7
1 2 1 2 8
2 3 2 2 11
3 4 3 2 9
4 5 7 2 8.5
..and you want to change all values in InvoiceNO to 1234:
df['InvoiceNO'] = 1234
Output:
InvoiceNO Month Year Size
0 1234 1 2 7
1 1234 1 2 8
2 1234 2 2 11
3 1234 3 2 9
4 1234 7 2 8.5
import pandas as pd
df = pd.DataFrame(
{'num1' : [3, 5, 9, 9, 14, 1],
'num2' : [3, 5, 9, 9, 14, 1]},
index=[0, 1, 2, 3, 4, 5])
print(df)
print('\n')
df['num1'] = 100
print(df)
df['num1'] = 'Hi'
print('\n')
print(df)
The output is
num1 num2
0 3 3
1 5 5
2 9 9
3 9 9
4 14 14
5 1 1
num1 num2
0 100 3
1 100 5
2 100 9
3 100 9
4 100 14
5 100 1
num1 num2
0 Hi 3
1 Hi 5
2 Hi 9
3 Hi 9
4 Hi 14
5 Hi 1

Setting values in grouped data frame in pandas

I have 2 data frames grouped by 4 separate keys. I would like to assign the mean of a column of one group, to all the row values in a column in another group. As I understand it, this is how it should be done:
g_test.get_group((1, 5, 13, 8)).monthly_sales = \
g_train.get_group((1, 5, 13, 8)).monthly_sales.mean()
Except this does nothing. The values in monthly_sales of the group identified in g_test are unchanged. Can someone please explain what I am doing wrong and suggest alternatives?
These are the first few rows of g_train.get_group((1, 5, 13, 8))
year month day store item units monthly_sales
1 5 5 13 8 4 466
1 5 6 13 8 12 475
1 5 0 13 8 22 469
1 5 5 13 8 26 469
1 5 6 13 8 39 480
and these are the first few rows of g_test.get_group((1, 5, 13, 8))
year month day store item monthly_sales
1 5 1 13 8 0
1 5 2 13 8 0
1 5 3 13 8 0
1 5 4 13 8 0
1 5 5 13 8 0
Only the first few rows are shown, but the mean of g_train((1, 5, 13, 8)).monthly_sales is 450, which I want to be copied over to the monthly_sales column in g_test.
Edit:
I now understand that, the code snippet below will work:
`df1.loc[(df1.year == 1)
& (df1.month == 5)
& (df1.store == 13)
& (df1.item == 8), 'monthly_sales'] = \
gb2.get_group((1, 5, 13, 8)).monthly_sales.mean()`
This operation is great for copying the mean once, however the whole reason I split the data frame into groups was to avoid these logic checks and do this multiple times for different store and item numbers. Is there something else I can do?
You need to assign the result back to the DataFrame, not the groupby object. This should work:
df1.loc[(df1.year == 1)
& (df1.month == 5)
& (df1.store == 13)
& (df1.item == 8), 'monthly_sales'] = \
gb2.get_group((1, 5, 13, 8)).monthly_sales.mean()
>>> gb1.get_group((1, 5, 13, 8))
year month day store item units monthly_sales
0 1 5 5 13 8 4 471.8
1 1 5 6 13 8 12 471.8
2 1 5 0 13 8 22 471.8
3 1 5 5 13 8 26 471.8
4 1 5 6 13 8 39 471.8
Actually I just discovered a better way. g_test is part of dataframe 'test', so when I tried the line below it worked perfectly
test.loc[g_test.get_group((1, 5, 13, 8)).index, 'monthly_sales'] = \
g_train.get_group((1, 5, 13, 8)).monthly_sales.mean()

Categories