Pandas sample by filter criteria

Pandas sample by filter criteria - python

I have a data frame like the one below
d = {'var1': [1, 2, 3, 4], 'var2': [5, 6, 7, 8], 'class': ['a', 'a', 'c', 'b']}
df = pd.DataFrame(data=d)
df
var1 var2 class
0 1 5 a
1 2 6 a
2 3 7 c
3 4 8 b
I would like to be able to change the proportion of the class column. For example I would like to down-sample at random the a class by 50% but keep the number of rows for the other classes the same. the results would be:
df
var1 var2 class
0 1 5 a
1 3 7 c
2 4 8 b
How would this be done.

I used the approach to split the DataFrame into df_selection and df_remaining first.
I then reduced df_selection by REMOVE_PERCENTAGE and merged the resulting DataFrame with df_remaining again.
import numpy as np
import pandas as pd
d = {'var1': [1, 2, 3, 4], 'var2': [5, 6, 7, 8], 'class': ['a', 'a', 'c', 'b']}
df = pd.DataFrame(data=d)
REMOVE_PERCENTAGE = 0.5 # between 0 and 1
df = df.set_index(['class'])
df_selection = df.loc['a'] \
.reset_index()
df_remaining = df.drop('a') \
.reset_index()
rows_to_remove = int(REMOVE_PERCENTAGE * len(df_selection.index))
drop_indices = np.random.choice(df_selection.index, rows_to_remove, replace=False)
df_selection_reduced = df_selection.drop(drop_indices)
df_result = pd.concat([df_selection_reduced, df_remaining]) \
.reset_index(drop=True)
print(df_result)

Related

Cumcount with reset in Python

I have a following problem. I need to compute a cumcount but I would like to reset the counter always when the series is interupted.
See example:
data = { 'col_1': ['a', 'a', 'b', 'b', 'a'], 'col_2': [3, 2, 1, 0, -3]}
df = pd.DataFrame.from_dict(data)
I tried this but it gives me a wrong output:
df["seq"] = df.groupby(["col_1"]).cumcount()
What I want is:
data = { 'col_1': ['a', 'a', 'b', 'b', 'a'], 'col_2': [3, 2, 1, 0, -3], 'seq': [0, 1, 0, 1, 0]}
How can I do it, please?

Try:
df["seq"] = df.groupby((df["col_1"] != df["col_1"].shift()).cumsum())["col_1"].cumcount()
print(df)
Output
col_1 col_2 seq
0 a 3 0
1 a 2 1
2 b 1 0
3 b 0 1
4 a -3 0

Note that as you are interested in runs (like in run-length encoding) itertools.groupby might be better suited for this task, consider following example
import pandas as pd
df = pd.DataFrame({'col1':['a','a','b','b','a']})
df['seq'] = [i for k, g in itertools.groupby(col1) for i in range(len(list(g)))]
print(df)
output
col1 seq
0 a 0
1 a 1
2 b 0
3 b 1
4 a 0

Replace values of a dataframe with the value of another dataframe

I have two pandas dataframes
df1 = pd.DataFrame({'A': [1, 3, 5], 'B': [3, 4, 5]})
df2 = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [8, 9, 10, 11, 12], 'C': ['K', 'D', 'E', 'F', 'G']})
The index of both data-frames are 'A'.
How to replace the values of df1's column 'B' with the values of df2 column 'B'?
RESULT of df1:
A B
1 8
3 10
5 12

Maybe dataframe.isin() is what you're searching:
df1['B'] = df2[df2['A'].isin(df1['A'])]['B'].values
print(df1)
Prints:
A B
0 1 8
1 3 10
2 5 12

One of possible solutions:
wrk = df1.set_index('A').B
wrk.update(df2.set_index('A').B)
df1 = wrk.reset_index()
The result is:
A B
0 1 8
1 3 10
2 5 12
Another solution, based on merge:
df1 = df1.merge(df2[['A', 'B']], how='left', on='A', suffixes=['_x', ''])\
.drop(columns=['B_x'])

groupby in pandas with function that must keep state

I have the following dataframe
df = pd.DataFrame({'a': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'b': [ 1, 2, 4, 1, 2, 3, 4]})
I want a function that would output the following dataframe definition:
df = pd.DataFrame({'a': [ 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'b': [ 1, 2, 4, 1, 2, 3, 4],
'c': ['A_0', 'A_0', 'A_1', 'B_0', 'B_0', 'B_0', 'B_0']})
The logic is that given for each value of 'a' (each group), I create a value 'c' which could be described as a "continuous" series of 'b' values.
So far, my code is the following:
def detection(dataset):
def detect(series, avalue):
_id = 0
start = True
visits = []
prev_ = None
for h in series:
if start:
start = False
prev_ = h
else:
if h - prev_ > 1:
_id += 1
prev_ = h
visits.append(f"{avalue}_{_id}")
return visits
res = []
gb = dataset.groupby("a")
for avalue in gb.groups:
dd = gb.get_group(avalue)
dd["VISIT_ID"] = detect(dd["b"], avalue)
res.append(dd)
return pd.concat(res, axis=0)
The good is, it works perfectly !!
The bad: it is extremely slow on a large dataset (7 millions of entries, 250k of different 'a' values.
Is there something better to do?

You can find the numeric part of column c using groupby and concat values
df['c'] = df.groupby('a').b.apply(lambda x: (x.diff() > 1).cumsum())
df['c'] = df['a'] + '_' + df['c'].astype(str)
a b c
0 A 1 A_0
1 A 2 A_0
2 A 4 A_1
3 B 1 B_0
4 B 2 B_0
5 B 3 B_0
6 B 4 B_0

How to replace specific character in pandas column with null?

I Have a column within a dataset, regarding categorical company sizes, which currently looks like this, where the '-' hyphens are currently representing missing data:
I want to change the '-' in missing values with nulls so i can analyse missing data. However when I use the pd replace tool (see following code) with a None value it seems to also make any of the genuine entries as they also contain hyphens (e.g 51-200).
df['Company Size'].replace({'-': None},inplace =True, regex= True)
How can I replace only lone standing hyphens and leave the other entries untouched?

You need not to use regex=True.
df['Company Size'].replace({'-': None},inplace =True)

You could also just do:
df['column_name'] = df['column_name'].replace('-','None')

import numpy as np
df.replace('-', np.NaN, inplace=True)
This code worked for me.

you can do it like this
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', '-', 'c--', 'd', 'e']})
df['C'] = df['C'].replace('-', np.nan)
df = df.where((pd.notnull(df)), None)
# can also use this -> df['C'] = df['C'].where((pd.notnull(df)), None)
print(df)
output:
A B C
0 0 5 a
1 1 6 None
2 2 7 c--
3 3 8 d
4 4 9 e
another example:
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': ['5-5', '-', 7, 8, 9],
'C': ['a', 'b', 'c--', 'd', 'e']})
df['B'] = df['B'].replace('-', np.nan)
df = df.where((pd.notnull(df)), None)
print(df)
output:
A B C
0 0 5-5 a
1 1 None b
2 2 7 c--
3 3 8 d
4 4 9 e

Replicating rows in pandas dataframe by column value and add a new column with repetition index

My question is similar to one asked here. I have a dataframe and I want to repeat each row of the dataframe k number of times. Along with it, I also want to create a column with values 0 to k-1. So
import pandas as pd
df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'n' : [ 1, 2, 2, 3, 3, 3],
'v' : [ 10, 13, 13, 8, 8, 8],
'repeat_id': [0, 0, 1, 0, 1, 2]
})
Command below does half of the job. I am looking for pandas way of adding the repeat_id column.
df.loc[df.index.repeat(df.n)]

Use GroupBy.cumcount and copy for avoid SettingWithCopyWarning:
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
df1 = df.loc[df.index.repeat(df.n)].copy()
df1['repeat_id'] = df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
print (df1)
id n v repeat_id
0 A 1 10 0
1 B 2 13 0
2 B 2 13 1
3 C 3 8 0
4 C 3 8 1
5 C 3 8 2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas sample by filter criteria - python

Related

Cumcount with reset in Python

Replace values of a dataframe with the value of another dataframe

groupby in pandas with function that must keep state

How to replace specific character in pandas column with null?

Replicating rows in pandas dataframe by column value and add a new column with repetition index

Categories

Resources