Python dataframe drop rows which occur less frequently [duplicate] - python

This question already has answers here:
Pandas: Selecting rows based on value counts of a particular column
(2 answers)
How to select rows in Pandas dataframe where value appears more than once
(5 answers)
Closed 2 years ago.
I have a data frame with repeatedly occurring rows with different names. I want to delete less occurring rows. My data frame is very big. I am giving only a small size here.
dataframe:
df =
name value
0 A 10
1 B 20
2 A 30
3 A 40
4 C 50
5 C 60
6 D 70
In the above data frame B and D rows occurred fewer times. That is less than 1. I want to delete/drop all such rows that occur less than 2.
My code:
##### Net strings
net_strs = df['name'].unique().tolist()
strng_list = df.group.unique().tolist()
tempdf = df.groupby('name').count()
##### strings that have less than 2 measurements in whole data set
lesstr = tempdf[tempdf['value']<2].index
##### Strings that have more than 2 measurements in whole data set
strng_list = np.setdiff1d(net_strs,lesstr).tolist()
##### Removing the strings with less measurements
df = df[df['name']==strng_list]
My present output:
ValueError: Lengths must match to compare
My expected output:
name value
0 A 10
1 A 30
2 A 40
3 C 50
4 C 60

You could find the count of each element in name and then select rows only those rows having names that occur more than once.
v = df.name.value_counts()
df[df.name.isin(v.index[v.gt(1)])]
Output :
name value
0 A 10
2 A 30
3 A 40
4 C 50
5 C 60

I believe this code should give you what you want.
df['count'] = df.groupby('name').transform('count')
df2 = df.loc[df['count'] >= 2].drop(columns='count')

You should use value_counts() to get the occurrence of each row, followed by slicing this series to get the name of the rows you can to drop.
df = pd.DataFrame({'name':['A','B','A','A','C','C','D'],
'value':[10,20,30,40,50,60,70]})
removals = df['name'].value_counts().reset_index()
removals = removals[removals['name'] > 1]['index'].values
Here, we're setting a threshold of 1, where all values that show up more than one will get selected, but this can obviously be a variable, or changed accordingly.
filtered_df = df[df['name'].isin(removals)]
print(filtered_df)
Output:
name value
0 A 10
2 A 30
3 A 40
4 C 50
5 C 60

Related

changing index of 1 row in pandas

I have the the below df build from a pivot of a larger df. In this table 'week' is the the index (dtype = object) and I need to show week 53 as the first row instead of the last
Can someone advice please? I tried reindex and custom sorting but can't find the way
Thanks!
here is the table
Since you can't insert the row and push others back directly, a clever trick you can use is create a new order:
# adds a new column, "new" with the original order
df['new'] = range(1, len(df) + 1)
# sets value that has index 53 with 0 on the new column
# note that this comparison requires you to match index type
# so if weeks are object, you should compare df.index == '53'
df.loc[df.index == 53, 'new'] = 0
# sorts values by the new column and drops it
df = df.sort_values("new").drop('new', axis=1)
Before:
numbers
weeks
1 181519.23
2 18507.58
3 11342.63
4 6064.06
53 4597.90
After:
numbers
weeks
53 4597.90
1 181519.23
2 18507.58
3 11342.63
4 6064.06
One way of doing this would be:
import pandas as pd
df = pd.DataFrame(range(10))
new_df = df.loc[[df.index[-1]]+list(df.index[:-1])].reset_index(drop=True)
output:
0
9 9
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
Alternate method:
new_df = pd.concat([df[df["Year week"]==52], df[~(df["Year week"]==52)]])

python panda new column with order of values

I would like to make a new column with the order of the numbers in a list. I get 3,1,0,4,2,5 ( index of the lowest numbers ) but I would like to have a new column with 2,1,4,0,3,5 ( so if I look at a row i get the list and I get in what order this number comes in the total list. what am I doing wrong?
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df.sort_values(by='list').index
print(df)
What you're looking for is the rank:
import pandas as pd
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df['list'].rank().sub(1).astype(int)
Result:
list order
0 4 2
1 3 1
2 6 4
3 1 0
4 5 3
5 9 5
You can use the method parameter to control how to resolve ties.

Iterate rows and find sum of rows not exceeding a number

Below is a dataframe showing coordinate values from and to, each row having a corresponding value column.
I want to find the range of coordinates where the value column doesn't exceed 5. Below is the dataframe input.
import pandas as pd
From=[10,20,30,40,50,60,70]
to=[20,30,40,50,60,70,80]
value=[2,3,5,6,1,3,1]
df=pd.DataFrame({'from':From, 'to':to, 'value':value})
print(df)
hence I want to convert the following table:
to the following outcome:
Further explanation:
Coordinates from 10 to 30 are joined and the value column changed to 5
as its sum of values from 10 to 30 (not exceeding 5)
Coordinates 30 to 40 equals 5
Coordinate 40 to 50 equals 6 (more than 5, however, it's included as it cannot be divided further)
Remaining coordinates sum up to a value of 5
What code is required to achieve the above?
We can do a groupby on cumsum:
s = df['value'].ge(5)
(df.groupby([~s, s.cumsum()], as_index=False, sort=False)
.agg({'from':'min','to':'max', 'value':'sum'})
)
Output:
from to value
0 10 30 5
1 30 40 5
2 40 50 6
3 50 80 5
Update: It looks like you want to accumulate the values so the new groups do not exceed 5. There are several threads on SO saying that this can only be done with a for a loop. So we can do something like this:
thresh = 5
groups, partial, curr_grp = [], thresh, 0
for v in df['value']:
if partial + v > thresh:
curr_grp += 1
partial = v
else:
partial += v
groups.append(curr_grp)
df.groupby(groups).agg({'from':'min','to':'max', 'value':'sum'})

How to feed random numbers as indices to pandas data frame?

I'm trying to get a random sample from two pandas frames. If rows (random) 2,5,8 are selected in frame A, then the same 2,5,8 rows must be selected from frame B. I did it by first generating a random sample and now want to use this sample as indices for rows for frame. How can I do it? The code should look like
idx = list(random.sample(range(X_train.shape[0]),5000))
lgstc_reg[i].fit(X_train[idx,:], y_train[idx,:])
However, running the code gives an error.
Use iloc:
indexes = [2,5,8] # in your case this is the randomly generated list
A.iloc[indexes]
B.iloc[indexes]
An alternative consistent sampling methodology would be to set a random seed, and then sample:
random_seed = 42
A.sample(3, random_state=random_seed)
B.sample(3, random_state=random_seed)
The sampled DataFrames will have the same index.
Hope this helps!
>>> df1
value ID
0 3 2
1 4 2
2 7 8
3 8 8
4 11 8
>>> df2
value distance
0 3 0
1 4 0
2 7 1
3 8 0
4 11 0
I have two data frames. I want to select randoms of df1 along with corresponding rows of df2.
First I create a sample_index which a list of random rows of df using Pandas inbuilt function sample. Now use this index to location these rows in df1 and df2 with the help of another inbuilt funciton loc.
>>> selection_index = df1.sample(2).index
>>> selection_index
Int64Index([3, 1], dtype='int64')
>>> df1.loc[selection_index]
value ID
3 8 8
1 4 2
>>> df2.loc[selection_index]
value distance
3 8 0
1 4 0
>>>
In your case, this would become somewhat like
idx = X_train.sample(5000).index
lgstc_reg[i].fit(X_train.loc[idx], y_train.loc[idx])

How to apply a python function to splitted 'from the end' pandas sub-dataframes and get a new dataframe?

The problem
Starting from a pandas dataframe df made of dim_df rows, I need a new
dataframe df_new obtained by applying a function to every sub-dataframe of dimension dim_blk, ideally splitted starting from the last row (so the first block, not the last, may have or not the right number of rows, dim_blk), in the most efficient way (may be vectorized?).
Example
In the following example the dataframe is made of few rows, but the real dataframe will be made of millions of rows, that's why I need an efficient solution.
dim_df = 7 # dimension of the starting dataframe
dim_blk = 3 # number of rows of the splitted block
df = pd.DataFrame(np.arange(1,dim_df+1), columns=['TEST'])
print(df)
Output:
TEST
0 1
1 2
2 3
3 4
4 5
5 6
6 7
The splitted blocks I want:
1 # note: this is the first block composed by a <= dim_blk number of rows
2,3,4
5,6,7 # note: this is the last block and it has dim_blk number of rows
I've done so (I don't know if this is the efficient way):
lst = np.arange(dim_df, 0, -dim_blk) # [7 4 1]
lst_mod = lst[1:] # [4 1] to cut off the last empty sub-dataframe
split_df = np.array_split(df, lst_mod[::-1]) # splitted by reversed list
print(split_df)
Output:
split_df: [
TEST
0 1,
TEST
1 2
2 3
3 4,
TEST
4 5
5 6
6 7]
For example:
print(split_df[1])
Output:
TEST
1 2
2 3
3 4
How can I get a new dataframe, df_new, where every row is made by two columns, min and max (just an example) calculated for every blocks?
I.e:
# df_new
Min Max
0 1 1
1 2 4
2 5 7
Thank you,
Gilberto
You can convert the split_df into dataframe and then create a dataframe using min and max functions i.e
split_df = pd.DataFrame(np.array_split(df['TEST'], lst_mod[::-1]))
df_new = pd.DataFrame({"MIN":split_df.min(axis=1),"MAX":split_df.max(axis=1)}).reset_index(drop=True)
Output:
MAX MIN
0 1.0 1.0
1 4.0 2.0
2 7.0 5.0
Moved solution from question to answer:
The Solution
I've think laterally and found a very speedy solution:
Apply a rolling function to the entire dataframe
Choose every num_blk rows starting from the end
The code (with different values):
import numpy as np
import pandas as pd
import time
dim_df = 500000
dim_blk = 240
df = pd.DataFrame(np.arange(1,dim_df+1), columns=['TEST'])
start_time = time.time()
df['MAX'] = df['TEST'].rolling(dim_blk).max()
df['MIN'] = df['TEST'].rolling(dim_blk).min()
df[['MAX', 'MIN']] = df[['MAX', 'MIN']].fillna(method='bfill')
df_split = pd.DataFrame(columns=['MIN', 'MAX'])
df_split['MAX'] = df['MAX'][-1::-dim_blk][::-1]
df_split['MIN'] = df['MIN'][-1::-dim_blk][::-1]
df_split.reset_index(inplace=True)
del(df_split['index'])
print(df_split.tail())
print('\n\nEND\n\n')
print("--- %s seconds ---" % (time.time() - start_time))
Time Stats
The original code stops after 545 secs. The new code stops after 0,16 secs. Awesome!

Categories