Below is a dataframe showing coordinate values from and to, each row having a corresponding value column.
I want to find the range of coordinates where the value column doesn't exceed 5. Below is the dataframe input.
import pandas as pd
From=[10,20,30,40,50,60,70]
to=[20,30,40,50,60,70,80]
value=[2,3,5,6,1,3,1]
df=pd.DataFrame({'from':From, 'to':to, 'value':value})
print(df)
hence I want to convert the following table:
to the following outcome:
Further explanation:
Coordinates from 10 to 30 are joined and the value column changed to 5
as its sum of values from 10 to 30 (not exceeding 5)
Coordinates 30 to 40 equals 5
Coordinate 40 to 50 equals 6 (more than 5, however, it's included as it cannot be divided further)
Remaining coordinates sum up to a value of 5
What code is required to achieve the above?
We can do a groupby on cumsum:
s = df['value'].ge(5)
(df.groupby([~s, s.cumsum()], as_index=False, sort=False)
.agg({'from':'min','to':'max', 'value':'sum'})
)
Output:
from to value
0 10 30 5
1 30 40 5
2 40 50 6
3 50 80 5
Update: It looks like you want to accumulate the values so the new groups do not exceed 5. There are several threads on SO saying that this can only be done with a for a loop. So we can do something like this:
thresh = 5
groups, partial, curr_grp = [], thresh, 0
for v in df['value']:
if partial + v > thresh:
curr_grp += 1
partial = v
else:
partial += v
groups.append(curr_grp)
df.groupby(groups).agg({'from':'min','to':'max', 'value':'sum'})
Related
I have a dataframe(df) like below (there are more rows actually).
number
0
21
1
35
2
467
3
965
4
2754
5
34r
6
5743
7
841
8
8934
9
275
I want to insert multiple 6 rows in between rows for example I want to get random 6 values within range of index 0 and 1 and add these 6 rows between index 0 and 1.
Same goes to index 1 and 2, 2 and 3 and so forth until the end.
np.linspace(df["number"][0], df["number"][1],8)
Is there a function or any other method to generate 6 additional rows between all existing 9 rows so therefore the final number of rows will be not 9 but 64 rows (after adding 54 rows)?
You could try the following:
from random import uniform
def rng_numbers(row):
left, right = row.iat[0], row.iat[1]
n = left
if pd.isna(right):
return [n]
if right < left:
left, right = right, left
return [n] + [uniform(left, right) for _ in range(6)]
df["number"] = (
pd.concat([df["number"], df["number"].shift(-1)], axis=1)
.apply(rng_numbers, axis=1)
)
df = df.explode("number", ignore_index=True)
First create a dataframe with 2 columns that form the interval boundaries: the number column and number column shifted 1 forth.
Then .apply the function rng_numbers to the rows of the new dataframe: rng_numbers first sorts the interval boundaries and then returns a list that starts with the resp. item from column number and then num_rows many random numbers in the interval. In the last row the left boundary is NaN (due to the .shift(-1)): in this case the function returns the list without the random numbers.
Then .explode df on the new column number.
You could do something similar with NumPy, which is probably faster:
rng = np.random.default_rng()
limits = pd.concat([df["number"], df["number"].shift(-1)], axis=1)
left = limits.min(axis=1).values.reshape(-1, 1)
right = limits.max(axis=1).values.reshape(-1, 1)
df["number"] = (
pd.Series(df["number"].values.reshape(len(df), 1).tolist())
+ pd.Series(rng.uniform(left, right, size=(len(df), 6)).tolist())
)
df["number"].iat[-1] = df["number"].iat[-1][:1]
df = df.explode("number", ignore_index=True)
I have a dataframe that has following columns: X and Y are Cartesian coordinates and Value is the value of element at these coordinates. What I want to achieve is to select only one coordinates out of n that are close to other, lets say coordinates are close if distance is lower than some value m, so the initial DF looks like this (example):
data = {'X':[0,0,0,1,1,5,6,7,8],'Y':[0,1,4,2,6,5,6,4,8],'Value':[6,7,4,5,6,5,6,4,8]}
df = pd.DataFrame(data)
X Y Value
0 0 0 6
1 0 1 7
2 0 4 4
3 1 2 5
4 1 6 6
5 5 5 5
6 6 6 6
7 7 4 4
8 8 8 8
distance is count with following function:
def countDistance(lat1, lon1, lat2, lon2):
#use basic knowledge about triangles - values are in meters
distance = sqrt(pow(lat1-lat2,2)+pow(lon1-lon2,2))
return distance
lets say if we want to m<=3, the output dataframe would look like this:
X Y Value
1 0 1 7
4 1 6 6
8 8 8 8
What is to be done:
rows 0,1,3 are close, highest value is in row 1, continue
rows 2 and 4 (from original df) are close, keep row 4
rows 5,6,7 are close, keep row 6
left over row 6 is close to row 8, keep row 8, has higher value
So I need to go through dataframe row by row, check the rest, select best match and then continue. I can't think about any simple method how to achieve this, this cant be use case of drop_duplicates, since they are not duplicates, but looping over the whole DF will be very inefficient. One method I could think about was to loop just once, for each of rows finds close ones (probably apply countdistance()), select the best fitting row and replace rest with its values, in the end use drop_duplicates. The other idea was to create a recursive function that would create a new DF, then while original df will have rows select first, find close ones, best match append to new DF, remove first row and all close from original DF and continue until empty, then return same function with new DF as to remove possible uncaught close points.
These ideas are all kind of inefficient, is there a nice and efficient pythonic way to achieve this?
For now, I have created simple code with recursion, the code works but is most likely not optimal.
def recModif(self,df):
#columns=['','X','Y','Value']
new_df = df.copy()
new_df = new_df[new_df['Value']<0] #create copy to work with
changed = False
while not df.empty: #for all the data
df = df.reset_index(drop=True) #need to reset so 0 is always accessible
x = df.loc[0,'X'] #first row x and y
y = df.loc[0,'Y']
df['dist'] = self.countDistance(x,y,df['X'],df['Y']) #add column with distances
select = df[df['dist']<10] #number of meters that two elements cant be next to other
if(len(select.index)>1): #if there is more than one elem close
changed = True
#print(select,select['Value'].idxmax())
select = select.loc[[select['Value'].idxmax()]] #get the highest one
new_df = new_df.append(pd.DataFrame(select.iloc[:,:3]),ignore_index=True) #add it to new df
df = df[df['dist'] >= 10] #drop the elements now
if changed:
return self.recModif(new_df) #use recursion if possible overlaps
else:
return new_df #return new df if all was OK
This question already has answers here:
Pandas: Selecting rows based on value counts of a particular column
(2 answers)
How to select rows in Pandas dataframe where value appears more than once
(5 answers)
Closed 2 years ago.
I have a data frame with repeatedly occurring rows with different names. I want to delete less occurring rows. My data frame is very big. I am giving only a small size here.
dataframe:
df =
name value
0 A 10
1 B 20
2 A 30
3 A 40
4 C 50
5 C 60
6 D 70
In the above data frame B and D rows occurred fewer times. That is less than 1. I want to delete/drop all such rows that occur less than 2.
My code:
##### Net strings
net_strs = df['name'].unique().tolist()
strng_list = df.group.unique().tolist()
tempdf = df.groupby('name').count()
##### strings that have less than 2 measurements in whole data set
lesstr = tempdf[tempdf['value']<2].index
##### Strings that have more than 2 measurements in whole data set
strng_list = np.setdiff1d(net_strs,lesstr).tolist()
##### Removing the strings with less measurements
df = df[df['name']==strng_list]
My present output:
ValueError: Lengths must match to compare
My expected output:
name value
0 A 10
1 A 30
2 A 40
3 C 50
4 C 60
You could find the count of each element in name and then select rows only those rows having names that occur more than once.
v = df.name.value_counts()
df[df.name.isin(v.index[v.gt(1)])]
Output :
name value
0 A 10
2 A 30
3 A 40
4 C 50
5 C 60
I believe this code should give you what you want.
df['count'] = df.groupby('name').transform('count')
df2 = df.loc[df['count'] >= 2].drop(columns='count')
You should use value_counts() to get the occurrence of each row, followed by slicing this series to get the name of the rows you can to drop.
df = pd.DataFrame({'name':['A','B','A','A','C','C','D'],
'value':[10,20,30,40,50,60,70]})
removals = df['name'].value_counts().reset_index()
removals = removals[removals['name'] > 1]['index'].values
Here, we're setting a threshold of 1, where all values that show up more than one will get selected, but this can obviously be a variable, or changed accordingly.
filtered_df = df[df['name'].isin(removals)]
print(filtered_df)
Output:
name value
0 A 10
2 A 30
3 A 40
4 C 50
5 C 60
I have a data frame that present some features with cumulative values. I need to identify those features in order to revert the cumulative values.
This is how my dataset looks (plus about 50 variables):
a b
346 17
76 52
459 70
680 96
679 167
246 180
What I wish to achieve is:
a b
346 17
76 35
459 18
680 26
679 71
246 13
I've seem this answer, but it first revert the values and then try to identify the columns. Can't I do the other way around? First identify the features and then revert the values?
Finding cumulative features in dataframe?
What I do at the moment is run the following code in order to give me the feature's names with cumulative values:
def accmulate_col(value):
count = 0
count_1 = False
name = []
for i in range(len(value)-1):
if value[i+1]-value[i] >= 0:
count += 1
if value[i+1]-value[i] > 0:
count_1 = True
name.append(1) if count == len(value)-1 and count_1 else name.append(0)
return name
df.apply(accmulate_col)
Afterwards, I save these features names manually in a list called cum_features and revert the values, creating the desired dataset:
df_clean = df.copy()
df_clean[cum_cols] = df_clean[cum_features].apply(lambda col: np.diff(col, prepend=0))
Is there a better way to solve my problem?
To identify which columns have increasing* values throughout the whole column, you will need to apply conditions on all the values. So in that sense, you have to use the values first to figure out what columns fit the conditions.
With that out of the way, given a dataframe such as:
import pandas as pd
d = {'a': [1,2,3,4],
'b': [4,3,2,1]
}
df = pd.DataFrame(d)
#Output:
a b
0 1 4
1 2 3
2 3 2
3 4 1
Figuring out which columns contain increasing values is just a question of using diff on all values in the dataframe, and checking which ones are increasing throughout the whole column.
That can be written as:
out = (df.diff().dropna()>0).all()
#Output:
a True
b False
dtype: bool
Then, you can just use the column names to select only those with True in them
new_df = df[df.columns[out]]
#Output:
a
0 1
1 2
2 3
3 4
*(the term cumulative doesn't really represent the conditions you used.Did you want it to be cumulative or just increasing? Cumulative implies that the value in a particular row/index was the sum of all previous values upto that index, while increasing is just that, the value in current row/index is greater than previous.)
If we run the following code
np.random.seed(0)
features = ['f1','f2','f3']
df = pd.DataFrame(np.random.rand(5000,4), columns=features+['target'])
for f in features:
df[f] = np.digitize(df[f], bins=[0.13,0.66])
df['target'] = np.digitize(df['target'], bins=[0.5]).astype(float)
df.groupby(features)['target'].agg(['mean','count']).head(9)
We get average values for each grouping of the feature set:
mean count
f1 f2 f3
0 0 0 0.571429 7
1 0.414634 41
2 0.428571 28
1 0 0.490909 55
1 0.467337 199
2 0.486726 113
2 0 0.518519 27
1 0.446281 121
2 0.541667 72
In the table above, some of the groups has too few observations and I want to merge it into 'adjacent' group by some rules. For example, I may want to merge the group [0,0,0] with group [0,0,1] since it has no more than 30 observations. I wonder if there is any good way of operating such group combinations according to columns values without creating a separate dictionary? More specifically, I may want to merge from the smallest count group to its adjacent group (the next group within the index order) until the total number of groups is no more than 10.
A simple way to do it is with a loop for on indexes meeting your condition:
df_group = df.groupby(features)['target'].agg(['mean','count'])
# Fist reset_index to get an easier manipulation
df_group = df_group.reset_index()
list_indexes = df_group[df_group['count'] <=58].index.values # put any value you want
# loop for on list_indexes
for ind in list_indexes:
# check again your condition in case at the previous iteration
# merging the row has increase the count above your cirteria
if df_group['count'].loc[ind] <= 58:
# add the count values to the next row
df_group['count'].loc[ind+1] = df_group['count'].loc[ind+1] + df_group['count'].loc[ind]
# do anything you want on mean
# drop the row
df_group = df_group.drop(axis = 0, index = ind)
# Reindex your df
df_group = df_group.set_index(features)