I have
x cluster_id
0 1 1
1 3 1
2 2 2
3 5 2
4 4 3
I want to generate
x cluster_id s
0 1 1 1
1 3 1 4
2 2 2 3
3 5 2 7
4 4 3 4
i.e. s is the running sum of x, but it gets reset when the cluster id changes. How is this achieved?
Alternatively, if it is easier, it may be Ok to do
x cluster_id s
0 1 1 4
1 3 1 4
2 2 2 7
3 5 2 7
4 4 3 4
i.e. all values for s within the same cluster are the same, and correspond to the total sum in the cluster.
Additionally, I want to subsample this so that I keep the last row of each cluster:
x cluster_id s
1 3 1 4
3 5 2 7
4 4 3 4
(note that all the cluster ids are different). How can I do this?
You can get the running totals using .cumsum() with .groupby()
>>> df
x cluster_id
0 1 1
1 3 1
2 2 2
3 5 2
4 4 3
>>> df['s'] = df.groupby('cluster_id').cumsum()
>>> df
x cluster_id s
0 1 1 1
1 3 1 4
2 2 2 2
3 5 2 7
4 4 3 4
Then to get only the last row for each cluster_id:
>>> df.groupby('cluster_id').last().reset_index()
cluster_id x s
0 1 3 4
1 2 5 7
2 3 4 4
Related
I have an excel dataset which contains 100 rows and 100 clolumns with order frequencies in locations described by x and y.(gird like structure)
I'd like to convert it to the following structure with 3 columns:
x-Coördinaten | y-Coördinaten | value
The "value" column only contains positive integers. The x and y column contain float type data (geograohical coordinates.
The order does not matter, as it can easily be sorted afterwards.
So,basicly a merge of lists could work, e.g.:
[[1,5,3,5], [4,2,5,6], [2,3,1,5]] ==> [1,5,3,5,4,2,5,6,2,3,1,5]
But then i would lose the location...which is key for my project.
What is the best way to accomplish this?
Assuming this input:
l = [[1,5,3,5],[4,2,5,6],[2,3,1,5]]
df = pd.DataFrame(l)
you can use stack:
df2 = df.rename_axis(index='x', columns='y').stack().reset_index(name='value')
output:
x y value
0 0 0 1
1 0 1 5
2 0 2 3
3 0 3 5
4 1 0 4
5 1 1 2
6 1 2 5
7 1 3 6
8 2 0 2
9 2 1 3
10 2 2 1
11 2 3 5
or melt for a different order:
df2 = df.rename_axis('x').reset_index().melt('x', var_name='y', value_name='value')
output:
x y value
0 0 0 1
1 1 0 4
2 2 0 2
3 0 1 5
4 1 1 2
5 2 1 3
6 0 2 3
7 1 2 5
8 2 2 1
9 0 3 5
10 1 3 6
11 2 3 5
You should be able to get the results with a melt operation -
df = pd.DataFrame(np.arange(9).reshape(3, 3))
df.columns = [2, 3, 4]
df.loc[:, 'x'] = [3, 4, 5]
This is what df looks like
2 3 4 x
0 0 1 2 3
1 3 4 5 4
2 6 7 8 5
The melt operation -
df.melt(id_vars='x', var_name='y')
output -
x y value
0 3 2 0
1 4 2 3
2 5 2 6
3 3 3 1
4 4 3 4
5 5 3 7
6 3 4 2
7 4 4 5
8 5 4 8
Consider a dataframe with a column like this:
sequence
1
2
3
4
5
1
2
3
1
2
3
4
5
6
7
I wish to create a column when the sequence resets. The sequence is of variable length.
Such that I'd get something like:
sequence run
1 1
2 1
3 1
4 1
5 1
1 2
2 2
3 2
1 3
2 3
3 3
4 3
5 3
6 3
7 3
Try with diff then cumsum
df['run'] = df['sequence'].diff().ne(1).cumsum()
Out[349]:
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
10 3
11 3
12 3
13 3
14 3
Name: sequence, dtype: int32
Use:
dataset['run'] = dataset.groupby('sequence ').cumcount().add(1)
output example:
sequence run
y 1
a 1
g 1
a 2
b 1
a 3
b 2
I am trying to count the consecutive elements in a data frame and store them in a new column. I don't want to count the total number of times an element appears overall in the list but how many times it appeared consecutively, i used this:
a=[1,1,3,3,3,5,6,3,3,0,0,0,2,2,2,0]
df = pd.DataFrame(list(zip(a)), columns =['Patch'])
df['count'] = df.groupby('Patch').Patch.transform('size')
print(df)
this gave me a result like this:
Patch count
0 1 2
1 1 2
2 3 5
3 3 5
4 3 5
5 5 1
6 6 1
7 3 5
8 3 5
9 0 4
10 0 4
11 0 4
12 2 3
13 2 3
14 2 3
15 0 4
however i want the result to be like this:
Patch count
0 1 2
1 3 3
2 5 1
3 6 1
4 3 2
5 0 3
6 2 3
7 0 1
df = (
df.groupby((df.Patch != df.Patch.shift(1)).cumsum())
.agg({"Patch": ("first", "count")})
.reset_index(drop=True)
.droplevel(level=0, axis=1)
.rename(columns={"first": "Patch"})
)
print(df)
Prints:
Patch count
0 1 2
1 3 3
2 5 1
3 6 1
4 3 2
5 0 3
6 2 3
7 0 1
I have a Series that look like this
col
0 1
1 2
2 3
3 4
4 1
5 2
6 3
7 1
8 2
9 3
10 1
11 2
and I would like to generate a second counter that looks like this
col col2
0 1 1
1 2 1
2 3 1
3 4 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
10 1 4
11 2 4
How can I do that in python?
If 1 is always start of groups then create mask by compare by Series.eq and then add Series.cumsum for cumulative sum:
df['col2'] = df['col'].eq(1).cumsum()
print (df)
col col2
0 1 1
1 2 1
2 3 1
3 4 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
10 1 4
11 2 4
I have a DataFrame with values 1 and 2.
I want to add a row in the end of the DataFrame, counting the number of 1 in each column. It should be similar
COUNTIF(A:A,1) and drag to all columns in excel
I tried something like df.loc['lastrow']=df.count()[1], but the result is not correct.
How can I do it and what this function (count()[1]) does?
You can do append after sum
df.append(df.eq(1).sum(),ignore_index=True )
You can just compare your dataframe to the value you are interested in (1 for example), and then perform a sum on these booleans, like:
>>> df
0 1 2 3 4
0 2 2 2 1 2
1 2 2 2 2 2
2 2 1 2 1 1
3 1 2 2 1 1
4 2 2 1 2 1
5 2 2 2 2 2
6 1 1 1 1 2
7 2 2 1 1 1
8 1 1 1 2 1
9 2 2 1 2 1
>>> (df == 1).sum()
0 3
1 3
2 5
3 5
4 6
dtype: int64
You can thus append that row, like:
>>> df.append((df == 1).sum(), ignore_index=True)
0 1 2 3 4
0 2 2 2 1 2
1 2 2 2 2 2
2 2 1 2 1 1
3 1 2 2 1 1
4 2 2 1 2 1
5 2 2 2 2 2
6 1 1 1 1 2
7 2 2 1 1 1
8 1 1 1 2 1
9 2 2 1 2 1
10 3 3 5 5 6
The last row here thus contains the number of 1s of the previous rows.