Slicing pandas dataframe on equal column values - python

I have a pandas df that looks like this:
import pandas as pd
df = pd.DataFrame({0:[1],5:[1],10:[1],15:[1],20:[0],25:[0],
30:[1],35:[1],40:[0],45:[0],50:[0]})
df
The column names reflect coordinates. I would like to retrieve the start and end coordinate of columns with consecutive equal numbers.
The output should be something like this:
# start,end
0,15
20,25
30,35
40,50

IIUCusing groupby with diff and cumsum to split the group
s=df.T.reset_index()
s=s.groupby(s[0].diff().ne(0).cumsum())['index'].agg(['first','last'])
Out[241]:
first last
0
1 0 15
2 20 25
3 30 35
4 40 50

cumsum to identify group, and groupby:
s = df.iloc[0].diff().ne(0).cumsum()
(df.columns.to_series()
.groupby(s).agg(['min','max'])
)
Output:
min max
0
1 0 15
2 20 25
3 30 35
4 40 50

Related

Python 3 match values based on column name similarity

I have a dataframe of the following form:
Year 1 Grade
Year 2 Grade
Year 3 Grade
Year 4 Grade
Year 1 Students
Year 2 Students
Year 3 Students
Year 4 Students
60
70
80
100
20
32
18
25
I would like to somehow transpose this table to the following format:
Year
Grade
Students
1
60
20
2
70
32
3
80
18
4
100
25
I created a list of years and initiated a new dataframe with the "year" column. I was thinking of matching the year integer to the column name containing it in the original DF, match and assign the correct value, but got stuck there.
You need a manual reshaping using a split of the Index into a MultiIndex:
out = (df
.set_axis(df.columns.str.split(expand=True), axis=1) # make MultiIndex
.iloc[0] # select row as Series
.unstack() # unstack Grade/Students
.droplevel(0) # remove literal "Year"
.rename_axis('Year') # set index name
.reset_index() # index to column
)
output:
Year Grade Students
0 1 60 20
1 2 70 32
2 3 80 18
3 4 100 25
Or using pivot_longer from janitor:
# pip install pyjanitor
import janitor
out = (df.pivot_longer(
names_to = ('ignore', 'Year', '.value'),
names_sep = ' ')
.drop(columns='ignore')
)
out
Year Grade Students
0 1 60 20
1 2 70 32
2 3 80 18
3 4 100 25
The .value determines which parts of the sub labels in the columns are retained; the labels are split apart by names_sep, which can be a string or a regex. Another option is by using a regex, with names_pattern to split and reshape the columns:
df.pivot_longer(names_to = ('Year', '.value'),
names_pattern = r'.+(\d)\s(.+)')
Year Grade Students
0 1 60 20
1 2 70 32
2 3 80 18
3 4 100 25
Here's one way to do it. Feel free to ask questions about how it works.
import pandas as pd
cols = ["Year 1 Grade", "Year 2 Grade", "Year 3 Grade" , "Year 4 Grade",
"Year 1 Students", "Year 2 Students", "Year 3 Students", "Year 4 Students"]
vals = [60,70,80,100,20,32,18,25]
vals = [[v] for v in vals]
df = pd.DataFrame({k:v for k,v in zip(cols,vals)})
grades = df.filter(like="Grade").T.reset_index(drop=True).rename(columns={0:"Grades"})
students = df.filter(like="Student").T.reset_index(drop=True).rename(columns={0:"Students"})
pd.concat([grades,students], axis=1)
I came up with these. grades here are your first row.
df = pd.DataFrame(grades) # your dataframe without columns
grades = np.array(df.iloc[0]).reshape(4,2) # extract the first row, turn it into an array and reshape it to 4*2
new_df = pd.DataFrame(grades).reset_index()
new_df.columns = ['Year', 'Grade', 'Students'] # rename the columns

Pandas Dataframe groupby with overlapping

I'm using a pandas dataframe to read a csv that has data points for machine learning. I'm trying to come up with a way that would allow me to index a dataframe where it would get that index and the next N number of rows. I don't want to group the data frame into bins with no overlap (i.e. index 0:4, 4:8, etc.) What I do want is to get a result like this: index 0:4, 1:5, 2:6,etc. How would this be done?
Maybe you can create a list of DataFrames, like:
import pandas as pd
import numpy as np
nrows = 7
group_size = 5
df = pd.DataFrame({'col1': np.random.randint(0, 10, nrows)})
print(df)
grp = [df.iloc[x:x+5,] for x in range(df.shape[0] - group_size + 1)]
print(grp[1])
Original DataFrame:
col1
0 2
1 6
2 6
3 5
4 3
5 3
6 8
2nd DataFrame from the list of DataFrames:
col1
1 6
2 6
3 5
4 3
5 3

Effciency: Dropping rows with the same timestamp while still having the median of second column for that timestamp

What I wanna do:
Column 'angle' has tracked about 20 angles per second (can vary). But my 'Time' timestamp has only an accuracy of 1s (therefore always about ~20 rows are having the same timestamp)(total rows of over 1 million in the dataframe).
My result shall be a new dataframe with a changing timestamp for each row. The angle for the timestamp shall be the median of the ~20 timestamps in that intervall.
My Idea:
I iterate through the rows and check if the timestamp has changed.
If so, I select all timestamps until it changes, calculate the median, and append it to a new dataframe.
Nevertheless I have many many big data files and I am wondering if there is a faster way to achieve my goal.
Right now my code is the following (see below).
It is not fast and I think there must be a better way to do that with pandas/numpy (or something else?).
a = 0
for i in range(1,len(df1.index)):
if df1.iloc[[a],[1]].iloc[0][0]==df1.iloc[[i],[1]].iloc[0][0]:
continue
else:
if a == 0:
df_result = df1[a:i-1].median()
else:
df_result = df_result.append(df1[a:i-1].median(), ignore_index = True)
a = i
You can use groupby here. Below, I made a simple dummy dataframe.
import pandas as pd
df1 = pd.DataFrame({'time': [1,1,1,1,1,1,2,2,2,2,2,2],
'angle' : [8,9,7,1,4,5,11,4,3,8,7,6]})
df1
time angle
0 1 8
1 1 9
2 1 7
3 1 1
4 1 4
5 1 5
6 2 11
7 2 4
8 2 3
9 2 8
10 2 7
11 2 6
Then, we group by the timestamp and take the median of the angle column within that group, and convert the result to a pandas dataframe.
df2 = pd.DataFrame(df1.groupby('time')['angle'].median())
df2 = df2.reset_index()
df2
time angle
0 1 6.0
1 2 6.5
You can use the .agg after grouping function to select operation according to the column
df1.groupby('Time', as_index=False).agg({"angle":"median"})

Cumsum as a new column in an existing Pandas data

I have a pandas dataframe defined as:
A B SUM_C
1 1 10
1 2 20
I would like to do a cumulative sum of SUM_C and add it as a new column to the same dataframe. In other words, my end goal is to have a dataframe that looks like below:
A B SUM_C CUMSUM_C
1 1 10 10
1 2 20 30
Using cumsum in pandas on group() shows the possibility of generating a new dataframe where column name SUM_C is replaced with cumulative sum. However, my ask is to add the cumulative sum as a new column to the existing dataframe.
Thank you
Just apply cumsum on the pandas.Series df['SUM_C'] and assign it to a new column:
df['CUMSUM_C'] = df['SUM_C'].cumsum()
Result:
df
Out[34]:
A B SUM_C CUMSUM_C
0 1 1 10 10
1 1 2 20 30

pandas: append new column of row subtotals

This is very similar to this question, except I want my code to be able to apply to the length of a dataframe, instead of specific columns.
I have a DataFrame, and I'm trying to get a sum of each row to append to the dataframe as a column.
df = pd.DataFrame([[1,0,0],[20,7,1],[63,13,5]],columns=['drinking','drugs','both'],index = ['First','Second','Third'])
drinking drugs both
First 1 0 0
Second 20 7 1
Third 63 13 5
Desired output:
drinking drugs both total
First 1 0 0 1
Second 20 7 1 28
Third 63 13 5 81
Current code:
df['total'] = df.apply(lambda row: (row['drinking'] + row['drugs'] + row['both']),axis=1)
This works great. But what if I have another dataframe, with seven columns, which are not called 'drinking', 'drugs', or 'both'? Is it possible to adjust this function so that it applies to the length of the dataframe? That way I can use the function for any dataframe at all, with a varying number of columns, not just a dataframe with columns called 'drinking', 'drugs', and 'both'?
Something like:
df['total'] = df.apply(for col in df: [code to calculate sum of each row]),axis=1)
You can use sum:
df['total'] = df.sum(axis=1)
If you need sum only some columns, use subset:
df['total'] = df[['drinking', 'drugs', 'both']].sum(axis=1)
what about something like this :
df.loc[:, 'Total'] = df.sum(axis=1)
with the output :
Out[4]:
drinking drugs both Total
First 1 0 0 1
Second 20 7 1 28
Third 63 13 5 81
It will sum all columns by row.

Categories