Grouping two classes in a dataframe with start and stop time - python

I have a Dataframe consist of time (from a video) and label (based on occurrence of an action in the video) . I would like to get a list for each label which shows the start and finish time for every label.
Time(sec)
Label
76
0
77
0
78
0
79
1
80
1
81
1
82
0
83
0
84
1
expecting output should be like this:
Label_Class_0 = [[76,78],[82,83],...]
Label_Class_1 = [[79,81],[84,..],...]
Thank you

Use from this code:
df['merger']=df.label.ne(df.label.shift(1)).cumsum()
df =df.groupby(by=['merger']).agg({'label':'first', 'time':['min', 'max']}).reset_index()
df.columns=['a','b','c']
df['w']=pd.Series(zip(df.b,df.c))
list0=df[df.a.eq(0)].w.to_list()
list1=df[df.a.eq(1)].w.to_list()

Related

group by and concatenate dataframe

I have df with frame, m_label, and details so all of them can be duplicated, in same frame may be different labels with different details, but you need to know m_label+details have a constant pattern of several option for example Findings may be PL or DV , so "Findings PL start" always have "Findings PL end", except BBPS, it may be start in details as 3 and end as 2 or same number. In final I need to know which label when it start (for example Action IR, start in frame 31) and when it end (end as Action IR in frame 101).
That my input:
frame m_label details
0 BBPS 3
0 BBPS start
0 Findings DV
0 Findings start
0 Findings DV
0 Findings end
31 Actions IR
31 Actions start
99 BBPS 2
99 Findings PL
99 Findings start
99 BBPS end
99 Findings PL
99 Findings end
101 Action IR
101 Action end
So I want convert this df to something like this:
frame m_label details
0 Findings.DV start
0 Findings.DV end
0 BBPS.3 start
31 Actions.IR start
99 Action.IR end
99 Findings.PL start
99 Findings.PL end
99 BBPS.2 end
101 Action.IR end
So I need concatenate row only without start/end and groupby(?) or transform(?) by frame..
I try this code, but then I got stuck:
def concat_func(x):
if not x[1] in ['start', 'end']:
result = x[0]+'.'+x[1]
else:
result=np.nan
return result
data_cv["concat"]=data_cv[["m_label","details"]].apply(concat_func,axis=1)
First I find it useful to move the start/end info to a new column, which is done by merging together the rows that have start/end on one side and the ones that don’t on the other:
>>> detail_type = df['details'].isin({'start', 'end'})
>>> df = pd.merge(df[~detail_type], df[detail_type].rename(columns={'details': 'detail_type'}))
>>> df
frame m_label details detail_type
0 0 BBPS 3 start
1 0 Findings DV start
2 0 Findings DV end
3 0 Findings DV start
4 0 Findings DV end
5 31 Actions IR start
6 99 BBPS 2 end
7 99 Findings PL start
8 99 Findings PL end
9 99 Findings PL start
10 99 Findings PL end
11 101 Action IR end
Now we can replace the 2 columns by their concatenated text:
>>> df = df.drop(columns=['m_label', 'details']).join(df['m_label'].str.cat(df['details'], sep='.'))
>>> df.drop_duplicates()
frame detail_type m_label
0 0 start BBPS.3
1 0 start Findings.DV
2 0 end Findings.DV
5 31 start Actions.IR
6 99 end BBPS.2
7 99 start Findings.PL
8 99 end Findings.PL
11 101 end Action.IR
You could even pivot to have a start and an end column:
>>> df.drop_duplicates().pivot(columns='detail_type', index='m_label', values='frame')
detail_type end start
m_label
Action.IR 101.0 NaN
Actions.IR NaN 31.0
BBPS.2 99.0 NaN
BBPS.3 NaN 0.0
Findings.DV 0.0 0.0
Findings.PL 99.0 99.0
But for that to be efficient you’ll first need to define rules that uniquely name your labels, e.g. BBPS regardless of details 2 and 3, Action / Actions always spelled the same way, etc.
I don't think groupby would help, as the order inside the group also matter.
Try this (since you didn't post the df in a copiable way, I can't test it myself):
df = df.assign(new_label=None).sort_values(['frame', 'm_label'])
df.loc[~df['details'].isin(['start', 'end']), 'new_label'] = df['m_label'] + '.' + df['details']
df.loc[(df['frame'] == df['frame'].shift(-1).fillna('')) & (df['m_label'] == df['m_label'].shift(-1).fillna('')) & df['details'].shift(-1).isin(['start', 'end']), 'details'] = df['details'].shift(-1).fillna('')
df = df.loc[pd.notna(df['new_label']) & df['details'].isin(['start', 'end']), ['frame', 'new_label', 'details']]

How to group by a df in Python by a column with the difference between the max value of one column and the min of another column?

I have a data frame which looks like this:
student_id
session_id
reading_level_id
st_week
end_week
1
3334
3
3
3
1
3335
2
4
4
2
3335
2
2
2
2
3336
2
2
3
2
3337
2
3
3
2
3339
2
3
4
...
There are multiple session_id's, st_weeks and end_weeks for every student_id. Im trying to group the data by 'student_id' and I want to calculate the difference between the maximum(end_week) and the minimum (st_week) for each student.
Aiming for an output that would look something like this:
Student_id
Diff
1
1
2
2
....
I am relatively new to Python as well as Stack Overflow and have been trying to find an appropriate solution - any help is appreciated.
Using the data you shared, a simpler solution is possible:
Group by student_id, and pass False argument to the as_index parameter (this works for a dataframe, and returns a dataframe);
Next, use a named aggregation to get the `max week for end week and the min week for st_week for each group
Get the difference between max_wk and end_wk
Finally, keep only the required columns
(
df.groupby("student_id", as_index=False)
.agg(max_wk=("end_week", "max"), min_wk=("st_week", "min"))
.assign(Diff=lambda x: x["max_wk"] - x["min_wk"])
.loc[:, ["student_id", "Diff"]]
)
student_id Diff
0 1 1
1 2 2
There's probably a more efficient way to do this, but I broke this into separate steps for the grouping to get max and min values for each id, and then created a new column representing the difference. I used numpy's randint() function in this example because I didn't have access to a sample dataframe.
import pandas as pd
import numpy as np
# generate dataframe
df = pd.DataFrame(np.random.randint(0,100,size=(1200, 4)), columns=['student_id', 'session_id', 'st_week', 'end_week'])
# use groupby to get max and min for each student_id
max_vals = df.groupby(['student_id'], sort=False)['end_week'].max().to_frame()
min_vals = df.groupby(['student_id'], sort=False)['st_week'].min().to_frame()
# use join to put max and min back together in one dataframe
merged = min_vals.join(max_vals)
# use assign() to calculate difference as new column
merged = merged.assign(difference=lambda x: x.end_week - x.st_week).reset_index()
merged
student_id st_week end_week difference
0 40 2 99 97
1 23 5 74 69
2 78 9 93 84
3 11 1 97 96
4 97 24 88 64
... ... ... ... ...
95 54 0 96 96
96 18 0 99 99
97 8 18 97 79
98 75 21 97 76
99 33 14 93 79
You can create a custom function and apply it to a group-by over students:
def week_diff(g):
return g.end_week.max() - g.st_week.min()
df.groupby("student_id").apply(week_diff)
Result:
student_id
1 1
2 2
dtype: int64

How to make box plot group by axis 1

I have dataframe:
id meters availability1 availability2 availability3
0 0 70 80 90
1 50 75 75 80
2 100 100 90 100
3 150 87 85 80
4 200 60 90 100
I want to create a box plot that shows me what is availability for each specific meter.
For example for 0-meter availability is from 90 to 70.
So I want to create box plot for each row, not column. I can not found how to apply this not changing the structure of my dataframe.
The code that I use is the following:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
file = 'D:\\test_box_plot.csv'
df = pd.read_csv(file, sep = ";", usecols = ['availability1','availability2','availability3'])
sns.boxplot(x="variable", y="value", data=pd.melt(df))
plt.show()
I will appreciate any help
I think it's a possible solution, but I am not quite sure if I've understood what you wanted:
boxMeters = sns.boxplot(x=0,y=1,data=df.transpose(), palette="Set3")
The trick here is to work with a transposed matrix of your dataframe.
I suggest you to print the transposed dataframe to know how to reference every column.
With the data you posted, my transposed dataframe is:
0 1 2 3 4
id 0 1 2 3 4
meters 0 50 100 150 200
availability1 70 75 100 87 60
availability2 80 75 90 85 90
availability3 90 80 100 80 100
"meters" is just below the column addressed as 0 and the availability1 is the column 1.
You tell me if it works for you.

Convert Column in CSV to list and sum the total of each return

I am trying to sum the total of returns for each name in a column taken from a CSV file.
For example, the returns are a,b,c,d,a,c,d,c,b.. and so on in no particular order.
I would like to:
Print the returns into a separate file (ie getting a,b,c and d)
Total the amount of times EACH return was found in the column.
I want my printed return in a separate file to look something like this:
a: 345
b. 230
c: 450
d: 234
try pd.to_json() function.
Here is the example.
import pandas as pd
import numpy as np
n=20
columns_name = list('abcd')
df = pd.DataFrame(data = np.random.randint(1,100,size=(5,4)),
columns= columns_name)
print(df)
df.sum().to_json("result.json")
The output to console will be :
a b c d
0 56 91 65 82
1 63 65 50 78
2 46 43 75 3
3 37 96 84 13
4 40 59 61 66
the file will be
{"a":165,"b":230,"c":234,"d":336}
hope it solve your problem.
Please see image. I would like to count how many times each name appears in "candidates" to see how many times each candidate received a vote.

Reading a Specific Row of a CSV File based on the 1st Occurrence of a value within a Colum

Below is the CSV File that I have:
Record Time Value 1 Value 2 Value 3
Event 1 20 35 40
Event 2 48 43 56
Event 3 45 58 90
FFC 4 12 89 94
FFC 5 30 25 60
Event 6 99 45 13
I would like to use pandas in order to parse through the 'Record' Column until I find the first FFC and then print that entire row. Additionally, I would like to print the row that is two above the first found FFC. Any suggestions on how to approach this?
My reasoning for wanting to use Pandas is that I am going to need to call upon specific values within the two printed rows and plot them.
To start I have:
csvfile = pd.read_csv('Test.csv')
print(csvfile)
Thank you very much for your assistance, it is greatly appreciated!
This is one way.
import pandas as pd
from io import StringIO
mystr = StringIO("""Record Time Value1 Value2 Value3
Event 1 20 35 40
Event 2 48 43 56
Event 3 45 58 90
FFC 4 12 89 94
FFC 5 30 25 60
Event 6 99 45 13""")
# replace mystr with 'file.csv'
df = pd.read_csv(mystr, delim_whitespace=True)
# get index of condition
idx = df[df['Record'] == 'FFC'].index[0]
# filter for appropriate indices
res1 = df.loc[idx]
res2 = df.loc[idx-2]
To output a dataframe:
print(res1.to_frame().T)
# Record Time Value1 Value2 Value3
# 3 FFC 4 12 89 94

Categories