Rename x rows after value in DataFrame - python

I have a pandas Dataframe containing a time series of data.
Every full second contains a string with the name of the point. I want to rename the next 4 values, that contain a random point id, after this row to the name of the first row with an added suffix
time ID
12:00:00,00 pointname1
12:00:00,20 12345
12:00:00,40 45645
12:00:00,60 78963
12:00:00,80 23432
12:00:01,00 pointname2
12:00:01,20 53454
12:00:01,40 24324
12:00:01,60 24324
12:00:01,80 42435
I want to transform this into:
time ID
12:00:00,00 pointname1
12:00:00,20 pointname1_1
12:00:00,40 pointname1_2
12:00:00,60 pointname1_3
12:00:00,80 pointname1_4
12:00:01,00 pointname2
12:00:01,20 pointname2_1
12:00:01,40 pointname2_2
12:00:01,60 pointname2_3
12:00:01,80 pointname2_4
I have a working solution by iterating over the entire DataFrame, detecting the 'pointname' rows and renaming the 4 rows after that. However, that takes a very long time with the 1.3million rows the data contains. Is there a more clever and efficient way of doing this?

Use Series.str.startswith with Series.where for set missing values to not matched values and then forward filling them, last use counter by GroupBy.cumcount and add values without first:
df['ID'] = df['ID'].where(df['ID'].str.startswith('pointname')).ffill()
df['ID'] += df.groupby('ID').cumcount().astype(str).radd('_').replace('_0','')
print (df)
time ID
0 12:00:00,00 pointname1
1 12:00:00,20 pointname1_1
2 12:00:00,40 pointname1_2
3 12:00:00,60 pointname1_3
4 12:00:00,80 pointname1_4
5 12:00:01,00 pointname2
6 12:00:01,20 pointname2_1
7 12:00:01,40 pointname2_2
8 12:00:01,60 pointname2_3
9 12:00:01,80 pointname2_4

You can to_numeric (or str.startswith if your identifier is literal, the only important point is to have True for the rows to use as referenc) to identify the ID rows, then for all other rows use ffill and groupby.cumcount to make the new identifier:
# find rows with string identifier (could use other methods)
m = pd.to_numeric(df['ID'], errors='coerce').isna()
# or if "pointname" is literal
# m = df['ID'].str.startswith('pointname')
# for non matching rows, use previous value
# and add group counter
df.loc[~m, 'ID'] = (df['ID'].where(m).ffill()
+'_'
+df.groupby(m.cumsum()).cumcount().astype(str)
)
output:
time ID
0 12:00:00,00 pointname1
1 12:00:00,20 pointname1_1
2 12:00:00,40 pointname1_2
3 12:00:00,60 pointname1_3
4 12:00:00,80 pointname1_4
5 12:00:01,00 pointname2
6 12:00:01,20 pointname2_1
7 12:00:01,40 pointname2_2
8 12:00:01,60 pointname2_3
9 12:00:01,80 pointname2_4

You can groupby the time part of time column and transform ID column to add suffix to first value in each group.
df['ID'] = (df.groupby(pd.to_datetime(df['time']).dt.strftime('%H:%M:%S'))
['ID'].transform(lambda col: [f'{col.iloc[0]}_{s}' for s in ['']+list(range(1, len(col)))])
.str.rstrip('_'))
# or
df['ID'] = (df.groupby(pd.to_datetime(df['time']).dt.strftime('%H:%M:%S'))
['ID'].transform(lambda col: [col.iloc[0]] + [f'{col.iloc[0]}_{s}' for s in range(1, len(col))]))
print(df)
time ID
0 12:00:00,00 pointname1
1 12:00:00,20 pointname1_1
2 12:00:00,40 pointname1_2
3 12:00:00,60 pointname1_3
4 12:00:00,80 pointname1_4
5 12:00:01,00 pointname2
6 12:00:01,20 pointname2_1
7 12:00:01,40 pointname2_2
8 12:00:01,60 pointname2_3
9 12:00:01,80 pointname2_4

Related

nunique compare two Pandas dataframe with duplicates and pivot them

My input:
df1 = pd.DataFrame({'frame':[ 1,1,1,2,3,0,1,2,2,2,3,4,4,5,5,5,8,9,9,10,],
'label':['GO','PL','ICV','CL','AO','AO','AO','ICV','PL','TI','PL','TI','PL','CL','CL','AO','TI','PL','ICV','ICV'],
'user': ['user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1']})
df2 = pd.DataFrame({'frame':[ 1, 1, 2, 3, 4,0,1,2,2,2,4,4,5,6,6,7,8,9,10,11],
'label':['ICV','GO', 'CL','TI','PI','AO','GO','ICV','TI','PL','ICV','TI','PL','CL','CL','CL','AO','AO','PL','ICV'],
'user': ['user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2']})
df_c = pd.concat([df1,df2])
I trying compare two df, frame by frame, and check if label in df1 existing in same frame in df2. And make some calucation with result (pivot for example)
That my code:
m_df = df1.merge(df2,on=['frame'],how='outer' )
m_df['cross']=m_df.apply(lambda row: 'Matched'
if row['label_x']==row['label_y']
else 'Mismatched', axis='columns')
pv_m_unq= pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.nunique,fill_value=0,margins=True)
pv_mc = pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.count,fill_value=0,margins=True)
but this creates a some problem:
first, I can calqulate "simple" total (column All) of matched and missmatched as descipted in picture, or its "duplicated" as AO in pv_m or wrong number as in CL in pv_m_unq
and second, I think merge method as I use int not clever way, because I get if frame+label repetead in df(its happens often), in merged df I get number row in df1 X number of rows in df2 for this specific frame+label
I think maybe there is a smarter way to compare df and pivot them?
You got the unexpected result on margin total because the margin is making use the same function passed to aggfunc (i.e. pd.Series.nunique in this case) for its calculation and the values of Matched and Mismatched in these 2 rows are both the same as 1 (hence only one unique value of 1). (You are currently getting the unique count of frame id's)
Probably, you can achieve more or less what you want by taking the count on them (including margin, Matched and Mismatched) instead of the unique count of frame id's, by using pd.Series.count instead in the last line of codes:
pv_m = pd.pivot_table(m_df,columns='cross',index='label_x',values='frame', aggfunc=pd.Series.count, margins=True, fill_value=0)
Result
cross Matched Mismatched All
label_x
AO 0 1 1
CL 1 0 1
GO 1 1 2
ICV 1 1 2
PL 0 2 2
All 3 5 8
Edit
If all you need is to have the All column being the sum of Matched and Mismatched, you can do it as follows:
Change your code of generating pv_m_unq without building margin:
pv_m_unq= pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.nunique,fill_value=0)
Then, we create the column All as the sum of Matched and Mismatched for each row, as follows:
pv_m_unq['All'] = pv_m_unq['Matched'] + pv_m_unq['Mismatched']
Finally, create the row All as the sum of Matched and Mismatched for each column and append it as the last row, as follows:
row_All = pd.Series({'Matched': pv_m_unq['Matched'].sum(),
'Mismatched': pv_m_unq['Mismatched'].sum(),
'All': pv_m_unq['All'].sum()},
name='All')
pv_m_unq = pv_m_unq.append(row_All)
Result:
print(pv_m_unq)
Matched Mismatched All
label_x
AO 1 3 4
CL 1 2 3
GO 1 1 2
ICV 2 4 6
PL 1 5 6
TI 2 3 5
All 8 18 26
You can use isin() function like this:
df3 =df1[df1.label.isin(df2.label)]

changing index of 1 row in pandas

I have the the below df build from a pivot of a larger df. In this table 'week' is the the index (dtype = object) and I need to show week 53 as the first row instead of the last
Can someone advice please? I tried reindex and custom sorting but can't find the way
Thanks!
here is the table
Since you can't insert the row and push others back directly, a clever trick you can use is create a new order:
# adds a new column, "new" with the original order
df['new'] = range(1, len(df) + 1)
# sets value that has index 53 with 0 on the new column
# note that this comparison requires you to match index type
# so if weeks are object, you should compare df.index == '53'
df.loc[df.index == 53, 'new'] = 0
# sorts values by the new column and drops it
df = df.sort_values("new").drop('new', axis=1)
Before:
numbers
weeks
1 181519.23
2 18507.58
3 11342.63
4 6064.06
53 4597.90
After:
numbers
weeks
53 4597.90
1 181519.23
2 18507.58
3 11342.63
4 6064.06
One way of doing this would be:
import pandas as pd
df = pd.DataFrame(range(10))
new_df = df.loc[[df.index[-1]]+list(df.index[:-1])].reset_index(drop=True)
output:
0
9 9
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
Alternate method:
new_df = pd.concat([df[df["Year week"]==52], df[~(df["Year week"]==52)]])

python panda new column with order of values

I would like to make a new column with the order of the numbers in a list. I get 3,1,0,4,2,5 ( index of the lowest numbers ) but I would like to have a new column with 2,1,4,0,3,5 ( so if I look at a row i get the list and I get in what order this number comes in the total list. what am I doing wrong?
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df.sort_values(by='list').index
print(df)
What you're looking for is the rank:
import pandas as pd
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df['list'].rank().sub(1).astype(int)
Result:
list order
0 4 2
1 3 1
2 6 4
3 1 0
4 5 3
5 9 5
You can use the method parameter to control how to resolve ties.

Pandas countifs value from column and other column not null

I am trying to perform the equivalent to an Excel COUNTIFS formula in pandas, where the first range is a dataframe column, and the search criteria is each value in that column. The second search range is a different column and the criteria is non null values in that column.
Written as an Excel formula, it would look like: COUNTIFS(A:A,A2,B:B,"<>")
Here is some sample data:
data = {'ADJL':['BCF-364/BTS-1091/ADJL-4', 'BCF-130/BTS-389/ADJL-1', 'BCF-130/BTS-389/ADJL-1', 'BCF-130/BTS-389/ADJL-1', 'BCF-130/BTS-389/ADJL-1', 'BCF-130/BTS-389/ADJL-1', 'BCF-581/BTS-1742/ADJL-1', 'BCF-581/BTS-1742/ADJL-1'],
'LNCEL':['LNBTS-55/LNCEL-63', '', 'LNBTS-801/LNCEL-62', '', 'LNBTS-801/LNCEL-61', '', '', '']}
df = pd.DataFrame(data)
I need to add two columns to this. The first is a count of the value of each "ADJL". I found this solution for that column:
df['Count_of_ADJL'] = df.groupby('ADJL')['ADJL'].transform('Count_of_ADJL')
What I am stuck on is the next one, shown below in Excel. I need to calculate how many times the value in ADJL occurs throughout the entire ADJL column AND the LNCEL column is not empty.
I removed many other columns to simplify my question, so a solution where I can just add another column is ideal.
Many thanks in advance.
Use groupby.transform with np.count_nonzero as:
df['Count_of_ADJL'] = df.groupby('ADJL')['ADJL'].transform('count')
df['Count_of_ADJL & LNCEL not null'] = df.groupby('ADJL')['LNCEL'].transform(np.count_nonzero)
#or
df['Count_of_ADJL & LNCEL not null'] = df.groupby('ADJL')['LNCEL'].transform('count')
print(df)
ADJL LNCEL Count_of_ADJL \
0 BCF-364/BTS-1091/ADJL-4 LNBTS-55/LNCEL-63 1
1 BCF-130/BTS-389/ADJL-1 5
2 BCF-130/BTS-389/ADJL-1 LNBTS-801/LNCEL-62 5
3 BCF-130/BTS-389/ADJL-1 5
4 BCF-130/BTS-389/ADJL-1 LNBTS-801/LNCEL-61 5
5 BCF-130/BTS-389/ADJL-1 5
6 BCF-581/BTS-1742/ADJL-1 2
7 BCF-581/BTS-1742/ADJL-1 2
Count_of_ADJL & LNCEL not null
0 1
1 2
2 2
3 2
4 2
5 2
6 0
7 0

Extract data from index in new column in Dataframe

How do I extract data based on index values in different columns?
So far I was able to extract data based on index number in the same column (block of 5).
The Dataframe looks like this:
3017 39517.3886
3018 39517.4211
3019 39517.4683
3020 39517.5005
3021 39517.5486
5652 39628.1622
5653 39628.2104
5654 39628.2424
5655 39628.2897
5656 39628.3229
5677 39629.2020
5678 39629.2342
5679 39629.2825
5680 39629.3304
5681 39629.3628
Where the data extracted in col are +/- 2 rows around the index value
I would like to have something that looks more like this:
3017-3021 5652-5656 5677-5681
1 39517.3886 39628.1622 39629.2020
2 39517.4211 39628.2104 39629.2342
3 39517.4683 39628.2424 39629.2825
4 39517.5005 39628.2897 39629.3304
5 39517.5486 39628.3229 39629.3628
and so on depending on the number of data that I want to extract.
The code I'm using to extract data based on index is:
## find index based on the first 0 of a 000 - 111 list
a = stim_epoc[1:]
ss = [(num+1) for num,i in enumerate(zip(stim_epoc,a)) if i == (0,1)]
## extract data from a df (GCamp_ps) based on the previous index 'ss'
fin = [i for x in ss for i in range(x-2, x + 2 + 1) if i in range(len(GCaMP_ps))]
df = time_fip.loc[np.unique(fin)]
print(df)
Form groups of 5 consecutive rows (since you pull +/-2 rows from a center). Then create the column and index labels and pivot
df = df.reset_index()
s = df.index//5 # If always 5 consecutive values. I.e. +/-2 rows from a center.
df['col'] = df.groupby(s)['index'].transform(lambda x: '-'.join(map(str, x.agg(['min', 'max']))))
df['idx'] = df.groupby(s).cumcount()
df.pivot(index='idx', columns='col', values=0) # Assuming column named `0`
Output:
col 3017-3021 5652-5656 5677-5681
idx
0 39517.3886 39628.1622 39629.2020
1 39517.4211 39628.2104 39629.2342
2 39517.4683 39628.2424 39629.2825
3 39517.5005 39628.2897 39629.3304
4 39517.5486 39628.3229 39629.3628

Categories