Extract data from index in new column in Dataframe - python

How do I extract data based on index values in different columns?
So far I was able to extract data based on index number in the same column (block of 5).
The Dataframe looks like this:
3017 39517.3886
3018 39517.4211
3019 39517.4683
3020 39517.5005
3021 39517.5486
5652 39628.1622
5653 39628.2104
5654 39628.2424
5655 39628.2897
5656 39628.3229
5677 39629.2020
5678 39629.2342
5679 39629.2825
5680 39629.3304
5681 39629.3628
Where the data extracted in col are +/- 2 rows around the index value
I would like to have something that looks more like this:
3017-3021 5652-5656 5677-5681
1 39517.3886 39628.1622 39629.2020
2 39517.4211 39628.2104 39629.2342
3 39517.4683 39628.2424 39629.2825
4 39517.5005 39628.2897 39629.3304
5 39517.5486 39628.3229 39629.3628
and so on depending on the number of data that I want to extract.
The code I'm using to extract data based on index is:
## find index based on the first 0 of a 000 - 111 list
a = stim_epoc[1:]
ss = [(num+1) for num,i in enumerate(zip(stim_epoc,a)) if i == (0,1)]
## extract data from a df (GCamp_ps) based on the previous index 'ss'
fin = [i for x in ss for i in range(x-2, x + 2 + 1) if i in range(len(GCaMP_ps))]
df = time_fip.loc[np.unique(fin)]
print(df)

Form groups of 5 consecutive rows (since you pull +/-2 rows from a center). Then create the column and index labels and pivot
df = df.reset_index()
s = df.index//5 # If always 5 consecutive values. I.e. +/-2 rows from a center.
df['col'] = df.groupby(s)['index'].transform(lambda x: '-'.join(map(str, x.agg(['min', 'max']))))
df['idx'] = df.groupby(s).cumcount()
df.pivot(index='idx', columns='col', values=0) # Assuming column named `0`
Output:
col 3017-3021 5652-5656 5677-5681
idx
0 39517.3886 39628.1622 39629.2020
1 39517.4211 39628.2104 39629.2342
2 39517.4683 39628.2424 39629.2825
3 39517.5005 39628.2897 39629.3304
4 39517.5486 39628.3229 39629.3628

Related

nunique compare two Pandas dataframe with duplicates and pivot them

My input:
df1 = pd.DataFrame({'frame':[ 1,1,1,2,3,0,1,2,2,2,3,4,4,5,5,5,8,9,9,10,],
'label':['GO','PL','ICV','CL','AO','AO','AO','ICV','PL','TI','PL','TI','PL','CL','CL','AO','TI','PL','ICV','ICV'],
'user': ['user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1']})
df2 = pd.DataFrame({'frame':[ 1, 1, 2, 3, 4,0,1,2,2,2,4,4,5,6,6,7,8,9,10,11],
'label':['ICV','GO', 'CL','TI','PI','AO','GO','ICV','TI','PL','ICV','TI','PL','CL','CL','CL','AO','AO','PL','ICV'],
'user': ['user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2']})
df_c = pd.concat([df1,df2])
I trying compare two df, frame by frame, and check if label in df1 existing in same frame in df2. And make some calucation with result (pivot for example)
That my code:
m_df = df1.merge(df2,on=['frame'],how='outer' )
m_df['cross']=m_df.apply(lambda row: 'Matched'
if row['label_x']==row['label_y']
else 'Mismatched', axis='columns')
pv_m_unq= pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.nunique,fill_value=0,margins=True)
pv_mc = pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.count,fill_value=0,margins=True)
but this creates a some problem:
first, I can calqulate "simple" total (column All) of matched and missmatched as descipted in picture, or its "duplicated" as AO in pv_m or wrong number as in CL in pv_m_unq
and second, I think merge method as I use int not clever way, because I get if frame+label repetead in df(its happens often), in merged df I get number row in df1 X number of rows in df2 for this specific frame+label
I think maybe there is a smarter way to compare df and pivot them?
You got the unexpected result on margin total because the margin is making use the same function passed to aggfunc (i.e. pd.Series.nunique in this case) for its calculation and the values of Matched and Mismatched in these 2 rows are both the same as 1 (hence only one unique value of 1). (You are currently getting the unique count of frame id's)
Probably, you can achieve more or less what you want by taking the count on them (including margin, Matched and Mismatched) instead of the unique count of frame id's, by using pd.Series.count instead in the last line of codes:
pv_m = pd.pivot_table(m_df,columns='cross',index='label_x',values='frame', aggfunc=pd.Series.count, margins=True, fill_value=0)
Result
cross Matched Mismatched All
label_x
AO 0 1 1
CL 1 0 1
GO 1 1 2
ICV 1 1 2
PL 0 2 2
All 3 5 8
Edit
If all you need is to have the All column being the sum of Matched and Mismatched, you can do it as follows:
Change your code of generating pv_m_unq without building margin:
pv_m_unq= pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.nunique,fill_value=0)
Then, we create the column All as the sum of Matched and Mismatched for each row, as follows:
pv_m_unq['All'] = pv_m_unq['Matched'] + pv_m_unq['Mismatched']
Finally, create the row All as the sum of Matched and Mismatched for each column and append it as the last row, as follows:
row_All = pd.Series({'Matched': pv_m_unq['Matched'].sum(),
'Mismatched': pv_m_unq['Mismatched'].sum(),
'All': pv_m_unq['All'].sum()},
name='All')
pv_m_unq = pv_m_unq.append(row_All)
Result:
print(pv_m_unq)
Matched Mismatched All
label_x
AO 1 3 4
CL 1 2 3
GO 1 1 2
ICV 2 4 6
PL 1 5 6
TI 2 3 5
All 8 18 26
You can use isin() function like this:
df3 =df1[df1.label.isin(df2.label)]

python / pandas: How to count each cluster of unevenly distributed distinct values in each row

I am transitioning from excel to python and finding the process a little daunting. I have a pandas dataframe and cannot find how to count the total of each cluster of '1's' per row and group by each ID (example data below).
ID 20-21 19-20 18-19 17-18 16-17 15-16 14-15 13-14 12-13 11-12
0 335344 0 0 1 1 1 0 0 0 0 0
1 358213 1 1 0 1 1 1 1 0 1 0
2 358249 0 0 0 0 0 0 0 0 0 0
3 365663 0 0 0 1 1 1 1 1 0 0
The result of the above in the format
ID
LastColumn Heading a '1' occurs: count of '1's' in that cluster
would be:
335344
16-17: 3
358213
19-20: 2
14-15: 4
12-13: 1
365663
13-14: 5
There are more than 11,000 rows of data I would like to output the result to a txt file. I have been unable to find any examples of how the same values are clustered by row, with a count for each cluster, but I am probably not using the correct python terminology. I would be grateful if someone could point me in the right direction. Thanks in advance.
First step is use DataFrame.set_index with DataFrame.stack for reshape. Then create consecutive groups by compare for not equal Series.shifted values with cumulative sum by Series.cumsum to new column g. Then filter rows with only 1 and aggregate by named aggregation by GroupBy.agg with GroupBy.last and GroupBy.size:
df = df.set_index('ID').stack().reset_index(name='value')
df['g'] = df['value'].ne(df['value'].shift()).cumsum()
df1 = (df[df['value'].eq(1)].groupby(['ID', 'g'])
.agg(a=('level_1','last'), b=('level_1','size'))
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID a b
0 335344 16-17 3
1 358213 19-20 2
2 358213 14-15 4
3 358213 12-13 1
4 365663 13-14 5
Last for write to txt use DataFrame.to_csv:
df1.to_csv('file.txt', index=False)
If need your custom format in text file use:
with open("file.txt","w") as f:
for i, g in df1.groupby('ID'):
f.write(f"{i}\n")
for a, b in g[['a','b']].to_numpy():
f.write(f"\t{a}: {b}\n")
You just need to use the sum method and then specify which axis you would like to sum on. To get the sum of each row, create a new series equal to the sum of the row.
# create new series equal to sum of values in the index row
df['sum'] = df.sum(axis=1) # specifies index (row) axis
The best method for getting the sum of each column is dependent on how you want to use that information but in general the core is just to use the sum method on the series and assign it to a variable.
# sum a column and assign result to variable
foo = df['20-21'].sum() # default axis=0
bar = df['16-17'].sum() # default axis=0
print(foo) # returns 1
print(bar) # returns 3
You can get the sum of each column using a for loop and add them to a dictionary. Here is a quick function I put together that should get the sum of each column and return a dictionary of the results so you know which total belongs to which column. The two inputs are 1) the dataframe 2) a list of any column names you would like to ignore
def get_df_col_sum(frame: pd.DataFrame, ignore: list) -> dict:
"""Get the sum of each column in a dataframe in a dictionary"""
# get list of headers in dataframe
dfcols = frame.columns.tolist()
# create a blank dictionary to store results
dfsums = {}
# loop through each column and append sum to list
for dfcol in dfcols:
if dfcol not in ignore:
dfsums.update({dfcol: frame[dfcol].sum()})
return dfsums
I then ran the following code
# read excel to dataframe
df = pd.read_excel(test_file)
# ignore the ID column
ignore_list = ['ID']
# get sum for each column
res_dict = get_df_col_sum(df, ignore_list)
print(res_dict)
and got the following result.
{'20-21': 1, '19-20': 1, '18-19': 1, '17-18': 3, '16-17': 3, '15-16':
2, '14-15': 2, '13-14': 1, '12-13': 1, '11-12': 0}
Sources: Sum by row, Pandas Sum, Add pairs to dictionary

Pandas: add number of unique values to other dataset (as shown in picture):

I need to add the number of unique values in column C (right table) to the related row in the left table based on the values in common column A (as shown in the picture):
thank you in advance
Groupby column A in second dataset and calculate count of each unique value in column C. merge it with first dataset on column A. Rename column C to C-count if needed:
>>> count_df = df2.groupby('A', as_index=False).C.nunique()
>>> output = pd.merge(df1, count_df, on='A')
>>> output.rename(columns={'C':'C-count'}, inplace=True)
>>> output
A B C-count
0 2 22 3
1 3 23 2
2 5 21 1
3 1 24 1
4 6 21 1
Use DataFrameGroupBy.nunique with Series.map for new column in df1:
df1['C-count'] = df1['A'].map(df2.groupby('A')['C'].nunique())
This may not be the most effective way of doing this, so if your databases are too big be careful.
Define the following function:
def c_value(a_value, right_table):
c_ids = []
for index, row in right_table.iterrows():
if row['A'] == a_value:
if row['C'] not in c_ids:
c_ids.append(row['C'])
return len(c_ids)
For this function I'm supposing that the right_table is a pandas.Dataframe.
Now, you do the following to build the new column (assuming that the left table is a pandas.Dataframe):
new_column = []
for index, row in left_table.iterrows():
new_column.append(c_value(row['A'],right_table))
left_table["C-count"] = new_column
After this, the left_table Dataframe should be the one dessired (as far as I understand what you need).

Merge subgroup into adjacent subgroup after groupby

If we run the following code
np.random.seed(0)
features = ['f1','f2','f3']
df = pd.DataFrame(np.random.rand(5000,4), columns=features+['target'])
for f in features:
df[f] = np.digitize(df[f], bins=[0.13,0.66])
df['target'] = np.digitize(df['target'], bins=[0.5]).astype(float)
df.groupby(features)['target'].agg(['mean','count']).head(9)
We get average values for each grouping of the feature set:
mean count
f1 f2 f3
0 0 0 0.571429 7
1 0.414634 41
2 0.428571 28
1 0 0.490909 55
1 0.467337 199
2 0.486726 113
2 0 0.518519 27
1 0.446281 121
2 0.541667 72
In the table above, some of the groups has too few observations and I want to merge it into 'adjacent' group by some rules. For example, I may want to merge the group [0,0,0] with group [0,0,1] since it has no more than 30 observations. I wonder if there is any good way of operating such group combinations according to columns values without creating a separate dictionary? More specifically, I may want to merge from the smallest count group to its adjacent group (the next group within the index order) until the total number of groups is no more than 10.
A simple way to do it is with a loop for on indexes meeting your condition:
df_group = df.groupby(features)['target'].agg(['mean','count'])
# Fist reset_index to get an easier manipulation
df_group = df_group.reset_index()
list_indexes = df_group[df_group['count'] <=58].index.values # put any value you want
# loop for on list_indexes
for ind in list_indexes:
# check again your condition in case at the previous iteration
# merging the row has increase the count above your cirteria
if df_group['count'].loc[ind] <= 58:
# add the count values to the next row
df_group['count'].loc[ind+1] = df_group['count'].loc[ind+1] + df_group['count'].loc[ind]
# do anything you want on mean
# drop the row
df_group = df_group.drop(axis = 0, index = ind)
# Reindex your df
df_group = df_group.set_index(features)

Change the row values in pandas based on lookup

I have two pandas dataframe. One Contains actual data and second contains row index which i need to replace with some value.
Df1 : Input record
A B record_id record_type
0 12342345 10 011 H
1 65767454 20 012 I
2 78545343 30 013 I
3 43455467 40 014 I
Df2 :Information contains which row index need to change(e.g :here it is #)
Column1 Column2 Column3 record_id
0 1 2 4 011
1 1 2 None 012
2 1 2 4 013
3 1 2 None 014
Output Result:
A B record_id record_type
0 # # 011 #
1 # # 012 I
2 # # 013 #
3 # # 014 I
So based on record_id lookup and want to change corresponding row index value.
Here (1 2 4 011) present in Df2 contains information saying we want to modify row index first ,second and forth for particular record whose id is 011 from Df1.
So in output result we replace row value for record id 011 for row index 1,2,4 and populate value as #.
please suggest any other approach to do same in pandas.
First, you can do some preprocessing to make life easier. Set the index to be record_id and then rename column3 from df2 to be record_type. Now the dataframes have identical index and column names and makes for easy automatic alignment.
df1 = df1.set_index('record_id')
df2 = df2.set_index('record_id')
df2 = df2.rename(columns={'Column3':'record_type'})
df2 = df2.replace('None', np.nan)
Then we can fill in missing values of df2 with d2 and then make all the original non-missing values '#'.
df2.fillna(df1).where(df2.isnull()).fillna('#')
Column1 Column2 record_type
record_id
11 # # #
12 # # I
13 # # #
14 # # I
Here (1 2 4 011) present in Df2 contains information saying we want to modify row index first, second and forth for particular record whose id is 011 from Df1.
This makes no sense to me -- the row with record_id = 011 does not itself have further rows (of which you seem to want to choose the first, second, fourth). Please complete the output values with the exact results you expect.
In any case, I came across the same problem as in the title, and solved it like this:
Assuming you have a DataFrame df and three equally long vectors rsel, csel (for row/column selectors) and val (say, of length N), and would like to do the equivalent of
df.lookup(rsel, csel) = val
Then, the following code will work (at least) for pandas v.0.23 and python 3.6, assuming that rsel does not contain duplicates!
Warning: this is not really suited for large datasets, because it initialises a full square matrix of the dimensions of shape (N, N)!
import pandas as pd
import numpy as np
from functools import reduce
def coalesce(df, ltr=True):
if not ltr:
df = df.iloc[:, ::-1] # flip left to right
# use iloc as safeguard against non-unique column names
list_of_series = [df.iloc[:, i] for i in range(len(df.columns))]
# this is like a SQL coalesce
return reduce(lambda interm, x: interm.combine_first(x), list_of_series)
# column names generally not unique!
square = pd.DataFrame(np.diag(val), index=rsel, columns=csel)
# np.diag creates 0s everywhere off-diagonal; set them to nan
square = square.where(np.diag([True] * len(rsel)))
# assuming no duplicates in rsel; this is empty
upd = pd.DataFrame(index=rsel, columns=sorted(csel.unique()))
# collapse square into upd
upd = upd.apply(lambda col: coalesce(square[square.columns == col.name]))
# actually update values
df.update(upd)
PS. If you know that you only have strings as column names, then square.filter(regex=col.name) is much faster than square[square.columns == col.name].

Categories