Keep similar rows pandas dataframe with maximum overlap - python

I have a question for which I have
a dataframe which looks like (example):
index ID time value
0 1 2h 10
1 1 2.15h 15
2 1 2.30h 5
3 1 2.45h 24
4 2 2.15h 6
5 2 2.30h 12
6 2 2.45h 18
7 3 2.15h 2
8 3 2.30h 1
I would like to keep the maximum number of ID row overlapping.
So:
index ID time value
1 1 2.15h 15
2 1 2.30h 5
4 2 2.15h 6
5 2 2.30h 12
7 3 2.15h 2
8 3 2.30h 1
I know I can create a df with unique times and then merge each ID separately to it and then keep all rows with all IDs filled for each time but this is quite impractical. I have looked but have not found an answer for a possible smarter way. Does someone have an idea how to make this more practical?

Use:
cols = df.groupby(['ID', 'time']).size().unstack().dropna(axis=1).columns
df = df[df['time'].isin(cols)]
print (df)
ID time value
1 1 2.15h 15
2 1 2.30h 5
4 2 2.15h 6
5 2 2.30h 12
7 3 2.15h 2
8 3 2.30h 1
Details:
First aggregate DataFrame by groupby and size, then reshape by unstack - NaNs are created for non overlapping values:
print (df.groupby(['ID', 'time']).size().unstack())
time 2.15h 2.30h 2.45h 2h
ID
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 NaN
3 1.0 1.0 NaN NaN
Remove columns with dropna and get columns names:
print (df.groupby(['ID', 'time']).size().unstack().dropna(axis=1))
time 2.15h 2.30h
ID
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
And last filter list by isin and boolean indexing:
df = df[df['time'].isin(cols)]

Related

How to drop a row in one dataframe if missing value in another dataframe?

I have two DataFrames (example below). I would like to delete any row in df1 with a value equal to df2[patnum] if df2[city] is 'nan'.
For example: I would want to drop rows 2 and 3 in df1 since they contain '4' and patnum '4' in df2 has a missing value in df2['city'].
How would I do this?
df1
Citer Citee
0 1 2
1 2 4
2 3 5
3 4 7
df2
Patnum City
0 1 new york
1 2 amsterdam
2 3 copenhagen
3 4 nan
4 5 sydney
expected result:
df1
Citer Citee
0 1 2
1 3 5
IIUC stack isin and dropna
the idea is to return a True/False boolean based on matches then drop those rows after we unstack the dataframe.
val = df2[df2['City'].isna()]['Patnum'].values
df3 = df1.stack()[~df1.stack().isin(val)].unstack().dropna(how="any")
Citer Citee
0 1.0 2.0
2 3.0 5.0
Details
df1.stack()[~df1.stack().isin(val)]
0 Citer 1
Citee 2
1 Citer 2
2 Citer 3
Citee 5
3 Citee 7
dtype: int64
print(df1.stack()[~df1.stack().isin(val)].unstack())
Citer Citee
0 1.0 2.0
1 2.0 NaN
2 3.0 5.0
3 NaN 7.0

How do I sort a whole pandas dataframe by one column, moving the rows grouped in 3s

I have a dataframe with genes (ensembl IDs and common name), homologs, counts, and totals in orders of three as such:
Index Zebrafish Homolog Human Homolog Total
0 ENSDARG00000019949 ENSG00000149257
1 serpinh1b SERPINH1
2 2 2 4
3 ENSDARG00000052437 ENSG00000268975
4 mia MIA-RAB4B
5 2 0 2
6 ENSDARG00000057992 ENSG00000134363
7 fstb FST
8 0 3 3
9 ENSDARG00000045580 ENSG00000139329
10 lum LUM
11 15 15 30
etc...
I want to sort these rows by the totals in descending order. such that all the rows are kept intact in groups of 3 in the orders shown. The ideal output would be:
Index Zebrafish Homolog Human Homolog Total
0 ENSDARG00000045580 ENSG00000139329
1 lum LUM
2 15 15 30
3 ENSDARG00000019949 ENSG00000149257
4 serpinh1b SERPINH1
5 2 2 4
6 ENSDARG00000057992 ENSG00000134363
7 fstb FST
8 0 3 3
9 ENSDARG00000052437 ENSG00000268975
10 mia MIA-RAB4B
11 2 0 2
etc...
I tried making the totals for each in all 3 rows and then sorting using dataframe.sort.values() and removing the previous 2 rows for each clump of 3 but it didnt work properly. Is there a way to group the rows together into clumps of 3, then sort them to maintain that structure? Thank you in advance for any assistance.
Update #1
If i try to use the code:
df['Total'] = df['Total'].bfill().astype(int)
df = df.sort_values(by='Total', ascending=False)
to add values to the total for each group of 3 and then sort, It partially works, but scrambles the code like this:
Index Zebrafish Homolog Human Homolog Total
0 ENSDARG00000045580 ENSG00000139329 30
1 lum LUM 30
2 15 15 30
4 serpinh1b SERPINH1 4
3 ENSDARG00000019949 ENSG00000149257 4
5 2 2 4
8 0 3 3
7 fstb FST 3
6 ENSDARG00000057992 ENSG00000134363 3
9 ENSDARG00000052437 ENSG00000268975 2
11 2 0 2
10 mia MIA-RAB4B 2
etc...
And even worse is if multiple genes have the same total counts, the rows will become interchanged between genes which becomes confusing
Is this a dead end? Maybe I should just rewrite the code a different way :(
You need to create a second key to keep the records together on sorting , see below:
df.Total= df.Total.bfill()
df["helper"]= np.arange(len(df))//3
df= df.sort_values(["Total","helper"])
df= df.drop(columns="helper")
It looks like your totals are missing values and that helps in this case
Approach 1
df['Total'] = df['Total'].bfill().astype(int)
df['idx'] = np.arange(len(df)) // 3
df = df.sort_values(by=['Total', 'idx'], ascending=False)
df = df.drop(['idx'], axis=1)
Zebrafish_Homolog Human_Homolog Total
9 ENSDARG00000045580 ENSG00000139329 30
10 lum LUM 30
11 15 15 30
0 ENSDARG00000019949 ENSG00000149257 4
1 serpinh1b SERPINH1 4
2 2 2 4
6 ENSDARG00000057992 ENSG00000134363 3
7 fstb FST 3
8 0 3 3
3 ENSDARG00000052437 ENSG00000268975 2
4 mia MIA-RAB4B 2
5 2 0 2
Note how the index stays the same, if you don't want that then reset_index()
df = df.reset_index(drop=True)
Approach 2
A more manual way of sorting.
The approach is to sort the index and then loc the df. It looks complicated but it's just subtract ints from a list. Note the process doesn't happen on the df until the end so there should be no speed issue for a larger df.
# Sort by total
df = df.reset_index().sort_values('Total', ascending=False)
# Get the index of the sorted values
uniq_index = df[df['Total'].notnull()]['index'].values
# Create the new index
index = uniq_index .repeat(3)
groups = [-2, -1, 0] * (len(df) // 3)
# Update so everything is in order
new_index = index + groups
# Apply to the dataframe
df = df.loc[new_index]
Zebrafish_Homolog Human_Homolog Total
0 ENSDARG00000045580 ENSG00000139329 NaN
1 lum LUM NaN
2 15 15 30.0
9 ENSDARG00000019949 ENSG00000149257 NaN
10 serpinh1b SERPINH1 NaN
11 2 2 4.0
3 ENSDARG00000057992 ENSG00000134363 NaN
4 fstb FST NaN
5 0 3 3.0
6 ENSDARG00000052437 ENSG00000268975 NaN
7 mia MIA-RAB4B NaN
8 2 0 2.0
12 ENSDARG00000052437 ENSG00000268975 NaN
13 mia MIA-RAB4B NaN
14 2 0 2.0

Pandas: re-index and interpolate in multi-index dataframe

I'm having trouble understanding pandas reindex. I have a series of measurements, munged into a multi-index df, and I'd like to reindex and interpolate those measurements to align them with some other data.
My actual data has ~7 index levels and several different measurements. I hope the solution for this toy data problem is applicable to my real data. It's "small data"; each individual measurement is a couple KB.
Here's a pair of toy problems, one which shows the expected behavior and one which doesn't seem to do anything.
Single-level index, works as expected:
"""
step,value
1,1
3,2
5,1
"""
df_i = pd.read_clipboard(sep=",").set_index("step")
print(df_i)
new_index = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
df_i = df_i.reindex(new_index).interpolate()
print(df_i)
Outputs, the original df and the re-indexed and interpolated one:
value
step
1 1
3 2
5 1
value
step
1 1.0
2 1.5
3 2.0
4 1.5
5 1.0
6 1.0
7 1.0
8 1.0
9 1.0
Works great.
Multi-index, currently not working:
"""
sample,meas_id,step,value
1,1,1,1
1,1,3,2
1,1,5,1
1,2,3,2
1,2,5,2
1,2,7,1
1,2,9,0
"""
df_mi = pd.read_clipboard(sep=",").set_index(["sample", "meas_id", "step"])
print(df_mi)
df_mi = df_mi.reindex(new_index, level="step").interpolate()
print(df_mi)
Output, unchanged after reindex (and therefore after interpolate):
value
sample meas_id step
1 1 1 1
3 2
5 1
2 3 2
5 2
7 1
9 0
value
sample meas_id step
1 1 1 1
3 2
5 1
2 3 2
5 2
7 1
9 0
How do I actually reindex a column in a multi-index df?
Here's the output I'd like, assuming linear interpolation:
value
sample meas_id step
1 1 1 1
2 1.5
3 2
5 1
6 1
7 1
8 1
9 1
2 1 NaN (or 2)
2 NaN (or 2)
3 2
4 2
5 2
6 1.5
7 1
8 0.5
9 0
I spent some sincere time looking over SO, and if the answer is in there, I missed it:
Fill multi-index Pandas DataFrame with interpolation
Resampling Within a Pandas MultiIndex
pandas multiindex dataframe, ND interpolation for missing values
Fill multi-index Pandas DataFrame with interpolation
https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-reindexing
Possibly related GitHub issues:
https://github.com/numpy/numpy/issues/11975
https://github.com/pandas-dev/pandas/issues/23104
https://github.com/pandas-dev/pandas/issues/17132
IIUC create the index by using MultiIndex.from_product, then just do reindex
idx=pd.MultiIndex.from_product([df_mi.index.levels[0],df_mi.index.levels[1],new_index])
df_mi.reindex(idx).interpolate()
Out[161]:
value
1 1 1 1.000000
2 1.500000
3 2.000000
4 1.500000
5 1.000000
6 1.142857
7 1.285714
8 1.428571
9 1.571429
2 1 1.714286 # here is bad , it take previous value into consideration
2 1.857143
3 2.000000
4 2.000000
5 2.000000
6 1.500000
7 1.000000
8 0.500000
9 0.000000
My think
def idx(x):
idx = pd.MultiIndex.from_product([x.index.get_level_values(0).unique(), x.index.get_level_values(1).unique(), new_index])
return idx
pd.concat([y.reindex(idx(y)).interpolate() for _,y in df_mi.groupby(level=[0,1])])
value
1 1 1 1.0
2 1.5
3 2.0
4 1.5
5 1.0
6 1.0
7 1.0
8 1.0
9 1.0
2 1 NaN
2 NaN
3 2.0
4 2.0
5 2.0
6 1.5
7 1.0
8 0.5
9 0.0

Pandas: Insert dataframe into other dataframe without preserving indices

I want to insert a pandas dataframe into another pandas dataframe at certain indices.
Lets say we have this dataframe:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
I can then change values at certain indices as following:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
original_df.iloc[[0,2],[0,1]] = 2
0 1 2
0 2 2 3
1 4 5 6
2 2 2 9
However, if i use the same technique to insert another dataframe, it doesn't work:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
df_to_insert = pd.DataFrame([[10,11],[12,13]])
original_df.iloc[[0,2],[0,1]] = df_to_insert
0 1 2
0 10.0 11.0 3.0
1 4.0 5.0 6.0
2 NaN NaN 9.0
I am looking for a way to get the following result:
0 1 2
0 10 11 3
1 4 5 6
2 12 13 9
It seems to me that with the syntax i am using, the values from df_to_insert are taken from the corresponding index at their target locations. Is there a way for me to avoid this?
When you do insert make sure change the df to values , pandas is index sensitive , which means it will always try to match with the index and column during calculation
original_df.iloc[[0,2],[0,1]] = df_to_insert.values
original_df
Out[651]:
0 1 2
0 10 11 3
1 4 5 6
2 12 13 9
It does work with an array rather than a df:
original_df.iloc[[0,2],[0,1]] = np.array([[10,11],[12,13]])

Pandas - create total column based on other column

I'm trying to create a total column that sums the numbers from another column based on a third column. I can do this by using .groupby(), but that creates a truncated column, whereas I want a column that is the same length.
My code:
df = pd.DataFrame({'a':[1,2,2,3,3,3], 'b':[1,2,3,4,5,6]})
df['total'] = df.groupby(['a']).sum().reset_index()['b']
My result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 15.0
3 3 4 NaN
4 3 5 NaN
5 3 6 NaN
My desired result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 5.0
3 3 4 15.0
4 3 5 15.0
5 3 6 15.0
...where each 'a' column has the same total as the other.
Returning the sum from a groupby operation in pandas produces a column only as long as the number of unique items in the index. Use transform to produce a column of the same length ("like-indexed") as the original data frame without performing any merges.
df['total'] = df.groupby('a')['b'].transform(sum)
>>> df
a b total
0 1 1 1
1 2 2 5
2 2 3 5
3 3 4 15
4 3 5 15
5 3 6 15

Categories