MultiIndex pivot table pandas python - python

import pandas as pd
data = pd.read_excel('.../data.xlsx')
the content looks like this:
Out[57]:
Block Concentration Name Replicate value
0 1 100 GlcNAc2 1 321
1 1 100 GlcNAc2 2 139
2 1 100 GlcNAc2 3 202
3 1 33 GlcNAc2 1 86
4 1 33 GlcNAc2 2 194
5 1 33 GlcNAc2 3 452
6 1 10 GlcNAc2 1 140
7 1 10 GlcNAc2 2 285
... ... ... ... ... ...
1742 24 0 Print buffer 1 -9968
1743 24 0 Print buffer 2 -4526
1744 24 0 Print buffer 3 14246
[1752 rows x 5 columns]
Pivot table looks like this (only a part of the large table):
newdata = data.pivot_table(index=["Block", "Concentration"],columns=["Name","Replicate"], values="value")
my Questions:
how do i fill the '0' concentration of 'GlcNAc2' and 'Man5GIcNAc2' with the 'print buffer' values?
desired output:
i have been searching online and haven't really found anything similar. I have not even found a way to point to the 'print buffer' values from the 'Name' column.
from the MultiIndex/advanced indexing chapters it says to use
df.xs('one', level='second')
but it doesn't work in my case, it doesn't work with pivot table, im not sure why , i'm confused. Is a pivot table multiindex??

If I understand correctly, you want to duplicate the values with Name == Print buffer to columns with Name == 'GlcNAc2' and 'Man5GIcNAc2' and concentration = 0.
A way of doing this is to duplicate the rows in the original dataset :
selection = data[data["Name"] == "Print buffer"]'
selection.loc[:, "Name"] = "GlcNAc2"
data = pd.concat([data, selection])
selection.loc[:, "Name"] = "Man5GIcNAc2"
data = pd.concat([data, selection])
And then apply the pivot_table.
Remark : I am not sure that I understand your question. I am confused by the fact that in your pictures, the values Block == 1 change from the first picture to the second. Is it just a mistake or was that the core of your problem?

Related

How to apply a function to dataframe with data clusters/neighborhoods seperately?

Consider the following table. The first column, Data1, contains data values that are clustered in groups: there are values around 100 and 200. I am wondering how I can apply a function that deals with each data grouping separately, perhaps by applying an if statement that excludes data points with values too far apart to be considered a neighboring data point.
Data1 Value1
99 1
100 2
101 3
102 4
199 5
200 6
201 7
... ...
For example, if I want to generate a third column called "Result1" that adds every Data1 cluster's corresponding Value1 together. The result would look something like this, where 1+2+3+4=10 and 5+6+7=18:
Data1 Value1 Result1
99 1 10
100 2 10
101 3 10
102 4 10
199 5 18
200 6 18
201 7 18
... ... ...
Try merge_asof:
data = [100,200]
labels = pd.merge_asof(df, pd.DataFrame({'label':data}),
left_on='Data1', right_on='label',
direction='nearest')['label']
df['Result1'] = df.groupby(labels)['Value1'].transform('sum')
Output:
Data1 Value1 Result1
0 99 1 10
1 100 2 10
2 101 3 10
3 102 4 10
4 199 5 18
5 200 6 18
6 201 7 18
In your case, a simple mask aught to do.
mask = df[“Data1”]<150
df.loc[mask,”Result1”] = df.loc[mask,”Value1”].sum()
df.loc[~mask,”Result1”] = ”df.loc[~mask,”Value1”].sum()

Is there a way to avoid while loops using pandas in order to speed up my code?

I'm writing a code to merge several dataframe together using pandas .
Here is my first table :
Index Values Intensity
1 11 98
2 12 855
3 13 500
4 24 140
and here is the second one:
Index Values Intensity
1 21 1000
2 11 2000
3 24 0.55
4 25 500
With these two df, I concanate and drop_duplicates the Values columns which give me the following df :
Index Values Intensity_df1 Intensity_df2
1 11 0 0
2 12 0 0
3 13 0 0
4 24 0 0
5 21 0 0
6 25 0 0
I would like to recover the intensity of each values in each Dataframes, for this purpose, I'm iterating through each line of each df which is very inefficient. Here is the following code I use:
m = 0
while m < len(num_df):
n = 0
while n < len(df3):
temp_intens_abs = df[m]['Intensity'][df3['Values'][n] == df[m]['Values']]
if temp_intens_abs.empty:
merged.at[n,"Intensity_df%s" %df[m]] = 0
else:
merged.at[n,"Intensity_df%s" %df[m]] = pandas.to_numeric(temp_intens_abs, errors='coerce')
n = n + 1
m = m + 1
The resulting df3 looks like this at the end:
Index Values Intensity_df1 Intensity_df2
1 11 98 2000
2 12 855 0
3 13 500 0
4 24 140 0.55
5 21 0 1000
6 25 0 500
My question is : Is there a way to directly recover "present" values in a df by comparing directly two columns using pandas? I've tried several solutions using numpy but without success.. Thanks in advance for your help.
You can try joining these dataframes: df3 = df1.merge(df2, on="Values")

Totalling the matching values of a dataframe column with Series values

I have a Series :
350 0
254 1
490 0
688 0
393 1
30 1
and a dataframe :
0 outcome
0 350 1
1 254 1
2 490 0
3 688 0
4 393 0
5 30 1
The below code to count the total number of matches between the Series and the outcome column in the dataframe is what was intended.
Is there any other better way besides the below?
i=0
match=0
for pred in results['outcome']:
if test.values[i] == pred:
match+=1
i+=1
print match
I tried using results['Survived'].eq(labels_test).sum() but the answer is wrong.
And using lambda but the syntax is wrong.
You can compare by mapping series i.e
(df['0'].map(s) == df['outcome']).sum()
4
First, align the dataframe and series using align.
df, s = df.set_index('0').align(s, axis=0)
Next, compare the outcome column with the values in s and count the number of True values -
df.outcome.eq(s).sum()
4

Searching for both ends of range from pandas data frame in another data frame and expanding it to a new data frame

I'm new to Pandas and Python as well. I have 2 data frames created from importing 2 tables from mysql database :
Ranges
titles
ranges data frame :
title s_low s_high post pre noy
1 104 106b 0 2 0
1 1 5 1 0 0
Here, each row represents a range of sections with the last three columns having data related to the sections in the range. s_low represents the lower end section of a range and s_high represents the higher end section. There are tens of titles, and each title has many sections.
titles data frame: (it contains all the data relating to all the sections under all the titles.
title section
1 1
1 101
1 102
1 103
1 104
1 105
1 106
1 106a
1 106b
1 107
1 108
1 109
1 110
1 111
1 112
1 112a
1 112b
1 113
1 114
1 2
1 201
1 202
1 203
1 204
1 205
1 206
1 207
1 208
1 209
1 210
1 211
1 212
1 213
1 3
1 4
1 5
1 6
1 7
1 8
I have to expand the ranges in the ranges and write the section in between each range to a new data frame along with the values in the last three columns : post, pre and noy.
Here's the code I've generated so far.
import MySQLdb as db
from pandas import DataFrame
from pandas.io.sql import frame_query
import pandas as pd
cnxn = db.connect('xxxx','xxxx','xxxx','xxxx', charset='utf8', use_unicode=True )
ranges = frame_query("SELECT * from ranges", cnxn)
titles = frame_query("SELECT title, section from titles", cnxn)
exp = pd.DataFrame(columns = ['title', 'section', 'post', 'pre', 'noy'])
for index, row in ranges.iterrows():
t = row['title']
s_low = row['s_low']
s_low1 = str(t)+'$'+s_low
s_high = row['s_high']
s_high1 = str(t)+'$'+s_high
post = row['post']
pre = row['pre']
noy = row['noy']
x=0
for i, r in titles.iterrows():
title = r['title']
sec = r['section']
if ((str(t)+'$'+s_low) == (str(title)+'$'+sec)):
x = <index of sec>
I am using string concatenate because there are multiple titles and multiple sections; same section code can be present under a different title.
I understand that I need to loop through the index of titles after finding the s_low until I reach s_high and write the values to the new data frame exp. I'm not able to get the index of s_low to proceed further.
Sample output (exp data frame) for the first row in ranges sample pasted above would be :
title section post pre noy
1 104 0 2 0
1 105 0 2 0
1 106 0 2 0
1 106a 0 2 0
1 106b 0 2 0
Any help with this would be much appreciated.

Pandas: Merge or join dataframes based on column data?

I am trying to add several columns of data to an existing dataframe. The dataframe itself was built from a number of other dataframes, which I successfully joined on indices, which were identical. For that, I used code like this:
data = p_data.join(r_data)
I actually joined these on a multi-index, so the dataframe looks something like the following, where Name1 and Name 2 are indices:
Name1 Name2 present r behavior
a 1 1 0 0
2 1 .5 2
4 3 .125 1
b 2 1 0 0
4 5 .25 4
8 1 0 1
So the Name1 index does not repeat data, but the Name2 index does (I'm using this to keep track of dyads, so that Name1 & Name2 together are only represented once). What I now want to add are 4 columns of data that correspond to Name2 data (information on the second member of the dyad). Unlike the "present" "r" and "behavior" data, these data are per individual, not per dyad. So I don't need to consider Name1 data when merging.
The problem is that while Name2 data are repeated to exhaust the dyad combos, the "Name2" column in the data I would now like to add only has one piece of data per Name2 individual:
Name2 Data1 Data2 Data3
1 80 6 1
2 61 8 3
4 45 7 2
8 30 3 6
What I would like the output to look like:
Name1 Name2 present r behavior Data1 Data2 Data3
a 1 1 0 0 80 6 1
2 1 .5 2 61 8 3
4 3 .125 1 45 7 2
b 2 1 0 0 61 8 3
4 5 .25 4 45 7 2
8 1 0 1 30 3 6
Despite reading the documentation, I am not clear on whether I can use join() or merge() for the desired outcome. If I try a join to the existing dataframe like the simple one I've used previously, I end up with the new columns but they are full of NaN values. I've also tried various combinations using Name1 and Name2 as either columns or as indices, with either join or merge (not as random as it sounds, but I'm clearly not interpreting the documentation correctly!). Your help would be very much appreciated, as I am presently very much lost.
I'm not sure if this is the best way, but you could use reset_index to temporarily make your original DataFrame indexed by Name2 only. Then you could perform the join as usual. Then use set_index to again make Name1 part of the MultiIndex:
import pandas as pd
df = pd.DataFrame({'Name1':['a','a','a','b','b','b'],
'Name2':[1,2,4,2,4,8],
'present':[1,1,3,1,5,1]})
df.set_index(['Name1','Name2'], inplace=True)
df2 = pd.DataFrame({'Data1':[80,61,45,30],
'Data2':[6,8,7,3]},
index=pd.Series([1,2,4,8], name='Name2'))
result = df.reset_index(level=0).join(df2).set_index('Name1', append=True)
print(result)
# present Data1 Data2
# Name2 Name1
# 1 a 1 80 6
# 2 a 1 61 8
# b 1 61 8
# 4 a 3 45 7
# b 5 45 7
# 8 b 1 30 3
To make the result look even more like your desired DataFrame, you could reorder and sort the index:
print(result.reorder_levels([1,0],axis=0).sort(axis=0))
# present Data1 Data2
# Name1 Name2
# a 1 1 80 6
# 2 1 61 8
# 4 3 45 7
# b 2 1 61 8
# 4 5 45 7
# 8 1 30 3

Categories