pandas settingwithcopywarning on groupby - python

While I generally understand the warnings, and many posts deal with this, I don understand why I am getting a warning only when I reach the groupby line (the last one):
grouped = data.groupby(['group'])
for name, group in grouped:
data2=group.loc[data['B-values'] > 0]
data2["unique_A-values"]=data2.groupby(["A-values"])["A-values"].transform('count')
EDIT:
Here is my dataframe (data):
group A-values B-values
human 1 -1
human 1 5
human 1 4
human 3 4
human 2 10
bird 7 8
....
For B-values > 0 (data2=group.loc[data['B-values'] > 0]):
human has two A-values equal to one, one equals to 3 and one equals to 2 (data2["unique_A-values"]=data2.groupby(["A-values"])["A-values"].transform('count'))

You get the error because you take a reference to your groupby and then try add a column to it, so it's just warning you that if your intention is to update the original df then this may or may not work.
If you are just modifying a local copy then take a copy using copy() so it's explicit and the warning will go away:
for name, group in grouped:
data2=group.loc[data['B-values'] > 0].copy() # <- add .copy() here
data2["unique_A-values"]=data2.groupby(["A-values"])["A-values"].transform('count')

FYI the pandas groupby user guide says:
Group chunks should be treated as immutable, and changes to a group chunk may produce unexpected results.
for name, group in grouped:
# making a reference to the group chunk
data2 = group.loc[data['B-values'] > 0]
# trying to make a change to that group chunk reference
data2["unique_A-values"] = data2.groupby(["A-values"])["A-values"].transform('count')
That said, it looks like you just want to count the values in the data frame so you may be better off using value_counts():
>>> data[data['B-values']>0].groupby('group')['A-values'].value_counts()
group A-values
bird 7 1
human 1 2
2 1
3 1
Name: A-values, dtype: int64

Related

Splitting row values and count unique's from a DataFrame

I have the following data in a column titled Reference:
ABS052
ABS052/01
ABS052/02
ADA010/00
ADD005
ADD005/01
ADD005/02
ADD005/03
ADD005/04
ADD005/05
...
WOO032
WOO032/01
WOO032/02
WOO032/03
WOO045
WOO045/01
WOO045/02
WOO045/03
WOO045/04
I would like to know how to split the row values to create a Dataframe that contains the single Reference code, plus a Count value, for example:
Reference
Count
ABS052
3
ADA010
0
ADD005
2
...
...
WOO032
3
WOO045
4
I have the following code:
df['Reference'] = df['Reference'].str.split('/')
Results in:
['ABS052'],
['ABS052','01'],
['ABS052','02'],
['ABS052','03'],
...
But I'm not sure how to ditch the last two digits from the list in each row.
All I want now is to keep the string in each row [0] if that makes sense, then I could just retrieve a value_count from the 'Reference' column.
There seems to be something wrong with the expected result listed in the question.
Let's say you want to ditch the digits and count the prefix occurrences:
df.Reference.str.split("/", expand=True)[0].value_counts()
If instead the suffix means something and you want to keep the highest value this should do
df.Reference.str.split("/", expand=True).fillna("00").astype({0: str, 1: int}).groupby(0).max()
You can just use regex to replace the last two digits like this:
df = pd.DataFrame({'a':['ABS052','ABS052/01','ABS052/02','ADA010/00','ADD005','ADD005/01','ADD005/02','ADD005/03','ADD005/04','ADD005/05']})
df = df['a'].str.replace(r'\/\d+$', '').value_counts().reset_index()
Output:
>>>> index a
0 ADD005 6
1 ABS052 3
2 ADA010 1
You are almost there, you can add expand=True to split and then use groupby:
df['Reference'].str.split("/", expand=True).fillna("--").groupby(0).count()
returns:
1
0
ABS052 3
ADA010 1
ADD005 6
for the first couple of rows of your data.
The fillna("--") makes sure you also count lines like ABS052 without a "", i.e. None in the second column.
Output to df with column names
df['Reference'] = df['Reference'].str.split('/').str[0]
df_counts = df['Reference'].value_counts().rename_axis('Reference').reset_index(name='Counts')
output
Reference Counts
0 ADD005 6
1 ABS052 3
2 ADA010 1
Explanation - The first line gives a clean series called 'Reference'. The second line gives a count of unique items and then resets the index and renames the columns.

How to optimally update cells based on previous cell value / How to elegantly spread values of cell to other cells?

I have a "large" DataFrame table with index being country codes (alpha-3) and columns being years (1900 to 2000) imported via a pd.read_csv(...) [as I understand, these are actually string so I need to pass it as '1945' for example].
The values are 0,1,2,3.
I need to "spread" these values until the next non-0 for each row.
example : 0 0 1 0 0 3 0 0 2 1
becomes: 0 0 1 1 1 3 3 3 2 1
I understand that I should not use iterations (current implementation is something like this, as you can see, using 2 loops is not optimal, I guess I could get rid of one by using apply(row) )
def spread_values(df):
for idx in df.index:
previous_v = 0
for t_year in range(min_year, max_year):
current_v = df.loc[idx, str(t_year)]
if current_v == 0 and previous_v != 0:
df.loc[idx, str(t_year)] = previous_v
else:
previous_v = current_v
However I am told I should use the apply() function, or vectorisation or list comprehension because it is not optimal?
The apply function however, regardless of the axis, does not allow to dynamically get the index/column (which I need to conditionally update the cell), and I think the core issue I can't make the vec or list options work is because I do not have a finite set of column names but rather a wide range (all examples I see use a handful of named columns...)
What would be the more optimal / more elegant solution here?
OR are DataFrames not suited for my data at all? what should I use instead?
You can use df.replace(to_replace=0, method='ffil). This will fill all zeros in your dataframe (except for zeros occuring at the start of your dataframe) with the previous non-zero value per column.
If you want to do it rowwise unfortunately the .replace() function does not accept an axis argument. But you can transpose your dataframe, replace the zeros and transpose it again: df.T.replace(0, method='ffill').T

Python Pandas adding two dataframes SettingWithCopyWarning

I don't understand why I am getting the dreaded warning when I am doing exactly as instructed by the official documentation.
We have a dataframe called a
a = pd.DataFrame(data = [['Tom',1],
['Tom',1],
['Dick',1],
['Dick',1],
['Harry',1],
['Harry',1]], columns = ['Col1', 'Col2'])
a
Out[377]:
Col1 Col2
0 Tom 1
1 Tom 1
2 Dick 1
3 Dick 1
4 Harry 1
5 Harry 1
First we create a "holder" dataframe:
holder = a
Then we create a subset of a:
c = a.loc[a['Col1'] == 'Tom',:]
c
Out[379]:
Col1 Col2
0 Tom 1
1 Tom 1
We create another subset d which will be added to (a slice of) the previous subset c but once we try to add d to c, we get the warning:
d = a.loc[a['Col1'] == 'Tom','Col2']
d
Out[389]:
0 1
1 1
c.loc[:,'Col2'] += d
C:\Users\~\anaconda3\lib\site-packages\pandas\core\indexing.py:494: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self.obj[item] = s
I would like to understand what I am doing wrong because I use this logic very often (coming from R where everything is not a darn object)
After noticing a different issue, I found a solution.
Whenever you say
dataframe_A = dataframe_B
you need to proceed with caution because Python, it seems, joins these two dataframes by the hip, so to speak. If you make changes to dataframe_B your dataframe_A will also change!
I understand just enough to fix the problem by using .copy(deep=True) where python will create a full and independent copy so that you can make changes to one without affecting the other one.
On further investigation, and for those interested, it apparently has to do with "pointers" which is a slightly complicated coding concept with a scope beyond this specific question.

Comparison between one element and all the others of a DataFrame column

I have a list of tuples which I turned into a DataFrame with thousands of rows, like this:
frag mass prot_position
0 TFDEHNAPNSNSNK 1573.675712 2
1 EPGANAIGMVAFK 1303.659458 29
2 GTIK 417.258734 2
3 SPWPSMAR 930.438172 44
4 LPAK 427.279469 29
5 NEDSFVVWEQIINSLSALK 2191.116099 17
...
and I have the follow rule:
def are_dif(m1, m2, ppm=10):
if abs((m1 - m2) / m1) < ppm * 0.000001:
v = False
else:
v = True
return v
So, I only want the "frag"s that have a mass that difers from all the other fragments mass. How can I achieve that "selection"?
Then, I have a list named "pinfo" that contains:
d = {'id':id, 'seq':seq_code, "1HW_fit":hits_fit}
# one for each protein
# each dictionary as the position of the protein that it describes.
So, I want to sum 1 to the "hits_fit" value, on the dictionary respective to the protein.
If I'm understanding correctly (not sure if I am), you can accomplish quite a bit by just sorting. First though, let me adjust the data to have a mix of close and far values for mass:
Unnamed: 0 frag mass prot_position
0 0 TFDEHNAPNSNSNK 1573.675712 2
1 1 EPGANAIGMVAFK 1573.675700 29
2 2 GTIK 417.258734 2
3 3 SPWPSMAR 417.258700 44
4 4 LPAK 427.279469 29
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17
Then I think you can do something like the following to select the "good" ones. First, create 'pdiff' (percent difference) to see how close mass is to the nearest neighbors:
ppm = .00001
df = df.sort('mass')
df['pdiff'] = (df.mass-df.mass.shift()) / df.mass
Unnamed: 0 frag mass prot_position pdiff
3 3 SPWPSMAR 417.258700 44 NaN
2 2 GTIK 417.258734 2 8.148421e-08
4 4 LPAK 427.279469 29 2.345241e-02
1 1 EPGANAIGMVAFK 1573.675700 29 7.284831e-01
0 0 TFDEHNAPNSNSNK 1573.675712 2 7.625459e-09
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17 2.817926e-01
The first and last data lines make this a little tricky so this next line backfills the first line and repeats the last line so that the following mask works correctly. This works for the example here, but might need to be tweaked for other cases (but only as far as the first and last lines of data are concerned).
df = df.iloc[range(len(df))+[-1]].bfill()
df[ (df['pdiff'] > ppm) & (df['pdiff'].shift(-1) > ppm) ]
Results:
Unnamed: 0 frag mass prot_position pdiff
4 4 LPAK 427.279469 29 0.023452
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17 0.281793
Sorry, I don't understand the second part of the question at all.
Edit to add: As mentioned in a comment to #AmiTavory's answer, I think possibly the sorting approach and groupby approach could be combined for a simpler answer than this. I might try at a later time, but everyone should feel free to give this a shot themselves if interested.
Here's something that's slightly different from what you asked, but it is very simple, and I think gives a similar effect.
Using numpy.round, you can create a new column
import numpy as np
df['roundedMass'] = np.round(df.mass, 6)
Following that, you can do a groupby of the frags on the rounded mass, and use nunique to count the numbers in the group. Filter for the groups of size 1.
So, the number of frags per bin is:
df.frag.groupby(np.round(df.mass, 6)).nunique()
Another solution can be create a dup of your list (if you need to preserve it for further processing later), iterate over it and remove all element that are not corresponding with your rule (m1 & m2).
You will get a new list with all unique masses.
Just don't forget that if you do need to use the original list later you will need to use deepcopy.

Combining MultiIndex and Index in a PANDAS DataFrame

I'm trying to come up with a DataFrame to do some data analysis and would really benefit from having a data frame that can handle regular indexing and MultiIndexing together in one data frame.
For each patient, I have 6 slices of various types of data (T1avg, T2avg, etc...). Let's call this dataframe1 (from an ipython notebook):
import pandas
dat0 = numpy.zeros([6])
dat1 = numpy.zeros([6])
pat0=(['NecS3Hs05']*6)
pat1=(['NecS3Hs06']*6)
slc = (['Slice ' + str(x) for x in xrange(dat0.shape[-1])])
ind = zip(*[pat0+pat1,slc+slc])
named_ind = pandas.MultiIndex.from_tuples(ind, names = ['Patients','Slices'])
ser = pandas.Series(numpy.append(dat0,dat1),index = named_ind)
df = pandas.DataFrame(data=ser, columns=['T1avg'])
Image of output: df1
I also have, for each patient, various strings of information (tumour type, number of imaging sessions, treatment type):
pats = ['NecS3Hs05','NecS3Hs05']
tx = ['Control','Treated']
Ttype = ['subcutaneous','orthotopic']
NSessions = ['2','3']
cols = ['Tx Group', 'Tumour Type', 'Imaging Sessions']
dat = numpy.array([tx,Ttype,NSessions]).T
df2 = pandas.DataFrame(dat, index=pats,columns=cols)
[I'd like to post a picture here as well, but I need at least 10 reputation to do so]
Ideally, I want to have a dataframe that looks as follows (sketched it out in an image editor sorry)
Image of desired output: df-desired
But when I try to use the append command,
com = df.append(df2)
I get something undesired, the MultiIndex that I set up in df is now gone, replaced with a simple index of type tuples ('NecS3Hs05, Slice 0' etc...). The indices from df2 remain the same 'NecS3Hs05'.
Is this possible to do with PANDAS, or am I barking up the wrong tree here? Also, is this even a recommended way of storing Patient attributes in a dataframe (i.e. is this unpandas)? I think what I would really like is to keep everything a simple index, but instead store N-d arrays inside the elements of the data frame.
For instance, if I try something like:
com['NecS3Hs05','T1avg']
I want to get an array/tuple of shape/len 6
and when I try to get the tumour type:
com['NecS3Hs05','Tumour Type']
I get the string 'subcutaneous'. Obviously I also want to retain the cool features of data frames as well, it looks like PANDAS is the right way to go here, I just need to understand a bit more about how to set up my dataframe
I hope this is a sensible question, if not, I'd be happy to re-form it.
Your problem can be solved, I believe, if you drop the MultiIndex business. Imagine '''df''' only has the (non-unique) 'Patient' as index. 'Slices' would become a simple column.
ind = zip(*[pat0+pat1])
named_ind = pandas.MultiIndex.from_tuples(ind, names = ['Patients'])
df = pandas.DataFrame({'T1avg':ser})
df['Slice']=pandas.Series(numpy.append(slc, slc), index=df.index)
If you had to select on the slice, you can still do that:
df[df['Slice']=='Slice 4']
Will give you Slice 4 for all patients. Note how this eliminates the need to have that row for all patients.
As long as your new dataframe (df2) defines the same index you can now join on that index quite simply:
df.join(df2)
and you'll get
T1avg Slice Tx Group Tumour Type Imaging Sessions
Patients
NecS3Hs05 0 Slice 0 Control subcutaneous 2
NecS3Hs05 0 Slice 1 Control subcutaneous 2
NecS3Hs05 0 Slice 2 Control subcutaneous 2
NecS3Hs05 0 Slice 3 Control subcutaneous 2
NecS3Hs05 0 Slice 4 Control subcutaneous 2
NecS3Hs05 0 Slice 5 Control subcutaneous 2
NecS3Hs06 0 Slice 0 Treated orthotopic 3
NecS3Hs06 0 Slice 1 Treated orthotopic 3
NecS3Hs06 0 Slice 2 Treated orthotopic 3
NecS3Hs06 0 Slice 3 Treated orthotopic 3
NecS3Hs06 0 Slice 4 Treated orthotopic 3
NecS3Hs06 0 Slice 5 Treated orthotopic 3

Categories