I don't understand why I am getting the dreaded warning when I am doing exactly as instructed by the official documentation.
We have a dataframe called a
a = pd.DataFrame(data = [['Tom',1],
['Tom',1],
['Dick',1],
['Dick',1],
['Harry',1],
['Harry',1]], columns = ['Col1', 'Col2'])
a
Out[377]:
Col1 Col2
0 Tom 1
1 Tom 1
2 Dick 1
3 Dick 1
4 Harry 1
5 Harry 1
First we create a "holder" dataframe:
holder = a
Then we create a subset of a:
c = a.loc[a['Col1'] == 'Tom',:]
c
Out[379]:
Col1 Col2
0 Tom 1
1 Tom 1
We create another subset d which will be added to (a slice of) the previous subset c but once we try to add d to c, we get the warning:
d = a.loc[a['Col1'] == 'Tom','Col2']
d
Out[389]:
0 1
1 1
c.loc[:,'Col2'] += d
C:\Users\~\anaconda3\lib\site-packages\pandas\core\indexing.py:494: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self.obj[item] = s
I would like to understand what I am doing wrong because I use this logic very often (coming from R where everything is not a darn object)
After noticing a different issue, I found a solution.
Whenever you say
dataframe_A = dataframe_B
you need to proceed with caution because Python, it seems, joins these two dataframes by the hip, so to speak. If you make changes to dataframe_B your dataframe_A will also change!
I understand just enough to fix the problem by using .copy(deep=True) where python will create a full and independent copy so that you can make changes to one without affecting the other one.
On further investigation, and for those interested, it apparently has to do with "pointers" which is a slightly complicated coding concept with a scope beyond this specific question.
Related
I have dataframe df_my that looks like this
id name age major
----------------------------------------
0 1 Mark 34 English
1 2 Tom 55 Art
2 3 Peter 31 Science
3 4 Mohammad 23 Math
4 5 Mike 47 Art
...
I am trying to get the value of major (only)
I used this and it works fine when I know the id of the record
df_my["major"][3]
returns
"Math"
great
but I want to get the major for a variable record
I used
i = 3
df_my.loc[df_my["id"]==i]["major"]
and also used
i = 3
df_my[df_my["id"]==i]["major"]
but they both return
3 Math
it includes the record index too
how can I get the major only and nothing else?
You could use squeeze:
i = 3
out = df.loc[df['id']==i,'major'].squeeze()
Another option is iat:
out = df.loc[df['id']==i,'major'].iat[0]
Output:
'Science'
I also stumbled over this problem, from a little different angle:
df = pd.DataFrame({'First Name': ['Kumar'],
'Last Name': ['Ram'],
'Country': ['India'],
'num_var': 1})
>>> df.loc[(df['First Name'] == 'Kumar'), "num_var"]
0 1
Name: num_var, dtype: int64
>>> type(df.loc[(df['First Name'] == 'Kumar'), "num_var"])
<class 'pandas.core.series.Series'>
So it returns a Series (although it is only a series with only 1 element). If you access through the index, you receive the integer.
df.loc[0, "num_var"]
1
type(df.loc[0, "num_var"])
<class 'numpy.int64'>
The answer on how to select the respective, single value was already given above. However, I think it is interesting to note that accessing through an index always gives the single value whereas accessing through a condition returns a series. This is, b/c accessing with index clearly returns only one value whereas accessing through a condition can return several values.
If one of the columns of your dataframe is the natural primary index for those data, then it's usually a good idea to make pandas aware of it by setting the index accordingly:
df_my.set_index('id', inplace=True)
Now you can easily get just the major value for any id value i:
df_my.loc[i, 'major']
Note that for i = 3, the output is 'Science', which is expected, as noted in the comments to your question above.
I have several sets of data, comprising about 3,000,000 measurements each. In many cases, 'identical' tests yielded different results. I am trying to create a scheme such that these inconsistencies are apparent in the processed data. I am using pandas for my analyses.
Though not my real case, here is a similar example: Suppose I have a set of chemicals, A, B, C, ..., and I mix them together pairwise and note whether a reaction takes place. Let's denote no reaction by '0' and a measurable reaction by '1'. Our data might look like this:
chemical 1
chemical 2
Do they react?
comments
A
B
1
comment 1
D
E
0
comment 2
A
B
1
comment 3
F
G
1
comment 4
A
B
0
comment 5
I am thinking that a workable aim would be to use a sentinel (say, '2') to indicate this inconsistency and summarize the comments for later examination:
chemical 1
chemical 2
Do they react?
comments
A
B
2
comment 1; comment 3; comment 5
D
E
0
comment 2
F
G
1
comment 4
I have developed code to identify the tuples (chemical 1, chemical 2) that lead to these inconsistent results. There are about 30,000 of them in my first dataset. This calculation runs fairly quickly.
I have also created some pandas code to give me the desired dataframe. Here is an adaptation for the type of data above:
def uniquify_rows(df, replicates, sentinel=2):
for chemical_1, chemical_2 in replicates:
indices = (df["chemical 1"] == chemical_1) & (df["chemical 2"] == chemical_2)
rows = df[indices]
comments = ";".join(c for c in rows["comments"].values)
field_vals = rows.iloc[0].values # chem1, chem2, react, comments
field_vals[2], field_vals[3] = sentinel, comments
df.iloc[indices] = field_vals
df.drop_duplicates(inplace=True)
return df
This code seems to work on my data, but is extremely slow. Being relatively new to pandas, I suspect that I am doing something very inefficient.
Any ideas to speed up this task?
Thank you.
Kind regards,
gyro
Let's try with groupby aggregate:
sentinel = 2
df = df.groupby(['chemical 1', 'chemical 2'], as_index=False).aggregate({
'Do they react?': lambda s: s.iat[0] if (s.iat[0] == s).all() else sentinel,
'comments': ';'.join
})
df:
chemical 1 chemical 2 Do they react? comments
0 A B 2 comment 1;comment 3;comment 5
1 D E 0 comment 2
2 F G 1 comment 4
I asked this question for R a few months back and got a great answer that I used often. Now I'm trying to transition to Python but I was dreading attempting rewriting this code snippet. And now after trying I haven't been able to translate the answer I got (or find anything similar by searching).
The question is: I have a dataframe that I'd like to append new columns to where the calculation is dependent on values in another dataframe which holds the instructions.
I have created a reproducible example below (although in reality there are quite a few more columns and many rows so speed is important and I'd like to avoid a loop if possible):
input dataframes:
import pandas as pd;
data = {"A":["orange","apple","banana"],"B":[5,3,6],"C":[7,12,4],"D":[5,2,7],"E":[1,18,4]}
data_df = pd.DataFrame(data)
key = {"cols":["A","B","C","D","E"],"include":["no","no","yes","no","yes"],"subtract":["na","A","B","C","D"],"names":["na","G","H","I","J"]}
key_df = pd.DataFrame(key)
desired output (same as data but with 2 new columns):
output = {"A":["orange","apple","banana"],"B":[5,3,6],"C":[7,12,4],"D":[5,2,7],"E":[1,18,4],"H":[2,9,-2],"J":[-4,16,-3]}
output_df= pd.DataFrame(output)
So, the key dataframe has 1 row for each column in the base dataframe and it has an "include" column that has to be set to "yes" if any calculation is to be done. When it is set to "yes", then I want to add a new column with a defined name that subtracts a defined column (all lookups from the key dataframe).
For example, column "C" in the base dataframe is included so I want to create a new column called "H" which is the the value from column "C" minus the value from column "B".
p.s. here was the answer from R in case that triggers any thought processes for someone better skillled than me!
k <- subset(key, include == "yes")
output <- cbind(base,setNames(base[k[["cols"]]]-base[k[["subtract"]]],k$names))
Filter for the yes values in include:
yes = key_df.loc[key_df.include.eq("yes"), ["cols", "subtract", "names"]]
cols subtract names
2 C B H
4 E D J
Create a dictionary of the yes values and unpack it in the assign method::
yes_values = { name: data_df[col] - data_df[subtract]
for col, subtract, name
in yes.to_numpy()}
data_df.assign(**yes_values)
A B C D E H J
0 orange 5 7 5 1 2 -4
1 apple 3 12 2 18 9 16
2 banana 6 4 7 4 -2 -3
While I generally understand the warnings, and many posts deal with this, I don understand why I am getting a warning only when I reach the groupby line (the last one):
grouped = data.groupby(['group'])
for name, group in grouped:
data2=group.loc[data['B-values'] > 0]
data2["unique_A-values"]=data2.groupby(["A-values"])["A-values"].transform('count')
EDIT:
Here is my dataframe (data):
group A-values B-values
human 1 -1
human 1 5
human 1 4
human 3 4
human 2 10
bird 7 8
....
For B-values > 0 (data2=group.loc[data['B-values'] > 0]):
human has two A-values equal to one, one equals to 3 and one equals to 2 (data2["unique_A-values"]=data2.groupby(["A-values"])["A-values"].transform('count'))
You get the error because you take a reference to your groupby and then try add a column to it, so it's just warning you that if your intention is to update the original df then this may or may not work.
If you are just modifying a local copy then take a copy using copy() so it's explicit and the warning will go away:
for name, group in grouped:
data2=group.loc[data['B-values'] > 0].copy() # <- add .copy() here
data2["unique_A-values"]=data2.groupby(["A-values"])["A-values"].transform('count')
FYI the pandas groupby user guide says:
Group chunks should be treated as immutable, and changes to a group chunk may produce unexpected results.
for name, group in grouped:
# making a reference to the group chunk
data2 = group.loc[data['B-values'] > 0]
# trying to make a change to that group chunk reference
data2["unique_A-values"] = data2.groupby(["A-values"])["A-values"].transform('count')
That said, it looks like you just want to count the values in the data frame so you may be better off using value_counts():
>>> data[data['B-values']>0].groupby('group')['A-values'].value_counts()
group A-values
bird 7 1
human 1 2
2 1
3 1
Name: A-values, dtype: int64
I'm trying to come up with a DataFrame to do some data analysis and would really benefit from having a data frame that can handle regular indexing and MultiIndexing together in one data frame.
For each patient, I have 6 slices of various types of data (T1avg, T2avg, etc...). Let's call this dataframe1 (from an ipython notebook):
import pandas
dat0 = numpy.zeros([6])
dat1 = numpy.zeros([6])
pat0=(['NecS3Hs05']*6)
pat1=(['NecS3Hs06']*6)
slc = (['Slice ' + str(x) for x in xrange(dat0.shape[-1])])
ind = zip(*[pat0+pat1,slc+slc])
named_ind = pandas.MultiIndex.from_tuples(ind, names = ['Patients','Slices'])
ser = pandas.Series(numpy.append(dat0,dat1),index = named_ind)
df = pandas.DataFrame(data=ser, columns=['T1avg'])
Image of output: df1
I also have, for each patient, various strings of information (tumour type, number of imaging sessions, treatment type):
pats = ['NecS3Hs05','NecS3Hs05']
tx = ['Control','Treated']
Ttype = ['subcutaneous','orthotopic']
NSessions = ['2','3']
cols = ['Tx Group', 'Tumour Type', 'Imaging Sessions']
dat = numpy.array([tx,Ttype,NSessions]).T
df2 = pandas.DataFrame(dat, index=pats,columns=cols)
[I'd like to post a picture here as well, but I need at least 10 reputation to do so]
Ideally, I want to have a dataframe that looks as follows (sketched it out in an image editor sorry)
Image of desired output: df-desired
But when I try to use the append command,
com = df.append(df2)
I get something undesired, the MultiIndex that I set up in df is now gone, replaced with a simple index of type tuples ('NecS3Hs05, Slice 0' etc...). The indices from df2 remain the same 'NecS3Hs05'.
Is this possible to do with PANDAS, or am I barking up the wrong tree here? Also, is this even a recommended way of storing Patient attributes in a dataframe (i.e. is this unpandas)? I think what I would really like is to keep everything a simple index, but instead store N-d arrays inside the elements of the data frame.
For instance, if I try something like:
com['NecS3Hs05','T1avg']
I want to get an array/tuple of shape/len 6
and when I try to get the tumour type:
com['NecS3Hs05','Tumour Type']
I get the string 'subcutaneous'. Obviously I also want to retain the cool features of data frames as well, it looks like PANDAS is the right way to go here, I just need to understand a bit more about how to set up my dataframe
I hope this is a sensible question, if not, I'd be happy to re-form it.
Your problem can be solved, I believe, if you drop the MultiIndex business. Imagine '''df''' only has the (non-unique) 'Patient' as index. 'Slices' would become a simple column.
ind = zip(*[pat0+pat1])
named_ind = pandas.MultiIndex.from_tuples(ind, names = ['Patients'])
df = pandas.DataFrame({'T1avg':ser})
df['Slice']=pandas.Series(numpy.append(slc, slc), index=df.index)
If you had to select on the slice, you can still do that:
df[df['Slice']=='Slice 4']
Will give you Slice 4 for all patients. Note how this eliminates the need to have that row for all patients.
As long as your new dataframe (df2) defines the same index you can now join on that index quite simply:
df.join(df2)
and you'll get
T1avg Slice Tx Group Tumour Type Imaging Sessions
Patients
NecS3Hs05 0 Slice 0 Control subcutaneous 2
NecS3Hs05 0 Slice 1 Control subcutaneous 2
NecS3Hs05 0 Slice 2 Control subcutaneous 2
NecS3Hs05 0 Slice 3 Control subcutaneous 2
NecS3Hs05 0 Slice 4 Control subcutaneous 2
NecS3Hs05 0 Slice 5 Control subcutaneous 2
NecS3Hs06 0 Slice 0 Treated orthotopic 3
NecS3Hs06 0 Slice 1 Treated orthotopic 3
NecS3Hs06 0 Slice 2 Treated orthotopic 3
NecS3Hs06 0 Slice 3 Treated orthotopic 3
NecS3Hs06 0 Slice 4 Treated orthotopic 3
NecS3Hs06 0 Slice 5 Treated orthotopic 3