I use Pandas dataframes to manipulate data and I usually visualise them as virtual spreadsheets, with rows and columns defining the positions of individual cells. I'm happy with the methods to slice and dice the dataframes but there seems to be some odd behaviour when the dataframe contains a single row. Basically, I want to select rows of data from a large parent dataframe that meet certain criteria and then pass those results as a daughter dataframe to a separate function for further processing. Sometimes there will only be a single record in the parent dataframe that meets the defined criteria and, therefore, the daughter dataframe will only contain a single row. Nevertheless, I still need to be able to access data in the daughter in the same way as for the parent database. To illustrate may point, consider the following dataframe:
import pandas as pd
tempDF = pd.DataFrame({'group':[1,1,1,1,2,2,2,2],
'string':['a','b','c','d','a','b','c','d']})
print(tempDF)
Which looks like:
group string
0 1 a
1 1 b
2 1 c
3 1 d
4 2 a
5 2 b
6 2 c
7 2 d
As an example, I can now select those rows where 'group' == 2 and 'string' == 'c', which yields just a single row. As expected, the length of dataframe is 1 and it's possible to print just a single cell using .ix() based on index values in the original dataframe:
tempDF2 = tempDF.loc[((tempDF['group']==2) & (tempDF['string']=='c')),['group','string']]
print(tempDF2)
print('Length of tempDF2 = ',tempDF2.index.size)
print(tempDF2.loc[6,['string']])
Output:
group string
6 2 c
Length of tempDF2 = 1
string c
However, if I select a single row using .loc, then the dataframe is printed in a transposed form and the length of the dataframe is now given as 2 (rather than 1). Clearly, it's no longer possible to select single cell values based on index of original parent dataframe:
tempDF3 = tempDF.loc[6,['group','string']]
print(tempDF3)
print('Length of tempDF3 = ',tempDF3.index.size)
Output:
group 2
string c
Name: 7, dtype: object
Length of tempDF3 = 2
In my mind, both these methods are actually doing the same thing, namely selecting a single row of data. However, in the second example, the rows and columns are transposed making it impossible to extract data in an expected way.
Why should these 2 behaviours exist? What is the point of transposing a single row of a dataframe as a default behaviour? How can I make sure that a dataframe containing a single row isn't transposed when I pass it to another function?
tempDF3 = tempDF.loc[6,['group','string']]
The 6 in the first position of the .loc selection dictates that the return type will be a Series and hence your problem. Instead use [6]:
tempDF3 = tempDF.loc[[6],['group','string']]
Related
I'm new to the world of python so I apologize in advance if this question seems pretty rudimentary. I'm trying to pull columns of one dataframe into a separate dataframe. I want to replace the duplicate columns from the first dataframe with one column that contains the mean values into the second dataframe. I hope this makes sense!
To provide some background, I am tracking gene expression over certain time points. I have a dataframe that is 17 rows x 33 columns. Every row in this data frame corresponds to a particular exon. Every column on this data frame corresponds to a time-point (AGE).
The dataframe looks like this:
Some of these columns contain the same name (age) and I'd like to calculate the mean of ONLY the columns with the same name, so that, for example, I get one column for "12 pcw" rather than three separate columns for "12 pcw." After which I hope to pull these values from the first dataframe into a second dataframe for averaged values.
I'm hoping to use a for loop to loop through each age (column) to get the average expression across the subjects.
I will explain my process so far below:
#1) Get list of UNIQUE string names from age list
unique_ages = set(column_names)
#2) Create an empty dataframe that gives an outline of what I want my averaged data to fit/be put in
mean_df = pd.DataFrame(index=exons, columns=unique_ages)
#3) Now I want to loop through each age to get the average expression across the donors present. This is where I'm trying to utilize a for loop to create a pipeline to process other data frames that I will be working with in the future.
for age in unique_ages:
print(age)
age_df = pd.DataFrame() ##pull columns of df as separate df that have this string
if len(age_df.columns) > 1: ##check if df has >1 SAME column, if so, take avg across SAME columns
mean = df.mean(axis=1)
mean_df[age] = mean
else:
## just pull out the values and put them into your temp_df
#4) Now, with my new averaged array (or same array if multiple ages NOT present), I want to place this array into my 'temp_df' under the appropriate columns. I understand that I should use the 'age' variable provided by the for loop to get the proper locationname of the column in my temp df. However I'm not sure how to do this. This has all been quite a steep learning curve and I feel like it's a simple solution but I can't seem to wrap my head around it. Any help would be greatly appreciated.
There is no need for a for loop (there often isn't with Pandas :)). You can simply use df.groupby(lambda x:x, axis=1).mean(). An example:
data = [[1,2,3],[4,5,6]]
cols = ['col1', 'col2', 'col2']
df = pd.DataFrame(data=data, columns=cols)
# col1 col2 col2
# 0 1 2 3
# 1 4 5 6
df = df.groupby(lambda x:x, axis=1).mean()
# col1 col2
# 0 1.0 2.5
# 1 4.0 5.5
The groupby function takes another function (the lambda) which basically means that it will insert each column name, and that it will return the group that column belongs to. In our case, we just want the column name itself to be the group. So, on the third column named col2, it will say 'this column belongs to group named col2' which already exists (because the second column was passed earlier). You then provide the aggregation you want, in this case the mean().
I have a column which contains lists of variable sizes. The lists contain a limited amount of short text values. Around 60 unique values all together.
0 ["AC","BB"]
1 ["AD","CB", "FF"]
2 ["AA","CC"]
3 ["CA","BB"]
4 ["AA"]
I want to make this values columns in my data-frame and the values of this columns would be 1 if the values is in this row and 0 if not.
I know I could expand the list and than call unique and set those as new columns. But after that I don't know what to do?
Here's one way:
df = pd.get_dummies(df.explode('val')).sum(level = 0)
NOTE: Here (level=0) is kind of like a grouping operation that uses an index for grouping stuff. So, I prefer to use this after exploding the dataframe.
Can I create a dataframe which has a unique index or columns, similar to creating an unique key in mysql, that it will return an error if I try to add a duplicate index?
Or is my only option to create an if-statement and check for the value in the dataframe before appending it?
EDIT:
It seems my question was a bit unclear. With unique columns I mean that we cannot have non-unique values in a column.
With
df.append(new_row, verify_integrity=True)
we can check for all columns, but how can we check for only one or two columns?
You can use df.append(..., verify_integrity=True) to maintain a unique row index:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=list('ABCD'))
dup_row = pd.DataFrame([[10,20,30,40]], columns=list('ABCD'), index=[1])
new_row = pd.DataFrame([[10,20,30,40]], columns=list('ABCD'), index=[9])
This successfully appends a new row (with index 9):
df.append(new_row, verify_integrity=True)
# A B C D
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
# 9 10 20 30 40
This raises ValueError because 1 is already in the index:
df.append(dup_row, verify_integrity=True)
# ValueError: Indexes have overlapping values: [1]
While the above works to ensure a unique row index, I'm not aware of a similar method for ensuring a unique column index. In theory you could transpose the DataFrame, append with verify_integrity=True and then transpose again, but generally I would not recommend this since transposing can alter dtypes when the column dtypes are not all the same. (When the column dtypes are not all the same the transposed DataFrame gets columns of object dtype. Conversion to and from object arrays can be bad for performance.)
If you need both unique row- and column- Indexes, then perhaps a better alternative is to stack your DataFrame so that all the unique column index levels become row index levels. Then you can use append with verify_integrity=True on the reshaped DataFrame.
OP's follow-up question:
With df.append(new_row, verify_integrity=True), we can check for all columns, but how can we check for only one or two
columns?
To check uniqueness of just one column, say the column name is value, one can try
df['value'].duplicated().any()
This will check whether any in this column is duplicated. If duplicated, then it is not unique.
Given two columns, say C1 and C2,to check whether there are duplicated rows, we can still use DataFrame.duplicated.
df[["C1", "C2"]].duplicated()
It will check row-wise uniqueness. You can again use any to check if any of the returned value is True.
Given 2 columns, say C1 and C2, to check whether each column contains duplicated value, we can use apply.
df[["C1", "C2"]].apply(lambda x: x.duplicated().any())
This will apply the function to each column.
NOTE
pd.DataFrame([[np.nan, np.nan],
[ np.nan, np.nan]]).duplicated()
0 False
1 True
dtype: bool
np.nan will also be captured by duplicated. If you want to ignore np.nan, you can try select the non-nan part first.
I am trying to create a function that iterates through a pandas dataframe row by row. I want to create a new column based on row values of other columns. My original dataframe could look like this:
df:
A B
0 1 2
1 3 4
2 2 2
Now I want to create a new column filled with the row values of Column A - Column B at each index position, so that the result looks like this:
df:
A B A-B
0 1 2 -1
1 3 4 -1
2 2 2 0
the solution I have works, but only when I do NOT use it in a function:
for index, row in df.iterrows():
print index
df['A-B']=df['A']-df['B']
This gives me the desired output, but when I try to use it as a function, I get an error.
def test(x):
for index, row in df.iterrows():
print index
df['A-B']=df['A']-df['B']
return df
df.apply(test)
ValueError: cannot copy sequence with size 4 to array axis with dimension 3
What am I doing wrong here and how can I get it to work?
It's because apply method works for column by default, change axis to 1 if you'd like through rows:
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’: apply function to each column
1 or ‘columns’: apply function to each row
df.apply(test, axis=1)
EDIT
I thought that you need to do something complex manupulation with each row. If you need just substract columns from each other:
df['A-B'] = df.A - df.B
Like indicated by Anton you should execute the apply function with axis=1 parameter. However it is not necessary to then loop through the rows as you did in the function test, since
the apply documentation mentions:
Objects passed to functions are Series objects
So you could simplify the function to:
def test(x):
x['A-B']=x['A']-x['B']
return x
and then run:
df.apply(test,axis=1)
Note that in fact you named the parameter of test x, while not using x in the function test at all.
Finally I should comment that you can do column wise operations with pandas (i.e. without for loop) doing simply this:
df['A-B']=df['A']-df['B']
Also see:
how to compute a new column based on the values of other columns in pandas - python
How to apply a function to two columns of Pandas dataframe
Say I have a dataframe df
import pandas as pd
df = pd.DataFrame()
and I have the following tuple and value:
column_and_row = ('bar', 'foo')
value = 56
How can I most easily add this tuple to my dataframe so that:
df['bar']['foo']
returns 56?
What if I have a list of such tuples and list of values? e.g.
columns_and_rows = [A, B, C, ...]
values = [5, 10, 15]
where A, B and C are tuples of columns and rows (similar to column_and_row).
Along the same lines, how would this be done with a Series?, e.g.:
import pandas as pd
srs = pd.Series()
and I want to add one item to it with index 'foo' and value 2 so that:
srs['foo']
returns 2?
Note:
I know that none of these are efficient ways of creating dataframes or series, but I need a solution that allows me to grow my structures organically in this way when I have no other choice.
For a series, you can do it with append, but you have to create a series from your value first:
>>> print x
A 1
B 2
C 3
>>> print x.append( pandas.Series([8, 9], index=["foo", "bar"]))
A 1
B 2
C 3
foo 8
bar 9
For a DataFrame, you can also use append or concat, but it doesn't make sense to do this for a single cell only. DataFrames are tabular, so you can only add a whole row or a whole column. The documentation has plenty of examples and there are other questions about this.
Edit: Apparently you actually can set a single value with df.set_value('newRow', 'newCol', newVal). However, if that row/column doesn't already exist, this will actually create an entire new row and/or column, with the rest of the values in the created row/column filled with NaN. Note that in this case a new object will be returned, so you'd have to do df = df.set_value('newRow', 'newCol', newVal) to modify the original.
However, now matter how you do it, this is going to be inefficient. Pandas data structures are based on Numpy and are fundamentally reliant on knowing the size of the array ahead of time. You can add rows and columns, but every time you do so, and entirely new data structure is created, so if you do this a lot, it will be slower than using ordinary Python lists/dicts.