Say I have a dataframe df
import pandas as pd
df = pd.DataFrame()
and I have the following tuple and value:
column_and_row = ('bar', 'foo')
value = 56
How can I most easily add this tuple to my dataframe so that:
df['bar']['foo']
returns 56?
What if I have a list of such tuples and list of values? e.g.
columns_and_rows = [A, B, C, ...]
values = [5, 10, 15]
where A, B and C are tuples of columns and rows (similar to column_and_row).
Along the same lines, how would this be done with a Series?, e.g.:
import pandas as pd
srs = pd.Series()
and I want to add one item to it with index 'foo' and value 2 so that:
srs['foo']
returns 2?
Note:
I know that none of these are efficient ways of creating dataframes or series, but I need a solution that allows me to grow my structures organically in this way when I have no other choice.
For a series, you can do it with append, but you have to create a series from your value first:
>>> print x
A 1
B 2
C 3
>>> print x.append( pandas.Series([8, 9], index=["foo", "bar"]))
A 1
B 2
C 3
foo 8
bar 9
For a DataFrame, you can also use append or concat, but it doesn't make sense to do this for a single cell only. DataFrames are tabular, so you can only add a whole row or a whole column. The documentation has plenty of examples and there are other questions about this.
Edit: Apparently you actually can set a single value with df.set_value('newRow', 'newCol', newVal). However, if that row/column doesn't already exist, this will actually create an entire new row and/or column, with the rest of the values in the created row/column filled with NaN. Note that in this case a new object will be returned, so you'd have to do df = df.set_value('newRow', 'newCol', newVal) to modify the original.
However, now matter how you do it, this is going to be inefficient. Pandas data structures are based on Numpy and are fundamentally reliant on knowing the size of the array ahead of time. You can add rows and columns, but every time you do so, and entirely new data structure is created, so if you do this a lot, it will be slower than using ordinary Python lists/dicts.
Related
I'm interested in creating a "dynamic" column in a pandas dataframe. The dynamic column's contents contain references to other columns, such that when one is modified the dynamic's values are also.
df = pd.DataFrame({'source':[1,2,3]})
df['dyn_col'] = ## something magic here that means source*2
df.dyn_col
>>> 2,4,6
df.source = [4,5,6]
df.dyn_col
>>> 8,10,12
This could potentially include references to multiple columns, eg A*B
I tried extending __getitem__ but it got messy quickly (in part (I think)) because the key wasn't present in the columns
How can this be accomplished?
I am curious as to why df[2] is not supported, while df.ix[2] and df[2:3] both work.
In [26]: df.ix[2]
Out[26]:
A 1.027680
B 1.514210
C -1.466963
D -0.162339
Name: 2000-01-03 00:00:00
In [27]: df[2:3]
Out[27]:
A B C D
2000-01-03 1.02768 1.51421 -1.466963 -0.162339
I would expect df[2] to work the same way as df[2:3] to be consistent with Python indexing convention. Is there a design reason for not supporting indexing row by single integer?
echoing #HYRY, see the new docs in 0.11
http://pandas.pydata.org/pandas-docs/stable/indexing.html
Here we have new operators, .iloc to explicity support only integer indexing, and .loc to explicity support only label indexing
e.g. imagine this scenario
In [1]: df = pd.DataFrame(np.random.rand(5,2),index=range(0,10,2),columns=list('AB'))
In [2]: df
Out[2]:
A B
0 1.068932 -0.794307
2 -0.470056 1.192211
4 -0.284561 0.756029
6 1.037563 -0.267820
8 -0.538478 -0.800654
In [5]: df.iloc[[2]]
Out[5]:
A B
4 -0.284561 0.756029
In [6]: df.loc[[2]]
Out[6]:
A B
2 -0.470056 1.192211
[] slices the rows (by label location) only
The primary purpose of the DataFrame indexing operator, [] is to select columns.
When the indexing operator is passed a string or integer, it attempts to find a column with that particular name and return it as a Series.
So, in the question above: df[2] searches for a column name matching the integer value 2. This column does not exist and a KeyError is raised.
The DataFrame indexing operator completely changes behavior to select rows when slice notation is used
Strangely, when given a slice, the DataFrame indexing operator selects rows and can do so by integer location or by index label.
df[2:3]
This will slice beginning from the row with integer location 2 up to 3, exclusive of the last element. So, just a single row. The following selects rows beginning at integer location 6 up to but not including 20 by every third row.
df[6:20:3]
You can also use slices consisting of string labels if your DataFrame index has strings in it. For more details, see this solution on .iloc vs .loc.
I almost never use this slice notation with the indexing operator as its not explicit and hardly ever used. When slicing by rows, stick with .loc/.iloc.
You can think DataFrame as a dict of Series. df[key] try to select the column index by key and returns a Series object.
However slicing inside of [] slices the rows, because it's a very common operation.
You can read the document for detail:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics
To index-based access to the pandas table, one can also consider numpy.as_array option to convert the table to Numpy array as
np_df = df.as_matrix()
and then
np_df[i]
would work.
You can take a look at the source code .
DataFrame has a private function _slice() to slice the DataFrame, and it allows the parameter axis to determine which axis to slice. The __getitem__() for DataFrame doesn't set the axis while invoking _slice(). So the _slice() slice it by default axis 0.
You can take a simple experiment, that might help you:
print df._slice(slice(0, 2))
print df._slice(slice(0, 2), 0)
print df._slice(slice(0, 2), 1)
you can loop through the data frame like this .
for ad in range(1,dataframe_c.size):
print(dataframe_c.values[ad])
I would normally go for .loc/.iloc as suggested by Ted, but one may also select a row by transposing the DataFrame. To stay in the example above, df.T[2] gives you row 2 of df.
If you want to index multiple rows by their integer indexes, use a list of indexes:
idx = [2,3,1]
df.iloc[idx]
N.B. If idx is created using some rule, then you can also sort the dataframe by using .iloc (or .loc) because the output will be ordered by idx. So in a sense, iloc can act like a sorting function where idx is the sorting key.
I need to compare some columns in a dataframe as a whole, for example:
df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
#Select condition: If df['A'] == 1 and df['B'] == 4, then pick up this row.
For this simple example, I can use below method:
df.loc[(df['A']==1)&(df['B']==4),'A':'B']
However, in reality my dataframe has tens of columns which should be compared as whole. Above solution will be very very messy if I choose to list all of them. So I think if regard them as a whole to compare with a list may solve the problem:
#something just like this:
df.loc[df.loc[:,'A':'B']==[1,4],'A':'B')]
Not worked. So I came up with the idea that first combine all desired columns into a new column as a list value, then compare this new column with the list. The latter has been solved in Pandas: compare list objects in Series
Although generally I've solved my case, I still want to know if there is an easier way to solve this problem? Thanks.
Or use [[]] for getting multiple columns:
df[(df[['A','B']].values==[1,4]).all(1)]
Demo:
>>> df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
>>> df[(df[['A','B']].values==[1,4]).all(1)]
A B
0 1 4
>>>
You can use a Boolean mask via a NumPy array representation of your data:
df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
res = df[(df.loc[:, 'A':'B'].values == [1, 4]).all(1)]
print(res)
A B
0 1 4
In this situation, never combine your columns into a single series of lists. This is inefficient as you will lose all vectorisation benefits, and any processing thereafter will involve Python-level loops.
I have a pretty simple question, but I'm having trouble achieving what I want.
I have a DataFrame that looks like this:
base
[a,b,c]
[c,d,e]
[a,b,h]
I want to remove the second element of every list, so I would get this:
base
[a,c]
[c,e]
[a,h]
I suppose there's an easy way to do this, but it's not that usual to work with lists in DataFrames, so I'm not finding anything.
Thanks in advance.
Edit: The DataFrame is just one column, which is comprised of lists, all of the same length. I need to remove one element, so the length of the list is the same as the number of columns of the DataFrame it will become.
IIUC
df.base=pd.DataFrame(df.base.values.tolist()).drop(1,1).values.tolist()
df
Out[635]:
base
0 [a, c]
1 [c, e]
2 [a, h]
Don't use list in series
Pandas series are not designed to hold lists. You lose all functionality and performance with 2 layers of pointers: one with your object dtype array, another corresponding to each list within your series.
Since each list has the same number of elements, separate into columns instead:
df = pd.DataFrame({'base': [list('abc'), list('cde'), list('abh')]})
res = pd.DataFrame(df['base'].values.tolist()).iloc[:, [0, 2]]
print(res)
0 2
0 a c
1 c e
2 a h
You could work on the underlying np.array:
df['base'] = np.stack(df.base.values)[:,[0,2]].tolist()
>>> df
base
0 [a, c]
1 [c, e]
2 [a, h]
You can use df['base'].apply(lambda x: x.pop(1)). Note that pop acts in place, so you don't need to assign the result to base (in fact, if you do so, you'll get the removed element instead of the remaining list).
However, as #jpp says, you should consider using some other data structure, such as a dataframe with multi-index or a three-dimensional numpy array.
And considering your edit, it's probably easier to convert the data to a dataframe with multiple columns, and then delete the extra column, rather than trying to manipulate a column of lists and then turn it into your final dataframe. It may seem simpler to have "only one column", but you're just putting the extra complexity into a separate layer, rather than getting rid of it. Pandas was built around two-dimensional data being represented as columns and rows, not a single column of lists, so you're going out of your way to not use the tools that pandas was built to provide.
Presumably, you had something like this:
data=[['a','b','c'],
['c','d','e'],
['a','b','h']]
And you did something like this:
df = pd.DataFrame({'base':data})
You should instead do
df = pd.DataFrame(data)
df = df[[0,2]]
I use Pandas dataframes to manipulate data and I usually visualise them as virtual spreadsheets, with rows and columns defining the positions of individual cells. I'm happy with the methods to slice and dice the dataframes but there seems to be some odd behaviour when the dataframe contains a single row. Basically, I want to select rows of data from a large parent dataframe that meet certain criteria and then pass those results as a daughter dataframe to a separate function for further processing. Sometimes there will only be a single record in the parent dataframe that meets the defined criteria and, therefore, the daughter dataframe will only contain a single row. Nevertheless, I still need to be able to access data in the daughter in the same way as for the parent database. To illustrate may point, consider the following dataframe:
import pandas as pd
tempDF = pd.DataFrame({'group':[1,1,1,1,2,2,2,2],
'string':['a','b','c','d','a','b','c','d']})
print(tempDF)
Which looks like:
group string
0 1 a
1 1 b
2 1 c
3 1 d
4 2 a
5 2 b
6 2 c
7 2 d
As an example, I can now select those rows where 'group' == 2 and 'string' == 'c', which yields just a single row. As expected, the length of dataframe is 1 and it's possible to print just a single cell using .ix() based on index values in the original dataframe:
tempDF2 = tempDF.loc[((tempDF['group']==2) & (tempDF['string']=='c')),['group','string']]
print(tempDF2)
print('Length of tempDF2 = ',tempDF2.index.size)
print(tempDF2.loc[6,['string']])
Output:
group string
6 2 c
Length of tempDF2 = 1
string c
However, if I select a single row using .loc, then the dataframe is printed in a transposed form and the length of the dataframe is now given as 2 (rather than 1). Clearly, it's no longer possible to select single cell values based on index of original parent dataframe:
tempDF3 = tempDF.loc[6,['group','string']]
print(tempDF3)
print('Length of tempDF3 = ',tempDF3.index.size)
Output:
group 2
string c
Name: 7, dtype: object
Length of tempDF3 = 2
In my mind, both these methods are actually doing the same thing, namely selecting a single row of data. However, in the second example, the rows and columns are transposed making it impossible to extract data in an expected way.
Why should these 2 behaviours exist? What is the point of transposing a single row of a dataframe as a default behaviour? How can I make sure that a dataframe containing a single row isn't transposed when I pass it to another function?
tempDF3 = tempDF.loc[6,['group','string']]
The 6 in the first position of the .loc selection dictates that the return type will be a Series and hence your problem. Instead use [6]:
tempDF3 = tempDF.loc[[6],['group','string']]