pandas iterrows changes ints into floats - python

I'm trying to iterate over the rows of a DataFrame that contains some int64s and some floats. iterrows() seems to be turning my ints into floats, which breaks everything I want to do downstream:
>>> import pandas as pd
>>> df = pd.DataFrame([[10000000000000001, 1.5], [10000000000000002, 2.5]], columns=['id', 'prc'])
>>> [id for id in df.id]
[10000000000000001, 10000000000000002]
>>> [r['id'] for (idx,r) in df.iterrows()]
[10000000000000000.0, 10000000000000002.0]
Iterating directly over df.id is fine. But through iterrows(), I get different values. Is there a way to iterate over the rows in such a way that I can still index by column name and get all the correct values?

Here's the relevant part of the docs:
Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames) [...] To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster as iterrows.
Example for your data:
>>> df = pd.DataFrame([[10000000000000001, 1.5], [10000000000000002, 2.5]], columns=['id', 'prc'])
>>> [t[1] for t in df.itertuples()]
[10000000000000001, 10000000000000002]

If possible you're better off avoiding iteration. Check if you can vectorize your work first.
If vectorization is impossible, you probably want DataFrame.itertuples. That will return an iterable of (named)tuples where the first element is the index label.
In [2]: list(df.itertuples())
Out[2]:
[Pandas(Index=0, id=10000000000000001, prc=1.5),
Pandas(Index=1, id=10000000000000002, prc=2.5)]
iterrows returns a Series for each row. Since series are backed by numpy arrays, whose elements must all share a single type, your ints were cast as floats.

Related

Efficient method to append pandas rows into a list

I have a pandas DataFrame with ~5m rows. I am looking for an efficient method to append / store each rows into a list.
import pandas as pd
df = pd.DataFrame({
'id': [0, 1, 2, 3],
'val': ['w','x','y','z'],
'pos': ['p1','p2','p3','p4']
})
# Using List comprehensions
df_lst = []
[df_lst.append(rows) for rows in df.iterrows()]
Given the size of the DataFrame; I am looking for other methods that are more efficient at storing rows to a list. Is there a vectorized solution to this?
I'd recommend .tolist() as others also mentioned in the comments. So I'll give an example of it.
df_lst = df.values.tolist()
in terms of efficiency, which I see some have mentioned why you want to do that, that would depend on the use case. of course df is more efficient on performing various tasks on the data and it seems to be redundant to convert it, but note that a list is more memory-efficient than a dataframe. So converting that is not unrational if you don't need the features of the df.
From here:
You can use df.to_dict('records') to convert the rows into a list of dicts. If this is useful depends on what you want to do with the list afterwards.

Filtered pandas dataframe containing boolean version of dataframe

In pandas I have a dataframe with X by Y dimension containing values.
Then I have an identical pandas dataframe with X by Y dimension (same as df1) containing True/False values.
I want to return only the elements from df1 where the same location on df2 the value = True.
What is the fastest way to do this? Is there a way to do this without converting to numpy array?
Without having the reproducible example, I may be missing a couple tweaks/details here, but I think you may be able to accomplish this by dataframe multiplication
df1.mul(df2)
This will multiply each element by the corresponding element in the other dataframe, where True will act to return the other element and False will return a null.
It is also possible to use mask
df1.mask(df2)
This is similar to df1[df2] and replaces hidden values with NaN, although you can choose the value to replace with using the other option
A quick benchmark on a 10x10 dataframe suggests that the df.mul approach is ~5 times faster

Pandas: how to compare several cells with a list/tuple

I need to compare some columns in a dataframe as a whole, for example:
df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
#Select condition: If df['A'] == 1 and df['B'] == 4, then pick up this row.
For this simple example, I can use below method:
df.loc[(df['A']==1)&(df['B']==4),'A':'B']
However, in reality my dataframe has tens of columns which should be compared as whole. Above solution will be very very messy if I choose to list all of them. So I think if regard them as a whole to compare with a list may solve the problem:
#something just like this:
df.loc[df.loc[:,'A':'B']==[1,4],'A':'B')]
Not worked. So I came up with the idea that first combine all desired columns into a new column as a list value, then compare this new column with the list. The latter has been solved in Pandas: compare list objects in Series
Although generally I've solved my case, I still want to know if there is an easier way to solve this problem? Thanks.
Or use [[]] for getting multiple columns:
df[(df[['A','B']].values==[1,4]).all(1)]
Demo:
>>> df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
>>> df[(df[['A','B']].values==[1,4]).all(1)]
A B
0 1 4
>>>
You can use a Boolean mask via a NumPy array representation of your data:
df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
res = df[(df.loc[:, 'A':'B'].values == [1, 4]).all(1)]
print(res)
A B
0 1 4
In this situation, never combine your columns into a single series of lists. This is inefficient as you will lose all vectorisation benefits, and any processing thereafter will involve Python-level loops.

MultiIndexing rows vs. columns in pandas DataFrame

I am working with multiindexing dataframe in pandas and am wondering whether I should multiindex the rows or the columns.
My data looks something like this:
Code:
import numpy as np
import pandas as pd
arrays = pd.tools.util.cartesian_product([['condition1', 'condition2'],
['patient1', 'patient2'],
['measure1', 'measure2', 'measure3']])
colidxs = pd.MultiIndex.from_arrays(arrays,
names=['condition', 'patient', 'measure'])
rowidxs = pd.Index([0,1,2,3], name='time')
data = pd.DataFrame(np.random.randn(len(rowidxs), len(colidxs)),
index=rowidxs, columns=colidxs)
Here I choose to multiindex the column, with the rationale that pandas dataframe consists of series, and my data ultimately is a bunch of time series (hence row-indexed by time here).
I have this question because it seems there is some asymmetry between rows and columns for multiindexing. For example, in this document webpage it shows how query works for row-multiindexed dataframe, but if the dataframe is column-multiindexed then the command in the document has to be replaced by something like df.T.query('color == "red"').T.
My question might seem a bit silly, but I'd like to see if there is any difference in convenience between multiindexing rows vs. columns for dataframes (such as the query case above).
Thanks.
A rough personal summary of what I call the row/column-propensity of some common operations for DataFrame:
[]: column-first
get: column-only
attribute accessing as indexing: column-only
query: row-only
loc, iloc, ix: row-first
xs: row-first
sortlevel: row-first
groupby: row-first
"row-first" means the operation expects row index as the first argument, and to operate on column index one needs to use [:, ] or specify axis=1;
"row-only" means the operation only works for row index and one has to do something like transposing the dataframe to operate on the column index.
Based on this, it seems multiindexing rows is slightly more convenient.
A natural question of mine: why don't pandas developers unify the row/column propensity of DataFrame operations? For example, that [] and loc/iloc/ix are two most common ways of indexing dataframes but one slices columns and the others slice rows seems a bit odd.

Adding individual items and sequences of items to dataframes and series

Say I have a dataframe df
import pandas as pd
df = pd.DataFrame()
and I have the following tuple and value:
column_and_row = ('bar', 'foo')
value = 56
How can I most easily add this tuple to my dataframe so that:
df['bar']['foo']
returns 56?
What if I have a list of such tuples and list of values? e.g.
columns_and_rows = [A, B, C, ...]
values = [5, 10, 15]
where A, B and C are tuples of columns and rows (similar to column_and_row).
Along the same lines, how would this be done with a Series?, e.g.:
import pandas as pd
srs = pd.Series()
and I want to add one item to it with index 'foo' and value 2 so that:
srs['foo']
returns 2?
Note:
I know that none of these are efficient ways of creating dataframes or series, but I need a solution that allows me to grow my structures organically in this way when I have no other choice.
For a series, you can do it with append, but you have to create a series from your value first:
>>> print x
A 1
B 2
C 3
>>> print x.append( pandas.Series([8, 9], index=["foo", "bar"]))
A 1
B 2
C 3
foo 8
bar 9
For a DataFrame, you can also use append or concat, but it doesn't make sense to do this for a single cell only. DataFrames are tabular, so you can only add a whole row or a whole column. The documentation has plenty of examples and there are other questions about this.
Edit: Apparently you actually can set a single value with df.set_value('newRow', 'newCol', newVal). However, if that row/column doesn't already exist, this will actually create an entire new row and/or column, with the rest of the values in the created row/column filled with NaN. Note that in this case a new object will be returned, so you'd have to do df = df.set_value('newRow', 'newCol', newVal) to modify the original.
However, now matter how you do it, this is going to be inefficient. Pandas data structures are based on Numpy and are fundamentally reliant on knowing the size of the array ahead of time. You can add rows and columns, but every time you do so, and entirely new data structure is created, so if you do this a lot, it will be slower than using ordinary Python lists/dicts.

Categories