Iterate two pandas dataframes error when one dataframe is empty - python

I have been trying to iterate two pandas dataframes using zip . It works perfectly until I have values available in both dataframes. If one of the dataframe is empty this won't iterate and return null.
for (kin_index, kin_row), (sub_index, sub_row) in zip(df1.iterrows(), df2.iterrows()):
print(kin_index,sub_index)
I want to iterate both dataframes even if one is empty.
This don't go through if one of the dataframe is empty.

zip only runs as far as the shortest iterable. If one of the iterables is empty, you won't be able to iterate any values.
itertools.zip_longest iterates to the longest iterable, but to ensure this works with unpacking you need to specify fillvalue as a tuple of length 2:
from itertools import zip_longest
df1 = pd.DataFrame([[0, 1], [2, 3]])
df2 = pd.DataFrame()
zipper = zip_longest(df1.iterrows(), df2.iterrows(), fillvalue=(None, None))
for (idx1, row1), (idx2, row2) in zipper:
print(idx1, idx2)
0 None
1 None
But there are very few occasions when you should need to iterate rows like this. In fact, it should be avoided if at all possible. You should consider refactoring your logic to use vectorised functionality.

Related

Splitting a dataframe by passing a list of indices

I have a dataframe df, containing only one column 'Info', which I want to split into multiple dataframes based on a list of indices, ls = [23,76,90,460,790]. If I want to use np.array_split(), how do I pass the list so that it parses the data from these indices with each index being the first row of split dataframes.
I don't think you can use np.array_split() here (you can access the underlying .values of the primary DF but you'd get back numpy arrays - not DFs...) - what you can do is use .iloc and "slice" from your DF, eg:
from itertools import zip_longest
dfs = [df.iloc[s: e] for s, e in zip_longest(ls[::2], ls[1::2])]

Pandas: how to compare several cells with a list/tuple

I need to compare some columns in a dataframe as a whole, for example:
df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
#Select condition: If df['A'] == 1 and df['B'] == 4, then pick up this row.
For this simple example, I can use below method:
df.loc[(df['A']==1)&(df['B']==4),'A':'B']
However, in reality my dataframe has tens of columns which should be compared as whole. Above solution will be very very messy if I choose to list all of them. So I think if regard them as a whole to compare with a list may solve the problem:
#something just like this:
df.loc[df.loc[:,'A':'B']==[1,4],'A':'B')]
Not worked. So I came up with the idea that first combine all desired columns into a new column as a list value, then compare this new column with the list. The latter has been solved in Pandas: compare list objects in Series
Although generally I've solved my case, I still want to know if there is an easier way to solve this problem? Thanks.
Or use [[]] for getting multiple columns:
df[(df[['A','B']].values==[1,4]).all(1)]
Demo:
>>> df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
>>> df[(df[['A','B']].values==[1,4]).all(1)]
A B
0 1 4
>>>
You can use a Boolean mask via a NumPy array representation of your data:
df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
res = df[(df.loc[:, 'A':'B'].values == [1, 4]).all(1)]
print(res)
A B
0 1 4
In this situation, never combine your columns into a single series of lists. This is inefficient as you will lose all vectorisation benefits, and any processing thereafter will involve Python-level loops.

Appending a list of lists and adding a column efficiently

Appending works fine with this method:
for poss in pos:
df = df.append([[poss,'1']], ignore_index=True)
Is this possible to write as a one-liner? This way is showing a syntax error.
df = df.append([[poss,'1']], ignore_index=True) for poss in pos
No, it's not. It seems like you are looking to use list comprehension syntax to single-linify multiple method calls. You can't do this.
Instead, what you can do is aggregate your list elements into a dataframe and append in one go. It is also inefficient to df.append within a loop versus building a dataframe from list of lists and appending just once.
df = df.append(pd.DataFrame(pos, columns=player_df.columns))
This assumes pos is a list of lists with columns aligned with df.
If you need to add an extra column "in one line", try this:
df = df.append(pd.DataFrame(pos, columns=player_df.columns)).assign(newcol=1)
I think it is possible, but I cannot test it, because I do not work with those datatypes and normal list.append does not return the resulting list.
from functools import reduce # Python3 only
df = reduce(lambda d, p: d.append([[p,'1']], ignore_index=True), pos, df)
Anyway, the for loop is much better. Also this is just a rewrite to one line. There might be other ways to modify your data.

pandas iterrows changes ints into floats

I'm trying to iterate over the rows of a DataFrame that contains some int64s and some floats. iterrows() seems to be turning my ints into floats, which breaks everything I want to do downstream:
>>> import pandas as pd
>>> df = pd.DataFrame([[10000000000000001, 1.5], [10000000000000002, 2.5]], columns=['id', 'prc'])
>>> [id for id in df.id]
[10000000000000001, 10000000000000002]
>>> [r['id'] for (idx,r) in df.iterrows()]
[10000000000000000.0, 10000000000000002.0]
Iterating directly over df.id is fine. But through iterrows(), I get different values. Is there a way to iterate over the rows in such a way that I can still index by column name and get all the correct values?
Here's the relevant part of the docs:
Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames) [...] To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster as iterrows.
Example for your data:
>>> df = pd.DataFrame([[10000000000000001, 1.5], [10000000000000002, 2.5]], columns=['id', 'prc'])
>>> [t[1] for t in df.itertuples()]
[10000000000000001, 10000000000000002]
If possible you're better off avoiding iteration. Check if you can vectorize your work first.
If vectorization is impossible, you probably want DataFrame.itertuples. That will return an iterable of (named)tuples where the first element is the index label.
In [2]: list(df.itertuples())
Out[2]:
[Pandas(Index=0, id=10000000000000001, prc=1.5),
Pandas(Index=1, id=10000000000000002, prc=2.5)]
iterrows returns a Series for each row. Since series are backed by numpy arrays, whose elements must all share a single type, your ints were cast as floats.

Adding individual items and sequences of items to dataframes and series

Say I have a dataframe df
import pandas as pd
df = pd.DataFrame()
and I have the following tuple and value:
column_and_row = ('bar', 'foo')
value = 56
How can I most easily add this tuple to my dataframe so that:
df['bar']['foo']
returns 56?
What if I have a list of such tuples and list of values? e.g.
columns_and_rows = [A, B, C, ...]
values = [5, 10, 15]
where A, B and C are tuples of columns and rows (similar to column_and_row).
Along the same lines, how would this be done with a Series?, e.g.:
import pandas as pd
srs = pd.Series()
and I want to add one item to it with index 'foo' and value 2 so that:
srs['foo']
returns 2?
Note:
I know that none of these are efficient ways of creating dataframes or series, but I need a solution that allows me to grow my structures organically in this way when I have no other choice.
For a series, you can do it with append, but you have to create a series from your value first:
>>> print x
A 1
B 2
C 3
>>> print x.append( pandas.Series([8, 9], index=["foo", "bar"]))
A 1
B 2
C 3
foo 8
bar 9
For a DataFrame, you can also use append or concat, but it doesn't make sense to do this for a single cell only. DataFrames are tabular, so you can only add a whole row or a whole column. The documentation has plenty of examples and there are other questions about this.
Edit: Apparently you actually can set a single value with df.set_value('newRow', 'newCol', newVal). However, if that row/column doesn't already exist, this will actually create an entire new row and/or column, with the rest of the values in the created row/column filled with NaN. Note that in this case a new object will be returned, so you'd have to do df = df.set_value('newRow', 'newCol', newVal) to modify the original.
However, now matter how you do it, this is going to be inefficient. Pandas data structures are based on Numpy and are fundamentally reliant on knowing the size of the array ahead of time. You can add rows and columns, but every time you do so, and entirely new data structure is created, so if you do this a lot, it will be slower than using ordinary Python lists/dicts.

Categories