Similar to this question How to add an empty column to a dataframe?, I am interested in knowing the best way to add a column of empty lists to a DataFrame.
What I am trying to do is basically initialize a column and as I iterate over the rows to process some of them, then add a filled list in this new column to replace the initialized value.
For example, if below is my initial DataFrame:
df = pd.DataFrame(d = {'a': [1,2,3], 'b': [5,6,7]}) # Sample DataFrame
>>> df
a b
0 1 5
1 2 6
2 3 7
Then I want to ultimately end up with something like this, where each row has been processed separately (sample results shown):
>>> df
a b c
0 1 5 [5, 6]
1 2 6 [9, 0]
2 3 7 [1, 2, 3]
Of course, if I try to initialize like df['e'] = [] as I would with any other constant, it thinks I am trying to add a sequence of items with length 0, and hence fails.
If I try initializing a new column as None or NaN, I run in to the following issues when trying to assign a list to a location.
df['d'] = None
>>> df
a b d
0 1 5 None
1 2 6 None
2 3 7 None
Issue 1 (it would be perfect if I can get this approach to work! Maybe something trivial I am missing):
>>> df.loc[0,'d'] = [1,3]
...
ValueError: Must have equal len keys and value when setting with an iterable
Issue 2 (this one works, but not without a warning because it is not guaranteed to work as intended):
>>> df['d'][0] = [1,3]
C:\Python27\Scripts\ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
Hence I resort to initializing with empty lists and extending them as needed. There are a couple of methods I can think of to initialize this way, but is there a more straightforward way?
Method 1:
df['empty_lists1'] = [list() for x in range(len(df.index))]
>>> df
a b empty_lists1
0 1 5 []
1 2 6 []
2 3 7 []
Method 2:
df['empty_lists2'] = df.apply(lambda x: [], axis=1)
>>> df
a b empty_lists1 empty_lists2
0 1 5 [] []
1 2 6 [] []
2 3 7 [] []
Summary of questions:
Is there any minor syntax change that can be addressed in Issue 1 that can allow a list to be assigned to a None/NaN initialized field?
If not, then what is the best way to initialize a new column with empty lists?
One more way is to use np.empty:
df['empty_list'] = np.empty((len(df), 0)).tolist()
You could also knock off .index in your "Method 1" when trying to find len of df.
df['empty_list'] = [[] for _ in range(len(df))]
Turns out, np.empty is faster...
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(pd.np.random.rand(1000000, 5))
In [3]: timeit df['empty1'] = pd.np.empty((len(df), 0)).tolist()
10 loops, best of 3: 127 ms per loop
In [4]: timeit df['empty2'] = [[] for _ in range(len(df))]
10 loops, best of 3: 193 ms per loop
In [5]: timeit df['empty3'] = df.apply(lambda x: [], axis=1)
1 loops, best of 3: 5.89 s per loop
EDIT: the commenters caught the bug in my answer
s = pd.Series([[]] * 3)
s.iloc[0].append(1) #adding an item only to the first element
>s # unintended consequences:
0 [1]
1 [1]
2 [1]
So, the correct solution is
s = pd.Series([[] for i in range(3)])
s.iloc[0].append(1)
>s
0 [1]
1 []
2 []
OLD:
I timed all the three methods in the accepted answer, the fastest one took 216 ms on my machine. However, this took only 28 ms:
df['empty4'] = [[]] * len(df)
Note: Similarly, df['e5'] = [set()] * len(df) also took 28ms.
Canonical solutions: List comprehension, map and apply
Obligatory disclaimer: avoid using lists in pandas columns where possible, list columns are slow to work with because they are objects and those are inherently hard to vectorize.
With that out of the way, here are the canonical methods of introducing a column of empty lists:
# List comprehension
df['c'] = [[] for _ in range(df.shape[0])]
df
a b c
0 1 5 []
1 2 6 []
2 3 7 []
There's also these shorthands involving apply and map:
from collections import defaultdict
# map any column with defaultdict
df['c'] = df.iloc[:,0].map(defaultdict(list))
# same as,
df['c'] = df.iloc[:,0].map(lambda _: [])
# apply with defaultdict
df['c'] = df.apply(defaultdict(list), axis=1)
# same as,
df['c'] = df.apply(lambda _: [], axis=1)
df
a b c
0 1 5 []
1 2 6 []
2 3 7 []
Things you should NOT do
Some folks believe multiplying an empty list is the way to go, unfortunately this is wrong and will usually lead to hard-to-debug issues. Here's an MVP:
# WRONG
df['c'] = [[]] * len(df)
df.at[0, 'c'].append('abc')
df.at[1, 'c'].append('def')
df
a b c
0 1 5 [abc, def]
1 2 6 [abc, def]
2 3 7 [abc, def]
# RIGHT
df['c'] = [[] for _ in range(df.shape[0])]
df.at[0, 'c'].append('abc')
df.at[1, 'c'].append('def')
df
a b c
0 1 5 [abc]
1 2 6 [def]
2 3 7 []
In the first case, a single empty list is created and its reference is replicated across all the rows, so you see updates to one reflected to all of them. In the latter case each row is assigned its own empty list, so this is not a concern.
Related
I have the following data structure:
The columns s and d are indicating the transition of object in column x. What I want to do is get a transition string per object present in the column x. For e.g. with a new column as follows:
Is there a lean way to do it using pandas, without using too many loops?
This was the code I tried:
obj = df['x'].tolist()
rows = []
for o in obj:
locs = df[df['x'] == o]['s'].tolist()
str_locs = '->'.join(str(l) for l in locs)
print(str_locs)
d = dict()
d['x'] = o
d['new'] = str_locs
rows.append(d)
tmp = pd.DataFrame(rows)
This give the output temp as:
x new
a 1->2->4->8
a 1->2->4->8
a 1->2->4->8
a 1->2->4->8
b 1->2
b 1->2
Example df:
df = pd.DataFrame({"x":["a","a","a","a","b","b"], "s":[1,2,4,8,5,11],"d":[2,4,8,9,11,12]})
print(df)
x s d
0 a 1 2
1 a 2 4
2 a 4 8
3 a 8 9
4 b 5 11
5 b 11 12
Following code will generate a transition string of all objects present in the column x.
groupby with respect to column x and get list of lists of s and d for every object available in x
Merge the list of lists sequentially
Remove consecutive duplicates from the merged list using itertools.groupby
Join the items of merged list with -> to make it a single string.
Finally map the series to column x of input df
from itertools import groupby
grp = df.groupby('x')[['s', 'd']].apply(lambda x: x.values.tolist())
grp = grp.apply(lambda x: [str(item) for tup in x for item in tup])
sr = grp.apply(lambda x: "->".join([i[0] for i in groupby(x)]))
df["new"] = df["x"].map(sr)
print(df)
x s d new
0 a 1 2 1->2->4->8->9
1 a 2 4 1->2->4->8->9
2 a 4 8 1->2->4->8->9
3 a 8 9 1->2->4->8->9
4 b 5 11 5->11->12
5 b 11 12 5->11->12
Problem:
I have a somewhat complicated cross-referencing task I need to perform between a long list (~600,000 entries) and a short list (~300,000 entries). I'm trying to find the similar entries between the two lists, and each unique entry is identified by three different integers (call them int1,int2,and int3). Based on the three integer identifiers in one list, I want to see if those same three integers are in the other list, and return which ones they are.
Attempt:
First I zipped each three-integer tuple in the long list into an array called a. Similarly, I zipped each three-int tuple in the short list into an array called b:
a = [(int1,int2,int3),...] # 600,000 entries
b = [(int1,int2,int3),...] # 300,000 entries
Then I iterated through each entry in a to see if it was in b. If it was, I appended the corresponding tuples to an array outside the loop called c:
c= []
for i in range(0,len(a),1):
if a[i] in b:
c.append(a[i])
The iteration is (not surprisingly) very slow. I'm guessing Python has to check b for a[i] at each iteration (~300,000 times!), and its iterating 600,000 times. It has taken over an hour now and still hasn't finished, so I know I should be optimizing something.
My question is: what is the most Pythonic or fastest way to perform this cross-referencing?
You can use sets:
c = set(b).intersection(a)
I chose to convert b to a set as it is the shorter of the two lists. Using intersection() does not require that list a first be converted to a set.
You can also do this:
c = set(a) & set(b)
however, both lists require conversion to type set first.
Either way you have a O(n) operation, see time complexity.
Pandas solution:
a = [(1,2,3),(4,5,6),(4,5,8),(1,2,8) ]
b = [(1,2,3),(0,3,7),(4,5,8)]
df1 = pd.DataFrame(a)
print (df1)
0 1 2
0 1 2 3
1 4 5 6
2 4 5 8
3 1 2 8
df2 = pd.DataFrame(b)
print (df2)
0 1 2
0 1 2 3
1 0 3 7
2 4 5 8
df = pd.merge(df1, df2)
print (df)
0 1 2
0 1 2 3
1 4 5 8
Pure python solution with sets:
c = list(set(b).intersection(set(a)))
print (c)
[(4, 5, 8), (1, 2, 3)]
Another interesting way to do it:
from itertools import compress
list(compress(b, map(lambda x: x in a, b)))
And other one:
filter(lambda x: x in a, b)
I have a list of tuples in the format:
tuples = [('a',1,10,15),('b',11,0,3),('c',7,19,2)] # etc.
I wish to store the data in a DataFrame with the format:
a b c ...
0 1 11 7 ...
1 10 0 19 ...
2 15 3 2 ...
Where the first element of the tuple is what I wish to be the column name.
I understand that if I can achieve what I want by running:
df = pd.DataFrame(tuples)
df = df.T
df.columns = df.iloc[0]
df = df[1:]
But it seems to me like it should be more straightforward than this. Is this a more pythonic way of solving this?
Here's one way
In [151]: pd.DataFrame({x[0]:x[1:] for x in tuples})
Out[151]:
a b c
0 1 11 7
1 10 0 19
2 15 3 2
You can use dictionary comprehension, like:
pd.DataFrame({k:v for k,*v in tuples})
in python-3.x, or:
pd.DataFrame({t[0]: t[1:] for t in tuples})
in python-2.7.
which generates:
>>> pd.DataFrame({k:v for k,*v in tuples})
a b c
0 1 11 7
1 10 0 19
2 15 3 2
The columns will be sorted alphabetically.
If you want the columns to be sorted like the original content, you can use the columns parameter:
pd.DataFrame({k:v for k,*v in tuples},columns=[k for k,*_ in tuples])
again in python-3.x, or for python-2.7:
pd.DataFrame({t[0]: t[1:] for t in tuples},columns=[t[0] for t in tuples])
We can shorten this a bit into:
from operator import itemgetter
pd.DataFrame({t[0]: t[1:] for t in tuples},columns=map(itemgetter(0),tuples))
Incase if the values in tuple are in row wise, then
df = pd.DataFrame(tuples, columns=tuples[0])[1:]
I am an R programmer and looking for a similar way to do something like this in R:
data[data$x > value, y] <- 1
(basically, take all rows where the x column is greater than some value and assign the y column at those rows the value of 1)
In pandas it would seem the equivalent would go something like:
data['y'][data['x'] > value] = 1
But this gives a SettingWithCopyWarning.
Equivalent statements I've tried are:
condition = data['x']>value
data.loc(condition,'x')=1
But I'm seriously confused. Maybe I'm thinking too much in R terms and can't wrap my head around what's going on in Python.
What would be equivalent code for this in Python, or workarounds?
Your statement is incorrect it should be:
data.loc[condition, 'x'] = 1
Example:
In [3]:
df = pd.DataFrame({'a':np.random.randn(10)})
df
Out[3]:
a
0 -0.063579
1 -1.039022
2 -0.011687
3 0.036160
4 0.195576
5 -0.921599
6 0.494899
7 -0.125701
8 -1.779029
9 1.216818
In [4]:
condition = df['a'] > 0
df.loc[condition, 'a'] = 20
df
Out[4]:
a
0 -0.063579
1 -1.039022
2 -0.011687
3 20.000000
4 20.000000
5 -0.921599
6 20.000000
7 -0.125701
8 -1.779029
As you are subscripting the df you should use square brackets [] rather than parentheses () which is a function call. See the docs
I have a dataframe that may or may not have columns that are the same value. For example
row A B
1 9 0
2 7 0
3 5 0
4 2 0
I'd like to return just
row A
1 9
2 7
3 5
4 2
Is there a simple way to identify if any of these columns exist and then remove them?
I believe this option will be faster than the other answers here as it will traverse the data frame only once for the comparison and short-circuit if a non-unique value is found.
>>> df
0 1 2
0 1 9 0
1 2 7 0
2 3 7 0
>>> df.loc[:, (df != df.iloc[0]).any()]
0 1
0 1 9
1 2 7
2 3 7
Ignoring NaNs like usual, a column is constant if nunique() == 1. So:
>>> df
A B row
0 9 0 1
1 7 0 2
2 5 0 3
3 2 0 4
>>> df = df.loc[:,df.apply(pd.Series.nunique) != 1]
>>> df
A row
0 9 1
1 7 2
2 5 3
3 2 4
I compared various methods on data frame of size 120*10000. And found the efficient one is
def drop_constant_column(dataframe):
"""
Drops constant value columns of pandas dataframe.
"""
return dataframe.loc[:, (dataframe != dataframe.iloc[0]).any()]
1 loop, best of 3: 237 ms per loop
The other contenders are
def drop_constant_columns(dataframe):
"""
Drops constant value columns of pandas dataframe.
"""
result = dataframe.copy()
for column in dataframe.columns:
if len(dataframe[column].unique()) == 1:
result = result.drop(column,axis=1)
return result
1 loop, best of 3: 19.2 s per loop
def drop_constant_columns_2(dataframe):
"""
Drops constant value columns of pandas dataframe.
"""
for column in dataframe.columns:
if len(dataframe[column].unique()) == 1:
dataframe.drop(column,inplace=True,axis=1)
return dataframe
1 loop, best of 3: 317 ms per loop
def drop_constant_columns_3(dataframe):
"""
Drops constant value columns of pandas dataframe.
"""
keep_columns = [col for col in dataframe.columns if len(dataframe[col].unique()) > 1]
return dataframe[keep_columns].copy()
1 loop, best of 3: 358 ms per loop
def drop_constant_columns_4(dataframe):
"""
Drops constant value columns of pandas dataframe.
"""
keep_columns = dataframe.columns[dataframe.nunique()>1]
return dataframe.loc[:,keep_columns].copy()
1 loop, best of 3: 1.8 s per loop
Assuming that the DataFrame is completely of type numeric:
you can try:
>>> df = df.loc[:, df.var() == 0.0]
which will remove constant(i.e. variance = 0) columns.
If the DataFrame is of type both numeric and object, then you should try:
>>> enum_df = df.select_dtypes(include=['object'])
>>> num_df = df.select_dtypes(exclude=['object'])
>>> num_df = num_df.loc[:, num_df.var() == 0.0]
>>> df = pd.concat([num_df, enum_df], axis=1)
which will drop constant columns of numeric type only.
If you also want to ignore/delete constant enum columns, you should try:
>>> enum_df = df.select_dtypes(include=['object'])
>>> num_df = df.select_dtypes(exclude=['object'])
>>> enum_df = enum_df.loc[:, [True if y !=1 else False for y in [len(np.unique(x, return_counts=True)[-1]) for x in enum_df.T.as_matrix()]]]
>>> num_df = num_df.loc[:, num_df.var() == 0.0]
>>> df = pd.concat([num_df, enum_df], axis=1)
Here is my solution since I needed to do both object and numerical columns. Not claiming its super efficient or anything but it gets the job done.
def drop_constants(df):
"""iterate through columns and remove columns with constant values (all same)"""
columns = df.columns.values
for col in columns:
# drop col if unique values is 1
if df[col].nunique(dropna=False) == 1:
del df[col]
return df
Extra caveat, it won't work on columns of lists or arrays since they are not hashable.
Many examples in this thread does not work properly. Check this my answer with collection of examples that work