I'm just starting out with Python and getting stuck on something while playing with the Kaggle Titanic data.
https://www.kaggle.com/c/titanic/data
Here's what I am typing in an ipython notebook (train.csv comes from the titanic data from the kaggle link above):
import pandas as pd
df = pd.read_csv("C:/fakepath/titanic/data/train.csv")
I then continue with this to check if there's any bad data in the 'Sex' column:
df['Sex'].value_counts()
Which returns:
male 577
female 314
dtype: int64
df['Gender'] = df['Sex'].map( {'male': 1, 'female': 0} ).astype(int)
this doesn't produce any errors. To verify that it created a new column called 'Gender' with integer values :
df
which returns:
# PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Gender
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C 0
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S 0
...success, the Gender column is appended to the end and is 0 for female, 1 for male. Now, I create a new pandas dataframe which is a subset of the df dataframe.
df2 = df[ ['Survived', 'Pclass', 'Age', 'Gender', 'Embarked'] ]
df2
which returns:
Survived Pclass Age Gender Embarked
0 0 3 22 1 S
1 1 1 38 0 C
2 1 3 26 0 S
3 1 1 35 0 S
4 0 3 35 1 S
5 0 3 NaN 1 Q
df2['Embarked'].value_counts()
...shows that there are 3 unique values (S, C, Q):
S 644
C 168
Q 77
dtype: int64
However, when I try to execute what I think is the same type of operation as when I converted male/female to 1/0, I get an error:
df2['Embarked_int'] = df2['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2}).astype(int)
returns:
ValueError Traceback (most recent call last)
<ipython-input-29-294c08f2fc80> in <module>()
----> 1 df2['Embarked_int'] = df2['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2}).astype(int)
C:\Anaconda\lib\site-packages\pandas\core\generic.pyc in astype(self, dtype, copy, raise_on_error)
2212
2213 mgr = self._data.astype(
-> 2214 dtype=dtype, copy=copy, raise_on_error=raise_on_error)
2215 return self._constructor(mgr).__finalize__(self)
2216
C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in astype(self, dtype, **kwargs)
2500
2501 def astype(self, dtype, **kwargs):
-> 2502 return self.apply('astype', dtype=dtype, **kwargs)
2503
2504 def convert(self, **kwargs):
C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in apply(self, f, axes, filter, do_integrity_check, **kwargs)
2455 copy=align_copy)
2456
-> 2457 applied = getattr(b, f)(**kwargs)
2458
2459 if isinstance(applied, list):
C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in astype(self, dtype, copy, raise_on_error, values)
369 def astype(self, dtype, copy=False, raise_on_error=True, values=None):
370 return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 371 values=values)
372
373 def _astype(self, dtype, copy=False, raise_on_error=True, values=None,
C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in _astype(self, dtype, copy, raise_on_error, values, klass)
399 if values is None:
400 # _astype_nansafe works fine with 1-d only
--> 401 values = com._astype_nansafe(self.values.ravel(), dtype, copy=True)
402 values = values.reshape(self.values.shape)
403 newb = make_block(values,
C:\Anaconda\lib\site-packages\pandas\core\common.pyc in _astype_nansafe(arr, dtype, copy)
2616
2617 if np.isnan(arr).any():
-> 2618 raise ValueError('Cannot convert NA to integer')
2619 elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer):
2620 # work around NumPy brokenness, #1987
ValueError: Cannot convert NA to integer
Any idea why I get this error on the 2nd use of the map function but not the first? There are no NAN values in the Embarked column per value_counts(). I'm guessing it's a noob problem :)
by default value_counts does not count NaN values, you can change this by doing df['Embarked'].value_counts(dropna=False) .
I looked at your value_counts for Gender column (577 + 314 = 891) versus Embarked column (644 + 168 + 77 = 889) and they are different by 2 which means you must have 2 NaN values.
So you either drop them first (using dropna) or fill them with some desired value using fillna.
Also the astype(int) is redundant as you are mapping to an int anyway.
I just came across this problem on the same dataset. Removing 'astype.int' solved the whole problem.
Related
I have a multi-index dataframe in pandas (date and entity_id) and for each date/entity I have obseravtions of a number of variables (A, B ...). My goal is to create a dataframe with the same shape but where the values are replaced by their decile scores.
My test data looks like this:
I want to apply qcut to each column grouped by level 0 of the multi-index - the issue I have is creating a result Dataframe
This code
def qcut_sub_index(df_with_sub_index):
# create empty return value same shape as passed dataframe
df_return=pd.DataFrame()
for date, sub_df in df_with_sub_index.groupby(level=0):
df_return=df_return.append(pd.DataFrame(pd.qcut(sub_df, 10, labels=False, duplicates='drop')))
print(df_return)
return df_return
print(df_values.apply(lambda x: qcut_sub_index(x), axis=0))
returns
A
as_at_date entity_id
2008-01-27 2928 0
2932 3
3083 6
3333 9
2008-02-27 2928 3
2935 9
3333 0
3874 6
2008-03-27 2928 1
2932 2
2934 0
2936 9
2937 4
2939 9
2940 7
2943 3
2944 0
2945 8
2946 6
2947 5
2949 4
B
as_at_date entity_id
2008-01-27 2928 9
2932 6
3083 0
3333 3
2008-02-27 2928 6
2935 0
3333 3
3874 9
2008-03-27 2928 0
2932 9
2934 2
2936 8
2937 7
2939 6
2940 3
2943 1
2944 4
2945 9
2946 5
2947 4
2949 0
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-104-72ff0e6da288> in <module>
11
12
---> 13 print(df_values.apply(lambda x: qcut_sub_index(x), axis=0))
~\Anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7546 kwds=kwds,
7547 )
-> 7548 return op.get_result()
7549
7550 def applymap(self, func) -> "DataFrame":
~\Anaconda3\lib\site-packages\pandas\core\apply.py in get_result(self)
178 return self.apply_raw()
179
--> 180 return self.apply_standard()
181
182 def apply_empty_result(self):
~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
272
273 # wrap results
--> 274 return self.wrap_results(results, res_index)
275
276 def apply_series_generator(self) -> Tuple[ResType, "Index"]:
~\Anaconda3\lib\site-packages\pandas\core\apply.py in wrap_results(self, results, res_index)
313 # see if we can infer the results
314 if len(results) > 0 and 0 in results and is_sequence(results[0]):
--> 315 return self.wrap_results_for_axis(results, res_index)
316
317 # dict of scalars
~\Anaconda3\lib\site-packages\pandas\core\apply.py in wrap_results_for_axis(self, results, res_index)
369
370 try:
--> 371 result = self.obj._constructor(data=results)
372 except ValueError as err:
373 if "arrays must all be same length" in str(err):
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
466
467 elif isinstance(data, dict):
--> 468 mgr = init_dict(data, index, columns, dtype=dtype)
469 elif isinstance(data, ma.MaskedArray):
470 import numpy.ma.mrecords as mrecords
~\Anaconda3\lib\site-packages\pandas\core\internals\construction.py in init_dict(data, index, columns, dtype)
281 arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
282 ]
--> 283 return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
284
285
~\Anaconda3\lib\site-packages\pandas\core\internals\construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype, verify_integrity)
76 # figure out the index, if necessary
77 if index is None:
---> 78 index = extract_index(arrays)
79 else:
80 index = ensure_index(index)
~\Anaconda3\lib\site-packages\pandas\core\internals\construction.py in extract_index(data)
385
386 if not indexes and not raw_lengths:
--> 387 raise ValueError("If using all scalar values, you must pass an index")
388
389 if have_series:
ValueError: If using all scalar values, you must pass an index
so something is preventing the second application of the lambda function.
I'd appreciate your help, thanks for takign a look.
p.s. if this can be done implcitly without using apply would love to hear. thanks
You solution appears over complicated. Your terminology is none standard, multi-indexes have levels. Stated as qcut() by level 0 of multi-index (not talking about sub-frames which are not pandas concepts)
Bring it all back together
use **kwargs approach to pass arguments to assign() for all columns in data frame
groupby(level=0) is as_of_date
transform() to get a row back for every entry in index
s = 12
df = pd.DataFrame({"as_at_date":np.random.choice(pd.date_range(dt.date(2020,1,27), periods=3, freq="M"), s),
"entity_id":np.random.randint(2900, 3500, s),
"A":np.random.random(s),
"B":np.random.random(s)*(10**np.random.randint(8,10,s))
}).sort_values(["as_at_date","entity_id"])
df = df.set_index(["as_at_date","entity_id"])
df2 = df.assign(**{c:df.groupby(level=0)[c].transform(lambda x: pd.qcut(x, 10, labels=False))
for c in df.columns})
df
A B
as_at_date entity_id
2020-01-31 2926 0.770121 2.883519e+07
2943 0.187747 1.167975e+08
2973 0.371721 3.133071e+07
3104 0.243347 4.497294e+08
3253 0.591022 7.796131e+08
3362 0.810001 6.438441e+08
2020-02-29 3185 0.690875 4.513044e+08
3304 0.311436 4.561929e+07
2020-03-31 2953 0.325846 7.770111e+08
2981 0.918461 7.594753e+08
3034 0.133053 6.767501e+08
3355 0.624519 6.318104e+07
df2
A B
as_at_date entity_id
2020-01-31 2926 7 0
2943 0 3
2973 3 1
3104 1 5
3253 5 9
3362 9 7
2020-02-29 3185 9 9
3304 0 0
2020-03-31 2953 3 9
2981 9 6
3034 0 3
3355 6 0
Using concat inside an iteration on the original dataframe does the trick but is there a smarter way to do this?
thanks
def qcut_sub_index(df_with_sub_index):
# create empty return value same shape as passed dataframe
df_return=pd.DataFrame()
for date, sub_df in df_with_sub_index.groupby(level=0):
df_return=df_return.append(pd.DataFrame(pd.qcut(sub_df, 10, labels=False,
duplicates='drop')))
return df_return
df_x=pd.DataFrame()
for (columnName, columnData) in df_values.iteritems():
df_x=pd.concat([df_x, qcut_sub_index(columnData)], axis=1, join="outer")
df_x
Following on from this question, I have a dataset as such:
ChildID MotherID preDiabetes
0 20 455 No
1 20 455 Not documented
2 13 102 NaN
3 13 102 Yes
4 702 946 No
5 82 571 No
6 82 571 Yes
7 82 571 Not documented
8 60 530 NaN
Which I have transformed to the following such that each mother has a single value for preDiabetes:
ChildID MotherID preDiabetes
0 20 455 No
1 13 102 Yes
2 702 946 No
3 82 571 Yes
4 60 530 No
I did this by applying the following logic:
if preDiabetes=="Yes" for a particular MotherID, assign preDiabetes a value of "Yes" regardless of the remaining observations
else if preDiabetes != "Yes" for a particular MotherID, I will assign preDiabetes a value of "No"
However, after thinking about this again, I realised that I should preserve NaN values to impute them later on, rather than just assign them 'No".
So I should edit my logic to be:
if preDiabetes=="Yes" for a particular MotherID, assign preDiabetes a value of "Yes" regardless of the remaining observations
else if all values for preDiabetes==NaN for a particular MotherID, assign preDiabetes a single NaN value
else assign preDiabetes a value of "No"
So, in the above table MotherID=530 should have a value of NaN for preDiabetes like so:
ChildID MotherID preDiabetes
0 20 455 No
1 13 102 Yes
2 702 946 No
3 82 571 Yes
4 60 530 NaN
I tried doing this using the following line of code:
df=df.groupby(['MotherID', 'ChildID'])['preDiabetes'].apply(
lambda x: 'Yes' if 'Yes' in x.values else (np.NaN if np.NaN in x.values.all() else 'No'))
However, running this line of code is resulting in the following error:
TypeError: 'in ' requires string as left operand, not float
I'd appreciate if you guys can point out what it is I am doing wrong. Thank you.
You can try:
import pandas as pd
import numpy as np
import io
data_string = """ChildID,MotherID,preDiabetes
20,455,No
20,455,Not documented
13,102,NaN
13,102,Yes
702,946,No
82,571,No
82,571,Yes
82,571,Not documented
60,530,NaN
"""
data = io.StringIO(data_string)
df = pd.read_csv(data, sep=',', na_values=['NaN'])
df.fillna('no_value', inplace=True)
df = df.groupby(['MotherID', 'ChildID'])['preDiabetes'].apply(
lambda x: 'Yes' if 'Yes' in x.values else (np.NaN if 'no_value' in x.values.all() else 'No'))
df
Result:
MotherID ChildID
102 13 Yes
455 20 No
530 60 NaN
571 82 Yes
946 702 No
Name: preDiabetes, dtype: object
You can do using a custom function:
def func(s):
if s.eq('Yes').any():
return 'Yes'
elif s.isna().all():
return np.nan
else:
return 'No'
df = (df
.groupby(['ChildID', 'MotherID'])
.agg({'preDiabetes': func}))
print(df)
ChildID MotherID preDiabetes
0 13 102 Yes
1 20 455 No
2 60 530 NaN
3 82 571 Yes
4 702 946 No
Try:
df['preDiabetes']=df['preDiabetes'].map({'Yes': 1, 'No': 0}).fillna(-1)
df=df.groupby(['MotherID', 'ChildID'])['preDiabetes'].max().map({1: 'Yes', 0: 'No', -1: 'NaN'}).reset_index()
First line will format preDiabetes to numbers, assuming NaN to be everything other than Yes or No (denoted by -1).
Second line assuming at least one preDiabetes is Yes - we output Yes for the group. Assuming we have both No and NaN - we output No. Assuming all are NaN we output NaN.
Outputs:
>>> df
MotherID ChildID preDiabetes
0 102 13 Yes
1 455 20 No
2 530 60 NaN
3 571 82 Yes
4 946 702 No
I am working with a date column in pandas. I have a date column. I want to have just the year and month as a separate column.
I achieved that by:
df1["month"] = pd.to_datetime(Table_A_df['date']).dt.to_period('M')
Printing it looks like this:
df1["month"]
Out:
0 2017-03
1 2017-03
2 2017-03
3 2017-03
4 2017-03
...
79638 2018-03
79639 2018-03
79640 2018-03
79641 2018-03
79642 2018-03
Name: month, Length: 79643, dtype: period[M]
My customer id looks like this:
0 5094298f068196c5349d43847de5afc9125cf989
1 NaN
2 NaN
3 433fdf385e33176cf9b0d67ecf383aa928fa261c
4 NaN
...
79638 6836d8cdd9c6c537c702b35ccd972fae58070004
79639 bbc08d8abad5e699823f2f0021762797941679be
79640 39b5fdd28cb956053d3e4f3f0b884fb95749da8a
79641 3342d5b210274b01e947cc15531ad53fbe25435b
79642 b3f02d0768c0ba8334047d106eb759f3e80517ac
Name: customer_id, Length: 79643, dtype: object
Now trying to groupby customer id and transform the data.
user_groups = df1.groupby("customer_id")["month"]
df1["Cohort_month"] = user_groups.transform("min")
I get the following error:
TypeError: data type not understood
Complete error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-108-107e17f9a489> in <module>
----> 1 df1["Cohort_month"] = user_groups.transform("min")
C:\Users\Public\Anaconda\lib\site-packages\pandas\core\groupby\generic.py in transform(self, func, *args, **kwargs)
475 # result to the whole group. Compute func result
476 # and deal with possible broadcasting below.
--> 477 result = getattr(self, func)(*args, **kwargs)
478 return self._transform_fast(result, func)
479
C:\Users\Public\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in f(self, **kwargs)
1375 # try a cython aggregation if we can
1376 try:
-> 1377 return self._cython_agg_general(alias, alt=npfunc, **kwargs)
1378 except DataError:
1379 pass
C:\Users\Public\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
887
888 result, agg_names = self.grouper.aggregate(
--> 889 obj._values, how, min_count=min_count
890 )
891
C:\Users\Public\Anaconda\lib\site-packages\pandas\core\groupby\ops.py in aggregate(self, values, how, axis, min_count)
568 ) -> Tuple[np.ndarray, Optional[List[str]]]:
569 return self._cython_operation(
--> 570 "aggregate", values, how, axis, min_count=min_count
571 )
572
C:\Users\Public\Anaconda\lib\site-packages\pandas\core\groupby\ops.py in _cython_operation(self, kind, values, how, axis, min_count, **kwargs)
560 result = type(orig_values)(result.astype(np.int64), dtype=orig_values.dtype)
561 elif is_datetimelike and kind == "aggregate":
--> 562 result = result.astype(orig_values.dtype)
563
564 return result, names
TypeError: data type not understood
This was working before when I had 1 as the day, but when I made it just year and month. I am getting an error. Is there a fix around this?
It's working for the sample you shared, not sure where the issue is, are there any missing values in your month column?
df['month'] = pd.to_datetime(df['month']).dt.to_period('M')
user_groups = df.groupby("customer_id")["month"]
df["Cohort_month"] = user_groups.transform("min")
print(df)
customer_id month Cohort_month
0 5094298f068196c5349d43847de5afc9125cf989 2017-03 2017-03
1 NaN 2017-03 NaT
2 NaN 2017-03 NaT
3 433fdf385e33176cf9b0d67ecf383aa928fa261c 2017-03 2017-03
4 NaN 2017-03 NaT
I have a dataset that looks like this df:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'Name':['a','b','c','d'],'1/1/2001':
['1/1/2015',0,0,'1/1/2015'],'2/20/2002':
['2/20/2002','2/20/2002','2/20/2002',0],'3/15/2015'
[0,0,0,'3/15/2015']}); df
df[df == 0] = np.nan
col = ['1/1/2001','2/20/2002','3/15/2015']
df.loc[:,col] = df.loc[:,col].bfill(axis=1)
df = df.fillna(value=0)
df
Name 1/1/2001 2/20/2002 3/15/2015
0 a 1/1/2015 2/20/2002 0
1 b 2/20/2002 2/20/2002 0
2 c 2/20/2002 2/20/2002 0
3 d 1/1/2015 3/15/2015 3/15/2015
And I want to return a dataframe that just has the unique values per row, so it could look like:
Name x_ x_2
0 a 1/1/2015 2/20/2002
1 b 2/20/2002 0
2 c 2/20/2002 0
3 d 1/1/2015 3/15/2015
But when I try to groupby with the following code:
df.groupby(['Name'])[col].apply(lambda x: list(np.unique(x)))
I get the long error:
TypeError Traceback (most recent call last)
<ipython-input-155-a3f3c8a3e6e5> in <module>
14 df
15
---> 16 df.groupby(['Name'])[col].apply(lambda x: list(np.unique(x)))
17
18
~/miniconda3/envs/planting/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
735
736 with _group_selection_context(self):
--> 737 return self._python_apply_general(f)
738
739 return result
~/miniconda3/envs/planting/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
740
741 def _python_apply_general(self, f):
--> 742 keys, values, mutated = self.grouper.apply(f, self._selected_obj, self.axis)
743
744 return self._wrap_applied_output(
~/miniconda3/envs/planting/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
235 # group might be modified
236 group_axes = _get_axes(group)
--> 237 res = f(group)
238 if not _is_indexed_like(res, group_axes):
239 mutated = True
<ipython-input-155-a3f3c8a3e6e5> in <lambda>(x)
14 df
15
---> 16 df.groupby(['Name'])[col].apply(lambda x: list(np.unique(x)))
17
18
<__array_function__ internals> in unique(*args, **kwargs)
~/miniconda3/envs/planting/lib/python3.7/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
260 ar = np.asanyarray(ar)
261 if axis is None:
--> 262 ret = _unique1d(ar, return_index, return_inverse, return_counts)
263 return _unpack_tuple(ret)
264
~/miniconda3/envs/planting/lib/python3.7/site-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
308 aux = ar[perm]
309 else:
--> 310 ar.sort()
311 aux = ar
312 mask = np.empty(aux.shape, dtype=np.bool_)
TypeError: '<' not supported between instances of 'int' and 'str'
Perhaps the error is related to the fact that the dates are strings. If it's helpful they could be converted to datetime objects.
This can be done by melt, then pivot
s=df.mask(df==0).melt('Name').drop_duplicates(['Name','value']).dropna()
s['row']=s.groupby('Name').cumcount()+1
s.pivot(index='Name',columns='row',values='value')
Out[76]:
row 1 2
Name
a 1/1/2015 2/20/2002
b 2/20/2002 NaN
c 2/20/2002 NaN
d 1/1/2015 3/15/2015
how about:
df.T.drop_duplicates(keep='first').T
output:
1/1/2001 2/20/2002 3/15/2015 Name
0 1/1/2015 2/20/2002 0 a
1 0 2/20/2002 0 b
2 0 2/20/2002 0 c
3 1/1/2015 0 3/15/2015 d
EDIT:
This solution is referring to the first version of the question, little needs to be done to apply it to the last version.
Problem:
I'm trying to two relatively small datasets together, but the merge raises a MemoryError. I have two datasets of aggregates of country trade data, that I'm trying to merge on the keys year and country, so the data needs to be particularity placed. This unfortunately makes the use of concat and its performance benefits impossible as seen in the answer to this question: MemoryError on large merges with pandas in Python.
Here's the setup:
The attempted merge:
df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])
Basic data structure:
i:
Year Reporter_Code Trade_Flow_Code Partner_Code Classification Commodity Code Quantity Unit Code Supplementary Quantity Netweight (kg) Value Estimation Code
0 2003 381 2 36 H2 070951 8 1274 1274 13810 0
1 2003 381 2 36 H2 070930 8 17150 17150 30626 0
2 2003 381 2 36 H2 0709 8 20493 20493 635840 0
3 2003 381 1 36 H2 0507 8 5200 5200 27619 0
4 2003 381 1 36 H2 050400 8 56439 56439 683104 0
df:
mporter cod CC ComTrade_CC Distance_miles
0 110 215 215 757 428.989
1 110 215 215 757 428.989
2 110 215 215 757 428.989
3 110 215 215 757 428.989
4 110 215 215 757 428.989
Error Traceback:
MemoryError Traceback (most recent call last)
<ipython-input-10-8d6e9fb45de6> in <module>()
1 for i in c_list:
----> 2 df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
36 right_index=right_index, sort=sort, suffixes=suffixes,
37 copy=copy)
---> 38 return op.get_result()
39 if __debug__:
40 merge.__doc__ = _merge_doc % '\nleft : DataFrame'
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
193 copy=self.copy)
194
--> 195 result_data = join_op.get_result()
196 result = DataFrame(result_data)
197
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
693 if klass in mapping:
694 klass_blocks.extend((unit, b) for b in mapping[klass])
--> 695 res_blk = self._get_merged_block(klass_blocks)
696
697 # if we have a unique result index, need to clear the _ref_locs
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _get_merged_block(self, to_merge)
706 def _get_merged_block(self, to_merge):
707 if len(to_merge) > 1:
--> 708 return self._merge_blocks(to_merge)
709 else:
710 unit, block = to_merge[0]
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _merge_blocks(self, merge_chunks)
728 # Should use Fortran order??
729 block_dtype = _get_block_dtype([x[1] for x in merge_chunks])
--> 730 out = np.empty(out_shape, dtype=block_dtype)
731
732 sofar = 0
MemoryError:
Thanks for your thoughts!
In case anyone coming across this question still has similar trouble with merge, you can probably get concat to work by renaming the relevant columns in the two dataframes to the same names, setting them as a MultiIndex (i.e. df = dv.set_index(['A','B'])), and then using concat to join them.
UPDATE
Example:
df1 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'C':[3, 4]})
df2 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'D':[7, 8]})
both = pd.concat([df1.set_index(['A','B']), df2.set_index(['A','B'])], axis=1).reset_index()
df1
A B C
0 1 2 3
1 2 3 4
df2
A B D
0 1 2 7
1 2 3 8
both
A B C D
0 1 2 3 7
1 2 3 4 8
I haven't benchmarked the performance of this approach, but it didn't get the memory error and worked for my applications.