I have a data set that can be found here https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset.
What we need exactly is for every employee to have their own set of rows of all the employees they share the same age with.
The desired output would be to add these rows in the data frame like so
source
target
Bob
Tom
Bob
Carl
Tom
Bob
Tom
Carl
Carl
Bob
Carl
Tom
I am using pandas to create the data frame from the csv file pd.read_csv
I am struggling with creating the loop to have my desired input.
This where I am at so far
import pandas as pd
path = "C:\CNT\IBM.csv"
df = pd.read_csv(path)
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
df['source'] = ''
df['target'] = ''
df2 = (df.loc[df['Age'] == 18])
print(df2)
this produces this
Age EmployeeNumber MonthlyIncome source target
296 18 297 1420
301 18 302 1200
457 18 458 1878
727 18 728 1051
828 18 829 1904
972 18 973 1611
1153 18 1154 1569
1311 18 1312 1514
My desired output is this
Age EmployeeNumber MonthlyIncome source target
296 18 297 1420 297 302
301 18 302 1200 297 458
457 18 458 1878 297 728
727 18 728 1051 297 829
828 18 829 1904 297 973
972 18 973 1611 297 1154
1153 18 1154 1569 297 1312
1311 18 1312 1514
Where do I go from here?
This will need some modification because I don't have the features added that you do. But this copies the EmployeeNumber to a new column target and shifts the values up. Leaving the last EmployeeNumber as a NaN. I added some modification to have the last row in a group's target value empty. The line of code would need to be modified more if using strings in the source column as well. The main point is using .shift(periods=-1) with the groupby().
import pandas as pd
import numpy as np
path = "C:\CNT\IBM.csv"
df = pd.read_csv(path)
def f(row):
val = np.where(row['A'] == row['B'], 0, np.where(row['A'] >= row['B'], 1, -1))
return val
df['source'] = df['EmployeeNumber']
df['target'] = df.groupby('Age')['EmployeeNumber'].shift(periods=-1).fillna(0).astype(int).replace(0,'')
print(df)
df2 = (df.loc[df['Age'] == 18])
df3 = df2[['Age','EmployeeNumber','source','target']]
print(df3)
Related
I am trying to create some random samples (of a given size) from a static dataframe. The goal is to create multiple columns for each sample (and each sample drawn is the same size). I'm expecting to see multiple columns of the same length (i.e. sample size) in the fully sampled dataframe, but maybe append isn't the right way to go. Here is the code:
# create sample dataframe
target_df = pd.DataFrame(np.arange(1000))
target_df.columns=['pl']
# create the sampler:
sample_num = 5
sample_len = 10
df_max_row = len(target_df) - sample_len
for i in range(sample_num):
rndm_start = np.random.choice(df_max_row, 1)[0]
rndm_end = rndm_start + sample_len
slicer = target_df.iloc[rndm_start:rndm_end]['pl']
sampled_df = sampled_df.append(slicer, ignore_index=True)
sampled_df = sampled_df.T
The output of this is shown in the pic below - The red line shows the index I want remove.
The desired output is shown below that. How do I make this happen?
Thanks!
I would create new column using
sampled_df[i] = slicer.reset_index(drop=True)
Eventually I would use str(i) for column name because later it is simpler to select column using string than number
import pandas as pd
import random
target_df = pd.DataFrame({'pl': range(1000)})
# create the sampler:
sample_num = 5
sample_len = 10
df_max_row = len(target_df) - sample_len
sampled_df = pd.DataFrame()
for i in range(1, sample_num+1):
start = random.randint(0, df_max_row)
end = start + sample_len
slicer = target_df[start:end]['pl']
sampled_df[str(i)] = slicer.reset_index(drop=True)
sampled_df.index += 1
print(sampled_df)
Result:
1 2 3 4 5
1 735 396 646 534 769
2 736 397 647 535 770
3 737 398 648 536 771
4 738 399 649 537 772
5 739 400 650 538 773
6 740 401 651 539 774
7 741 402 652 540 775
8 742 403 653 541 776
9 743 404 654 542 777
10 744 405 655 543 778
But to create really random values then I would first shuffle values
np.random.shuffle(target_df['pl'])
and then I don't have to use random to select start
shuffle changes original column so it can't assign to new variable.
It doesn't repeat values in samples.
import pandas as pd
#import numpy as np
import random
target_df = pd.DataFrame({'pl': range(1000)})
# create the sampler:
sample_num = 5
sample_len = 10
sampled_df = pd.DataFrame()
#np.random.shuffle(target_df['pl'])
random.shuffle(target_df['pl'])
for i in range(1, sample_num+1):
start = i * sample_len
end = start + sample_len
slicer = target_df[start:end]['pl']
sampled_df[str(i)] = slicer.reset_index(drop=True)
sampled_df.index += 1
print(sampled_df)
Result:
1 2 3 4 5
1 638 331 171 989 170
2 22 643 47 136 764
3 969 455 211 763 194
4 859 384 174 552 566
5 221 829 62 926 414
6 4 895 951 967 381
7 758 688 594 876 873
8 757 691 825 693 707
9 235 353 34 699 121
10 447 81 36 682 251
If values can repeat then you could use
sampled_df[str(i)] = target_df['pl'].sample(n=sample_len, ignore_index=True)
import pandas as pd
target_df = pd.DataFrame({'pl': range(1000)})
# create the sampler:
sample_num = 5
sample_len = 10
sampled_df = pd.DataFrame()
for i in range(1, sample_num+1):
sampled_df[str(i)] = target_df['pl'].sample(n=sample_len, ignore_index=True)
sampled_df.index += 1
print(sampled_df)
EDIT
You may also get shuffled values as numpy array and use reshape - and later convert back to DataFrame with many columns. And later you can get some columns.
import pandas as pd
import random
target_df = pd.DataFrame({'pl': range(1000)})
# create the sampler:
sample_num = 5
sample_len = 10
random.shuffle(target_df['pl'])
sampled_df = pd.DataFrame(target_df['pl'].values.reshape([sample_len,-1]))
sampled_df = sampled_df.iloc[:, 0:sample_num]
sampled_df.index += 1
print(sampled_df)
I have a multi-index dataframe in pandas (date and entity_id) and for each date/entity I have obseravtions of a number of variables (A, B ...). My goal is to create a dataframe with the same shape but where the values are replaced by their decile scores.
My test data looks like this:
I want to apply qcut to each column grouped by level 0 of the multi-index - the issue I have is creating a result Dataframe
This code
def qcut_sub_index(df_with_sub_index):
# create empty return value same shape as passed dataframe
df_return=pd.DataFrame()
for date, sub_df in df_with_sub_index.groupby(level=0):
df_return=df_return.append(pd.DataFrame(pd.qcut(sub_df, 10, labels=False, duplicates='drop')))
print(df_return)
return df_return
print(df_values.apply(lambda x: qcut_sub_index(x), axis=0))
returns
A
as_at_date entity_id
2008-01-27 2928 0
2932 3
3083 6
3333 9
2008-02-27 2928 3
2935 9
3333 0
3874 6
2008-03-27 2928 1
2932 2
2934 0
2936 9
2937 4
2939 9
2940 7
2943 3
2944 0
2945 8
2946 6
2947 5
2949 4
B
as_at_date entity_id
2008-01-27 2928 9
2932 6
3083 0
3333 3
2008-02-27 2928 6
2935 0
3333 3
3874 9
2008-03-27 2928 0
2932 9
2934 2
2936 8
2937 7
2939 6
2940 3
2943 1
2944 4
2945 9
2946 5
2947 4
2949 0
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-104-72ff0e6da288> in <module>
11
12
---> 13 print(df_values.apply(lambda x: qcut_sub_index(x), axis=0))
~\Anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7546 kwds=kwds,
7547 )
-> 7548 return op.get_result()
7549
7550 def applymap(self, func) -> "DataFrame":
~\Anaconda3\lib\site-packages\pandas\core\apply.py in get_result(self)
178 return self.apply_raw()
179
--> 180 return self.apply_standard()
181
182 def apply_empty_result(self):
~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
272
273 # wrap results
--> 274 return self.wrap_results(results, res_index)
275
276 def apply_series_generator(self) -> Tuple[ResType, "Index"]:
~\Anaconda3\lib\site-packages\pandas\core\apply.py in wrap_results(self, results, res_index)
313 # see if we can infer the results
314 if len(results) > 0 and 0 in results and is_sequence(results[0]):
--> 315 return self.wrap_results_for_axis(results, res_index)
316
317 # dict of scalars
~\Anaconda3\lib\site-packages\pandas\core\apply.py in wrap_results_for_axis(self, results, res_index)
369
370 try:
--> 371 result = self.obj._constructor(data=results)
372 except ValueError as err:
373 if "arrays must all be same length" in str(err):
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
466
467 elif isinstance(data, dict):
--> 468 mgr = init_dict(data, index, columns, dtype=dtype)
469 elif isinstance(data, ma.MaskedArray):
470 import numpy.ma.mrecords as mrecords
~\Anaconda3\lib\site-packages\pandas\core\internals\construction.py in init_dict(data, index, columns, dtype)
281 arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
282 ]
--> 283 return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
284
285
~\Anaconda3\lib\site-packages\pandas\core\internals\construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype, verify_integrity)
76 # figure out the index, if necessary
77 if index is None:
---> 78 index = extract_index(arrays)
79 else:
80 index = ensure_index(index)
~\Anaconda3\lib\site-packages\pandas\core\internals\construction.py in extract_index(data)
385
386 if not indexes and not raw_lengths:
--> 387 raise ValueError("If using all scalar values, you must pass an index")
388
389 if have_series:
ValueError: If using all scalar values, you must pass an index
so something is preventing the second application of the lambda function.
I'd appreciate your help, thanks for takign a look.
p.s. if this can be done implcitly without using apply would love to hear. thanks
You solution appears over complicated. Your terminology is none standard, multi-indexes have levels. Stated as qcut() by level 0 of multi-index (not talking about sub-frames which are not pandas concepts)
Bring it all back together
use **kwargs approach to pass arguments to assign() for all columns in data frame
groupby(level=0) is as_of_date
transform() to get a row back for every entry in index
s = 12
df = pd.DataFrame({"as_at_date":np.random.choice(pd.date_range(dt.date(2020,1,27), periods=3, freq="M"), s),
"entity_id":np.random.randint(2900, 3500, s),
"A":np.random.random(s),
"B":np.random.random(s)*(10**np.random.randint(8,10,s))
}).sort_values(["as_at_date","entity_id"])
df = df.set_index(["as_at_date","entity_id"])
df2 = df.assign(**{c:df.groupby(level=0)[c].transform(lambda x: pd.qcut(x, 10, labels=False))
for c in df.columns})
df
A B
as_at_date entity_id
2020-01-31 2926 0.770121 2.883519e+07
2943 0.187747 1.167975e+08
2973 0.371721 3.133071e+07
3104 0.243347 4.497294e+08
3253 0.591022 7.796131e+08
3362 0.810001 6.438441e+08
2020-02-29 3185 0.690875 4.513044e+08
3304 0.311436 4.561929e+07
2020-03-31 2953 0.325846 7.770111e+08
2981 0.918461 7.594753e+08
3034 0.133053 6.767501e+08
3355 0.624519 6.318104e+07
df2
A B
as_at_date entity_id
2020-01-31 2926 7 0
2943 0 3
2973 3 1
3104 1 5
3253 5 9
3362 9 7
2020-02-29 3185 9 9
3304 0 0
2020-03-31 2953 3 9
2981 9 6
3034 0 3
3355 6 0
Using concat inside an iteration on the original dataframe does the trick but is there a smarter way to do this?
thanks
def qcut_sub_index(df_with_sub_index):
# create empty return value same shape as passed dataframe
df_return=pd.DataFrame()
for date, sub_df in df_with_sub_index.groupby(level=0):
df_return=df_return.append(pd.DataFrame(pd.qcut(sub_df, 10, labels=False,
duplicates='drop')))
return df_return
df_x=pd.DataFrame()
for (columnName, columnData) in df_values.iteritems():
df_x=pd.concat([df_x, qcut_sub_index(columnData)], axis=1, join="outer")
df_x
Basically I'm trying to get the regex matched word + the word from the loop on the same row of the DataFrame if that makes sense.
I've tried creating lists from column names and zip through but that didn't work. I'm also not sure how to use a variable name in the re.findall with regex, otherwise I'd try something like:
result = pd.DataFrame()
for word in file2:
x = re.findall(word.*, file1)
result = result.append(x, word)
I know the above doesn't work, and there is probably a few reasons why, I'd really love some explanation as to why the above mock code wouldn't work and how to make it work... Meanwhile, I came up with the below, which HALF works:
import pandas as pd
from pandas import read_csv
file1 = pd.DataFrame(read_csv('alldatacols.csv'))
file2 = pd.DataFrame(read_csv('trainColumnsT2.csv'))
new_df = pd.DataFrame()
for word in file2['Name']:
y = file1['0'].str.extractall(r'({}.*)'.format(word))
new_df = new_df.append(y, ignore_index=True)
new_df.reset_index(drop=True)
print(new_df)
print(new_df.columns)
result::::::::::
0 Id
1 MSSubClass
2 MSZoning_C (all)
3 MSZoning_FV
4 MSZoning_RH
5 MSZoning_RL
6 MSZoning_RM
7 LotFrontage
8 LotArea
9 Street_Grvl
10 Street_Pave
11 Alley_Grvl
12 Alley_Pave
13 LotShape_IR1
14 LotShape_IR2
15 LotShape_IR3
16 LotShape_Reg
17 LandContour_Bnk
18 LandContour_HLS
19 LandContour_Low
20 LandContour_Lvl
21 Utilities_AllPub
22 Utilities_NoSeWa
23 LotConfig_Corner
24 LotConfig_CulDSac
25 LotConfig_FR2
26 LotConfig_FR3
27 LotConfig_Inside
28 LandSlope_Gtl
29 LandSlope_Mod
.. ...
266 PoolArea
267 PoolQC_Ex
268 PoolQC_Fa
269 PoolQC_Gd
270 Fence_GdPrv
271 Fence_GdWo
272 Fence_MnPrv
273 Fence_MnWw
274 MiscFeature_Gar2
275 MiscFeature_Othr
276 MiscFeature_Shed
277 MiscFeature_TenC
278 MiscVal
279 MoSold
280 YrSold
281 SaleType_COD
282 SaleType_CWD
283 SaleType_Con
284 SaleType_ConLD
285 SaleType_ConLI
286 SaleType_ConLw
287 SaleType_New
288 SaleType_Oth
289 SaleType_WD
290 SaleCondition_Abnorml
291 SaleCondition_AdjLand
292 SaleCondition_Alloca
293 SaleCondition_Family
294 SaleCondition_Normal
295 SaleCondition_Partial
Example output I'm looking for:
17 LandContour_Bnk LandContour
18 LandContour_HLS LandContour
19 LandContour_Low LandContour
20 LandContour_Lvl LandContour
274 MiscFeature_Gar2 MiscFeature
275 MiscFeature_Othr MiscFeature
276 MiscFeature_Shed MiscFeature
277 MiscFeature_TenC MiscFeature
281 SaleType_COD SaleType
282 SaleType_CWD SaleType
283 SaleType_Con SaleType
284 SaleType_ConLD SaleType
285 SaleType_ConLI SaleType
286 SaleType_ConLw SaleType
287 SaleType_New SaleType
288 SaleType_Oth SaleType
289 SaleType_WD SaleType
290 SaleCondition_Abnorml SaleCondition
291 SaleCondition_AdjLand SaleCondition
292 SaleCondition_Alloca SaleCondition
293 SaleCondition_Family SaleCondition
294 SaleCondition_Normal SaleCondition
295 SaleCondition_Partial SaleCondition
Please help me understand how to get this over the hump. Thank yoU!
So I just made another DataFrame that got populated by me searching for just the word, without the .*. Then I concatenated both DataFrames and got what I needed... that was a simple solution I couldn't think of for the last 7 hours while I tried other approaches
file1 = pd.DataFrame(read_csv('alldatacols.csv'))
file2 = pd.DataFrame(read_csv('trainColumnsT2.csv'))
new_df = pd.DataFrame()
**df = pd.DataFrame()**
for word in file2['Name']:
y = file1['0'].str.extractall(r'({}.*)'.format(word))
**x = file1['0'].str.extractall(r'({})'.format(word))**
new_df = new_df.append(y, ignore_index=True)
**df = df.append(x, ignore_index=True)**
new_df.reset_index(drop=True)
**df.reset_index(drop=True)
result = pd.concat([new_df, df], axis=1)**
result.to_csv('result.csv')
I have an input file as shown below which needs to be arranged in such an order that the key values need to be in ascending order, while the keys which are not present need to be printed in the last.
I am getting the data arranged in the required format but the order is missing.
I have tried using sort() method but it shows "list has no attribute sort".
Please suggest solution and also suggest if any modifications required.
Input file:
3=1388|4=1388|5=IBM|8=157.75|9=88929|1021=1500|854=n|388=157.75|394=157.75|474=157.75|1584=88929|444=20160713|459=93000546718000|461=7|55=93000552181000|22=89020|400=157.75|361=0.73|981=0|16=1468416600.6006|18=1468416600.6006|362=0.46
3=1388|4=1388|5=IBM|8=157.73|9=100|1021=0|854=p|394=157.73|474=157.749977558|1584=89029|444=20160713|459=93001362639104|461=26142|55=93001362849000|22=89120|361=0.71|981=0|16=1468416601.372|18=1468416601.372|362=0.45
3=1388|4=1388|5=IBM|8=157.69|9=100|1021=600|854=p|394=157.69|474=157.749910415|1584=89129|444=20160713|459=93004178882560|461=27052|55=93004179085000|22=89328|361=0.67|981=1|16=1468416604.1916|18=1468416604.1916|362=0.43
Code i tried:
import pandas as pd
import numpy as np
df = pd.read_csv('inputfile', index_col=None, names=['text'])
s = df.text.str.split('|')
ds = [dict(w.split('=', 1) for w in x) for x in s]
p = pd.DataFrame.from_records(ds)
p1 = p.replace(np.nan,'n/a', regex=True)
st = p1.stack(level=0,dropna=False)
dfs = [g for i,g in st.groupby(level=0)]
#print st
i = 0
while i < len(dfs):
#index of each column
print ('\nindex[%d]'%i)
for (_,k),v in dfs[i].iteritems():
print k,'\t',v
i = i + 1
output getting:
index[0]
1021 1500
1584 88929
16 1468416600.6006
18 1468416600.6006
22 89020
3 1388
361 0.73
362 0.46
388 157.75
394 157.75
4 1388
400 157.75
444 20160713
459 93000546718000
461 7
474 157.75
5 IBM
55 93000552181000
8 157.75
854 n
9 88929
981 0
index[1]
1021 0
1584 89029
16 1468416601.372
18 1468416601.372
22 89120
3 1388
361 0.71
362 0.45
388 n/a
394 157.73
4 1388
400 n/a
444 20160713
459 93001362639104
461 26142
474 157.749977558
5 IBM
55 93001362849000
8 157.73
854 p
9 100
981 0
Expected output:
index[0]
3 1388
4 1388
5 IBM
8 157.75
9 88929
16 1468416600.6006
18 1468416600.6006
22 89020
55 93000552181000
361 0.73
362 0.46
388 157.75
394 157.75
400 157.75
444 20160713
459 93000546718000
461 7
474 157.75
854 n
981 0
1021 1500
1584 88929
index[1]
3 1388
4 1388
5 IBM
8 157.75
9 88929
16 1468416600.6006
18 1468416600.6006
22 89020
55 93000552181000
361 0.73
362 0.46
394 157.75
444 20160713
459 93000546718000
461 7
474 157.75
854 n
981 0
1021 1500
1584 88929
388 n/a
400 n/a
Replace your ds line with
ds = [{int(pair[0]): pair[1] for pair in [w.split('=', 1) for w in x]} for x in s]
To convert the index to an integer so it will be sorted numerically
To output the n/a values at the end, you could use the pandas selection to output the nonnull values first, then the null values, e.g:
for (ix, series) in p.iterrows():
print('\nindex[%d]' % ix)
output_series(ix, series[pd.notnull])
output_series(ix, series[pd.isnull].fillna('n/a'))
btw, you can also simplify your stack, groupby, print to:
for (ix, series) in p1.iterrows():
print('\nindex[%d]' % ix)
for tag, value in series.iteritems():
print(tag, '\t', value)
So the whole script becomes:
def output_series(ix, series):
for tag, value in series.iteritems():
print(tag, '\t', value)
df = pd.read_csv('inputfile', index_col=None, names=['text'])
s = df.text.str.split('|')
ds = [{int(pair[0]): pair[1] for pair in [w.split('=', 1) for w in x]} for x in s]
p = pd.DataFrame.from_records(ds)
for (ix, series) in p.iterrows():
print('\nindex[%d]' % ix)
output_series(ix, series[pd.notnull])
output_series(ix, series[pd.isnull].fillna('n/a'))
Here:
import pandas as pd
import numpy as np
df = pd.read_csv('inputfile', index_col=None, names=['text'])
s = df.text.str.split('|')
ds = [dict(w.split('=', 1) for w in x) for x in s]
p1 = pd.DataFrame.from_records(ds).fillna('n/a')
st = p1.stack(level=0,dropna=False)
for k, v in st.groupby(level=0):
print(k, v.sort_index())
Problem:
I'm trying to two relatively small datasets together, but the merge raises a MemoryError. I have two datasets of aggregates of country trade data, that I'm trying to merge on the keys year and country, so the data needs to be particularity placed. This unfortunately makes the use of concat and its performance benefits impossible as seen in the answer to this question: MemoryError on large merges with pandas in Python.
Here's the setup:
The attempted merge:
df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])
Basic data structure:
i:
Year Reporter_Code Trade_Flow_Code Partner_Code Classification Commodity Code Quantity Unit Code Supplementary Quantity Netweight (kg) Value Estimation Code
0 2003 381 2 36 H2 070951 8 1274 1274 13810 0
1 2003 381 2 36 H2 070930 8 17150 17150 30626 0
2 2003 381 2 36 H2 0709 8 20493 20493 635840 0
3 2003 381 1 36 H2 0507 8 5200 5200 27619 0
4 2003 381 1 36 H2 050400 8 56439 56439 683104 0
df:
mporter cod CC ComTrade_CC Distance_miles
0 110 215 215 757 428.989
1 110 215 215 757 428.989
2 110 215 215 757 428.989
3 110 215 215 757 428.989
4 110 215 215 757 428.989
Error Traceback:
MemoryError Traceback (most recent call last)
<ipython-input-10-8d6e9fb45de6> in <module>()
1 for i in c_list:
----> 2 df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
36 right_index=right_index, sort=sort, suffixes=suffixes,
37 copy=copy)
---> 38 return op.get_result()
39 if __debug__:
40 merge.__doc__ = _merge_doc % '\nleft : DataFrame'
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
193 copy=self.copy)
194
--> 195 result_data = join_op.get_result()
196 result = DataFrame(result_data)
197
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
693 if klass in mapping:
694 klass_blocks.extend((unit, b) for b in mapping[klass])
--> 695 res_blk = self._get_merged_block(klass_blocks)
696
697 # if we have a unique result index, need to clear the _ref_locs
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _get_merged_block(self, to_merge)
706 def _get_merged_block(self, to_merge):
707 if len(to_merge) > 1:
--> 708 return self._merge_blocks(to_merge)
709 else:
710 unit, block = to_merge[0]
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _merge_blocks(self, merge_chunks)
728 # Should use Fortran order??
729 block_dtype = _get_block_dtype([x[1] for x in merge_chunks])
--> 730 out = np.empty(out_shape, dtype=block_dtype)
731
732 sofar = 0
MemoryError:
Thanks for your thoughts!
In case anyone coming across this question still has similar trouble with merge, you can probably get concat to work by renaming the relevant columns in the two dataframes to the same names, setting them as a MultiIndex (i.e. df = dv.set_index(['A','B'])), and then using concat to join them.
UPDATE
Example:
df1 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'C':[3, 4]})
df2 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'D':[7, 8]})
both = pd.concat([df1.set_index(['A','B']), df2.set_index(['A','B'])], axis=1).reset_index()
df1
A B C
0 1 2 3
1 2 3 4
df2
A B D
0 1 2 7
1 2 3 8
both
A B C D
0 1 2 3 7
1 2 3 4 8
I haven't benchmarked the performance of this approach, but it didn't get the memory error and worked for my applications.