sdf = sdf['Name1'].apply(lambda x: tryLookup(x, tdf))
tryLookup is a function that is currently taking a string, which is the value of Name1 in the sdf column. We map the function using apply to every row in the sdf DataFrame.
Instead of tryLookup returning just a string, is there a way for tryLookup to return a DataFrame that I want to merge with the sdf DataFrame? tryLookup has some extra information, and I want to include that in the results by adding them as new columns to all the rows in sdf.
So the return for tryLookup is as such:
return pd.Series({'BEST MATCH': bestMatch, 'SIMILARITY SCORE': humanScore})
I tried something such as
sdf = sdf.merge(sdf['Name1'].apply(lambda x: tryLookup(x, tdf)), left_index=True, right_index=True)
But that just throws
Traceback (most recent call last):
File "lookup.py", line 160, in <module>
main()
File "lookup.py", line 40, in main
sdf = sdf.merge(sdf['Name1'].apply(lambda x: tryLookup(x, tdf)), left_index=True, right_index=True)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 4618, in merge
copy=copy, indicator=indicator)
File "C:\Python27\lib\site-packages\pandas\tools\merge.py", line 58, in merge
copy=copy, indicator=indicator)
File "C:\Python27\lib\site-packages\pandas\tools\merge.py", line 473, in __init__
'type {0}'.format(type(right)))
ValueError: can not merge DataFrame with instance of type <class 'pandas.core.series.Series'>
Any help would be great. Thanks.
Try converting the pd.Series to a dataframe with pandas.Series.to_frame as documented here:
sdf = sdf.merge(sdf['Sold To Name (10)'].apply(lambda x: tryLookup(x, tdf)).to_frame(), left_index=True, right_index=True)
Related
I was working with a multiindex dataframe (which I find unbeleivably complicated to work with). I flattened the multiindex into jest Level0, with this line of code.
df_append.columns = df_append.columns.map('|'.join).str.strip('|')
Now, when I print columns, I get this.
Index(['IDRSSD', 'RCFD3531|TRDG ASSETS-US TREAS SECS IN DOM OFF',
'RCFD3532|TRDG ASSETS-US GOV AGC CORP OBLGS',
'RCFD3533|TRDG ASSETS-SECS ISSD BY ST POL SUB',
'TEXTF660|3RD ITEMIZED AMT FOR OTHR TRDG ASTS',
'Unnamed: 115_level_0|Unnamed: 115_level_1',
'Unnamed: 133_level_0|Unnamed: 133_level_1',
'Unnamed: 139_level_0|Unnamed: 139_level_1',
'Unnamed: 20_level_0|Unnamed: 20_level_1',
'Unnamed: 87_level_0|Unnamed: 87_level_1', 'file', 'schedule_code',
'year', 'qyear'],
dtype='object', length=202)
I am trying to concatenate two columns into one single column, like this.
df_append['period'] = df_append['IDRSSD'].astype(str) + '-' + df_append['qyear'].astype(str)
Here is the error that I am seeing.
Traceback (most recent call last):
File "C:\Users\ryans\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2895, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'IDRSSD'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<ipython-input-153-92d2e8486595>", line 1, in <module>
df_append['period'] = df_append['IDRSSD'].astype(str) + '-' + df_append['qyear'].astype(str)
File "C:\Users\ryans\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2902, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\ryans\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2897, in get_loc
raise KeyError(key) from err
KeyError: 'IDRSSD'
To me, it looks like I have a column named 'IDRSSD' and a column named 'qyear', but Python disagrees. Or, perhaps I am misinterpreting the error message. Can I get these two columns concatenated into one, or is this impossible to do? Thanks everyone.
I tried the method below. It worked for me.
1.) First convert the column to string:
df_append['IDRSSD'] = df_append['IDRSSD'].astype(str)
df_append['qyear'] = df_append['qyear'].astype(str)
2.) Now join then both the columns into one using '-' as seperator
df_append['period'] = df_append[['IDRSSD', 'qyear']].apply(lambda x: '-'.join(x), axis=1)
Attached the screenshot of my approach.
You can use df_append.columns = df_append.columns.to_flat_index() to change the MultiIndex into a one dimensional array of tuples. From there you should be able to manipulate the columns more easily, or at least see what the issue is.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.to_flat_index.html
use apply method
import pandas as pd
def concat(row):
if ("col1" in row) & ("col2" in row):
return str(row['col1']) + "-" +str(row['col2'])
df =pd.DataFrame([["1","2"],["1","2"]],columns=["col1","col2"])
df['col3'] = df.apply(lambda row: concat(row), axis=1)
df
In Test1.csv, in all strings after the second line of the Entry column, I would like to write a code that sorts all the lines of Test1.csv according to the order of the Entry column in Test2.csv.
I would appreciate your advice. Thank you for your cooperation.
This is a simplified version of this data (more than 1000 lines).
import pandas as pd
input_path1 = "Test1.csv"
input_path2 = "Test2.csv"
output_path = "output.csv"
df1 = pd.read_csv(filepath_or_buffer=input_path1, encoding="utf-8")
df2 = pd.read_csv(filepath_or_buffer=input_path2, encoding="utf-8")
(df1.merge(df2, how='left', on='Entry')
.set_index('Entry')
.drop('Number_x', axis='columns')
.rename({'Number_y': 'Number'}, axis='columns')
.to_csv(output_path)
Error massage
Traceback (most recent call last):
File "narabekae.py", line 28, in <module>
.drop('Number_x', axis='columns')
File "/Users/macuser/downloads/yes/lib/python3.7/site-packages/pandas/core/frame.py", line 4102, in drop
errors=errors,
File "/Users/macuser/downloads/yes/lib/python3.7/site-packages/pandas/core/generic.py", line 3914, in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
File "/Users/macuser/downloads/yes/lib/python3.7/site-packages/pandas/core/generic.py", line 3946, in _drop_axis
new_axis = axis.drop(labels, errors=errors)
File "/Users/macuser/downloads/yes/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 5340, in drop
raise KeyError("{} not found in axis".format(labels[mask]))
KeyError: "['Number_x'] not found in axis"
The output what I want
,V1,V2,>sp,Entry,details,PepPI
1,OS=Ha,MTNKG,>sp,A4G4K7,HFQ_HERAR,7.028864399
2,OS=Sh,MAKGQ,>sp,B4TFA6,HFQ_SALHS,7.158609631
3,OS=Oi,MAQSV,>sp,Q8EQQ9,HFQ_OCEIH,9.229953074
4,OS=Bc,MAERS,>sp,A9M5C4,HFQ_BRUC2,8.154348935
5,OS=Re,MAERS,>sp,Q2K8U6,HFQ_RHIEC,8.154348935
Test1.csv
,V1,V2,>sp,Entry,details,PepPI
1,OS=Re,MAERS,>sp,Q2K8U6,HFQ_RHIEC,8.154348935
2,OS=Sh,MAKGQ,>sp,B4TFA6,HFQ_SALHS,7.158609631
3,OS=Ha,MTNKG,>sp,A4G4K7,HFQ_HERAR,7.028864399
4,OS=Bc,MAERS,>sp,A9M5C4,HFQ_BRUC2,8.154348935
5,OS=Oi,MAQSV,>sp,Q8EQQ9,HFQ_OCEIH,9.229953074
Test2.csv
pI,Molecular weight (average),Entry,Entry name,Organism
6.82,8763.13,A4G4K7,HFQ_HERAR,Rat
6.97,11119.33,B4TFA6,HFQ_SALHS,Pig
9.22,8438.69,Q8EQQ9,HFQ_OCEIH,Bacteria
7.95,8854.28,A9M5C4,HFQ_BRUC2,Mouse
7.95,9044.5,Q2K8U6,HFQ_RHIEC,Human
Additional information
macOS10.15.4 Python3.7.3 Atom
To reorder the columns, you just define list of columns in the order that you want, and use df[columns];
In [17]: columns = ["V1","V2",">sp","Entry","details","PepPI"]
In [18]: df = df1.merge(df2, how='left', on='Entry')
In [19]: df[columns]
Out[19]:
V1 V2 >sp Entry details PepPI
0 OS=Re MAERS >sp Q2K8U6 HFQ_RHIEC 8.154349
1 OS=Sh MAKGQ >sp B4TFA6 HFQ_SALHS 7.158610
2 OS=Ha MTNKG >sp A4G4K7 HFQ_HERAR 7.028864
3 OS=Bc MAERS >sp A9M5C4 HFQ_BRUC2 8.154349
4 OS=Oi MAQSV >sp Q8EQQ9 HFQ_OCEIH 9.229953
Naturally, you can save it normally with the to_csv() method:
df[columns].to_csv(output_path)
Notes
The errors are not reproducible with the data given, since there are no Number columns in the dataframes df1 and df2.
You should not set_index("Entry") if you want to have it saved in the .csv in the middle (since in the "The output what I want" you have simple integer based indexing).
I have a dataset which requires missing value treatment.
Column Missing Values
Complaint_ID 0
Date_received 0
Transaction_Type 0
Complaint_reason 0
Company_response 22506
Date_sent_to_company 0
Complaint_Status 0
Consumer_disputes 7698
Now the problem is, when I try to replace the missing values with mode of other columns using groupby:
Code:
data11["Company_response"] =
data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()
[0]))["Company_response"]
data11["Consumer_disputes"] =
data11.groupby("Transaction_Type").transform(lambda x: x.fillna(x.mode()
[0]))["Consumer_disputes"]
I get the following error:
Stacktrace
Traceback (most recent call last):
File "<ipython-input-89-8de6a010a299>", line 1, in <module>
data11["Company_response"] = data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()[0]))["Company_response"]
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3741, in transform
return self._transform_general(func, *args, **kwargs)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3699, in _transform_general
res = path(group)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3783, in <lambda>
lambda x: func(x, *args, **kwargs), axis=self.axis)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4360, in apply
ignore_failures=ignore_failures)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4456, in _apply_standard
results[i] = func(v)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3783, in <lambda>
lambda x: func(x, *args, **kwargs), axis=self.axis)
File "<ipython-input-89-8de6a010a299>", line 1, in <lambda>
data11["Company_response"] = data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()[0]))["Company_response"]
File "C:\Anaconda3\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2434, in get_value
return libts.get_value_box(s, key)
File "pandas\_libs\tslib.pyx", line 923, in pandas._libs.tslib.get_value_box (pandas\_libs\tslib.c:18843)
File "pandas\_libs\tslib.pyx", line 939, in pandas._libs.tslib.get_value_box (pandas\_libs\tslib.c:18560)
IndexError: ('index out of bounds', 'occurred at index Consumer_disputes')
I have checked the length of the dataframeand all of its columns and it is same: 43266.
I have also found a question similar to this but does not have correct answer: Click here
Please help resolve the error.
IndexError: ('index out of bounds', 'occurred at index Consumer_disputes')
Here is a snapshot of the dataset if it helps in any way: Dataset Snapshot
I am using the below code successfully. But it does not serve my purpose exactly. Helps to fill the missing values though.
data11['Company_response'].fillna(data11['Company_response'].mode()[0],
inplace=True)
data11['Consumer_disputes'].fillna(data11['Consumer_disputes'].mode()[0],
inplace=True)
Edit1: (Attaching Sample)
Input Given:
Expected Output:
You can see that the missing values for company-response of Tr-1 and Tr-3 are filled by taking mode of Complaint-Reason.
And similarly for the Consumer-Disputes by taking mode of transaction-type, for Tr-5.
The below snippet consists of the dataframe and the code for those who want to replicate and give it a try.
Replication Code
import pandas as pd
import numpy as np
data11=pd.DataFrame({'Complaint_ID':['Tr-1','Tr-2','Tr-3','Tr-4','Tr-5','Tr-6'],
'Transaction_Type':['Mortgage','Credit card','Bank account or service','Debt collection','Credit card','Mortgage'],
'Complaint_reason':['Loan servicing, payments, escrow account','Incorrect information on credit report',"Cont'd attempts collect debt not owed","Cont'd attempts collect debt not owed",'Payoff process','Loan servicing, payments, escrow account'],
'Company_response':[np.nan,'Company chooses not to provide a public response',np.nan,'Company believes it acted appropriately as authorized by contract or law','Company has responded to the consumer and the CFPB and chooses not to provide a public response','Company disputes the facts presented in the complaint'],
'Consumer_disputes':['Yes','No','No','No',np.nan,'Yes']})
data11.isnull().sum()
data11["Company_response"] = data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()[0]))["Company_response"]
data11["Consumer_disputes"] = data11.groupby("Transaction_Type").transform(lambda x: x.fillna(x.mode()[0]))["Consumer_disputes"]
The error is raised because for at least one of the groups the values in corresponding aggregated columns contains only np.nan values. In this case pd.Series([np.nan]).mode() returns an empty series which leads to an error when you take the first value.
So, you may use something like transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else "Empty") ).
Try:
data11["Company_response"] = data11.groupby("Complaint_reason")['Company_response'].transform(lambda x: x.fillna(x.mode()[0]))
data11["Consumer_disputes"] = data11.groupby("Transaction_Type")['Consumer_disputes'].transform(lambda x: x.fillna(x.mode()[0]))
#Mikhail Berlinkov is almost certainly correct. I was able to reproduce your error, and then avoid it by using dropna():
data11.groupby("Transaction-Type").transform(
lambda x: x.fillna(x.mode() [0]))["Consumer-disputes"]
# Returns IndexError
data11.dropna().groupby("Transaction-Type").transform(
lambda x: x.fillna(x.mode() [0]))["Consumer-disputes"]
# Works
I have problems when I try to assign a dict to the df DataFrame,
df.loc[index,'count'] = dict()
as I get this error message:
Incompatible indexer with Series
To work around this problem, I can do this,
df.loc[index,'count'] = [dict()]
, but I don't like this solution since I have to resolve the list before getting the dictionary i.e.
a = (df.loc[index,'count'])[0]
How can I solve this situation in a more elegant way?
EDIT1
One way to replicate the whole code is as follow
Code:
import pandas as pd
df = pd.DataFrame(columns= ['count', 'aaa'])
d = dict()
df.loc[0, 'count'] = [d]; print('OK!');
df.loc[0, 'count'] = d
Output:
OK!
Traceback (most recent call last):
File "<ipython-input-193-67bbd89f2c69>", line 4, in <module>
df.loc[0, 'count'] = d
File "/usr/lib64/python3.6/site-packages/pandas/core/indexing.py", line 194, in __setitem__
self._setitem_with_indexer(indexer, value)
File "/usr/lib64/python3.6/site-packages/pandas/core/indexing.py", line 625, in _setitem_with_indexer
value = self._align_series(indexer, Series(value))
File "/usr/lib64/python3.6/site-packages/pandas/core/indexing.py", line 765, in _align_series
raise ValueError('Incompatible indexer with Series')
ValueError: Incompatible indexer with Series
So I have two pandas dataframes created via
df1 = pd.read_cvs("first1.csv")
df2 = pd.read_csv("second2.csv")
These both have the column column1. To double check,
print(df1.columns)
print(df2.columns)
both return a column 'column1'.
So, I would like to merge these two dataframes with dask, using 60 threads locally (using an outer merge):
dd1 = dd.merge(df1, df2, on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60)
That fails with a KeyError, KeyError: 'column1'
Traceback (most recent call last):
File "INSTALLATIONPATH/python3.5/site-packages/pandas/indexes/base.py", line 2134, in get_loc
return self._engine.get_loc(key)
File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4443)
File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4289)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13733)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13687)
KeyError: 'column1'
I would think this is a parallelizable task, i.e. dd.merge(df1, df2, on='id')
Is there a "dask-equivalent" operation for this? I also tried reindexing the pandas dataframes on chr (i.e. df1 = df1.reset_index('chr') ) and then tried joining on the index
dd.merge(df1, df2, left_index=True, right_index=True)
That didn't work either, same error.
http://dask.pydata.org/en/latest/dataframe-overview.html
From your error, I would double check your initial dataframe to make sure you do have column1 in both (no extra spaces or anything) as an actual column, because it should work fine (no error in the following code)
Additionally, there's a difference between calling merge on pandas.DataFrame or on Dask.dataframe.
Here's some example data:
df1 = pd.DataFrame(np.transpose([np.arange(1000),
np.arange(1000)]), columns=['column1','column1_1'])
df2 = pd.DataFrame(np.transpose([np.arange(1000),
np.arange(1000, 2000)]), columns=['column1','column1_2'])
And their dask equivalent:
ddf1 = dd.from_pandas(df1, npartitions=100)
ddf2 = dd.from_pandas(df2, npartitions=100)
Using pandas.DataFrame:
In [1]: type(dd.merge(df1, df2, on="column1", how="outer"))
Out [1]: pandas.core.frame.DataFrame
So this returns a pandas.DataFrame, so you cannot call compute() on it.
Using dask.dataframe:
In [2]: type(dd.merge(ddf1, ddf2, on="column1", how="outer"))
Out[2]: dask.dataframe.core.DataFrame
Here you can call compute:
In [3]: dd.merge(ddf1,ddf2, how='outer').compute(num_workers=60)
Out[3]:
column1 column1_1 column1_2
0 0 0 1000
1 400 400 1400
2 100 100 1100
3 500 500 1500
4 300 300 1300
Side Note: depending on the size of your data and your hardware, you might want to check if doing a pandas.join wouldn't be faster:
df1.set_index('column1').join(df2.set_index('column1'), how='outer').reset_index()
Using a size of (1 000 000, 2) for each df it's faster than the dask solution on my hardware.