KeyError using `dask.merge()` - python

So I have two pandas dataframes created via
df1 = pd.read_cvs("first1.csv")
df2 = pd.read_csv("second2.csv")
These both have the column column1. To double check,
print(df1.columns)
print(df2.columns)
both return a column 'column1'.
So, I would like to merge these two dataframes with dask, using 60 threads locally (using an outer merge):
dd1 = dd.merge(df1, df2, on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60)
That fails with a KeyError, KeyError: 'column1'
Traceback (most recent call last):
File "INSTALLATIONPATH/python3.5/site-packages/pandas/indexes/base.py", line 2134, in get_loc
return self._engine.get_loc(key)
File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4443)
File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4289)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13733)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13687)
KeyError: 'column1'
I would think this is a parallelizable task, i.e. dd.merge(df1, df2, on='id')
Is there a "dask-equivalent" operation for this? I also tried reindexing the pandas dataframes on chr (i.e. df1 = df1.reset_index('chr') ) and then tried joining on the index
dd.merge(df1, df2, left_index=True, right_index=True)
That didn't work either, same error.
http://dask.pydata.org/en/latest/dataframe-overview.html

From your error, I would double check your initial dataframe to make sure you do have column1 in both (no extra spaces or anything) as an actual column, because it should work fine (no error in the following code)
Additionally, there's a difference between calling merge on pandas.DataFrame or on Dask.dataframe.
Here's some example data:
df1 = pd.DataFrame(np.transpose([np.arange(1000),
np.arange(1000)]), columns=['column1','column1_1'])
df2 = pd.DataFrame(np.transpose([np.arange(1000),
np.arange(1000, 2000)]), columns=['column1','column1_2'])
And their dask equivalent:
ddf1 = dd.from_pandas(df1, npartitions=100)
ddf2 = dd.from_pandas(df2, npartitions=100)
Using pandas.DataFrame:
In [1]: type(dd.merge(df1, df2, on="column1", how="outer"))
Out [1]: pandas.core.frame.DataFrame
So this returns a pandas.DataFrame, so you cannot call compute() on it.
Using dask.dataframe:
In [2]: type(dd.merge(ddf1, ddf2, on="column1", how="outer"))
Out[2]: dask.dataframe.core.DataFrame
Here you can call compute:
In [3]: dd.merge(ddf1,ddf2, how='outer').compute(num_workers=60)
Out[3]:
column1 column1_1 column1_2
0 0 0 1000
1 400 400 1400
2 100 100 1100
3 500 500 1500
4 300 300 1300
Side Note: depending on the size of your data and your hardware, you might want to check if doing a pandas.join wouldn't be faster:
df1.set_index('column1').join(df2.set_index('column1'), how='outer').reset_index()
Using a size of (1 000 000, 2) for each df it's faster than the dask solution on my hardware.

Related

How do I reorder rows in a CSV file by referring to a single column?

In Test1.csv, in all strings after the second line of the Entry column, I would like to write a code that sorts all the lines of Test1.csv according to the order of the Entry column in Test2.csv.
I would appreciate your advice. Thank you for your cooperation.
This is a simplified version of this data (more than 1000 lines).
import pandas as pd
input_path1 = "Test1.csv"
input_path2 = "Test2.csv"
output_path = "output.csv"
df1 = pd.read_csv(filepath_or_buffer=input_path1, encoding="utf-8")
df2 = pd.read_csv(filepath_or_buffer=input_path2, encoding="utf-8")
(df1.merge(df2, how='left', on='Entry')
.set_index('Entry')
.drop('Number_x', axis='columns')
.rename({'Number_y': 'Number'}, axis='columns')
.to_csv(output_path)
Error massage
Traceback (most recent call last):
File "narabekae.py", line 28, in <module>
.drop('Number_x', axis='columns')
File "/Users/macuser/downloads/yes/lib/python3.7/site-packages/pandas/core/frame.py", line 4102, in drop
errors=errors,
File "/Users/macuser/downloads/yes/lib/python3.7/site-packages/pandas/core/generic.py", line 3914, in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
File "/Users/macuser/downloads/yes/lib/python3.7/site-packages/pandas/core/generic.py", line 3946, in _drop_axis
new_axis = axis.drop(labels, errors=errors)
File "/Users/macuser/downloads/yes/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 5340, in drop
raise KeyError("{} not found in axis".format(labels[mask]))
KeyError: "['Number_x'] not found in axis"
The output what I want
,V1,V2,>sp,Entry,details,PepPI
1,OS=Ha,MTNKG,>sp,A4G4K7,HFQ_HERAR,7.028864399
2,OS=Sh,MAKGQ,>sp,B4TFA6,HFQ_SALHS,7.158609631
3,OS=Oi,MAQSV,>sp,Q8EQQ9,HFQ_OCEIH,9.229953074
4,OS=Bc,MAERS,>sp,A9M5C4,HFQ_BRUC2,8.154348935
5,OS=Re,MAERS,>sp,Q2K8U6,HFQ_RHIEC,8.154348935
Test1.csv
,V1,V2,>sp,Entry,details,PepPI
1,OS=Re,MAERS,>sp,Q2K8U6,HFQ_RHIEC,8.154348935
2,OS=Sh,MAKGQ,>sp,B4TFA6,HFQ_SALHS,7.158609631
3,OS=Ha,MTNKG,>sp,A4G4K7,HFQ_HERAR,7.028864399
4,OS=Bc,MAERS,>sp,A9M5C4,HFQ_BRUC2,8.154348935
5,OS=Oi,MAQSV,>sp,Q8EQQ9,HFQ_OCEIH,9.229953074
Test2.csv
pI,Molecular weight (average),Entry,Entry name,Organism
6.82,8763.13,A4G4K7,HFQ_HERAR,Rat
6.97,11119.33,B4TFA6,HFQ_SALHS,Pig
9.22,8438.69,Q8EQQ9,HFQ_OCEIH,Bacteria
7.95,8854.28,A9M5C4,HFQ_BRUC2,Mouse
7.95,9044.5,Q2K8U6,HFQ_RHIEC,Human
Additional information
macOS10.15.4 Python3.7.3 Atom
To reorder the columns, you just define list of columns in the order that you want, and use df[columns];
In [17]: columns = ["V1","V2",">sp","Entry","details","PepPI"]
In [18]: df = df1.merge(df2, how='left', on='Entry')
In [19]: df[columns]
Out[19]:
V1 V2 >sp Entry details PepPI
0 OS=Re MAERS >sp Q2K8U6 HFQ_RHIEC 8.154349
1 OS=Sh MAKGQ >sp B4TFA6 HFQ_SALHS 7.158610
2 OS=Ha MTNKG >sp A4G4K7 HFQ_HERAR 7.028864
3 OS=Bc MAERS >sp A9M5C4 HFQ_BRUC2 8.154349
4 OS=Oi MAQSV >sp Q8EQQ9 HFQ_OCEIH 9.229953
Naturally, you can save it normally with the to_csv() method:
df[columns].to_csv(output_path)
Notes
The errors are not reproducible with the data given, since there are no Number columns in the dataframes df1 and df2.
You should not set_index("Entry") if you want to have it saved in the .csv in the middle (since in the "The output what I want" you have simple integer based indexing).

Reshape pandas df of x,y,z in columns to x index, y header, and z values

I have a pandas df with x (latitudes), y (longitudes), and z (topography/elevation).
I want to plot a 3D surface plot but to do that I need a 2D z array whereas I have a 1D array (1 column).
Is there a way to pivot the dataframe so that the latitudes are the index, longitudes are the header and z are the values in the table?
I tried:
newdf = df.pivot(index='Pt_Latitude', columns='Pt_Longitude', values='topography')
but it gives me an index error:
Pt_Longitude Pt_Latitude topography Coordinated_Universal_Time S
0 272.799970 -45.261200 2670.92 2009-07-20T12:12:00.90412170 1
1 272.800520 -45.261986 2677.35 2009-07-20T12:12:00.90412170 2
2 272.798841 -45.261578 2670.04 2009-07-20T12:12:00.90412170 3
3 272.799396 -45.260423 2663.68 2009-07-20T12:12:00.90412170 4
4 272.801063 -45.260832 2671.67 2009-07-20T12:12:00.90412170 5
Traceback (most recent call last):
File "catalan.py", line 28, in <module>
newdf = df.pivot(index='Pt_Latitude', columns='Pt_Longitude', values='topography')
File "C:\Users\polyq\Anaconda3\lib\site-packages\pandas\core\frame.py", line 5923, in pivot
return pivot(self, index=index, columns=columns, values=values)
File "C:\Users\polyq\Anaconda3\lib\site-packages\pandas\core\reshape\pivot.py", line 450, in pivot
return indexed.unstack(columns)
File "C:\Users\polyq\Anaconda3\lib\site-packages\pandas\core\series.py", line 3550, in unstack
return unstack(self, level, fill_value)
File "C:\Users\polyq\Anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 419, in unstack
constructor=obj._constructor_expanddim,
File "C:\Users\polyq\Anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 141, in __init__
self._make_selectors()
File "C:\Users\polyq\Anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 176, in _make_selectors
mask.put(selector, True)
IndexError: index 571859901 is out of bounds for axis 0 with size 571829246
I think that you are using the wrong tool here. The line works fine with a tiny dataframe like the one in the example, but produces a dataframe of size len(df)*len(df)! That means that you are likely to be bitten by a memory error.
Long story made short: I can see no way to pivot a large dataframe according to your requirement because it will exhaust the available memory.
Why not just ...plt.plot_surface(*list(df.values), ...)?

Merging two dataframes with pd.NA in merge column yields 'TypeError: boolean value of NA is ambiguous'

With Pandas 1.0.1, I'm unable to merge if the
df = df.merge(df2, on=some_column)
yields
File /home/torstein/code/fintechdb/Sheets/sheets/gild.py, line 42, in gild
df = df.merge(df2, on=some_column)
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py, line 7297, in merge
validate=validate,
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 88, in merge
return op.get_result()
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 643, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 862, in _get_join_info
(left_indexer, right_indexer) = self._get_join_indexers()
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 841, in _get_join_indexers
self.left_join_keys, self.right_join_keys, sort=self.sort, how=self.how
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 1311, in _get_join_indexers
zipped = zip(*mapped)
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 1309, in <genexpr>
for n in range(len(left_keys))
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 1918, in _factorize_keys
rlab = rizer.factorize(rk)
File pandas/_libs/hashtable.pyx, line 77, in pandas._libs.hashtable.Factorizer.factorize
File pandas/_libs/hashtable_class_helper.pxi, line 1817, in pandas._libs.hashtable.PyObjectHashTable.get_labels
File pandas/_libs/hashtable_class_helper.pxi, line 1732, in pandas._libs.hashtable.PyObjectHashTable._unique
File pandas/_libs/missing.pyx, line 360, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous
while this works:
df[some_column].fillna(np.nan, inplace=True)
df2[some_column].fillna(np.nan, inplace=True)
df = df.merge(df2, on=some_column)
# Works
If instead, I do
df[some_column].fillna(pd.NA, inplace=True)
then the error returns.
This has to do with pd.NA being implemented in pandas 1.0.0 and how the pandas team decided it should work in a boolean context. Also, you take into account it is an experimental feature, hence it shouldn't be used for anything but experimenting:
Warning Experimental: the behaviour of pd.NA can still change without warning.
In another link of pandas documentation, where it covers working with missing values, is where I believe the reason and the answer you are looking for can be found:
NA in a boolean context:
Since the actual value of an NA is unknown, it is ambiguous to convert NA to a boolean value. The following raises an error: TypeError: boolean value of NA is ambiguous
Furthermore, it provides a valuable piece of advise:
"This also means that pd.NA cannot be used in a context where it is evaluated to a boolean, such as if condition: ... where condition can potentially be pd.NA. In such cases, isna() can be used to check for pd.NA or condition being pd.NA can be avoided, for example by filling missing values beforehand."
I decided that the pd.NA instances in my data were valid, and hence I needed to deal with them rather than filling them, like with fillna(). If you're like me in this case, then convert it from pd.NA to either True or False by simply using pd.isna(val). Only you can decide whether the null should come out T or F, but here's a simple example:
val = pd.NA
if pd.isna(val) :
print('it is null')
else :
print('it is not null')
returns: it is null
Then,
val = 7
if pd.isna(val) :
print('it is null')
else :
print('it is not null')
returns: it is not null
Hope this helps other trying to get a definitive course of action (Celius's answer is accurate, but I wanted to provide actionable code for those struggling with this).

Pandas return DataFrame from apply function?

sdf = sdf['Name1'].apply(lambda x: tryLookup(x, tdf))
tryLookup is a function that is currently taking a string, which is the value of Name1 in the sdf column. We map the function using apply to every row in the sdf DataFrame.
Instead of tryLookup returning just a string, is there a way for tryLookup to return a DataFrame that I want to merge with the sdf DataFrame? tryLookup has some extra information, and I want to include that in the results by adding them as new columns to all the rows in sdf.
So the return for tryLookup is as such:
return pd.Series({'BEST MATCH': bestMatch, 'SIMILARITY SCORE': humanScore})
I tried something such as
sdf = sdf.merge(sdf['Name1'].apply(lambda x: tryLookup(x, tdf)), left_index=True, right_index=True)
But that just throws
Traceback (most recent call last):
File "lookup.py", line 160, in <module>
main()
File "lookup.py", line 40, in main
sdf = sdf.merge(sdf['Name1'].apply(lambda x: tryLookup(x, tdf)), left_index=True, right_index=True)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 4618, in merge
copy=copy, indicator=indicator)
File "C:\Python27\lib\site-packages\pandas\tools\merge.py", line 58, in merge
copy=copy, indicator=indicator)
File "C:\Python27\lib\site-packages\pandas\tools\merge.py", line 473, in __init__
'type {0}'.format(type(right)))
ValueError: can not merge DataFrame with instance of type <class 'pandas.core.series.Series'>
Any help would be great. Thanks.
Try converting the pd.Series to a dataframe with pandas.Series.to_frame as documented here:
sdf = sdf.merge(sdf['Sold To Name (10)'].apply(lambda x: tryLookup(x, tdf)).to_frame(), left_index=True, right_index=True)

How to get n longest entries of DataFrame?

I'm trying to get the n longest entries of a dask DataFrame. I tried calling nlargest on a dask DataFrame with two columns like this:
import dask.dataframe as dd
df = dd.read_csv("opendns-random-domains.txt", header=None, names=['domain_name'])
df['domain_length'] = df.domain_name.map(len)
print(df.head())
print(df.dtypes)
top_3 = df.nlargest(3, 'domain_length')
print(top_3.head())
The file opendns-random-domains.txt contains just a long list of domain names. This is what the output of the above code looks like:
domain_name domain_length
0 webmagnat.ro 12
1 nickelfreesolutions.com 23
2 scheepvaarttelefoongids.nl 26
3 tursan.net 10
4 plannersanonymous.com 21
domain_name object
domain_length float64
dtype: object
Traceback (most recent call last):
File "nlargest_test.py", line 9, in <module>
print(top_3.head())
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 382, in head
result = result.compute()
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 86, in compute
return compute(self, **kwargs)[0]
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 179, in compute
results = get(dsk, keys, **kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/threaded.py", line 57, in get
**kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 484, in get_async
raise(remote_exception(res, tb))
dask.async.TypeError: Cannot use method 'nlargest' with dtype object
Traceback
---------
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
result = _execute_task(task, data)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
return func(*args2)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 2040, in <lambda>
f = lambda df: df.nlargest(n, columns)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3355, in nlargest
return self._nsorted(columns, n, 'nlargest', keep)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3318, in _nsorted
ser = getattr(self[columns[0]], method)(n, keep=keep)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/util/decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/series.py", line 1898, in nlargest
return algos.select_n(self, n=n, keep=keep, method='nlargest')
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/algorithms.py", line 559, in select_n
raise TypeError("Cannot use method %r with dtype %s" % (method, dtype))
I'm confused, because I'm calling nlargest on the column which is of type float64 but still get this error saying it cannot be called on dtype object. Also this works fine in pandas. How can I get the n longest entries from a DataFrame?
I was helped by explicit type conversion:
df['column'].astype(str).astype(float).nlargest(5)
This is how my first data frame look.
This is how my new data frame looks after getting top 5.
'''
station_count.nlargest(5,'count')
'''
You have to give (nlargest) command to a column who have int data type and not in string so it can calculate the count.
Always top n number followed by its corresponding column that is int type.
I tried to reproduce your problem but things worked fine. Can I recommend that you produce a Minimal Complete Verifiable Example?
Pandas example
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})
In [3]: df['y'] = df.x.map(len)
In [4]: df
Out[4]:
x y
0 a 1
1 bb 2
2 ccc 3
3 dddd 4
In [5]: df.nlargest(3, 'y')
Out[5]:
x y
3 dddd 4
2 ccc 3
1 bb 2
Dask dataframe example
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})
In [3]: import dask.dataframe as dd
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf['y'] = ddf.x.map(len)
In [6]: ddf.nlargest(3, 'y').compute()
Out[6]:
x y
3 dddd 4
2 ccc 3
1 bb 2
Alternatively, perhaps this is just working now on the git master version?
You only need to change the type of respective column to int or float using .astype().
For example, in your case:
top_3 = df['domain_length'].astype(float).nlargest(3)
If you want to get the values with the most occurrences from a String type column you may use value_counts() with nlargest(n), where n is the number of elements you want to bring.
df['your_column'].value_counts().nlargest(3)
It will bring the top 3 occurrences from that column.

Categories