DuckDB Group by Pandas Index - python

I have a pandas dataframe with a multiindex. How can I reference the indexes in a duck db query?
import duckdb
import pandas as pd
import numpy as np
df = pd.DataFrame({
'i1': np.arange(0, 100),
'i2': np.arange(0, 100),
'c': np.random.randint(0 , 10, 100)
}).set_index(['i1', 'i2'])
>>> duckdb.query('select sum(c) from df group by i1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: Binder Error: Referenced column "i1" not found in FROM clause!
Candidate bindings: "df.c"

not an answer (since I'm looking for one as well), but may still help. I think DuckDB may not recognize any index. If you do this:
rel = conn.from_df(df)
rel.create("a_table")
result = conn.execute("select * from a_table").fetch_df()
You will see that the result data frame only has the c column without i1 or i2. The same is true if you have just a simple index, not a multi-index. The workaround is to simply reset the index before you use the data frame in DuckDB.
See the official issue discussion as well: https://github.com/duckdb/duckdb/issues/1011

Related

Merge datasets using pandas

Below I have code which was provided to me in order to join 2 datasets.
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
df= pd.read_csv("student/student-por.csv")
ds= pd.read_csv("student/student-mat.csv")
print("before merge")
print(df)
print(ds)
print("After merging:")
dq = pd.merge(df,ds,by=c("school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet"))
print(dq)
I get this error:
Traceback (most recent call last):
File "/Users/PycharmProjects/datamining/main.py", line 15, in <module>
dq = pd.merge(df, ds,by=c ("school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet"))
NameError: name 'c' is not defined
Any help would be great, I've tried messing about with it for a while. I believe the 'by=c' is the issue.
Thanks
Hi 👋🏻 Hope you are doing well!
The error is happening because of the c symbol in the arguments of the merge function. Also merge function has a different signature and it doesn't have the argument by but instead it should be on, which accepts only the list of columns 🙂 So in summary it should something similar to this:
import pandas as pd
df = pd.read_csv("student/student-por.csv")
ds = pd.read_csv("student/student-mat.csv")
print("Before merge.")
print(df)
print(ds)
print("After merge.")
dq = pd.merge(
left=df,
right=ds,
on=[
"school",
"sex",
"age",
"address",
"famsize",
"Pstatus",
"Medu",
"Fedu",
"Mjob",
"Fjob",
"reason",
"nursery",
"internet",
],
)
print(dq)
Docs: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

DataFrame.lookup requires unique index and columns with a recent version of Pandas

I am working with python3.7, and I am facing an issue with a recent version of pandas.
Here is my code.
import pandas as pd
import numpy as np
data = {'col_1':[9087.6000, 9135.8000, np.nan, 9102.1000],
'col_2':[0.1648, 0.1649, '', 5.3379],
'col_nan':[np.nan, np.nan, np.nan, np.nan],
'col_name':['col_nan', 'col_1', 'col_2', 'col_nan']
}
df = pd.DataFrame(data, index=[101, 102, 102, 104])
col_lookup = 'results'
col_result = 'col_name'
df[col_lookup] = df.lookup(df.index, df[col_result])
The code works fine with pandas version 1.0.3, but
when I try with version 1.1.1 the following error occurs:
"ValueError: DataFrame.lookup requires unique index and columns"
The dataframe indeed includes a duplication of the index "102".
For different reasons, I have to work with version 1.1.1 of pandas. Is there a solution with the "lookup" command to support index duplication with this version of pandas?
Thanks in advance for your help.
Put a unique index in place then restore the old index...
import pandas as pd
import numpy as np
data = {'col_1':[9087.6000, 9135.8000, np.nan, 9102.1000],
'col_2':[0.1648, 0.1649, '', 5.3379],
'col_nan':[np.nan, np.nan, np.nan, np.nan],
'col_name':['col_nan', 'col_1', 'col_2', 'col_nan']
}
df = pd.DataFrame(data, index=[101, 102, 102, 104])
col_lookup = 'results'
col_result = 'col_name'
df.reset_index(inplace=True)
df[col_lookup] = df.lookup(df.index, df[col_result])
df = df.set_index(df["index"]).drop(columns="index")
Non-unique index was a bug: Github link
"look up" method in pandas 1.1.1 does not allows you to pass non-unique index as input argument.
following code has been added at the beginning of "lookup" method in "frame.py" which for me is in(line 3836):
C:\Users\Sajad\AppData\Local\Programs\Python\Python38\Lib\site-packages\pandas\core\frame.py
if not (self.index.is_unique and self.columns.is_unique):
# GH#33041
raise ValueError("DataFrame.lookup requires unique index and columns")
However if this error handler didn't exist, the following procedure in this method would end up in a for loop. substituting the last line with this built-in for loop gives you the same result as previous pandas versions.
result = np.empty(len(df.index), dtype="O")
for i, (r, c) in enumerate(zip(df.index, df[col_result])):
result[i] = df._get_value(r, c)
df[col_lookup] = result

Not all columns displayed using pandas to display sqlite3 query results

SQL noob here. I am trying to feed a pandas dataframe with SQL data and explore data with the following setup. But the output does not give all column headers. Shouldn't it be possible to display all column headers without having to open the database itself without sqlite studio?
import pandas as pd
import sqlite3
conn = sqlite3.connect('hubway.db')
def run_query(query):
return pd.read_sql_query(query,conn)
query = 'SELECT * FROM stations LIMIT 10
run_query(query)
gives:
id ... lng
0 3 ... -71.100812
1 4 ... -71.069616
2 5 ... -71.090179
3 6 ... -71.06514
4 7 ... -71.044624
and so on. I tried to use the curor object to try without the pandas dataframe
crsr.execute('.schema stations')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
sqlite3.OperationalError: near ".": syntax error
Really not sure where to go from here, is there any way to do this?
try:
pd.set_option('display.max_columns', 500)
by default, pandas will show you only x columns, you can alter that value...
other parameters:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

How to print unique values of a column in a group using Pandas?

I am trying to print unique values of the column ADO_name in my data set. Following is the example data set and code I tried (which gives error):
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
data = {'ADO_name':['car1','car1','car1','car2','car2','car2'],
'Time_sec':[0,1,2,0,1,2],
'Speed.kph':[50,51,52,0,0,52]}
dframe = DataFrame(data)
for ado in dframe.groupby('ADO_name'):
ado_name = ado["ADO_name"]
adoID = ado_name.unique()
print(adoID)
Traceback (most recent call last):
File "C:\Users\Quinton\AppData\Local\Temp\Rtmp88ifpB\chunk-code-188c39fc7de8.txt", line 14, in <module>
ado_name = ado["ADO_name"]
TypeError: tuple indices must be integers or slices, not str
What am I doing wrong and how to fix it? Please help.
You can do: dframe["ADO_name"].unique().
You may want to correct your code or use the correct way.
Here is what you need to correct in your code.
for ado in dframe.groupby('ADO_name'):
ado_name = ado[1]["ADO_name"]
adoID = ado_name.unique()
print(adoID)

Apply SequenceMatcher to DataFrame

I'm new to pandas and Python in general, so I'm hoping someone can help me with this simple question. I have a large dataframe m with several million rows and seven columns, including an ITEM_NAME_x and ITEM_NAME_y. I want to compare ITEM_NAME_x and ITEM_NAME_y using SequenceMatcher.ratio(), and add a new column to the dataframe with the result.
I've tried to come at this several ways, but keep running into errors:
>>> m.apply(SequenceMatcher(None, str(m.ITEM_NAME_x), str(m.ITEM_NAME_y)).ratio(), axis=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 4416, in apply
return self._apply_standard(f, axis)
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 4491, in _apply_standard
raise e
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 4480, in _apply_standard
results[i] = func(v)
TypeError: ("'float' object is not callable", 'occurred at index 0')
Could someone help me fix this?
You have to apply a function, not a float which expression SequenceMatcher(None, str(m.ITEM_NAME_x), str(m.ITEM_NAME_y)).ratio() is.
Working demo (a draft):
import difflib
from functools import partial
import pandas as pd
def apply_sm(s, c1, c2):
return difflib.SequenceMatcher(None, s[c1], s[c2]).ratio()
df = pd.DataFrame({'A': {1: 'one'}, 'B': {1: 'two'}})
print df.apply(partial(apply_sm, c1='A', c2='B'), axis=1)
output:
1 0.333333
dtype: float64

Categories