SQL noob here. I am trying to feed a pandas dataframe with SQL data and explore data with the following setup. But the output does not give all column headers. Shouldn't it be possible to display all column headers without having to open the database itself without sqlite studio?
import pandas as pd
import sqlite3
conn = sqlite3.connect('hubway.db')
def run_query(query):
return pd.read_sql_query(query,conn)
query = 'SELECT * FROM stations LIMIT 10
run_query(query)
gives:
id ... lng
0 3 ... -71.100812
1 4 ... -71.069616
2 5 ... -71.090179
3 6 ... -71.06514
4 7 ... -71.044624
and so on. I tried to use the curor object to try without the pandas dataframe
crsr.execute('.schema stations')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
sqlite3.OperationalError: near ".": syntax error
Really not sure where to go from here, is there any way to do this?
try:
pd.set_option('display.max_columns', 500)
by default, pandas will show you only x columns, you can alter that value...
other parameters:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
Related
I have a pandas dataframe with a multiindex. How can I reference the indexes in a duck db query?
import duckdb
import pandas as pd
import numpy as np
df = pd.DataFrame({
'i1': np.arange(0, 100),
'i2': np.arange(0, 100),
'c': np.random.randint(0 , 10, 100)
}).set_index(['i1', 'i2'])
>>> duckdb.query('select sum(c) from df group by i1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: Binder Error: Referenced column "i1" not found in FROM clause!
Candidate bindings: "df.c"
not an answer (since I'm looking for one as well), but may still help. I think DuckDB may not recognize any index. If you do this:
rel = conn.from_df(df)
rel.create("a_table")
result = conn.execute("select * from a_table").fetch_df()
You will see that the result data frame only has the c column without i1 or i2. The same is true if you have just a simple index, not a multi-index. The workaround is to simply reset the index before you use the data frame in DuckDB.
See the official issue discussion as well: https://github.com/duckdb/duckdb/issues/1011
I am trying to convert a SAS proc transpose statement to pyspark in databricks.
With the following data as a sample:
data = [{"duns":1234, "finc stress":100,"ver":6.0},{"duns":1234, "finc stress":125,"ver":7.0},{"duns":1234, "finc stress":135,"ver":7.1},{"duns":12345, "finc stress":125,"ver":7.6}]
I would expect the result to look like this
I tried using the pandas pivot_table() function with the following code however I ran into some performance issues with the size of the data:
tst = (df.pivot_table(index=['duns'], columns=['ver'], values='finc stress')
.add_prefix('ver')
.reset_index())
Is there a way to translate the PROC Transpose SAS logic to Pyspark instead of using pandas?
I am trying something like this but am getting an error
tst= sparkdf.groupBy('duns').pivot('ver').agg('finc_stress').withColumn('ver')
AssertionError: all exprs should be Column
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<command-2507760044487307> in <module>
4 df = pd.DataFrame(data) # pandas
5
----> 6 tst= sparkdf.groupBy('duns').pivot('ver').agg('finc_stress').withColumn('ver')
7
8
/databricks/spark/python/pyspark/sql/group.py in agg(self, *exprs)
115 else:
116 # Columns
--> 117 assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
118 jdf = self._jgd.agg(exprs[0]._jc,
119 _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
AssertionError: all exprs should be Column
If you could help me out I would so appreciate it! Thank you so much.
I don't know how you create df from data but here is what I did:
import pyspark.pandas as ps
df = ps.DataFrame(data)
df['ver'] = df['ver'].astype('str')
Then your pandas code worked.
To use PySpark method, here is what I did:
sparkdf.groupBy('duns').pivot('ver').agg(F.first('finc stress'))
I am writing code to analyze some data and want to create a data frame. How do I set it up successfully to run?
this is for analysis of data and I would like to create a data frame that can categorize data in different grades such as A
Here is the code I wrote:
import analyze_lc_Feb2update
from imp import reload
analyze_lc_Feb2update = reload(analyze_lc_Feb2update)
df = analyze_lc_Feb2update.create_df()
df.shape
df_new = df[df.grade=='A']
df_new.shape
df.columns
df.int_rate.head(5)
df.int_rate.tail(5)
df.int_rate.dtype
df.term.dtype
df_new = df[df.grade =='A']
df_new.shape
output:
TypeError Traceback (most recent call last)
<ipython-input-3-7079435f776f> in <module>()
2 from imp import reload
3 analyze_lc_Feb2update = reload(analyze_lc_Feb2update)
4 df = analyze_lc_Feb2update.create_df()
5 df.shape
6 df_new = df[df.grade=='A']
TypeError: create_df() missing 1 required positional
argument: 'grade'
Based on what was provided I guess your problem is here:
from imp import reload
analyze_lc_Feb2update = reload(analyze_lc_Feb2update)
df = analyze_lc_Feb2update.create_df()
This looks like some custom library you are trying to use, of which the .create_df() method requires a positional argument "grade" which would require you to do something like:
df = analyze_lc_Feb2update.create_df(grade="blah")
I am looking to find the total number of players by counting the unique screen names.
# Dependencies
import pandas as pd
# Save path to data set in a variable
df = "purchase_data.json"
# Use Pandas to read data
data_file_pd = pd.read_json(df)
data_file_pd.head()
# Find total numbers of players
player_count = len(df['SN'].unique())
TypeError Traceback (most recent call last)
<ipython-input-26-94bf0ee04d7b> in <module>()
1 # Find total numbers of players
----> 2 player_count = len(df['SN'].unique())
TypeError: string indices must be integers
Without access to the original data, this is guess work. But I think you might want something like this:
# Save path variable (?)
json_data = "purchase_data.json"
# convert json data to Pandas dataframe
df = pd.read_json(json_data)
df.head()
len(data_file_pd['SN'].unique())
simply if you are getting this error while connecting to schema. then at that time close the web browser and kill the Pg Admin Server and restart it. then it will be work perfectly
I'm currently querying data from MS SQL Server 2008 based on user input. However, I am getting an error when I am trying to get the five number summary using the describe() function.
import pyodbc
import numpy as np
import pandas.io.sql as sql
import pandas
print "What Part Number will you examine?"
PartN = raw_input()
conn = pyodbc.connect('my connection info')
curs = conn.cursor()
sqlr = """SELECT partmadeperhour FROM Completions WHERE PartNumber = ?
AND endtime > '2012-12-31 23:59:00' ORDER BY partmadeperhour"""
q = curs.execute(sqlr,[PartN]).fetchall()
df = pandas.DataFrame(q, columns =['rate'])
print df
columnnames = list(df.columns.values)
print columnnames
df['rate'].describe()
My data frame looks something like this
rate
0 [0.25]
1 [0.67]
2 [0.93]
... ...
1474 [5400.00]
And I am getting the following return and error:
[1475 rows x 1 columns]
['rate']
rate object
dtype: object
Traceback (most recent call last):
File "newr.py", line 30, in <module>
df['rate'].describe()
File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 4034, in describe
return describe_1d(self, percentiles)
File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 4031, in describe_1d
return describe_categorical_1d(data)
File "C:\Python27\lib\site-packages\pandas\core\generic.py",
line 4007, in describe_categorical_1d
objcounts = data.value_counts()
File "C:\Python27\lib\site-packages\pandas\core\base.py", line 433, in value_counts
normalize=normalize, bins=bins, dropna=dropna)
File "C:\Python27\lib\site-packages\pandas\core\algorithms.py", line 245, in value_counts
keys, counts = htable.value_count_object(values, mask)
File "pandas\hashtable.pyx", line 983, in pandas.hashtable.value_count_object
(pandas\hashtable.c:17616)
File "pandas\hashtable.pyx", line 994, in pandas.hashtable.value_count_object
(pandas\hashtable.c:17353)
TypeError: unhashable type: 'pyodbc.Row'
I understand that I need to convert the data in the dataframe to a different type as its currently an object, but not sure of how to convert to a float.
Any help is appreciated
Ensure you're using pandas 0.12 or later:
>>> import pandas
>>> pandas.__version__
'0.14.1'
Use pandas.read_sql_query to populate the dataframe directly, passing the query string and pyodbc connection. Note that the column alias rate is added to the T-SQL query, since pandas.read_sql_query doesn't support passing a list or dictionary of column names:
...
>>> sql = "select 0.25 union select 0.67 union select 0.93 as rate"
>>> df = pandas.read_sql_query(sql, connection)
>>> df
rate
0 0.25
1 0.67
2 0.93
>>> df['rate'].describe()
count 3.000000
mean 0.616667
std 0.343123
min 0.250000
25% 0.460000
50% 0.670000
75% 0.800000
max 0.930000
dtype: float64
The parameter values in your original query can be supplied using the params parameter of pandas.read_sql_query.
Instead of this
q = curs.execute(sqlr,[PartN]).fetchall()
df = pandas.DataFrame(q, columns =['rate'])
can you try
df = sql.read_frame(sqlr, conn) # You can directly read a table as dataframe