How to insert Pandas DataFrame into Cassandra? - python

I have a dataframe as below:
df
date time open high low last
01-01-2017 11:00:00 37 45 36 42
01-01-2017 11:23:00 36 43 33 38
01-01-2017 12:00:00 45 55 35 43
....
I want to write it into cassandra. It's kind of bulk upload after processing on data in python.
The schema for cassandra is as below:
CREATE TABLE ks.table1(date text, time text, open float, high float, low
float, last float, PRIMARY KEY(date, time))
To insert single row into cassandra we can use cassandra-driver in python but I couldn't find any details about uploading an entire dataframe.
from cassandra.cluster import Cluster
session.execute(
"""
INSERT INTO ks.table1 (date,time,open,high,low,last)
VALUES (01-01-2017, 11:00:00, 37, 45, 36, 42)
""")
P.S: The similar question have been asked earlier, but doesn't have answer to my question.

Even i was facing this problem but i figured out that even while uploading Millions of rows(19 Million to be exact) into Cassandra its didn't take much time.
Coming to your problem,you can use cassandra Bulk LOADER
to get your job done.
EDIT 1:
You can use prepared statements to help uplaod data into cassandra table while iterating through the dataFrame.
from cassandra.cluster import Cluster
cluster = Cluster(ip_address)
session = cluster.connect(keyspace_name)
query = "INSERT INTO data(date,time,open,high,low,last) VALUES (?,?,?,?,?,?)"
prepared = session.prepare(query)
"?" is used to input variables
for item in dataFrame:
session.execute(prepared, (item.date_value,item.time_value,item.open_value,item.high_value,item.low_value,item.last_value))
or
for item in dataFrame:
session.execute(prepared, (item[0],item[1],item[2],item[3],item[4],item[5]))
What i mean is that use for loop to extract data and upload using session.execute().
for more info on prepared statements
Hope this helps..

Nice option is to use batches. First you can split df into even partitions (thanks to Python/Pandas - partitioning a pandas DataFrame in 10 disjoint, equally-sized subsets) and then put each partition as batch into Cassandra. Batch size is limited by Cassandra (cassandra.yaml) setting:
batch_size_fail_threshold_in_kb: 50
The code for batch insert of Pandas df:
from cassandra.cluster import Cluster
from cassandra import ConsistencyLevel
from cassandra.query import BatchStatement
CASSANDRA_PARTITION_NUM = 1500
def write_to_cassandra(df):
cassandra_cluster = Cluster('ip')
session = cassandra_cluster.connect('keyspace')
prepared_query = session.prepare('INSERT INTO users(id, name) VALUES (?,?)')
for partition in split_to_partitions(df, CASSANDRA_PARTITION_NUM):
batch = BatchStatement(consistency_level=ConsistencyLevel.QUORUM)
for index, item in partition.iterrows():
batch.add(prepared_query, (item.id, item.name))
session.execute(batch)
def split_to_partitions(self, df, partition_number):
permuted_indices = np.random.permutation(len(df))
partitions = []
for i in range(partition_number):
partitions.append(df.iloc[permuted_indices[i::partition_number]])
return partitions
Update:
Do it only when batch is within the same partition.

Related

Using SQLalchemy ORM for Python In my REST api, how can I aggregate resources to the hour to the day?

I have a MySql db table with that looks like:
time_slot | sales
2022-08-026T01:00:00 | 100
2022-08-026T01:06:40 | 103
...
I am serving the data via api to a client. The FE engineer wants the data aggregated by hour for each day within the query period (atm it's a week). So he gives from and to and wants the sum of sales within each hour for each day as a nested array. Because it's a week, it's a 7 element array, where each element is an array containing all the hourly slots where we have data.
[
[
"07:00": 567,
"08:00": 657,
....
],
[], [], ...
]
The api is built in python. There is an ORM (sqlalchemy) for the data, that looks like:
class HourlyData(Base):
hour: Column(Datetime)
sales: Column(Float)
I can query the hourly data, and then in python memory aggregate it into list of lists. But to save compute time (and conceptual complexity), I would like to run the aggregation through orm queries.
What is the sqlalchemy syntax to achieve this?
The below should get you started, where the solution is a mix of SQL and Python using existing tools, and it should work with any RDBMS.
Assumed model definition, and imports
from itertools import groupby
import json
class TimelyData(Base):
__tablename__ = "timely_data"
id = Column(Integer, primary_key=True)
time_slot = Column(DateTime)
sales = Column(Float)
We get the data from the DB aggregated enough for us to group properly
# below works for Posgresql (tested), and should work for MySQL as well
# see: https://mode.com/blog/date-trunc-sql-timestamp-function-count-on
col_hour = func.date_trunc("hour", TimelyData.time_slot)
q = (
session.query(
col_hour.label("hour"),
func.sum(TD.sales).label("total_sales"),
)
.group_by(col_hour)
.order_by(col_hour) # this is important for `groupby` function later on
)
Group the results by date again using python groupby
groups = groupby(q.all(), key=lambda row: row.hour.date())
# truncate and format the final list as required
data = [
[(f"{row.hour:%H}:00", int(row.total_sales)) for row in rows]
for _, rows in groups
]
Example result:
[[["01:00", 201], ["02:00", 102]], [["01:00", 103]], [["08:00", 104]]]
I am not familiar with MySQL, but with Postgresql one could implement all at the DB level due to extensive JSON support. However, I would argue the readability of that implementation will not be improve, and so will not the speed assuming we get from the database at most 168 rows = 7 days x 24 hours).

Inserting Records To Delta Table Through Databricks

I wanted to insert 100,000 records into a delta table using databricks. I am trying to insert data by using a simple for loop , something like -
revision_date = '01/04/2022'
for i in range( 0 , 100,000):
spark.sql(""" insert into db.delta_table_name values ( 'Class1' , '{revision_date}' + i """)
The problem is , it takes awfully long to insert data using insert statement in databricks. It took almost 5+ hours to complete this. Can anyone suggest an alternative or a solution for this problem in databricks.
My Cluster configuration is - 168 GB, 24 core, DBR 9.1 LTS,Spark 3.1.2
The loop through enormous INSERT operations on Delta Table costs a lot because it involves a new Transaction Logging for every single INSERT command. May read more on the doc.
Instead, it would be better to create a whole Spark dataframe first and then execute just one WRITE operation to insert data into Delta Table. The example code below will do in less than a minute.
from pyspark.sql.functions import expr, row_number, lit, to_date, date_add
from pyspark.sql.window import Window
columns = ['col1']
rows = [['Class1']]
revision_date = '01/04/2022'
# just create a one record dataframe
df = spark.createDataFrame(rows, columns)
# duplicate to 100,000 records
df = df.withColumn('col1', expr('explode(array_repeat(col1,100000))'))
# create date column
df = df.withColumn('revision_date', lit(revision_date))
df = df.withColumn('revision_date', to_date('revision_date', 'dd/MM/yyyy'))
# create sequence column
w = Window().orderBy(lit('X'))
df = df.withColumn("col2", row_number().over(w))
# use + operation to add date
df = df.withColumn("revision_date", df.revision_date + df.col2)
# drop unused column
df = df.drop("col2")
# write to the delta table location
df.write.format('delta').mode('overwrite').save('/location/of/your/delta/table')

pd.read_sql method to count number of rows in a large Access database

I am trying to read the number of rows in a large access database and I am trying to find the most efficient method. Here is my code:
driver = 'access driver as string'
DatabaseLink = 'access database link as string'
Name = 'access table name as string'
conn = pyodbc.connect(r'Driver={' + driver + '};DBQ=' + DatabaseLink +';')
cursor = conn.cursor()
AccessSize = cursor.execute('SELECT count(1) FROM '+ Name).fetchone()[0]
connection.close()
This works and AccessSize does give me an integer with the number of rows in the database, however it takes far too long to compute (my database has over 2 million rows and 15 columns).
I have attempted to read the data through pd.read_sql and used the chunksize functionality to loop through and keep counting the length of each chunk but this also takes long. I have also attempted .fetchall in the cursor execute section but the speed is similar to .fetchone
I would have thought there would be a faster method to quickly calculate the length of the table as I don't require the entire table to be read. My thought is to find the index value of the last row as this essentially is the number of rows but I am unsure how to do this.
Thanks
From comment to the question:
Unfortunately the database doesn't have a suitable keys or indexes in any of its columns.
Then you can't expect good performance from the database because every SELECT will be a table scan.
I have an Access database on a network share. It contains a single table with 1 million rows and absolutely no indexes. The Access database file itself is 42 MiB. When I do
t0 = time()
df = pd.read_sql_query("SELECT COUNT(*) AS n FROM Table1", cnxn)
print(f'{time() - t0} seconds')
it takes 75 seconds and generates 45 MiB of network traffic. Simply adding a primary key to the table increases the file size to 48 MiB, but the same code takes 10 seconds and generates 7 MiB of network traffic.
TL;DR: Add a primary key to the table or continue to suffer from poor performance.
2 million should not take that long. I have Use pd.read_sql(con, sql) like this:
con = connection
sql = """ my sql statement
here"""
table = pd.read_sql(sql=sql, con=con)
Are you doing something different?
In my case I am using a db2 database, maybe that is why is faster.

Importing database takes a lot of time

I am trying to import a table that contains 81462 rows in a dataframe using the following code:
sql_conn = pyodbc.connect('DRIVER={SQL Server}; SERVER=server.database.windows.net; DATABASE=server_dev; uid=user; pwd=pw')
query = "select * from product inner join brand on Product.BrandId = Brand.BrandId"
df = pd.read_sql(query, sql_conn)
And the whole process takes a very long time. I think that I am already 30-minutes in and it's still processing. I'd assume that this is not quite normal - so how else should I import it so the processing time is quicker?
Thanks to #RomanPerekhrest. FETCH NEXT imported everything within 1-2 minutes.
SELECT product.Name, brand.Name as BrandName, description, size FROM Product inner join brand on product.brandid=brand.brandid ORDER BY Name OFFSET 1 ROWS FETCH NEXT 80000 ROWS ONLY

How to select data from SQL Server based on data available in pandas data frame?

I have a list of data in one of the pandas dataframe column for which I want to query SQL Server database. Is there any way I can query a SQL Server DB based on data I have in pandas dataframe.
select * from table_name where customerid in pd.dataframe.customerid
In SAP, there is something called "For all entries in" where the SQL can query the DB based on the data available in the array, I was trying to find something similar.
Thanks.
If you are working with tiny DataFrame, then the easiest way would be to generate a corresponding SQL:
In [8]: df
Out[8]:
id val
0 1 21
1 3 111
2 5 34
3 12 76
In [9]: q = 'select * from tab where id in ({})'.format(','.join(['?']*len(df['id'])))
In [10]: q
Out[10]: 'select * from tab where id in (?,?,?,?)'
now you can read data from SQL Server:
from sqlalchemy import create_engine
conn = create_engine(...)
new = pd.read_sql(q, conn, params=tuple(df['id']))
NOTE: this approach will not work for bigger DF's as the generated query (and/or list of bind variables) might bee too long either for Pandas to_sql() function or for SQL Server or even for both.
For bigger DFs I would recommend to write your pandas DF to SQL Server table and then use SQL subquery to filter needed data:
df[list_of_columns_to_save].to_sql('tmp_tab_name', conn, index=False)
q = "select * from tab where id in (select id from tmp_tab_name)"
new = pd.read_sql(q, conn, if_exists='replace')
This is a very familiar scenario and one can use the below code to Query SQL using a very large pandas dataframe. The parameter n needs to be manipulated based on your SQL server memory. For me n=25000 worked.
n = 25000 #chunk row size
## Big_frame dataframe divided into smaller chunks of n into a list
list_df = [big_frame[i:i+n] for i in range(0,big_frame.shape[0],n)]
## Create another dataframe with columns names as expected from SQL
big_frame_2 = pd.DataFrame(columns=[<Mention all column names from SQL>])
## Print total no. of iterations
print("Total Iterations:", len(list_df))
for i in range(0,len(list_df)):
print("Iteration :",i)
temp_frame = list_df[i]
testList = temp_frame['customer_no']
## Pass smaller chunk of data to SQL(here I am passing a list of customers)
temp_DF = SQL_Query(tuple(testList))
print(temp_DF.shape[0])
## Append all the data retrieved from SQL to big_frame_2
big_frame_2=big_frame_2.append(temp_DF, ignore_index=True)

Categories