Writing a lot of data in cassandra in one query - python

I have made a program that get data in one of my cassandra table and query twitter API to get the followers and friends of one user. I safe all the id in a set and then when I get all the followers/friends I write that into Cassandra.
The problem is one of the user got 1M24 followers and when I executed this code the size of the set kind of generate a writing error into cassandra.
def get_data(tweepy_function, author_id, author_username, session):
if tweepy_function == "followers":
followers = set()
for follower_id in tweepy.Cursor(API.followers_ids, id=author_id, count=5000).items():
if len(followers) % 5000 == 0 and len(followers) != 0:
print("Collected followers: ", len(followers))
followers.add(follower_id)
query = "INSERT INTO {0} (node_id, screen_name, centrality, follower_ids) VALUES ({1}, {2}, {3}, {4})"\
.format("network", author_id, author_username, 0.0, followers)
session.execute(query)
if tweepy_function == "friends":
friends = set()
for friend_id in tweepy.Cursor(API.friends_ids, id=author_id, count=5000).items():
if len(friends) % 5000 == 0 and len(friends) != 0:
print("Collected followers: ", len(friends))
friends.add(friend_id)
query = "INSERT INTO {0} (node_id, screen_name, centrality, friend_ids) VALUES ({1}, {2}, {3}, {4})"\
.format("network", author_id, author_username, 0.0, friends)
session.execute(query)
As asked I add my schema:
table = """CREATE TABLE IF NOT EXISTS
{0} (
node_id bigint ,
screen_name text,
last_tweets set<text>,
follower_ids set<bigint>,
friend_ids set<bigint>,
centrality float,
PRIMARY KEY (node_id))
""".format(table_name)
Why did I get a writing error? How to prevent it? Is that a good way to safe data into Cassandra?

You are using follower_ids and friend_ids as Set (Collection)
Limitation of Collection in Cassandra :
The maximum size of an item in a collection is 64K or 2B, depending
on the native protocol version.
Keep collections small to prevent delays during querying because
Cassandra reads a collection in its entirety. The collection is not
paged internally, collections are designed to
store only a small amount of data.
Never insert more than 64K items in a collection.
If you insert more than 64K items into a collection, only 64K of them will be queryable, resulting in data loss.
You can use the below schema :
CREATE TABLE IF NOT EXISTS my_table (
node_id bigint ,
screen_name text,
last_tweets set<text>,
centrality float,
friend_follower_id bigint,
is_friend boolean,
is_follower boolean,
PRIMARY KEY ((node_id), friend_follower_id)
);
Here friend_follower_id is friendid or followerid, if friend then mark is_friend as true and if follower then mark is_follower as true
Example :
If for node_id = 1
friend_ids = [10, 20, 30]
follower_ids = [11, 21, 31]
Then your insert query will be :
INSERT INTO user(node_id , friend_follower_id , is_friend) VALUES ( 1, 10, true);
INSERT INTO user(node_id , friend_follower_id , is_friend) VALUES ( 1, 20, true);
INSERT INTO user(node_id , friend_follower_id , is_friend) VALUES ( 1, 30, true);
INSERT INTO user(node_id , friend_follower_id , is_follower) VALUES ( 1, 11, true);
INSERT INTO user(node_id , friend_follower_id , is_follower) VALUES ( 1, 21, true);
INSERT INTO user(node_id , friend_follower_id , is_follower) VALUES ( 1, 31, true);
If you want to get all friendids and followerids then query :
SELECT * FROM user WHERE node_id = 1;
You will get this :
node_id | friend_follower_id | centrality | is_follower | is_friend | last_tweets | screen_name
---------+--------------------+------------+-------------+-----------+-------------+-------------
1 | 10 | null | null | True | null | null
1 | 11 | null | True | null | null | null
1 | 20 | null | null | True | null | null
1 | 21 | null | True | null | null | null
1 | 30 | null | null | True | null | null
1 | 31 | null | True | null | null | null
Source :
https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_collections_c.html
https://docs.datastax.com/en/cql/3.1/cql/cql_reference/refLimits.html

Related

What is the Postgres _text type?

I have a Postgres table with a _text type (note the underscore) and am unable to determine how to insert the string [] into that table.
Here is my table definition:
CREATE TABLE public.newtable (
column1 _text NULL
);
I have the postgis extension enabled:
CREATE EXTENSION IF NOT EXISTS postgis;
And my python code:
conn = psycopg2.connect()
conn.autocommit = True
cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
rows = [("[]",)]
insert_query = f"INSERT INTO newtable (column1) values %s"
psycopg2.extras.execute_values(cur, insert_query, rows, template=None, page_size=100)
This returns the following error:
psycopg2.errors.InvalidTextRepresentation: malformed array literal: "[]"
LINE 1: INSERT INTO newtable (column1) values ('[]')
^
DETAIL: "[" must introduce explicitly-specified array dimensions.
How can I insert this data? What does this error mean? And what is a _text type in Postgres?
Pulling my comments together:
CREATE TABLE public.newtable (
column1 _text NULL
);
--_text gets transformed into text[]
\d newtable
Table "public.newtable"
Column | Type | Collation | Nullable | Default
---------+--------+-----------+----------+---------
column1 | text[] | | |
insert into newtable values ('{}');
select * from newtable ;
column1
---------
{}
In Python:
import psycopg2
con = psycopg2.connect(dbname="test", host='localhost', user='postgres')
cur = con.cursor()
cur.execute("insert into newtable values ('{}')")
con.commit()
cur.execute("select * from newtable")
cur.fetchone()
([],)
cur.execute("truncate newtable")
con.commit()
cur.execute("insert into newtable values (%s)", [[]])
con.commit()
cur.execute("select * from newtable")
cur.fetchone()
([],)
From the psycopg2 docs Type adaption Postgres arrays are adapted to Python lists and vice versa.
UPDATE
Finding _text type in Postgres system catalog pg_type. In psql:
\x
Expanded display is on.
select * from pg_type where typname = '_text';
-[ RECORD 1 ]--+-----------------
oid | 1009
typname | _text
typnamespace | 11
typowner | 10
typlen | -1
typbyval | f
typtype | b
typcategory | A
typispreferred | f
typisdefined | t
typdelim | ,
typrelid | 0
typelem | 25
typarray | 0
typinput | array_in
typoutput | array_out
typreceive | array_recv
typsend | array_send
typmodin | -
typmodout | -
typanalyze | array_typanalyze
typalign | i
typstorage | x
typnotnull | f
typbasetype | 0
typtypmod | -1
typndims | 0
typcollation | 100
typdefaultbin | NULL
typdefault | NULL
typacl | NULL
Refer to the pg_type link above to get information on what the columns refer to. The typcategory of A as mapped in "Table 52.63. typcategory Codes Code Category A Array types" at the link is one clue. As well as typinput, typoutput, etc values.

Python 3 - How do I extract data from SQL database and process the data and append to pandas dataframe row by row?

I have a MySQL database, its columns are:
+--------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+----------------+
| id | int unsigned | NO | PRI | NULL | auto_increment |
| artist | text | YES | | NULL | |
| title | text | YES | | NULL | |
| album | text | YES | | NULL | |
| duration | text | YES | | NULL | |
| artistlink | text | YES | | NULL | |
| songlink | text | YES | | NULL | |
| albumlink | text | YES | | NULL | |
| instrumental | tinyint(1) | NO | | 0 | |
| downloaded | tinyint(1) | NO | | 0 | |
| filepath | text | YES | | NULL | |
| language | json | YES | | NULL | |
| genre | json | YES | | NULL | |
| style | json | YES | | NULL | |
| artistgender | text | YES | | NULL | |
+--------------+--------------+------+-----+---------+----------------+
I need to extract data from it and process the data and add the data to a pandas DataFrame.
I know how to extract data from SQL database, and I have already implemented a way to pass the data to DataFrame, but it is extremely slow (about 30 seconds), whereas when I used a flat list of namedtuples the operation is tremendously faster (under 3 seconds).
Specifically, filepath is default NULL unless the file is downloaded (currently none of the songs are downloaded), and when Python gets filepath the value will be None, and I need that value become ''.
And because MySQL doesn't have BOOLEAN type, I need to cast the received ints to bool.
And the language, genre, style fields are tags stored as JSON lists, and they are all currently NULL, when Python gets them they are strings and I need to make them lists using json.loads unless they are None, and if they are None I need to append empty lists instead.
This is my inefficient solution to the problem:
import json
import mysql.connector
from pandas import *
fields = {
"artist": str(),
"album": str(),
"title": str(),
"id": int(),
"duration": str(),
"instrumental": bool(),
"downloaded": bool(),
"filepath": str(),
"language": list(),
"genre": list(),
"style": list(),
"artistgender": str(),
"artistlink": str(),
"albumlink": str(),
"songlink": str(),
}
conn = mysql.connector.connect(
user="Estranger", password=PWD, host="127.0.0.1", port=3306, database="Music"
)
cursor = conn.cursor()
def proper(x):
return x[0].upper() + x[1:]
def fetchdata():
cursor.execute("select {} from songs".format(', '.join(list(fields))))
data = cursor.fetchall()
dataframes = list()
for item in data:
entry = list(map(proper, item[0:3]))
entry += [item[3]]
for j in range(4, 7):
cell = item[j]
if isinstance(cell, int):
entry.append(bool(cell))
elif isinstance(cell, str):
entry.append(cell)
if item[7] is not None:
entry.append(item[7])
else:
entry.append('')
for j in range(8, 11):
entry.append(json.loads(item[j])) if item[j] is not None else entry.append([])
entry.append(item[11])
entry += item[12:15]
df = DataFrame(fields, index=[])
row = Series(entry, index = df.columns)
df = df.append(row, ignore_index=True)
dataframes.append(df)
songs = concat(dataframes, axis=0, ignore_index=True)
songs.sort_values(['artist', 'album', 'title'], inplace=True)
return songs
Currently there are 4464 songs in the database and the code takes about 30 seconds to finish.
I sorted my SQL database by artist and title and I need to resort the entries by artist, album and title for QTreeWidget, and MySQL sorts data differently from Python and I prefer Python sorting.
In my testing, df.loc and df = df.append() methods are slow, pd.concat is fast, but I really don't know how to create dataframes with only one row and pass flat lists to dataframe instead of a dictionary, and if there is a faster way than pd.concat, or if operations in the for loop can be vectorized.
How can my code be improved?
I figured out how to create a DataFrame with a list of lists and specify column names, and it is tremendously faster, but I still don't know how to also specify the data types elegantly without the code throwing errors...
def fetchdata():
cursor.execute("select {} from songs".format(', '.join(list(fields))))
data = cursor.fetchall()
for i, item in enumerate(data):
entry = list(map(proper, item[0:3]))
entry += [item[3]]
for j in range(4, 7):
cell = item[j]
if isinstance(cell, int):
entry.append(bool(cell))
elif isinstance(cell, str):
entry.append(cell)
if item[7] is not None:
entry.append(item[7])
else:
entry.append('')
for j in range(8, 11):
entry.append(json.loads(item[j])) if item[j] is not None else entry.append([])
entry.append(item[11])
entry += item[12:15]
data[i] = entry
songs = DataFrame(data, columns=list(fields), index=range(len(data)))
songs.sort_values(['artist', 'album', 'title'], inplace=True)
return songs
And I still need the type conversions, they are already pretty fast, but they don't look elegant.
You could make a list of conversion functions for each column:
funcs = [
str.capitalize,
str.capitalize,
str.capitalize,
int,
str,
bool,
bool,
lambda v: v if v is not None else '',
lambda v: json.loads(v) if v is not None else [],
lambda v: json.loads(v) if v is not None else [],
lambda v: json.loads(v) if v is not None else [],
str,
str,
str,
str,
]
Now you can apply the function that converts the value for each field
for i, item in enumerate(data):
row = [func(field) for field, func in zip(item, funcs)]
data[i] = row
For the first part of the question, for generic database 'history':
import pymysql
# open database
connection = pymysql.connect("localhost","root","123456","blue" )
# prepare a cursor object using cursor() method
cursor = connection.cursor()
# prepare SQL command
sql = "SELECT * FROM history"
try:
cursor.execute(sql)
data = cursor.fetchall()
print ("Last row uploaded",list(data[-1]))
except:
print ("Error: unable to fetch data")
# disconnect from server
connection.close()
You can simply fetch data from the table and create a Data-frame using Pandas.
import pymysql
import pandas as pd
from pymysql import Error
conn = pymysql.connect(host="",user="",connect_timeout=10,password="",database="",port=)
if conn:
cursor = conn.cursor()
sql = f"""SELECT * FROM schema.table_name;"""
cursor.execute(sql)
data =pd.DataFrame(cursor.fetchall())
conn.close()
# You can go ahead and create a csv from this Data-Frame
csv_gen = pd.to_csv(data,index=False)
enter code here

MySQL JSON Query sends random numbers

I'm writing a MySQL Query in Python using pymysql to send JSON data to a MySQL table. When it sends the data, the following results are produced.
| id | data |
+----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 21 | 0x7B226775696C6473223A207B22383538393137313431303633343031353132223A207B226D656D62657273223A207B22343130373936333038323431303535373436223A207B22706F696E7473223A203137302C2022726166666C65735F776F6E223A20307D7D2C2022726166666C6573223A20307D7D7D |
The code I used to send the data is the following:
self.data_str = json.dumps(self.data)
sql = "insert into jsondata ( data) values ('" + self.data_str + "') "
mysql.exec_sql(sql)
mysql.close_db()
The exec_sql function is:
def exec_sql(self, sql):
# sql is insert, delete or update statement
cursor = self.db.cursor()
try:
cursor.execute(sql)
# commit sql to mysql
self.db.commit()
cursor.close()
return True
except:
self.db.rollback()
return False
An example line of JSON data is
{"guilds": {"853317141063401512": {"members": {"410846308241055746": {"points": 250, "raffles_won": 0}}, "raffles": 0}}}
My SQL table was setup as follows:
| Field | Type | Null | Key | Default | Extra |
+-------+--------+------+-----+---------+----------------+
| id | int(6) | NO | PRI | NULL | auto_increment |
| data | blob | YES | | NULL | |
+-------+--------+------+-----+---------+----------------+
The bytes are not random. They are the hex representation of ASCII bytes in your JSON string. Observe:
mysql> select unhex('7B226775696C6473223A207B22383538393137313431303633343031353132223A207B226D656D62657273223A207B22343130373936333038323431303535373436223A207B22706F696E7473223A203137302C2022726166666C65735F776F6E223A20307D7D2C2022726166666C6573223A20307D7D7D') as j;
+--------------------------------------------------------------------------------------------------------------------------+
| j |
+--------------------------------------------------------------------------------------------------------------------------+
| {"guilds": {"858917141063401512": {"members": {"410796308241055746": {"points": 170, "raffles_won": 0}}, "raffles": 0}}} |
+--------------------------------------------------------------------------------------------------------------------------+
What you're seeing is that when you store a JSON string in a binary column (BLOB), MySQL "forgets" that it is supposed to be text, and dumps only the hex encoding of the bytes when you query it.
If you want to store JSON, then use the JSON data type, not BLOB.

How to retrieve zero fill column from mysql via python?

I have a table in mysql as below:
CREATE TABLE `province` (
`pid` int(2) unsigned zerofill NOT NULL,
`pname` varchar(255) CHARACTER SET utf8 COLLATE utf8_persian_ci DEFAULT NULL,
`family` int(12) DEFAULT NULL,
`population` int(11) DEFAULT NULL,
`male` int(11) DEFAULT NULL,
`female` int(11) DEFAULT NULL,
PRIMARY KEY (`pid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_persian_ci
note that first column (pid) is zero fill column
and the data in the table(province) is as below:
=========================================
|pid|pname|family|population|male|female|
=========================================
|02 | 'A' | 12 | 20 | 8 | 5 |
=========================================
|03 | 'B' | 25 | 20 | 7 | 6 |
=========================================
|05 | 'c' | 34 | 5 | 7 | 9 |
=========================================
I want to retrieve pid column via python so my python code is:
import mysql.connector
if __name__ == '__main__':
data = []
res = []
cnx = mysql.connector.connect(user='mehdi', password='mehdi', host='127.0.0.1', database='cra_db')
cursor = cnx.cursor()
qrystr = 'SELECT pid FROM province;'
cursor.execute(qrystr)
print(cursor.fetchall())
cnx.close()
but when i run this python code this Exception occurred:
returned a result with an error set
File "C:\Users\M_Parastar\Desktop\New folder\ttt.py", line 11, in
print(cursor.fetchall())
Do you have any idea how to retrieve zero fill column via python??
The cursor returns an iterable python object, replace "print(cursor.fetchall())" with "for rows in cursor: print (rows)". Visit Connector/Python API Reference

MySQL Python Insert strange?

I dont see why it's not working. I have created several databases and tables and obviously no problem. But I am stuck with this table which is created from django data model. To clarify what I have done, created new database and table from mysql console and try to insert from python and working. But, this one is strange for me.
class Experiment(models.Model):
user = models.CharField(max_length=25)
filetype = models.CharField(max_length=10)
createddate= models.DateField()
uploaddate = models.DateField()
time = models.CharField(max_length=20)
size = models.CharField(max_length=20)
located= models.CharField(max_length=50)
Here is view in mysql console
mysql> describe pmass_experiment;
+-------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| user | varchar(25) | NO | | NULL | |
| filetype | varchar(10) | NO | | NULL | |
| createddate | date | NO | | NULL | |
| uploaddate | date | NO | | NULL | |
| time | varchar(20) | NO | | NULL | |
| size | varchar(20) | NO | | NULL | |
| located | varchar(50) | NO | | NULL | |
+-------------+-------------+------+-----+---------+----------------+
8 rows in set (0.01 sec)
Above pmass_experiment table is created by django ORM after python manage.py syncdb
Now I am trying to insert data into pmass_experiment through python MySQLdb
import MySQLdb
import datetime,time
import sys
conn = MySQLdb.connect(
host="localhost",
user="root",
passwd="root",
db="experiment")
cursor = conn.cursor()
user='tchand'
ftype='mzml'
size='10MB'
located='c:\'
date= datetime.date.today()
time = str(datetime.datetime.now())[10:19]
#Insert into database
sql = """INSERT INTO pmass_experiment (user,filetype,createddate,uploaddate,time,size,located)
VALUES (user, ftype, date, date, time, size, located)"""
try:
# Execute the SQL command
cursor.execute(sql)
# Commit your changes in the database
conn.commit()
except:
# Rollback in case there is any error
conn.rollback()
# disconnect from server
conn.close()
But, unfortunately nothing is inserting. I am guessing it's may be due to primary_key (id) in table which is not incrementing automatically.
mysql> select * from pmass_experiment;
Empty set (0.00 sec)
can you simply point out my mistake?
Thanks
sql = """INSERT INTO pmass_experiment (user,filetype,createddate,uploaddate,time,size,located)
VALUES (user, ftype, date, date, time, size, located)"""
Parametrize your sql and pass in the values as the second argument to cursor.execute:
sql = """INSERT INTO pmass_experiment (user,filetype,createddate,uploaddate,time,size,located)
VALUES (%s, %s, %s, %s, %s, %s, %s)"""
try:
# Execute the SQL command
cursor.execute(sql,(user, ftype, date, date, time, size, located))
# Commit your changes in the database
conn.commit()
except Exception as err:
# logger.error(err)
# Rollback in case there is any error
conn.rollback()
It is a good habit to always parametrize your sql since this will help prevent sql injection.
The original sql
INSERT INTO pmass_experiment (user,filetype,createddate,uploaddate,time,size,located)
VALUES (user, ftype, date, date, time, size, located)
seems to be valid. An experiment in the mysql shell shows it inserts a row of NULL values:
mysql> insert into foo (first,last,value) values (first,last,value);
Query OK, 1 row affected (0.00 sec)
mysql> select * from foo order by id desc;
+-----+-------+------+-------+
| id | first | last | value |
+-----+-------+------+-------+
| 802 | NULL | NULL | NULL |
+-----+-------+------+-------+
1 row in set (0.00 sec)
So I'm not sure why your are not seeing any rows committed to the database table.
Nevertheless, the original sql is probably not doing what you intend.

Categories