Improving MySQL read time, MySQLdb - python

I have a table with more than a million record with the following structure:
mysql> SELECT * FROM Measurement;
+----------------+---------+-----------------+------+------+
| Time_stamp | Channel | SSID | CQI | SNR |
+----------------+---------+-----------------+------+------+
| 03_14_14_30_14 | 7 | open | 40 | -70 |
| 03_14_14_30_14 | 7 | roam | 31 | -79 |
| 03_14_14_30_14 | 8 | open2 | 28 | -82 |
| 03_14_14_30_15 | 8 | roam2 | 29 | -81 |....
I am reading data from this table into python for plotting. The problem is, the MySQL reads are too slow and it is taking me hours to get the plots even after using
MySQLdb.cursors.SSCursor (as suggested by a few in this forum) to quicken up the task.
con = mdb.connect('localhost', 'testuser', 'conti', 'My_Freqs', cursorclass = MySQLdb.cursors.SSCursor);
cursor=con.cursor()
cursor.execute("Select Time_stamp FROM Measurement")
for row in cursor:
... Do processing ....
Will normalizing the table help me in speeding up the task? If so, How should i normalize it?
P.S: Here is the result for EXPLAIN
+------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| Time_stamp | varchar(128) | YES | | NULL | |
| Channel | int(11) | YES | | NULL | |
| SSID | varchar(128) | YES | | NULL | |
| CQI | int(11) | YES | | NULL | |
| SNR | float | YES | | NULL | |
+------------+--------------+------+-----+---------+-------+

The problem is probably that you are looping over the cursor instead of just dumping out all the data at once and then processing it. You should be able to dump out a couple million rows in a couple/few seconds. Try to do something like
cursor.execute("select Time_stamp FROM Measurement")
data = cusror.fetchall()
for row in data:
#do some stuff...

Well, since you're saying the whole table has to be read, I guess you can't do much about it. It has more than 1 million records... you're not going to optimize much on the database side.
How much time does it take you to process just one record? Maybe you could try optimizing that part. But even if you got down to 1 millisecond per record, it would still take you about half an hour to process the full table. You're dealing with a lot of data.
Maybe run multiple plotting jobs in parallel? With the same metrics as above, dividing your data in 6 equal-sized jobs would (theoretically) give you the plots in 5 minutes.
Do your plots have to be fine-grained? You could look for ways to ignore certain values in the data, and generate a complete plot only when the user needs it (wild speculation here, I really have no idea what your plots look like)

Related

meanshift clustering using pyspark

We're trying to migrate a vanilla python code-base to pyspark. The agenda is to do some filtering on a dataframe (previously pandas, now spark), then group it by user-ids, and finally apply meanshift clustering on top.
I'm using pandas_udf(df.schema, PandasUDFType.GROUPED_MAP) on the grouped-data. But now there's a problem in the way the final output should be represented.
Let's say we have two columns in the input dataframe, user-id and location. For each user we need to get all clusters (on the location), retain only the biggest one, and then return its attributes, which is a 3-dimensional vector. Let's assume the columns of the 3-tuple are col-1, col-2 and col-3. I can only think of creating the original dataframe with 5 columns, with these 3 fields set to None, using something like withColumn('col-i', lit(None).astype(FloatType())). Then, in the first row for each user, I'm planning to populate these three columns with these attributes. But this seems really ugly way of doing it, and it would unnecessarily waste a lot of space, because apart from the first row, all entries in col-1, col-2 and col-3 would be zero. The output dataframe would look something like below in this case:
+---------+----------+-------+-------+-------+
| user-id | location | col-1 | col-2 | col-3 |
+---------+----------+-------+-------+-------+
| 02751a9 | 0.894956 | 21.9 | 31.5 | 54.1 |
| 02751a9 | 0.811956 | null | null | null |
| 02751a9 | 0.954956 | null | null | null |
| ... |
| 02751a9 | 0.811956 | null | null | null |
+--------------------------------------------+
| 0af2204 | 0.938011 | 11.1 | 12.3 | 53.3 |
| 0af2204 | 0.878081 | null | null | null |
| 0af2204 | 0.933054 | null | null | null |
| 0af2204 | 0.921342 | null | null | null |
| ... |
| 0af2204 | 0.978081 | null | null | null |
+--------------------------------------------+
This feels so wrong. Is there an elegant way of doing it?
What I ended up doing, was grouped the df by user-ids, applied functions.collect_list on the columns, so that each cell contains a list. Now each user has only one row. Then I applied meanshift clustering on each row's data.

Filter SQL elements with adjacent ID

I don't really know how to properly state this question in the title.
Suppose I have a table Word like the following:
| id | text |
| --- | --- |
| 0 | Hello |
| 1 | Adam |
| 2 | Hello |
| 3 | Max |
| 4 | foo |
| 5 | bar |
Is it possible to query this table based on text and receive the objects whose primary key (id) is exactly one off?
So, if I do
Word.objects.filter(text='Hello')
I get a QuerySet containing the rows
| id | text |
| --- | --- |
| 0 | Hello |
| 2 | Hello |
but I want the rows
| id | text |
| --- | --- |
| 1 | Adam |
| 3 | Max |
I guess I could do
word_ids = Word.objects.filter(text='Hello').values_list('id', flat=True)
word_ids = [w_id + 1 for w_id in word_ids] # or use a numpy array for this
Word.objects.filter(id__in=word_ids)
but that doesn't seem overly efficient. Is there a straight SQL way to do this in one call? Preferably directly using Django's QuerySets?
EDIT: The idea is that in fact I want to filter those words that are in the second QuerySet. Something like:
Word.objects.filter(text__of__previous__word='Hello', text='Max')
In plain Postgres you could use the lag window function (https://www.postgresql.org/docs/current/static/functions-window.html)
SELECT
id,
name
FROM (
SELECT
*,
lag(name) OVER (ORDER BY id) as prev_name
FROM test
) s
WHERE prev_name = 'Hello'
The lag function adds a column with the text of the previous row. So you can filter by this text in a subquery.
demo:db<>fiddle
I am not really into Django but the documentation means, in version 2.0 the functionality for window function has been added.
If by "1 off" you mean that the difference is exactly 1, then you can do:
select w.*
from w
where w.id in (select w2.id + 1 from words w2 where w2.text = 'Hello');
lag() is also a very reasonable solution. This seems like a direct interpretation of your question. If you have gaps (and the intention is + 1), then lag() is a bit trickier.

Simple moving average for random related time values

I'm beginner programmer looking for help with Simple Moving Average SMA. I'm working with column files, where first one is related to the time and second is value. The time intervals are random and also the value. Usually the files are not big, but the process is collecting data for long time. At the end files look similar to this:
+-----------+-------+
| Time | Value |
+-----------+-------+
| 10 | 3 |
| 1345 | 50 |
| 1390 | 4 |
| 2902 | 10 |
| 34057 | 13 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
After whole process number of rows is around 60k-100k.
Then i'm trying to "smooth" data with some time window. For this purpose I'm using SMA. [AWK_method]
awk 'BEGIN{size=$timewindow} {mod=NR%size; if(NR<=size){count++}else{sum-=array[mod]};sum+=$1;array[mod]=$1;print sum/count}' file.dat
To achive proper working of SMA with predefined $timewindow i create linear increment filled with zeros. Next, I run a script using diffrent $timewindow and I observe the results.
+-----------+-------+
| Time | Value |
+-----------+-------+
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| (...) | |
| 10 | 3 |
| 11 | 0 |
| 12 | 0 |
| (...) | |
| 1343 | 0 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
For small data it was relatively comfortable, but now it is quite time-devouring, and created files starting to be too big. I'm also familiar with Gnuplot but SMA there is hell...
So here are my questions:
Is it possible to change the awk solution to bypass filling data with zeros?
Do you recomend any other solution using bash?
I also have considered to learn python because after 6 months of learning bash, I have got to know its limitation. Will I able to solve this in python without creating big data?
I'll be glad with any form of help or advices.
Best regards!
[AWK_method] http://www.commandlinefu.com/commands/view/2319/awk-perform-a-rolling-average-on-a-column-of-data
You included a python tag, check out traces:
http://traces.readthedocs.io/en/latest/
Here are some other insights:
Moving average for time series with not-equal intervls
http://www.eckner.com/research.html
https://stats.stackexchange.com/questions/28528/moving-average-of-irregular-time-series-data-using-r
https://en.wikipedia.org/wiki/Unevenly_spaced_time_series
key phrase in bold for more research:
In statistics, signal processing, and econometrics, an unevenly (or unequally or irregularly) spaced time series is a sequence of observation time and value pairs (tn, Xn) with strictly increasing observation times. As opposed to equally spaced time series, the spacing of observation times is not constant.
awk '{Q=$2-last;if(Q>0){while(Q>1){print "| "++i" | 0 |";Q--};print;last=$2;next};last=$2;print}' Input_file

What might be causing MySQL to hang my Python script?

I have a pretty straightforward Python script. It kicks off a pool of 10 processes that each:
Make an external API request for 1,000 records
Parses the XML response
Inserts each record into a MySQL database
There's nothing particularly tricky here, but about the time I reach 90,000 records the script hangs.
mysql> show processlist;
+----+------+-----------------+---------------+---------+------+-------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------------+---------------+---------+------+-------+------------------+
| 44 | root | localhost:48130 | my_database | Sleep | 57 | | NULL |
| 45 | root | localhost:48131 | NULL | Sleep | 6 | | NULL |
| 59 | root | localhost | my_database | Sleep | 506 | | NULL |
| 60 | root | localhost | NULL | Query | 0 | NULL | show processlist |
+----+------+-----------------+---------------+---------+------+-------+------------------+
I have roughly a million records to import in the way so I have a long, long way to go.
What can I do to prevent this hang and keep my script moving?
Python 2.7.6
MySQL-python 1.2.5
Not exactly what I wanted to do, but I have found that opening and closing the connection as required seems to move things along.

Why django order_by is so slow in a manytomany query?

I have a ManyToMany field. Like this:
class Tag(models.Model):
books = models.ManyToManyField ('book.Book', related_name='vtags', through=TagBook)
class Book (models.Model):
nump = models.IntegerField (default=0, db_index=True)
I have around 450,000 books, and for some tags, it related around 60,000 books. When I did a query like:
tag.books.order_by('nump')[1:11]
It gets extremely slow, like 3-4 minutes.
But if I remove order_by, it run queries as normal.
The raw sql for the order_by version looks like this:
'SELECT `book_book`.`id`, ... `book_book`.`price`, `book_book`.`nump`,
FROM `book_book` INNER JOIN `book_tagbook` ON (`book_book`.`id` =
`book_tagbook`.`book_id`) WHERE `book_tagbook`.`tag_id` = 1 ORDER BY
`book_book`.`nump` ASC LIMIT 11 OFFSET 1'
Do you have any idea on this? How could I fix it? Thanks.
---EDIT---
Checked the previous raw query in mysql as #bouke suggested:
SELECT `book_book`.`id`, `book_book`.`title`, ... `book_book`.`nump`,
`book_book`.`raw_data` FROM `book_book` INNER JOIN `book_tagbook` ON
(`book_book`.`id` = `book_tagbook`.`book_id`) WHERE `book_tagbook`.`tag_id` = 1
ORDER BY `book_book`.`nump` ASC LIMIT 11 OFFSET 1;
11 rows in set (4 min 2.79 sec)
Then use explain to find out why:
+----+-------------+--------------+--------+---------------------------------------------+-----------------------+---------+-----------------------------+--------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+---------------------------------------------+-----------------------+---------+-----------------------------+--------+---------------------------------+
| 1 | SIMPLE | book_tagbook | ref | book_tagbook_3747b463,book_tagbook_752eb95b | book_tagbook_3747b463 | 4 | const | 116394 | Using temporary; Using filesort |
| 1 | SIMPLE | book_book | eq_ref | PRIMARY | PRIMARY | 4 | legend.book_tagbook.book_id | 1 | |
+----+-------------+--------------+--------+---------------------------------------------+-----------------------+---------+-----------------------------+--------+---------------------------------+
2 rows in set (0.10 sec)
And for the table book_book:
mysql> explain book_book;
+----------------+----------------+------+-----+-----------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+----------------+------+-----+-----------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| title | varchar(200) | YES | | NULL | |
| href | varchar(200) | NO | UNI | NULL | |
..... skip some part.............
| nump | int(11) | NO | MUL | 0 | |
| raw_data | varchar(10000) | YES | | NULL | |
+----------------+----------------+------+-----+-----------+----------------+
24 rows in set (0.00 sec)

Categories