Data structure for text corpus database

Data structure for text corpus database - python

A text corpus is usually represented in xml as such:
<corpus name="foobar" date="08.09.13" authors="mememe">
<document filename="br-392">
<paragraph pnumber="1">
<sentence snumber="1">
<word wnumber="1" partofspeech="VB" sensetag="012345678-v" nameentity="None">Hello</word>
<word wnumber="2" partofspeech="NN" sensetag="876543210-n" nameentity="World">Foo bar</word>
</sentence>
</paragraph>
</document>
</corpus>
When I try to put a corpus into a database I had each row to represent a word and the columns are as such:
| uid | corpusname | docfilename | pnumber | snumber | wnumber | token
| pos | sensetag | ne
| 198317 | foobar | br-392 | 1 | 1 | 1 | Hello | VB | 012345678-v |
None |
| 192184 | foobar | br-392 | 1 | 1 | 1 | foobar | NN | 87654321-n |
World |
I put the data into an sqlite3 database as such:
# I read the xml file and now it's in memory as such.
w1 = (198317,'foobar','br-392',1,1,1,'hello','VB','12345678-n','Hello')
w2 = (192184,'foobar','br-392',1,1,1,'foobar','NN','87654321-n','World')
con = sqlite3.connect('semcor.db', isolation_level=None)
cur = con.cursor()
engtable = "CREATE TABLE eng(uid INT, corpusname TEXT, docname TEXT,"+\
"pnum INT, snum INT, tnum INT,"+\
"word TEXT, pos TEXT, sensetag TEXT, ne TEXT)"
cur.execute(engtable)
cur.executemany("INSERT INTO eng VALUES(?,?,?,?,?,?,?,?,?,?)", \
wordtokens)
The purpose of the database is so that I can run queries as such
SELECT * from ENG if paragraph=1;
SELECT * from ENG if sentence=1;
SELECT * from ENG if sentence=1 and pos="NN" or sensetag="87654321-n"
SELECT * from ENG if pos="NN" and sensetag="87654321-n"
SELECT * from ENG if docfilename="br-392"
SELECT * from ENG if corpusname="foobar"
It seems like when I structure the database as above, my size of database explodes because the number of tokens in each corpus can go up to millions or billions.
Other than structuring a corpus by having each row for a word and the columns its attribute and parental attribute, how else could i structure the database such I can perform the queries and get the same output?
For the purpose of indexing large size corpus,
should I be using some other database programs other than sqlite3?
And should i still use the same schema for the table as I have defined above?

And should i still use the same schema for the table as I have defined
above?
In perspective of relational database design, due to 1NF, I will use a table per element of the xml file.
We will save space and we will help DBMS performance. Using the model, desired queries will be appliable
The draft model will be:
should I be using some other database programs other than sqlite3?
That may be answered based on your application specification, like how many data records you will have after a month, year, ... , how many users will be connected, is it a OLTP or OLAP or mixed, the projects budget and ...
BTW take a look at free R-DBMS's like PostgreSQL,MySQL and commercial ones like Oracle.
For NoSql solutions having a look on the post may be helpfull

I guess the obvious answer is "normalisation"... you have an enormous amount of duplicated information per row and that is going to massively increase the size of your database.
You should work out from each row what is duplicated and then create a table to contain that data and then you will reduce, for example, a duplicated string containing the corpus length of, say, 20 characters in length to a pointer to a row in the "corpus name" table which for arguments sake might just take 4 characters as the ID value of that entry.
You don't say what platform you are using either. If it is a mobile device then it really does pay to normalise your data as much as possible. It makes the code a little more complex but that is always the space/time trade-off with stuff like this. I am guessing that this is some kind of reference application in which case pure blinding speed is probably secondary to just making it work.
The mandatory wikipedia link for normalisation
and this YouTube video
Google is your friend, hope that helps. :) Sean

Related

How do I put a placeholder for a column name in mysql python connector?

so I know how to use %s but it doesn't appear to work for a column name.
my aim here is to get a column name (a roll number here) and use that to find information (how many days they attended)
roll=input("enter roll no.: ")
c1.execute("select sum(%s) from attendance", ("" + roll + "",))
a=c1.fetchall()
the table looks like:
date | 11b1 | 11b2 | 11b3 |......| 11b45 |
2020-12-01 | 1 | 0 | 1 |......| 1 |
2020-12-02 | 1 | 1 | 1 |......| 0 |
2020-12-03 | 0 | 1 | 1 |......| 1 |
this doesn't work and seems to give me a random value
so how do I write that middle code? also why does the original code not give errors but still give an arbitrary seeming number?

I will assume you mean the columns_names;
According toPython 3.8
By using f string¹
roll = input("enter roll name.: ")
a = c1.execute(f"select sum({roll}) from attendance").fetchall()

The names of MySQL schema objects - tables, columns etc - can be interpolated using string formatting, by surrounding the placeholder with backticks ('`', ASCII 0x96). See MySQL Identifiers.
Using backticks prevents errors if the column name contains a space, or matches a keyword or reserved word.
However backticks do not protect against SQL injection. As the programmer, it's your responsibility to make sure that any column names coming from outside your program (for example, from user input) are verified as matching the column names in the table.
colname = 'roll'
sql = f"""SELECT sum(`{colname}`) FROM attendance"""
mycursor.execute(sql)
For values in INSERT or UPDATE statements, or WHERE clauses, DB-API parameter substitution should always be used.
colname = 'roll'
colvalue = 42
sql = f"""SELECT sum(`{colname}`) FROM attendance WHERE colvalue = %s"""
mycursor.execute(sql, (colvalue,))

Get fields used within SQL query

I would like to be able to return a list of all fields (ideally with the table details) used by an given SQL query. E.g. the input of the query:
SELECT t1.field1, field3
FROM dbo.table1 AS t1
INNER JOIN dbo.table2 as t2
ON t2.field2 = t1.field2
WHERE t2.field1 = 'someValue'
would return
+--------+-----------+--------+
| schema | tablename | field |
+--------+-----------+--------+
| dbo | table1 | field1 |
| dbo | table1 | field2 |
| dbo | table1 | field3 |
| dbo | table2 | field1 |
| dbo | table2 | field2 |
+--------+-----------+--------+
Really it needs to make use of the SQL kernal (is that the right word? engine?) as there is no way that the reader can know that field3 is in table1 not table2. For this reason I would assume that the solution would be an SQL. Bonus points if it can handle SELECT * too!
I have attempted a python solution using sqlparse (https://sqlparse.readthedocs.io/en/latest/), but was having trouble with the more complex SQL queries involving temporary tables, subqueries and CTEs. Also handling of aliases was very difficult (particularly if the query used the same alias in multiple places). Obviously it could not handle cases like field3 above which had no table identifier. Nor can it handle SELECT *.
I was hoping there might be a more elgant solution within SQL Server Management Studio or even some function within SQL Server itself. We have SQL Prompt from Redgate, which must have some understand within its intellisense, of the architecture and SQL query it is formatting.
UPDATE:
As requested: the reason I'm trying to do this is to work out which Users can execute which SSRS Reports within our organisation. This is entirely dependent on them having GRANT SELECT permissions assigned to their Roles on all fields used by all datasets (in our case SQL queries) in a given report. I have already managed to report on which Users have GRANT SELECT on which fields according to their Roles. I now want to extend that to which reports those permissions allow them to run.

The column table names may be tricky because column names can be ambiguous or even derived. However, you can get the column names, sequence and type from virtually any query or stored procedure.
Example
Select column_ordinal
,name
,system_type_name
From sys.dm_exec_describe_first_result_set('Select * from YourTable',null,null )

I think I have now found an answer. Please note: I currently do not have permissions to execute these functions so I have not yet tested it - I will update the answer when I've had a chance to test it. Thanks for the answer goes to #milivojeviCH. The answer is copied from here: https://stackoverflow.com/a/19852614/6709902
The ultimate goal of selecting all the columns used in an SQL Server's execution plan solved:
USE AdventureWorksDW2012
DBCC FREEPROCCACHE
SELECT dC.Gender, dc.HouseOwnerFlag,
SUM(fIS.SalesAmount) AS SalesAmount
FROM
dbo.DimCustomer dC INNER JOIN
dbo.FactInternetSales fIS ON fIS.CustomerKey = dC.CustomerKey
GROUP BY dC.Gender, dc.HouseOwnerFlag
ORDER BY dC.Gender, dc.HouseOwnerFlag
/*
query_hash query_plan_hash
0x752B3F80E2DB426A 0xA15453A5C2D43765
*/
DECLARE #MyQ AS XML;
-- SELECT qstats.query_hash, query_plan_hash, qplan.query_plan AS [Query Plan],qtext.text
SELECT #MyQ = qplan.query_plan
FROM sys.dm_exec_query_stats AS qstats
CROSS APPLY sys.dm_exec_query_plan(qstats.plan_handle) AS qplan
cross apply sys.dm_exec_sql_text(qstats.plan_handle) as qtext
where text like '% fIS %'
and query_plan_hash = 0xA15453A5C2D43765
SeLeCt #MyQ
;WITH xmlnamespaces (default 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
SELECT DISTINCT
[Database] = x.value('(#Database)[1]', 'varchar(128)'),
[Schema] = x.value('(#Schema)[1]', 'varchar(128)'),
[Table] = x.value('(#Table)[1]', 'varchar(128)'),
[Alias] = x.value('(#Alias)[1]', 'varchar(128)'),
[Column] = x.value('(#Column)[1]', 'varchar(128)')
FROM #MyQ.nodes('//ColumnReference') x1(x)
Leads to the following output:
Database Schema Table Alias Column
------------------------- ------ ---------------- ----- ----------------
NULL NULL NULL NULL Expr1004
[AdventureWorksDW2012] [dbo] [DimCustomer] [dC] CustomerKey
[AdventureWorksDW2012] [dbo] [DimCustomer] [dC] Gender
[AdventureWorksDW2012] [dbo] [DimCustomer] [dC] HouseOwnerFlag
[AdventureWorksDW2012] [dbo] [FactInternetSal [fIS] CustomerKey
[AdventureWorksDW2012] [dbo] [FactInternetSal [fIS] SalesAmount

How can I get Zapier MySQL "update row" to update a value in a row with a Null value?

I have created a Zapier Zap to populate data from a SmartSheet to a MySQL database. I have it branching so if the row does not already exist in MySQL a new row is created. This part works fine.
In my second branch, if the row already exists then the data in the row is updated with new data from the SmartSheet row. When existing data is replaced with new data the Zap works fine. E.g. for an example existing MySQL row:
+--------+---------------+--------------------+
| row_id | email_comment | smartsheet_orig_id |
+--------+---------------+--------------------+
| 895 | easy | 6876364645150921 |
+--------+---------------+--------------------+
In the SmartSheet if the user replaces the comment with another, the MySQL data is updated successfully, e.g:
+--------+---------------+--------------------+
| row_id | email_comment | smartsheet_orig_id |
+--------+---------------+--------------------+
| 895 | difficult | 6876364645150921 |
+--------+---------------+--------------------+
But, if the user has deleted the comment in SmartSheet and not replaced it with another, leaving the comment empty, the data is not removed from the corresponding MySQL record e.g:
+--------+---------------+--------------------+
| row_id | email_comment | smartsheet_orig_id |
+--------+---------------+--------------------+
| 895 | difficult | 6876364645150921 |
+--------+---------------+--------------------+
What I need the MySQL record to look like in this case would be:
+--------+---------------+--------------------+
| row_id | email_comment | smartsheet_orig_id |
+--------+---------------+--------------------+
| 895 | | 6876364645150921 |
+--------+---------------+--------------------+
After quite a lot of testing, and a conversation with Zapier support it appears the problem is that Null values are removed from the Zapier Code output step. So, the above case, this is a summary of what I'm expecting to happen:
Zapier Code step: email_comment = Null --> MySQL Update Row step: email_comment = Null
But at the output of the Code step my Null value for email_comment is stripped and so the MySQL Update Row Zap step interprets the record as not needing to be updated as there is no change and leaves the old value there.
I have tried, in my code, passing an empty string " " instead of a Null but I get the exact same result. The only way around I can see is to pass on some empirical character and then in the Update Row step replace that with a Null to store in the record but I can't see a way of doing that in Zapier.
I have searched Google and Here for others wrestling with this issue but have drawn a complete blank. The search strings I have been using are [Zapier] delete data, [Zapier] remove data and [Zapier] Null. None of the results of those searches seem to be dealing with the issue I am having.
This is the Python code I'm using to gather the inputs from SmartSheet:
#for a non existent input store an empty value
def gather_vals(inp):
return input_data.get(inp, emptyInput)
def pull_inputs(inputs, vinputs):
for key, value in zip(vinputs,inputs):
v = gather_vals(value)
d_inputs.update( {key:v})
x_vinputs = ['input_equipment', 'input_from', 'input_to', 'input_description', 'input_contractor', 'input_booked', 'input_confirmed', 'input_job_no', 'input_complete', 'input_est_val', 'input_inv_val', 'input_inv_no', 'input_book', 'input_update', 'input_comments_email']
x_inputs = ['equipment', 'from', 'to', 'description', 'contractor', 'booked', 'confirmed', 'job_no', 'complete', 'est_val', 'inv_val', 'inv_no', 'book', 'update', 'comments_email']
# Gather rest of inputs
emptyInput = None
d_inputs = {}
#gather pick-up/delivery date/time input data
pull_inputs(x_inputs, x_vinputs)
results.update(d_inputs)
return results
It appears that the code works, it returns no errors and when there is an updated actual value in SmartSheet it is updated in MySQL but when the comment is deleted the old value is left in MySQL.
I'm hoping someone may have a suggestion for me to follow.
This is the Zap flow:
Zapier support tells me the problem is that Nulls are being stripped off the output of the Python code step circled in red. The Nulls need to flow through to the Update Row step.
Manually entering NULL or Null or null in the Update Row step results in a string of characters being sent to MySQL. See the outpt from MySQL Workbench for that record:
Sending an empty string results in a string with quotation marks being sent to MySQL:
It appears this Zapier step will only send strings to MySQL so I guess it is a moot point that the code step strips NULLs from the output.

Gather rows under the same key in Cassandra/Python

I was working through the Pattern 1 and was wondering if there is a way, in python or otherwise to "gather" all the rows with the same id into a row with dictionaries.
CREATE TABLE temperature (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time)
);
Insert some data
INSERT INTO temperature(weatherstation_id,event_time,temperature)
VALUES ('1234ABCD','2013-04-03 07:01:00','72F');
INSERT INTO temperature(weatherstation_id,event_time,temperature)
VALUES ('1234ABCD','2013-04-03 07:02:00','73F');
INSERT INTO temperature(weatherstation_id,event_time,temperature)
VALUES ('1234ABCD','2013-04-03 07:03:00','73F');
INSERT INTO temperature(weatherstation_id,event_time,temperature)
VALUES ('1234ABCD','2013-04-03 07:04:00','74F');
Query the database.
SELECT weatherstation_id,event_time,temperature
FROM temperature
WHERE weatherstation_id='1234ABCD';
Result:
weatherstation_id | event_time | temperature
-------------------+--------------------------+-------------
1234ABCD | 2013-04-03 06:01:00+0000 | 72F
1234ABCD | 2013-04-03 06:02:00+0000 | 73F
1234ABCD | 2013-04-03 06:03:00+0000 | 73F
1234ABCD | 2013-04-03 06:04:00+0000 | 74F
Which works, but I was wondering if I can turn this into a row per weatherstationid.
E.g.
{
"weatherstationid": "1234ABCD",
"2013-04-03 06:01:00+0000": "72F",
"2013-04-03 06:02:00+0000": "73F",
"2013-04-03 06:03:00+0000": "73F",
"2013-04-03 06:04:00+0000": "74F"
}
Is there some parameter in cassandra driver that can be specified to gather by certain id (weatherstationid) and turn everything else into dictionary? Or is there needs to be some Python magic to turn list of rows into single row per id (or set of IDs)?

Alex, you will have to do some post execution data processing to get this format. The driver returns row by row, no matter what row_factory you use.
One of the reasons the driver cannot accomplish the format you suggest is that there is pagination involved internally (default fetch_size is 5000). So the results generated your way could potentially be partial or incomplete. Additionally, this can be easily be done with Python when the query execution is done and you are sure that all required results are fetched.

Load XML to MySQL

I need some input on how to best load the below XML file to MySQL.
I have a XML file which contains info like below:
<Start><Account>0001</Account><Asset>ABC</Asset><Value>500</Value><Asset>DEF</Asset><Value>600</Value></Start>
<Start>.......
When I use
LOAD XML LOCAL INFILE 'file.xml' INTO TABLE my_tablename ROWS IDENTIFIED BY '<Start>
the file loads successfully but the account column is all NULL.
I.e., select * from my_tablename;
Account | Asset | Value
Null | ABC | 500
Null | DEF | 600
as opposed to
I.e., select * from my_tablename;
Account | Asset | Value
0001 | ABC | 500
0001 | DEF | 600
What's the best way to handle this? re-format the file in python first? Another SQL query?
Thank you.

To have the result you need, your XML should be like this:
<Start><account>0001</account><asset>ABC</asset><value>500</value></Start>
<Start><account>0001</account><asset>DEF</asset><value>600</value></Start>
One account, asset and value tag per Start tag.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data structure for text corpus database - python

Related

How do I put a placeholder for a column name in mysql python connector?

Get fields used within SQL query

How can I get Zapier MySQL "update row" to update a value in a row with a Null value?

Gather rows under the same key in Cassandra/Python

Load XML to MySQL

Categories

Resources