Load XML to MySQL - python

I need some input on how to best load the below XML file to MySQL.
I have a XML file which contains info like below:
<Start><Account>0001</Account><Asset>ABC</Asset><Value>500</Value><Asset>DEF</Asset><Value>600</Value></Start>
<Start>.......
When I use
LOAD XML LOCAL INFILE 'file.xml' INTO TABLE my_tablename ROWS IDENTIFIED BY '<Start>
the file loads successfully but the account column is all NULL.
I.e., select * from my_tablename;
Account | Asset | Value
Null | ABC | 500
Null | DEF | 600
as opposed to
I.e., select * from my_tablename;
Account | Asset | Value
0001 | ABC | 500
0001 | DEF | 600
What's the best way to handle this? re-format the file in python first? Another SQL query?
Thank you.

To have the result you need, your XML should be like this:
<Start><account>0001</account><asset>ABC</asset><value>500</value></Start>
<Start><account>0001</account><asset>DEF</asset><value>600</value></Start>
One account, asset and value tag per Start tag.

Related

Grouping and summing cloudwatch log insights query

I have about 10k logs from log insights in the below format (cannot post actual logs due to privacy rules). I am using boto3 to query the logs.
Log insights query:
filter #message like /ERROR/
Output Logs format:
timestamp:ERROR <some details>Apache error....<error details>
timestamp:ERROR <some details>Connection error.... <error details>
timestamp:ERROR <some details>Database error....<error details>
What I need is to group the errors having similar substring (like group by Connection error, Apache error, Database error) or any other similar errors and get a sum of those.
Expected output:
Apache error 130
Database error 2253
Connection error 3120
Is there some regex or any other way I can use to pull out similar substrings and group them and get the sum? Either in python or in log insights.
Its impossible to say without seeing the source of your data but you can extract values from logs with a logs insight query like
filter #logStream like 'SOMEHOST'
| parse #message /<EventID.*?>(?<event_id>\d+)<\/EventID>/
| stats count() by event_id
In this case I was parsing windows event logs to count how many of each event type happened
----------------------
| event_id | count() |
|----------|---------|
| 7036 | 80 |
| 7001 | 4 |
| 7002 | 4 |
| 6013 | 1 |
| 7039 | 1 |
| 7009 | 1 |
| 7000 | 1 |
| 7040 | 2 |
| 7045 | 1 |
----------------------
This query just looked for the event ID xml element. In your case you would need to look at your data to see how best to identify and extract your error. but if the error is in a format that has a field you can extract it. Even if there is no field you could still use a regex if there is a pattern to the data
If you had the logs inside a list you could accomplish this with a regex like this.
logs = [(errors here)]
error_counts = {}
for log in logs:
match = re.search(r'ERROR\s+(.*?)\s*\.\.\.', log)
if match:
error_type = match.group(1)
error_counts[error_type] = error_counts.get(error_type, 0) + 1
for error_type, count in error_counts.items():
print(error_type.ljust(20), count)
If you're using the AWS SDK for Python to query the logs, you can use the filter_pattern parameter to filter the logs by error type and the stats parameter to get the count for each error type and then use the start_query method to submit a query to CloudWatch Logs Insights.

how to update a mysql table efficiently with a pandas dataframe?

I'm doing ETL with Airflow PythonOperator to update a SCD1 dimension table (dim_user).
The structure of the mysql dimension table:
| user_key | open_id | gender | nickname | mobile | load_time | updated_at |
|----------|---------------------|--------|----------|-------------|---------------------|---------------------|
| 117 | ohwv90JTgZSn******* | 2 | ABC | ************| 2019-05-24 10:12:44 | 2019-05-23 19:00:43 |
In the python script, I have a same structure (except the user_key and load_time column) pandas dataframe df_users_updated.
Now I want to update the mysql table on the condition of open_id field matched:
# database connection
conn = create_engine(db_conn_str)
# update the rows with a for loop
for index, row in df_users_updated.iterrows():
info = dict(row)
conn.execute('update dim_user set gender=%s, nickname=%s, mobile=%s, updated_at=%s where open_id=%s',
(info['gender'], info['nickname'], info['mobile'], info['updated_at'], info['open_id']))
conn.dispose()
The problem is I only have 1000 rows in the df_users_updated, it toke over 10 minutes to execute these update queries.
Is there a better way to do this?
based on my experience, there are some tricks could improve the performance.
use mysqlclient lib, cursor.executemany(sql, params) method
use tuple type of params
use index on the where fields.

Create a search function using Dynamodb and boto3

I'm trying to understand how to create a search function using dynamodb. This answer helped me to understand better the use of Global Secondary Indexes but I still have some questions. Suppose we have an structure like this and a GSI called last_name_index:
+------+-----------+----------+---------------+
| User | FirstName | LastName | Email |
+------+-----------+----------+---------------+
| 1001 | Test | Test | test#mail.com |
| 1002 | Jonh | Doe | jdoe#mail.com |
| 1003 | Another | Test | mail#mail.com |
+------+-----------+----------+---------------+
Using boto3 I can search now for a user if I know the last name:
table.query(
IndexName = "last_name_index",
KeyConditionExpression=Key('LastName').eq(name)
)
But what if want to search for users and I only know part of the last name? I know there is a contains function on boto3 but this only works with non index keys. Do I need to change the GSI? Or is there something I'm missing? I want to be able to do something like:
table.query(
IndexName = "last_name_index",
KeyConditionExpression=Key('LastName').contains(name) # part of the name
)

After running script, column names not appearing in pgadmin

Sometimes when I run my Python script which calls shp2pgsqlto upload a new table to the database, when I view this table in pgadmin, it appears with blank column names:
This one has column names
Usually when I run the script again it fixes the problem, and pgadmin displays a message about database vacuuming. Honestly the problem is my boss because he takes this as a sign there is something wrong with my code and we can't move forward until he sees the names in pgadmin (by chance when I demonstrated the script it was the 1/10 time that it messed up without the column names).
In postgres is it even possible to have a table without column names?
Here is the vacuum message
Here is the output from psql's \d (assume XYZ is the name of the project and the name of the db)
xyz => \d asmithe.intersect
Table "asmithe.intersect"
Column | Type | Modifiers
------------+------------------------------+------------------------------------
------------------------
gid | integer | not null default nextval('intersect
ion_gid_seq'::regclass)
fid_xyz_09 | integer |
juris_id | character varying(2) |
xyz_plot | numeric |
poly_id | character varying(20) |
layer | character varying(2) |
area | numeric |
perimeter | numeric |
lid_dist | integer |
comm | character varying(252) |
cdate | character varying(30) |
sdate | character varying(30) |
edate | character varying(30) |
afsdate | character varying(30) |
afedate | character varying(30) |
capdate | character varying(30) |
salvage | double precision |
pb_harv | double precision |
utotarea | numeric |
nbacvers | character varying(24) |
totarea | numeric |
areamoda | numeric |
areamodb | numeric |
areamodt | double precision |
areamodv | numeric |
area_intr | numeric |
dist_perct | numeric |
id | double precision |
floodid | double precision |
basr | double precision |
floodmaps | double precision |
floodmapm | double precision |
floodcaus | double precision |
burnclas | double precision |
geom | geometry(MultiPolygon,13862) |
Indexes:
"intersect_pkey" PRIMARY KEY, btree (gid)
Quitting and restarting usually does fix it.
In postgres is it even possible to have a table without column names?
It is possible to create a table with zero columns:
test=> CREATE TABLE zerocolumns();
CREATE TABLE
test=> \d zerocolumns
Table "public.zerocolumns"
Column | Type | Modifiers
--------+------+-----------
but not a zero-width column name:
test=> CREATE TABLE zerowidthcol("" integer);
ERROR: zero-length delimited identifier at or near """"
LINE 1: CREATE TABLE zerowidthcol("" integer);
^
though a column name composed only of a space is permissible:
test=> CREATE TABLE spacecol(" " integer);
CREATE TABLE
test=> \d spacecol
Table "public.spacecol"
Column | Type | Modifiers
--------+---------+-----------
| integer |
Please show the output from psql's \d command if this happens. With only (heavily edited) screenshots I can't tell you anything more useful.
If I had to guess I'd say it's probably a drawing bug in PgAdmin.
Update: The VACUUM message is normal after big changes to a table. Read the message, it explains what is going on. There is no problem there.
There's nothing wrong with the psql output, and since quitting and restarting PgAdmin fixes it, I'm pretty confident you've hit a PgAdmin bug related to drawing or catalog access. If it happens on the current PgAdmin version and you can reproduce it with a script you can share with the public, please post a report on the pgadmin-support mailing list.
The same happened to me in pgAdmin 1.18.1 when running the DDL (i.e. SQL script that drops and recreates all tables). After restarting pgAdmin or refreshing the database it is working again (just refreshing the table is not sufficient). It seems that pgAdmin simply does not auto-refresh table metadata after the tables are replaced.

Data structure for text corpus database

A text corpus is usually represented in xml as such:
<corpus name="foobar" date="08.09.13" authors="mememe">
<document filename="br-392">
<paragraph pnumber="1">
<sentence snumber="1">
<word wnumber="1" partofspeech="VB" sensetag="012345678-v" nameentity="None">Hello</word>
<word wnumber="2" partofspeech="NN" sensetag="876543210-n" nameentity="World">Foo bar</word>
</sentence>
</paragraph>
</document>
</corpus>
When I try to put a corpus into a database I had each row to represent a word and the columns are as such:
| uid | corpusname | docfilename | pnumber | snumber | wnumber | token
| pos | sensetag | ne
| 198317 | foobar | br-392 | 1 | 1 | 1 | Hello | VB | 012345678-v |
None |
| 192184 | foobar | br-392 | 1 | 1 | 1 | foobar | NN | 87654321-n |
World |
I put the data into an sqlite3 database as such:
# I read the xml file and now it's in memory as such.
w1 = (198317,'foobar','br-392',1,1,1,'hello','VB','12345678-n','Hello')
w2 = (192184,'foobar','br-392',1,1,1,'foobar','NN','87654321-n','World')
con = sqlite3.connect('semcor.db', isolation_level=None)
cur = con.cursor()
engtable = "CREATE TABLE eng(uid INT, corpusname TEXT, docname TEXT,"+\
"pnum INT, snum INT, tnum INT,"+\
"word TEXT, pos TEXT, sensetag TEXT, ne TEXT)"
cur.execute(engtable)
cur.executemany("INSERT INTO eng VALUES(?,?,?,?,?,?,?,?,?,?)", \
wordtokens)
The purpose of the database is so that I can run queries as such
SELECT * from ENG if paragraph=1;
SELECT * from ENG if sentence=1;
SELECT * from ENG if sentence=1 and pos="NN" or sensetag="87654321-n"
SELECT * from ENG if pos="NN" and sensetag="87654321-n"
SELECT * from ENG if docfilename="br-392"
SELECT * from ENG if corpusname="foobar"
It seems like when I structure the database as above, my size of database explodes because the number of tokens in each corpus can go up to millions or billions.
Other than structuring a corpus by having each row for a word and the columns its attribute and parental attribute, how else could i structure the database such I can perform the queries and get the same output?
For the purpose of indexing large size corpus,
should I be using some other database programs other than sqlite3?
And should i still use the same schema for the table as I have defined above?
And should i still use the same schema for the table as I have defined
above?
In perspective of relational database design, due to 1NF, I will use a table per element of the xml file.
We will save space and we will help DBMS performance. Using the model, desired queries will be appliable
The draft model will be:
should I be using some other database programs other than sqlite3?
That may be answered based on your application specification, like how many data records you will have after a month, year, ... , how many users will be connected, is it a OLTP or OLAP or mixed, the projects budget and ...
BTW take a look at free R-DBMS's like PostgreSQL,MySQL and commercial ones like Oracle.
For NoSql solutions having a look on the post may be helpfull
I guess the obvious answer is "normalisation"... you have an enormous amount of duplicated information per row and that is going to massively increase the size of your database.
You should work out from each row what is duplicated and then create a table to contain that data and then you will reduce, for example, a duplicated string containing the corpus length of, say, 20 characters in length to a pointer to a row in the "corpus name" table which for arguments sake might just take 4 characters as the ID value of that entry.
You don't say what platform you are using either. If it is a mobile device then it really does pay to normalise your data as much as possible. It makes the code a little more complex but that is always the space/time trade-off with stuff like this. I am guessing that this is some kind of reference application in which case pure blinding speed is probably secondary to just making it work.
The mandatory wikipedia link for normalisation
and this YouTube video
Google is your friend, hope that helps. :) Sean

Categories