Sqoop job from python stdout=subprocess.pipe

Sqoop job from python stdout=subprocess.pipe - python

I am trying to generate Sqoopcommand using python. I am able to pass and fire the Sqoop query. I wanted to map the column name in Sqoop command --map-column-java and number of columns are different in each column . only BLOB and CLOB needs to be mapped.
Data:
-----------------------------------------------
| COLUMN_NAME | DATA_TYPE |
-----------------------------------------------
| C460 | VARCHAR2 |
| C459 | CLOB |
| C458 | VARCHAR2 |
| C457 | VARCHAR2 |
| C456 | CLOB |
| C8 | BLOB |
| C60901 | VARCHAR2 |
-----------------------------------------------
sample code :-
proc=subprocess.Popen(["sqoop", "eval", "--connect","jdbc:oracle:thin:#" + config["Production_host"]+":"+config["port"]+"/"+config['Production_SERVICE_NAME'],"--username", config["Production_User"], "--password", config["Production_Password"], "--query","SELECT column_name, data_type FROM all_tab_columns where table_name =" + "'"+ Tablename + "'"],stdout=subprocess.PIPE)
COl_Re=re.compile('(?m)(C\d+)(?=.+[CB]LOB)')
columns=COl_Re.findall(proc.stdout.read())
i am able to get the required column namesC459,C456,C8 using the above code . output ['C459', 'C456','C8']
i should get a new Sqoop query with below format
sqoop import --connect "--connect","jdbc:oracle:thin:#" + config["Production_host"]+":"+config["port"]+"/"+config['Production_SERVICE_NAME'],"--username", config["Production_User"], "--password", config["Production_Password"], --table table --fields-terminated-by '|' --map-column-java C456=String,C459=String,C8=String --hive-drop-import-delims --input-null-string '\\N' --input-null-non-string '\\N' --as-textfile --target-dir <Location> -m 1
i only need to add this part --map-column-java C456=String,C459=String,C8=String dynamically so that my next code subprocess.call can use this.

Build your sqoop syntax by assigning to a variable and based on the conditions override the variable with parameters and once the final syntax is built then execute it. Hope this helps.

Related

How do i modify this python code as SQL query in redshift

I am trying to see if theres anyway i can implement this piece of code using only sql REDSHIFT
a = '''
SELECT to_char(DATE '2022-01-01'
+ (interval '1 day' * generate_series(0,365)), 'YYYY_MM_DD') AS ym
'''
dfa = pd.read_sql(a, conn)
b = f'''
select account_no, {','.join('"' + str(x) + '"' for x in dfa.ym)}
from loan__balance_table
where account_no =
'''
dfb = pd.read_sql(b, conn)
the first query will yield something like this
| ym |
| ---------- |
| 2022_01_01 |
| 2022_01_02 |
...
| 2022_12_31|
Then i used string concatenation to combime the dates together and use then in the second query to select all columns in ym. The result of the second query should be something like this.
| account_no | 2022_01_01 | 2022_01_01 | ...
| ---------- | ---------- | ---------- | ...
| 1234 | 234,987.09 | 233,989.19 | ...
I just want to know if theres a way i can combine both queries together as one in sql without using python to concat the column_names.
I tried using CTE but i cant seem to get it right i dont even know if this is the right approach, The database is REDSHIFT

psycopg2: cursor.execute storing only table structure, no data

I am trying to store some tables I create in my code in an RDS instance using psycopg2. The script runs without issue and I can see the table being stored correctly in the DB. However, if I try to retrieve the query, I only see the columns, but no data:
import pandas as pd
import psycopg2
test=pd.DataFrame({'A':[1,1],'B':[2,2]})
#connect is a function to connect to the RDS instance
connection= connect()
cursor=connection.cursor()
query='CREATE TABLE test (A varchar NOT NULL,B varchar NOT NULL);'
cursor.execute(query)
connection.commit()
cursor.close()
connection.close()
This script runs without issues and, printing out file_check from the following script:
connection=connect()
# check if file already exists in SQL
sql = """
SELECT "table_name","column_name", "data_type", "table_schema"
FROM INFORMATION_SCHEMA.COLUMNS
WHERE "table_schema" = 'public'
ORDER BY table_name
"""
file_check=pd.read_sql(sql, con=connection)
connection.close()
I get:
table_name column_name data_type table_schema
0 test a character varying public
1 test b character varying public
which looks good.
Running the following however:
read='select * from public.test'
df=pd.read_sql(read,con=connection)
returns:
Empty DataFrame
Columns: [a, b]
Index: []
Anybody have any idea why this is happening? I cannot seem to get around this

Erm, your first script has a test_tbl dataframe, but it's never referred to after it's defined.
You'll need to
test_tbl.to_sql("test", connection)
or similar to actually write it.
A minimal example:
$ createdb so63284022
$ python
>>> import sqlalchemy as sa
>>> import pandas as pd
>>> test = pd.DataFrame({'A':[1,1],'B':[2,2], 'C': ['yes', 'hello']})
>>> engine = sa.create_engine("postgres://localhost/so63284022")
>>> with engine.connect() as connection:
... test.to_sql("test", connection)
...
>>>
$ psql so63284022
so63284022=# select * from test;
index | A | B | C
-------+---+---+-------
0 | 1 | 2 | yes
1 | 1 | 2 | hello
(2 rows)
so63284022=# \d+ test
Table "public.test"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+--------+-----------+----------+---------+----------+--------------+-------------
index | bigint | | | | plain | |
A | bigint | | | | plain | |
B | bigint | | | | plain | |
C | text | | | | extended | |
Indexes:
"ix_test_index" btree (index)
Access method: heap
so63284022=#

I was able to solve this:
As it was pointed out by #AKX, I was only creating the table structure, but I was not filling in the table.
I now import import psycopg2.extras as well and, after this:
query='CREATE TABLE test (A varchar NOT NULL,B varchar NOT NULL);'
cursor.execute(query)
I add something like:
update_query='INSERT INTO test(A, B) VALUES(%s,%s) ON CONFLICT DO NOTHING'
psycopg2.extras.execute_batch(cursor, update_query, test.values)
cursor.close()
connection.close()
My table is now correctly filled after checking with pd.read_sql

Is there any ways to combine two rows of table into one row using Django ORM?

I have a table which has columns named measured_time, data_type and value.
In data_type, there is two types, temperature and humidity.
I want to combine two rows of data if they have same measured_time using Django ORM.
I am using Maria DB.
Using Raw SQL, The following Query does what I want to.
SELECT T1.measured_time, T1.temperature, T2.humidity
FROM ( SELECT CASE WHEN data_type = 1 then value END as temperature,
CASE WHEN data_type = 2 then value END as humidity ,
measured_time FROM data_table) as T1,
( SELECT CASE WHEN data_type = 1 then value END as temperature ,
CASE WHEN data_type = 2 then value END as humidity ,
measured_time FROM data_table) as T2
WHERE T1.measured_time = T2.measured_time and
T1.temperature IS NOT null and T2.humidity IS NOT null and
DATE(T1.measured_time) = '2019-07-01'
Original Table
| measured_time | data_type | value |
|---------------------|-----------|-------|
| 2019-07-01-17:27:03 | 1 | 25.24 |
| 2019-07-01-17:27:03 | 2 | 33.22 |
Expected Result
| measured_time | temperaure | humidity |
|---------------------|------------|----------|
| 2019-07-01-17:27:03 | 25.24 | 33.22 |

I've never used it and so can't answer in detail, but you can feed a raw SQL query into Django and get the results back through the ORM. Since you have already got the SQL this may be the easiest way to proceed. Documentation here

Python MySQLdb: Inserting duplicate entry into a table with UNIQUE fields

I have a MySQL database that contains a table named commands with the following structure:
+-----------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| input | varchar(3000) | NO | | NULL | |
| inputhash | varchar(66) | YES | UNI | NULL | |
+-----------+---------------+------+-----+---------+----------------+
I am trying to insert rows in it, but only if the inputhash field does not already exist. I thought INSERT IGNORE was the way to do this, but I am still getting warnings.
For instance, suppose that the able already contains
+----+---------+------------------------------------------------------------------+
| id | input | inputhash |
+----+---------+------------------------------------------------------------------+
| 1 | enable | 234a86bf393cadeba1bcbc09a244a398ac10c23a51e7fd72d7c449ef0edaa9e9 |
+----+---------+------------------------------------------------------------------+
Then when using the following Python code to insert a row
import MySQLdb
db = MySQLdb.connect(host='xxx.xxx.xxx.xxx', user='xxxx', passwd='xxxx', db='dbase')
c = db.cursor()
c.execute('INSERT IGNORE INTO `commands` (`input`, `inputhash`) VALUES (%s, %s)', ('enable', '234a86bf393cadeba1bcbc09a244a398ac10c23a51e7fd72d7c449ef0edaa9e9',))
I am getting the warning
Warning: Duplicate entry '234a86bf393cadeba1bcbc09a244a398ac10c23a51e7fd72d7c449ef0edaa9e9' for key 'inputhash'
c.execute('INSERT IGNORE INTO `commands` (`input`, `inputhash`) VALUES (%s, %s)', ('enable','234a86bf393cadeba1bcbc09a244a398ac10c23a51e7fd72d7c449ef0edaa9e9',))
Why does this happen? I thought that the whole point of using INSERT IGNORE on a table with UNIQUE fields is to suppress the error and simply ignore the write attempt?
What is the proper way to resolve this? I suppose I can suppress the warning in Python with warnings.filterwarnings('ignore') but why does the warning appear in the first place?

I hope it will help you !
import MySQLdb
db = MySQLdb.connect(host='xxx.xxx.xxx.xxx', user='xxxx', passwd='xxxx',
db='dbase')
c = db.cursor()
c.execute('INSERT INTO `commands` (`input`, `inputhash`) VALUES ('enable',
'234a86bf393cadeba1bcbc09a244a398ac10c23a51e7fd72d7c449ef0edaa9e9') ON
DUPLICATE KEY UPDATE 'inputhash'='inputhash')

After running script, column names not appearing in pgadmin

Sometimes when I run my Python script which calls shp2pgsqlto upload a new table to the database, when I view this table in pgadmin, it appears with blank column names:
This one has column names
Usually when I run the script again it fixes the problem, and pgadmin displays a message about database vacuuming. Honestly the problem is my boss because he takes this as a sign there is something wrong with my code and we can't move forward until he sees the names in pgadmin (by chance when I demonstrated the script it was the 1/10 time that it messed up without the column names).
In postgres is it even possible to have a table without column names?
Here is the vacuum message
Here is the output from psql's \d (assume XYZ is the name of the project and the name of the db)
xyz => \d asmithe.intersect
Table "asmithe.intersect"
Column | Type | Modifiers
------------+------------------------------+------------------------------------
------------------------
gid | integer | not null default nextval('intersect
ion_gid_seq'::regclass)
fid_xyz_09 | integer |
juris_id | character varying(2) |
xyz_plot | numeric |
poly_id | character varying(20) |
layer | character varying(2) |
area | numeric |
perimeter | numeric |
lid_dist | integer |
comm | character varying(252) |
cdate | character varying(30) |
sdate | character varying(30) |
edate | character varying(30) |
afsdate | character varying(30) |
afedate | character varying(30) |
capdate | character varying(30) |
salvage | double precision |
pb_harv | double precision |
utotarea | numeric |
nbacvers | character varying(24) |
totarea | numeric |
areamoda | numeric |
areamodb | numeric |
areamodt | double precision |
areamodv | numeric |
area_intr | numeric |
dist_perct | numeric |
id | double precision |
floodid | double precision |
basr | double precision |
floodmaps | double precision |
floodmapm | double precision |
floodcaus | double precision |
burnclas | double precision |
geom | geometry(MultiPolygon,13862) |
Indexes:
"intersect_pkey" PRIMARY KEY, btree (gid)
Quitting and restarting usually does fix it.

In postgres is it even possible to have a table without column names?
It is possible to create a table with zero columns:
test=> CREATE TABLE zerocolumns();
CREATE TABLE
test=> \d zerocolumns
Table "public.zerocolumns"
Column | Type | Modifiers
--------+------+-----------
but not a zero-width column name:
test=> CREATE TABLE zerowidthcol("" integer);
ERROR: zero-length delimited identifier at or near """"
LINE 1: CREATE TABLE zerowidthcol("" integer);
^
though a column name composed only of a space is permissible:
test=> CREATE TABLE spacecol(" " integer);
CREATE TABLE
test=> \d spacecol
Table "public.spacecol"
Column | Type | Modifiers
--------+---------+-----------
| integer |
Please show the output from psql's \d command if this happens. With only (heavily edited) screenshots I can't tell you anything more useful.
If I had to guess I'd say it's probably a drawing bug in PgAdmin.
Update: The VACUUM message is normal after big changes to a table. Read the message, it explains what is going on. There is no problem there.
There's nothing wrong with the psql output, and since quitting and restarting PgAdmin fixes it, I'm pretty confident you've hit a PgAdmin bug related to drawing or catalog access. If it happens on the current PgAdmin version and you can reproduce it with a script you can share with the public, please post a report on the pgadmin-support mailing list.

The same happened to me in pgAdmin 1.18.1 when running the DDL (i.e. SQL script that drops and recreates all tables). After restarting pgAdmin or refreshing the database it is working again (just refreshing the table is not sufficient). It seems that pgAdmin simply does not auto-refresh table metadata after the tables are replaced.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sqoop job from python stdout=subprocess.pipe - python

Build your sqoop syntax by assigning to a variable and based on the conditions override the variable with parameters and once the final syntax is built then execute it. Hope this helps.

Related

How do i modify this python code as SQL query in redshift

psycopg2: cursor.execute storing only table structure, no data

Is there any ways to combine two rows of table into one row using Django ORM?

Python MySQLdb: Inserting duplicate entry into a table with UNIQUE fields

After running script, column names not appearing in pgadmin

Categories

Resources