I have a large JSON file (400k lines). I am trying to isolate the following:
Policies- "description"
policy items - "users" and "database values"
JSON FILE - https://pastebin.com/hv8mLfgx
Expected Output from Pandas: https://imgur.com/a/FVcNGsZ
Everything after "Policy Items" is re-iterated the exact same throughout the file. I have tried the code below to isolate "users". It doesn't seem to work, I'm trying to dump all of this into a CSV.
Edit* here was a solution I was attempting to try, but could not get this to work correctly - Deeply nested JSON response to pandas dataframe
from pandas.io.json import json_normalize as Jnormal
import json
import pprint, csv
import re
with open("Ranger_Policies_20190204_195010.json") as file:
jsonDF = json.load(file)
for item in jsonDF['policies'][0]['policyItems'][0]:
print ('{} - {} - {}'.format(jsonDF['users']))
EDIT 2: I have some working code which is able to grab some of the USERS, but it does not grab all of them. Only 11 out of 25.
from pandas.io.json import json_normalize as Jnormal
import json
import pprint, csv
import re
with open("Ranger_Policies_20190204_195010.json") as file:
jsonDF = json.load(file)
pNode = Jnormal(jsonDF['policies'][0]['policyItems'], record_path='users')
print(pNode.head(500))
EDIT 3: This is the Final working copy, however I am still not copying over all my TABLE data. I set a loop to simply ignore everything. Capture everything and I'd sort it in Excel, Does anyone have any ideas why I cannot capture all the TABLE values?
json_data = json.load(file)
with open("test.csv", 'w', newline='') as fd:
wr = csv.writer(fd)
wr.writerow(('Database name', 'Users', 'Description', 'Table'))
for policy in json_data['policies']:
desc = policy['description']
db_values = policy['resources']['database']['values']
db_tables = policy['resources']['table']['values']
for item in policy['policyItems']:
users = item['users']
for dbT in db_tables:
for user in users:
for db in db_values:
_ = wr.writerow((db, user, desc, dbT))```
Pandas is overkill here: the csv standard module is enough. You have just to iterate on policies to extract the description an database values, next on policyItems to extract the users:
with open("Ranger_Policies_20190204_195010.json") as file:
jsonDF = json.load(file)
with open("outputfile.csv", newline='') as fd:
wr = csv.writer(fd)
_ = wr.writerow(('Database name', 'Users', 'Description'))
for policy in js['policies']:
desc = policy['description']
db_values = policy['resources']['database']['values']
for item in policy['policyItems']:
users = item['users']
for user in users:
for db in db_values:
if db != '*':
_ = wr.writerow((db, user, desc))
Here is one way to do it, and let's assume your json data is in a variable called json_data
from itertools import product
def make_dfs(data):
cols = ['db_name', 'user', 'description']
for item in data.get('policies'):
description = item.get('description')
users = item.get('policyItems', [{}])[0].get('users', [None])
db_name = item.get('resources', {}).get('database', {}).get('values', [None])
db_name = [name for name in db_name if name != '*']
prods = product(db_name, users, [description])
yield pd.DataFrame.from_records(prods, columns=cols)
df = pd.concat(make_dfs(json_data), ignore_index=True)
print(df)
db_name user description
0 m2_db hive Policy for all - database, table, column
1 m2_db rangerlookup Policy for all - database, table, column
2 m2_db ambari-qa Policy for all - database, table, column
3 m2_db af34 Policy for all - database, table, column
4 m2_db g748 Policy for all - database, table, column
5 m2_db hdfs Policy for all - database, table, column
6 m2_db dh10 Policy for all - database, table, column
7 m2_db gs22 Policy for all - database, table, column
8 m2_db dh27 Policy for all - database, table, column
9 m2_db ct52 Policy for all - database, table, column
10 m2_db livy_pyspark Policy for all - database, table, column
Tested on Python 3.5.1 and pandas==0.23.4
Related
I would like to read a mysql database in chunks and write its contents to a bunch of csv files.
While this can be done easily with pandas using below:
df_chunks = pd.read_sql_table(table_name, con, chunksize=CHUNK_SIZE)
for i, df in enumerate(chunks):
df.to_csv("file_{}.csv".format(i)
Assuming I cannot use pandas, what other alternative can I use? I tried using
import sqlalchemy as sqldb
import csv
CHUNK_SIZE = 100000
table_name = "XXXXX"
host = "XXXXXX"
user = "XXXX"
password = "XXXXX"
database = "XXXXX"
port = "XXXX"
engine = sqldb.create_engine('mysql+pymysql://{}:{}#{}:{}/{}'.format(user,password,host,port,database))
con = engine.connect()
metadata = sqldb.MetaData()
table = sqldb.Table(table_name, metadata, autoload=True, autoload_with=engine)
query = table.select()
proxy = con.execution_options(stream_results=True).execute(query)
cols = [""] + [column.name for column in table.c]
file_num = 0
while True:
batch = proxy.fetchmany(CHUNK_SIZE)
if not batch:
break
csv_writer = csv.writer("file_{}.csv".format(file_num), delimiter=',')
csv_writer.writerow(cols)
#csv_writer.writerows(batch) # while this work, it does not have the index similar to df.to_csv()
for i, row in enumerate(batch):
csv_writer.writerow(i + row) # will error here
file_num += 1
proxy.close()
While using .writerows(batch) works fine, it does not have the index like the result you get from df.to_csv(). I would like to add the row number equivalent as well, but cant seem to add to the row which is a sqlalchemy.engine.result.RowProxy. How can I do it? Or what other faster alternative can I use?
Look up SELECT ... INTO OUTFILE ...
It will do the task in 1 SQL statement; 0 lines of Python (other than invoking that SQL).
I am learning Python and am currently working with it to parse a CSV file.
The CSV file has 3 columns:
Full_name, university, and Birth_Year.
I have successfully loaded,read, and printed the content of a given CSV file into Python, but here’s where I am stuck at:
I want to use and parse ONLY the column Full_name to 3 columns: first, middle, and last. If there are only 2 words in the name, then the middle name should be null.
The resulting parsed output should then be inserted to a sql db through Python.
Here’s my code so far:
import csv
if __name__ == '__main__':
if len (sys.argv) != 2:
print("Please enter the csv file too: python name_parsing.py student_info.csv")
sys.exit()
else:
with open(sys.argv[1], "r" ) as file:
reader = csv.DictReader(file) #I use DictReader because the csv file has > 1 col
# Print the names on the cmd
for row in reader:
name = row["Full_name"]
for name in reader:
if len(name) == 2:
print(first_name = name[0])
print(middle_name = None)
print(last_name = name[2])
if len(name) == 3 : # The assumption is that name is either 2 or 3 words only.
print(first_name = name[0])
print(middle_name = name[1])
print(last_name = name[2])
db.execute("INSERT INTO name (first, middle, last) VALUES(?,?,?)",
row["first_name"], row["middle_name"], row["last_name"])
Running the program above gives me no output whatsoever. How to parse my code the right way? Thank you.
I created a sample file based on your description. The content looks as below:
Full_name,University,Birth_Year
Prakash Ranjan Gupta,BPUT,1920
Hari Shankar,NIT,1980
John Andrews,MIT,1950
Arbaaz Aslam Khan,REC,2005
And then I executed the code below. It runs fine on my jupyter notebook. You can add the lines (sys.argv) != 2 etc) with this as you need. I have used sqlite3 database I hope this works. In case you want the if/main block added to this, let me know: can edit.
This is going by your code. (Otherwise You can do this using pandas in an easier way I believe.)
import csv
import sqlite3
con = sqlite3.connect('name_data.sql') ## Make DB connection and create a table if it does not exist
cur = con.cursor()
cur.execute('''CREATE TABLE IF NOT EXISTS UNIV_DATA
(FIRSTNAME TEXT,
MIDDLE_NAME TEXT,
LASTNAME TEXT,
UNIVERSITY TEXT,
YEAR TEXT)''')
with open('names_data.csv') as fh:
read_data = csv.DictReader(fh)
for uniData in read_data:
lst_nm = uniData['Full_name'].split()
if len(lst_nm) == 2:
fn,ln = lst_nm
mn = None
else:
fn,mn,ln = lst_nm
# print(fn,mn,ln,uniData['University'],uniData['Birth_Year'] )
cur.execute('''
INSERT INTO UNIV_DATA
(FIRSTNAME, MIDDLE_NAME, LASTNAME, UNIVERSITY, YEAR)
VALUES(?,?,?,?,?)''',
(fn,mn,ln,uniData['University'],uniData['Birth_Year'])
)
con.commit()
cur.close()
con.close()
If you want to read the data in the table UNIV_DATA:
Option 1: (prints the rows in the form of tuple)
import sqlite3
con = sqlite3.connect('name_data.sql') #Make connection to DB and create a connection object
cur = con.cursor() #Create a cursor object
results = cur.execute('SELECT * FROM UNIV_DATA') # Execute the query and store the rows retrieved in 'result'
[print(result) for result in results] #Traverse through 'result' in a loop to print the rows retrieved
cur.close() #close the cursor
con.close() #close the connection
Option 2: (prints all the rows in the form of a pandas data frame - execute in jupyter ...preferably )
import sqlite3
import pandas as pd
con = sqlite3.connect('name_data.sql') #Make connection to DB and create a connection object
df = pd.read_sql('SELECT * FROM UNIV_DATA', con) #Query the table and store the result in a dataframe : df
df
When you call name = row["Full_name"] it is going to return a string representing the name, e.g. "John Smith".
In python strings can be treated like lists, so in this case if you called len(name) it would return 10 as "John Smith" has 10 characters. As this doesn't equal 2 or 3, nothing will happen in your for loop.
What you need is some way to turn the string into a list that containing the first, second and last names. You can do this using the split function. If you call name.split(" ") it would split the string whenever there is a space, continuing the above example this would return ["John", "Smith"] which should make your code work.
I am using python and SQLalchemy to fetch data from a table.
import sqlalchemy as db
import pandas as pd
DATABASE_URI = 'postgres+psycopg2://postgres:postgresql#localhost:5432/postgres'
engine = db.create_engine(DATABASE_URI)
connection = engine.connect()
project_table = db.Table('project', metadata, autoload=True, autoload_with=engine)
here i want to fetch records based on a list of ids which i have.
l=[557997, 558088, 623106, 558020, 623108, 557836, 557733, 622792, 623511, 623185]
query1 = db.select([project_table ]).where(project_table .columns.project_id.in_(l))
#sql query= "select * from project where project_id in l"
Result = connection.execute(query1)
Rset = Result.fetchall()
df = pd.DataFrame(Rset)
print(df.head())
Here when i print df.head() I am getting an empty dataframe. I am not able to pass a list to the above query. Is there a way to send a list to in to above query.
The result should contain the rows in the table which are equal to project_id's given.
i.e.
project_id project_name project_date project_developer
557997 Test1 24-05-2011 Ajay
558088 Test2 24-06-2003 Alex
These rows will be inserted into dataset.
The Query is
"select * from project where project_id in (557997, 558088, 623106, 558020, 623108, 557836, 557733, 622792, 623511, 623185)"
here as i cant give static values I will insert the values to a list and pass this list to query as a parameter.
This is where i am having a problem. I cant pass a list as a parameter to db.select().How can i pass a list to db.select()
After many trails i have found out that because of large data the query is fetching and also less ram in my workstation, the query returned null(no results). so what I did was
Result = connection.execute(query1)
while True:
rows = Result.fetchmany(10000)
if not rows:
break
for row in rows:
table_data.append(row)
pass
df1 = pd.DataFrame(table_data)
df1.columns = columns
After this the program was working fine.
I want to extract table information from sqlite file.
I could list all the table name following this page and tried to extract table information using query method on the session instance. But I got following error.
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such column: ComponentSizes [SQL: 'SELECT ComponentSizes']
Does anyone know how should I revise following code in order to extract table specifying the table name?
class read():
def __init__(self,path):
engine = create_engine("sqlite:///" + sqlFile)
inspector=inspect(engine)
for table_name in inspector.get_table_names():
for column in inspector.get_columns(table_name):
#print("Column: %s" % column['name'])
print (table_name+" : "+column['name'])
Session = sessionmaker(bind=engine)
self.session = Session()
def getTable(self,name):
table=self.session.query(name).all()
return table
if __name__ == '__main__':
test=read(sqlFile)
test.getTable('ComponentSizes')
The error you are getting is suggestive of what is going wrong. Your code is translating into SQL - SELECT ComponentSizes which is incomplete. It's not clear for what is your end goal. If you want to extract contents of a table into CSV, you could do this:
import sqlite3
import csv
con = sqlite3.connect('mydatabase.db')
outfile = open('mydump.csv', 'wb')
outcsv = csv.writer(outfile)
cursor = con.execute('select * from ComponentSizes')
# dump column titles (optional)
outcsv.writerow(x[0] for x in cursor.description)
# dump rows
outcsv.writerows(cursor.fetchall())
outfile.close()
Else, if you want contents of the table into a pandas df for further analysis, you could choose to do this:
import sqlite3
import pandas as pd
# Create your connection.
cnx = sqlite3.connect('file.db')
df = pd.read_sql_query("SELECT * FROM ComponentSizes", cnx)
Hope it helps. Happy coding!
I have a Sqlite 3 and/or MySQL table named "clients"..
Using python 2.6, How do I create a csv file named Clients100914.csv with headers?
excel dialect...
The Sql execute: select * only gives table data, but I would like complete table with headers.
How do I create a record set to get table headers. The table headers should come directly from sql not written in python.
w = csv.writer(open(Fn,'wb'),dialect='excel')
#w.writelines("header_row")
#Fetch into sqld
w.writerows(sqld)
This code leaves me with file open and no headers. Also cant get figure out how to use file as log.
import csv
import sqlite3
from glob import glob; from os.path import expanduser
conn = sqlite3.connect( # open "places.sqlite" from one of the Firefox profiles
glob(expanduser('~/.mozilla/firefox/*/places.sqlite'))[0]
)
cursor = conn.cursor()
cursor.execute("select * from moz_places;")
with open("out.csv", "w", newline='') as csv_file: # Python 3 version
#with open("out.csv", "wb") as csv_file: # Python 2 version
csv_writer = csv.writer(csv_file)
csv_writer.writerow([i[0] for i in cursor.description]) # write headers
csv_writer.writerows(cursor)
PEP 249 (DB API 2.0) has more information about cursor.description.
Using the csv module is very straight forward and made for this task.
import csv
writer = csv.writer(open("out.csv", 'w'))
writer.writerow(['name', 'address', 'phone', 'etc'])
writer.writerow(['bob', '2 main st', '703', 'yada'])
writer.writerow(['mary', '3 main st', '704', 'yada'])
Creates exactly the format you're expecting.
You can easily create it manually, writing a file with a chosen separator. You can also use csv module.
If it's from database you can alo just use a query from your sqlite client :
sqlite <db params> < queryfile.sql > output.csv
Which will create a csv file with tab separator.
How to extract the column headings from an existing table:
You don't need to parse an SQL "create table" statement. This is fortunate, as the "create table" syntax is neither nice nor clean, it is warthog-ugly.
You can use the table_info pragma. It gives you useful information about each column in a table, including the name of the column.
Example:
>>> #coding: ascii
... import sqlite3
>>>
>>> def get_col_names(cursor, table_name):
... results = cursor.execute("PRAGMA table_info(%s);" % table_name)
... return [row[1] for row in results]
...
>>> def wrong_way(cur, table):
... import re
... cur.execute("SELECT sql FROM sqlite_master WHERE name=?;", (table, ))
... sql = cur.fetchone()[0]
... column_defs = re.findall("[(](.*)[)]", sql)[0]
... first_words = (line.split()[0].strip() for line in column_defs.split(','))
... columns = [word for word in first_words if word.upper() != "CONSTRAINT"]
... return columns
...
>>> conn = sqlite3.connect(":memory:")
>>> curs = conn.cursor()
>>> _ignored = curs.execute(
... "create table foo (id integer, name text, [haha gotcha] text);"
... )
>>> print get_col_names(curs, "foo")
[u'id', u'name', u'haha gotcha']
>>> print wrong_way(curs, "foo")
[u'id', u'name', u'[haha'] <<<<<===== WHOOPS!
>>>
Other problems with the now-deleted "parse the create table SQL" answer:
Stuffs up with e.g. create table test (id1 text, id2 int, msg text, primary key(id1, id2)) ... needs to ignore not only CONSTRAINT, but also keywords PRIMARY, UNIQUE, CHECK and FOREIGN (see the create table docs).
Needs to specify re.DOTALL in case there are newlines in the SQL.
In line.split()[0].strip() the strip is redundant.
This is simple and works fine for me.
Lets say you have already connected to your database table and also got a cursor object. So following on on from that point.
import csv
curs = conn.cursor()
curs.execute("select * from oders")
m_dict = list(curs.fetchall())
with open("mycsvfile.csv", "wb") as f:
w = csv.DictWriter(f, m_dict[0].keys())
w.writerow(dict((fn,fn) for fn in m_dict[0].keys()))
w.writerows(m_dict)
unless i'm missing something, you just want to do something like so...
f = open("somefile.csv")
f.writelines("header_row")
logic to write lines to file (you may need to organize values and add comms or pipes etc...)
f.close()
It can be easily done using pandas and sqlite3. In extension to the answer from Cristian Ciupitu.
import sqlite3
from glob import glob; from os.path import expanduser
conn = sqlite3.connect(glob(expanduser('data/clients_data.sqlite'))[0])
cursor = conn.cursor()
Now use pandas to read the table and write to csv.
clients = pd.read_sql('SELECT * FROM clients' ,conn)
clients.to_csv('data/Clients100914.csv', index=False)
This is more direct and works all the time.
The below code works for Oracle with Python 3.6 :
import cx_Oracle
import csv
# Create tns
dsn_tns=cx_Oracle.makedsn('<host>', '<port>', service_name='<service_name>')
# Connect to DB using user, password and tns settings
conn=cx_Oracle.connect(user='<user>', password='<pass>',dsn=dsn_tns)
c=conn.cursor()
#Execute the Query
c.execute("select * from <table>")
# Write results into CSV file
with open("<file>.csv", "w", newline='') as csv_file:
csv_writer = csv.writer(csv_file)
csv_writer.writerow([i[0] for i in c.description]) # write headers
csv_writer.writerows(c)