I am working on SQLAlchemy and want to fetch the data from database and convert the same into JSON format.
I have below code :
db_string = "postgres://user:pwd#10.**.**.***:####/demo_db"
Base = declarative_base()
db = create_engine(db_string)
record = db.execute("SELECT name, columndata, gridname, ownerid, issystem, ispublic, isactive FROM col.layout WHERE (ispublic=1 AND isactive=1) OR (isactive=1 AND ispublic=1 AND ownerid=ownerid);")
for row in record:
result.append(row)
print(result)
Data is coming in this format:
[('layout-1', {'theme': 'blue', 'sorting': 'price_down', 'filtering': ['Sub Strategye', 'PM Strategy']}, 'RealTimeGrid', 1, 0, 1, 1), ('layout-2', {'theme': 'orange', 'sorting': 'price_up', 'filtering': ['FX Rate', 'Start Price']}, 'RealBalancing Grid', 2, 0, 1, 1), ('layout-3', {'theme': 'red', 'sorting': 'mv_price', 'filtering': ['Sub Strategye', 'PM Strategy']}, 'RT', 3, 0, 1, 1)]
But I am facing a lot of issues to convert the above result into JSON Format. Please suggest.
Your data is basically a list of tuples.
like first tuple is like
('layout-3',
{'filtering': ['Sub Strategye', 'PM Strategy'],
'sorting': 'mv_price',
'theme': 'red'},
'RT',
3,
0,
1,
1)
if you want to convert whole data as it is to json, you can use json module dumps function
import json
jsn_data = json.dumps(data)
Your list of tuple is converted to json
[["layout-1", {"theme": "blue", "sorting": "price_down", "filtering": ["Sub Strategye", "PM Strategy"]}, "RealTimeGrid", 1, 0, 1, 1], ["layout-2", {"theme": "orange", "sorting": "price_up", "filtering": ["FX Rate", "Start Price"]}, "RealBalancing Grid", 2, 0, 1, 1], ["layout-3", {"theme": "red", "sorting": "mv_price", "filtering": ["Sub Strategye", "PM Strategy"]}, "RT", 3, 0, 1, 1]]
but If you need json formate as key value pair , first need to convert the result in python dictionary then use json.dumps(dictionary_Var)
What you want to accomplish is called "serialization".
You can follow Sudhanshu Patel's answer if you just want to dump json into response.
However if you intend to produce a more sophisticated application, consider using a serialization library. You'll be able to input data from request into db, check if input data is in the right format, and send response in a standarised format.
Check these libraries:
Marshmallow
Python's own Pickle
Related
I had a very simple idea: Use Python Pandas (for convenience) to do some simple database operations with moderate data amounts and write the data back to S3 in Parquet format.
Then, the data should be exposed to Redshift as an external table in order to not take storage space from the actual Redshift cluster.
I found two ways to that.
Given the data:
data = {
'int': [1, 2, 3, 4, None],
'float': [1.1, None, 3.4, 4.0, 5.5],
'str': [None, 'two', 'three', 'four', 'five'],
'boolean': [True, None, True, False, False],
'date': [
date(2000, 1, 1),
date(2000, 1, 2),
date(2000, 1, 3),
date(2000, 1, 4),
None,
],
'timestamp': [
datetime(2000, 1, 1, 1, 1, 1),
datetime(2000, 1, 1, 1, 1, 2),
None,
datetime(2000, 1, 1, 1, 1, 4),
datetime(2000, 1, 1, 1, 1, 5),
]
}
df = pd.DataFrame(data)
df['int'] = df['int'].astype(pd.Int64Dtype())
df['date'] = df['date'].astype('datetime64[D]')
df['timestamp'] = df['timestamp'].astype('datetime64[s]')
The type casts at the end are necessary in both cases to assert, that Pandas' type recognition does not interfere.
With PyArrow:
Using Pyarrow, you do it like this:
import pyarrow as pa
pyarrow_schema = pa.schema([
('int', pa.int64()),
('float', pa.float64()),
('str', pa.string()),
('bool', pa.bool_()),
('date', pa.date64()),
('timestamp', pa.timestamp(unit='s'))
])
df.to_parquet(
path='pyarrow.parquet',
schema=pyarrow_schema,
engine='pyarrow'
)
Why use PyArrow: Pandas' default engine for Parquet export is PyArrow, so you can expect good integration. Also, PyArrow provides extensive features and caters for many datatypes.
With fastparquet:
First you need to write out the data with these additional steps:
from fastparquet import write
write('fast.parquet', df, has_nulls=True, times='int96')
The important bit here is the 'times' parameter. See this post, where I found a remedy for the 'date' column.
Why use fastparquet: Fastparquet is much more limited than PyArrow, especially, when it comes to accepted datatypes. On the other hand, the package is much smaller.
The external table:
Given, that you have exported your data to Parquet and stored it in S3, you can then expose it to Redshift like this:
CREATE EXTERNAL TABLE "<your_external_schema>"."<your_table_name>" (
"int" bigint,
"float" float,
"str" varchar(255),
"boolean" bool,
"date" date,
"timestamp" timestamp)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location
's3://<your_bucket>/<your_prefix>/';
Final story and note:
When I started working with Pandas, Parquet and external Redshift tables in the context of AWS Lambda functions, everything was fine for a while. Until I reached a point, where the bundle for my Lambda package reached its allowed limit (Deployment package size). Checking, which of my dependencies made up for all that, I found PyArrow, Pandas and Numpy (dependency of Pandas) to be the culprits. While I could definitely not drop Numpy (for efficiency) and did not want to loose Pandas (convenience, again), I looked to replace PyArrow with something more light-weight. Et voila: Fastparquet. After some research and a lot of experimentation, I could make this also work.
I hope, some other people find this explanation and resources helpful.
The question already holds the answer. :)
I have more than 6000 XML want to parse and save as csv (or anything else for storage).
I need to perform JOIN for each XML to join them to big dataframe.
The problem is the process takes so long and uses too many memory.
I am wondering would sql can solve the problem? faster and less memory consumption?
def get_data(lst):
results = pd.DataFrame()
errors = []
for data in lst:
try:
df = parseXML_Annual(data)
try:
results = results.join(df, how = "outer")
except:
results = df
except:
errors.append(data)
return results, errors
results, errors = get_data(lst_result)
As I can see from your sample, entire XML file is related to the same company. To me it sounds that you need to add this a new row, not join it as a table. In my understanding you want to have some list of metrics for each company. If so you probably can just stick with key-value storage. If python is your primary tool, use dictionary, and then save it as a JSON file.
In your for loop just fill a blank dictionary with data from XML like this:
report = {
"apple": {
'metricSet1': {"m11": 5, "m12": 2, "m13": 3},
'metricSet2': {"m21": 4, "m22": 5, "m23": 6}
},
"google": {
'metricSet1': {"m11": 1, "m12": 13, "m13": 3},
'metricSet2': {"m21": 9, "m22": 0, "m23": 11}
},
"facebook": {
'metricSet1': {"m11": 1, "m12": 9, "m13": 9},
'metricSet2': {"m21": 7, "m22": 2, "m23": 4}
}
}
when you need to query it or fill some table with data do something like this:
for k in report.keys():
row = [
k,
report[k]["metricSet1"]["m12"],
report[k]["metricSet2"]["m22"],
report[k]["metricSet2"]["m23"]
]
print(row)
If data structure is not changing (say all these XML are the same) it would make sence to store it in SQL database, creating a table for each metric set. If XML structure may vary then just keep it as json file, or probably in some Key-Value based database, like mongo
I am trying to pull multiple values from consul.
after pulling data using the following code:
import consul
c = consul.Consul("consulServer")
index, data = c.kv.get("key",recurese=False)
print data
I am getting the following json in my data list:
[ {
'LockIndex': 0,
'ModifyIndex': 54,
'Value': '1',
'Flags': 0,
'Key': 'test/one',
'CreateIndex': 54
}, {
'LockIndex': 0,
'ModifyIndex': 69,
'Value': '2',
'Flags': 0,
'Key': 'test/two',
'CreateIndex': 69
}]
I want to transform this output to key:value json file. for this example it should look like:
{
"one": "1",
"two": "2"
}
I have two questions:
1. Is there a better way to get multiple values from consul kv?
2. Assuming there is no better way, what is the best way to convert the json from the first example to the second one?
Thanks,
I am fairly new to Python and have several nested JSON files I need to convert to CSV's.
The structure of these is the following:
{'Z': {'#SchemaVersion': 9,
'FD': [{'FDRecord': [{'NewCase': {'#TdrRecVer': 5,
'CaseLabel': '',
'StdHdr': {'DevDateTime': '2000-05-02T10:43:18',
'ElapsedTime': 0,
'GUID': '5815D34615C15690936B822714009468',
'MsecTime': 5012,
'RecId': 4},
'UniqueCaseId': '5389F346136315497325122714009468'}},
{'NewCase': {'#TdrRecVer': 5,
'CaseLabel': '',
'StdHdr': {'DevDateTime': '2000-05-02T10:43:18',
'ElapsedTime': 0,
'GUID': '5819D346166610458312622714009468',
'MsecTime': 9459,
'RecId': 4},
'UniqueCaseId': '5819F346148627009653284714009468'}},
{'AnnotationEvt': {'#EvtName': 'New',
'#TdrRecVer': 1,
'DevEvtCode': 13,
'Payload': '0 0 0 0',
'StdHdr': {'DevDateTime': '2000-05-02T10:43:18',
'ElapsedTime': 0,
'GUID': '5899D34616BC1000938B824538219968',
'MsecTime': 7853,
'RecId': 8},
'TreatmentSummary': 1,
'XidCode': '0000000B'}},
{'TrendRpt': {'#TdrRecVer': 9,
'CntId': 0,
'DevEvtCode': 30,
'StdHdr': {'DevDateTime': '2000-05-02T10:43:18',
'ElapsedTime': 0,
'GUID': '5819C34616781004852225698409468',
'MsecTime': 4052,
'RecId': 12}, ...
My problem is, most examples online show how to read in a very small json and write it out to a csv by explicitly stating the keys or field names when creating a csv. My files are far too large to do this, some of them being over 40MB.
I tried following another person's example (below) from online, but did not succeed:
import json
import csv
with open('path-where-json-located') as file:
data_parse = json.load(file)
data = data_parse['Z']
data_writer = open('path-for-csv', 'w')
csvwriter = csv.writer(data_writer)
count=0
for i in data:
if count == 0:
header = i.keys()
csvwriter.writerow(header)
count+=1
csvwriter.writerow(i.values())
data_writer.close()
When I run this, I get the following error:
AttributeError: 'str' object has no attribute 'keys'
I understand that for some reason it is treating the key I want to pull as a string object, but I do not know how to get around this and correctly parse this into a csv.
I have successfully created some data bar plot with the python module xlsxwriter by its conditional_format method.
However, is it possible to specify the fill pattern of condition format within xlswriter or python generally?
I tried the following code which didn't work:
myformat = workbook.add_format()
myformat .set_pattern(1)
worksheet.conditional_format(0, 4, 0, 4, {'type': 'data_bar',
'min_value': 0,
'min_type': 'num',
'max_value': 110,
'max_type': 'num',
'bar_color': '#C00000',
'format': myformat})
It is possible to set the pattern of the format for some conditional formats with XlsxWriter.
However, as far as I know, it isn't possible, in Excel, to set a patten type for data_bar conditional formats.
You could do it with a cell format:
import xlsxwriter
workbook = xlsxwriter.Workbook('hello_world3.xlsx')
worksheet = workbook.add_worksheet()
myformat = workbook.add_format()
myformat.set_pattern(1)
myformat.set_bg_color('#C00000')
worksheet.conditional_format(0, 4, 0, 4, {'type': 'cell',
'criteria': 'between',
'minimum': 0,
'maximum': 110,
'format': myformat})
worksheet.write(0, 4, 50)
workbook.close()
If, on the other hand, you are looking for a non-gradient data_bar fill, that isn't currently supported.
for newer version, you only need the 'data_bar_2010' property to true:
worksheet.conditional_format(0, 4, 0, 4, {'type': 'data_bar','data_bar_2010': True})
see the xlsxwriter documents here