Get schema using SODA api and python - python

I am using python to download a city's data. I want to get the schema with python (data types and names) for columns download with sodapy. I can get the data fine but is is all of type string. So, it would be nice to have data types so I could build a proper schema to load the data into.
On the website, they have data types laid out for the columns.
https://data.muni.org/Housing-and-Homelessness/CAMA-Property-Inventory-Residential-with-Details/r3di-nq2j
There is so metadata at this url but it does not have the column info.
https://data.muni.org/api/views/metadata/v1/r3di-nq2j

Unable to comment due to low rep, so adding a new answer, in case anyone else comes across this in search results, as I did.
Omitting /metadata/v1/ from OPs original URL as such: https://data.muni.org/api/views/r3di-nq2j will return full table metadata.
Additionally, Socrata HTML response headers include special fields (X-SODA2-Fields and X-SODA2-Types) containing a list of the column names and data types returned in the results.
Ref: https://dev.socrata.com/docs/response-codes.html

There may be a better way but this url returns a json response with the information.
http://api.us.socrata.com/api/catalog/v1?ids=r3di-nq2j

Related

Pandas df `to_gbq` with nested data

I'm working in a limited Airflow environment in which I don't have access to google-cloud-bigquery but do have access to pandas-gbq. My goal is to load some JSON API data using some schema involving records into a BigQuery table. My strategy is to first read all the data into a pandas dataframe using a dictionary to represent the records: e.g.
uuid
metadata1
...
001
{u'time_updated': u'', u'name':u'jeff'}
...
Then, I've been trying to use pandas_gbq.to_gbq to load into BQ. The issue is I get
Error at Row: 0, Reason: invalid, Location: metadata1, Message: This field: metadata1 is not a record. and I realize this is because from the Google Cloud website it says that pandas-gbq "Converts the DataFrame to CSV format before sending to the API, which does not support nested or array values."
And so I won't be able to upload a dataframe with records to BQ in this way since again I can't use google-cloud-bigquery in my environment.
What would be the best strategy for me to upload my data to BQ (around 30k rows and 6 or so columns with 8ish nested fields each)?
I know this sounds like a very bad strategy but I could upload a flattened version of all fields ini a record as a single string to the BQ table and then run a query from my code to replace these flattened fields with their record-form versions. But this seems really bad since for a time, the table would contain the wrong schema.
Any thoughts would be much appreciated. Thanks in advance.

Pyspark: Read in only certain fields from nested json data

I'm trying to create a spark job that can read in 1000's of json files and perform some actions, then write out to file (s3) again.
It is taking a long time and I keep running out of memory. I know that spark tries to infer the schema if one isn't given. The obvious thing to do would be to supply the schema when reading in. However, the schema changes from file to file depending on a many factors that aren't important. There are about 100 'core' columns that are in all files and these are the only ones I want.
Is it possible to write a partial schema that only reads the specific fields I want into spark using pyspark?
At first, It is recommended to have a jsonl file that each of it contains a single json input data. Generally, you can read just a specific set of fields from a big json, but that should not be considered to be Sparks' job. You should have an initial method that converts your json input into an object of a serializable datatype; you should feed that object into your Spark pipeline.
Passing the schema is not an appropriate design, and it is just making the problem more severe. Instead, define a single method and extract the specific fields after loading the data from files. You can use the following link for finding how to extract some considering fields from a json string in python: How to extract specific fields and values from a JSON with python?

How can i fetch data from a particular index in elasticsearch?

When i am trying to fetch data all i am getting is some useless information like datatype etc. How can I get the real data stored in the index.
{"rpa-trans-2020.02.26":{"aliases":{},"mappings":{"rpa-trans":{"dynamic":"true","properties":{"#timestamp":{"type":"date"},"#version":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"AreaImpacted":{"type":"keyword"},"AssigneeUserId":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"CreatedByUserId":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"CrossReferenceId":{"type":"keyword"},"EntityName":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"ErrorCode":{"type":"keyword"},"ErrorDescription":{"type":"keyword"},"FailureTransaction":{"type":"integer"},"Initiator":{"type":"keyword"},"InstanceId":{"type":"integer"},"IsApplication":{"type":"keyword"},"ListenerReqEndTime":{"type":"long"},"ListenerReqStartTime":{"type":"long"},"NotQualifiedRequest":{"type":"integer"},"Param1":{"type":"float"},"Param2":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"Param3":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"Param4":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"ProcessName":{"type":"keyword"},"Processes":{"type":"keyword"},"QualifiedRequest":{"type":"integer"},"RetryTrans":{"properties":{"ReasonRequestUnsuccessful":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"RetryAttemptNumber":{"type":"long"},"RetryReason":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"RobotReqEndTime":{"type":"long"},"RobotReqStartTime":{"type":"long"},"Robots":{"type":"keyword"},"SearchInput":{"type":"keyword"},"SourceApplicationId":{"type":"keyword"},"SourceMachineId":{"type":"keyword"},"StepDescription":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"StepEndTime":{"type":"date"},"StepStartTime":{"type":"date"},"StepStatus":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"SuccessfulTransaction":{"type":"integer"},"TransactionDateTime":{"type":"date","format":"yyyy-MM-dd'T'HH:mm:ss.SSSZZ"},"TransactionEndTime":{"type":"date"},"TransactionId":{"type":"keyword"},"TransactionMessage":{"type":"keyword"},"TransactionProcessName":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"TransactionProfileName":{"type":"keyword"},"TransactionSource":{"type":"keyword"},"TransactionState":{"type":"keyword"},"TransactionStatus":{"type":"keyword"},"TransactionTitle":{"type":"keyword"},"TransactionType":{"type":"keyword"},"UserId":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"elapsed_execution_time":{"type":"float"},"elapsed_execution_time ":{"type":"float"},"elapsed_handle_time":{"type":"float"},"elapsed_qualification_time":{"type":"float"},"elapsed_qualification_time ":{"type":"float"},"elapsed_timestamp_start":{"type":"date"},"elapsed_wait_time":{"type":"float"},"manual_proc_time":{"type":"integer"},"process_sla":{"type":"integer"},"tags":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"transactions":{"type":"long"},"type":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}}},"settings":{"index":{"number_of_shards":"1","provided_name":"rpa-trans-2020.02.26","creation_date":"1582703183260","requests":{"cache":{"enable":"false"}},"number_of_replicas":"0","uuid":"WslG0Q-YT323WZxNuw_sWw","version":{"created":"5050299"}}}},
Looks like you are trying to issue a request in the format of GET myindex. That would give you the mappings and settings of your index.
To read your data, request should looks like GET myindex/_search
There are many ways to query data and you can read documentation for the ones that suit your query requirements.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html
If you use that code then you can do things like
the_data["rpa-trans-2020.02.26"]["mappings"]
the_data = {"rpa-trans-2020.02.26":{"aliases":{},"mappings":{"rpa-trans":{"dynamic":"true","properties":{"#timestamp":{"type":"date"},"#version":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"AreaImpacted":{"type":"keyword"},"AssigneeUserId":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"CreatedByUserId":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"CrossReferenceId":{"type":"keyword"},"EntityName":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"ErrorCode":{"type":"keyword"},"ErrorDescription":{"type":"keyword"},"FailureTransaction":{"type":"integer"},"Initiator":{"type":"keyword"},"InstanceId":{"type":"integer"},"IsApplication":{"type":"keyword"},"ListenerReqEndTime":{"type":"long"},"ListenerReqStartTime":{"type":"long"},"NotQualifiedRequest":{"type":"integer"},"Param1":{"type":"float"},"Param2":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"Param3":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"Param4":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"ProcessName":{"type":"keyword"},"Processes":{"type":"keyword"},"QualifiedRequest":{"type":"integer"},"RetryTrans":{"properties":{"ReasonRequestUnsuccessful":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"RetryAttemptNumber":{"type":"long"},"RetryReason":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"RobotReqEndTime":{"type":"long"},"RobotReqStartTime":{"type":"long"},"Robots":{"type":"keyword"},"SearchInput":{"type":"keyword"},"SourceApplicationId":{"type":"keyword"},"SourceMachineId":{"type":"keyword"},"StepDescription":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"StepEndTime":{"type":"date"},"StepStartTime":{"type":"date"},"StepStatus":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"SuccessfulTransaction":{"type":"integer"},"TransactionDateTime":{"type":"date","format":"yyyy-MM-dd'T'HH:mm:ss.SSSZZ"},"TransactionEndTime":{"type":"date"},"TransactionId":{"type":"keyword"},"TransactionMessage":{"type":"keyword"},"TransactionProcessName":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"TransactionProfileName":{"type":"keyword"},"TransactionSource":{"type":"keyword"},"TransactionState":{"type":"keyword"},"TransactionStatus":{"type":"keyword"},"TransactionTitle":{"type":"keyword"},"TransactionType":{"type":"keyword"},"UserId":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"elapsed_execution_time":{"type":"float"},"elapsed_execution_time ":{"type":"float"},"elapsed_handle_time":{"type":"float"},"elapsed_qualification_time":{"type":"float"},"elapsed_qualification_time ":{"type":"float"},"elapsed_timestamp_start":{"type":"date"},"elapsed_wait_time":{"type":"float"},"manual_proc_time":{"type":"integer"},"process_sla":{"type":"integer"},"tags":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"transactions":{"type":"long"},"type":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}}},"settings":{"index":{"number_of_shards":"1","provided_name":"rpa-trans-2020.02.26","creation_date":"1582703183260","requests":{"cache":{"enable":"false"}},"number_of_replicas":"0","uuid":"WslG0Q-YT323WZxNuw_sWw","version":{"created":"5050299"}}}}}

How to convert data fetched from a database using python into a JSON string so that I am able to HIT a particular API

I exhaustively searched for my question on the web , but the closest answer I found was this:
Answer for converting list to string
To be precise, I fetched some data in a variable in java using mysql-python connector module.
After I printed the variable,The fetched data looked something like this:
[(16,'abc#ail.com','lenbox','+91','16','ACTIVE'),
(17,'xyz#ail.com','lenro','+41','17','ACTIVE'),
(18,'sdf#ail.com','meru','+51','18','NOTACTIVE')]
I want to use this data to hit an API whose URL and other details are known to me. For that the above data should be in a JSON string format.Please suggest a way?
I am newbie in python language and don't know so well about JSON.

Building a solr index using large text file

I have a large text file in following format:
00001,234234|234|235|7345
00005,788|298|234|735
You can treat values prior to , as keys and what I want to do is quick and dirty approach to query these keys and find the results sets for each key. After reading a bit I found out that solr provide a good framework to do this.
What would be the starting point?
Can I use python to read the file and build this index (search
engine) using solr?
is there a different mechanism to do such?
You can definitely do that using pysolr which is a python library. If the data is in key value form you can read it in python like shown here :
https://pypi.python.org/pypi/pysolr/3.1.0
To have more control on search you need to modify the schema.xml file to have the keys as you have in your text file.
Once you have the data ingested in SOLR you can follow the above link to perform search.
You can index your data directly in Solr using the UpdateCSV handler: You just need to specify the destination field names in the fieldnames parameter in your curl call (or add them as the first line in your file if that is easier). No custom code needed.
Do remember to check that the destination field for the |-separated values splits into tokens using that characters.
See https://wiki.apache.org/solr/UpdateCSV for details.

Categories