Splitting .txt file into elements - python

Since we can not read directly from a Json File, I'm using a .txt one.
It looks like that with more elements seperated by ",".
[
{
"Item_Identifier": "FDW58",
"Outlet_Size": "Medium"
},
{
"Item_Identifier": "FDW14",
"Outlet_Size": "Small"
},
]
I want to count the number of elements, here I would get 2.
The problem is that I can't seperate the text into elements separated with a comma ','.
I get every line alone even if I convert it into a json format.
lines = p | 'receive_data' >> beam.io.ReadFromText(
known_args.input)\
| 'jsondumps' >> beam.Map(lambda x: json.dumps(x))\
| 'jsonloads' >> beam.Map(lambda x: json.loads(x))\
| 'print' >> beam.ParDo(PrintFn()) \

I don't believe this is a safe approach. I haven't used the python sdk (I use java) but io.TextIO on the java side is pretty clear that it will emit a PCollection where each element is one line of input from the source file. Hierarchical data formats (json, xml, etc.) are not amendable to being split apart in this way.
If you file is as well-formatted, and non-nested as the json you've included, you can get away with:
reading the file by line (as I believe you're doing)
filtering for only lines containing }
counting the resulting pcollection size
To integrate with json more generally, though, we took a different approach:
start with a PCollection of Strings where each value is the path to a file
use native libraries to access the file and parse it in a streaming fashion (we use scala which has a few streaming json parsing libraries available)
alternatively, use Beam's APIs to obtain a ReadableFile instance from a MatchResult and access the file through that
My understanding has been that not all file formats are good matches for distributed processors. Gzip for example cannot be 'split' or chunked easily. Same with JSON. CSV has an issue where the values are nonsense unless you also have the opening line handy.

A .json file is simply a text file whose contents are in JSON-parseable format.
Your JSON is invalid because it has trailing commas. This should work:
import json
j = r"""
[
{
"Item_Identifier": "FDW58",
"Outlet_Size": "Medium"
},
{
"Item_Identifier": "FDW14",
"Outlet_Size": "Small"
}
]
"""
print(json.loads(j))

Related

How to custom indent JSON?

I want to be able to indent a JSON file in a way where the first key is on the same line as the opening bracket, by deafult, a json.dump() function puts it in a new line.
For example, if the original file was:
[
{
"statusName": "CO_FILTER",
"statusBar": {}
I want it to start like this:
[
{ "statusName": "CO_FILTER",
"statusBar":{}
I believe if you wanted to custom format it, you will most likely have to make/add onto the JSON library, just like the Google Json formatter and many others. But for the most part, its not like a value you can change on demand, it will either be included in another Json library apart from the default library from python. Maybe search for different python Json libraries that dumps Json the way you like.

How to read a newline delimited JSON file from R?

I have a newline delimited (i.e., each JSON object is confined to 1 line in the file):
{"name": "json1"}
{"name": "json2"}
{"name": "json3"}
In Python I can easily read it as follows (I must use the encoding encoding='cp850' to read my real data):
import json
objs = []
with open("testfile.json", encoding='cp850') as f:
for line in f:
objs.append(json.loads(line))
How can I do a similar trick in R?
At the end I want to get a data.frame:
library("jsonlite")
library("data.table")
d <- fromJSON("testfile.json", flatten=FALSE)
df <- as.data.frame(d)
We can use stream_in from jsonlite
library(jsonlite)
out <- stream_in(file('testfile.json'))
out
# name
#1 json1
#2 json2
#3 json3
str(out)
#'data.frame': 3 obs. of 1 variable:
#$ name: chr "json1" "json2" "json3"
You can read & process the data to a proper format and then parse the JSON
jsonlite::fromJSON(sprintf('[%s]', paste(readLines('text.json', warn = FALSE),
collapse = ',')))
# name
# 1 json1
# 2 json2
# 3 json3
(you can use one of the many alternatives as JSON package e.g.
jsonlite a more R-like package as it will mainly work with data frames
RJSONIO a more Python-ic package working mainly with lists
or yet another one)
The fastest way to read in newline json is probably stream_in from ndjson. It uses underlying C++ libraries (although I think it is still single threaded). Still, I find it much faster than (the still very good) jsonlite library. As a bonus, the json is flattened by default.
library(ndjson)
out<- ndjson::stream_in(path='./testfile.json')

Select specific fields from DB export that includes Json

I have an export from a DB table that has the following columns:
name|value|age|external_atributes
The external_atribute is on Json format. So the export looks like this:
George|10|30|{"label1":1,"label2":2,"label3":3,"label4":4,"label5":5,"label6":"6","label7":"7","label8":"8"}
Which is the most efficient way (since the export has more than 1m lines) to keep only the name and the values from label2, label5 and label6. For example from the above export I would like to keep only:
George|2|5|6
Edit: I am not sure for the sequence of the fields/variables on the JSON part. Data could be also for example:
George|10|30|{"label2":2,"label1":1,"label4":4,"label3":3,"label6":6,"label8":"8","label7":"7","label5":"5"}
Also the fact that some of the values are double quoted, while some are not, is intentional (This is how they appear also on the export).
My understanding until now is that I have to use something that has a JSON parser like Python or jq.
This is what i created on Python and seems that is working as expected:
from __future__ import print_function
import sys,json
with open(sys.argv[1], 'r') as file:
for line in file:
fields = line.split('|')
print (fields[0], json.loads(fields[3])['label2'], json.loads(fields[3])['label5'], json.loads(fields[3])['label6'], sep='|')
output:
George|2|5|6
Since I am looking for the most efficient way to do this, any comment is more than welcome.
Even if the data are easy to parse, I advise to use a json parser like jq to extract your json data:
<file jq -rR '
split("|")|[ .[0], (.[3]|fromjson|(.label2,.label5,.label6)|tostring)]|join("|")'
Both options -R and -r allows jq to accept and display a string as input and output (instead of json data).
The split function enable getting all fields into an array that can be indexed with number .[0] and .[3].
The third field is then parsed as json data with the function fromjson such that the wanted labels are extracted.
All wanted fields are put into an array and join together with the | delimiter.
You could split with multiple delimiters using a character class.
The following prints the desired result:
awk 'BEGIN { FS = "[|,:]";OFS="|"} {gsub(/"/,"",$15)}{print $1,$7,$13,$15}'
The above solution assumes that the input data is structured.
Since it is about record-based text edits, awk is most probably the best tool to accomplish the task. However, here it is a sed solution:
sed 's/\([^|]*\).*label2[^0-9]*\([0-9]*\).*label5[^0-9]*\([0-9]*\).*label6[^0-9]*\([0-9]*\).*/\1|\2|\3|\4/' inputFile

YAML vs Python configuration/parameter files (but perhaps also vs JSON vs XML)

I see Python used to do a fair amount of code generation for C/C++ header and source files. Usually, the input files which store parameters are in JSON or YAML format, although most of what I see is YAML. However, why not just use Python files directly? Why use YAML at all in this case?
That also got me thinking: since Python is a scripted language, its files, when containing only data and data structures, could literally be used the same as XML, JSON, YAML, etc. Do people do this? Is there a good use case for it?
What if I want to import a configuration file into a C or C++ program? What about into a Python program? In the Python case it seems to me there is no sense in using YAML at all, as you can just store your configuration parameters and variables in pure Python files. In the C or C++ case, it seems to me you could still store your data in Python files and then just have a Python script import that and auto-generate header and source files for you as part of the build process. Again, perhaps there's no need for YAML or JSON in this case at all either.
Thoughts?
Here's an example of storing some nested key/value hash table pairs in a YAML file:
my_params.yml:
---
dict_key1:
dict_key2:
dict_key3a: my string message
dict_key3b: another string message
And the same exact thing in a pure Python file:
my_params.py
data = {
"dict_key1": {
"dict_key2": {
"dict_key3a": "my string message",
"dict_key3b": "another string message",
}
}
}
And to read in both the YAML and Python data and print it out:
import_config_file.py:
import yaml # Module for reading in YAML files
import json # Module for pretty-printing Python dictionary types
# See: https://stackoverflow.com/a/34306670/4561887
# 1) import .yml file
with open("my_params.yml", "r") as f:
data_yml = yaml.load(f)
# 2) import .py file
from my_params import data as data_py
# OR: Alternative method of doing the above:
# import my_params
# data_py = my_params.data
# 3) print them out
print("data_yml = ")
print(json.dumps(data_yml, indent=4))
print("\ndata_py = ")
print(json.dumps(data_py, indent=4))
Reference for using json.dumps: https://stackoverflow.com/a/34306670/4561887
SAMPLE OUTPUT of running python3 import_config_file.py:
data_yml =
{
"dict_key1": {
"dict_key2": {
"dict_key3a": "my string message",
"dict_key3b": "another string message"
}
}
}
data_py =
{
"dict_key1": {
"dict_key2": {
"dict_key3a": "my string message",
"dict_key3b": "another string message"
}
}
}
Yes people do this, and have been doing this for years.
But many make the mistake you do and make it unsafe to by using import my_params.py. That would be the same as loading YAML using YAML(typ='unsafe') in ruamel.yaml (or yaml.load() in PyYAML, which is unsafe).
What you should do is using the ast package that comes with Python to parse your "data" structure, to make such an import safe. My package pon has code to update these kind of structures, and in each of my __init__.py files there is such an piece of data named _package_data that is read by some code (function literal_eval) in the setup.py for the package. The ast based code in setup.py takes around ~100 lines.
The advantage of doing this in a structured way are the same as with using YAML: you can programmatically update the data structure (version numbers!), although I consider PON, (Python Object Notation), less readable than YAML and slightly less easy to manually update.

what is different loads, dumps vs read json file?

I have a simple json file 'stackoverflow.json"
{
"firstname": "stack",
"lastname": "overflow"
}
what is different between 2 below functions:
def read_json_file_1():
with open('stackoverflow.json') as fs:
payload = ujson.loads(fs.read())
payload = ujson.dumps(payload)
return payload
def read_json_file_2():
with open ('stackoverflow.json') as fs:
return payload = fs.read()
Then I use 'requests' module to send post resquest with payload from 2 above funtions and it works for both.
Thanks.
the 'loads' function takes a json file and converts it into a dictionary or list depending on the exact json file.
the `dumps' take a python data structure, and converts it back into json.
so you first function is loading and validating json, converting it into a python structure, and then converting it back to json before returning, whereas your 2nd function just reads the content of the file, with no conversion or validation
The functions are therefore only equivalent if the json is valid - if there are json errors then the two functions will execute very differently.If you know that the file contains an error free json the two files should return equivalent output, however if the file contains error within the json, then the first function will fail with relevant tracebacks etc, and the 2nd function wont generate any errors at all. The 1st function is more error friendly, but is less efficient (since it converts json -> python -> json). The 2nd function is much more efficient but is much less error friendly (i.e. it wont fail if the json is broken).

Categories