I have a newline delimited (i.e., each JSON object is confined to 1 line in the file):
{"name": "json1"}
{"name": "json2"}
{"name": "json3"}
In Python I can easily read it as follows (I must use the encoding encoding='cp850' to read my real data):
import json
objs = []
with open("testfile.json", encoding='cp850') as f:
for line in f:
objs.append(json.loads(line))
How can I do a similar trick in R?
At the end I want to get a data.frame:
library("jsonlite")
library("data.table")
d <- fromJSON("testfile.json", flatten=FALSE)
df <- as.data.frame(d)
We can use stream_in from jsonlite
library(jsonlite)
out <- stream_in(file('testfile.json'))
out
# name
#1 json1
#2 json2
#3 json3
str(out)
#'data.frame': 3 obs. of 1 variable:
#$ name: chr "json1" "json2" "json3"
You can read & process the data to a proper format and then parse the JSON
jsonlite::fromJSON(sprintf('[%s]', paste(readLines('text.json', warn = FALSE),
collapse = ',')))
# name
# 1 json1
# 2 json2
# 3 json3
(you can use one of the many alternatives as JSON package e.g.
jsonlite a more R-like package as it will mainly work with data frames
RJSONIO a more Python-ic package working mainly with lists
or yet another one)
The fastest way to read in newline json is probably stream_in from ndjson. It uses underlying C++ libraries (although I think it is still single threaded). Still, I find it much faster than (the still very good) jsonlite library. As a bonus, the json is flattened by default.
library(ndjson)
out<- ndjson::stream_in(path='./testfile.json')
Related
I am currently working on a side project where I am converting some code from python to rust.
In python, we can do something like:
Python code->
data = b'commit'+b' '+b'\x00'
print(data)
Output->
b'commit \x00'
Is there any way to achieve this in rust? As I need to concatenate some b'' and store them in a file.
Thanks in advance.
I tried using + operator but didn't work and shown error like: cannot add &[u8;6] with &[u8;1]
You have a number of options for combining binary strings. However it sounds the best option for your use case is to use the write! macro. It lets you write bytes the same way you would use the format! and println! macros. This has the benefit of requiring no additional allocation. The write! macro uses UTF-8 encoding (the same as strings in Rust).
One thing to keep in mind though is that the write! will perform lots of small calls using the std::io::Write trait. To prevent your program from slowing down due to the frequent calls to the OS when writing files you will want to buffer your data using a BufWriter.
As for zlip this is defiantly doable. In Rust, the most popular library for handling deflate, gzip, and zlib is the flate2 crate. Using this we can wrap a writer to do the compression inline with the rest of the writing process. This lets us save memory and keeps our code tidy.
use std::fs::File;
use std::io::{self, BufWriter, Write};
use flate2::write::ZlibEncoder;
use flate2::Compression;
pub fn save_commits(path: &str, results: &[&str]) -> io::Result<()> {
// Open file and wrap it in a buffered writer
let file = BufWriter::new(File::create(path)?);
// We can then wrap the file in a Zlib encoder so the data we write will
// be compressed with zlib as we go
let mut writer = ZlibEncoder::new(file, Compression::best());
// Write our header
write!(&mut writer, "commit {}\0", results.len())?;
// Write commit hashes
for hash in results {
write!(&mut writer, "{}\0", hash);
}
// This step is not required, but it lets us propogate the error instead of
// letting the final bytes buffered in the BufferedWriter from being flushed
// when it gets dropped.writer is dropped.
writer.by_ref().flush()?;
Ok(())
}
Additionally, you may find it helpful to read this answer to learn more about how binary strings work in Rust: https://stackoverflow.com/a/68231883/5987669
I have an export from a DB table that has the following columns:
name|value|age|external_atributes
The external_atribute is on Json format. So the export looks like this:
George|10|30|{"label1":1,"label2":2,"label3":3,"label4":4,"label5":5,"label6":"6","label7":"7","label8":"8"}
Which is the most efficient way (since the export has more than 1m lines) to keep only the name and the values from label2, label5 and label6. For example from the above export I would like to keep only:
George|2|5|6
Edit: I am not sure for the sequence of the fields/variables on the JSON part. Data could be also for example:
George|10|30|{"label2":2,"label1":1,"label4":4,"label3":3,"label6":6,"label8":"8","label7":"7","label5":"5"}
Also the fact that some of the values are double quoted, while some are not, is intentional (This is how they appear also on the export).
My understanding until now is that I have to use something that has a JSON parser like Python or jq.
This is what i created on Python and seems that is working as expected:
from __future__ import print_function
import sys,json
with open(sys.argv[1], 'r') as file:
for line in file:
fields = line.split('|')
print (fields[0], json.loads(fields[3])['label2'], json.loads(fields[3])['label5'], json.loads(fields[3])['label6'], sep='|')
output:
George|2|5|6
Since I am looking for the most efficient way to do this, any comment is more than welcome.
Even if the data are easy to parse, I advise to use a json parser like jq to extract your json data:
<file jq -rR '
split("|")|[ .[0], (.[3]|fromjson|(.label2,.label5,.label6)|tostring)]|join("|")'
Both options -R and -r allows jq to accept and display a string as input and output (instead of json data).
The split function enable getting all fields into an array that can be indexed with number .[0] and .[3].
The third field is then parsed as json data with the function fromjson such that the wanted labels are extracted.
All wanted fields are put into an array and join together with the | delimiter.
You could split with multiple delimiters using a character class.
The following prints the desired result:
awk 'BEGIN { FS = "[|,:]";OFS="|"} {gsub(/"/,"",$15)}{print $1,$7,$13,$15}'
The above solution assumes that the input data is structured.
Since it is about record-based text edits, awk is most probably the best tool to accomplish the task. However, here it is a sed solution:
sed 's/\([^|]*\).*label2[^0-9]*\([0-9]*\).*label5[^0-9]*\([0-9]*\).*label6[^0-9]*\([0-9]*\).*/\1|\2|\3|\4/' inputFile
Since we can not read directly from a Json File, I'm using a .txt one.
It looks like that with more elements seperated by ",".
[
{
"Item_Identifier": "FDW58",
"Outlet_Size": "Medium"
},
{
"Item_Identifier": "FDW14",
"Outlet_Size": "Small"
},
]
I want to count the number of elements, here I would get 2.
The problem is that I can't seperate the text into elements separated with a comma ','.
I get every line alone even if I convert it into a json format.
lines = p | 'receive_data' >> beam.io.ReadFromText(
known_args.input)\
| 'jsondumps' >> beam.Map(lambda x: json.dumps(x))\
| 'jsonloads' >> beam.Map(lambda x: json.loads(x))\
| 'print' >> beam.ParDo(PrintFn()) \
I don't believe this is a safe approach. I haven't used the python sdk (I use java) but io.TextIO on the java side is pretty clear that it will emit a PCollection where each element is one line of input from the source file. Hierarchical data formats (json, xml, etc.) are not amendable to being split apart in this way.
If you file is as well-formatted, and non-nested as the json you've included, you can get away with:
reading the file by line (as I believe you're doing)
filtering for only lines containing }
counting the resulting pcollection size
To integrate with json more generally, though, we took a different approach:
start with a PCollection of Strings where each value is the path to a file
use native libraries to access the file and parse it in a streaming fashion (we use scala which has a few streaming json parsing libraries available)
alternatively, use Beam's APIs to obtain a ReadableFile instance from a MatchResult and access the file through that
My understanding has been that not all file formats are good matches for distributed processors. Gzip for example cannot be 'split' or chunked easily. Same with JSON. CSV has an issue where the values are nonsense unless you also have the opening line handy.
A .json file is simply a text file whose contents are in JSON-parseable format.
Your JSON is invalid because it has trailing commas. This should work:
import json
j = r"""
[
{
"Item_Identifier": "FDW58",
"Outlet_Size": "Medium"
},
{
"Item_Identifier": "FDW14",
"Outlet_Size": "Small"
}
]
"""
print(json.loads(j))
I see Python used to do a fair amount of code generation for C/C++ header and source files. Usually, the input files which store parameters are in JSON or YAML format, although most of what I see is YAML. However, why not just use Python files directly? Why use YAML at all in this case?
That also got me thinking: since Python is a scripted language, its files, when containing only data and data structures, could literally be used the same as XML, JSON, YAML, etc. Do people do this? Is there a good use case for it?
What if I want to import a configuration file into a C or C++ program? What about into a Python program? In the Python case it seems to me there is no sense in using YAML at all, as you can just store your configuration parameters and variables in pure Python files. In the C or C++ case, it seems to me you could still store your data in Python files and then just have a Python script import that and auto-generate header and source files for you as part of the build process. Again, perhaps there's no need for YAML or JSON in this case at all either.
Thoughts?
Here's an example of storing some nested key/value hash table pairs in a YAML file:
my_params.yml:
---
dict_key1:
dict_key2:
dict_key3a: my string message
dict_key3b: another string message
And the same exact thing in a pure Python file:
my_params.py
data = {
"dict_key1": {
"dict_key2": {
"dict_key3a": "my string message",
"dict_key3b": "another string message",
}
}
}
And to read in both the YAML and Python data and print it out:
import_config_file.py:
import yaml # Module for reading in YAML files
import json # Module for pretty-printing Python dictionary types
# See: https://stackoverflow.com/a/34306670/4561887
# 1) import .yml file
with open("my_params.yml", "r") as f:
data_yml = yaml.load(f)
# 2) import .py file
from my_params import data as data_py
# OR: Alternative method of doing the above:
# import my_params
# data_py = my_params.data
# 3) print them out
print("data_yml = ")
print(json.dumps(data_yml, indent=4))
print("\ndata_py = ")
print(json.dumps(data_py, indent=4))
Reference for using json.dumps: https://stackoverflow.com/a/34306670/4561887
SAMPLE OUTPUT of running python3 import_config_file.py:
data_yml =
{
"dict_key1": {
"dict_key2": {
"dict_key3a": "my string message",
"dict_key3b": "another string message"
}
}
}
data_py =
{
"dict_key1": {
"dict_key2": {
"dict_key3a": "my string message",
"dict_key3b": "another string message"
}
}
}
Yes people do this, and have been doing this for years.
But many make the mistake you do and make it unsafe to by using import my_params.py. That would be the same as loading YAML using YAML(typ='unsafe') in ruamel.yaml (or yaml.load() in PyYAML, which is unsafe).
What you should do is using the ast package that comes with Python to parse your "data" structure, to make such an import safe. My package pon has code to update these kind of structures, and in each of my __init__.py files there is such an piece of data named _package_data that is read by some code (function literal_eval) in the setup.py for the package. The ast based code in setup.py takes around ~100 lines.
The advantage of doing this in a structured way are the same as with using YAML: you can programmatically update the data structure (version numbers!), although I consider PON, (Python Object Notation), less readable than YAML and slightly less easy to manually update.
I have a 300 mb CSV with 3 million rows worth of city information from Geonames.org. I am trying to convert this CSV into JSON to import into MongoDB with mongoimport. The reason I want JSON is that it allows me to specify the "loc" field as an array and not a string for use with the geospatial index. The CSV is encoded in UTF-8.
A snippet of my CSV looks like this:
"geonameid","name","asciiname","alternatenames","loc","feature_class","feature_code","country_code","cc2","admin1_code","admin2_code","admin3_code","admin4_code"
3,"Zamīn Sūkhteh","Zamin Sukhteh","Zamin Sukhteh,Zamīn Sūkhteh","[48.91667,32.48333]","P","PPL","IR",,"15",,,
5,"Yekāhī","Yekahi","Yekahi,Yekāhī","[48.9,32.5]","P","PPL","IR",,"15",,,
7,"Tarvīḩ ‘Adāī","Tarvih `Adai","Tarvih `Adai,Tarvīḩ ‘Adāī","[48.2,32.1]","P","PPL","IR",,"15",,,
The desired JSON output (except the charset) that works with mongoimport is below:
{"geonameid":3,"name":"Zamin Sukhteh","asciiname":"Zamin Sukhteh","alternatenames":"Zamin Sukhteh,Zamin Sukhteh","loc":[48.91667,32.48333] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
{"geonameid":5,"name":"Yekahi","asciiname":"Yekahi","alternatenames":"Yekahi,Yekahi","loc":[48.9,32.5] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
{"geonameid":7,"name":"Tarvi? ‘Adai","asciiname":"Tarvih `Adai","alternatenames":"Tarvih `Adai,Tarvi? ‘Adai","loc":[48.2,32.1] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
I have tried all available online CSV-JSON converters and they do not work because of the file size. The closest I got was with Mr Data Converter (the one pictured above) which would import to MongoDb after removing the start and end bracket and commas between documents. Unfortunately that tool doesn't work with a 300 mb file.
The JSON above is set to be encoded in UTF-8 but still has charset problems, most likely due to a conversion error?
I spent the last three days learning Python, trying to use Python CSVKIT, trying all the CSV-JSON scripts on stackoverflow, importing CSV to MongoDB and changing "loc" string to array (this retains the quotation marks unfortunately) and even trying to manually copy and paste 30,000 records at a time. A lot of reverse engineering, trial and error and so forth.
Does anyone have a clue how to achieve the JSON above while keeping the encoding proper like in the CSV above? I am at a complete standstill.
Python standard library (plus simplejson for decimal encoding support) has all you need:
import csv, simplejson, decimal, codecs
data = open("in.csv")
reader = csv.DictReader(data, delimiter=",", quotechar='"')
with codecs.open("out.json", "w", encoding="utf-8") as out:
for r in reader:
for k, v in r.items():
# make sure nulls are generated
if not v:
r[k] = None
# parse and generate decimal arrays
elif k == "loc":
r[k] = [decimal.Decimal(n) for n in v.strip("[]").split(",")]
# generate a number
elif k == "geonameid":
r[k] = int(v)
out.write(simplejson.dumps(r, ensure_ascii=False, use_decimal=True)+"\n")
Where "in.csv" contains your big csv file. The above code is tested as working on Python 2.6 & 2.7, with about 100MB csv file, producing a properly encoded UTF-8 file. Without surrounding brackets, array quoting nor comma delimiters, as requested.
It is also worth noting that passing both the ensure_ascii and use_decimal parameters is required for the encoding to work properly (in this case).
Finally, being based on simplejson, the python stdlib json package will also gain decimal encoding support sooner or later. So only the stdlib will ultimately be needed.
Maybe you could try importing the csv directly into mongodb using
mongoimport -d <dB> -c <collection> --type csv --file location.csv --headerline