Python: finding duplicates in large jsonl file

Python: finding duplicates in large jsonl file - python

I'm trying to find all json objects in my jsonl file that contain the same identifier value.
So if my data look like:
{
"data": {
"value": 42,
"url": "url.com",
"details": {
"timestamp": "07:32:29",
"identifier": "123ABC"
}
},
"message": "string"
}
I want to find every object that has the same identifier value. The file is too large to load all at once, so instead I check line by line and store just the identifier values. This has the drawback of missing the first object that has that identifier (ie, if objects A, B, and C all have the same identifier, I would only end up with B and C saved). To find the first occurrence of the identifier, I try reading through the file a second time to pick up only the first time each duplicate identifier is found. This is where I encounter some problems.
This part works as intended:
import gzip
import json_lines
import jsonlines
from itertools import groupby
identifiers=set()
duplicates=[]
with json_lines.open('file.jsonlines.gz') as f:
for item in f:
ID = item["data"]["details"]["identifier"]
if ID in identifiers:
duplicates.append(item)
else:
identifiers.add(ID)
dup_IDs={dup["data"]["details"]["identifier"] for dup in duplicates}
But when I read through the file a second time:
with json_lines.open('file.jsonlines.gz') as f:
for item in f:
ID = item["data"]["details"]["identifier"]
if ID in dup_IDs:
duplicates.append(item)
dup_IDs.remove(ID)
else:
continue
if len(dup_IDs)==0:
break
else:
continue
It runs for ~30 minutes and eventually crashes my computer. I'm assuming (hoping) this is because there's a problem with my code and not my computer because the code is easier to fix.

If the file size is too large, I'd suggest to upload data into SQL database and use SQL queries to filter what you needed.

import gzip
import json_lines
import jsonlines
from itertools import groupby
duplicates=[]
nb = {}
i = 0
with json_lines.open('file.jsonlines.gz') as f:
for item in f:
ID = item["data"]["details"]["identifier"]
if ID in nb:
if ID not in b:
nb[ID]=int(i)
else:
nb[ID]=str(i)
i +=1
i = 0
k = set(nb[i] for i in nb if isinstance(nb[i], int))
del nb
with json_lines.open('file.jsonlines.gz') as f:
for item in f:
if i in k:
duplicates.append(item)
i +=1
print(duplicates)

Related

Is there a way to read data into R or Python and split it by number of characters?

Say I have a number of fields with their respective length of characters. The first field is the ID with length of 10, the second field is phone number with length 20, and so forth. Can I set the delimiter based on the length?
The data does not have any structure to it so in order to read it I have to find a way to construct the table when I read it in. It's a plain text file. I do, however, have the respective length of characters for each field. I have not done anything like this before so before I spend hours into it I wanted to see if this is even possible.

In python you can just slice the string.
msg = "ID12345678PHONE123456789012345BLOB"
_id = msg[:10]
phone = msg[10:30]
blob = msg[30:34]
print(_id, phone, blob)
Result
ID12345678 PHONE123456789012345 BLOB
Option 2: If you open the file in binary mode and get bytes strings, you can use the struct module to unpack them.
import struct
msg = b"ID12345678PHONE123456789012345BLOB"
_id, phone, blob = struct.unpack("10s20s4s", msg)
print(_id, phone, blob)

In python it should be quite simple indeed
In pandas you can use the simple function read_fwf()
Given for example something like this
myfile.txt
1name1surname1
2name2surname2
with columns of size [1, 5, 8]
you can read the file in this way
import pandas as pd
df = pd.read_fwf("myfile.txt", widths=[1, 5, 8])
Details here for this function
If you want to parse the file yourself instead, which is also quite simple:
import pandas as pd
# column name + size
meta_data = [('id',1),('name',5),('surname',8)]
def my_parser(line):
curr_dict = {}
start=0
end=0
for meta in meta_data:
end = meta[1] + end
curr_dict[meta[0]] = line[start:end]
start = end
return curr_dict
with open("myfile.txt", "r") as f_o:
lines = f_o.readlines()
dicts = []
for line in lines:
dicts.append(my_parser(line))
pd.Dataframe(dicts)

Python - convert text file to dict and convert to json

How can I convert this text file to json? Ultimately, I'll be inserting the json blobs into a NoSQL database, but for now I plan to parse the text files and build a python dict, then dump to json.
I think there has to be a way to do this with a dict comprehension that I'm just not seeing/following (I'm new to python).
Example of a file:
file_1.txt
[namespace1] => metric_A = value1
[namespace1] => metric_B = value2
[namespace2] => metric_A = value3
[namespace2] => metric_B = value4
[namespace2] => metric_B = value5
Example of dict I want to build to convert to json:
{ "file1" : {
"namespace1" : {
"metric_A" : "value_1",
"metric_B" : "value_2"
},
"namespace2" : {
"metric_A" : "value_3",
"metric_B" : ["value4", "value5"]
}
}
I currently have this working, but my code is a total mess (and much more complex than this example w/ clean up etc). I'm basically going line by line through the file, building a python dict. I check each namespace for existence in the dict, if it exists, i check the metric. If the metric exists already, I know I have duplicates and need to convert the value to an array that contains the existing value and my new value(s). There has to be a more simple/clean way.

import glob
import json
answer = {}
for fname in glob.glob(file_*.txt): # loop over all filenames
answer[fname] = {}
with open(fname) as infile:
for line in infile:
line = line.strip()
if not line: continue
splits = line.split()[::2]
splits[0] = splits[0][1:-1]
namespace, metric, value = splits # all the values in the line that we're interested in
answer[fname].get(namespace, {})[metric] = value # populate the dict
required_json = json.dumps(answer) # turn the dict into proper JSON

You can use regex for that. re.findall('\w+', line) will find all text groups which you are after, then the rest is saving it in the dictionary of dictionary. The simplest way to do that is to use defaultdict from collections.
import re
from collections import defaultdict
answer = defaultdict(lambda: defaultdict(lambda: []))
with open('file_1.txt', 'r') as f:
for line in f:
namespace, metric, value = re.findall(r'\w+', line)
answer[namespace][metric].append(value)
As we know, that we expect exactly 3 alphanum groups, we assign it to 3 variable, i.e. namespace, metric, value. Finally, defaultdict will return defaultdict for the case when we see namespace first time, and the inner defaultdict will return an empty array for first append, making code more compact.

Multiple jsons to csv

I have multiple files, each containing multiple highly nested json rows. The two first rows of one such file look like:
{
"u":"28",
"evv":{
"w":{
"1":400,
"2":{
"i":[{
"l":14,
"c":"7",
"p":"4"
}
]
}
}
}
}
{
"u":"29",
"evv":{
"w":{
"3":400,
"2":{
"i":[{
"c":14,
"y":"7",
"z":"4"
}
]
}
}
}
}
they are actually rows, I just wrote them here this way for more visibility.
My question is the following:
Is there any way to convert all these files to one (or multiple, i.e. one per file) csv/excel... ?
Is there any simple way, that doesn't require writing dozens, or hundreds of lines in Python, specific to my file, to convert all these files to one (or multiple, i.e. one per file) csv/excel... ? One example would be using an external library, script... that handles this particular task, regardless of the names of the fields.
The trap is that some elements do not appear in each line. For example, for the "i" key, we have 3 fields (l, c, p) in the first json, and 3 in the second one (c, y, z). Ideally, the csv should contain as many columns as possible fields (e.g. evv.w.2.i.l, evv.w.2.i.c, evv.w.2.i.p, evv.w.2.i.y, evv.w.2.i.z) at the risk of having (many) null values per csv row.
A possible csv output for this example would have the following columns:
u, evv.w.1, evv.w.3, evv.w.2.i.l, evv.w.2.i.c, evv.w.2.i.p, evv.w.2.i.y, evv.w.2.i.z
Any idea/reference is welcome :)
Thanks

No, there is no general-purpose program that does precisely what you ask for.
You can, however, write a Python program that does it.
This program might do what you want. It does not have any code specific to your key names, but it is specific to your file format.
It can take several files on the command line.
Each file is presumed to have one JSON object per line.
It flattens the JSON object, joining labels with "."
import fileinput
import json
import csv
def flattify(d, key=()):
if isinstance(d, list):
result = {}
for i in d:
result.update(flattify(i, key))
return result
if isinstance(d, dict):
result = {}
for k, v in d.items():
result.update(flattify(v, key + (k,)))
return result
return {key: d}
total = []
for line in fileinput.input():
if(line.strip()):
line = json.loads(line)
line = flattify(line)
line = {'.'.join(k): v for k, v in line.items()}
total.append(line)
keys = set()
for d in total:
keys.update(d)
with open('result.csv', 'w') as output_file:
output_file = csv.DictWriter(output_file, sorted(keys))
output_file.writeheader()
output_file.writerows(total)

Please check if this (python3) solution works for you.
import json
import csv
with open('test.json') as data_file:
with open('output.csv', 'w', newline='') as fp:
for line in data_file:
data = json.loads(line)
output = [[data['u'], data['evv']['w'].get('1'), data['evv']['w'].get('3'),
data['evv']['w'].get('2')['i'][0].get('l'), data['evv']['w'].get('2')['i'][0].get('c'),
data['evv']['w'].get('2')['i'][0].get('p'), data['evv']['w'].get('2')['i'][0].get('y'),
data['evv']['w'].get('2')['i'][0].get('z')]]
a = csv.writer(fp, delimiter=',')
a.writerows(output)
test.json
{ "u": "28", "evv": { "w": { "1": 400, "2": { "i": [{ "l": 14, "c": "7", "p": "4" }] } } }}
{"u":"29","evv":{ "w":{ "3":400, "2":{ "i":[{ "c":14, "y":"7", "z":"4" } ] } } }}
output
python3 pyprog.py
dac#dac-Latitude-E7450 ~/P/pyprog> more output.csv
28,400,,14,7,4,,
29,,400,,14,,7,4

Parsing a file that looks like JSON to a JSON

I'm trying to figure out was is the best way to go about this problem:
I'm reading text lines from a certain buffer that eventually creates a certain log that looks something like this:
Some_Information: here there's some information about date and hour
Additional information: log summary #1234:
details {
name: "John Doe"
address: "myAdress"
phone: 01234567
}
information {
age: 30
height: 1.70
weight: 70
}
I would like to get all the fields in this log to a dictionary which I can later turn into a json file, the different sections in the log are not important so for example if myDictionary is a dictionary variable in python I would like to have:
> myDictionary['age']
will show me 30.
and the same for all other fields.
Speed is very important here that's why I would like to just go through every line once and get it in a dictionary
My way about doing this would be to for each line that contains ":" colon I would split the string and get the key and the value in the dictionary.
is there a better way to do it?
Is there any python module that would be sufficient?
If more information is needed please let me know.
Edit:
So I've tried something that to me look to work best so far,
I am currently reading from a file to simulate the reading of the buffer
My code:
import json
import shlex
newDict = dict()
with open('log.txt') as f:
for line in f:
try:
line = line.replace(" ", "")
stringSplit = line.split(':')
key = stringSplit[0]
value = stringSplit[1]
value = shlex.split(value)
newDict[key] = value[0]
except:
continue
with open('result.json', 'w') as fp:
json.dump(newDict, fp)
Resulting in the following .json:
{"name": "JohnDoe", "weight": "70", "Additionalinformation": "logsummary#1234",
"height": "1.70", "phone": "01234567", "address": "myAdress", "age": "30"}

You haven't described exactly what the desired output should be from the sample input, so it's not completely clear what you want done. So I guessed and the following only extracts data values from lines following one that contains a '{' until one with a '}' in it is encountered, while ignoring others.
It uses the re module to isolate the two parts of each dictionary item definition found on the line, and then uses the ast module to convert the value portion of that into a valid Python literal (i.e. string, number, tuple, list, dict, bool, and None).
import ast
import json
import re
pat = re.compile(r"""(?P<key>\w+)\s*:\s*(?P<value>.+)$""")
data_dict = {}
with open('log.txt', 'rU') as f:
braces = 0
for line in (line.strip() for line in f):
if braces > 0:
match = pat.search(line)
if match and len(match.groups()) == 2:
key = match.group('key')
value = ast.literal_eval(match.group('value'))
data_dict[key] = value
elif '{' in line:
braces += 1
elif '}' in line:
braces -= 1
else:
pass # ignore line
print(json.dumps(data_dict, indent=4))
Output from your example input:
{
"name": "John Doe",
"weight": 70,
"age": 30,
"height": 1.7,
"phone": 342391,
"address": "myAdress"
}

Import Mongodb to CSV - removing duplicates

I am importing data from Mongo into a CSV file. The import consists of "timestamp" and "text" for each JSON Document.
The documents:
{
name: ...,
size: ...,
timestamp: ISODate("2013-01-09T21:04:12Z"),
data: { text:..., place:...},
other: ...
}
The code:
with open(output, 'w') as fp:
for r in db.hello.find(fields=['text', 'timestamp']):
print >>fp, '"%s","%s"' % (r['text'], r['timestamp'].strftime('%H:%M:%S'))
I would like to remove duplicates (some Mongo docs have the same text), and I would like to keep the first instance (with regards to the time) intact. Is it possible to remove these dupes as I import?
Thanks for your help!

I would use a set to store the hashes of the data, and check for duplicates. Something like this:
import md5
hashes = set()
with open(output, 'w') as fp:
for r in db.hello.find(fields=['text', 'timestamp']):
digest = md5.new(r['text']).digest()
if digest in hashes:
# It's a duplicate!
continue
else:
hashes.add(digest)
print >>fp, '"%s","%s"' % (r['text'], r['timestamp'].strftime('%H:%M:%S'))
It's worth noting that you could use the text field directly, but for larger text fields storing just the hash is much more memory efficient.

You just need to maintain a map (dictionary) to maintain (text, timestamp) pairs. The 'text' is the key, so there won't be any duplicates. I will assume the order of reading is not guaranteed to return the oldest timestamp first. In that case you will have to make 2 passes-- once for reading and later one pass for writing.
textmap = {}
def insert(text, ts):
global textmap
if text in textmap:
textmap[text] = min(ts, textmap[text])
else:
textmap[text] = ts
for r in db.hello.find(fields=['text', 'timestamp']):
insert(r['text'], r['timestamp'])
for text in textmap:
print >>fp, text, textmap[text] # with whatever format desired.
At the end, you can also easily convert the dictionary into list of tuples, in case you want to sort the results using timestamp before printing, for example.
(See Sort a Python dictionary by value )

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: finding duplicates in large jsonl file - python

If the file size is too large, I'd suggest to upload data into SQL database and use SQL queries to filter what you needed.

Related

Is there a way to read data into R or Python and split it by number of characters?

Python - convert text file to dict and convert to json

Multiple jsons to csv

Parsing a file that looks like JSON to a JSON

Import Mongodb to CSV - removing duplicates

Categories

Resources