How to read the lines from a mmap into a dictionary? - python

My mmap opens a json file from disk. Currently I am able to read line by line from the file but I want to be able to store this information into a dictionary of key and value pairs something similar to the structure of the json file contents.
Currently, I use the following line of code to read line by line on Windows.
filename = 'C:\Workspace\tempfile.json'
resultsDictionary = {}
with open(filename, "r+b") as f:
map_file = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
for line in iter(map_file.readline, b""):
print(line)
# I want to be able to store it in a resultsDictionary so I could use that resultsDictionary in latter method in my python code. I am not sure on how to do this.
Any help would be appreciated.

You should use the json module from the python standard library.

Related

Reading large compressed files

This might be a simple question but I can't seem to find the answer to this or why it is not working on this specific case.
I want to read large files, they can be compressed or not. I used contextlib to write a contextmanager function to handle this. Then using the with statement I read the files in the main script.
My problem here is that the script uses a lot of memory then gets killed (testing using a compressed file). What am I doing wrong? Should I approach this differently?
def process_vcf(location):
logging.info('Processing vcf')
logging.debug(location)
with read_compressed_or_not(location) as vcf:
for line in vcf.readlines():
if line.startswith('#'):
logging.debug(line)
#contextmanager
def read_compressed_or_not(location):
if location.endswith('.gz'):
try:
file = gzip.open(location)
yield file
finally:
file.close()
else:
try:
file = open(location, 'r')
yield file
finally:
file.close()
The lowest impact solution is just to skip the use of the readlines function. readlines returns a list containing every line in the file, so it does have the entire file in memory. Using the filename by itself reads one line at a time using a generator, so it doesn't have to have the whole file in memory.
with read_compressed_or_not(location) as vcf:
for line in vcf:
if line.startswith('#'):
logging.debug(line)
Instead of using for line in vcf.readlines(), you can do:
line = vcf.readline()
while line:
# Do stuff
line = vcf.readline()
This will only load one single line into memory at once
The file opening function is the main difference between reading a gzip file and a non-gzip file. So one can dynamically assign the opener and then read the file. Then there is no need for a custom context manager.
import gzip
open_fn = gzip.open if location.endswith(".gz") else open
with open_fn(location, mode="rt") as vcf:
for line in vcf:
...

How to create and use a text file in Python?

I need to create a text file in Python to store certain data from a game. I do not want to use numpy, or any external libraries if at all possible.
I need to put some numerical data. Do text files require string data? Also does the data come out of the file as a string?
I know how to create and open a text file, and how to convert string to integer and vice versa, as well as handle CSV file data. I do not know how to handle a text file.
Any ideas on what to do?
To create a file:
file = open("textfile.txt","w+")
This will create a file if it doesn't exist in the directory.
To write inside it:
file.write("This is the content of the file.")
And then you'll have to close the instance with
file.close()
by using the with open command you can create and use it
here is an example
Here w is for writing mode
with open('test.txt','w') as d:
d.write('your text goes here')
You can write to file like this if the file not exists then it will be created
Any ideas on what to do?
Put your data into dict and use built-in json module, example:
import json
data = {'gold': 500, 'name': 'xyzzy'}
# writing
with open('save.json', 'w') as f:
json.dump(data, f)
# reading
with open('save.json', 'r') as f:
data2 = json.load(f)
This create human-readable text file.

How do I loop through a file and remove all lines that fit the condition?

I have a huge file. I've tried with other software, and it didn't work. So I want to make a custom script.
However, I just cannot work it out myself.
I want to delete every line in a file with the following condition: if "[" in line:
File in question is a .txt file with about 14,000,000 lines. I would prefer something fast.
I've tried other similar functions on this page, but I couldn't find any that would fit my requirements.
Instead of deleting, you can pretty easily make a copy of the file with only the desired records.
in_file_path = 'xxxx'
out_file_path = 'yyyy'
with open(in_file_path, 'r') as fh_in:
with open(out_file_path, 'w') as fh_out:
for line in fh_in:
if not '[' in line:
fh_out.write(line)
If you want to go even faster, you can read and write in binary mode and check for b'[' in the line.
Use the readline method of the file object inside a while loop. So while you are in the loop get all lines that do not fit the if condition and store into into a data structure.
Later open a new file and write th entire structure to the new file
Try this. This is a simple read and write of a file:
with open("sample_file.txt", "r") as reader:
new_file = []
for line in reader:
if "[" not in line:
new_file.append(line)
with open("new_file.txt", "w+") as writer:
writer.writelines(new_file)

Read only first line of gzip JSON file

Although a lot of code has been posted here about how to read the first line of a file, I cannot figure out how to only read the first line of a gzipped JSON file in Python.
Here is my current working example. However, it contains a nasty break statement, and the loop seems completely unnecessary:
for line in gzip.open(file, 'rb'):
one_line = json.loads(line)
print(one_line)
break
Is there a solution that keeps the json.loads() command (or a similar one that reads in the JSON file correctly), while only reading the first line of the gzipped JSON file?
Call readline() instead of a for loop.
with gzip.open(file, 'rb') as f:
line = f.readline()
one_line = json.loads(line)
print(one_line)

How to parse WIkidata JSON (.bz2) file using Python?

I want to look at entities and relationships using Wikidata. I downloaded the Wikidata JSON dump (from here .bz2 file, size ~ 18 GB).
However, I cannot open the file, it's just too big for my computer.
Is there a way to look into the file without extracting the full .bz2
file. Especially using Python, I know that there is a PHP dump
reader (here ), but I can't use it.
I came up with a strategy that allows to use json module to access information without opening the file:
import bz2
import json
with bz2.open(filename, "rt") as bzinput:
lines = []
for i, line in enumerate(bzinput):
if i == 10: break
tweets = json.loads(line)
lines.append(tweets)
In this way lines will be a list of dictionaries that you can easly manipulate and, for example, reduce their size by removing keys that you don't need.
Note also that (obviously) the condition i==10 can be arbitrarly changed to fit anyone(?) needings. For example, you may parse some line at a time, analyze them and writing on a txt file the indices of the lines you really want from the original file. Than it will be sufficient to read only those lines (using a similar condition in i in the for loop).
you can use BZ2File interface to manipulate the compressed file. But you can NOT use json module to access information for it, it will take too much space. You will have to index the file meaning you have to read the file line by line and save position and length of interesting object in a Dictionary (hashtable) and then you could extract a given object and load it with the json module.
You'd have to do line-by-line processing:
import bz2
import json
path = "latest.json.bz2"
with bz2.BZ2File(path) as file:
for line in file:
line = line.decode().strip()
if line in {"[", "]"}:
continue
if line.endswith(","):
line = line[:-1]
entity = json.loads(line)
# do your processing here
print(str(entity)[:50] + "...")
Seeing as WikiData is now 70GB+, you might wish to process it directly from the URL:
import bz2
import json
from urllib.request import urlopen
path = "https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2"
with urlopen(path) as stream:
with bz2.BZ2File(path) as file:
...

Categories