Import a very large txt file to Dynamobd

Import a very large txt file to Dynamobd - python

I have a huge txt file and I need to put it on DynamoDB.
the file struct is:
223344|blue and orange|Red|16/12/2022
223344|blue and orange|Red|16/12/2022
...
This file has more than 200M lines
I have tried to convert it on json file using this code bellow:
import json
with open('mini_data.txt', 'r') as f_in:
for line in f_in:
line = line.strip().split('|')
filename = 'smini_final_data.json'
result = {"fild1": line[0], "field2": line[1], "field3": str(line[2]).replace(" ",""),"field4":line[3]}
with open(filename, "r") as file:
data = json.load(file)
data.append(result)
with open(filename, "w") as file:
json.dump(data, file)
But this isn't efficient and it's only the first part of the job ( convert data to Json ), after this I need put the Json in dynamoDB.
I have used this code (it's look good):
def insert(self):
if not self.dynamodb:
self.dynamodb = boto3.resource(
'dynamodb', endpoint_url="http://localhost:8000")
table = self.dynamodb.Table('fruits')
json_file = open("final_data.json")
orange = json.load(json_file, parse_float = decimal.Decimal)
with table.batch_writer() as batch:
for fruit in orange:
fild1 = fruit['fild1']
fild2 = fruit['fild2']
fild3= fruit['fild3']
fild4 = fruit['fild4']
batch.put_item(
Item={
'fild1':fild1,
'fild2':fild2,
'fild3':fild3,
'fild4':fild4
}
)
So, does anyone, have some suggestions to process this txt more efficiently?
Thanks

The step of converting from delimited text to JSON seems unnecessary in this case. The way you've written it requires reopening and rewriting the JSON file for each line of your delimited text file. That I/O overhead repeated 200M times can really slow things down.
I suggest going straight from your delimited text to DynamoDB. It might look something like this:
dynamodb = boto3.resource(
'dynamodb', endpoint_url="http://localhost:8000")
table = self.dynamodb.Table('fruits')
with table.batch_writer() as batch:
with open('mini_data.txt', 'r') as f_in:
for line in f_in:
line = line.strip().split('|')
batch.put_item(
Item={
'fild1':line[0],
'fild2':line[1],
'fild3':str(line[2]).replace(" ",""),
'fild4':line[3]
}
)

Related

How can I edit my code to print out the content of my created json file?

My program takes a csv file as input and writes it as an output file in json format. On the final line, I use the print command to output the contents of the json format file to the screen. However, it does not print out the json file contents and I don't understand why.
Here is my code that I have so far:
import csv
import json
def jsonformat(infile,outfile):
contents = {}
csvfile = open(infile, 'r')
reader = csvfile.read()
for m in reader:
key = m['No']
contents[key] = m
jsonfile = open(outfile, 'w')
jsonfile.write(json.dumps(contents))
csvfile.close()
jsonfile.close()
return jsonfile
infile = 'orders.csv'
outfile = 'orders.json'
output = jsonformat(infile,outfile)
print(output)

Your function returns the jsonfile variable, which is a file.
Try adding this:
jsonfile.close()
with open(outfile, 'r') as file:
return file.read()

Your function returns a file handle to the file jsonfile that you then print. Instead, return the contents that you wrote to that file. Since you opened the file in w mode, any previous contents are removed before writing the new contents, so the contents of your file are going to be whatever you just wrote to it.
In your function, do:
def jsonformat(infile,outfile):
...
# Instead of this:
# jsonfile.write(json.dumps(contents))
# do this:
json_contents = json.dumps(contents, indent=4) # indent=4 to pretty-print
jsonfile.write(json_contents)
...
return json_contents
Aside from that, you aren't reading the CSV file the correct way. If your file has a header, you can use csv.DictReader to read each row as a dictionary. Then, you'll be able to use for m in reader: key = m['No']. Change reader = csvfile.read() to reader = csv.DictReader(csvfile)
As of now, reader is a string that contains all the contents of your file. for m in reader makes m each character in this string, and you cannot access the "No" key on a character.

a_file = open("sample.json", "r")
a_json = json.load(a_file)
pretty_json = json.dumps(a_json, indent=4)
a_file.close()
print(pretty_json)
Using this sample to print the contents of your json file. Have a good day.

create and append data in json format to json file - python

How to create a null json file and append each details to the json file in the following format
[
{"name":"alan","job":"clerk"},
{"name":"bob","job":"engineer"}
]
Code
import json
with open("test.json", mode='w', encoding='utf-8') as f:
json.dump([], f)
test_data = ['{"name":"alan","job":"clerk"}','{"name":"bob","job":"engineer"}']
for i in test_data:
with open("test.json", mode='w', encoding='utf-8') as fileobj:
json.dump(i, fileobj)
How this can be efficiently done

You can't modify the json content like that. You'll need to modify the data structure and then completely rewrite the json file. You might be able to just read the data from jsone at startup, and write it at shutdown.
import json
def store_my_data(data, filename='test.json'):
""" write data to json file """
with open(filename, mode='w', encoding='utf-8') as f:
json.dump(data, f)
def load_my_data(filename='test.json'):
""" load data from json file """
with open(filename, mode='r', encoding='utf-8') as f:
return json.load(f)
raise Exception # skipping some steps here
test_data = [
{"name": "alan", "job": "clerk"},
{"name": "bob", "job": "engineer"}
]
item_one = test_data[0]
item_two = test_data[1]
# You already know how to store data in a json file.
store_my_data(test_data)
# Suppose you don't have any data at the start.
current_data = []
store_my_data(current_data)
# Later, you want to add to the data.
# You will have to change your data in memory,
# then completely rewrite the file.
current_data.append(item_one)
current_data.append(item_two)
store_my_data(current_data)

How to read CSV after metadata?

I have a CSV file like this:
#Description
#Param1: value
#Param2: value
...
#ParamN: value
Time (s),Header1,Header2
243.41745,3,1
243.417455,3,5
243.41746,7,6
...
I need to read it with Python without using Pandas as requirement. How to read the CSV data itself ignoring the initial lines until the empty one? I am using the code below that successfully reads the metadata.
def read(file_path: str):
'''Read the data of the Digilent WaveForms Logic Analyzer Acquisition
(moodel Discovery2).
Parameter: File path.
'''
meta = {}
RE_CONFIG = re.compile(r'^#(?P<name>[^:]+)(: *(?P<value>.+)\s*$)*')
with open(file_path, 'r') as fh:
# Read the metadata and description at the beginning of the file.
for line in fh.readlines():
line = line.strip()
if not line:
break
config = RE_CONFIG.match(line)
if config:
if not config.group('value'):
meta.update({'Description': config.group('name')})
else:
meta.update({config.group('name'): config.group('value')})
# Read the data it self.
data = csv.DictReader(fh, delimiter=',')
return data, meta

This seems to work. I had to change for line in fh.readlines(): to for line in fh: the portion that reads the meta-data so line with data wouldn't be read, then create the DictReader and use it to get the data.
import csv
from pprint import pprint, pp
import re
def read(file_path: str):
'''Read the data of the Digilent WaveForms Logic Analyzer Acquisition
(moodel Discovery2).
Parameter: File path.
'''
meta = {}
RE_CONFIG = re.compile(r'^#(?P<name>[^:]+)(: *(?P<value>.+)\s*$)*')
with open(file_path, 'r') as fh:
# Read the metadata and description at the beginning of the file.
for line in fh: # CHANGED
line = line.strip()
if not line:
break
config = RE_CONFIG.match(line)
if config:
if not config.group('value'):
meta.update({'Description': config.group('name')})
else:
meta.update({config.group('name'): config.group('value')})
# Read the data itself.
reader = csv.DictReader(fh, delimiter=',')
data = list(reader)
return data, meta
res = read('mixed.csv')
pprint(res)

Boto3 read a file content from S3 key line by line

With boto3, you can read a file content from a location in S3, given a bucket name and the key, as per (this assumes a preliminary import boto3)
s3 = boto3.resource('s3')
content = s3.Object(BUCKET_NAME, S3_KEY).get()['Body'].read()
This returns a string type. The specific file I need to fetch happens to be a collection of dictionary-like objects, one per line. So it is not a JSON format. Instead of reading it as a string, I'd like to stream it as a file object and read it line by line; cannot find a way to do this other than downloading the file locally first as
s3 = boto3.resource('s3')
bucket = s3.Bucket(BUCKET_NAME)
filename = 'my-file'
bucket.download_file(S3_KEY, filename)
f = open('my-file')
What I'm asking is if it's possible to have this type of control on the file without having to download it locally first?

I found .splitlines() worked for me...
txt_file = s3.Object(bucket, file).get()['Body'].read().decode('utf-8').splitlines()
Without the .splitlines() the whole blob of text was return and trying to iterate each line resulted in each char being iterated. With .splitlines() iteration by line was achievable.
In my example here I iterate through each line and compile it into a dict.
txt_file = s3.Object(bucket, file).get()['Body'].read().decode(
'utf-8').splitlines()
for line in txt_file:
arr = line.split()
print(arr)

You also can take advantage of StreamingBody's iter_lines method:
for line in s3.Object(bucket, file).get()['Body'].iter_lines():
decoded_line = line.decode('utf-b') # if decoding is needed
That would consume less memory than reading the whole line at once and then split it

The following comment from kooshiwoosh to a similar question provides a nice answer:
from io import TextIOWrapper
from gzip import GzipFile
...
# get StreamingBody from botocore.response
response = s3.get_object(Bucket=bucket, Key=key)
# if gzipped
gzipped = GzipFile(None, 'rb', fileobj=response['Body'])
data = TextIOWrapper(gzipped)
for line in data:
# process line

This will do the work:
bytes_to_read = 512
content = s3.Object(BUCKET_NAME, S3_KEY).get()['Body'].read(bytes_to_read)

This works for me:
json_object = s3.get_object(Bucket = bucket, Key = json_file_name)
json_file_reader = json_object['Body'].read()
content = json.loads(json_file_reader)

As of now you have a possibility to use the download_fileobj function. Here an example for a CSV file:
import boto3
import csv
bucket = 'my_bucket'
file_key = 'my_key/file.csv'
output_file_path = 'output.csv'
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket)
#Dump binary in append mode
with open(output_file_path, 'ab') as file_object:
bucket.download_fileobj(
Key = file_key,
Fileobj = file_object,
)
#Read your file as usual
with open(output_file_path, 'r') as csvfile:
lines = csv.reader(csvfile)
for line in lines:
doWhatEver(line[0])

Save file without first and last double quotes

I am trying to save my data to a file. My problem is the file i saved contains double quotes at the first and the last of a line. I have tried many ways to solve it from str.replace(), strip, csv to json, pickle. However, the problem has been still persistent. I have got stuck with it. Please help me. I will detail my problem below.
Firstly, I have a file called angles.txt like that:
{'left_w0': -2.6978887076110842, 'left_w1': -1.3257428944152834, 'left_w2': -1.7533400385498048, 'left_e0': 0.03566505327758789, 'left_e1': 0.6948932961 181641, 'left_s0': -1.1665923878540039, 'left_s1': -0.6726505747192383}
{'left_w0': -2.6967382220214846, 'left_w1': -0.8440729275695802, 'left_w2': -1.7541070289428713, 'left_e0': 0.036048548474121096, 'left_e1': 0.166820410 49194338, 'left_s0': -0.7731263162109375, 'left_s1': -0.7056311616210938}
I read line by line from the text file and transfer to a dict variable called data. Here is the reading file code:
def read_data_from_file(file_name):
data = dict()
f = open(file_name, 'r')
for index_line in range(1, number_lines +1):
data[index_line] = eval(f.readline())
f.close()
return data
Then I changed something in the data. Something like data[index_line]['left_w0'] = data[index_line]['left_w0'] + 0.0006. After that I wrote my data into another text file. Here is the code:
def write_data_to_file(data, file_name)
f = open(file_name, 'wb')
data_convert = dict()
for index_line in range(1, number_lines):
data_convert[index_line] = repr(data[index_line])
data_convert[index_line] = data_convert[index_line].replace('"','') # I also used strip
json.dump(data_convert[index_line], f)
f.write('\n')
f.close()
The result I received in the new file is:
"{'left_w0': -2.6978887076110842, 'left_w1': -1.3257428944152834, 'left_w2': -1.7533400385498048, 'left_e0': 0.03566505327758789, 'left_e1': 0.6948932961 181641, 'left_s0': -1.1665923878540039, 'left_s1': -0.6726505747192383}"
"{'left_w0': -2.6967382220214846, 'left_w1': -0.8440729275695802, 'left_w2': -1.7541070289428713, 'left_e0': 0.036048548474121096, 'left_e1': 0.166820410 49194338, 'left_s0': -0.7731263162109375, 'left_s1': -0.7056311616210938}"
I cannot remove "".

You could simplify your code by removing unnecessary transformations:
import json
def write_data_to_file(data, filename):
with open(filename, 'w') as file:
json.dump(data, file)
def read_data_from_file(filename):
with open(filename) as file:
return json.load(file)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Import a very large txt file to Dynamobd - python

Related

How can I edit my code to print out the content of my created json file?

create and append data in json format to json file - python

How to read CSV after metadata?

Boto3 read a file content from S3 key line by line

Save file without first and last double quotes

Categories

Resources