Python create list of dictionaries from csv on S3 - python

I am trying to take a CSV and create a list of dictionaries in python with the CSV coming from S3. Code is as follows:
import os
import boto3
import csv
import json
from io import StringIO
import logging
import time
s3 = boto3.resource('s3')
s3Client = boto3.client('s3','us-east-1')
bucket = 'some-bucket'
key = 'some-key'
obj = s3Client.get_object(Bucket = bucket, Key = key)
lines = obj['Body'].read().decode('utf-8').splitlines(True)
newl = []
for line in csv.reader(lines, quotechar='"', delimiter=',',quoting=csv.QUOTE_ALL,skipinitialspace=True, escapechar="\\"):
newl.append(line)
fieldnames = newl[0]
newl1 = newl[1:]
reader = csv.DictReader(newl1,fieldnames)
out = json.dumps([row for row in reader])
jlist1 = json.loads(out)
but this gives me the error:
iterator should return strings, not list (did you open the file in text mode?)
if I alter the for loop to this:
for line in csv.reader(lines, quotechar='"', delimiter=',',quoting=csv.QUOTE_ALL,skipinitialspace=True, escapechar="\\"):
newl.append(','.join(line))
then it works, however there are some fields that have commas in them so this completely screws up the schema and shifts the data. For example:
|address1 |address2 |state|
------------------------------
|123 Main st|APT 3, Fl1|TX |
becomes:
|address1 |address2 |state|null|
-----------------------------------
|123 Main st|APT 3 |Fl1 |TX |
Where am I going wrong?

The problem is that you are building a list of lists here :
newl.append(line)
and as the error says : iterator should return strings, not list
so try to cast line as a string:
newl.append(str(line))
Hope this helps :)

I ended up changing the code to this:
obj = s3Client.get_object(Bucket = bucket, Key = key)
lines1 = obj['Body'].read().decode('utf-8').split('\n')
fieldnames = lines1[0].replace('"','').split(',')
testls = [row for row in csv.DictReader(lines1[1:], fieldnames)]
out = json.dumps([row for row in testls])
jlist1 = json.loads(out)
And got the desired result

Related

Updating Json Object using data from a CSV File

I have a JSON File (>1GB) and I have another CSV File with one matching column (i.e ID). I need to update the JSON File by mapping CSV with JSON.
The approach I thought at first was to convert the json to csv and then overwrite the csv, but since the file is huge, it's not the most optimized way. I am supposed to use Python.
import csv
import json
id = []
qrank = []
def readingCsvFile():
with open('qrank.csv', 'r') as csvFile:
dataCsv = csv.reader(csvFile)
for row in dataCsv:
id.append(row[0])
qrank.append(row[1])
dataJson = [json.loads(line) for line in open('enhanced-wikipois','r', encoding='UTF-8')]
records = len(dataJson)
readingCsvFile()
for i in range(records):
x = dataJson[i]['id']
if (x in id):
pos = id.index(x)
dataJson[i]['wikiQRank'] = qrank[pos]
print(dataJson)
The size of the file is not really relevant. What's important is the number of JSON objects and the number of "qrank" values.
If you build a dictionary based on id and rank from the CSV file then the subsequent lookups will be much more efficient.
There are a number of other efficiencies that you could implement.
import csv
import json
CSVFILE = '/Volumes/G-Drive/qrank.csv'
JSONLFILE = '/Volumes/G-Drive/enhanced-wikipois'
def read_csv(filename):
with open(filename, newline='') as data:
reader = csv.reader(data)
return {_id: rank for _id, rank, *_ in reader}
def read_jsonl(filename):
with open(filename) as data:
return [json.loads(line) for line in data]
id_dict = read_csv(CSVFILE)
json_data = read_jsonl(JSONLFILE)
for j in json_data:
if (_id := j.get('id')) is not None:
if (rank := id_dict.get(_id)) is not None:
j['wikiQRank'] = rank
print(json_data)

replace('\n','') does not work for import to .txt format

The place where I put this code belongs to an import page.And here there is data in the data I want to import in .txt format, but this data contains the \n character.
if request.method == "POST":
txt_file = request.FILES['file']
if not txt_file .name.endswith('.txt'):
messages.info(request,'This is not a txt file')
data_set = csv_file.read().decode('latin-1')
io_string = io.StringIO(data_set)
next(io_string)
csv_reader = csv.reader(io_string, delimiter='\t',quotechar="|")
for column in csv_reader:
b = Module_Name(
user= request.user,
a = column[1],
b = column[2],
c = column[3],
d = column[4],
e = column[5],
f = column[6],
g = column[7],
h = column[8],
)
b.save()
messages.success(request,"Successfully Imported...")
return redirect("return:return_import")
This can be called the full version of my code. To explain, there is a \n character in the data that comes here as column[1]. This file is a .txt file from another export. And in this export column[1];
This is
a value
and my django localhost new-line character seen in unquoted field - do you need to open the file in universal-newline mode? gives a warning and aborts the import to the system.
the csv reader iterates over rows, not columns. So if you want to append the data from a given column together, you must iterate over all the rows first. For example:
import csv
from io import StringIO
io_string = "this is , r0 c1\r\na value, r1 c2\r\n"
io_string = StringIO(io_string)
rows = csv.reader(io_string)
column_0_data = []
for row in rows:
column_0_data.append(row[0])
print("".join(column_0_data))
the rest of your code looks iffy to me, but that is off topic.

Parse JSON to CSV + additional columns

I'm attempting to parse a JSON file with the following syntax into CSV:
{"code":2000,"message":"SUCCESS","data":
{"1":
{"id":1,
"name":"first_name",
"icon":"url.png",
"attribute1":"value",
"attribute2":"value" ...},
"2":
{"id":2,
"name":"first_name",
"icon":"url.png",
"attribute1":"value",
"attribute2":"value" ...},
"3":
{"id":3,
"name":"first_name",
"icon":"url.png",
"attribute1":"value",
"attribute2":"value" ...}, and so forth
}}}
I have found similar questions (e.g. here and here and I am working with the following method:
import requests
import json
import csv
import os
jsonfile = "/path/to.json"
csvfile = "/path/to.csv"
with open(jsonfile) as json_file:
data=json.load(json_file)
data_file = open(csvfile,'w')
csvwriter = csv.writer(data_file)
csvwriter.writerow(data["data"].keys())
for row in data:
csvwriter.writerow(row["data"].values())
data_file.close()
but I am missing something.
I get this error when I try to run:
TypeError: string indices must be integers
and my csv output is:
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,96
At the end of the day, I am trying to convert the following function (from PowerShell) to Python. This converted the JSON to CSV and added 3 additional custom columns to the end:
$json = wget $lvl | ConvertFrom-Json
$json.data | %{$_.psobject.properties.value} `
| select-object *,#{Name='Custom1';Expression={$m}},#{Name='Level';Expression={$l}},#{Name='Custom2';Expression={$a}},#{Name='Custom3';Expression={$r}} `
| Export-CSV -path $outfile
The output looks like:
"id","name","icon","attribute1","attribute2",..."Custom1","Custom2","Custom3"
"1","first_name","url.png","value","value",..."a","b","c"
"2","first_name","url.png","value","value",..."a","b","c"
"3","first_name","url.png","value","value",..."a","b","c"
As suggested by martineau in a now-deleted answer, my key name was incorrect.
I ended up with this:
import json
import csv
jsonfile = "/path/to.json"
csvfile = "/path/to.csv"
with open(jsonfile) as json_file:
data=json.load(json_file)
data_file = open(csvfile,'w')
csvwriter = csv.writer(data_file)
#get sample keys
header=data["data"]["1"].keys()
#add new fields to dict
keys = list(header)
keys.append("field2")
keys.append("field3")
#write header
csvwriter.writerow(keys)
#for each entry
total = data["data"]
for row in total:
rowdefault = data["data"][str(row)].values()
rowdata = list(rowdefault)
rowdata.append("value1")
rowdata.append("value2")
csvwriter.writerow(rowdata)
Here, I'm grabbing each row by its name id via str(row).

extract record by csv and filtering by date

I have a csv file where each record is a LinkedIn contact. I have to recreate another csv file where each contact it was reached only after a specific date (ex all the contact that are connected to me after 1/04/2017).
So this is my implementation:
def import_from_csv(file):
key_order = ("FirstName","LastName","EmailAddress","Company","ConnectedOn")
linkedin_contacts = []
with open(file, encoding="utf8") as csvfile:
reader=csv.DictReader(csvfile, delimiter=',')
for row in reader:
single_person = {"FirstName": row["FirstName"], "LastName": row["LastName"],
"EmailAddress": row["EmailAddress"], "Company": row["Company"],
"ConnectedOn": parser.parse(row["ConnectedOn"])}
od = OrderedDict((k, single_person[k]) for k in key_order)
linkedin_contacts.append(od)
return linkedin_contacts
the first script give to me a list of ordered dict, i dont know if the way i used to achive the correct order is good, also seeing some example (like here) i'm not using the od.update method, but i dont think i need it, is it correct?
Now i wrote a second function to filter the list:
def filter_by_date(connections):
filtered_list = []
target_date = parser.parse("01/04/2017")
for row in connections:
if row["ConnectedOn"] > target_date:
filtered_list.append(row)
return filtered_list
Am I doing this correctly?
Is there a way to optimize the code? Thanks
First point: you don't need the OrderedDict at all, just use a csv.DictWriter to write the filtered csv.
fieldnames = ("FirstName","LastName","EmailAddress","Company","ConnectedOn")
with open("/apth/to/final.csv", "wb") as f:
writer = csv.DictWriter(f, fieldnames)
writer.writeheader()
writer.writerows(filtered_contacts)
Second point: you don't need to create a new dict from the one yielded by the csv reader, just update the ConnectedOn key in place :
def import_from_csv(file):
linkedin_contacts = []
with open(file, encoding="utf8") as csvfile:
reader=csv.DictReader(csvfile, delimiter=',')
for row in reader:
row["ConnectedOn"] = parser.parse(row["ConnectedOn"])
linkedin_contacts.append(row)
return linkedin_contacts
And finally, if all you have to do is take the source csv, filter out records on ConnectedOn and write the result, you don't need to load the whole source in memory, create a filtered list (in memory again) and write the filtered list, you can stream the whole operation:
def filter_csv(source_path, dest_path, date):
fieldnames = ("FirstName","LastName","EmailAddress","Company","ConnectedOn")
target = parser.parse(date)
with open(source_path, "rb") as source, open(dest_path, "wb") as dest:
reader = csv.DictReader(source)
writer = csv.DictWriter(dest, fieldnames)
# if you want a header line with the fieldnames - else comment it out
writer.writeheaders()
for row in reader:
row_date = parser.parse(row["ConnectedOn"])
if row_date > target:
writer.writerow(row)
And here you are, plain and simple.
NB : I don't know what "parser.parse()" is but as others answers mention, you'd probably be better using the datetime module instead.
For filtering you could use filter() function:
def filter_by_date(connections):
target_date = datetime.strptime("01/04/2017", '%Y/%m/%d').date()
return list(filter(lambda x: x["ConnectedOn"] > target_date, connections))
And instead of creating simple dict and then fill its values into OrderedDict you could write values directly to the OrderedDict:
for row in reader:
od = OrderedDict()
od["FirstName"] = row["FirstName"]
od["LastName"] = row["LastName"]
od["EmailAddress"] = row["EmailAddress"]
od["Company"] = row["Company"]
od["ConnectedOn"] = datetime.strptime(row["ConnectedOn"], '%Y/%m/%d').date()
linkedin_contacts.append(od)
If you know date format you don't need python_dateutil, you could use built-in datetime.datetime.strptime() with needed format.
Because you don't precise the format string.
Use :
from datetime import datetime
format = '%d/%m/%Y'
date_text = '01/04/2017'
# inverse by datetime.strftime(format)
datetime.strptime(date_text, format)
#....
# with format as global
for row in reader:
od = OrderedDict()
od["FirstName"] = row["FirstName"]
od["LastName"] = row["LastName"]
od["EmailAddress"] = row["EmailAddress"]
od["Company"] = row["Company"]
od["ConnectedOn"] = strptime(row["ConnectedOn"], format)
linkedin_contacts.append(od)
Do:
def filter_by_date(connections, date_text):
target_date = datetime.strptime(date_text, format)
return [x for x in connections if x["ConnectedOn"] > target_dat]

error in reading csv file content on S3 using boto

I am using boto to read a csv file and parse it contents. This is the code I wrote:
import boto
from boto.s3.key import Key
import pandas as pd
import io
conn = boto.connect_s3(keyId, sKeyId)
bucket = conn.get_bucket(bucketName)
# Get the Key object of the given key, in the bucket
k = Key(bucket, srcFileName)
content = k.get_contents_as_string()
reader = pd.read_csv(io.StringIO(content))
for row in reader:
print(row)
But I am getting error at read_csv line:
TypeError: initial_value must be str or None, not bytes
How can I resolve this error and parse the contents of the csv file present on S3
UPDATE: if I use BytesIO instead of StringIO then the print(row) line only prints 1st row of the csv. How do I loop over it?
This is my current code:
import boto3
s3 = boto3.resource('s3',aws_access_key_id = keyId, aws_secret_access_key = sKeyId)
obj = s3.Object(bucketName, srcFileName)
content = obj.get_contents_as_string()
reader = pd.read_csv(io.BytesIO(content), header=None)
count = 0
for index, row in reader.iterrows():
print(row[1])
When I execute this I get AttributeError: 's3.Object' object has no attribute 'get_contents_as_string' error

Categories