I have a JSON File (>1GB) and I have another CSV File with one matching column (i.e ID). I need to update the JSON File by mapping CSV with JSON.
The approach I thought at first was to convert the json to csv and then overwrite the csv, but since the file is huge, it's not the most optimized way. I am supposed to use Python.
import csv
import json
id = []
qrank = []
def readingCsvFile():
with open('qrank.csv', 'r') as csvFile:
dataCsv = csv.reader(csvFile)
for row in dataCsv:
id.append(row[0])
qrank.append(row[1])
dataJson = [json.loads(line) for line in open('enhanced-wikipois','r', encoding='UTF-8')]
records = len(dataJson)
readingCsvFile()
for i in range(records):
x = dataJson[i]['id']
if (x in id):
pos = id.index(x)
dataJson[i]['wikiQRank'] = qrank[pos]
print(dataJson)
The size of the file is not really relevant. What's important is the number of JSON objects and the number of "qrank" values.
If you build a dictionary based on id and rank from the CSV file then the subsequent lookups will be much more efficient.
There are a number of other efficiencies that you could implement.
import csv
import json
CSVFILE = '/Volumes/G-Drive/qrank.csv'
JSONLFILE = '/Volumes/G-Drive/enhanced-wikipois'
def read_csv(filename):
with open(filename, newline='') as data:
reader = csv.reader(data)
return {_id: rank for _id, rank, *_ in reader}
def read_jsonl(filename):
with open(filename) as data:
return [json.loads(line) for line in data]
id_dict = read_csv(CSVFILE)
json_data = read_jsonl(JSONLFILE)
for j in json_data:
if (_id := j.get('id')) is not None:
if (rank := id_dict.get(_id)) is not None:
j['wikiQRank'] = rank
print(json_data)
The place where I put this code belongs to an import page.And here there is data in the data I want to import in .txt format, but this data contains the \n character.
if request.method == "POST":
txt_file = request.FILES['file']
if not txt_file .name.endswith('.txt'):
messages.info(request,'This is not a txt file')
data_set = csv_file.read().decode('latin-1')
io_string = io.StringIO(data_set)
next(io_string)
csv_reader = csv.reader(io_string, delimiter='\t',quotechar="|")
for column in csv_reader:
b = Module_Name(
user= request.user,
a = column[1],
b = column[2],
c = column[3],
d = column[4],
e = column[5],
f = column[6],
g = column[7],
h = column[8],
)
b.save()
messages.success(request,"Successfully Imported...")
return redirect("return:return_import")
This can be called the full version of my code. To explain, there is a \n character in the data that comes here as column[1]. This file is a .txt file from another export. And in this export column[1];
This is
a value
and my django localhost new-line character seen in unquoted field - do you need to open the file in universal-newline mode? gives a warning and aborts the import to the system.
the csv reader iterates over rows, not columns. So if you want to append the data from a given column together, you must iterate over all the rows first. For example:
import csv
from io import StringIO
io_string = "this is , r0 c1\r\na value, r1 c2\r\n"
io_string = StringIO(io_string)
rows = csv.reader(io_string)
column_0_data = []
for row in rows:
column_0_data.append(row[0])
print("".join(column_0_data))
the rest of your code looks iffy to me, but that is off topic.
I'm attempting to parse a JSON file with the following syntax into CSV:
{"code":2000,"message":"SUCCESS","data":
{"1":
{"id":1,
"name":"first_name",
"icon":"url.png",
"attribute1":"value",
"attribute2":"value" ...},
"2":
{"id":2,
"name":"first_name",
"icon":"url.png",
"attribute1":"value",
"attribute2":"value" ...},
"3":
{"id":3,
"name":"first_name",
"icon":"url.png",
"attribute1":"value",
"attribute2":"value" ...}, and so forth
}}}
I have found similar questions (e.g. here and here and I am working with the following method:
import requests
import json
import csv
import os
jsonfile = "/path/to.json"
csvfile = "/path/to.csv"
with open(jsonfile) as json_file:
data=json.load(json_file)
data_file = open(csvfile,'w')
csvwriter = csv.writer(data_file)
csvwriter.writerow(data["data"].keys())
for row in data:
csvwriter.writerow(row["data"].values())
data_file.close()
but I am missing something.
I get this error when I try to run:
TypeError: string indices must be integers
and my csv output is:
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,96
At the end of the day, I am trying to convert the following function (from PowerShell) to Python. This converted the JSON to CSV and added 3 additional custom columns to the end:
$json = wget $lvl | ConvertFrom-Json
$json.data | %{$_.psobject.properties.value} `
| select-object *,#{Name='Custom1';Expression={$m}},#{Name='Level';Expression={$l}},#{Name='Custom2';Expression={$a}},#{Name='Custom3';Expression={$r}} `
| Export-CSV -path $outfile
The output looks like:
"id","name","icon","attribute1","attribute2",..."Custom1","Custom2","Custom3"
"1","first_name","url.png","value","value",..."a","b","c"
"2","first_name","url.png","value","value",..."a","b","c"
"3","first_name","url.png","value","value",..."a","b","c"
As suggested by martineau in a now-deleted answer, my key name was incorrect.
I ended up with this:
import json
import csv
jsonfile = "/path/to.json"
csvfile = "/path/to.csv"
with open(jsonfile) as json_file:
data=json.load(json_file)
data_file = open(csvfile,'w')
csvwriter = csv.writer(data_file)
#get sample keys
header=data["data"]["1"].keys()
#add new fields to dict
keys = list(header)
keys.append("field2")
keys.append("field3")
#write header
csvwriter.writerow(keys)
#for each entry
total = data["data"]
for row in total:
rowdefault = data["data"][str(row)].values()
rowdata = list(rowdefault)
rowdata.append("value1")
rowdata.append("value2")
csvwriter.writerow(rowdata)
Here, I'm grabbing each row by its name id via str(row).
I have a csv file where each record is a LinkedIn contact. I have to recreate another csv file where each contact it was reached only after a specific date (ex all the contact that are connected to me after 1/04/2017).
So this is my implementation:
def import_from_csv(file):
key_order = ("FirstName","LastName","EmailAddress","Company","ConnectedOn")
linkedin_contacts = []
with open(file, encoding="utf8") as csvfile:
reader=csv.DictReader(csvfile, delimiter=',')
for row in reader:
single_person = {"FirstName": row["FirstName"], "LastName": row["LastName"],
"EmailAddress": row["EmailAddress"], "Company": row["Company"],
"ConnectedOn": parser.parse(row["ConnectedOn"])}
od = OrderedDict((k, single_person[k]) for k in key_order)
linkedin_contacts.append(od)
return linkedin_contacts
the first script give to me a list of ordered dict, i dont know if the way i used to achive the correct order is good, also seeing some example (like here) i'm not using the od.update method, but i dont think i need it, is it correct?
Now i wrote a second function to filter the list:
def filter_by_date(connections):
filtered_list = []
target_date = parser.parse("01/04/2017")
for row in connections:
if row["ConnectedOn"] > target_date:
filtered_list.append(row)
return filtered_list
Am I doing this correctly?
Is there a way to optimize the code? Thanks
First point: you don't need the OrderedDict at all, just use a csv.DictWriter to write the filtered csv.
fieldnames = ("FirstName","LastName","EmailAddress","Company","ConnectedOn")
with open("/apth/to/final.csv", "wb") as f:
writer = csv.DictWriter(f, fieldnames)
writer.writeheader()
writer.writerows(filtered_contacts)
Second point: you don't need to create a new dict from the one yielded by the csv reader, just update the ConnectedOn key in place :
def import_from_csv(file):
linkedin_contacts = []
with open(file, encoding="utf8") as csvfile:
reader=csv.DictReader(csvfile, delimiter=',')
for row in reader:
row["ConnectedOn"] = parser.parse(row["ConnectedOn"])
linkedin_contacts.append(row)
return linkedin_contacts
And finally, if all you have to do is take the source csv, filter out records on ConnectedOn and write the result, you don't need to load the whole source in memory, create a filtered list (in memory again) and write the filtered list, you can stream the whole operation:
def filter_csv(source_path, dest_path, date):
fieldnames = ("FirstName","LastName","EmailAddress","Company","ConnectedOn")
target = parser.parse(date)
with open(source_path, "rb") as source, open(dest_path, "wb") as dest:
reader = csv.DictReader(source)
writer = csv.DictWriter(dest, fieldnames)
# if you want a header line with the fieldnames - else comment it out
writer.writeheaders()
for row in reader:
row_date = parser.parse(row["ConnectedOn"])
if row_date > target:
writer.writerow(row)
And here you are, plain and simple.
NB : I don't know what "parser.parse()" is but as others answers mention, you'd probably be better using the datetime module instead.
For filtering you could use filter() function:
def filter_by_date(connections):
target_date = datetime.strptime("01/04/2017", '%Y/%m/%d').date()
return list(filter(lambda x: x["ConnectedOn"] > target_date, connections))
And instead of creating simple dict and then fill its values into OrderedDict you could write values directly to the OrderedDict:
for row in reader:
od = OrderedDict()
od["FirstName"] = row["FirstName"]
od["LastName"] = row["LastName"]
od["EmailAddress"] = row["EmailAddress"]
od["Company"] = row["Company"]
od["ConnectedOn"] = datetime.strptime(row["ConnectedOn"], '%Y/%m/%d').date()
linkedin_contacts.append(od)
If you know date format you don't need python_dateutil, you could use built-in datetime.datetime.strptime() with needed format.
Because you don't precise the format string.
Use :
from datetime import datetime
format = '%d/%m/%Y'
date_text = '01/04/2017'
# inverse by datetime.strftime(format)
datetime.strptime(date_text, format)
#....
# with format as global
for row in reader:
od = OrderedDict()
od["FirstName"] = row["FirstName"]
od["LastName"] = row["LastName"]
od["EmailAddress"] = row["EmailAddress"]
od["Company"] = row["Company"]
od["ConnectedOn"] = strptime(row["ConnectedOn"], format)
linkedin_contacts.append(od)
Do:
def filter_by_date(connections, date_text):
target_date = datetime.strptime(date_text, format)
return [x for x in connections if x["ConnectedOn"] > target_dat]
I am using boto to read a csv file and parse it contents. This is the code I wrote:
import boto
from boto.s3.key import Key
import pandas as pd
import io
conn = boto.connect_s3(keyId, sKeyId)
bucket = conn.get_bucket(bucketName)
# Get the Key object of the given key, in the bucket
k = Key(bucket, srcFileName)
content = k.get_contents_as_string()
reader = pd.read_csv(io.StringIO(content))
for row in reader:
print(row)
But I am getting error at read_csv line:
TypeError: initial_value must be str or None, not bytes
How can I resolve this error and parse the contents of the csv file present on S3
UPDATE: if I use BytesIO instead of StringIO then the print(row) line only prints 1st row of the csv. How do I loop over it?
This is my current code:
import boto3
s3 = boto3.resource('s3',aws_access_key_id = keyId, aws_secret_access_key = sKeyId)
obj = s3.Object(bucketName, srcFileName)
content = obj.get_contents_as_string()
reader = pd.read_csv(io.BytesIO(content), header=None)
count = 0
for index, row in reader.iterrows():
print(row[1])
When I execute this I get AttributeError: 's3.Object' object has no attribute 'get_contents_as_string' error