I have a csv file that contains csv data separated by ','. I am trying to convert it into a json format. For this I am tyring to extract headers first. But, I am not able to differentiate between headers and the next row.
Here is the data in csv file:
Start Date ,Start Time,End Date,End Time,Event Title
9/5/2011,3:00:00 PM,9/5/2011,,Social Studies Dept. Meeting
9/5/2011,6:00:00 PM,9/5/2011,8:00:00 PM,Curriculum Meeting
I have tried csvreader as well but I got stuck at the same issue.
Basically Event Title and the date on the next line is not being distinguished.
with open(file_path, 'r') as f:
first_line = re.sub(r'\s+', '', f.read())
arr = []
headers = []
for header in f.readline().split(','):
headers.append(header)
for line in f.readlines():
lineItems = {}
for i,item in enumerate(line.split(',')):
lineItems[headers[i]] = item
arr.append(lineItems)
print(arr)
print(headers)
jsonText = json.dumps(arr)
print(jsonText)
All three print statements give empty result below.
[]
['']
[]
I expect jsonText to be a json of key value pairs.
Use csv.DictReader to get a list of dicts (each row is a dict) then serialize it.
import json
import csv
with open(csvfilepath) as f:
json.dump(list(csv.DictReader(f)), jsonfilepath))
In Python, each file has a marker that keeps track of where you are in the file. Once you call read(), you have read through the entire file, and all future read or readline calls will return nothing.
So, just delete the line involving first_line.
Related
I have this use case
please create a function called “myfunccsvtojson” that takes in a filename path to a csv file (please refer to attached csv file) and generates a file that contains streamable line delimited JSON.
• Expected filename will be based on the csv filename, i.e. Myfilename.csv will produce Myfilename.json or File2.csv will produce File2.json. Please show this in your code and should not be hardcoded.
• csv file has 10000 lines including the header
• output JSON file should contain 9999 lines
• Sample JSON lines from the csv file below:
CSV:
nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0072308,tt0043044,tt0050419,tt0053137" nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0071877,tt0038355,tt0117057,tt0037382" nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,producer","tt0057345,tt0059956,tt0049189,tt0054452"
JSON lines:
{"nconst":"nm0000001","primaryName":"Fred Astaire","birthYear":1899,"deathYear":1987,"primaryProfession":"soundtrack,actor,miscellaneous","knownForTitles":"tt0072308,tt0043044,tt0050419,tt0053137"}
{"nconst":"nm0000002","primaryName":"Lauren Bacall","birthYear":1924,"deathYear":2014,"primaryProfession":"actress,soundtrack","knownForTitles":"tt0071877,tt0038355,tt0117057,tt0037382"}
{"nconst":"nm0000003","primaryName":"Brigitte Bardot","birthYear":1934,"deathYear":null,"primaryProfession":"actress,soundtrack,producer","knownForTitles":"tt0057345,tt0059956,tt0049189,tt0054452"}
I am not able to understand is how the header can be inputted as a key to every value of jason.
Has anyone come access this scenario and help me out of it?
What i was trying i know loop is not correct but figuring it out
with open(file_name, encoding = 'utf-8') as file:
csv_data = csv.DictReader(file)
csvreader = csv.reader(file)
# print(csv_data)
keys = next(csvreader)
print (keys)
for i,Value in range(len(keys)), csv_data:
data[keys[i]] = Value
print (data)
You can convert your csv to pandas data frame and output as json:
df = pd.read_csv('data.csv')
df.to_json(orient='records')
import csv
import json
def csv_to_json(csv_file_path, json_file_path):
data_dict = []
with open(csv_file_path, encoding = 'utf-8') as csv_file_handler:
csv_reader = csv.DictReader(csv_file_handler)
for rows in csv_reader:
data_dict.append(rows)
with open(json_file_path, 'w', encoding = 'utf-8') as json_file_handler:
json_file_handler.write(json.dumps(data_dict, indent = 4))
csv_to_json("/home/devendra/Videos/stackoverflow/Names.csv", "/home/devendra/Videos/stackoverflow/Names.json")
I am new to Python and I am trying to export my output to CSV with headers: Artifacts, Size. Below is the python code for it.
import requests
import json
import csv
with open('content.csv', 'w') as csvfile:
headers = ['Artifacts', 'Size']
writer = csv.writer(csvfile)
writer.writerow(headers)
with open('file_list.txt') as file:
for line in file:
url = "http://fqdn/repository/{0}/{1}?describe=json".format(repo_name, line.strip())
response = requests.get(url)
json_data = response.text
data = json.loads(json_data)
for size in data['items']:
if size['name'] == 'Payload':
value_size= size['value']['Size']
if value_size != -1:
with open('content.csv', 'a') as fileappend:
data_writer = csv.writer(fileappend)
data_writer.writerow({line.strip(), str(value_size)})
Issue I am stuck with is, I am not seeing the csv file as expected with Artifacts column with artifact names and Size column with their sizes, instead I am seeing mixed pattern, where some of the rows are correct and some of the rows shows size first and artifact next.
Another issue(not so important), is an added empty row gap between each line when opened in Excel.
Sample CSV data
Artifacts,Size
3369,mysql.odbc/5.1.14
641361,curl/7.24.0
2142246,curl/7.24.0.20120225
2163958,curl/7.25.0
curl/7.55.0,3990517
curl/7.55.1,3991943
3875614,curl/7.54.1
curl/7.58.0,3690457
putty.portable/0.67,4201
6227,notepadplusplus/7.5.4
4407,openjdk8/8.242.8.1
5453,dotnetcore-sdk/3.1.201
4405,openjdk8/8.252.9
Any help is much appreciated.
you're sending a set as a parameter to write_row().
set object is hash-based and therefore doesn't necessarily keep the order.
you should use a list or a tuple instead.
in your last row:
# replace curly brackets with square brackets
data_writer.writerow([line.strip(), str(value_size)])
I have a .json file where each line is an object. For example, first two lines are:
{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}
I have tried processing using ijson lib as follows:
with open(filename, 'r') as f:
objects = ijson.items(f, 'columns.items')
columns = list(objects)
However, i get error:
JSONError: Additional data
Its seems due to multiple objects I'm receiving such error.
Whats the recommended way for analyzing such Json file in Jupyter?
Thank You in advance
The file format is not correct if this is the complete file. Between the curly brackets there must be a comma and it should start and end with a square bracket. Like so: [{...},{...}]. For your data it would look like:
[{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...},
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}]
Here is some code how to clean your file:
lastline = None
with open("yourfile.json","r") as f:
lineList = f.readlines()
lastline=lineList[-1]
with open("yourfile.json","r") as f, open("cleanfile.json","w") as g:
for i,line in enumerate(f,0):
if i == 0:
line = "["+str(line)+","
g.write(line)
elif line == lastline:
g.write(line)
g.write("]")
else:
line = str(line)+","
g.write(line)
To read a json file properly you could also consider using the pandas library (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html).
import pandas as pd
#get a pandas dataframe object from json file
df = pd.read_json("path/to/your/filename.json")
If you are not familiar with pandas, here a quick headstart, how to work with a dataframe object:
df.head() #gives you the first rows of the dataframe
df["review_id"] # gives you the column review_id as a vector
df.iloc[1,:] # gives you the complete row with index 1
df.iloc[1,2] # gives you the item in row with index 1 and column with index 2
While each line on it's own is valid JSON, your file as a whole is not. As such, you can't parse it in one go, you will have to iterate over each line parse it into an object.
You can aggregate these objects in one list, and from there do whatever you like with your data :
import json
with open(filename, 'r') as f:
object_list = []
for line in f.readlines():
object_list.append(json.loads(line))
# object_list will contain all of your file's data
You could do it as a list comprehension to have it a little more pythonic :
with open(filename, 'r') as f:
object_list = [json.loads(line)
for line in f.readlines()]
# object_list will contain all of your file's data
You have multiple lines in your file, so that's why it's throwing errors
import json
with open(filename, 'r') as f:
lines = f.readlines()
first = json.loads(lines[0])
second = json.loads(lines[1])
That should catch both lines and load them in properly
I am new to Python. I am trying to transfer data from a text file to a csv file. I have included a short description of the data of my text file and csv file. Can someone point me to the right direction of what to read up to get this done?
**Input Text file**
01/20/18 12:19:35#
TARGET_CENTER_COLUMN=0
TARGET_CENTER_ROW=0
TARGET_COLUMN=0
BASELINE_AVERAGE=0
#
01/21/18 12:19:35#
TARGET_CENTER_COLUMN=0
TARGET_CENTER_ROW=13
TARGET_COLUMN=13
BASELINE_AVERAGE=26
#
01/23/18 12:19:36#
TARGET_COLUMN=340
TARGET_CENTER_COLUMN=223
TARGET_CENTER_ROW=3608, 3609, 3610
BASELINE_AVERAGE=28
#
01/24/18 12:19:37#
TARGET_CENTER_COLUMN=224
TARGET_CENTER_ROW=388
TARGET_COLUMN=348
BASELINE_AVERAGE=26
#
01/25/18 12:19:37#
TARGET_CENTER_COLUMN=224
TARGET_CENTER_ROW=388
TARGET_COLUMN=348
BASELINE_AVERAGE=26
#
01/27/18 12:19:37#
TARGET_CENTER_COLUMN=223
TARGET_COLUMN=3444
TARGET_CENTER_ROW=354
BASELINE_AVERAGE=25
#
**Output CSV file**
Date,Time,BASELINE_AVERAGE,TARGET_CENTER_COLUMN,TARGET_CENTER_ROW,TARGET_COLUMN
01/20/18,9:37:16 PM,0,0,0,0
01/21/18,9:37:16 PM,26,0,13,13
01/23/18,9:37:16 PM,28,223,3608,340
0,0,3609,0
0,0,3610,0
01/24/18,9:37:16 PM,26,224,388,348
01/25/18,9:37:16 PM,26,224,388,348
01/27/18,9:37:16 PM,25,223,354,344
Reading up online I've been able to implement this.
import csv
txt_file = r"DebugLog15test.txt"
csv_file = r"15test.csv"
mylist = ['Date','Time','BASELINE_AVERAGE' ,'TARGET_CENTER_COLUMN', 'TARGET_CENTER_ROW','TARGET_COLUMN']
in_txt = csv.reader(open(txt_file, "r"))
with open(csv_file, 'w') as myfile:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(mylist)
Beyond this I was planning to start a for loop and read data till # as this would be 1 row, and then use the delimiter to find each '=' and insert the data into the appropriate location in a row list(do this by comparing the column header with the string prior to the delimiter) and populate the row accordingly. Do you think this approach is correct?
Thanks for your help!
Check out csv.DictWriter for a nicer approach. You give it a list of headers and then you can give it data dictionaries which it will write for you. The writing portion would look like this:
import csv
csv_file = "15test.csv"
headers = ['Date','Time','BASELINE_AVERAGE' ,'TARGET_CENTER_COLUMN', 'TARGET_CENTER_ROW','TARGET_COLUMN']
with open(csv_file, 'w') as myfile:
wr = csv.DictWriter(myfile, quoting=csv.QUOTE_ALL, headers = headers)
# data_dicts is a list of dictionaries looking like so:
# {'Date': '01/20/18', 'Time': '12:19:35', 'TARGET_CENTER_COLUMN': '0', ...}
wr.writerows(data_dicts)
As for reading your input, csv.reader won't be of much help: your input file isn't really anything like a csv file. You'd probably be better off writing your own parsing, although it'll be a bit messy because of the inconsistency of the input format. Here's how I would approach that. First, make a function to interpret each line:
def get_data_from_line(line):
line = line.strip()
if line == '#':
# We're between data sections; None will signal that
return None
if '=' in line:
# this is a "KEY=VALUE" line
key, value = line.split('=', 1)
return {key: value}
if ' ' in line:
# this is a "Date time" line
date, time = line.split(' ', 1)
return {'Date': date, 'Time': time}
# if we get here, either we've missed something or there's bad data
raise ValueError("Couldn't parse line: {}".format(line))
Then build the list of data dictionaries from the input file:
data_dicts = []
with open(txt_file) as infh:
data_dict = {}
for line in infh:
update = get_data_from_line(line)
if update is None:
# we're between sections; add our current data to the list,
# if we have data.
if data_dict:
data_dicts.append(data_dict)
data_dict = {}
else:
# this line had some data; this incorporates it into data_dict
data_dict.update(update)
# finally, if we don't have a section marker at the end,
# we need to append the last section's data
if data_dict:
data_dicts.append(data_dict)
I am very new to python, so please be gentle.
I have a .csv file, reported to me in this format, so I cannot do much about it:
ClientAccountID AccountAlias CurrencyPrimary FromDate
SomeID SomeAlias SomeCurr SomeDate
OtherID OtherAlias OtherCurr OtherDate
ClientAccountID AccountAlias CurrencyPrimary AssetClass
SomeID SomeAlias SomeCurr SomeClass
OtherID OtherAlias OtherCurr OtherDate
AnotherID AnotherAlias AnotherCurr AnotherDate
I am using the csv package in python, so i have:
with open(theFile, 'rb') as csvfile:
theReader = csv.DictReader(csvfile, delimiter = ',')
Which, as I understand it, creates the dictionary 'theReader'. How do I subset this dictionary, into several dictionaries, splitting them by the header rows in the original csv file? Is there a simple, elegant, non-loop way to create a list of dictionaries (or even a dictionary of dictionaries, with account IDs as keys)? Does that make sense?
Oh. Please note the header rows are not equivalent, but the header rows will always begin with 'ClientAccountID'.
Thanks to # codie, I am now using the following to split the csv into several dicts, based on using the '\t' delimiter.
with open(theFile, 'rb') as csvfile:
theReader = csv.DictReader(csvfile, delimiter = '\t')
However, I now get the entire header row as a key, and each other row as a value. How do I further split this up?
Thanks to #Benjamin Hodgson below, I have the following:
from csv import DictReader
from io import BytesIO
stringios = []
with open('file.csv', 'r') as f:
stringio = None
for line in f:
if line.startswith('ClientAccountID'):
if stringio is not None:
stringios.append(stringio)
stringio = BytesIO()
stringio.write(line)
stringio.write("\n")
stringios.append(stringio)
data = [list(DictReader(x.getvalue(), delimiter=',')) for x in stringios]
If I print the first item in stringios, I get what I would expect. It looks like a single csv. However, if I print the first item in data, using below, i get something odd:
for row in data[0]:
print row
It returns:
{'C':'U'}
{'C':'S'}
{'C':'D'}
...
So it appears it is splitting every character, instead of using the comma delimiter.
If I've understood your question correctly, you have a single CSV file which contains multiple tables. Tables are delimited by header rows which always begin with the string "ClientAccountID".
So the job is to read the CSV file into a list of lists-of-dictionaries. Each entry in the list corresponds to one of the tables in your CSV file.
Here's how I'd do it:
Break up the single CSV file with multiple tables into multiple files each with a single table. (These files could be in-memory.) Do this by looking for lines which start with "ClientAccountID".
Read each of these files into a list of dictionaries using a DictReader.
Here's some code to read the file into a list of StringIOs. (A StringIO is an in-memory file. It works by wrapping a string up into a file-like interface).
from csv import DictReader
from io import StringIO
stringios = []
with open('file.csv', 'r') as f:
stringio = None
for line in f:
if line.startswith('ClientAccountID'):
if stringio is not None:
stringio.seek(0)
stringios.append(stringio)
stringio = StringIO()
stringio.write(line)
stringio.write("\n")
stringio.seek(0)
stringios.append(stringio)
If we encounter a line starting with 'ClientAccountID', we put the current StringIO into the list and start writing to a new one. When you've finished, remember to add the last one to the list too.
Don't forget (as I did, in an earlier version of this answer) to rewind the StringIO after you've written to it using stringio.seek(0).
Now it's straightforward to loop over the StringIOs to get a table of dictionaries.
data = [list(DictReader(x, delimiter='\t')) for x in stringios]
For each file-like object in the list stringios, create a DictReader and read it into a list.
It's not too hard to modify this approach if your data is too big to fit into memory. Use generators instead of lists and do the processing line-by-line.
If your data was not comma or tab delimited you could use str.split, you can combine it with itertools.groupby to delimit the headers and rows:
from itertools import groupby, izip, imap
with open("test.txt") as f:
grps, data = groupby(imap(str.split, f), lambda x: x[0] == "ClientAccountID"), []
for k, v in grps:
if k:
names = next(v)
vals = izip(*next(grps)[1])
data.append(dict(izip(names, vals)))
from pprint import pprint as pp
pp(data)
Output:
[{'AccountAlias': ('SomeAlias', 'OtherAlias'),
'ClientAccountID': ('SomeID', 'OtherID'),
'CurrencyPrimary': ('SomeCurr', 'OtherCurr'),
'FromDate': ('SomeDate', 'OtherDate')},
{'AccountAlias': ('SomeAlias', 'OtherAlias', 'AnotherAlias'),
'AssetClass': ('SomeClass', 'OtherDate', 'AnotherDate'),
'ClientAccountID': ('SomeID', 'OtherID', 'AnotherID'),
'CurrencyPrimary': ('SomeCurr', 'OtherCurr', 'AnotherCurr')}]
If it is tab delimited just change one line:
with open("test.txt") as f:
grps, data = groupby(csv.reader(f, delimiter="\t"), lambda x: x[0] == "ClientAccountID"), []
for k, v in grps:
if k:
names = next(v)
vals = izip(*next(grps)[1])
data.append(dict(izip(names, vals)))