I have a nested json file (100k rows), which looks like this:
{"UniqueId":"4224f3c9-323c-e911-a820-a7f2c9e35195","TransactionDateUTC":"2019-03-01 15:00:52.627 UTC","Itinerary":"MUC-CPH-ARN-MUC","OriginAirportCode":"MUC","DestinationAirportCode":"CPH","OneWayOrReturn":"Return","Segment":[{"DepartureAirportCode":"MUC","ArrivalAirportCode":"CPH","SegmentNumber":"1","LegNumber":"1","NumberOfPassengers":"1"},{"DepartureAirportCode":"ARN","ArrivalAirportCode":"MUC","SegmentNumber":"2","LegNumber":"1","NumberOfPassengers":"1"}]}
I am trying to create a csv, so that it can easily be loaded in a rdbms. I am trying to use json_normalize() in pandas but even before I get there I am getting below error.
with open('transactions.json') as data_file:
data = json.load(data_file)
JSONDecodeError: Extra data: line 2 column 1 (char 466)
If your problem originates in reading the json file itself, then i would just use:
json.loads()
and then use
pd.read_csv()
If your problem originates in the conversion from your json dict to dataframe you can use this:
test = {"UniqueId":"4224f3c9-323c-e911-a820-a7f2c9e35195","TransactionDateUTC":"2019-03-01 15:00:52.627 UTC","Itinerary":"MUC-CPH-ARN-MUC","OriginAirportCode":"MUC","DestinationAirportCode":"CPH","OneWayOrReturn":"Return","Segment":[{"DepartureAirportCode":"MUC","ArrivalAirportCode":"CPH","SegmentNumber":"1","LegNumber":"1","NumberOfPassengers":"1"},{"DepartureAirportCode":"ARN","ArrivalAirportCode":"MUC","SegmentNumber":"2","LegNumber":"1","NumberOfPassengers":"1"}]}
import json
import pandas
# convert json to string and read
df = pd.read_json(json.dumps(test), convert_axes=True)
# 'unpack' the dict as series and merge them with original df
df = pd.concat([df, df.Segment.apply(pd.Series)], axis=1)
# remove dict
df.drop('Segment', axis=1, inplace=True)
That would be my approach but there might be more convenient approaches.
Step one: loop over a file of records
Since your file has one JSON record per line, you need to loop over all the records in your file, which you can do like this:
with open('transactions.json', encoding="utf8") as data_file:
for line in data_file:
data = json.loads(line)
# or
df = pd.read_json(line, convert_axes=True)
# do something with data or df
Step two: write the CSV file
Now, you can combine this with a csv.writer to convert the file into a CSV file.
with open('transactions.csv', "w", encoding="utf8") as csv_file:
writer = csv.writer(csv_file)
#Loop for each record, somehow:
#row = build list with row contents
writer.writerow(row)
Putting it all together
I'll read the first record once to get the keys to display and output them as a CSV header, and then I'll read the whole file and convert it one record at a time:
import copy
import csv
import json
import pandas as pd
# Read the first JSON record to get the keys that we'll use as headers for the CSV file
with open('transactions.json', encoding="utf8") as data_file:
keys = list(json.loads(next(data_file)).keys())
# Our CSV headers are going to be the keys from the first row, except for
# segments, which we'll replace (arbitrarily) by three numbered segment column
# headings.
keys.pop()
base_keys = copy.copy(keys)
keys.extend(["Segment1", "Segment2", "Segment3"])
with open('transactions.csv', "w", encoding="utf8") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(keys) # Write the CSV headers
with open('transactions.json', encoding="utf8") as data_file:
for line in data_file:
data = json.loads(line)
row = [data[k] for k in base_keys] + data["Segment"]
writer.writerow(row)
The resulting CSV file will still have a JSON record in each Segmenti column. If you want to format each segment differently, you could define a format_segment(segment) function and replace data["Segment"] by this list comprehension: [format_segment(segment) for segment in data["Segment"]]
I am very new to python, so please be gentle.
I have a .csv file, reported to me in this format, so I cannot do much about it:
ClientAccountID AccountAlias CurrencyPrimary FromDate
SomeID SomeAlias SomeCurr SomeDate
OtherID OtherAlias OtherCurr OtherDate
ClientAccountID AccountAlias CurrencyPrimary AssetClass
SomeID SomeAlias SomeCurr SomeClass
OtherID OtherAlias OtherCurr OtherDate
AnotherID AnotherAlias AnotherCurr AnotherDate
I am using the csv package in python, so i have:
with open(theFile, 'rb') as csvfile:
theReader = csv.DictReader(csvfile, delimiter = ',')
Which, as I understand it, creates the dictionary 'theReader'. How do I subset this dictionary, into several dictionaries, splitting them by the header rows in the original csv file? Is there a simple, elegant, non-loop way to create a list of dictionaries (or even a dictionary of dictionaries, with account IDs as keys)? Does that make sense?
Oh. Please note the header rows are not equivalent, but the header rows will always begin with 'ClientAccountID'.
Thanks to # codie, I am now using the following to split the csv into several dicts, based on using the '\t' delimiter.
with open(theFile, 'rb') as csvfile:
theReader = csv.DictReader(csvfile, delimiter = '\t')
However, I now get the entire header row as a key, and each other row as a value. How do I further split this up?
Thanks to #Benjamin Hodgson below, I have the following:
from csv import DictReader
from io import BytesIO
stringios = []
with open('file.csv', 'r') as f:
stringio = None
for line in f:
if line.startswith('ClientAccountID'):
if stringio is not None:
stringios.append(stringio)
stringio = BytesIO()
stringio.write(line)
stringio.write("\n")
stringios.append(stringio)
data = [list(DictReader(x.getvalue(), delimiter=',')) for x in stringios]
If I print the first item in stringios, I get what I would expect. It looks like a single csv. However, if I print the first item in data, using below, i get something odd:
for row in data[0]:
print row
It returns:
{'C':'U'}
{'C':'S'}
{'C':'D'}
...
So it appears it is splitting every character, instead of using the comma delimiter.
If I've understood your question correctly, you have a single CSV file which contains multiple tables. Tables are delimited by header rows which always begin with the string "ClientAccountID".
So the job is to read the CSV file into a list of lists-of-dictionaries. Each entry in the list corresponds to one of the tables in your CSV file.
Here's how I'd do it:
Break up the single CSV file with multiple tables into multiple files each with a single table. (These files could be in-memory.) Do this by looking for lines which start with "ClientAccountID".
Read each of these files into a list of dictionaries using a DictReader.
Here's some code to read the file into a list of StringIOs. (A StringIO is an in-memory file. It works by wrapping a string up into a file-like interface).
from csv import DictReader
from io import StringIO
stringios = []
with open('file.csv', 'r') as f:
stringio = None
for line in f:
if line.startswith('ClientAccountID'):
if stringio is not None:
stringio.seek(0)
stringios.append(stringio)
stringio = StringIO()
stringio.write(line)
stringio.write("\n")
stringio.seek(0)
stringios.append(stringio)
If we encounter a line starting with 'ClientAccountID', we put the current StringIO into the list and start writing to a new one. When you've finished, remember to add the last one to the list too.
Don't forget (as I did, in an earlier version of this answer) to rewind the StringIO after you've written to it using stringio.seek(0).
Now it's straightforward to loop over the StringIOs to get a table of dictionaries.
data = [list(DictReader(x, delimiter='\t')) for x in stringios]
For each file-like object in the list stringios, create a DictReader and read it into a list.
It's not too hard to modify this approach if your data is too big to fit into memory. Use generators instead of lists and do the processing line-by-line.
If your data was not comma or tab delimited you could use str.split, you can combine it with itertools.groupby to delimit the headers and rows:
from itertools import groupby, izip, imap
with open("test.txt") as f:
grps, data = groupby(imap(str.split, f), lambda x: x[0] == "ClientAccountID"), []
for k, v in grps:
if k:
names = next(v)
vals = izip(*next(grps)[1])
data.append(dict(izip(names, vals)))
from pprint import pprint as pp
pp(data)
Output:
[{'AccountAlias': ('SomeAlias', 'OtherAlias'),
'ClientAccountID': ('SomeID', 'OtherID'),
'CurrencyPrimary': ('SomeCurr', 'OtherCurr'),
'FromDate': ('SomeDate', 'OtherDate')},
{'AccountAlias': ('SomeAlias', 'OtherAlias', 'AnotherAlias'),
'AssetClass': ('SomeClass', 'OtherDate', 'AnotherDate'),
'ClientAccountID': ('SomeID', 'OtherID', 'AnotherID'),
'CurrencyPrimary': ('SomeCurr', 'OtherCurr', 'AnotherCurr')}]
If it is tab delimited just change one line:
with open("test.txt") as f:
grps, data = groupby(csv.reader(f, delimiter="\t"), lambda x: x[0] == "ClientAccountID"), []
for k, v in grps:
if k:
names = next(v)
vals = izip(*next(grps)[1])
data.append(dict(izip(names, vals)))
I have a csv file with the following columns:
id,name,age,sex
Followed by a lot of values for the above columns.
I am trying to read the column names alone and put them inside a list.
I am using Dictreader and this gives out the correct details:
with open('details.csv') as csvfile:
i=["name","age","sex"]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]
But what I want to do is, I need the list of columns, ("i" in the above case)to be automatically parsed with the input csv than hardcoding them inside a list.
with open('details.csv') as csvfile:
rows=iter(csv.reader(csvfile)).next()
header=rows[1:]
re=csv.DictReader(csvfile)
for row in re:
print row
for x in header:
print row[x]
This gives out an error
Keyerrror:'name'
in the line print row[x]. Where am I going wrong? Is it possible to fetch the column names using Dictreader?
Though you already have an accepted answer, I figured I'd add this for anyone else interested in a different solution-
Python's DictReader object in the CSV module (as of Python 2.6 and above) has a public attribute called fieldnames.
https://docs.python.org/3.4/library/csv.html#csv.csvreader.fieldnames
An implementation could be as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
d_reader = csv.DictReader(f)
#get fieldnames from DictReader object and store in list
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])
In the above, d_reader.fieldnames returns a list of your headers (assuming the headers are in the top row).
Which allows...
>>> print(headers)
['MyCol1', 'MyCol2', 'MyCol3']
If your headers are in, say the 2nd row (with the very top row being row 1), you could do as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
#you can eat the first line before creating DictReader.
#if no "fieldnames" param is passed into
#DictReader object upon creation, DictReader
#will read the upper-most line as the headers
f.readline()
d_reader = csv.DictReader(f)
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])
You can read the header by using the next() function which return the next row of the reader’s iterable object as a list. then you can add the content of the file to a list.
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
rest = list(reader)
Now i has the column's names as a list.
print i
>>>['id', 'name', 'age', 'sex']
Also note that reader.next() does not work in python 3. Instead use the the inbuilt next() to get the first line of the csv immediately after reading like so:
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = next(reader)
print(i)
>>>['id', 'name', 'age', 'sex']
The csv.DictReader object exposes an attribute called fieldnames, and that is what you'd use. Here's example code, followed by input and corresponding output:
import csv
file = "/path/to/file.csv"
with open(file, mode='r', encoding='utf-8') as f:
reader = csv.DictReader(f, delimiter=',')
for row in reader:
print([col + '=' + row[col] for col in reader.fieldnames])
Input file contents:
col0,col1,col2,col3,col4,col5,col6,col7,col8,col9
00,01,02,03,04,05,06,07,08,09
10,11,12,13,14,15,16,17,18,19
20,21,22,23,24,25,26,27,28,29
30,31,32,33,34,35,36,37,38,39
40,41,42,43,44,45,46,47,48,49
50,51,52,53,54,55,56,57,58,59
60,61,62,63,64,65,66,67,68,69
70,71,72,73,74,75,76,77,78,79
80,81,82,83,84,85,86,87,88,89
90,91,92,93,94,95,96,97,98,99
Output of print statements:
['col0=00', 'col1=01', 'col2=02', 'col3=03', 'col4=04', 'col5=05', 'col6=06', 'col7=07', 'col8=08', 'col9=09']
['col0=10', 'col1=11', 'col2=12', 'col3=13', 'col4=14', 'col5=15', 'col6=16', 'col7=17', 'col8=18', 'col9=19']
['col0=20', 'col1=21', 'col2=22', 'col3=23', 'col4=24', 'col5=25', 'col6=26', 'col7=27', 'col8=28', 'col9=29']
['col0=30', 'col1=31', 'col2=32', 'col3=33', 'col4=34', 'col5=35', 'col6=36', 'col7=37', 'col8=38', 'col9=39']
['col0=40', 'col1=41', 'col2=42', 'col3=43', 'col4=44', 'col5=45', 'col6=46', 'col7=47', 'col8=48', 'col9=49']
['col0=50', 'col1=51', 'col2=52', 'col3=53', 'col4=54', 'col5=55', 'col6=56', 'col7=57', 'col8=58', 'col9=59']
['col0=60', 'col1=61', 'col2=62', 'col3=63', 'col4=64', 'col5=65', 'col6=66', 'col7=67', 'col8=68', 'col9=69']
['col0=70', 'col1=71', 'col2=72', 'col3=73', 'col4=74', 'col5=75', 'col6=76', 'col7=77', 'col8=78', 'col9=79']
['col0=80', 'col1=81', 'col2=82', 'col3=83', 'col4=84', 'col5=85', 'col6=86', 'col7=87', 'col8=88', 'col9=89']
['col0=90', 'col1=91', 'col2=92', 'col3=93', 'col4=94', 'col5=95', 'col6=96', 'col7=97', 'col8=98', 'col9=99']
How about
with open(csv_input_path + file, 'r') as ft:
header = ft.readline() # read only first line; returns string
header_list = header.split(',') # returns list
I am assuming your input file is CSV format.
If using pandas, it takes more time if the file is big size because it loads the entire data as the dataset.
I am just mentioning how to get all the column names from a csv file.
I am using pandas library.
First we read the file.
import pandas as pd
file = pd.read_csv('details.csv')
Then, in order to just get all the column names as a list from input file use:-
columns = list(file.head(0))
Thanking Daniel Jimenez for his perfect solution to fetch column names alone from my csv, I extend his solution to use DictReader so we can iterate over the rows using column names as indexes. Thanks Jimenez.
with open('myfile.csv') as csvfile:
rest = []
with open("myfile.csv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
i=i[1:]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]
here is the code to print only the headers or columns of the csv file.
import csv
HEADERS = next(csv.reader(open('filepath.csv')))
print (HEADERS)
Another method with pandas
import pandas as pd
HEADERS = list(pd.read_csv('filepath.csv').head(0))
print (HEADERS)
import pandas as pd
data = pd.read_csv("data.csv")
cols = data.columns
I literally just wanted the first row of my data which are the headers I need and didn't want to iterate over all my data to get them, so I just did this:
with open(data, 'r', newline='') as csvfile:
t = 0
for i in csv.reader(csvfile, delimiter=',', quotechar='|'):
if t > 0:
break
else:
dbh = i
t += 1
Using pandas is also an option.
But instead of loading the full file in memory, you can retrieve only the first chunk of it to get the field names by using iterator.
import pandas as pd
file = pd.read_csv('details.csv'), iterator=True)
column_names_full=file.get_chunk(1)
column_names=[column for column in column_names_full]
print column_names
I'm somewhat new to Python and still trying to learn all its tricks and exploitations.
I'm looking to see if it's possible to collect column data from two separate files to create a single dictionary, rather than two distinct dictionaries. The code that I've used to import files before looks like this:
import csv
from collections import defaultdict
columns = defaultdict(list)
with open("myfile.txt") as f:
reader = csv.DictReader(f,delimiter='\t')
for row in reader:
for (header,variable) in row.items():
columns[header].append(variable)
f.close()
This code makes each element of the first line of the file into a header for the columns of data below it. What I'd like to do now is to import a file that only contains one line which I'll use as my header, and import another file that only contains data that I'll match the headers up to. What I've tried so far resembles this:
columns = defaultdict(list)
with open("headerData.txt") as g:
reader1 = csv.DictReader(g,delimiter='\t')
for row in reader1:
for (h,v) in row.items():
columns[h].append(v)
with open("variableData.txt") as f:
reader = csv.DictReader(f,delimiter='\t')
for row in reader:
for (h,v) in row.items():
columns[h].append(v)
Is nesting the open statements the right way to attempt this? Honestly I am totally lost on what to do. Any help is greatly appreciated.
You can't use DictReader like that if the headers are not in the file. But you can create a fake file object that would yield the headers and then the data, using itertools.chain:
from itertools import chain
with open('headerData.txt') as h, open('variableData.txt') as data:
f = chain(h, data)
reader = csv.DictReader(f,delimiter='\t')
# proceed with you code from the first snippet
# no close() calls needed when using open() with "with" statements
Another way of course would be to just read the headers into a list and use regular csv.reader on variableData.txt:
with open('headerData') as h:
names = next(h).split('\t')
with open('variableData.txt') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
for name, value in zip(names, row):
columns[name].append(value)
By default, DictReader will take the first line in your csv file and use that as the keys for the dict. However, according to the docs, you can also pass it a fieldnames parameter, which is a sequence containing the names of the keys to use for the dict. So you could do this:
columns = defaultdict(list)
with open("headerData.txt") as f, open("variableData.txt") as data:
reader = csv.DictReader(data,
fieldnames=f.read().rstrip().split('\t'),
delimiter='\t')
for row in reader:
for (h,v) in row.items():
columns[h].append(v)
I am trying to "clean up" some data - I'm creating a dictionary of the channels that I need to keep and then I've got an if block to create a second dictionary with the correct rounding.
Dictionary looks like this:
{'time, s': (imported array), 'x temp, C':(imported array),
'x pressure, kPa': (diff. imported array).....etc}
Each imported array is 1-d.
I was looking at this example, but I didn't quite get the way to parse it so that I ended up with what I want.
My desired output is a csv file (do not care if the delimiter is spaces or commas or whatever) with the first row being the keys and the subsequent rows simply being the values.
I feel like what I'm missing is how to use the map function properly.
Also, I'm wondering if I'm using DictWriter when I should be using DictReader.
This is what I originally tried:
with open((filename), 'wb') as outfile:
write = csv.DictWriter(outfile, Fieldname_order)
write.writer.writerow(Fieldname_order)
write.writerows(data)
DictWriter's API doesn't match the data structure you have. DictWriter requires list of dictionaries. You have a dictionary of lists.
You can use the ordinary csv.writer:
my_data = {'time, s': [0,1,2,3], 'x temp, C':[0,10,20,30],
'x pressure, kPa': [0,100,200,300]}
import csv
with open('outfile.csv', 'w') as outfile:
writer = csv.writer(outfile)
writer.writerow(my_data.keys())
writer.writerows(zip(*my_data.values()))
That will write the columns in arbitrary order, which order may change from run to run. One way to make the order to be consistent is to replace the last two lines with:
writer.writerow(sorted(my_data.keys()))
writer.writerows(zip(*(my_data[k] for k in sorted(my_data.keys()))))
Edit: in this example data is a list of dictionaries. Each row in the csv contains one value for each key.
To write your dictionary with a header row and then data rows:
with open(filename, 'wb') as outfile:
writer = csv.DictWriter(outfile, fieldnames)
writer.writeheader()
writer.writerows(data)
To read in data as a dictionary then you do need to use DictReader:
with open(filename, 'r') as infile:
reader = csv.DictReader(infile)
data = [row for row in reader]