Parsing multiple json objects from a text file using Python - python

I have a .json file where each line is an object. For example, first two lines are:
{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}
I have tried processing using ijson lib as follows:
with open(filename, 'r') as f:
objects = ijson.items(f, 'columns.items')
columns = list(objects)
However, i get error:
JSONError: Additional data
Its seems due to multiple objects I'm receiving such error.
Whats the recommended way for analyzing such Json file in Jupyter?
Thank You in advance

The file format is not correct if this is the complete file. Between the curly brackets there must be a comma and it should start and end with a square bracket. Like so: [{...},{...}]. For your data it would look like:
[{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...},
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}]
Here is some code how to clean your file:
lastline = None
with open("yourfile.json","r") as f:
lineList = f.readlines()
lastline=lineList[-1]
with open("yourfile.json","r") as f, open("cleanfile.json","w") as g:
for i,line in enumerate(f,0):
if i == 0:
line = "["+str(line)+","
g.write(line)
elif line == lastline:
g.write(line)
g.write("]")
else:
line = str(line)+","
g.write(line)
To read a json file properly you could also consider using the pandas library (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html).
import pandas as pd
#get a pandas dataframe object from json file
df = pd.read_json("path/to/your/filename.json")
If you are not familiar with pandas, here a quick headstart, how to work with a dataframe object:
df.head() #gives you the first rows of the dataframe
df["review_id"] # gives you the column review_id as a vector
df.iloc[1,:] # gives you the complete row with index 1
df.iloc[1,2] # gives you the item in row with index 1 and column with index 2

While each line on it's own is valid JSON, your file as a whole is not. As such, you can't parse it in one go, you will have to iterate over each line parse it into an object.
You can aggregate these objects in one list, and from there do whatever you like with your data :
import json
with open(filename, 'r') as f:
object_list = []
for line in f.readlines():
object_list.append(json.loads(line))
# object_list will contain all of your file's data
You could do it as a list comprehension to have it a little more pythonic :
with open(filename, 'r') as f:
object_list = [json.loads(line)
for line in f.readlines()]
# object_list will contain all of your file's data

You have multiple lines in your file, so that's why it's throwing errors
import json
with open(filename, 'r') as f:
lines = f.readlines()
first = json.loads(lines[0])
second = json.loads(lines[1])
That should catch both lines and load them in properly

Related

Reading csv file and want to skip first two columns

I am trying to read a CSV file in Python. Further I want to read my whole file but just don't want first two columns. Also I don't have columns name so that I can easily drop or skip it.
What code do I need to read the file without reading first two columns?
I have tried below code:
with open("data2.csv", "r") as file:
lines = [line.split() for line in file]
for i, x in enumerate(lines):
print("line {0} = {1}".format(i,x))
I am just reading file line by line from above code. But how to skip first two columns and then read the file? I don't have names of the columns.
You should use the csv module in the standard library. You might need to pass additional kwargs (keyword arguments) depending on the format of your csv file.
import csv
with open('my_csv_file', 'r') as fin:
reader = csv.reader(fin)
for line in reader:
print(line[2:])
# do something with rest of columns...
if the lines list does getting the data you want you can use slicing to get rid of the columns you don't want:
getting rid of first two:
lines[2:]
getting rid of last two:
lines[:-2]
with open("data2.csv", "r") as file:
lines = [line.split()[2:] for line in file]
for i, x in enumerate(lines):
print("line {0} = {1}".format(i,x))

Python: Replace string in a txt file but not on every occurrence

I am really new to python and I need to change new artikel Ids to the old ones. The Ids are mapped inside a dict. The file I need to edit is a normal txt where every column is sperated by Tabs. The problem is not replacing the values rather then only replacing the ouccurances in the desired column which is set by pos.
I really would appreciate some help.
def replaceArtCol(filename, pos):
with open(filename) as input_file, open('test.txt','w') as output_file:
for each_line in input_file:
val = each_line.split("\t")[pos]
for row in artikel_ID:
if each_line[pos] == pos
line = each_line.replace(val, artikel_ID[val])
output_file.write(line)`
This Code just replaces any occurance of the string in the text file.
supposed your ID mapping dict looks like ID_mapping = {'old_id': 'new_id'}, I think your code is not far from working correctly. A modified version could look like
with open(filename) as input_file, open('test.txt','w') as output_file:
for each_line in input_file:
line = each_line.split("\t")
if line[pos] in ID_mapping.keys():
line[pos] = ID_mapping[line[pos]]
line = '\t'.join(line)
output_file.write(line)
if you're not working in pandas anyway, this can save a lot of overhead.
if your data is tab separated then you must load this data into dataframe.. this way you can have columns and rows structure.. what you are sdoing right now will not allow you to do what you want to do without some complex and buggy logic. you may try these steps
import pandas as pd
df = pd.read_csv("dummy.txt", sep="\t", encoding="latin-1")
df['desired_column_name'] = df['desired_column_name'].replace({"value_to_be_changed": "newvalue"})
print(df.head())

Converting CSV data from file to JSON

I have a csv file that contains csv data separated by ','. I am trying to convert it into a json format. For this I am tyring to extract headers first. But, I am not able to differentiate between headers and the next row.
Here is the data in csv file:
Start Date ,Start Time,End Date,End Time,Event Title
9/5/2011,3:00:00 PM,9/5/2011,,Social Studies Dept. Meeting
9/5/2011,6:00:00 PM,9/5/2011,8:00:00 PM,Curriculum Meeting
I have tried csvreader as well but I got stuck at the same issue.
Basically Event Title and the date on the next line is not being distinguished.
with open(file_path, 'r') as f:
first_line = re.sub(r'\s+', '', f.read())
arr = []
headers = []
for header in f.readline().split(','):
headers.append(header)
for line in f.readlines():
lineItems = {}
for i,item in enumerate(line.split(',')):
lineItems[headers[i]] = item
arr.append(lineItems)
print(arr)
print(headers)
jsonText = json.dumps(arr)
print(jsonText)
All three print statements give empty result below.
[]
['']
[]
I expect jsonText to be a json of key value pairs.
Use csv.DictReader to get a list of dicts (each row is a dict) then serialize it.
import json
import csv
with open(csvfilepath) as f:
json.dump(list(csv.DictReader(f)), jsonfilepath))
In Python, each file has a marker that keeps track of where you are in the file. Once you call read(), you have read through the entire file, and all future read or readline calls will return nothing.
So, just delete the line involving first_line.

Getting a unique value from csv file in python

I have following code in Python for twitter api, How do i iterate to get the values from CSV file.
get_tweet = set()
gettweet_list = []
with open(TWEET_FILE) as in_file:
for line in in_file:
gettweet_list.append(str(line))
get_tweet.update(set(gettweet_list))
del gettweet_list
if len(get_tweet) > 140:
get_tweet = get_tweet[:140]
return get_tweet
If you just want the first how just get the first row:
with open(TWEET_FILE) as in_file:
data = set(next(in_file).split())
You also cannot index a set so get_tweet[:140] would never work.
If you want a set of all rows:
import csv
from itertools import chain
with open(TWEET_FILE) as in_file:
data = set(chain.from_iterable(csv.reader(in_file)))
Sets also have no order so even turning it back into a list and slicing will not give you the first row.

Reading column names alone in a csv file

I have a csv file with the following columns:
id,name,age,sex
Followed by a lot of values for the above columns.
I am trying to read the column names alone and put them inside a list.
I am using Dictreader and this gives out the correct details:
with open('details.csv') as csvfile:
i=["name","age","sex"]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]
But what I want to do is, I need the list of columns, ("i" in the above case)to be automatically parsed with the input csv than hardcoding them inside a list.
with open('details.csv') as csvfile:
rows=iter(csv.reader(csvfile)).next()
header=rows[1:]
re=csv.DictReader(csvfile)
for row in re:
print row
for x in header:
print row[x]
This gives out an error
Keyerrror:'name'
in the line print row[x]. Where am I going wrong? Is it possible to fetch the column names using Dictreader?
Though you already have an accepted answer, I figured I'd add this for anyone else interested in a different solution-
Python's DictReader object in the CSV module (as of Python 2.6 and above) has a public attribute called fieldnames.
https://docs.python.org/3.4/library/csv.html#csv.csvreader.fieldnames
An implementation could be as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
d_reader = csv.DictReader(f)
#get fieldnames from DictReader object and store in list
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])
In the above, d_reader.fieldnames returns a list of your headers (assuming the headers are in the top row).
Which allows...
>>> print(headers)
['MyCol1', 'MyCol2', 'MyCol3']
If your headers are in, say the 2nd row (with the very top row being row 1), you could do as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
#you can eat the first line before creating DictReader.
#if no "fieldnames" param is passed into
#DictReader object upon creation, DictReader
#will read the upper-most line as the headers
f.readline()
d_reader = csv.DictReader(f)
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])
You can read the header by using the next() function which return the next row of the reader’s iterable object as a list. then you can add the content of the file to a list.
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
rest = list(reader)
Now i has the column's names as a list.
print i
>>>['id', 'name', 'age', 'sex']
Also note that reader.next() does not work in python 3. Instead use the the inbuilt next() to get the first line of the csv immediately after reading like so:
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = next(reader)
print(i)
>>>['id', 'name', 'age', 'sex']
The csv.DictReader object exposes an attribute called fieldnames, and that is what you'd use. Here's example code, followed by input and corresponding output:
import csv
file = "/path/to/file.csv"
with open(file, mode='r', encoding='utf-8') as f:
reader = csv.DictReader(f, delimiter=',')
for row in reader:
print([col + '=' + row[col] for col in reader.fieldnames])
Input file contents:
col0,col1,col2,col3,col4,col5,col6,col7,col8,col9
00,01,02,03,04,05,06,07,08,09
10,11,12,13,14,15,16,17,18,19
20,21,22,23,24,25,26,27,28,29
30,31,32,33,34,35,36,37,38,39
40,41,42,43,44,45,46,47,48,49
50,51,52,53,54,55,56,57,58,59
60,61,62,63,64,65,66,67,68,69
70,71,72,73,74,75,76,77,78,79
80,81,82,83,84,85,86,87,88,89
90,91,92,93,94,95,96,97,98,99
Output of print statements:
['col0=00', 'col1=01', 'col2=02', 'col3=03', 'col4=04', 'col5=05', 'col6=06', 'col7=07', 'col8=08', 'col9=09']
['col0=10', 'col1=11', 'col2=12', 'col3=13', 'col4=14', 'col5=15', 'col6=16', 'col7=17', 'col8=18', 'col9=19']
['col0=20', 'col1=21', 'col2=22', 'col3=23', 'col4=24', 'col5=25', 'col6=26', 'col7=27', 'col8=28', 'col9=29']
['col0=30', 'col1=31', 'col2=32', 'col3=33', 'col4=34', 'col5=35', 'col6=36', 'col7=37', 'col8=38', 'col9=39']
['col0=40', 'col1=41', 'col2=42', 'col3=43', 'col4=44', 'col5=45', 'col6=46', 'col7=47', 'col8=48', 'col9=49']
['col0=50', 'col1=51', 'col2=52', 'col3=53', 'col4=54', 'col5=55', 'col6=56', 'col7=57', 'col8=58', 'col9=59']
['col0=60', 'col1=61', 'col2=62', 'col3=63', 'col4=64', 'col5=65', 'col6=66', 'col7=67', 'col8=68', 'col9=69']
['col0=70', 'col1=71', 'col2=72', 'col3=73', 'col4=74', 'col5=75', 'col6=76', 'col7=77', 'col8=78', 'col9=79']
['col0=80', 'col1=81', 'col2=82', 'col3=83', 'col4=84', 'col5=85', 'col6=86', 'col7=87', 'col8=88', 'col9=89']
['col0=90', 'col1=91', 'col2=92', 'col3=93', 'col4=94', 'col5=95', 'col6=96', 'col7=97', 'col8=98', 'col9=99']
How about
with open(csv_input_path + file, 'r') as ft:
header = ft.readline() # read only first line; returns string
header_list = header.split(',') # returns list
I am assuming your input file is CSV format.
If using pandas, it takes more time if the file is big size because it loads the entire data as the dataset.
I am just mentioning how to get all the column names from a csv file.
I am using pandas library.
First we read the file.
import pandas as pd
file = pd.read_csv('details.csv')
Then, in order to just get all the column names as a list from input file use:-
columns = list(file.head(0))
Thanking Daniel Jimenez for his perfect solution to fetch column names alone from my csv, I extend his solution to use DictReader so we can iterate over the rows using column names as indexes. Thanks Jimenez.
with open('myfile.csv') as csvfile:
rest = []
with open("myfile.csv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
i=i[1:]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]
here is the code to print only the headers or columns of the csv file.
import csv
HEADERS = next(csv.reader(open('filepath.csv')))
print (HEADERS)
Another method with pandas
import pandas as pd
HEADERS = list(pd.read_csv('filepath.csv').head(0))
print (HEADERS)
import pandas as pd
data = pd.read_csv("data.csv")
cols = data.columns
I literally just wanted the first row of my data which are the headers I need and didn't want to iterate over all my data to get them, so I just did this:
with open(data, 'r', newline='') as csvfile:
t = 0
for i in csv.reader(csvfile, delimiter=',', quotechar='|'):
if t > 0:
break
else:
dbh = i
t += 1
Using pandas is also an option.
But instead of loading the full file in memory, you can retrieve only the first chunk of it to get the field names by using iterator.
import pandas as pd
file = pd.read_csv('details.csv'), iterator=True)
column_names_full=file.get_chunk(1)
column_names=[column for column in column_names_full]
print column_names

Categories