Parsing multiple json objects from a text file using Python - python
I have a .json file where each line is an object. For example, first two lines are:
{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}
I have tried processing using ijson lib as follows:
with open(filename, 'r') as f:
objects = ijson.items(f, 'columns.items')
columns = list(objects)
However, i get error:
JSONError: Additional data
Its seems due to multiple objects I'm receiving such error.
Whats the recommended way for analyzing such Json file in Jupyter?
Thank You in advance
The file format is not correct if this is the complete file. Between the curly brackets there must be a comma and it should start and end with a square bracket. Like so: [{...},{...}]. For your data it would look like:
[{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...},
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}]
Here is some code how to clean your file:
lastline = None
with open("yourfile.json","r") as f:
lineList = f.readlines()
lastline=lineList[-1]
with open("yourfile.json","r") as f, open("cleanfile.json","w") as g:
for i,line in enumerate(f,0):
if i == 0:
line = "["+str(line)+","
g.write(line)
elif line == lastline:
g.write(line)
g.write("]")
else:
line = str(line)+","
g.write(line)
To read a json file properly you could also consider using the pandas library (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html).
import pandas as pd
#get a pandas dataframe object from json file
df = pd.read_json("path/to/your/filename.json")
If you are not familiar with pandas, here a quick headstart, how to work with a dataframe object:
df.head() #gives you the first rows of the dataframe
df["review_id"] # gives you the column review_id as a vector
df.iloc[1,:] # gives you the complete row with index 1
df.iloc[1,2] # gives you the item in row with index 1 and column with index 2
While each line on it's own is valid JSON, your file as a whole is not. As such, you can't parse it in one go, you will have to iterate over each line parse it into an object.
You can aggregate these objects in one list, and from there do whatever you like with your data :
import json
with open(filename, 'r') as f:
object_list = []
for line in f.readlines():
object_list.append(json.loads(line))
# object_list will contain all of your file's data
You could do it as a list comprehension to have it a little more pythonic :
with open(filename, 'r') as f:
object_list = [json.loads(line)
for line in f.readlines()]
# object_list will contain all of your file's data
You have multiple lines in your file, so that's why it's throwing errors
import json
with open(filename, 'r') as f:
lines = f.readlines()
first = json.loads(lines[0])
second = json.loads(lines[1])
That should catch both lines and load them in properly
Related
Reading csv file and want to skip first two columns
I am trying to read a CSV file in Python. Further I want to read my whole file but just don't want first two columns. Also I don't have columns name so that I can easily drop or skip it. What code do I need to read the file without reading first two columns? I have tried below code: with open("data2.csv", "r") as file: lines = [line.split() for line in file] for i, x in enumerate(lines): print("line {0} = {1}".format(i,x)) I am just reading file line by line from above code. But how to skip first two columns and then read the file? I don't have names of the columns.
You should use the csv module in the standard library. You might need to pass additional kwargs (keyword arguments) depending on the format of your csv file. import csv with open('my_csv_file', 'r') as fin: reader = csv.reader(fin) for line in reader: print(line[2:]) # do something with rest of columns...
if the lines list does getting the data you want you can use slicing to get rid of the columns you don't want: getting rid of first two: lines[2:] getting rid of last two: lines[:-2] with open("data2.csv", "r") as file: lines = [line.split()[2:] for line in file] for i, x in enumerate(lines): print("line {0} = {1}".format(i,x))
Python: Replace string in a txt file but not on every occurrence
I am really new to python and I need to change new artikel Ids to the old ones. The Ids are mapped inside a dict. The file I need to edit is a normal txt where every column is sperated by Tabs. The problem is not replacing the values rather then only replacing the ouccurances in the desired column which is set by pos. I really would appreciate some help. def replaceArtCol(filename, pos): with open(filename) as input_file, open('test.txt','w') as output_file: for each_line in input_file: val = each_line.split("\t")[pos] for row in artikel_ID: if each_line[pos] == pos line = each_line.replace(val, artikel_ID[val]) output_file.write(line)` This Code just replaces any occurance of the string in the text file.
supposed your ID mapping dict looks like ID_mapping = {'old_id': 'new_id'}, I think your code is not far from working correctly. A modified version could look like with open(filename) as input_file, open('test.txt','w') as output_file: for each_line in input_file: line = each_line.split("\t") if line[pos] in ID_mapping.keys(): line[pos] = ID_mapping[line[pos]] line = '\t'.join(line) output_file.write(line) if you're not working in pandas anyway, this can save a lot of overhead.
if your data is tab separated then you must load this data into dataframe.. this way you can have columns and rows structure.. what you are sdoing right now will not allow you to do what you want to do without some complex and buggy logic. you may try these steps import pandas as pd df = pd.read_csv("dummy.txt", sep="\t", encoding="latin-1") df['desired_column_name'] = df['desired_column_name'].replace({"value_to_be_changed": "newvalue"}) print(df.head())
Converting CSV data from file to JSON
I have a csv file that contains csv data separated by ','. I am trying to convert it into a json format. For this I am tyring to extract headers first. But, I am not able to differentiate between headers and the next row. Here is the data in csv file: Start Date ,Start Time,End Date,End Time,Event Title 9/5/2011,3:00:00 PM,9/5/2011,,Social Studies Dept. Meeting 9/5/2011,6:00:00 PM,9/5/2011,8:00:00 PM,Curriculum Meeting I have tried csvreader as well but I got stuck at the same issue. Basically Event Title and the date on the next line is not being distinguished. with open(file_path, 'r') as f: first_line = re.sub(r'\s+', '', f.read()) arr = [] headers = [] for header in f.readline().split(','): headers.append(header) for line in f.readlines(): lineItems = {} for i,item in enumerate(line.split(',')): lineItems[headers[i]] = item arr.append(lineItems) print(arr) print(headers) jsonText = json.dumps(arr) print(jsonText) All three print statements give empty result below. [] [''] [] I expect jsonText to be a json of key value pairs.
Use csv.DictReader to get a list of dicts (each row is a dict) then serialize it. import json import csv with open(csvfilepath) as f: json.dump(list(csv.DictReader(f)), jsonfilepath))
In Python, each file has a marker that keeps track of where you are in the file. Once you call read(), you have read through the entire file, and all future read or readline calls will return nothing. So, just delete the line involving first_line.
Getting a unique value from csv file in python
I have following code in Python for twitter api, How do i iterate to get the values from CSV file. get_tweet = set() gettweet_list = [] with open(TWEET_FILE) as in_file: for line in in_file: gettweet_list.append(str(line)) get_tweet.update(set(gettweet_list)) del gettweet_list if len(get_tweet) > 140: get_tweet = get_tweet[:140] return get_tweet
If you just want the first how just get the first row: with open(TWEET_FILE) as in_file: data = set(next(in_file).split()) You also cannot index a set so get_tweet[:140] would never work. If you want a set of all rows: import csv from itertools import chain with open(TWEET_FILE) as in_file: data = set(chain.from_iterable(csv.reader(in_file))) Sets also have no order so even turning it back into a list and slicing will not give you the first row.
Reading column names alone in a csv file
I have a csv file with the following columns: id,name,age,sex Followed by a lot of values for the above columns. I am trying to read the column names alone and put them inside a list. I am using Dictreader and this gives out the correct details: with open('details.csv') as csvfile: i=["name","age","sex"] re=csv.DictReader(csvfile) for row in re: for x in i: print row[x] But what I want to do is, I need the list of columns, ("i" in the above case)to be automatically parsed with the input csv than hardcoding them inside a list. with open('details.csv') as csvfile: rows=iter(csv.reader(csvfile)).next() header=rows[1:] re=csv.DictReader(csvfile) for row in re: print row for x in header: print row[x] This gives out an error Keyerrror:'name' in the line print row[x]. Where am I going wrong? Is it possible to fetch the column names using Dictreader?
Though you already have an accepted answer, I figured I'd add this for anyone else interested in a different solution- Python's DictReader object in the CSV module (as of Python 2.6 and above) has a public attribute called fieldnames. https://docs.python.org/3.4/library/csv.html#csv.csvreader.fieldnames An implementation could be as follows: import csv with open('C:/mypath/to/csvfile.csv', 'r') as f: d_reader = csv.DictReader(f) #get fieldnames from DictReader object and store in list headers = d_reader.fieldnames for line in d_reader: #print value in MyCol1 for each row print(line['MyCol1']) In the above, d_reader.fieldnames returns a list of your headers (assuming the headers are in the top row). Which allows... >>> print(headers) ['MyCol1', 'MyCol2', 'MyCol3'] If your headers are in, say the 2nd row (with the very top row being row 1), you could do as follows: import csv with open('C:/mypath/to/csvfile.csv', 'r') as f: #you can eat the first line before creating DictReader. #if no "fieldnames" param is passed into #DictReader object upon creation, DictReader #will read the upper-most line as the headers f.readline() d_reader = csv.DictReader(f) headers = d_reader.fieldnames for line in d_reader: #print value in MyCol1 for each row print(line['MyCol1'])
You can read the header by using the next() function which return the next row of the reader’s iterable object as a list. then you can add the content of the file to a list. import csv with open("C:/path/to/.filecsv", "rb") as f: reader = csv.reader(f) i = reader.next() rest = list(reader) Now i has the column's names as a list. print i >>>['id', 'name', 'age', 'sex'] Also note that reader.next() does not work in python 3. Instead use the the inbuilt next() to get the first line of the csv immediately after reading like so: import csv with open("C:/path/to/.filecsv", "rb") as f: reader = csv.reader(f) i = next(reader) print(i) >>>['id', 'name', 'age', 'sex']
The csv.DictReader object exposes an attribute called fieldnames, and that is what you'd use. Here's example code, followed by input and corresponding output: import csv file = "/path/to/file.csv" with open(file, mode='r', encoding='utf-8') as f: reader = csv.DictReader(f, delimiter=',') for row in reader: print([col + '=' + row[col] for col in reader.fieldnames]) Input file contents: col0,col1,col2,col3,col4,col5,col6,col7,col8,col9 00,01,02,03,04,05,06,07,08,09 10,11,12,13,14,15,16,17,18,19 20,21,22,23,24,25,26,27,28,29 30,31,32,33,34,35,36,37,38,39 40,41,42,43,44,45,46,47,48,49 50,51,52,53,54,55,56,57,58,59 60,61,62,63,64,65,66,67,68,69 70,71,72,73,74,75,76,77,78,79 80,81,82,83,84,85,86,87,88,89 90,91,92,93,94,95,96,97,98,99 Output of print statements: ['col0=00', 'col1=01', 'col2=02', 'col3=03', 'col4=04', 'col5=05', 'col6=06', 'col7=07', 'col8=08', 'col9=09'] ['col0=10', 'col1=11', 'col2=12', 'col3=13', 'col4=14', 'col5=15', 'col6=16', 'col7=17', 'col8=18', 'col9=19'] ['col0=20', 'col1=21', 'col2=22', 'col3=23', 'col4=24', 'col5=25', 'col6=26', 'col7=27', 'col8=28', 'col9=29'] ['col0=30', 'col1=31', 'col2=32', 'col3=33', 'col4=34', 'col5=35', 'col6=36', 'col7=37', 'col8=38', 'col9=39'] ['col0=40', 'col1=41', 'col2=42', 'col3=43', 'col4=44', 'col5=45', 'col6=46', 'col7=47', 'col8=48', 'col9=49'] ['col0=50', 'col1=51', 'col2=52', 'col3=53', 'col4=54', 'col5=55', 'col6=56', 'col7=57', 'col8=58', 'col9=59'] ['col0=60', 'col1=61', 'col2=62', 'col3=63', 'col4=64', 'col5=65', 'col6=66', 'col7=67', 'col8=68', 'col9=69'] ['col0=70', 'col1=71', 'col2=72', 'col3=73', 'col4=74', 'col5=75', 'col6=76', 'col7=77', 'col8=78', 'col9=79'] ['col0=80', 'col1=81', 'col2=82', 'col3=83', 'col4=84', 'col5=85', 'col6=86', 'col7=87', 'col8=88', 'col9=89'] ['col0=90', 'col1=91', 'col2=92', 'col3=93', 'col4=94', 'col5=95', 'col6=96', 'col7=97', 'col8=98', 'col9=99']
How about with open(csv_input_path + file, 'r') as ft: header = ft.readline() # read only first line; returns string header_list = header.split(',') # returns list I am assuming your input file is CSV format. If using pandas, it takes more time if the file is big size because it loads the entire data as the dataset.
I am just mentioning how to get all the column names from a csv file. I am using pandas library. First we read the file. import pandas as pd file = pd.read_csv('details.csv') Then, in order to just get all the column names as a list from input file use:- columns = list(file.head(0))
Thanking Daniel Jimenez for his perfect solution to fetch column names alone from my csv, I extend his solution to use DictReader so we can iterate over the rows using column names as indexes. Thanks Jimenez. with open('myfile.csv') as csvfile: rest = [] with open("myfile.csv", "rb") as f: reader = csv.reader(f) i = reader.next() i=i[1:] re=csv.DictReader(csvfile) for row in re: for x in i: print row[x]
here is the code to print only the headers or columns of the csv file. import csv HEADERS = next(csv.reader(open('filepath.csv'))) print (HEADERS) Another method with pandas import pandas as pd HEADERS = list(pd.read_csv('filepath.csv').head(0)) print (HEADERS)
import pandas as pd data = pd.read_csv("data.csv") cols = data.columns
I literally just wanted the first row of my data which are the headers I need and didn't want to iterate over all my data to get them, so I just did this: with open(data, 'r', newline='') as csvfile: t = 0 for i in csv.reader(csvfile, delimiter=',', quotechar='|'): if t > 0: break else: dbh = i t += 1
Using pandas is also an option. But instead of loading the full file in memory, you can retrieve only the first chunk of it to get the field names by using iterator. import pandas as pd file = pd.read_csv('details.csv'), iterator=True) column_names_full=file.get_chunk(1) column_names=[column for column in column_names_full] print column_names