I have a json file with non valid lines. I read it using this code
import json
import pandas as pd
data = []
with open('json file ') as f:
for line in f:
data.append(json.loads(line))
Sorry about the ugly looking code, I' m using the mobile Stack Exchange app. What I would like to do is to convert the data object into a data frame which columns are the first 5 elements of each data object list. Can you help?
Cheers!
Dani
I feel a little bit ashamed. It is as easy as using the Dataframe method:
df = pd.DataFrame(data)
Related
I'm building a site that, based on a user's input, sorts through JSON data and prints a schedule for them into an html table. I want to give it the functionality that once the their table is created they can export the data to a CSV/Excel file so we don't have to store their credentials (logins & schedules in a database). Is this possible? If so, how can I do it using python preferably?
This is not the exact answer but rather steps for you to follow in order to get a solution:
1 Read data from json. some_dict = json.loads(json_string)
2 Appropriate code to get the data from dictionary (sort/ conditions etc) and get that data in a 2D array (list)
3 Save that list as csv: https://realpython.com/python-csv/
I'm pretty lazy and like to utilize pandas for things like this. It would be something along the lines of
import pandas as pd
file = 'data.json'
with open(file) as j:
json_data = json.load(j)
df = pd.DataFrame.from_dict(j, orient='index')
df.to_csv("data.csv")
I have .ndjson file that has 20GB that I want to open with Python. File is to big so I found a way to split it into 50 peaces with one online tool. This is the tool: https://pinetools.com/split-files
Now I get one file, that has extension .ndjson.000 (and I do not know what is that)
I'm trying to open it as json or as a csv file, to read it in pandas but it does not work.
Do you have any idea how to solve this?
import json
import pandas as pd
First approach:
df = pd.read_json('dump.ndjson.000', lines=True)
Error: ValueError: Unmatched ''"' when when decoding 'string'
Second approach:
with open('dump.ndjson.000', 'r') as f:
my_data = f.read()
print(my_data)
Error: json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 104925061 (char 104925060)
I think the problem is that I have some emojis in my file, so I do not know how to encode them?
ndjson is now supported out of the box with argument lines=True
import pandas as pd
df = pd.read_json('/path/to/records.ndjson', lines=True)
df.to_json('/path/to/export.ndjson', lines=True)
I think the pandas.read_json cannot handle ndjson correctly.
According to this issue you can do sth. like this to read it.
import ujson as json
import pandas as pd
records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)
P.S: All credits for this code go to KristianHolsheimer from the Github Issue
The ndjson (newline delimited) json is a json-lines format, that is, each line is a json. It is ideal for a dataset lacking rigid structure ('non-sql') where the file size is large enough to warrant multiple files.
You can use pandas:
import pandas as pd
data = pd.read_json('dump.ndjson.000', lines=True)
In case your json strings do not contain newlines, you can alternatively use:
import json
with open("dump.ndjson.000") as f:
data = [json.loads(l) for l in f.readlines()]
I have a large JSONL file (~100 GB). I want to convert this to a pandas dataframe and apply some functions on a column by iterating over all the rows .
Whats the best way to read this JSONL file ? I am doing the following currently but that gets stuck (running this on GCP)
import pandas as pd
import json
data = []
with open("my_jsonl_file", 'r') as file:
for line in file:
data.append(json.loads(line))
For smaller data you can simply use:
import pandas as pd
path = "test.jsonl"
data = pd.read_json(path, lines=True)
For large data, you can use something like this:
df = pd.DataFrame(columns=['c1'])
import jsonlines
data = jsonlines.open(path)
for line in data.iter():
# get data in line
df.append({'c1': data})
I have an extremely large list of JSON files in the form of a TextEdit document, each of which has 6 key-value pairs.
I would like to turn each key-value pair into a column name for a Pandas Dataframe, and list the values under the column.
{'column1': "stuff stuff", 'column2': "details details, ....}
Is there a standard way to do this?
I think you could begin uploading the file into a dataframe with
import pandas as pd
df = pd.read_table(file_name)
I think each column could be created by iterating through each JSON document using groupby.
EDIT: I think the correct approach is to parse each JSON object into a Dataframe, and then create a function to iterate through all JSONs and create one Dataframe.
Take a look at read_json or json_normalize. You would indeed most likely read each file and then use for instance pd.concat to combine them as required.
Something along the below lines should work, depending on what your file looks like (here assuming that each json dictionary makes up a line in the file:
df = pd.DataFrame()
f = open('workfile', 'r')
for line in f:
df = pd.concat([df, pd.read_json(line, orient='columns')])
I have some data in a text file which looks like this:
(v14).K TaskList[Parameter Estimation].(Problem)Parameter Estimation.Best Value
5.00885e-007 3.0914e+007
5.75366e-007 2.99467e+007
6.60922e-007 2.99199e+007
I'm trying to get this data into a pandas dataframe. The code I've written below partially works but has formatting issues:
def parse_PE_results(results_file):
with open(results_file) as f:
data=f.readlines()
parameter_value=[]
best_value=[]
for i in data:
split= i.split('\t')
parameter_value.append(split[0])
best_value.append(split[1].rstrip())
pv=pandas.Series(parameter_value,name=parameter_value[0])
bv=pandas.Series(best_value,name=best_value[0])
df=pandas.DataFrame({parameter_value[0]:pv,best_value[0]:bv})
return df
I get the feeling that there must be an easier, more 'pythonic' way of building a data frame from text files. Would anybody happen to know what that is?
Use pandas.read_csv. The entire parse_PE_results function can be replaced with
df = pd.read_csv(results_file, delimiter='\t')
You'll also enjoy better performance by using read_csv instead of calling
data=f.readlines() and looping through it line by line.