I have a large JSONL file (~100 GB). I want to convert this to a pandas dataframe and apply some functions on a column by iterating over all the rows .
Whats the best way to read this JSONL file ? I am doing the following currently but that gets stuck (running this on GCP)
import pandas as pd
import json
data = []
with open("my_jsonl_file", 'r') as file:
for line in file:
data.append(json.loads(line))
For smaller data you can simply use:
import pandas as pd
path = "test.jsonl"
data = pd.read_json(path, lines=True)
For large data, you can use something like this:
df = pd.DataFrame(columns=['c1'])
import jsonlines
data = jsonlines.open(path)
for line in data.iter():
# get data in line
df.append({'c1': data})
Related
I have this huge excel file with over 400000 rows and 20 columns. I need to transpose the table but I was unable to do it with excel, then I was unable to do it with pandas. So I done it in a way that I converted to csv file.
import pandas as pd
df = pd.read_excel('file.xlsx')
df.to_csv('file.csv')
Then I was able to do it with csv file to txt...
import pandas as pd
df = pd.read_csv("file.csv")
transposed_df =df.T
with open('transposed_file_from_csv.txt', 'w') as outfile:
transposed_df.to_string(outfile)
But for some reason I got txt file with 1.5 GB, with my laptop I'm unable to open this huge file. Is there an option to get file with smaller size? or and other idea is more then welcome.
Thanks in advance?
If the intention is to save the transposed dataframe as csv, then it's the same command as in the early part of your snippet:
transposed_df =df.T
transposed_df.to_csv('new_file.csv')
I am trying to load JSON file data into a dataframe, filter a few records, and write it back to file again. My file contains one JSON record per line and each one has a URL in it.
This is the sample data in the input file.
{"site_code":"111","site_url":"https://www.site111.com"}
{"site_code":"222","site_url":"https://www.site333.com"}
{"site_code":"333","site_url":"https://www.site333.com"}
Sample code I used
import pandas as pd
sites = pd.read_json('sites.json', lines=True)
modified_sites = sites[sites['site_code']!=222]
modified_sites.to_json('modified_sites.json',orient='records',lines=True)
But the generated file contains escaped forward slashes
{"site_code":111,"site_url":"https:\/\/www.site111.com"}
{"site_code":333,"site_url":"https:\/\/www.site333.com"}
How can I avoid it and get the following data in the generated file?
{"site_code":111,"site_url":"https://www.site111.com"}
{"site_code":333,"site_url":"https://www.site333.com"}
Note: I referred to these but not helpful for my case
pandas to_json() redundant backslashes
You can try to format escaped slashes directly and save result to file:
import pandas as pd
import numpy as np
sites = pd.read_json('sites.json', lines=True)
modified_sites = sites[sites['site_code']!=222]
modified_sites.to_json('modified_sites.json',orient='records',lines=True)
formatted_json = modified_sites.to_json(orient='records',lines=True).replace('\\/', '/')
print(formatted_json, file=open('modified_sites.json', 'w'))
I have .ndjson file that has 20GB that I want to open with Python. File is to big so I found a way to split it into 50 peaces with one online tool. This is the tool: https://pinetools.com/split-files
Now I get one file, that has extension .ndjson.000 (and I do not know what is that)
I'm trying to open it as json or as a csv file, to read it in pandas but it does not work.
Do you have any idea how to solve this?
import json
import pandas as pd
First approach:
df = pd.read_json('dump.ndjson.000', lines=True)
Error: ValueError: Unmatched ''"' when when decoding 'string'
Second approach:
with open('dump.ndjson.000', 'r') as f:
my_data = f.read()
print(my_data)
Error: json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 104925061 (char 104925060)
I think the problem is that I have some emojis in my file, so I do not know how to encode them?
ndjson is now supported out of the box with argument lines=True
import pandas as pd
df = pd.read_json('/path/to/records.ndjson', lines=True)
df.to_json('/path/to/export.ndjson', lines=True)
I think the pandas.read_json cannot handle ndjson correctly.
According to this issue you can do sth. like this to read it.
import ujson as json
import pandas as pd
records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)
P.S: All credits for this code go to KristianHolsheimer from the Github Issue
The ndjson (newline delimited) json is a json-lines format, that is, each line is a json. It is ideal for a dataset lacking rigid structure ('non-sql') where the file size is large enough to warrant multiple files.
You can use pandas:
import pandas as pd
data = pd.read_json('dump.ndjson.000', lines=True)
In case your json strings do not contain newlines, you can alternatively use:
import json
with open("dump.ndjson.000") as f:
data = [json.loads(l) for l in f.readlines()]
I am trying to import CSV formatted data to Pandas dataframe. The CSV data is located within a .txt file the is located at a web URL. The issue is that I only want to import a part (or parts) of the .txt file that is formatted as CSV (see image below). Essentially I need to skip the first 9 rows and then import rows 10-16 as CSV.
My code
import csv
import pandas as pd
import io
url = "http://www.bom.gov.au/climate/averages/climatology/windroses/wr15/data/086282-3pmMonth.txt"
df = pd.read_csv(io.StringIO(url), skiprows = 9, sep =',', skipinitialspace = True)
df
I get a lengthy error msg that ultimately says "EmptyDataError: No columns to parse from file"
I have looked at similar examples Read .txt file with Python Pandas - strings and floats but this is different.
The code above attempts to read a CSV file from the URL itself rather than the text file fetched from that URL. To see what I mean take out the skiprows parameter and then show the data frame. You'll see this:
Empty DataFrame
Columns: [http://www.bom.gov.au/climate/averages/climatology/windroses/wr15/data/086282-3pmMonth.txt]
Index: []
Note that the columns are the URL itself.
Import requests (you may have to install it first) and then try this:
content = requests.get(url).content
df = pd.read_csv(io.StringIO(content.decode('utf-8')),skiprows=9)
I have a json file with non valid lines. I read it using this code
import json
import pandas as pd
data = []
with open('json file ') as f:
for line in f:
data.append(json.loads(line))
Sorry about the ugly looking code, I' m using the mobile Stack Exchange app. What I would like to do is to convert the data object into a data frame which columns are the first 5 elements of each data object list. Can you help?
Cheers!
Dani
I feel a little bit ashamed. It is as easy as using the Dataframe method:
df = pd.DataFrame(data)