How to avoid pandas to_json escaping forward ashes in urls - python

I am trying to load JSON file data into a dataframe, filter a few records, and write it back to file again. My file contains one JSON record per line and each one has a URL in it.
This is the sample data in the input file.
{"site_code":"111","site_url":"https://www.site111.com"}
{"site_code":"222","site_url":"https://www.site333.com"}
{"site_code":"333","site_url":"https://www.site333.com"}
Sample code I used
import pandas as pd
sites = pd.read_json('sites.json', lines=True)
modified_sites = sites[sites['site_code']!=222]
modified_sites.to_json('modified_sites.json',orient='records',lines=True)
But the generated file contains escaped forward slashes
{"site_code":111,"site_url":"https:\/\/www.site111.com"}
{"site_code":333,"site_url":"https:\/\/www.site333.com"}
How can I avoid it and get the following data in the generated file?
{"site_code":111,"site_url":"https://www.site111.com"}
{"site_code":333,"site_url":"https://www.site333.com"}
Note: I referred to these but not helpful for my case
pandas to_json() redundant backslashes

You can try to format escaped slashes directly and save result to file:
import pandas as pd
import numpy as np
sites = pd.read_json('sites.json', lines=True)
modified_sites = sites[sites['site_code']!=222]
modified_sites.to_json('modified_sites.json',orient='records',lines=True)
formatted_json = modified_sites.to_json(orient='records',lines=True).replace('\\/', '/')
print(formatted_json, file=open('modified_sites.json', 'w'))

Related

How do I get my response to show ndjson data instead of text?

My API call returns an ndjson format that wont allow me to use "data = response.json".
My only work around is doing this. This give me a string format that is difficult to parse. After put into a pandas df, it is about 20 columns of dictionaries of dictionaries. Is there a better way of getting the ndjson data and/or parsing these columns?
import pandas as pd
from io import StringIO
import ndjson as nd
data = response.text()
df = pd.read_json(StringIO(data), lines = True)

Pandas exported CSV file not enclosing text/string in double quotes

I have a python script which gets a JSON file from a MongoDB database, performs ETL processes such as filtering, flattening the dictionary and finally exporting the dataframe to CSV (which works fine).
The issue I am having is when I open the CSV in Notepad, the text columns are not enclosed in quotation marks.
Correct me if I'm wrong but I believe when a datatype of a column has been specified as a string/text, when you open that file in Excel there are no quotes but when opened in Notepad it should show those string columns within quotes.
from pymongo import MongoClient
import pandas as pd
from azure.storage.filedatalake import DataLakeServiceClient
from azure.core._match_conditions import MatchConditions
from azure.storage.filedatalake._models import ContentSettings
from pandas import json_normalize
from datetime import datetime, timedelta
import numpy as np
import json
mongo_client = MongoClient("XXXX")
db = mongo_client.rfqdb
table = db.request
document = table.find({'createdAt': {'$gt': datetime.utcnow() - timedelta(days=7)}})
docs = list(document)
docs = json.dumps(docs,default=str)
docs = docs.replace(r"\n",'').replace(r"\r\n",'').replace(r"\r",'')
docs = json.loads(docs)
docs = json_normalize(docs)
docs = docs[["id","reportName"]].astype("string")
print(docs.dtypes)
id string
reportName string
When I open the exported CSV file from Pandas in Notepad++ it doesn't show the string within quotes:
Could anyone shed some light on this situation as I've done this same process in Azure Data Factory where I have mapped these two columns as Strings and when I open the CSV in Notepad it shows the strings wrapped inside quotes(see below), so I'm a bit confused why Python Pandas isn't showing this when exporting.
Thanks in advance
As the answered here, it is about how the CSV are formatted.
Unnecessary double quotes added to rows of CVS file when opening with notepad/notepad++
So, if you want to get rid of the double quotes I suggest to try this one:
csv.writer(csvfile, quoting=csv.QUOTE_NONE)

How to open .ndjson file in Python?

I have .ndjson file that has 20GB that I want to open with Python. File is to big so I found a way to split it into 50 peaces with one online tool. This is the tool: https://pinetools.com/split-files
Now I get one file, that has extension .ndjson.000 (and I do not know what is that)
I'm trying to open it as json or as a csv file, to read it in pandas but it does not work.
Do you have any idea how to solve this?
import json
import pandas as pd
First approach:
df = pd.read_json('dump.ndjson.000', lines=True)
Error: ValueError: Unmatched ''"' when when decoding 'string'
Second approach:
with open('dump.ndjson.000', 'r') as f:
my_data = f.read()
print(my_data)
Error: json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 104925061 (char 104925060)
I think the problem is that I have some emojis in my file, so I do not know how to encode them?
ndjson is now supported out of the box with argument lines=True
import pandas as pd
df = pd.read_json('/path/to/records.ndjson', lines=True)
df.to_json('/path/to/export.ndjson', lines=True)
I think the pandas.read_json cannot handle ndjson correctly.
According to this issue you can do sth. like this to read it.
import ujson as json
import pandas as pd
records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)
P.S: All credits for this code go to KristianHolsheimer from the Github Issue
The ndjson (newline delimited) json is a json-lines format, that is, each line is a json. It is ideal for a dataset lacking rigid structure ('non-sql') where the file size is large enough to warrant multiple files.
You can use pandas:
import pandas as pd
data = pd.read_json('dump.ndjson.000', lines=True)
In case your json strings do not contain newlines, you can alternatively use:
import json
with open("dump.ndjson.000") as f:
data = [json.loads(l) for l in f.readlines()]

Python pandas create datafrane from csv embeded within a web txt file

I am trying to import CSV formatted data to Pandas dataframe. The CSV data is located within a .txt file the is located at a web URL. The issue is that I only want to import a part (or parts) of the .txt file that is formatted as CSV (see image below). Essentially I need to skip the first 9 rows and then import rows 10-16 as CSV.
My code
import csv
import pandas as pd
import io
url = "http://www.bom.gov.au/climate/averages/climatology/windroses/wr15/data/086282-3pmMonth.txt"
df = pd.read_csv(io.StringIO(url), skiprows = 9, sep =',', skipinitialspace = True)
df
I get a lengthy error msg that ultimately says "EmptyDataError: No columns to parse from file"
I have looked at similar examples Read .txt file with Python Pandas - strings and floats but this is different.
The code above attempts to read a CSV file from the URL itself rather than the text file fetched from that URL. To see what I mean take out the skiprows parameter and then show the data frame. You'll see this:
Empty DataFrame
Columns: [http://www.bom.gov.au/climate/averages/climatology/windroses/wr15/data/086282-3pmMonth.txt]
Index: []
Note that the columns are the URL itself.
Import requests (you may have to install it first) and then try this:
content = requests.get(url).content
df = pd.read_csv(io.StringIO(content.decode('utf-8')),skiprows=9)

Unable to parse string quoted csv data using pandas

I am trying to parse this CSV data which has quotes in between in unusual pattern and semicolon in the end of each row.
I am not able to parse this file correctly using pandas.
Here is the link of data (The pastebin was for some reason not recognizing as text / csv so picked up any random formatting please ignore that)
https://paste.gnome.org/pr1pmw4w2
I have tried using the "," as delimiter, and normal call of pandas dataframe object construction by only giving file name as parameter.
header = ["Organization_Name","Organization_Name_URL","Categories","Headquarters_Location","Description","Estimated_Revenue_Range","Operating_Status","Founded_Date","Founded_Date_Precision","Contact_Email","Phone_Number","Full_Description","Investor_Type","Investment_Stage","Number_of_Investments","Number_of_Portfolio_Organizations","Accelerator_Program_Type","Number_of_Founders_(Alumni)","Number_of_Alumni","Number_of_Funding_Rounds","Funding_Status","Total_Funding_Amount","Total_Funding_Amount_Currency","Total_Funding_Amount_Currency_(in_USD)","Total_Equity_Funding_Amount","Total_Equity_Funding_Amount_Currency","Total_Equity_Funding_Amount_Currency_(in_USD)","Number_of_Lead_Investors","Number_of_Investors","Number_of_Acquisitions","Transaction_Name","Transaction_Name_URL","Acquired_by","Acquired_by_URL","Announced_Date","Announced_Date_Precision","Price","Price_Currency","Price_Currency_(in_USD)","Acquisition_Type","IPO_Status,Number_of_Events","SimilarWeb_-_Monthly_Visits","Number_of_Founders","Founders","Number_of_Employees"]
pd.read_csv("data.csv", sep=",", encoding="utf-8", names=header)
First, you can just read the data normally. Now all data would be in the first column. You can use pyparsing module to split based on ',' and assign it back. I hope this solves your query. You just need to do this for all the rows.
import pyparsing as pp
import pandas as pd
df = pd.read_csv('input.csv')
df.loc[0] = pp.commaSeparatedList.parseString(df['Organization Name'][0]).asList()
Output
df #(since there are 42 columns, pasting just a snipped)

Categories