I am trying to import a (very) large json file (3.3m rows, 1k columns), that has nested multiple nested jsons within it. Some of these nested jsons are double nested. I have found two ways to import the json file into a dataframe, however, I can't get the imported json to be flattened, and converted to strings at he same time.
The codes I am using are:
# 1: Import directly and convert to string
def Data_IMP(path):
with open(path) as Data:
Data_IMP = pd.read_json(Data, dtype=str)
Data_IMP = Data_IMP.replace("nan", "", regex=True)
return Data_IMP
The issue with the above is that it doesn't flatten the json file fully.
# 2: Import json and normalise
def Data_IMP(path):
with open(path) as Data:
d = json.load(Data)
Data_IMP = json_normalize(d)
return Data_IMP
The above script flattens out the json, but lets Python decide on the dtype for each column.
Is there a way to combine these approaches, so that the json file is flattened, and all columns read a strings?
I found a solution that worked, and was able to both import and flatten the jsons, as well as convert all text to strings.
# Function to import data from ARIC json file to dataframe
def Data_IMP(path):
with open(path) as Data:
d = json.load(Data)
Data_IMP = json_normalize(d)
return Data_IMP
# --------------------------------------------------------------------------------------------------------- #
# Function to cleanse Data file
def Data_Cleanse(Data_IMP):
Data_Cleanse = Data_IMP.replace(np.nan, '', regex=True)
Data_Cleanse = Data_Cleanse.astype(str)
return Data_Cleanse
Related
I have json data which is in the structure below:
{"Text1": 4, "Text2": 1, "TextN": 123}
I want to read the json file and make a dataframe such as
Each key value pairs will be a row in the dataframe and I need to need headers "Sentence" and "Label". I tried with using lines = True but it returns all the key-value pairs in one row.
data_df = pd.read_json(PATH_TO_DATA, lines = True)
What is the correct way to load such json data?
you can use:
with open('json_example.json') as json_data:
data = json.load(json_data)
df=pd.DataFrame.from_dict(data,orient='index').reset_index().rename(columns={'index':'Sentence',0:'Label'})
Easy way that I remember
import pandas as pd
import json
with open("./data.json", "r") as f:
data = json.load(f)
df = pd.DataFrame({"Sentence": data.keys(), "Label": data.values()})
With read_json
To read straight from the file using read_json, you can use something like:
pd.read_json("./data.json", lines=True)\
.T\
.reset_index()\
.rename(columns={"index": "Sentence", 0: "Labels"})
Explanation
A little dirty but as you probably noticed, lines=True isn't completely sufficient so the above transposes the result so that you have
(index)
0
Text1
4
Text2
1
TextN
123
So then resetting the index moves the index over to be a column named "index" and then renaming the columns.
I have a log file where every line is a log record such as:
{"log":{"identifier": "x", "message": {"key" : "value"}}}
What I'd like to do is convert this JSON collection to a single DataFrame for analysis.
Example
identifier | key
------------|-------------
x | value
Up till now, I have done the following
with open("../data/cleaned_logs_xs.json", 'r') as logfile:
for line in logfile:
jsonified = json.loads(line)
log = jsonified["log"]
df = pd.io.json.json_normalize(log)
df.columns = df.columns.map(lambda x: x.split(".")[-1])
Read this file line by line, convert every single record to a DataFrame and append the DataFrame to a parent DataFrame. At the end of this loop, it builds the final DataFrame I need.
Now I know this is extremely hack-y and inefficient. What would be the best way to go about this?
I dont know exactly if this is what you want but there is something like this:
import json
from pandas.io.json import json_normalize
my_json = '{"log": {"identifier": "x", "message": {"key": "value"}}}'
data = json.loads(my_json)
data = json_normalize(data)
print(data)
Output:
log.identifier log.message.key
0 x value
In your case just read the json file.
At this moment, I've removed the constant appending to the parent dataframe.
I append the JSON encoded log message into an array through a loop and at the end convert the array of JSON records to a dataframe using the following:
log_messages = list()
for line in logfile:
jsonified = json.loads(line)
log = jsonified["log"]
log_messages.append(log)
log_df = pd.DataFrame.from_records(log_messages)
Can this be optimized further?
I have 4-10 xml files in a folder, these files are broken down from a large single xml file. Luckily parsing the xml was easy because I could use xmltodict package. So I can do whatever I need to already with a single xml file. I converted it into pandas dataframe for analysis requirement. However, I need to combine 4 xml files into one pandas dataframe. Assume there is no data/ index issue, the files are surely correctly named as 00001.xml, 00002.xml, 00003.xml, 00004.xml in order.
import xmltodict
import numpy as np
import pandas as pd
from collections import Counter
with open('00001.xml') as fd:
doc = xmltodict.parse(fd.read())
def panda_maker (xml_dict):
channel_list = xml_dict ['logs']['log']['logData']['mnemonicList'].split(",")
logData_list = [i.split(",") for i in xml_dict ['logs']['log']['logData']['data']]
logData_list.insert(0, xml_dict ['logs']['log']['logData']['unitList'].split(","))
return pd.DataFrame(np.array(logData_list).reshape(len(logData_list),len(channel_list)), columns = channel_list)
logData_frame_01 = panda_maker(doc)
logData_frame_01.head() #all good
How can I neatly combine logData_frame_01 + _02 + _03 + _04 to one dataframe?
Any further abstractions tips in above program is also highly welcome.
Try:
doc = []
for i in range(1,5):
with open('0000{}.xml'.format(i)) as fd:
doc.append(xmltodict.parse(fd.read()))
def panda_maker (xml_dict):
logData_list = []
for xmlval in xml_dict:
channel_list = xmlval['logs']['log']['logData']['mnemonicList'].split(",")
temp = [i.split(",") for i in xml_dict ['logs']['log']['logData']['data']]
temp.insert(0, xml_dict ['logs']['log']['logData']['unitList'].split(","))
logData_list.extend(temp)
return pd.DataFrame(np.array(logData_list).reshape(len(logData_list),len(channel_list)), columns = channel_list)
logData_frame_01 = panda_maker(doc)
logData_frame_01.head() #all good
I have a big csv file (aprx. 1GB) that I want to convert to a json file in the following way:
the csv file has the following structure:
header: tid;inkey;outkey;value
values:
tid1;inkey1;outkey1;value1
tid1;inkey2;outkey2;value2
tid2;inkey2;outkey3;value2
tid2;inkey4;outkey3;value2
etc.
The idea is to convert this csv to a json with the following structure, basically to group everything by "tid":
{
"tid1": {
"inkeys":["inkey1", "inkey2"],
"outkeys":["outkey1", "outkey2"]
}
}
I can imagine how to do it normal python dicts and lists, but my problem is also the huge amount of data that i have to process. I suppose pandas can help here, but I am still very confused with this tool.
I think this should be straight-forward to do with standard Python data structures such as defaultdict. Unless you have very limited memory, I see no reason why a 1gb file will be problematic using a straight-forward approach.
Something like (did not test):
from collections import defaultdict
import csv
import json
out_data = defaultdict(lambda: {"inkeys": [], "outkeys": [], "values": []})
with file("your-file.csv") as f:
reader = csv.reader(f):
for line in reader:
tid, inkey, outkey, value = line
out_data[tid]["inkeys"].append(inkey)
out_data[tid]["outkeys"].append(outkey)
out_data[tid]["values"].append(value)
print(json.dumps(out_data))
There might be a faster or more memory efficient way to do it with Pandas or others, but simplicity and zero dependencies go a long way.
First you need to use pandas and read your csv into a dataframe. Say the csv is saved in a file called my_file.csv then you call
import pandas as pd
my_df = pd.read_csv('my_file.csv')
Then you need to convert this dataframe to the form that you specified. The following call will convert it to a dict with the specified structure
my_json = dict(my_df.set_index('tid1').groupby(level=0).apply(lambda x : x.to_json(orient = 'records')))
Now you can export it to a json file if you want
import json
with open('my_json.json', 'w') as outfile:
json.dump(my_json, outfile)
You can use Pandas with groupby and a dictionary comprehension:
from io import StringIO
import pandas as pd
mystr = StringIO("""tid1;inkey1;outkey1;value1
tid1;inkey2;outkey2;value2
tid2;inkey2;outkey3;value2
tid2;inkey4;outkey3;value2""")
# replace mystr with 'file.csv'
df = pd.read_csv(mystr, sep=';', header=None, names=['tid1', 'inkeys', 'outkeys'])
# group by index
grouper = df.groupby(level=0)
# nested dictionary comprehension with selected columns
res = {k: {col: v[col].tolist() for col in ('inkeys', 'outkeys')} for k, v in grouper}
print(res)
{'tid1': {'inkeys': ['outkey1', 'outkey2'], 'outkeys': ['value1', 'value2']},
'tid2': {'inkeys': ['outkey3', 'outkey3'], 'outkeys': ['value2', 'value2']}}
Similar to the other defaultdict() answer:
from collections import defaultdict
d = defaultdict(lambda: defaultdict(list))
with open('file.txt') as in_file:
for line in in_file:
tid, inkey, outkey, value = line.strip().split(';')
d[tid]['inkeys'].append(inkey)
d[tid]['outkeys'].append(outkey)
d[tid]['values'].append(value)
I have a CSV file that has a field/column with a comma (","). I load this CSV into mongodb for data manipulation. I would like to strip all text from the comma to the right, leaving only the text to the left of the comma.
What is the most efficient method of accomplishing this task? In my mongodb csv import script (I utilize pandas)? Afterward when the data is already in MongoDB? Honestly, I'm new to programming and would like to know how to do it in either scenario, but I would like to see a solution for which is most efficient.
Here's my csv to python import script:
#!/usr/bin/env python
import sys
import os
import pandas as pd
import pymongo
import json
def import_content(filepath):
mng_client = pymongo.MongoClient('localhost', 27017)
mng_db = mng_client['swx_inv']
collection_name = 'device.switch'
db_cm = mng_db[collection_name]
cdir = os.path.dirname(__file__)
file_res = os.path.join(cdir, filepath)
data = pd.read_csv(file_res, skiprows=2, skip_footer=1)
data_json = json.loads(data.to_json(orient='records'))
db_cm.remove()
db_cm.insert(data_json)
if __name__ == "__main__":
filepath = '/vagrant/data/DeviceInventory-Category.Switch.csv'
import_content(filepath)
Here are the top three rows of the CSV for reference. I'm trying to alter the last field, "OS Image":
Device,Serial Number,Realm,Vendor,Model,OS Image
ABBNWX0100,SMG3453ESDN,BlAH BLAH,Cisco,WS-C6509-E,"IOS 12.2(33)SXI9, s72033_rp-ADVENTERPRISEK9_WAN-M"
ABBNWX0101,SDG127343S0,BLAH BLAH,Cisco,WS-C4506-E,"IOS 12.2(53)SG8, cat4500-IPBASEK9-M"
ABBNWX0102,TREFDSFY1KK,BLAH BLAH,Cisco,WS-C3560V2-48PS-S,"IOS 12.2(55)SE5, C3560-IPBASEK9-M"
EDIT: I found a method to do what I needed via pandas prior to uploading to the mongoDB collection. I have to do this twice, as the save column data uses two different delimiters and a regex would not work properly:
# Use pandas to read CSV, skipping top 2 lines & footer line from
# CSV export. Set column data to string type.
data = pd.read_csv(
file_res, index_col=False, skiprows=2,
skip_footer=1, dtype={'Device': str, 'Serial Number': str,
'Realm': str, 'Vendor': str, 'Model': str,
'OS Image': str}
)
# Drop rows where Serial Number is empty
data = data.dropna(subset=['Serial Number'])
# Split the OS Image column by "," and ";" to remove extraneous data
data['OS Image'].update(data['OS Image'].apply(
lambda x: x.split(",")[0] if len(x.split()) > 1 else None)
)
data['OS Image'].update(data['OS Image'].apply(
lambda x: x.split(";")[0] if len(x.split()) > 1 else None)
)
import csv
s='''Device,Serial Number,Realm,Vendor,Model,OS Image
ABBNWX0100,SMG3453ESDN,BlAH BLAH,Cisco,WS-C6509-E,"IOS 12.2(33)SXI9, s72033_rp-ADVENTERPRISEK9_WAN-M"
ABBNWX0101,SDG127343S0,BLAH BLAH,Cisco,WS-C4506-E,"IOS 12.2(53)SG8, cat4500-IPBASEK9-M"
ABBNWX0102,TREFDSFY1KK,BLAH BLAH,Cisco,WS-C3560V2-48PS-S,"IOS 12.2(55)SE5, C3560-IPBASEK9-M"'''
print("\n".join([','.join(row[:5])+","+str(row[5].split(",")[0]) for row in csv.reader(s.split("\n"))]))
Converting list comprehension into loops for more readability:
newtext=""
for row in csv.reader(s.split("\n")):
newtext+=','.join(row[:5])+","+str(row[5].split(",")[0])+"\n"
print(newtext)
Output:
Device,Serial Number,Realm,Vendor,Model,OS Image
ABBNWX0100,SMG3453ESDN,BlAH BLAH,Cisco,WS-C6509-E,IOS 12.2(33)SXI9
ABBNWX0101,SDG127343S0,BLAH BLAH,Cisco,WS-C4506-E,IOS 12.2(53)SG8
ABBNWX0102,TREFDSFY1KK,BLAH BLAH,Cisco,WS-C3560V2-48PS-S,IOS 12.2(55)SE5
https://ideone.com/FMJCrO
For a file you will have to use
with open(fname) as f:
content = f.readlines()
content will contain a list of lines in the file and then use csv.reader(content)