Convert specific columns to list and then create json - python

I have a spreadsheet like the following:
As you can see, there are multiple "tags" columns like this: "tags_0", "tags_1", "tags_2".
And they can be more.
I'm trying to find all the "tags", and put them inside a list using panda's data frame. And eventually, put them inside an array of "tags" inside a json file.
I thought of using regex, but I can't find a way to apply it.
This is the function I'm using to output the json file. I added the tags array for reference:
def convert_products():
read_exc = pd.read_excel('./data/products.xlsx')
df = pd.DataFrame(read_exc)
all_data = []
for i in range(len(df)):
js = {
"sku": df['sku'][i],
"brand": df['brand'][i],
"tags": [?]
}
all_data.append(js)
json_object = json.dumps(all_data, ensure_ascii=False, indent=2)
with open("./data/products.json", "w", encoding='utf-8') as outfile:
outfile.write(json_object)
How can I achieve this?
Thanks

You can achieve that in a much easier way by doing something like this...
df = pd.read_excel('your_file.xlsx')
tags_columns = [col for col in df.columns if col.startswith("tags_")]
df["tags"] = df[tags_columns].values.tolist()
df[["sku","brand","tags"]].to_json("test.json",orient="records")
You can try other json orientation if you want: ["index","columns","split","records","values","table"]. Check them in pandas documentation

First You can get all the columns as a list
list(df.columns.values)
Now you can search for all columns names which contains tags_ inside this list, once you get all the columns names which is for tags, you can loop through this list and retrieve specific tag value for specific row and put inside a list
And can pass into json object.
For each row in dataframe:
tagList =[]
for tagColumn in tagColumnList:
tagList.append(df[tagColumn][i])
.... Your code for creating json object...
Pass this tagList for tags key in json object

You are probably looking for filter:
out = pd.concat([df[['sku', 'brand']],
df.filter(regex='^tags_').agg(list, axis=1).rename('tags')],
axis=1).to_json(orient='records', indent=2)
print(out)
# Output
[
{
"sku":"ADX112",
"brand":"ADX",
"tags":[
"art",
"frame",
"painting"
]
}
]

Related

How to read a json data into a dataframe using pandas

I have json data which is in the structure below:
{"Text1": 4, "Text2": 1, "TextN": 123}
I want to read the json file and make a dataframe such as
Each key value pairs will be a row in the dataframe and I need to need headers "Sentence" and "Label". I tried with using lines = True but it returns all the key-value pairs in one row.
data_df = pd.read_json(PATH_TO_DATA, lines = True)
What is the correct way to load such json data?
you can use:
with open('json_example.json') as json_data:
data = json.load(json_data)
df=pd.DataFrame.from_dict(data,orient='index').reset_index().rename(columns={'index':'Sentence',0:'Label'})
Easy way that I remember
import pandas as pd
import json
with open("./data.json", "r") as f:
data = json.load(f)
df = pd.DataFrame({"Sentence": data.keys(), "Label": data.values()})
With read_json
To read straight from the file using read_json, you can use something like:
pd.read_json("./data.json", lines=True)\
.T\
.reset_index()\
.rename(columns={"index": "Sentence", 0: "Labels"})
Explanation
A little dirty but as you probably noticed, lines=True isn't completely sufficient so the above transposes the result so that you have
(index)
0
Text1
4
Text2
1
TextN
123
So then resetting the index moves the index over to be a column named "index" and then renaming the columns.

Taking python output to a pandas dataframe

I'm trying to take the output from this code into a pandas dataframe. I'm really only trying to pull the first part of the output which is the stock symbols,company name, field3, field4. The output has a lot of other data I'm not interested in but it's giving me everything. Could someone help me to put this into a dataframe if possible?
The current output is in this format
["ABBV","AbbVie","_DRUGM","S&P 100, S&P 500"],["ABC","AmerisourceBergen","_MEDID","S&P 500"],
Desired Output
Full Code
import requests
import pandas as pd
import requests
url = "https://www.stockrover.com/build/production/Research/tail.js?1644930560"
payload={}
headers = {}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)
Use a dictionary to store the data from your tuple of lists, then create a DataFrame based on that dictionary. In my solution below, I omit the 'ID' field because the index of the DataFrame serves the same purpose.
import pandas as pd
# Store the data you're getting from requests
data = ["ABBV","AbbVie","_DRUGM","S&P 100, S&P 500"],["ABC","AmerisourceBergen","_MEDID","S&P 500"]
# Create an empty dictionary with relevant keys
dic = {
"Ticker": [],
"Name": [],
"Field3": [],
"Field4": []
}
# Append data to the dictionary for every list in your `response`
for pos, lst in enumerate(data):
dic['Ticker'].append(lst[0])
dic['Name'].append(lst[1])
dic['Field3'].append(lst[2])
dic['Field4'].append(lst[3])
# Create a DataFrame from the dictionary above
df = pd.DataFrame(dic)
The resulting dictionary looks like so.
Edit: A More Efficient Approach
In my solution above, I manually called the list form of each key in the dic dictionary. Using zip we can streamline the process and have it work for any length response and any changes you make to the labels of the dictionary.
The only caveat to this method is that you have to make sure the order of keys in the dictionary lines up with the data in each list in your response. For example, if Ticker is the first dictionary key, the ticker must be the first item in the list resulted from your response. This was true for the first solution, too, however.
new_dic = {
"Ticker": [],
"Name": [],
"Field3": [],
"Field4": []
}
for pos, lst in enumerate(data): # Iterate position and list
for key, item in zip(new_dic, data[pos]): # Iterate key and item in list
new_dic[key].append(item) # Append to each key the item in list
df = pd.DataFrame(new_dic)
The result is identical to the method above:
Edit (even better!)
I'm coming back to this after learning from a commenter that pd.DataFrame() can input two-dimensional array data and output a DataFrame. This would streamline the entire process several times over:
import pandas as pd
# Store the data you're getting from requests
data = ["ABBV","AbbVie","_DRUGM","S&P 100, S&P 500"],["ABC","AmerisourceBergen","_MEDID","S&P 500"]
# Define columns
columns = ['ticker', 'name', 'field3', 'field4']
df = pd.DataFrame(data, columns = columns)
The result (same as first two):

Write Python3/Pandas dataframe to JSON with orient=records but without the array when there is only one record

I'm writing a very small Pandas dataframe to a JSON file. In fact, the Dataframe has only one row with two columns.
To build the dataframe:
import pandas as pd
df = pd.DataFrame.from_dict(dict({'date': '2020-10-05', 'ppm': 411.1}), orient='index').T
print(df)
prints
date ppm
0 2020-10-05 411.1
The desired json output is as follows:
{
"date": "2020-10-05",
"ppm": 411.1
}
but when writing the json with pandas, I can only print it as an array with one element, like so:
[
{
"date":"2020-10-05",
"ppm":411.1
}
]
I've currently hacked my code to convert the Dataframe to a dict, and then use the json module to write the file.
import json
data = df.to_dict(orient='records')
data = data[0] # keep the only element
with open('data.json', 'w') as fp:
json.dump(data, fp, indent=2)
Is there a native way with pandas' .to_json() to keep the only dictionary item if there is only one?
I am currently using .to_json() like this, which incorrectly prints the array with one dictionary item.
df.to_json('data.json', orient='index', indent = 2)
Python 3.8.6
Pandas 1.1.3
If you want to export only one row, use iloc:
print (df.iloc[0].to_dict())
#{'date': '2020-10-05', 'ppm': 411.1}

How to return a specific data structure with inner dictionary of lists

I have a csv file (image attached) and to take the CSV file and create a dictionary of lists with the format "{method},{number},{orbital_period},{mass},{distance},{year}" .
So far I have code :
import csv
with open('exoplanets.csv') as inputfile :
reader = csv.reader(inputfile)
inputm = list(reader)
print(inputm)
but my output is coming out like ['Radial Velocity', '1', '269.3', '7.1', '77.4', '2006']
when I want it to look like :
"Radial Velocity" : {"number":[1,1,1], "orbital_period":[269.3, 874.774, 763.0], "mass":[7.1, 2.21, 2.6], "distance":[77.4, 56.95, 19.84], "year":[2006.0, 2008.0, 2011.0] } , "Transit" : {"number":[1,1,1], "orbital_period":[1.5089557, 1.7429935, 4.2568], "mass":[], "distance":[200.0, 680.0], "year":[2008.0, 2008.0, 2008.0] }
Any ideas on how I can alter my code?
Hey SKR01 welcome to Stackoverflow!
I would suggest working with the pandas library. It is meant for table like contents that you have there. What you are then looking for is a groupby on your #method column.
import pandas as pd
def remove_index(row):
d = row._asdict()
del d["Index"]
return d
df = pd.read_csv("https://docs.google.com/uc?export=download&id=1PnQzoefx-IiB3D5BKVOrcawoVFLIPVXQ")
{row.Index : remove_index(row) for row in df.groupby('#method').aggregate(list).itertuples()}
The only thing that remains is removing the nan values from the resulting dict.
If you don't want to use Pandas, maybe something like this is what you're looking for:
import csv
with open('exoplanets.csv') as inputfile :
reader = csv.reader(inputfile)
inputm = list(reader)
header = inputm.pop(0)
del header[0] # probably you don't want "#method"
# create and populate the final dictionary
data = {}
for row in inputm:
if row[0] not in data:
data[row[0]] = {h:[] for h in header}
for i, h in enumerate(header):
data[row[0]][h].append(row[i+1])
print(data)
This is a bit complex, and I'm questioning why you want the data this way, but this should get you the output format you want without requiring any external libraries like Pandas.
import csv
with open('exoplanets.csv') as input_file:
rows = list(csv.DictReader(input_file))
# Create the data structure
methods = {d["#method"]: {} for d in rows}
# Get a list of fields, trimming off the method column
fields = list(rows[1])[1:]
# Fill in the data structure
for method in methods:
methods[method] = {
# Null-trimmed version of listcomp
# f: [r[f] for r in rows if r["#method"] == method and r[f]]
f: [r[f] for r in rows if r["#method"] == method]
for f
in fields
}
Note: This could be one multi-tiered list/dict comprehension, but I've broken it apart for clarity.

how to make json.loads() read a json string with the column names as the first element

I am serializing a datatable from a http get and for performance reasons would prefer to serialize it in a Names, Values structure, so that the first element contains the column names, can json.loads deal with this, if not is there another json parser that will?
{
"Names" : ["summaryDate","count"],
"Values" : [["2020-01-15T00:00:00",10],["2020-01-16T00:00:00",12],["2020-01-17T00:00:00",16]]
}
(this reduces the size to 20% of a standard JSON stream with the field names repeated for each 'row')
I did some digging and found ijson.
It lets you iterate over a json file and access its objects.
you can build you data structur like this(i was lazy and used pd):
import ijson
import pandas as pd
f= open("testjson.txt",'r')
f2= open("testjson.txt",'r')
names=[]
values=[]
names = ijson.items(f, 'Names.item')
values = ijson.items(f2, 'Values.item')
pd.DataFrame(values,columns=list(names))
you can load it in a dictionnary with json , and then apply this formula to obtain the dictionnary you want:
dic = {
"Names" : ["summaryDate","count"],
"Values" : [["2020-01-15T00:00:00",10],["2020-01-16T00:00:00",12],["2020-01-17T00:00:00",16]]
}
d=[ {'SummaryDate':summaryDate,'count':count} for summaryDate,count in dic['Values']]

Categories