Taking python output to a pandas dataframe - python

I'm trying to take the output from this code into a pandas dataframe. I'm really only trying to pull the first part of the output which is the stock symbols,company name, field3, field4. The output has a lot of other data I'm not interested in but it's giving me everything. Could someone help me to put this into a dataframe if possible?
The current output is in this format
["ABBV","AbbVie","_DRUGM","S&P 100, S&P 500"],["ABC","AmerisourceBergen","_MEDID","S&P 500"],
Desired Output
Full Code
import requests
import pandas as pd
import requests
url = "https://www.stockrover.com/build/production/Research/tail.js?1644930560"
payload={}
headers = {}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)

Use a dictionary to store the data from your tuple of lists, then create a DataFrame based on that dictionary. In my solution below, I omit the 'ID' field because the index of the DataFrame serves the same purpose.
import pandas as pd
# Store the data you're getting from requests
data = ["ABBV","AbbVie","_DRUGM","S&P 100, S&P 500"],["ABC","AmerisourceBergen","_MEDID","S&P 500"]
# Create an empty dictionary with relevant keys
dic = {
"Ticker": [],
"Name": [],
"Field3": [],
"Field4": []
}
# Append data to the dictionary for every list in your `response`
for pos, lst in enumerate(data):
dic['Ticker'].append(lst[0])
dic['Name'].append(lst[1])
dic['Field3'].append(lst[2])
dic['Field4'].append(lst[3])
# Create a DataFrame from the dictionary above
df = pd.DataFrame(dic)
The resulting dictionary looks like so.
Edit: A More Efficient Approach
In my solution above, I manually called the list form of each key in the dic dictionary. Using zip we can streamline the process and have it work for any length response and any changes you make to the labels of the dictionary.
The only caveat to this method is that you have to make sure the order of keys in the dictionary lines up with the data in each list in your response. For example, if Ticker is the first dictionary key, the ticker must be the first item in the list resulted from your response. This was true for the first solution, too, however.
new_dic = {
"Ticker": [],
"Name": [],
"Field3": [],
"Field4": []
}
for pos, lst in enumerate(data): # Iterate position and list
for key, item in zip(new_dic, data[pos]): # Iterate key and item in list
new_dic[key].append(item) # Append to each key the item in list
df = pd.DataFrame(new_dic)
The result is identical to the method above:
Edit (even better!)
I'm coming back to this after learning from a commenter that pd.DataFrame() can input two-dimensional array data and output a DataFrame. This would streamline the entire process several times over:
import pandas as pd
# Store the data you're getting from requests
data = ["ABBV","AbbVie","_DRUGM","S&P 100, S&P 500"],["ABC","AmerisourceBergen","_MEDID","S&P 500"]
# Define columns
columns = ['ticker', 'name', 'field3', 'field4']
df = pd.DataFrame(data, columns = columns)
The result (same as first two):

Related

Convert specific columns to list and then create json

I have a spreadsheet like the following:
As you can see, there are multiple "tags" columns like this: "tags_0", "tags_1", "tags_2".
And they can be more.
I'm trying to find all the "tags", and put them inside a list using panda's data frame. And eventually, put them inside an array of "tags" inside a json file.
I thought of using regex, but I can't find a way to apply it.
This is the function I'm using to output the json file. I added the tags array for reference:
def convert_products():
read_exc = pd.read_excel('./data/products.xlsx')
df = pd.DataFrame(read_exc)
all_data = []
for i in range(len(df)):
js = {
"sku": df['sku'][i],
"brand": df['brand'][i],
"tags": [?]
}
all_data.append(js)
json_object = json.dumps(all_data, ensure_ascii=False, indent=2)
with open("./data/products.json", "w", encoding='utf-8') as outfile:
outfile.write(json_object)
How can I achieve this?
Thanks
You can achieve that in a much easier way by doing something like this...
df = pd.read_excel('your_file.xlsx')
tags_columns = [col for col in df.columns if col.startswith("tags_")]
df["tags"] = df[tags_columns].values.tolist()
df[["sku","brand","tags"]].to_json("test.json",orient="records")
You can try other json orientation if you want: ["index","columns","split","records","values","table"]. Check them in pandas documentation
First You can get all the columns as a list
list(df.columns.values)
Now you can search for all columns names which contains tags_ inside this list, once you get all the columns names which is for tags, you can loop through this list and retrieve specific tag value for specific row and put inside a list
And can pass into json object.
For each row in dataframe:
tagList =[]
for tagColumn in tagColumnList:
tagList.append(df[tagColumn][i])
.... Your code for creating json object...
Pass this tagList for tags key in json object
You are probably looking for filter:
out = pd.concat([df[['sku', 'brand']],
df.filter(regex='^tags_').agg(list, axis=1).rename('tags')],
axis=1).to_json(orient='records', indent=2)
print(out)
# Output
[
{
"sku":"ADX112",
"brand":"ADX",
"tags":[
"art",
"frame",
"painting"
]
}
]

How to read a json data into a dataframe using pandas

I have json data which is in the structure below:
{"Text1": 4, "Text2": 1, "TextN": 123}
I want to read the json file and make a dataframe such as
Each key value pairs will be a row in the dataframe and I need to need headers "Sentence" and "Label". I tried with using lines = True but it returns all the key-value pairs in one row.
data_df = pd.read_json(PATH_TO_DATA, lines = True)
What is the correct way to load such json data?
you can use:
with open('json_example.json') as json_data:
data = json.load(json_data)
df=pd.DataFrame.from_dict(data,orient='index').reset_index().rename(columns={'index':'Sentence',0:'Label'})
Easy way that I remember
import pandas as pd
import json
with open("./data.json", "r") as f:
data = json.load(f)
df = pd.DataFrame({"Sentence": data.keys(), "Label": data.values()})
With read_json
To read straight from the file using read_json, you can use something like:
pd.read_json("./data.json", lines=True)\
.T\
.reset_index()\
.rename(columns={"index": "Sentence", 0: "Labels"})
Explanation
A little dirty but as you probably noticed, lines=True isn't completely sufficient so the above transposes the result so that you have
(index)
0
Text1
4
Text2
1
TextN
123
So then resetting the index moves the index over to be a column named "index" and then renaming the columns.

Nested dictionary from dataframe with loop over list

complete newbie here.
I want to read data from two columns in several excel sheets in a nested dictionary. In the end, I'd like to have a dictionary looking like this:
{SheetName1:{Index1: Value1, Index2: Value2,...}, SheetName2:{Index1: Value1, Index2: Value2} ...}
So far I read in the data using pandas and figured out how to combine the two columns I need into the inner dictionary {Index: Value}, which afterwards gets assigned the name of the sheet as a key to from the outer dictionary:
#read excel sheet into dataframe
df = ExcelWorkbook.parse(sheet_name = None, header= 1, usecols= 16, skiprows= 6)
#read in the different excel sheet names in a List
SHEETNAMES = []
SHEETNAMES = ExcelWorkbook.sheet_names
#nested dictionary
for Sheet in SHEETNAMES:
df[Sheet] = df[Sheet].loc[0:87,:]
dic = dict(zip(df[Sheet].index, df[Sheet]['ColumnName']))
dic = {Sheet: dic}
Now when I run this, it only returns the last sheet with its corresponding {Index: Value} pair:
{'LastSheetName': {Key1: Value1, Key2: Value2,...}
Now it seems to me that I've done the "harder" part but I can't seem to figure out, how to fill a new dictionary with the dictionaries generated by this loop....
Any help is greatly appreciated!
Best regards,
Jan
You are assigning dic as a new variable each time you iterate through your for loop. Instead, instantiate dic as an empty list [] outside of the loop and then append the dictionaries you define inside the loop to it, such as:
#read excel sheet into dataframe
df = ExcelWorkbook.parse(sheet_name = None, header= 1, usecols= 16, skiprows= 6)
#nested dictionary
dic = []
for Sheet in ExcelWorkbook.sheet_names:
df[Sheet] = df[Sheet].iloc[0:87,:]
out = {Sheet: dict(zip(df[Sheet].index, df[Sheet]['ColumnName']))}
dic.update(out)
Also, you want you to use .iloc in place of .loc considering you are specifying index locations inside of the dataframe.
I just figured it out after tweaking #rahlf23 response a bit. So for anyone looking this up:
dic.append() does not work for dictionaries, instead I used dic.update():
#nested dictionary
dic1 = {}
for Sheet in SHEETNAMES:
df[Sheet] = df[Sheet].iloc[0:87,:]
out = dict(zip(df[Sheet].index, df[Sheet]['ColumnName']))
out2 = {Sheet: out}
dic1.update(out2)
Now one can access the values with:
print(dic1[SheetName][Index])
Thanks for your help #rahlf23, without your comment I'd still be trapped in the loop :)

pandas.DataFrame.from_dict not preserving order using OrderedDict

I want to import OData XML datafeeds from the Dutch Bureau of Statistics (CBS) into our database. Using lxml and pandas I thought this should be straigtforward. By using OrderDict I want to preserve the order of the columns for readability, but somehow I can't get it right.
from collections import OrderedDict
from lxml import etree
import requests
import pandas as pd
# CBS URLs
base_url = 'http://opendata.cbs.nl/ODataFeed/odata'
datasets = ['/37296ned', '/82245NED']
feed = requests.get(base_url + datasets[1] + '/TypedDataSet')
root = etree.fromstring(feed.content)
# all record entries start at tag m:properties, parse into data dict
data = []
for record in root.iter('{{{}}}properties'.format(root.nsmap['m'])):
row = OrderedDict()
for element in record:
row[element.tag.split('}')[1]] = element.text
data.append(row)
df = pd.DataFrame.from_dict(data)
df.columns
Inspecting data, the OrderDict is in the right order. But looking at df.head() the columns have been sorted alphabetically with CAPS first?
Help, anyone?
Something in your example seems to be inconsistent, as data is a list and no dict, but assuming you really have an OrderedDict:
Try to explicitly specify your column order when you create your DataFrame:
# ... all your data collection
df = pd.DataFrame(data, columns=data.keys())
This should give you your DataFrame with the columns ordered just in exact the way they are in the OrderedDict (via the data.keys() generated list)
The above answer doesn't work for me and keep giving me "ValueError: cannot use columns parameter with orient='columns'".
Later I found a solution by doing this below and worked:
df = pd.DataFrame.from_dict (dict_data) [list (dict_data[0].keys())]

How to append a dictionary to a pandas dataframe?

I have a set of urls containing json files and an empty pandas dataframe with columns representing the attributes of the jsnon files. Not all json files have all the attributes in the pandas dataframe. What I need to do is to create dictionaries out of the json files and then append each dictionary to the pandas dataframe as a new row and, in case the json file doesn't have an attribute matching a column in the dataframe this has to be filled blank.
I managed to create dictionaries as:
import urllib2
import json
url = "https://cws01.worldstores.co.uk/api/product.php?product_sku=ULST:7BIS01CF"
data = urllib2.urlopen(url).read()
data = json.loads(data)
and then I tried to create a for loop as follows:
row = -1
for i in links:
row = row + 1
data = urllib2.urlopen(str(i)).read()
data = json.loads(data)
for key in data.keys():
for column in df.columns:
if str(column) == str(key):
df.loc[[str(column)],row] = data[str(key)]
else:
df.loc[[str(column)],row] = None
where df is the dataframe and links is the set of urls
However, I get the following error:
raise KeyError('%s not in index' % objarr[mask])
KeyError: "['2_seater_depth_mm'] not in index"
where ['2_seater_depth_mm'] is the first column of the pandas dataframe
For me below code works:
row = -1
for i in links:
row = row + 1
data = urllib2.urlopen(str(i)).read()
data = json.loads(data)
for key in data.keys():
df.loc[row,key] = data[key]
You have mixed order of arguments in .loc() and have one to much []
Assuming that df is empty and has the same columns as the url dictionary keys, i.e.
list(df)
#[u'alternate_product_code',
# u'availability',
# u'boz',
# ...
len(df)
#0
then you can use pandas.append
for url in links:
url_data = urllib2.urlopen(str(url)).read()
url_dict = json.loads(url_data)
a_dict = { k:pandas.Series([str(v)], index=[0]) for k,v in url_dict.iteritems() }
new_df = pandas.DataFrame.from_dict(a_dict)
df.append(new_df, ignore_index=True)
Not too sure why your code won't work, but consider the following few edits which should clean things up, should you still want to use it:
for row,url in enumerate(links):
data = urllib2.urlopen(str(url)).read()
data_dict = json.loads(data)
for key,val in data_dict.items():
if key in list(df):
df.ix[row,key] = val
I used enumerate to iterate over the index and value of links array, in this way you dont need an index counter (row in your code) and then I used the .items dictionary method, so I can iterate over key and values at once. I believe pandas will automatically handle the empty dataframe entries.

Categories