I am using a Yahoo finance Python library to grab accounting financial data to do some basic analysis. All of the financial statement data comes in JSON format. I want the data to be in a tabular format as I typically see in a Python dataframe. Hello there are several wrappers around the data and I'm not sure how to remove those so that I can get my data into a simple columns and rows dataframe. Here is what the Python looks like:
{
"incomeStatementHistory":{
"F":[
{
"2019-12-31":{
"researchDevelopment":"None",
"effectOfAccountingCharges":"None",
"incomeBeforeTax":-640000000,
"minorityInterest":45000000,
"netIncome":47000000,
"sellingGeneralAdministrative":10218000000,
"grossProfit":12876000000,
"ebit":2658000000,
"operatingIncome":2658000000,
"otherOperatingExpenses":"None",
"interestExpense":-1049000000,
"extraordinaryItems":"None",
you don't have the full response so it's difficult to tell if this will be what you want
d = {
"incomeStatementHistory":{
"F":[
{
"2019-12-31":{
"researchDevelopment":"None",
"effectOfAccountingCharges":"None",
"incomeBeforeTax":-640000000,
"minorityInterest":45000000,
"netIncome":47000000,
"sellingGeneralAdministrative":10218000000,
"grossProfit":12876000000,
"ebit":2658000000,
"operatingIncome":2658000000,
"otherOperatingExpenses":"None",
"interestExpense":-1049000000,
"extraordinaryItems":"None",}}]}}
pd.json_normalize(d['incomeStatementHistory']['F'])
Output:
2019-12-31.researchDevelopment 2019-12-31.effectOfAccountingCharges 2019-12-31.incomeBeforeTax ... 2019-12-31.otherOperatingExpenses 2019-12-31.interestExpense 2019-12-31.extraordinaryItems
0 None None -640000000 ... None -1049000000 None
[1 rows x 12 columns]
You should use Pandas
Here its a tutorial of how to do that with pandas
Also you could check this question
Related
I have found a numerous number of similar questions on stackoverflow, however, one issue remains unsolved to me. I have a heavily nested “.json” file I need to import and convert into R or Python data.frame to work with. Json file contains lists inside (usually empty but sometime contains data). Example of json's structure:
I use R's library jsonlite and Python's pandas.
# R
jsonlite::fromJSON(json_file, flatten = TRUE)
# or
jsonlite::read_json(json_file, simplifyVector = T)
# Python
with open(json_file.json, encoding = "utf-8") as f:
data = json.load(f)
pd.json_normalize(data)
Generally, in both cases it work. The output looks like a normal data.frame, however, the problem is that some columns of a new data.frame contain an embedded lists (I am not sure about "embedded lists" whether it's correct and clear). Seems that both Pandas and jsonlite combined each list into single column, which is clearly seen in the screens below.
R
Python
As you might see some columns, such as wymagania.wymaganiaKonieczne.wyksztalcenia is nothing but a vector contains a combined/embedded list, i.e. content of a list has been combined into single column.
As a desired output I want to split each element of such lists as a single column of a data.frame. In other words, I want to obtain normal “in tidy sense” data.frame without any nested either data.frames and lists. Both R and Python codes are appreciated.
Minimum reproducible example:
[
{
"warunkiPracyIPlacy":{"miejscePracy":"abc","rodzajObowiazkow":"abc","zakresObowiazkow":"abc","rodzajZatrudnienia":"abc","kodRodzajuZatrudnienia":"abc","zmianowosc":"abc"},
"wymagania":{
"wymaganiaKonieczne":{
"zawody":[],
"wyksztalcenia":["abc"],
"wyksztalceniaSzczegoly":[{"kodPoziomuWyksztalcenia":"RPs002|WY","kodTypuWyksztalcenia":"abc"}],
"jezyki":[],
"jezykiSzczegoly":[],
"uprawnienia":[]},
"wymaganiaPozadane":{
"zawody":[],
"zawodySzczegoly":[],
"staze":[]},
"wymaganiaDodatkowe":{"zawody":[],"zawodySzczegoly":[]},
"inneWymagania":"abc"
},
"danePracodawcy":{"pracodawca":"abc","nip":"abc","regon":"abc","branza":null},
"pozostaleDane":{"identyfikatorOferty":"abc","ofertaZgloszonaPrzez":"abc","ofertaZgloszonaPrzezKodJednostki":"abc"},
"typOferty":"abc",
"typOfertyNaglowek":"abc",
"rodzajOferty":["DLA_ZAREJESTROWANYCH"],"staz":false,"link":false}
]
This is an answer for Python. It is not very elegant, but I think it will do for your purpose.
I have called your example file nested_json.json
import json
import pandas as pd
json_file = "nested_json.json"
with open(json_file, encoding="utf-8") as f:
data = json.load(f)
df = pd.json_normalize(data)
df_exploded = df.apply(lambda x: x.explode()).reset_index(drop=True)
# check based on first row whether its of type dict
columns_dict = df_exploded.columns[df_exploded.apply(lambda x: isinstance(x[0], dict))]
# append the splitted dict to the dataframe
for col in columns_dict:
df_splitted_dict = df_exploded[col].apply(pd.Series)
df_exploded = pd.concat([df_exploded, df_splitted_dict], axis=1)
This leads to a rectangular dataframe
>>> df_exploded.T
0
typOferty abc
typOfertyNaglowek abc
rodzajOferty DLA_ZAREJESTROWANYCH
staz False
link False
warunkiPracyIPlacy.miejscePracy abc
warunkiPracyIPlacy.rodzajObowiazkow abc
warunkiPracyIPlacy.zakresObowiazkow abc
warunkiPracyIPlacy.rodzajZatrudnienia abc
warunkiPracyIPlacy.kodRodzajuZatrudnienia abc
warunkiPracyIPlacy.zmianowosc abc
wymagania.wymaganiaKonieczne.zawody NaN
wymagania.wymaganiaKonieczne.wyksztalcenia abc
wymagania.wymaganiaKonieczne.wyksztalceniaSzcze... {'kodPoziomuWyksztalcenia': 'RPs002|WY', 'kodT...
wymagania.wymaganiaKonieczne.jezyki NaN
wymagania.wymaganiaKonieczne.jezykiSzczegoly NaN
wymagania.wymaganiaKonieczne.uprawnienia NaN
wymagania.wymaganiaPozadane.zawody NaN
wymagania.wymaganiaPozadane.zawodySzczegoly NaN
wymagania.wymaganiaPozadane.staze NaN
wymagania.wymaganiaDodatkowe.zawody NaN
wymagania.wymaganiaDodatkowe.zawodySzczegoly NaN
wymagania.inneWymagania abc
danePracodawcy.pracodawca abc
danePracodawcy.nip abc
danePracodawcy.regon abc
danePracodawcy.branza None
pozostaleDane.identyfikatorOferty abc
pozostaleDane.ofertaZgloszonaPrzez abc
pozostaleDane.ofertaZgloszonaPrzezKodJednostki abc
kodPoziomuWyksztalcenia RPs002|WY
kodTypuWyksztalcenia abc
I'm relatively new to PySpark and trying to solve a data problem. I have a pyspark DF, created with data extracted from MS SQL Server, having 2 columns: ID (Integer) and XMLMsg (String). The 2nd column, XMLMsg contains data in XML format.
The goal is to parse the XMLMsg column and create additional columns in the same DF with the extracted columns from the XML.
Following is a sample structure of the pyspark DF:
ID XMLMsg
101 ...<a><b>name1</b><c>loc1</c></a>...<d>dept1</d>...
102 ...<a><b>name2</b><c>loc2</c></a>...<d>dept2</d>...
103 ...<a><b>name3</b><c>loc3</c></a>...<d>dept3</d>...
Expected output is:
ID XMLMsg b c d
101 ...<a><b>name1</b><c>loc1</c></a>...<d>dept1</d>... name1 loc1 dept1
102 ...<a><b>name2</b><c>loc2</c></a>...<d>dept2</d>... name2 loc2 dept2
103 ...<a><b>name3</b><c>loc3</c></a>...<d>dept3</d>... name3 loc3 dept3
I tried few suggestions based on my search in SO; however, could not achieve the expected result. Hence, reaching out for some help and directions. Thanks for your time.
I solved this finally using Lambda and UDF considering i had to get the texts from 4 nodes from a huge XML file. Since the XML files are already in a column and part of the pyspark Dataframe, I didnt want to write as files and again parse the whole XML. I also wanted to avoid using XSD schema.
The actual xml has multiple namespaces and also some nodes with specific conditions.
Example:
<ap:applicationproduct xmlns:xsi="http://www.example.com/2005/XMLSchema-instance" xmlns:ap="http://example.com/productinfo/1_6" xmlns:ct="http://example.com/commontypes/1_0" xmlns:dc="http://example.com/datacontent/1_0" xmlns:tp="http://aexample.com/prmvalue/1_0" ....." schemaVersion="..">
<ap:ParameterInfo>
<ap:Header>
<ct:Version>1.0</ct:Version>
<ct:Sender>ABC</ct:Sender>
<ct:SenderVersion />
<ct:SendTime>...</ct:SendTime>
</ap:Header>
<ap:ProductID>
<ct:Model>
<ct:Series>34AP</ct:Series>
<ct:ModelNo>013780</ct:ModelNo>
..............
..............
<ap:Object>
<ap:Parameter schemaVersion="2.5" Code="DDA">
<dc:Value>
<tp:Blob>mbQAEAgBTgKQEBAX4KJJYABAIASL0AA==</tp:Blob>
</dc:Value>
</ap:Parameter>
.........
........
Here I need to extract the values from ct:ModelNo and tp:Blob
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import xml.etree.ElementTree as ET
# List of namespaces to be used:
ns = {'ap' : 'http://example.com/productinfo/1_6',
'ct':'http://example.com/commontypes/1_0',
'dc':'http://example.com/datacontent/1_0',
'tp':'http://aexample.com/prmvalue/1_0'
}
parsed_model = (lambda x: ET.fromstring(x).find('ap:ParameterInfo/ap:ProductID/ct:Model/ct:ModelNo', ns).text)
udf_model = udf(parsed_model)
parsed_model_df = df.withColumn('ModelNo', udf_Model('XMLMsg'))
Also for the Node with blob value similar function can be written but the path to the node would be:
'ap:ParameterInfo/ap:Object/ap:Parameter[#Code="DDA"]/dc:Value/tp:Blob'
This worked for me and I was able to add the required values in the pyspark DF. Any suggestions are welcome to make it better though. Thank you!
I would need to parse some json files to a pandas dataframe. I want to have one column with the words present in the text, and another column with the corresponding entity – the entity will be the “Type” of the text below, when the “value” corresponds to the word, otherwise I want to assign the label ‘O’.
Below is an example.
This is the JSON file:
{"Text": "I currently use a Netgear Nighthawk AC1900. I find it reliable.",
"Entities": [
{
"Type": "ORGANIZATION ",
"Value": "Netgear"
},
{
"Type": "DEVICE ",
"Value": "Nighthawk AC1900"
}]
}
Here is what I want to get:
WORD TAG
I O
currently O
use O
a O
Netgear ORGANIZATION
Nighthawk AC1900 DEVICE
. O
I O
find O
it O
reliable O
. O
Can someone help me with the parsing? I can`t use the split() because sometime the values consists of two words. Hope this is clear. Thank you!
This is a difficult problem and will depend on what data isn't in this example and the output required. Do you have repeating data in the entity values? is order important? Did you want repetition in the output?
There are a few tools that can be used:
make a trie out of the Entity values before you search the string. This is good if you have overlapping versions of the same name like "Netgear" and "Netgear INC." and you want the longest version.
nltk.PunktSentenceTokenizer This one is finicky to work with about the Nouns. This tutorial does a better job of explaining how to deal with them.
I don't know if what you need is strictly what you post as a desired output.
The solution I am giving you is "dirty" (more elements and the column TAG is placed first)
You can manage to clean it and put it in the format you need. As you didn't provided a piece of code to start on, you can finish it.
Eventually you will find out that the purpose of stackoverflow is not to get people to write the code for you, but people to help you out with the code you are trying.
import json
import pandas as pd
#open and reading of the json:
with open('netgear.json','r') as jfile:
data = jfile.read()
info = json.loads(data)
#json into content
words,tags = info['Text'].split(),info['Entities']
#list to handle the Entities
prelist = []
for i in tags:
j = list(i.values())
#['ORGANIZATION ', 'Netgear']
#['DEVICE ', 'Nighthawk AC1900']
prelist.append(j)
#DataFrames to be merged
dft = pd.DataFrame(prelist,columns=['TAG','WORD'])
dfw = pd.DataFrame(words,columns=['WORD'])
#combine the dataFrames and NaN into 0
df = dfw.merge(dft, on='WORD', how='outer').fillna(0)
This is the output:
WORD TAG
0 I 0
1 I 0
2 currently 0
3 use 0
4 a 0
5 Netgear ORGANIZATION
6 Nighthawk 0
7 AC1900. 0
8 find 0
9 it 0
10 reliable. 0
11 Nighthawk AC1900 DEVICE
I'm trying to read a csv from a web page into pandas but get something saying empty dataframe and an object with 0 rows and 155 columns. I feel like I'm missing a step. Struggling to use data from web as opposed to my machine in general.
url ="https://data.world/exercises/data-wrangling-exercise-1/workspace/file?filename=Crime_2015.csv"
crimex= pd.read_csv(url)
print(crimex)
output as follows:
Empty DataFrame
Columns: [ data.world Loading...?Feedback{"dataset":{"hasError":false, loadedDatasets:{}, usersDatasets:{}, loadedCurrentUsersDatasets:false, usersDatasetsTotalCount:0}, user:{"hasError":false, loadedUsers:{}, userFollows:{}, userLikes:{}, userInvites:null, userRecentComments:{}, groupAuthorizations:{}}, invite:{}, treatments:["showVersions", showWorkspaceTabs, showEntityLinkOptions, paramQueries], currentUser:{"authorizedAccounts":null, requests:[], unDismissedTourCount:2, needsAnalyticsAlias:false, notifications:[], dismissed:[], profile:{"agentid":"", visitorid:"109156be-c4fb-41ea-b1b4-efe1671c580f", displayName:"", email:"", company:"", activeSubscriptionid:"", accountStanding:"good", bio:"", emailVerified:true, location:"", website:"", avatarUrl:"", numFollowers:0, numFollowing:0, numOrganizations:0, allowedRoles:[], relationship:{}, created:"", updated:"", tags:[], abTests:{"buckets":{"landingPageVideo":{"name":"video", value:1}, openAccessTwo:{"name":"normal", value:0}, sidedoorVideo:{"name":"play-video", value:1}}}, orgMode:"", orgDetails:"", level:"", userCapabilities:{}}, requestsFetching:false, authorizedClients:[], isLoggedIn:false, onboardThisSession:false, token:"eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiJwcm9kLXVzZXItY2xpZW50IiwiaXNzIjoiYXV0aG9yaXR5OmRhdGFkb3R3b3JsZDo6MEJGMEVCRDMtMkRFNy00OUZCLUI4N0ItQUQwMzA1N0JCODlFIiwiaWF0IjoxNTM4NDkxMzM0LCJyb2xlIjpbXSwiZXhwIjoxNTM4NTc3NzM0LCJ2aXNpdG9yaWQiOiIxMDkxNTZiZS1jNGZiLTQxZWEtYjFiNC1lZmUxNjcxYzU4MGYiLCJnZW5lcmFsLXB1cnBvc2UiOnRydWUsImF1dGhvcml0eWlkcyI6WyJkYXRhZG90d29ybGQiXX0.9zL3wK5ceD8ylOykw30wuN5JQLPNjArVYo7H-1R85OvTb2hHPTKekcZKYoW-fLhZhnRH22PXTfg23RD__wK3wA"}, discussion:{}, routing:{"locationBeforeTransitions":null}, siteStatus:{"uploadsOperational":true, queryOperational:true, websiteOperational:true, downloadsOperational:true, shouldAskReload:false, shouldForceReload:false, incidentName:"", incidentMessage:"", pageStatus:"none"}, loginModal:{"show":false}, signUpModal:{"show":false, treatment:null, treatmentProps:null}, linkedOnboardModal:{"show":false}, queries:{}, requestInfo:{"pathsMarkedAs404":[], metadata:{}}, integration:{"activeIntegrations":[]}, clients:{}, events:{"status":"closed", meta:{}}}{"analytics":{"snowplow_pixel":"d2vtrn3jrzj4cp.cloudfront.net", segment_id:"IXhtZmPohuGM60VHk59cdSbWqWQBB7aR"}, filestack_media_key:"AhpzUeRlHRGCiCUDy2Tz3z", filestack_dataset_key:"AuM8NdQnIQGE3NOBFdr6wz", filestack_dataset_strings_url:"https:\u002F\u002Fcdn.filepicker.io\u002Fapi\u002Ffile\u002F2OkHIxqQqq0DvgBPfiAh", filestack_max_upload_size:524288000, iframely_key:"e4239223dd4ad21531b3a044840d5898", sentry_dsn_client:"https:\u002F\u002F78d03c5189424a139f3e29423a60f67f#app.getsentry.com\u002F78710", version_build:"b003207", version_commit:"96f0da448", version_check_interval:60000, csrf_cookie_name:"_csrf", csrf_header_name:"x-csrf-token", analytics_log_to_console:false, ...]
Index: []
[0 rows x 155 columns]
You need to do what it says on the data.world site.
Click the Download link (at the top right), Share URL and use either:
To share a secure download link:
https://query.data.world/s/xxxxxxxlink_codexxxxxxxx
To import:
import pandas as pd
df = pd.read_csv('https://query.data.world/s/xxxxxxxlink_codexxxxxxxx')
Note: my links have been edited to not work, you may provide your own link.
I am trying to do some date parsing in python and while parsing I came to this weird error that said
time data 'nan' does not match format '%d/%m/%y'
As i checked my .csv file in libreoffice calc everything looked fine. No nan values what so ever. However when I checked it in excel(excel mobile version. Since I don't want to pay) I saw different value. Value that was shown as follows in different editor
Libre office calc - 11/09/93
excel - ########.
Here is a screenshot below:
How could I change it in LibreOffice or python so that it won't be treated as nan values but the real values like they should be.
I don't have much knowledge in excel and Libreoffice calc so any explanation to solve this simple issue would be welcome.
Here is the python code
import pandas as pd
from datetime import datetime as dt
loc = "C:/Data/"
season1993_94 = pd.read_csv(loc + '1993-94.csv')
def parse_date_type1(date):
if date == '':
return None
return dt.strptime(date, '%d/%m/%y').date()
def parse_date_type2(date):
if date == '':
return None
return dt.strptime(date, '%d/%m/%Y').date()
season1993_94.Date = season1993_94.Date.astype(str).apply(parse_date_type1)
Error:
<ipython-input-13-46ff7e1afe94> in <module>()
----> 1 season1993_94.Date = season1993_94.Date.astype(str).apply(parse_date_type1)
ValueError: time data 'nan' does not match format '%d/%m/%y'
PS: If the question seems inappropriate as per the context given, please feel free to edit it.
To see what is going on, use a text editor such as Notepad++. Viewing with Excel or Calc may not show the problem; at least, the problem cannot be seen from the images in the question.
The error occurs with a CSV file consisting of the following three lines.
Date,Place
28/08/93,Southampton
,Newcastle
Here is the solution, adapted from How to convert string to datetime with nulls - python, pandas?
season1993_94['Date'] = pd.to_datetime(season1993_94['Date'], errors='coerce')
The result:
>>> season1993_94
Date Place
0 1993-08-28 Southampton
1 NaT Newcastle