Import and Print Json objects in Dataframe in Pandas

Import and Print Json objects in Dataframe in Pandas - python

I have a json file which looks like the one shown in the picture.
How can I import and print all the Quantity and Rate in Pandas?
How can I print the sum of all the Quantity for Buy and Sell separately?
How can I print the sum of all the Quantity who's values is greater than x. For eg:SUM(Qty > 5)
In raw format, the data is like this
{"success":true,"message":"","result":{"buy":[{"Quantity":199538.30948659,"Rate":0.00000970},{"Quantity":62142.31715449,"Rate":0.00000968},{"Quantity":233476.03486058,"Rate":0.00000967},{"Quantity":75613.30879931,"Rate":0.00000966},{"Quantity":3109.14961399,"Rate":0.00000965},{"Quantity":66.22406639,"Rate":0.00000964},{"Quantity":401.06420081,"Rate":0.00000963},{"Quantity":186.93339628,"Rate":0.00000961},{"Quantity":122731.01165366,"Rate":0.00000960},{"Quantity":7718.27750144,"Rate":0.00000959},{"Quantity":802.00000000,"Rate":0.00000958},{"Quantity":2050.72163419,"Rate":0.00000956},{"Quantity":1000.00000000,"Rate":0.00000955}

import pandas as pd
#change 'buy' for other results
data = pd.DataFrame(pd.read_json('file.json')['result']['buy'])
#for filtering
print(data.query('Quantity > 5').query('Rate > 0.00000966').sum())

You can use the pandas.read_json() command to do this. Just pass it your json file in the function and pandas will create a dataframe out of it for you.
Here's the link to the documentation for it where you can pass extra parameters like orient='records' and so on to tell pandas on what to use as the dataframe columns and what to use as row data etc.
Here's the link: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
Once in the dataframe, then you can run various commands to calculate sums of quantity for buy and sell. Having the data in a dataframe makes life a bit more easier when running math calculations in my opinion.

Use json_normalize and pass the meta path :
import json
import pandas as pd
with open('data.json') as f:
data = json.load(f)
buy_df = pd.io.json.json_normalize(data['result'],'buy')
#Similarly for sell data if you have a separte entity named `sell`.
sell_df = pd.io.json.json_normalize(data['result'],'sell')
Output:
Quantity Rate
0 199538.309487 0.00001
1 62142.317154 0.00001
2 233476.034861 0.00001
3 75613.308799 0.00001
4 3109.149614 0.00001
For sum you can do
buy_df['Quantity'].sum()
From now for selection and indexing the data refer this - Indexing and Selecting Data - Pandas

Related

How can I get a Pandas Dataframe out of the Codeforces API?

I would like to use the CODEFORCERS API for some Analytics (in Python / SQL / even XLS...).
I tried to get a sufficient Pandas Dataframe but I get a Dataframe with 0 Rows and 13644 columns.
I have no clue how to get a usable Dataframe out of the API.
What I want to do with the data:
Analyse different aspects like scores / participants / score changes / rounds ...
Just pulling the data into an XLS sheet / SQL should work as well.
Best, Kiki
I tried
import pandas as pd
from sklearn import datasets
contest_list = pd.read_csv("https://codeforces.com/api/contest.list?gym=false")
pd.DataFrame(contest_list)
but got a Dataframe with 0 rows × 13644 columns.

My earlier answer was wrong. I made an assumption that the data on your URL was in CSV format (and that Pandas wouldn't support reading from an HTTP endpoint on read_csv... which it does!).
The data on that URL is not in CSV format.
It is in JSON format.
Therefore:
contest_list = pd.read_json("https://codeforces.com/api/contest.list?gym=false")
pd.DataFrame(contest_list)
However this results in: [0 rows x 13652 columns]
This is because the JSON is a complex object that needs a little bit of pre-parsing to make for a nice dataframe
import requests
url = "https://codeforces.com/api/contest.list?gym=false"
response = requests.get(url)
contents = response.json()
results = contents.get("result")
df = pd.DataFrame(results)
print(df)
This results in: [1667 rows x 8 columns]

Finding average of every column from CSV file using Python?

I have a CSV file, which has several columns and several rows. Please, see the picture above. In the picture is shown just the two first baskets, but in the original CSV -file I have hundreds of them.
[1]: https://i.stack.imgur.com/R2ZTo.png
I would like to calculate average for every Fruit in every Basket using Python. Here is my code but it doesn't seem to work as it should be. Better ideas? I have tried to fix this also importing and using numpy but I didn't succeed with it.
I would appreciate any help or suggestions! I'm totally new in this.
import csv
from operator import itemgetter
fileLineList = []
averageFruitsDict = {} # Creating an empty dictionary here.
with open('Fruits.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for column in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
averageFruitsDict[row[0]] = [highest, lowest, round(average)]
averageFruitsList = []
for key, value in averageFruitsDict.items():
averageFruitsList.append([key, value[2]])
print('\nFruits in Baskets\n')
print(averageFruitsList)
--- So I'm know trying with this code:
import pandas as pd
fruits = pd.read_csv('fruits.csv', sep=';')
print(list(fruits.columns))
fruits['Unnamed: 0'].fillna(method='ffill', inplace = True)
fruits.groupby('Unnamed: 0').mean()
fruits.groupby('Bananas').mean()
fruits.groupby('Apples').mean()
fruits.groupby('Oranges').mean()
fruits.to_csv('results.csv', index=False)
It creates a new CSV file for me and it looks correct, I don't get any errors but I can't make it calculate the mean of every fruit for every basket. Thankful of all help!

So using the image you posted and replicating/creating an identical test csv called fruit - I was able to create this quick solution using pandas.
import pandas as pd
fruit = pd.read_csv('fruit.csv')
With the unnamed column containing the basket numbers with NaNs in between - we fill with the preceding value. By doing so we are then able to group by the basket number (by using the 'Unnamed: 0' column and apply the mean to all other columns)
fruit['Unnamed: 0'].fillna(method='ffill', inplace = True)
fruit.groupby('Unnamed: 0').mean()
This gets you your desired output of a fruit average for each basket (please note I made up values for basket 3)

How to sort and count each unique value in a column in Pandas

I currently have a data frame that is a reading a csv file called: "wimbledons_champions_claned.csv" I need to gather the data for the number of each unique nationality in the column "Champion Nationality". For example, nationality that shows up in the data is "AUS" and I need to count how many there are and I need to do this with 14 others. Is there any efficient way to do this without having to hard code. I currently am doing something like this:
import pandas as pd
df = pd.read_csv("wimbledons_champions_claned.csv")
ausChampions = len(df[df['Champion Nationality'] == 'AUS']
fraChampions = len(df[df['Champion Nationality'] == 'FRA']
etc...
If there is any better way to do this I would greatly appreciate it. Thank you.

You could just use df['Champion Nationality'].value_counts(). You could also use the Counter function from the collections library. This will return a dictionary of each Champion Nationality in the pandas and how many times is repeated.
import pandas as pd
from collections import Counter
df = pd.read_csv("wimbledons_champions_claned.csv")
result = Counter(df['Champion Nationality'].tolist())
so if you do result['AUS'] it will return how many times it has shown up on the DataFrame

Something like this:
df['Champion Nationality'].value_counts()

Filtering outliers from DataFrame

I have a big problem filtering my data. I've read a lot here on stackoverflow and ion other pages and tutorials, but I could not solve my specific problem...
The first part of my code, where I load my data into python looks as follow:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from arch import arch_model
spotmarket = pd.read_excel("./data/external/Spotmarket_dhp.xlsx", index=True)
r = spotmarket['Price'].pct_change().dropna()
returns = 100 * r
df = pd.DataFrame(returns)
The excel table has 43.000 values in one column and includes the hourly prices. I use this data to calculate the percentage change from hour to hour and the problem is, that there are sometimes big changes between 1000 to 40000%. The dataframe looks as follow:
df
Out[12]:
Price
1 20.608229
2 -2.046870
3 6.147789
4 16.519258
...
43827 -16.079874
43828 -0.438322
43829 -40.314465
43830 -100.105374
43831 700.000000
43832 -62.500000
43833 -40400.000000
43834 1.240695
43835 52.124183
43836 12.996778
43837 -17.157795
43838 -30.349971
43839 6.177924
43840 45.073701
43841 76.470588
43842 2.363636
43843 -2.161042
43844 -6.444781
43845 -14.877102
43846 6.762918
43847 -38.790036
[43847 rows x 1 columns]
I wanna exclude this outliers. I've tried different ways like calculating the meanand the std and exclude all values which are + and - three times the std away from the mean. It works for a small part of the data, but for the complete data, the mean and std are both NaN. Has someone an idea, how I can filter my dataframe?

I think need filter by percentiles by quantile:
r = spotmarket['Price'].pct_change() * 100
Q1 = r.quantile(.25)
Q3 = r.quantile(.75)
q1 = Q1-1.5*(Q3-Q1)
q3 = Q3+1.5*(Q3-Q1)
df = spotmarket[r.between(q1, q3)]

may you should first discard all the values that are giving those fluctuations and then create the dataframe. One way is to use the filter()

Lag values and differences in pandas dataframe with missing quarterly data

Though Pandas has time series functionality, I am still struggling with dataframes that have incomplete time series data.
See the pictures below, the lower picture has complete data, the upper has gaps. Both pics show correct values. In red are the columns that I want to calculate using the data in black. Column Cumm_Issd shows the accumulated issued shares during the year, MV is market value.
I want to calculate the issued shares per quarter (IssdQtr), the quarterly change in Market Value (D_MV_Q) and the MV of last year (L_MV_Y).
See for underlying cvs data this link for the full data and this link for the gapped data. There are two firms 1020180 and 1020201.
However, when I try Pandas shift method it fails when there are gaps, try yourself using the csv files and the code below. All columns (DiffEq, Dif1MV, Lag4MV) differ - for some quarters - from IssdQtr, D_MV_Q, L_MV_Y, respectively.
Are there ways to deal with gaps in data using Pandas?
import pandas as pd
import numpy as np
import os
dfg = pd.read_csv('example_soverflow_gaps.csv',low_memory=False)
dfg['date'] = pd.to_datetime(dfg['Period'], format='%Y%m%d')
dfg['Q'] = pd.DatetimeIndex(dfg['date']).to_period('Q')
dfg['year'] = dfg['date'].dt.year
dfg['DiffEq'] = dfg.sort_values(['Q']).groupby(['Firm','year'])['Cumm_Issd'].diff()
dfg['Dif1MV'] = dfg.groupby(['Firm'])['MV'].diff(1)
dfg['Lag4MV'] = dfg.groupby(['Firm'])['MV'].shift(4)
Gapped data:
Full data:

Solved the basic problem by using a merge. First, create a variable that shows the lagged date or quarter. Here we want last year's MV (4 quarters back):
from pandas.tseries.offsets import QuarterEnd
dfg['lagQ'] = dfg['date'] + QuarterEnd(-4)
Then create a data-frame with the keys (Firm and date) and the relevant variable (here MV).
lagset=dfg[['Firm','date', 'MV']].copy()
lagset.rename(columns={'MV':'Lag_MV', 'date':'lagQ'}, inplace=True)
Lastly, merge the new frame into the existing one:
dfg=pd.merge(dfg, lagset, on=['Firm', 'lagQ'], how='left')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Import and Print Json objects in Dataframe in Pandas - python

import pandas as pd #change 'buy' for other results data = pd.DataFrame(pd.read_json('file.json')['result']['buy']) #for filtering print(data.query('Quantity > 5').query('Rate > 0.00000966').sum())

Related

How can I get a Pandas Dataframe out of the Codeforces API?

Finding average of every column from CSV file using Python?

How to sort and count each unique value in a column in Pandas

Filtering outliers from DataFrame

Lag values and differences in pandas dataframe with missing quarterly data

Categories

Resources