I'm trying to generate different altair charts programmatically.
I will base those different charts setups on dictionaries with alt.Chart.from_dict().
I've reverse engineered the overall configuration of the charts with an existing chart doing chart.to_dict(), but this method serializes the data into json, whereas my data is hosted in pandas dataframes and I'm struggling to find the right syntax in the dictionary to pass the dataframe.
I've tried a few variations of the below :
d_chart_config = {
"data": df, #or df.to_dict()
"config": {
"view": {"continuousWidth": 400, "continuousHeight": 300},
"title": {"anchor": "start", "color": "#4b5c65", "fontSize": 20},
},
"mark": {"type": "bar", "size": 40},
....}
but haven't managed to figure out how or where to insert the dataframe in the dictionary, either as a dataframe directly or as a df.to_dict()
please help if you've managed something similar.
The pure pandas way to generate a Vega-Lite data field is {"values": df.to_dict(orient="records")}, but this has problems in some cases (namely handling of datetimes, categoricals, and non-standard numeric & string types).
Altair has utilities to work around these issues that you can use directly, namely the altair.utils.data.to_values function.
For example:
import pandas as pd
from altair.utils.data import to_values
df = pd.DataFrame({'a': [1, 2, 3], 'b': pd.date_range('2012', freq='Y', periods=3)})
print(to_values(df))
# {'values': [{'a': 1, 'b': '2012-12-31T00:00:00'},
# {'a': 2, 'b': '2013-12-31T00:00:00'},
# {'a': 3, 'b': '2014-12-31T00:00:00'}]}
You can use this directly within a dictionary containing a vega-lite specification and generate a valid chart:
alt.Chart.from_dict({
"data": to_values(df),
"mark": "bar",
"encoding": {
"x": {"field": "a", "type": "quantitative"},
"y": {"field": "b", "type": "ordinal", "timeUnit": "year"},
}
})
Related
Input : Multiple csv with the same columns (800 million rows) [Time Stamp, User ID, Col1, Col2, Col3]
Memory available : 60GB of RAM and 24 core CPU
Input Output example
Problem : I want to group by User ID, sort by TimeStamp and take a unique of Col1 but dropping duplicates while retaining the order based on the TimeStamp.
Solutions Tried :
Tried using joblib to load csv in parallel and use pandas to sort and write to csv (Get an error at the sorting step)
Used dask (New to Dask); \
LocalCluster(dashboard_address=f':{port}', n_workers=4, threads_per_worker=4, memory_limit='7GB') ## Cannot use the full 60 gigs as there are others on the server
ddf = read_csv("/path/*.csv")
ddf = ddf.set_index("Time Stamp")
ddf.to_csv("/outdir/")
Questions :
Assuming dask will use disk to sort and write the multipart output, will it preserve the order after I read the output using read_csv?
How do I achieve the 2 part of the problem in dask. In pandas, I'd just apply and gather results in a new dataframe?
def getUnique(user_group): ## assuming the rows for each user are sorted by timestamp
res = list()
for val in user_group["Col1"]:
if val not in res:
res.append(val)
return res
Please direct me if there is a better alternative to dask.
So, I think I would approach this with two passes. In the first pass, I would look to run though all the csv files and build a data structure to hold the keys of user_id and col1 and the "best" timestamp. In this case, "best" will be the lowest.
Note: the use of dictionaries here only serves to clarify what we are attempting to do and if performance or memory was an issue, I would first look to reimplement without them where possible.
so, starting with CSV data like:
[
{"user_id": 1, "col1": "a", "timestamp": 1},
{"user_id": 1, "col1": "a", "timestamp": 2},
{"user_id": 1, "col1": "b", "timestamp": 4},
{"user_id": 1, "col1": "c", "timestamp": 3},
]
After processing all the csv files I hope to have an interim representation of:
{
1: {'a': 1, 'b': 4, 'c': 3}
}
Note that a representation like this could be created in parallel for each CSV and then re-distilled into a final interim representation via a pass 1.5 if you wanted to do that.
Now we can create a final representation based on the keys of this nested structure sorted by the inner most value. Giving us:
[
{'user_id': 1, 'col1': ['a', 'c', 'b']}
]
Here is how I might first approach this task before tweaking things for performance.
import csv
all_csv_files = [
"some.csv",
"bunch.csv",
"of.csv",
"files.csv",
]
data = {}
for csv_file in all_csv_files:
#with open(csv_file, "r") as file_in:
# rows = csv.DictReader(file_in)
## ----------------------------
## demo data
## ----------------------------
rows = [
{"user_id": 1, "col1": "a", "timestamp": 1},
{"user_id": 1, "col1": "a", "timestamp": 2},
{"user_id": 1, "col1": "b", "timestamp": 4},
{"user_id": 1, "col1": "c", "timestamp": 3},
]
## ----------------------------
## ----------------------------
## First pass to determine the "best" timestamp
## for a user_id/col1
## ----------------------------
for row in rows:
user_id = row['user_id']
col1 = row['col1']
ts_new = row['timestamp']
ts_old = (
data
.setdefault(user_id, {})
.setdefault(col1, ts_new)
)
if ts_new < ts_old:
data[user_id][col1] = ts_new
## ----------------------------
print(data)
## ----------------------------
## second pass to set order of col1 for a given user_id
## ----------------------------
data_out = [
{
"user_id": outer_key,
"col1": [
inner_kvp[0]
for inner_kvp
in sorted(outer_value.items(), key=lambda v: v[1])
]
}
for outer_key, outer_value
in data.items()
]
## ----------------------------
print(data_out)
I started learning data analysis using Python. And I have been trying to use Adzuna dataset for my course project. The response from my API call looks like this:
{
"results": [
{
"salary_min": 50000,
"longitude": -0.776902,
"location": {
"__CLASS__": "Adzuna::API::Response::Location",
"area": [
"UK",
"South East England",
"Marlow"
],
"display_name": "Marlow, Buckinghamshire"
},
"salary_is_predicted": 0,
"description": "JavaScript Developer Corporate ...",
"__CLASS__": "Adzuna::API::Response::Job",
"created": "2013-11-08T18:07:39Z",
"latitude": 51.571999,
"redirect_url": "http://adzuna.co.uk/jobs/land/ad/129698749...",
"title": "Javascript Developer",
"category": {
"__CLASS__": "Adzuna::API::Response::Category",
"label": "IT Jobs",
"tag": "it-jobs"
},
"id": "129698749",
"salary_max": 55000,
"company": {
"__CLASS__": "Adzuna::API::Response::Company",
"display_name": "Corporate Project Solutions"
},
"contract_type": "permanent"
},
... another 19 samples here ...
],
"mean": 43900.46,
"__CLASS__": "Adzuna::API::Response::JobSearchResults",
"count": 74433
}
My goal is to extract 20 samples under "results" individually so that I can create a numpy dataset later for data analysis. So, I wrote Python like this:
item_dict = json.loads(response.text)
# Since "results" start/end with [ and ], Python treats it as a list. So, I need to remove them.
string_data = str(item_dict['results']).lstrip("[")
string_data = string_data.rstrip("]")
# Convert "results" string back to JSON, then extract each sample from 20 samples
json_results_data = json.loads(string_data)
for sample in json_results_data:
print(sample)
However, json_results_data = json.loads(string_data) doesn't convert the "results" string to JSON well. I am new to Python, so I may be asking a stupid question, but please let me know if you can figure out an easy way to fix this. Thanks.
Stop stipping the square brackets... its meant to be a list.
Try this
item_dict = json.loads(response.text)
for sample in item_dict["results"]:
print(sample)
Your issue was you thought you had a dict (json) but you have a list of dicts.
Solution
What you are trying to achieve is organize your data from a json object. The first line in your code item_dict = json.loads(response.text) returns you a dict object and hence you could simply use that.
I would show two methods:
Using a pandas.DataFrame to organize your data.
Using a for loop to just print your data.
But note that, the pandas.DataFrame allows you to quickly convert your data into a numpy array as well (use: df.to_numpy())
import pandas as pd
results = response.json['results'] # item_dict['results']
df = pd.DataFrame(results)
print(df)
# df.to_numpy()
output:
a b c d e
0 1.0 2.0 NaN dog True
1 20.0 2.0 0.0 cat True
2 1.0 NaN NaN bird True
3 NaN 2.0 88.0 pig False
Instead, if you just want to print out each dictionary inside results, you could just do this:
for result in results:
print(result)
Dummy Data
item_dict = {
'results': [
{'a': 1, 'b': 2, 'c': None, 'd': 'dog', 'e': True},
{'a': 20, 'b': 2, 'c': 0, 'd': 'cat', 'e': True},
{'a': 1, 'b': None, 'c': None, 'd': 'bird', 'e': True},
{'a': None, 'b': 2, 'c': 88, 'd': 'pig', 'e': False}
],
"mean": 43900.46,
"__CLASS__": "Adzuna::API::Response::JobSearchResults",
"count": 74433
}
I have a data structure like this:
data = [{
"name": "leopard",
"character": "mean",
"skills": ["sprinting", "hiding"],
"pattern": "striped",
},
{
"name": "antilope",
"character": "good",
"skills": ["running"],
},
.
.
.
]
Each key in the dictionaries has values of type integer, string or
list of strings (not all keys are in all dicts present), each
dictionary represents a row in a table; all rows are given as the list
of dictionaries.
How can I easily import this into Pandas? I tried
df = pd.DataFrame.from_records(data)
but here I get an "ValueError: arrays must all be same length" error.
The DataFrame constructor takes row-based arrays (amoungst other structures) as data input. Therefore the following works:
data = [{
"name": "leopard",
"character": "mean",
"skills": ["sprinting", "hiding"],
"pattern": "striped",
},
{
"name": "antilope",
"character": "good",
"skills": ["running"],
}]
df = pd.DataFrame(data)
print(df)
Output:
character name pattern skills
0 mean leopard striped [sprinting, hiding]
1 good antilope NaN [running]
Here is example JSON im working with.
{
":#computed_region_amqz_jbr4": "587",
":#computed_region_d3gw_znnf": "18",
":#computed_region_nmsq_hqvv": "55",
":#computed_region_r6rf_p9et": "36",
":#computed_region_rayf_jjgk": "295",
"arrests": "1",
"county_code": "44",
"county_code_text": "44",
"county_name": "Mifflin",
"fips_county_code": "087",
"fips_state_code": "42",
"incident_count": "1",
"lat_long": {
"type": "Point",
"coordinates": [
-77.620031,
40.612749
]
}
I have been able to pull out select columns I want except I'm having troubles with "lat_long". So far my code looks like:
# PRINTS OUT SPECIFIED COLUMNS
col_titles = ['county_name', 'incident_count', 'lat_long']
df = df.reindex(columns=col_titles)
However 'lat_long' is added to the data frame as such: {'type': 'Point', 'coordinates': [-75.71107, 4...
I thought once I figured out how properly add the coordinates to the data frame I would then create two seperate columns, one for latitude and one for longitude.
Any help with this matter would be appreciated. Thank you.
If I don't misunderstood your requirements then you can try this way with json_normalize. I just added the demo for single json, you can use apply or lambda for multiple datasets.
import pandas as pd
from pandas.io.json import json_normalize
df = {":#computed_region_amqz_jbr4":"587",":#computed_region_d3gw_znnf":"18",":#computed_region_nmsq_hqvv":"55",":#computed_region_r6rf_p9et":"36",":#computed_region_rayf_jjgk":"295","arrests":"1","county_code":"44","county_code_text":"44","county_name":"Mifflin","fips_county_code":"087","fips_state_code":"42","incident_count":"1","lat_long":{"type":"Point","coordinates":[-77.620031,40.612749]}}
df = pd.io.json.json_normalize(df)
df_modified = df[['county_name', 'incident_count', 'lat_long.type']]
df_modified['lat'] = df['lat_long.coordinates'][0][0]
df_modified['lng'] = df['lat_long.coordinates'][0][1]
print(df_modified)
Here is how you can do it as well:
df1 = pd.io.json.json_normalize(df)
pd.concat([df1, df1['lat_long.coordinates'].apply(pd.Series) \
.rename(columns={0: 'lat', 1: 'long'})], axis=1) \
.drop(columns=['lat_long.coordinates', 'lat_long.type'])
So I have some data in json format, here's a snippet:
"sell": [
{
"Rate": 0.001425,
"Quantity": 537.27713514
},
{
"Rate": 0.00142853,
"Quantity": 6.59174681
}
]
What's the easiest way to access Rate and Quantity so that I can plot it in Matplotlib? Do I have to flatten/normalize it, or create a for loop to generate an array, or can I use pandas or some other library to convert it into matplotlib friendly data automatically?
I know matplotlib can handle inputs in a few ways
plt.plot([1,2,3,4], [1,4,9,16])
plt.plot([1,1],[2,4],[3,9],[4,16])
The simpliest is DataFrame constructor with DataFrame.plot:
import pandas as pd
d = {"sell": [
{
"Rate": 0.001425,
"Quantity": 537.27713514
},
{
"Rate": 0.00142853,
"Quantity": 6.59174681
}
]}
df = pd.DataFrame(d['sell'])
print (df)
Quantity Rate
0 537.277135 0.001425
1 6.591747 0.001429
df.plot(x='Quantity', y='Rate')
EDIT:
Also is possible use read_json for DataFrame.