I am trying to convert this string into a pandas dataframe where each value before the colon is a header and the next value is put in the column. This is what the data structure looks like:
{"data":[{"id":47179,"date":"2019-01-30T00:00:00.000Z","home_team":{"id":2,"abbreviation":"BOS","city":"Boston","conference":"East","division":"Atlantic","full_name":"Boston Celtics","name":"Celtics"},"home_team_score":126,"period":4,"postseason":false,"season":2018,"status":"Final","time":" ","visitor_team":{"id":4,"abbreviation":"CHA","city":"Charlotte","conference":"East","division":"Southeast","full_name":"Charlotte Hornets","name":"Hornets"},"visitor_team_score":94},{"id":48751,"date":"2019-02-09T00:00:00.000Z","home_team":{"id":2,"abbreviation":"BOS","city":"Boston","conference":"East","division":"Atlantic","full_name":"Boston Celtics","name":"Celtics"}}
This sting I am converting has hundreds of games but I stop pasting after two. What is this data structure called and how can I quickly move it to a dataframe?
It looks like json, but it's not structured in a very useful way; everything is under the "data" key. To get it into a DataFrame, use a combination of Python's json module and pandas.normalize_json().
import json
import pandas as pd
data_string = """{"data":[{"id":47179,"date":"2019-01-30T00:00:00.000Z",
"home_team":{"id":2,"abbreviation":"BOS","city":"Boston",
"conference":"East","division":"Atlantic","full_name":"Boston Celtics",
"name":"Celtics"},"home_team_score":126,"period":4,"postseason":false,
"season":2018,"status":"Final","time":" ","visitor_team":{"id":4,
"abbreviation":"CHA","city":"Charlotte","conference":"East",
"division":"Southeast","full_name":"Charlotte Hornets","name":"Hornets"},
"visitor_team_score":94},{"id":48751,"date":"2019-02-09T00:00:00.000Z",
"home_team":{"id":2,"abbreviation":"BOS","city":"Boston",
"conference":"East","division":"Atlantic","full_name":"Boston Celtics",
"name":"Celtics"}}]}"""
raw_data = json.loads(data_string)
df = pd.json_normalize(raw_data['data'])
There is a lot of redundant information about the teams. Look at "home_team" for the two games, the information is the same. You could pull all the team data into a separate DataFrame and use the team "id" or "abbreviation" in the DataFrame with all the games.
Related
I have imported a JSON file with 10000 rows and saved it in the variable data.
I want a table that shows the name of the father, the mother, the living situation and the number of pets.
For this, I wanted to use a DataFrame from pandas (python). Unfortunately, I don't know how to extract the data. I always get a Family column with {...} entries.
df = pd.DataFrame.from_dict(data)
pd.DataFrame.from_dict(data['Family'])
I have a csv formatted like below which I'd like to import into pandas. Basically, it is a bunch of measurements and data from multiple samples.
dataformat1,sample1,sample_source,start_time
dataformat2,parameter1,spec_a_1,spec_b_1,measurement1
dataformat2,parameter2,spec_a_2,spec_b_2,measurement2
dataformat2,parameter3,spec_a_3,spec_b_3,measurement3
dataformat3,result_code_for_sample1,completion_time_for_sample1
...
dataformat1,sampleN,sample_source,start_time
dataformat2,parameter1,spec_a_1,spec_b_1,measurement1
dataformat2,parameter2,spec_a_2,spec_b_2,measurement2
dataformat2,parameter3,spec_a_3,spec_b_3,measurement3
dataformat3,result_code_for_sampleN,completion_time_for_sampleN
Essentially, the first field of the csv is describing the data format of the rest of that line. In the pandas dataframe I would like to import all these into a single dataframe and fill them to relevant section. I am currently planning to do this by:
prepending the line number into the csv
reading each dataformat# into a separate dataframe
combining, sorting, and ffill/bfill/unpivot(?) shenanigans to fill in all the data
I assume this will work, but I am wondering if there's a cleaner way to do this either within pandas or using some other library. It is a somewhat common data logging paradigm in the work I do.
Imagine I have pd.DataFrame containing a lot of rows and I want to add some string data describing/summarizing it.
One way to do this would be to add column with this value, but it would be a complete waste of memory (e.g. when the string would be a long specification of ML model).
I could also insert it in file name - but this has limitations and is very impractical at saving/loading.
I could store this data in separate file, but it is not what I want.
I could make a class based on pd.DataFrame and add this field, but then I am unsure if save/load pickle would still work properly.
So is there some real clean way to store something like "description" in pandas DataFrame? Preferentially such that would withstand to_pickle/read_pickle operations.
[EDIT]: Ofc, this generates next questions such as what if we concatenate dataframes with such informations? But let's stick to saving/loading problem only.
Googled it a bit, you can name a dataframe as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame( data=np.ones([10,10]) )
df.name = 'My name'
print(df.name)
I have a json string that contain data likes this:
{"key1":"1","key2":2,"key 3":"text3",
"key4":{"subkey4.1":"text4.1","subkey4.2":"date"},
"key5":[{"subkey5.1":"text5.1","subkey5.2":"text5.2"}],"key6":"6"}
I would like to transform into pandas df, using keys as column names, subkeys as secondary columns, and key values as separate rows under the respective column.
Guessing pivot function needs to be used but not sure how.
this is a snippet of an actual string (removed long names and https for brevity): https://trinket.io/python/8e989e4d04
I have a csv file. It looks something like this;
name,id,
AAA,1111,
BBB,2222,
CCC,3333,
DDD,2222,
I would like to extract the data in id column and placed inside a data structure. For this, I used python panda. Here is the code for doing this;
import pandas as pd
csv_file = 'C:/test.csv'
df = pd.read_csv(csv_file)
column_items = df['id']
I would like to check whether there is a duplicate among the data items in the id column. The data items are stored in column_items. In this case, there is a duplicate.
I am using python 2.7 and panda library.
To find out whether there are duplicate IDs in that whole column, do
df['id'].duplicated().any()