Creating multiple dataframes using a for loop - python

Hi I have code which looks like this:
with open("file123.json") as json_file:
data = json.load(json_file)
df_1 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in data["spt"][1].items()]))
df_1_made =pd.json_normalize(json.loads(df_1.to_json(orient="records"))).T.drop(["content.id","shortname","name"])
df_2 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in data["spt"][2].items()]))
df_2_made = pd.json_normalize(json.loads(df_2.to_json(orient="records"))).T.drop(["content.id","shortname","name"])
df_3 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in data["spt"][3].items()]))
df_3_made = pd.json_normalize(json.loads(df_3.to_json(orient="records"))).T.drop(["content.id","shortname","name"])
which the dataframe is built from a json file
the problem is that I am dealing with different json files and each one of them can lead to different number of dataframes. so the code above is 3, it may change to 7. Is there any way to make a for loop taking the length of the data:
length = len(data["spt"])
and make the correct number of dataframes from it? so I do not need to do it manually.

The simplest option here will be to put all your dataframes into a dictionary or a list. First define a function that creates the dataframe and then use a list comprehension.
def create_df(data):
df = pd.DataFrame(
dict(
[(k,pd.Series(v)) for k,v in data]
)
)
df =pd.json_normalize(
json.loads(
df.to_json(orient="records")
)
).T.drop(["content.id","shortname","name"])
return df
my_list_of_dfs = [create_df(data.items()) for x in data["spt"]]

Related

How to create a dataframe?

df4 = []
for i in (my_data.points.values.tolist()[0]):
df3 = pd.json_normalize(j)
df4.append(df3)
df5 = pd.DataFrame(df4)
df5.head()
When I run this code I get this error: Must pass 2-d input. shape=(16001, 1, 3)
pd.json_normalize will change the json data to table format, but what you need to have is an array of dictionaries to be able to convert to a dataframe.
For example
dict_list=[
{"id":1,"name":"apple","price":10},
{"id":1,"name":"orange","price":20},
{"id":1,"name":"pineapple","price":15},
]
df=pd.DataFrame(dict_list)
In your case
df4 = []
for i in (my_data.points.values.tolist()[0]):
# df3 = pd.json_normalize(j) since the structure is not mentioned,
# I'm assuming "i" as a dictionary which has the relevant row
df4.append(i)
df5 = pd.DataFrame(df4)
df5.head()

Appending dictionaries generated from a loop to the same dataframe

I have a loop within a nested loop that at the end generates 6 dictionaries. Each dictionary has the same key but different values, I would at the end of every iteration to append the dictionary to the same dataframe but it keeps failing.
At the end I would like to have a table with 6 columns plus an index which holds the keys.
This is the idea behind what I'm trying to do:
dictionary = dict()
for i in blahh:
dictionary[i] = dict(zip(blahh['x'][i], blahh['y'][i]))
df = pd.DataFrame(dictionary)
df_final = pd.concat([dictionary, df])
I get the error:
cannot concatenate object of type '<class 'dict'>'; only series and dataframe objs are valid
I created a practice dataset set if necessary here:
letts = [ ('a','b','c'),('e','f','g'),('h','i','j'),('k','l','m'),('n','o','p')]
numns = [(1,2,3),(4,5,6),(7,8,9),(10,11,12),(13,14,15)]
dictionary = dict()
for i in letts:
for j in numns:
dictionary = dict(zip(i, j))
i am confusing about your practice dataset, but modifications below could provide an idea...
df_final = pd.DataFrame()
dictionary = dict()
for i in blahh:
dictionary[i] = dict(zip(blahh['x'][i], blahh['y'][i]))
df = pd.DataFrame(dictionary, index="index must be passed")
df_final = pd.concat([df_final, df])

Cannot assign to function call when looping through and converting excel files

With this code:
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i,snlist in list(zip(range(1,13),sn)):
'df{}'.format(str(i)) = pd.read_excel('test.xlsx',sheet_name=snlist, skiprows=range(6))
I get this error:
'df{}'.format(str(i)) = pd.read_excel('test.xlsx',sheet_name=snlist,
skiprows=range(6))
^ SyntaxError: cannot assign to function call
I can't understand the error and how solve. What's the problem?
df+str(i) also return error
i want to make result as:
df1 = pd.read_excel.. list1...
df2 = pd.read_excel... list2....
You can't assign the result of df.read_excel to 'df{}'.format(str(i)) -- which is a string that looks like "df0", "df1", "df2" etc. That is why you get this error message. The error message is probably confusing since its treating this as assignment to a "function call".
It seems like you want a list or a dictionary of DataFrames instead.
To do this, assign the result of df.read_excel to a variable, e.g. df and then append that to a list, or add it to a dictionary of DataFrames.
As a list:
dataframes = []
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i, snlist in list(zip(range(1, 13), sn)):
df = pd.read_excel('test.xlsx', sheet_name=snlist, skiprows=range(6))
dataframes.append(df)
As a dictionary:
dataframes = {}
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i, snlist in list(zip(range(1, 13), sn)):
df = pd.read_excel('test.xlsx', sheet_name=snlist, skiprows=range(6))
dataframes[i] = df
In both cases, you can access the DataFrames by indexing like this:
for i in range(len(dataframes)):
print(dataframes[i])
# Note indexes will start at 0 here instead of 1
# You may want to change your `range` above to start at 0
Or more simply:
for df in dataframes:
print(df)
In the case of the dictionary, you'd probably want:
for i, df in dataframes.items():
print(i, df)
# Here, `i` is the key and `df` is the actual DataFrame
If you really do want df1, df2 etc as the keys, then do this instead:
dataframes[f'df{i}'] = df

Extract part of a json keys value and combine

I have this json dataset. From this dataset i only want "column_names" keys and its values and "data" keys and its values.Each values of column_names corresponds to values of data. How do i combine only these two keys in python for analysis
{"dataset":{"id":42635350,"dataset_code":"MSFT","column_names":
["Date","Open","High","Low","Close","Volume","Dividend","Split",
"Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"],
"frequency":"daily","type":"Time Series",
"data":[["2017-12-28",85.9,85.93,85.55,85.72,10594344.0,0.0,1.0,83.1976157998082,
83.22667201021558,82.85862667838872,83.0232785373639,10594344.0],
["2017-12-27",85.65,85.98,85.215,85.71,14678025.0,0.0,1.0,82.95548071308001,
83.27509902756123,82.53416566217294,83.01359313389476,14678025.0]
for cnames in data['dataset']['column_names']:
print(cnames)
for cdata in data['dataset']['data']:
print(cdata)
For loop gives me column names and data values i want but i am not sure how to combine it and make it as a python data frame for analysis.
Ref:The above piece of code is from quandal website
data = {
"dataset": {
"id":42635350,"dataset_code":"MSFT",
"column_names": ["Date","Open","High","Low","Close","Volume","Dividend","Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"],
"frequency":"daily",
"type":"Time Series",
"data":[
["2017-12-28",85.9,85.93,85.55,85.72,10594344.0,0.0,1.0,83.1976157998082, 83.22667201021558,82.85862667838872,83.0232785373639,10594344.0],
["2017-12-27",85.65,85.98,85.215,85.71,14678025.0,0.0,1.0,82.95548071308001,83.27509902756123,82.53416566217294,83.01359313389476,14678025.0]
]
}
}
Should the following code do what you want ?
import pandas as pd
df = pd.DataFrame(data, columns = data['dataset']['column_names'])
for i, data_row in enumerate(data['dataset']['data']):
df.loc[i] = data_row
cols = data['dataset']['column_names']
data = data['dataset']['data']
It's quite simple
labeled_data = [dict(zip(cols, d)) for d in data]
The following snippet should work for you
import pandas as pd
df = pd.DataFrame(data['dataset']['data'],columns=data['dataset']['column_names'])
Check the following link to learn more
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

Looping at specified indexes

I have 2 large lists, each with about 100 000 elements each and one being larger than the other, that I want to iterate through. My loop looks like this:
for i in list1:
for j in list2:
function()
This current looping takes too long. However, list1 is a list that needs to be checked from list2 but, from a certain index, there are no more instances beyond in list2. This means that looping from indexes might be faster but the problem is I do not know how to do so.
In my project, list2 is a list of dicts that have three keys: value, name, and timestamp. list1 is a list of the timestamps in order. The function is one that takes the value based off the timestamp and puts it into a csv file in the appropriate name column.
This is an example of entries from list1:
[1364310855.004000, 1364310855.005000, 1364310855.008000]
This is what list2 looks like:
{"name":"vehicle_speed","value":2,"timestamp":1364310855.004000}
{"name":"accelerator_pedal_position","value":4,"timestamp":1364310855.004000}
{"name":"engine_speed","value":5,"timestamp":1364310855.005000}
{"name":"torque_at_transmission","value":-3,"timestamp":1364310855.008000}
{"name":"vehicle_speed","value":1,"timestamp":1364310855.008000}
In my final csv file, I should have something like this:
http://s000.tinyupload.com/?file_id=03563948671103920273
If you want this to be fast, you should restructure the data that you have in list2 in order to speedup your lookups:
# The following code converts list2 into a multivalue dictionary
from collections import defaultdict
list2_dict = defaultdict(list)
for item in list2:
list2_dict[item['timestamp']].append((item['name'], item['value']))
This gives you a much faster way to look up your timestamps:
print(list2_dict)
defaultdict(<type 'list'>, {
1364310855.008: [('torque_at_transmission', -3), ('vehicle_speed', 0)],
1364310855.005: [('engine_speed', 0)],
1364310855.004: [('vehicle_speed', 0), ('accelerator_pedal_position', 0)]})
Lookups will be much more efficient when using list2_dict:
for i in list1:
for j in list2_dict[i]:
# here j is a tuple in the form (name, value)
function()
You appear to only want to use the elements in list2 that correspond to i*2 and i*2+1, That is elements 0, 1 and 2, 3, ...
You only need one loop.
for i in range(len(list1)):
j = list[i*2]
k = list2[j+1]
# Process function using j and k
You will only process to the end of list one.
i think pandas module is a perfect match for your goals...
import ujson # 'ujson' (Ultra fast JSON) is faster than the standard 'json'
import pandas as pd
filter_list = [1364310855.004000, 1364310855.005000, 1364310855.008000]
def file2list(fn):
with open(fn) as f:
return [ujson.loads(line) for line in f]
# Use pd.read_json('data.json') instead of pd.DataFrame(load_data('data.json'))
# if you have a proper JSON file
#
# df = pd.read_json('data.json')
df = pd.DataFrame(file2list('data.json'))
# filter DataFrame with 'filter_list'
df = df[df['timestamp'].isin(filter_list)]
# convert UNIX timestamps to readable format
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
# pivot data frame
# fill NaN's with zeroes
df = df.pivot(index='timestamp', columns='name', values='value').fillna(0)
# save data frame to CSV file
df.to_csv('output.csv', sep=',')
#pd.set_option('display.expand_frame_repr', False)
#print(df)
output.csv
timestamp,accelerator_pedal_position,engine_speed,torque_at_transmission,vehicle_speed
2013-03-26 15:14:15.004,4.0,0.0,0.0,2.0
2013-03-26 15:14:15.005,0.0,5.0,0.0,0.0
2013-03-26 15:14:15.008,0.0,0.0,-3.0,1.0
PS i don't know where did you get [Latitude,Longitude] columns from, but it's pretty easy to add those columns to your result DataFrame - just add the following lines before calling df.to_csv()
df.insert(0, 'latitude', 0)
df.insert(1, 'longitude', 0)
which would result in:
timestamp,latitude,longitude,accelerator_pedal_position,engine_speed,torque_at_transmission,vehicle_speed
2013-03-26 15:14:15.004,0,0,4.0,0.0,0.0,2.0
2013-03-26 15:14:15.005,0,0,0.0,5.0,0.0,0.0
2013-03-26 15:14:15.008,0,0,0.0,0.0,-3.0,1.0

Categories