Python dictionary with multiple lists in to pandas dataframe - python

I'm trying to get a dictionary with multiple list values and one string in to one dataframe.
Here's the information i'm trying to get in to the dataframe:
{'a': ['6449.70000', '1', '1.000'],
'b': ['6446.40000', '1', '1.000'],
'c': ['6449.80000', '0.04879000'],
'h': ['6449.90000', '6449.90000'],
'l': ['6362.00000', '6120.30000'],
'o': '6442.30000',
'p': ['6413.12619', '6353.50910'],
't': [5272, 16027],
'v': ['1299.86593468', '4658.87787321']}
The 3 values represented by key "a" all have their own names, say a1, a2 and a3 then b1, b2, and b3.. Preferably i want to define them myself. This goes for all information. So there should be 19 columns.
I've read a lot about this.
Take multiple lists into dataframe
https://pythonprogramming.net/data-analysis-python-pandas-tutorial-introduction/
http://pbpython.com/pandas-list-dict.html
Video tutorials youtube
Based on these readings i think i could iterate trough it with a for loop, build separate dataframes and then join/merge them. But that seems more work then i think is required.
What is the most efficient / readable / logic way to do this using Python 3.6?

Do the cleaning up in pure Python:
colnames = []
values = []
for key,value in d.iteritems():
if type(value) == list:
for c in range(len(value)):
colnames.append(key + str(c+1))
values += value
else:
colnames.append(key + '1')
values.append(value)
df = pd.DataFrame(values, index = colnames).T

Related

Returning First Key in a Nested Dictionary as Column Value

I'm trying to learn Python, and at the moment I'm working with the Alpaca API to get historical data. I'm working with the raw data, which looks like this:
{'SPY': [{'t': '2022-01-03T05:00:00Z', 'o': 476.3, 'h': 477.85, 'l': 473.85, 'c': 477.71, 'v': 72604064, 'n': 534803, 'vw': 476.526964}, {'t': '2022-01-04T05:00:00Z', 'o': 479.22......}
What I want to get is it looking like this (just using 4 columns as an example).
ticker t o
SPY 2022-01-03T05:00:00Z 476.3
SPY 2022-01-04T05:00:00Z 479.22
I've searched around, and tried using something like json_normalize(my_data), but that just results in the below.
SPY
0 [{'t': '2022-01-03T05:00:00Z', 'o': 476.3, 'h'...
I'm not sure if this is what you want, but here
for ticker, ticker_list in my_dict.items():
for my_dict in ticker_list:
print(ticker, end="\n")
for value in my_dict.values():
print(value, end="\n")
print()
If you want to make it an array instead
result = []
for ticker, ticker_list in my_dict.items():
for my_dict in ticker_list:
new_dict = dict(my_dict)
new_dict["ticker"] = ticker
result.append(new_dict)

How to extract specific values from a list of dictionaries in python

I have a list of dictionaries like shown below and i would like to extract the partID and the corresponding quantity for a specific orderID using python, but i don't know how to do it.
dataList = [{'orderID': 'D00001', 'customerID': 'C00001', 'partID': 'P00001', 'quantity': 2},
{'orderID': 'D00002', 'customerID': 'C00002', 'partID': 'P00002', 'quantity': 1},
{'orderID': 'D00003', 'customerID': 'C00003', 'partID': 'P00001', 'quantity': 1},
{'orderID': 'D00004', 'customerID': 'C00004', 'partID': 'P00003', 'quantity': 3}]
So for example, when i search my dataList for a specific orderID == 'D00003', i would like to receive both the partID ('P00001'), as well as the corresponding quantity (1) of the specified order. How would you go about this? Any help is much appreciated.
It depends.
You are not going to do that a lot of time, you can just iterate over the list of dictionaries until you find the "correct" one:
search_for_order_id = 'D00001'
for d in dataList:
if d['orderID'] == search_for_order_id:
print(d['partID'], d['quantity'])
break # assuming orderID is unique
Outputs
P00001 2
Since this solution is O(n), if you are going to do this search a lot of times it will add up.
In that case it will be better to transform the data to a dictionary of dictionaries, with orderID being the outer key (again, assuming orderID is unique):
better = {d['orderID']: d for d in dataList}
This is also O(n) but you pay it only once. Any subsequent lookup is an O(1) dictionary lookup:
search_for_order_id = 'D00001'
print(better[search_for_order_id]['partID'], better[search_for_order_id]['quantity'])
Also outputs
P00001 2
I believe you would like to familiarize yourself with the pandas package, which is very useful for data analysis. If these are the kind of problems you're up against, I advise you to take the time and take a tutorial in pandas. It can do a lot, and is very popular.
Your dataList is very similar to a DataFrame structure, so what you're looking for would be as simple as:
import pandas as pd
df = pd.DataFrame(dataList)
df[df['orderID']=='D00003']
You can use this:
results = [[x['orderID'], x['partID'], x['quantity']] for x in dataList]
for i in results:
print(i)
Also,
results = [['Order ID: ' + x['orderID'], 'Part ID: ' + x['partID'],'Quantity:
' + str(x['quantity'])] for x in dataList]
To get the partID you can make use of the filter function.
myData = [{"x": 1, "y": 1}, {"x": 2, "y": 5}]
filtered = filter(lambda item: item["x"] == 1) # Search for an object with x equal to 1
# Get the next item from the filter (the matching item) and get the y property.
print(next(filtered)["y"])
You should be able to apply this to your situation.

Bokeh: Column DataSource part giving error

I am trying to create an interactive bokeh plot that holds multiple data and I am not sure why I am getting the error
ValueError: expected an element of ColumnData(String, Seq(Any)),got {'x': 6.794, 'y': 46.8339999999999, 'country': 'Congo, Dem. Rep.', 'pop': 3.5083789999999997, 'region': 'Sub-Saharan Africa'}
source = ColumnDataSource(data={
'x' : data.loc[1970].fertility,
'y' : data.loc[1970].life,
'pop' : (data.loc[1970].population / 20000000) + 2,
'region' : data.loc[1970].region,
})
I have tried two different data sets by importing data from excel and have been running out of issues on exactly why this happening.
As the name suggests, the ColumnDataSource is a data structure for storing columns of data. This means that the value of every key in .data must be a column, i.e. a Python list, a NumPy array, or a Pandas series. But you are trying to assign plain numbers as the values, which is what the error message is telling you:
I am trying to create an interactive bokeh plot that holds multiple data and I am not sure why I am getting the error
expected an element of ColumnData(String, Seq(Any))
This is saying acceptable, expected values are dicts that map strings to sequences. But what you passed is clearly not that:
got {'x': 6.794, 'y': 46.8339999999999, 'country': 'Congo, Dem. Rep.', 'pop': 3.5083789999999997, 'region': 'Sub-Saharan Africa'}
The value for x for instance is just the number 6.794 and not an array or list, etc.
You can easily do this:
source = ColumnDataSource({str(c): v.values for c, v in df.items()})
This would be a solution. I think the problem is in getting the data from df.
source = ColumnDataSource(data={
'x' : data[data['Year'] == 1970]['fertility'],
'y' : data[data['Year'] == 1970]['life'],
'pop' : (data[data['Year'] == 1970]['population']/20000000) + 2,
'region' : data[data['Year'] == 1970]['region']
})
I had this same problem using this same dataset.
My solution was to import the csv in pandas using "Year" as index column.
data = pd.read_csv(csv_path, index_col='Year')

How to iteratively create a vector with different name in python

I have a pandas data.frame
temp = pd.DataFrame({'country':['C1','C1','C1','C1','C2','C2','C2','C2'],
'seg': ['S1','S2','S1','S2','S1','S2','S1','S2'],
'agegroup': ['1', '2', '2', '1','1', '2', '2', '1'],
'N' : [21,22,23,24,31,32,33,34]})
and a vector like
vector = ['country', 'seg']
what i want to do is to create two vectors with names vector_country and vector_seg which will contain the respective columns of the temp, in this case the columns country and seg
I have tried
for vec in vector:
'vector_' + str(vec) = temp[[vec]]
So in the end I would like to end up with two vectors:
vector_country, which will contain the temp.country and
vector_seg, which will contain the temp.seg
Is it possible to do something like that in python ?
Do not try and dynamically name variables. This is bad practice, will make your code intractable.
A better alternative is to use dictionaries, as so:
v = {}
for vec in ['country', 'seg']:
v[vec] = temp[vec].values

Create multiple dataframes in loop

I have a list, with each entry being a company name
companies = ['AA', 'AAPL', 'BA', ....., 'YHOO']
I want to create a new dataframe for each entry in the list.
Something like
(pseudocode)
for c in companies:
c = pd.DataFrame()
I have searched for a way to do this but can't find it. Any ideas?
Just to underline my comment to #maxymoo's answer, it's almost invariably a bad idea ("code smell") to add names dynamically to a Python namespace. There are a number of reasons, the most salient being:
Created names might easily conflict with variables already used by your logic.
Since the names are dynamically created, you typically also end up using dynamic techniques to retrieve the data.
This is why dicts were included in the language. The correct way to proceed is:
d = {}
for name in companies:
d[name] = pd.DataFrame()
Nowadays you can write a single dict comprehension expression to do the same thing, but some people find it less readable:
d = {name: pd.DataFrame() for name in companies}
Once d is created the DataFrame for company x can be retrieved as d[x], so you can look up a specific company quite easily. To operate on all companies you would typically use a loop like:
for name, df in d.items():
# operate on DataFrame 'df' for company 'name'
In Python 2 you are better writing
for name, df in d.iteritems():
because this avoids instantiating a list of (name, df) tuples.
You can do this (although obviously use exec with extreme caution if this is going to be public-facing code)
for c in companies:
exec('{} = pd.DataFrame()'.format(c))
Adding to the above great answers. The above will work flawless if you need to create empty data frames but if you need to create multiple dataframe based on some filtering:
Suppose the list you got is a column of some dataframe and you want to make multiple data frames for each unique companies fro the bigger data frame:-
First take the unique names of the companies:-
compuniquenames = df.company.unique()
Create a data frame dictionary to store your data frames
companydict = {elem : pd.DataFrame() for elem in compuniquenames}
The above two are already in the post:
for key in DataFrameDict.keys():
DataFrameDict[key] = df[:][df.company == key]
The above will give you a data frame for all the unique companies with matching record.
Below is the code for dynamically creating data frames in loop:
companies = ['AA', 'AAPL', 'BA', ....., 'YHOO']
for eachCompany in companies:
#Dynamically create Data frames
vars()[eachCompany] = pd.DataFrame()
For difference between vars(),locals() and globals() refer to the below link:
What's the difference between globals(), locals(), and vars()?
you can do this way:
for xxx in yyy:
globals()[f'dataframe_{xxx}'] = pd.Dataframe(xxx)
The following is reproducable -> so lets say you have a list with the df/company names:
companies = ['AA', 'AAPL', 'BA', 'YHOO']
you probably also have data, presumably also a list? (or rather list of lists) like:
content_of_lists = [
[['a', '1'], ['b', '2']],
[['c', '3'], ['d', '4']],
[['e', '5'], ['f', '6']],
[['g', '7'], ['h', '8']]
]
in this special example the df´s should probably look very much alike, so this does not need to be very complicated:
dic={}
for n,m in zip(companies, range(len(content_of_lists))):
dic["df_{}".format(n)] = pd.DataFrame(content_of_lists[m]).rename(columns = {0: "col_1", 1:"col_2"})
Here you would have to use dic["df_AA"] to get to the dataframe inside the dictionary.
But Should you require more "distinct" naming of the dataframes I think you would have to use for example if-conditions, like:
dic={}
for n,m in zip(companies, range(len(content_of_lists))):
if n == 'AA':
special_naming_1 = pd.DataFrame(content_of_lists[m]).rename(columns = {0:
"col_1", 1:"col_2"})
elif n == 'AAPL':
special_naming_2 ...
It is a little more effort but it allows you to grab the dataframe object in a more conventional way by just writing special_naming_1 instead of dic['df_AA'] and gives you more controll over the dataframes names and column names if that´s important.

Categories