To begin this topic off I've created a stock market environment that a function can return its observation through this function. The field 'df' is a pandas instance loaded from csv file and I am returning a step (row) of the data frame to get the data which return its value on the data sheet. My issue is when I set the data to the observation field it return different values then the data sheet.
def _next_observation(self):
observation = [
self.df.loc[self.current_step, 'Open'],
self.df.loc[self.current_step, 'High'],
self.df.loc[self.current_step, 'Low'],
self.df.loc[self.current_step, 'Close'],
self.df.loc[self.current_step, 'Volume'],
self.account_quantity
]
# Add Indicators
if self.indicators != None:
for _ in range(len(self.indicators)):
observation.append(self.df.loc[self.current_step, self.indicators[_][0]])
print(observation) # Print normally
self.observation = np.array(observation) # Hmmmmmm
print(self.observation) # Print strangly
exit(1)
return self.observation
The first observation in the step return instance of this data which is incorrect is listed below.
[ 5.17610000e-01 5.17810000e-01 5.15010000e-01 5.15370000e-01
5.18581850e+06 0.00000000e+00 3.76286621e+01 5.15838144e-01
-1.86428571e-05]
I have narrow the issue down to a line of code.
The correct data is presented as a list not numpy array below
[0.51761, 0.51781, 0.51501, 0.51537, 5185818.5, 0, 37.62866206258322, 0.5158381442961018, -1.864285714292535e-05]
If anyone has any tips of how to solve this issue please let me know I don't understand why this is happening. I usually don't ask for help but this is the first. I also have a Agent (A2C) that keep returning 0 as it action and I believe the data is to blame.
Sincerely, Richard
The data is just in exponential notation but identical. To suppress exponential notation in numpy you can do the following:
numpy.set_printoptions(supress = True)
or you can use formatted printing of variables such as:
for item in mylist:
print(f"{item:0.3f}")
Related
Some background, I'm taking a machine learning class on customer segmentation. My code env is pandas(python) and sklearn. I have two datasets, a general population dataset and a customer demographics dataset with 85 identical columns.
I'm calling a function I created to run preprocessing steps on the 'customers' data, steps that were previously run outside this function on the general population dataset. Within the function is a loop that replaces missing values with np.nan. Here is the loop:
#replacing missing data with NaNs.
#feat_sum is a dataframe (feature_summary) of coded values
for i in range(len(feat_sum)):
mi_unk = feat_sum.iloc[i]['missing_or_unknown'] #locate column and values
mi_unk = mi_unk.strip('[').strip(']').split(',')# strip the brackets then split
mi_unk = [int(val) if (val!='' and val!='X' and val!='XX') else val for val in mi_unk]
if mi_unk != ['']:
featsum_attrib = feat_sum.iloc[i]['attribute']
df = df.replace({featsum_attrib: mi_unk}, np.nan)
Toward the end of the function I'm engineering new variables:
#Investigate "CAMEO_INTL_2015" and engineer two new variables.
df['WEALTH'] = df['CAMEO_INTL_2015']
df['LIFE_STAGE'] = df['CAMEO_INTL_2015']
mf_wealth_dict = {'11':1, '12':1, '13':1, '14':1, '15':1, '21':2, '22':2, '23':2, '24':2, '25':2, '31':3,'32':3, '33':3, '34':3, '35':3, '41':4, '42':4, '43':4, '44':4, '45':4, '51':5, '52':5, '53':5, '54':5, '55':5}
mf_lifestage_dict = {'11':1, '12':2, '13':3, '14':4, '15':5, '21':1, '22':2, '23':3, '24':4, '25':5, '31':1, '32':2, '33':3, '34':4, '35':5, '41':1, '42':2, '43':3, '44':4, '45':5, '51':1, '52':2, '53':3, '54':4, '55':5}
#replacing the 'WEALTH' and 'LIFE_STAGE' columns with values from the dictionaries
df['WEALTH'].replace(mf_wealth_dict, inplace=True)
df['LIFE_STAGE'].replace(mf_lifestage_dict, inplace=True)
Near the end of the project code, I'm running an imputer to replace the np.nans which ran successfully on the general population dataset(azdias):
az_imp = Imputer(strategy="most_frequent")
azdias_cleaned_imp = pd.DataFrame(az_imp.fit_transform(azdias_cleaned_encoded))
So when I call the clean_data function passing the 'customers' dataframe, clean_data(customers),it is giving me the ValueError: could not convert str to float: 'XX' on this line:
customers_imp = Imputer(strategy="most_frequent")
---> 19 customers_cleaned_imputed = pd.DataFrame(customers_imp.fit_transform(customers_cleaned_encoded))
In the data dictionary for the CAMEO_INTL_2015 column of the dataset, the very last category is 'XX': unknown. When I run a value count on the WEALTH and LIFE_STAGE columns, 124 occurrences of 'XX' under those two columns. No other columns in the dataset have the 'XX' value except these. Again, no issues with the other dataset, I did not run into this problem. I know this is wordy, but any help appreciated and I can provide the project code as well.
A mentor and myself tried troubleshooting looking at all the steps that were performed on both datasets, to no avail. I was expecting the 'XX' to be dealt with from the loop I mentioned earlier.
First time on stackoverflow so bear with me. Code is below. Basically, the df_history is a dataframe with different variables. I am trying to pull the 'close' variable and sort it based on the categorical type of the currency.
When I pull data over using the .query command, it gives me 1 object with all the individual observations together separated by a space. I know how to separate that back into independent data, but issue is that it is pulling the index count with the observations. In the image you can see 179, 178, 177 etc in the BTC object. I dont want that there and didnt indend to pull it. How do I get rid of that?
additional_rows = []
for currency in selected_coins:
df_history = df_history.sort_values(['date'], ascending=True)
row_data = [currency,
df_history.query('granularity == \'daily\' and currency == #currency')['close'],
df_history.query('granularity == \'daily\' and currency == #currency').head(180)['close'].pct_change(),
df_history['date']
]
additional_rows.append(row_data)
df_additional_info = pd.DataFrame(additional_rows, columns = ['currency',
'close',
'returns',
'df_history'])
df_additional_info.set_index('currency').transpose()
import ast
list_of_lists = df_additional_info.close.to_list()
flat_list = [i for sublist in list_of_lists for i in ast.literal_eval(sublist)]
uniq_list = list(set(flat_list))
len(uniq_list),len(flat_list)
I was trying to pull data from one data frame to the next and sort it based on a categorical input from the currency variable. It is not transferring over well
I have a problem with getting data.
I have this DataFrame:
I need to filter by 'fabricante' == 'Kellogs' and get the 'calorias' column, I did this:
I need the second column (calorias) for introducing in this function:
def valor_medio_intervalo(fabricante, variable, confianza):
subconjunto = None # Select only the data: (fabricante, variable) from 'cereal_df'
inicio, final = None, None # put the statistical function here.
return inicio, final
And this is my code for the last part:
def valor_medio_intervalo(fabricante, variable, confianza):
subconjunto = cereal_df.loc[cereal_df['fabricante'] == fabricante][variable]
inicio, final = sm.stats.DescrStatsW(variable).tconfint_mean(alpha = 1-confianza)
return inicio, final
The error:
I'm gonna be so appreciative if you can help me
You called DescrStatsW('calorias').
But surely you wanted DescrStatsW(subconjunto), right?
I'm just reading https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.DescrStatsW.html
which explains you should pass in
a 1- or 2-column numpy array or dataframe.
First time posting here, I apologize if this question has been asked before - I can't find anything that applies.
Is there a way to read the underlying data from an Excel PivotTable into a Pandas Data Frame? For several years I've had an Excel Auto_Open macro that downloads several Excel files and double clicks on the "Grand Total" row in order to extract all of the data, which ultimate gets imported into a database. This is done because the owners of the source data refuse to grant access to the database itself.
This macro has never been the ideal scenario and we need to move it to a better method soon. I have extensive SQL knowledge but have only recently begun to learn Python.
I have been able to read worksheets using OpenPyXl, but these files do not contain the source data on a separate worksheet by default - the pivotcache must be extracted to a new sheet first. What I would like to do, if possible, is read from the Excel PivotCache into a Pandas Data Frame and either save that output as a CSV or load it directly into our database. It seems that this is not capable with OpenPyXl and that I'll probably need to use win32com.client.
Does anybody have any experience with this, and know if it's even possible? Any pointers for where I might get started? I've tried several items from the Excel Object model (PivotCache, GetData, etc etc) but either I don't know how to use them or they don't return what I need.
Any help would be much appreciated. Thanks!
This answer is very late, but I came up with it while struggling with the same issue, and some of the comments above helped me nail it.
In essence, the steps one can take to solve this with openpyxl are:
Use openpyxl to get the openpyxl.pivot.table.TableDefinition object from the desired pivot table (let's call it my_pivot_table)
Get cached fields and their values from my_pivot_table.cache.cacheFields
Get rows data as a dict in two sub-steps:
3.1) Get all cached rows and their values from my_pivot_table.cache.records.r. Cache fields in these records are stored as indexes from my_pivot_table.cache.cacheFields
3.2) Replace cache fields from each record by their actual values, by "joining" cache.records.r and cache.cacheFields
Convert dict with rows into a pandas DataFrame
Below you will find a copy of the code that implements such solution. Since the structure of these Excel objects are somewhat complex, the code will probably look very cryptic (sorry about that). To address this, I'm adding further below minimal examples of the main objects being manipulated, so people can get a better sense of what is going on, what are the objects being returned, etc.
This was the simplest approach I could find to achieve this. I hope it is still useful for someone, albeit some tweaking may be needed for individual cases.
"Bare" code
import numpy as np
import pandas as pd
from openpyxl import load_workbook
from openpyxl.pivot.fields import Missing
file_path = 'path/to/your/file.xlsx'
workbook = load_workbook(file_path)
worksheet = workbook['Plan1']
# Name of desired pivot table (the same name that appears within Excel)
pivot_name = 'Tabela dinâmica1'
# Extract the pivot table object from the worksheet
pivot_table = [p for p in worksheet._pivots if p.name == pivot_name][0]
# Extract a dict of all cache fields and their respective values
fields_map = {}
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
fields_map[field.name] = [f.v for f in field.sharedItems._fields]
# Extract all rows from cache records. Each row is initially parsed as a dict
column_names = [field.name for field in pivot_table.cache.cacheFields]
rows = []
for record in pivot_table.cache.records.r:
# If some field in the record in missing, we replace it by NaN
record_values = [
field.v if not isinstance(field, Missing) else np.nan for field in record._fields
]
row_dict = {k: v for k, v in zip(column_names, record_values)}
# Shared fields are mapped as an Index, so we replace the field index by its value
for key in fields_map:
row_dict[key] = fields_map[key][row_dict[key]]
rows.append(row_dict)
df = pd.DataFrame.from_dict(rows)
Results:
>>> df.head(2)
FUEL YEAR REGION STATE UNIT Jan Feb (...)
0 GASOLINE (m3) 2000.0 S TEXAS m3 9563.263 9563.263 (...)
1 GASOLINE (m3) 2000.0 NE NEW YORK m3 3065.758 9563.263 (...)
Some of the objects details
Object pivot_table
This is an object of type openpyxl.pivot.table.TableDefinition. It is quite complex. A small glimpse of it:
<openpyxl.pivot.table.TableDefinition object>
Parameters:
name='Tabela dinâmica1', cacheId=36, dataOnRows=True, dataPosition=None, (A LOT OF OMITTED STUFF...)
Parameters:
ref='B52:W66', firstHeaderRow=1, firstDataRow=2, firstDataCol=1, rowPageCount=2, colPageCount=1, pivotFields=[<openpyxl.pivot.table.PivotField object>
Parameters: (A LOT OF OMITTED STUFF...)
Object fields_map (from cache.cacheFields)
This is a dict with column name and their available values:
{'YEAR': [2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, 2006.0, 2007.0, 2008.0,
2009.0, 2010.0, 2011.0, 2012.0, 2013.0, 2014.0, 2015.0, 2016.0, 2017.0,
2018.0, 2019.0, 2020.0],
'FUEL': ['GASOLINE (m3)', 'AVIATION GASOLINE (m3)', 'KEROSENE (m3)'],
'STATE': ['TEXAS', 'NEW YORK', 'MAINE', (...)],
'REGION': ['S', 'NE', 'N', (...)]}
Object row_dict (before mapping)
Each row is a dict with column names and their values. Raw values for cache fields are not stored here. Here they are represented by their indexes in cache.cacheFields (see above)
{'YEAR': 0, # <<<--- 0 stands for index in fields_map
'Jan': 10719.983,
'Feb': 12482.281,
'FUEL': 0, # <<<--- index in fields_map
'Dec': 10818.094,
'STATE': 0, # <<<--- index in fields_map
(...)
'UNIT': 'm3'}
Object row_dict (after mapping)
After extracting raw values for cache fields from their indexes, we have a dict that represent all values of a row:
{'YEAR': 2000.0, # extracted column value from index in fields_map
'Jan': 10719.983,
'Feb': 12482.281,
'FUEL': 'GASOLINE (m3)', # extracted from fields_map
'Dec': 10818.094,
'STATE': 'TEXAS', # extracted from fields_map
(...)
'UNIT': 'm3'}
Building on #PMHM excellent answer I have modified the code to take care of source data with blank cells. The piece of code that needed modification is the following:
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
# take care of cases where f.v returns an AttributeError because the cell is empty
# fields_map[field.name] = [f.v for f in field.sharedItems._fields]
l = []
for f in field.sharedItems._fields:
try:
l += [f.v]
except AttributeError:
l += [""]
fields_map[field.name] = l
The complete code (mostly copy/paste from above) is therefore:
import numpy as np
import pandas as pd
from openpyxl import load_workbook
from openpyxl.pivot.fields import Missing
file_path = 'path/to/your/file.xlsx'
workbook = load_workbook(file_path)
worksheet = workbook['Plan1']
# Name of desired pivot table (the same name that appears within Excel)
pivot_name = 'Tabela dinâmica1'
# Extract the pivot table object from the worksheet
pivot_table = [p for p in worksheet._pivots if p.name == pivot_name][0]
# Extract a dict of all cache fields and their respective values
fields_map = {}
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
# take care of cases where f.v returns an AttributeError because the cell is empty
# fields_map[field.name] = [f.v for f in field.sharedItems._fields]
l = []
for f in field.sharedItems._fields:
try:
l += [f.v]
except AttributeError:
l += [""]
fields_map[field.name] = l
# Extract all rows from cache records. Each row is initially parsed as a dict
column_names = [field.name for field in pivot_table.cache.cacheFields]
rows = []
for record in pivot_table.cache.records.r:
# If some field in the record in missing, we replace it by NaN
record_values = [
field.v if not isinstance(field, Missing) else np.nan for field in record._fields
]
row_dict = {k: v for k, v in zip(column_names, record_values)}
# Shared fields are mapped as an Index, so we replace the field index by its value
for key in fields_map:
row_dict[key] = fields_map[key][row_dict[key]]
rows.append(row_dict)
df = pd.DataFrame.from_dict(rows)
I have a dataframe that consists of two decimal values and an Id:
When I apply the as matrix function on the x and y values it yields an array that looks like this:
coords = df.as_matrix(columns=['x', 'y'])
coords
yields:
array([[ 0.0703843 , 0.170845 ],
[ 0.07022078, 0.17150128],
[ 0.07208886, 0.17159163],
...,
[ 0.07162819, 0.17044404],
[ 0.06951432, 0.17096308],
[ 0.07104143, 0.17040137]])
This immediately seemed strange since the length of the decimal place were inconsistent but I just assumed pandas was doing some shortening for display purposes
But then when I tried to retrieve the IDs - I could only get one or zero matchs when they should all match:
ids = []
for coord in coords:
try:
_id = df.loc[df['x'] == coord[0]]['id'][1]
ids.append(_id)
except:
pass
len(ids)
1
What I am trying to understand is why the pd.as_matrix function extracts a value from the data frame that cannot be referenced again, and if so how do retrieve the ids from the data frame.
Any help here would be appreciated.
Thanks
Edit
Bellow is an subset of the data frame in CSV:
,id,x,y
0,07379a26-2447-4fce-83ac-4784abf07389,0.07038429591623253,0.17084500318384327
1,f5cc3adb-0588-4705-b1a3-fe1b669b776f,0.07022078416348305,0.17150127781674332
2,b5a57ffe-8565-4443-9685-11675ce25dc4,0.07208886125821728,0.17159163002146055
3,940efcaa-6d9d-4b10-a0fe-d8ec8c1d9c7e,0.07057468050347501,0.1700482708522834
4,616d7794-565a-4d2d-98cb-334beb5b91ef,0.07057895306948389,0.170054305037284
5,e2d1819d-1f58-407d-9950-be0a0c00374b,0.07161607658023798,0.17013089473907284
6,6a739687-f9ad-47bd-8a4b-c47bc4b2aec6,0.070163429153604,0.16889764101717875
7,dd2df646-9a66-4baa-8815-d24f1858eda7,0.07035099968831582,0.16995622800529742
8,6a224d76-efea-4313-803d-c25b619dae0a,0.07066777462044714,0.17021849979554743
9,321147fa-ee51-4bab-9634-199c92a42d2f,0.06984869509314469,0.17098101436534555
10,e52d6289-01ba-4e7d-8054-bb9a349c0505,0.07068704829137691,0.17029718331066224
11,517f256b-6171-4d93-9b4b-0f81aac828fb,0.0713283119291569,0.16983952831019206
12,e339c742-9784-49fc-a435-790db0364229,0.07131341496221469,0.1698513011377732
13,6f20ad5a-22fb-43a2-8885-838e5161df14,0.06942397329210678,0.1716572235671854
14,f6e1008f-2b22-4d88-8c84-c0dc4f2d822e,0.06942427697939664,0.17165098925109726
15,8a2d35e5-10a2-4188-b98f-54200d2db8da,0.07048162129308791,0.16896051533992895
16,adab8fd8-4348-412d-85d2-01491886967b,0.07076495746208027,0.16966622176968035
17,df79523b-848b-45a9-8dab-fe53c2a5b62d,0.06988926585338372,0.17028143287771583
18,db05d97c-3b16-4da8-9659-820fc7e3f858,0.0713167479593096,0.1685149810693375
19,d43963d1-b803-473c-85dc-2ed2e9f77f4e,0.07045583812582461,0.1706502407290604
20,9d99c9a6-2de3-4e6a-9bd7-9d7ece358a2f,0.07044174575566758,0.17066067488910522
21,3eec44be-b9e2-45a2-b919-05028f5a0ba9,0.07079585677115756,0.16920818686920963
22,9f836847-2b67-4b33-930a-1f84452628ba,0.07078522829778934,0.16919781903167638
23,fbaa8958-a5d5-4dfb-91f7-8c11afe226a8,0.07128542860765898,0.16834798505762455
24,a84b59c4-4145-472d-a26a-4c930648c16c,0.07196635776157265,0.17047633495883885
25,29cf8ad3-7068-4207-b0a2-4f0cff337c9f,0.0719701195278871,0.17051442269732875
26,d0f512c8-5c4f-427a-99e1-ebb4c5b363e5,0.0718787509597688,0.17054903897593635
27,74b1db2d-002b-4f89-8d02-ac084e9a3cd5,0.07089130417373782,0.16981103290127117
28,89210a0c-8144-491d-9e98-19e7f4c3085e,0.07076060461092577,0.1707011426749184
29,aebb377e-7c26-4bb5-8563-c3055a027844,0.07103977816965212,0.17113978347674103
30,00b527a0-d40a-44b4-90f9-750fd447d2d7,0.07097785505134419,0.16963542019904118
31,8c186559-f50d-40ca-a821-11596e1e5261,0.06992637446216321,0.17110063865050085
32,0e64cf14-6ccd-4ad0-9715-ab410f6baf6a,0.0718311255786932,0.1705675237580442
33,f5479823-1efe-47b8-9977-73dc41d1d69e,0.07016981880399553,0.1703708437681898
34,385cfa13-2476-4e3d-b755-3063a7f802b9,0.07016550435008462,0.17037054473511137
35,a40bf573-b701-46f0-9a06-5857cf3ab199,0.0701443567773146,0.17035314147536326
36,0c5a9751-2c1b-4003-834d-9584d2f907a2,0.07016050805421256,0.17038992836178396
37,65b09067-9cf0-492d-8a70-13d4f92f8a10,0.07137336818557355,0.1684713798357405
The issue is with the df.loc function on geo-dataframes.
Once I exported it to a csv, then re read the dataframe in using normal pandas it seemed to work just fine.
Just letting who finds this know.