List as names for names of Pandas Dataframes - python

I want to make the names of some stock symbols the actual name of a pandas dataframe.
import pandas as pd
import pandas_datareader.data as pdr
choices = ['ROK', 'HWM', 'PYPL', 'V', 'KIM', 'FISV', 'REG', 'EMN', 'GS', 'TYL']
for c in choices:
pdr.DataReader(c, data_source='yahoo', start=datetime(2000,1,1),
end=datetime(2020,1,1)).to_csv(f'Data/{c}.csv')
f'{c}'['Price'] = pd.read_csv(f'Data/{c}.csv', index_col='Date')['Adj Close']
I'm getting this error:
TypeError: 'str' object does not support item assignment
Is there a way to go about doing this? Maybe perhaps using the name of the stock symbol as the name of the dataframe is not the best convention.
Thank you

You can put it in a data structure as a dictionary.
import pandas as pd
import pandas_datareader.data as pdr
choices = ['ROK', 'HWM', 'PYPL', 'V', 'KIM', 'FISV', 'REG', 'EMN', 'GS', 'TYL']
dataframes = {}
for c in choices:
pdr.DataReader(c, data_source='yahoo', start=datetime(2000,1,1),
end=datetime(2020,1,1)).to_csv(f'Data/{c}.csv')
dataframes[c] = pd.read_csv(f'Data/{c}.csv', index_col='Date')['Adj Close']
So, you will get a structure like the one bellow:
>>> print(dataframes)
{'ROK': <your_ROK_dataframe_here>,
'HWM': <your_HWM_dataframe_here>,
...
}
Then, you can access a specific dataframe by using dataframes['XXXX'] where XXXX is one of the choices.

You shouldn't be storing variables with string as it can get quite messy down the line. If you wanted to keep with your convention I'd advise storing your dataframes as a dictionary with the stock symbols as a key
choices = ['ROK', 'HWM', 'PYPL', 'V', 'KIM', 'FISV', 'REG', 'EMN', 'GS', 'TYL']
choices_dict = {}
for c in choices:
pdr.DataReader(c, data_source='yahoo', start=datetime(2000,1,1),
end=datetime(2020,1,1)).to_csv(f'Data/{c}.csv')
csv_pd = pd.read_csv(f'Data/{c}.csv', index_col='Date')['Adj Close']
choices_dict[c] = pd.DataFrame(csv_pd, columns=['Price'])

Related

Why isn't this Pandas pivot table working?

My code takes a bank statement from Excel and creates a dataframe that categorises each transaction based on description:
import pandas as pd
import openpyxl
import datetime as dt
import numpy as np
dff = pd.DataFrame({'Date': ['20221003', '20221005'],
'Tran Type': ['BOOK TRANSFER CREDIT', 'ACH DEBIT'],
'Debit Amount': [0.00, -220000.00],
'Credit Amount': [182.90, 0.0],
'Description': ['BOOK TRANSFER CREDIT FROM ACCOUNT 98743987', 'USREF2548 ACH OFFSET'],
'Amount': [-220000.00, 182.90]})
import re
dff['Category'] = dff['Description'].str.findall('Ref|BCA|Fund|Transfer', flags=re.IGNORECASE)
But this code will not work. Any ideas why?
pivotf = dff
pivotf = pd.pivot_table(pivotf,
index=["Date"], columns="Category",
values=['Amount'],
margins=False, margins_name="Total")
The error message is TypeError: unhashable type: 'list'
When I change columns from "Category" to anything else, it works fine.
Thanks!
Add this line before executing (untested):
import numpy as np
dff['category'] = [x[0] if not x.isempty() else np.nan for x in dff['category']]
This will make sure your category is not a list (which can't be hashed).

Getting a dictionnary of lists that contain element from a column using a groupby

I have a dataframe that looks like this, with 1 string column and 1 int column.
import random
columns=['EG','EC','FI', 'ED', 'EB', 'FB', 'FCY', 'ECY', 'FG', 'FUR', 'E', '\[ED']
choices_str = random.choices(columns, k=200)
choices_int = random.choices(range(1, 8), k=200)
my_df = pd.DataFrame({'column_A': choices_str, 'column_B': choices_int})
I would like to get at the very end a dictionnary of lists that store all values of column B groupby A, like this :
What I made to achieve this to used a groupby to get number of occurences for column_B :
group_by = my_df.groupby(['column_A','column_B'])['column_B'].count().unstack().fillna(0).T
group_by
And then use some list comprehensions to create by hand my lists for each column_A and add them to the dictionnary.
Is there anyway to get more directly using a groupby ?
I am not aware of a method that is able to achieve that within the groupby statement. But I think you could try something like this alternatively:
import random
import pandas as pd
columns=['EG','EC','FI', 'ED', 'EB', 'FB', 'FCY', 'ECY', 'FG', 'FUR', 'E', '\[ED']
choices_str = random.choices(columns, k=200)
choices_int = random.choices(range(1, 8), k=200)
my_df = pd.DataFrame({'column_A': choices_str, 'column_B': choices_int})
final_dict = {val: my_df.loc[my_df['column_A'] == val, 'column_B'].values.tolist() for val in my_df['column_A'].unique()}
This dict-comprehension is a one-liner and takes all column_B values that correspond to a specific column_A value and assigns them to the dict stored in a list with column_A values as keys.

Convert a dataframe column into a list of object

I am using pandas to read a CSV which contains a phone_number field (string), however, I need to convert this field into the below JSON format
[{'phone_number':'+01 373643222'}] and put it under a new column name called phone_numbers, how can I do that?
Searched online but the examples I found are converting the all the columns into JSON by using to_json() which is apparently cannot solve my case.
Below is an example
import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'],
'phone_number': ['+1 569-483-2388', '+1 555-555-1212', '+1 432-867-5309']})
use map function like this
df["phone_numbers"] = df["phone_number"].map(lambda x: [{"phone_number": x}] )
display(df)

Python custom method set to new variable changes old variable

I have created a class with two methods, NRG_load and NRG_flat. The first loads a CSV, converts it into a DataFrame and applies some filtering; the second takes this DataFrame and, after creating two columns, it melts the DataFrame to pivot it.
I am trying out these methods with the following code:
nrg105 = eNRG.NRG_load('nrg_105a.tsv')
nrg105_flat = eNRG.NRG_flat(nrg105, '105')
where eNRG is the class, and '105' as second argument is needed to run an if-loop within the method to create the aforementioned columns.
The behaviour I cannot explain is that the second line - the one with the NRG_flat method - changes the nrg105 values.
Note that if I only run the NRG_load method, I get the expected DataFrame.
What is the behaviour I am missing? Because it's not the first time I apply a syntax like that, but I never had problems, so I don't know where I should look at.
Thank you in advance for all of your suggestions.
EDIT: as requested, here is the class' code:
# -*- coding: utf-8 -*-
"""
Created on Tue Apr 16 15:22:21 2019
#author: CAPIZZI Filippo Antonio
"""
import pandas as pd
from FixFilename import FixFilename as ff
from SplitColumn import SplitColumn as sc
from datetime import datetime as ddt
class EurostatNRG:
# This class includes the modules needed to load and filter
# the Eurostat NRG files
# Default countries' lists to be used by the functions
COUNTRIES = [
'EU28', 'AL', 'AT', 'BE', 'BG', 'CY', 'CZ', 'DE', 'DK', 'EE', 'EL',
'ES', 'FI', 'FR', 'GE', 'HR', 'HU', 'IE', 'IS', 'IT', 'LT', 'LU', 'LV',
'MD', 'ME', 'MK', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK',
'TR', 'UA', 'UK', 'XK'
]
# Default years of analysis
YEARS = list(range(2005, int(ddt.now().year) - 1))
# NOTE: the 'datetime' library will call the current year, but since
# the code is using the 'range' function, the end years will be always
# current-1 (e.g. if we are in 2019, 'current year' will be 2018).
# Thus, I have added "-1" because the end year is t-2.
INDIC_PROD = pd.read_excel(
'./Datasets/VITO/map_nrg.xlsx',
sheet_name=[
'nrg105a_indic', 'nrg105a_prod', 'nrg110a_indic', 'nrg110a_prod',
'nrg110'
],
convert_float=True)
def NRG_load(dataset, countries=COUNTRIES, years=YEARS, unit='ktoe'):
# This module will load and refine the NRG dataset,
# preparing it to be filtered
# Fix eventual flags
dataset = ff.fix_flags(dataset)
# Load the dataset into a DataFrame
df = pd.read_csv(
dataset,
delimiter='\t',
encoding='utf-8',
na_values=[':', ': ', ' :'],
decimal='.')
# Clean up spaces from the column names
df.columns = df.columns.str.strip()
# Removes the mentioned column because it's not needed
if 'Flag and Footnotes' in df.columns:
df.drop(columns=['Flag and Footnotes'], inplace=True)
# Split the first column into separate columns
df = sc.nrg_split_column(df)
# Rename the columns
df.rename(
columns={
'country': 'COUNTRY',
'fuel_code': 'KEY_PRODUCT',
'nrg_code': 'KEY_INDICATOR',
'unit': 'UNIT'
},
inplace=True)
# Filter the dataset
df = EurostatNRG.NRG_filter(
df, countries=countries, years=years, unit=unit)
return df
def NRG_filter(df, countries, years, unit):
# This module will filter the input DataFrame 'df'
# showing only the 'countries', 'years' and 'unit' selected
# First, all of the units not of interest are removed
df.drop(df[df.UNIT != unit.upper()].index, inplace=True)
# Then, all of the countries not of interest are filtered out
df.drop(df[~df['COUNTRY'].isin(countries)].index, inplace=True)
# Finally, all of the years not of interest are removed,
# and the columns are rearranged according to the desired output
main_cols = ['KEY_INDICATOR', 'KEY_PRODUCT', 'UNIT', 'COUNTRY']
cols = main_cols + [str(y) for y in years if y not in main_cols]
df = df.reindex(columns=cols)
return df
def NRG_flat(df, name):
# This module prepares the DataFrame to be flattened,
# then it gives it as output
# Assign the indicators and products' names
if '105' in name: # 'name' is the name of the dataset
# Creating the 'INDICATOR' column
indic_dic = dict(
zip(EurostatNRG.INDIC_PROD['nrg105a_indic'].KEY_INDICATOR,
EurostatNRG.INDIC_PROD['nrg105a_indic'].INDICATOR))
df['INDICATOR'] = df['KEY_INDICATOR'].map(indic_dic)
# Creating the 'PRODUCT' column
prod_dic = dict(
zip(
EurostatNRG.INDIC_PROD['nrg105a_prod'].KEY_PRODUCT.astype(
str), EurostatNRG.INDIC_PROD['nrg105a_prod'].PRODUCT))
df['PRODUCT'] = df['KEY_PRODUCT'].map(prod_dic)
elif '110' in name:
# Creating the 'INDICATOR' column
indic_dic = dict(
zip(EurostatNRG.INDIC_PROD['nrg110a_indic'].KEY_INDICATOR,
EurostatNRG.INDIC_PROD['nrg110a_indic'].INDICATOR))
df['INDICATOR'] = df['KEY_INDICATOR'].map(indic_dic)
# Creating the 'PRODUCT' column
prod_dic = dict(
zip(
EurostatNRG.INDIC_PROD['nrg110a_prod'].KEY_PRODUCT.astype(
str), EurostatNRG.INDIC_PROD['nrg110a_prod'].PRODUCT))
df['PRODUCT'] = df['KEY_PRODUCT'].map(prod_dic)
# Delete che columns 'KEY_INDICATOR' and 'KEY_PRODUCT', and
# rearrange the columns in the desired order
df.drop(columns=['KEY_INDICATOR', 'KEY_PRODUCT'], inplace=True)
main_cols = ['INDICATOR', 'PRODUCT', 'UNIT', 'COUNTRY']
year_cols = [y for y in df.columns if y not in main_cols]
cols = main_cols + year_cols
df = df.reindex(columns=cols)
# Pivot the DataFrame to have it in flat format
df = df.melt(
id_vars=df.columns[:4], var_name='YEAR', value_name='VALUE')
# Convert the 'VALUE' column into float numbers
df['VALUE'] = pd.to_numeric(df['VALUE'], downcast='float')
# Drop rows that have no indicators (it means they are not in
# the Excel file with the products of interest)
df.dropna(subset=['INDICATOR', 'PRODUCT'], inplace=True)
return df
EDIT 2: if this could help, this is the error I receive when using the EurostatNRG class in IPython:
[autoreload of EurostatNRG failed: Traceback (most recent call last):
File
"C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 244, in check
superreload(m, reload, self.old_objects) File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 394, in superreload
update_generic(old_obj, new_obj) File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 331, in update_generic
update(a, b) File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 279, in update_class
if (old_obj == new_obj) is True: File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py",
line 1478, in nonzero
.format(self.class.name)) ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or
a.all(). ]
I managed to find the culprit.
In the NRG_flat method, the lines:
df['INDICATOR'] = df['KEY_INDICATOR'].map(indic_dic)
...
df['PRODUCT'] = df['KEY_PRODUCT'].map(indic_dic)
mess up the copies of the df DataFrame, thus I had to change them with the Pandas assign method:
df = df.assign(INDICATOR=df.KEY_INDICATOR.map(prod_dic))
...
df = df.assign(PRODUCT=df.KEY_PRODUCT.map(prod_dic))
I do not get any more error.
Thank you for replying!

Pandas df to dictionary with values as python lists aggregated from a df column

I have a pandas df containing 'features' for stocks, which looks like this:
I am now trying to create a dictionary with unique sector as key, and a python list of tickers for that unique sector as values, so I end up having something that looks like this:
{'consumer_discretionary': ['AAP',
'AMZN',
'AN',
'AZO',
'BBBY',
'BBY',
'BWA',
'KMX',
'CCL',
'CBS',
'CHTR',
'CMG',
etc.
I could iterate over the pandas df rows to create the dictionary, but I prefer a more pythonic solution. Thus far, this code is a partial solution:
df.set_index('sector')['ticker'].to_dict()
Any feedback is appreciated.
UPDATE:
The solution by #wrwrwr
df.set_index('ticker').groupby('sector').groups
partially works, but it returns a pandas series as a the value, instead of a python list. Any ideas about how to transform the pandas series into a python list in the same line and w/o having to iterate the dictionary?
Wouldn't f.set_index('ticker').groupby('sector').groups be what you want?
For example:
f = DataFrame({
'ticker': ('t1', 't2', 't3'),
'sector': ('sa', 'sb', 'sb'),
'name': ('n1', 'n2', 'n3')})
groups = f.set_index('ticker').groupby('sector').groups
# {'sa': Index(['t1']), 'sb': Index(['t2', 't3'])}
To ensure that they have the type you want:
{k: list(v) for k, v in f.set_index('ticker').groupby('sector').groups.items()}
or:
f.set_index('ticker').groupby('sector').apply(lambda g: list(g.index)).to_dict()

Categories