Why my txt file has been changed after I used pd.DataFrame? - python

the orignal data is:
the output data is:
import pandas as pd
signal_data = pd.read_csv('B.txt').T
print pd.read_csv('B.txt').T
dates = pd.date_range('2015-10-1', periods=19)
signal_data_df= pd.DataFrame(signal_data, index=dates, columns=['PCLN', 'SPY', 'QCOM', 'AAPL', 'USB', 'AMGN', 'GS', 'BIIB', 'AGN'])
print signal_data_df

Because you pass a df as the data source, it's reusing the index and columns from the df so when you pass an alternative index and column values you're effectively reindexing the original df hence the NaN values everywhere. You can just rename the columns and overwrite the index directly:
signal_data = pd.read_csv('B.txt').T
signal_data.columns=['PCLN', 'SPY', 'QCOM', 'AAPL', 'USB', 'AMGN', 'GS', 'BIIB', 'AGN']
signal_data.index = dates
or to get your code to work call .values to return the df as anonymous np array data:
signal_data_df= pd.DataFrame(signal_data.values, index=dates, columns=['PCLN', 'SPY', 'QCOM', 'AAPL', 'USB', 'AMGN', 'GS', 'BIIB', 'AGN'])

Related

Add a calculated column to a pivot table in pandas

Hi I am trying to create new columns to a multi-indexed pandas pivot table to do a countif statement (similar to excel) depending if a level of the index contains a specific string. This is the sample data:
df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover','Adak','Denver','Houston','Adak','Denver'],
'State': ['Texas', 'Texas', 'Alabama','Alaska','Colorado','Texas','Alaska','Colorado'],
'Name':['Aria', 'Penelope', 'Niko','Susan','Aria','Niko','Aria','Niko'],
'Unit':['Sales', 'Marketing', 'Operations','Sales','Operations','Operations','Sales','Operations'],
'Assigned':['Yes','No','Maybe','No','Yes','Yes','Yes','Yes']},
columns=['City', 'State', 'Name', 'Unit','Assigned'])
pivot=df.pivot_table(index=['City','State'],columns=['Name','Unit'],values=['Assigned'],aggfunc=lambda x:', '.join(set(x)),fill_value='')
and this is the desired output (in screenshot). Thanks in advance!
try:
temp = pivot[('Mango', 'Aria', 'Sales')].str.len()>0
pivot['new col'] = temp.astype(int)
the result:
Based on your edit:
import numpy as np
temp = pivot.xs('Sales', level=2, drop_level=False, axis = 1).apply(lambda x: np.sum([1 if y!='' else 0 for y in x]), axis = 1)
pivot[('', 'total sales', 'count how many...')]=temp

Remove excess of pipes '|' in CSV after append files

I have 3 dataframes. I need to convert them in one merged CSV separated by pipes '|'.
And I need to sort them by Column1 after append.
But, when I try to convert the final df to CSV, there comes exceeded pipes for null columns. How to avoid this?
import pandas as pd
import io
df1 = pd.DataFrame({
'Column1': ['key_1', 'key_2', 'key_3'],
'Column2': ['1100', '1100', '1100']
})
df2 = pd.DataFrame({
'Column1': ['key_1', 'key_2', 'key_3', 'key_1', 'key_2', 'key_3'],
'Column2': ['1110', '1110', '1110', '1110', '1110', '1110'],
'Column3': ['xxr', 'xxv', 'xxw', 'xxt', 'xxe', 'xxz'],
'Column4': ['wer', 'cad', 'sder', 'dse', 'sdf', 'csd']
})
df3 = pd.DataFrame({
'Column1': ['key_1', 'key_2', 'key_3', 'key_1', 'key_2', 'key_3'],
'Column2': ['1115', '1115', '1115', '1115', '1115', '1115'],
'Column3': ['xxr', 'xxv', 'xxw', 'xxt', 'xxe', 'xxz'],
'Column4': ['wer', 'cad', 'sder', 'dse', 'sdf', 'csd'],
'Column5': ['xxr', 'xxv', 'xxw', 'xxt', 'xxe', 'xxz'],
'Column6': ['xxr', 'xxv', 'xxw', 'xxt', 'xxe', 'xxz'],
})
print(df1, df2, df3, sep="\n")
output = io.StringIO()
pd.concat([df1, df2, df3]).sort_values("Column1") \
.to_csv(output, header=False, index=False, sep="|")
print("csv",output.getvalue(),sep="\n")
output.seek(0)
df4 = pd.read_csv(output, header=None, sep="|", keep_default_na=False)
print("df4",df4,sep="\n" )
output.close()
This is the output I have (note pipes'|'):
key_1|1100||||
key_1|1110|xxr|wer||
key_1|1110|xxt|dse||
key_1|1115|xxr|wer|xxr|xxr
key_1|1115|xxt|dse|xxt|xxt
key_2|1100||||
key_2|1110|xxv|cad||
key_2|1110|xxe|sdf||
key_2|1115|xxv|cad|xxv|xxv
key_2|1115|xxe|sdf|xxe|xxe
key_3|1100||||
key_3|1110|xxw|sder||
key_3|1110|xxz|csd||
key_3|1115|xxw|sder|xxw|xxw
key_3|1115|xxz|csd|xxz|xxz
I need this. Justo to introduce, I'll not work on this final data, I need to upload it to a specific database in the exact format I show below, but I need this without using regex (note pipes'|'). Is there a way to do so?
key_1|1100
key_1|1110|xxr|wer
key_1|1110|xxt|dse
key_1|1115|xxr|wer|xxr|xxr
key_1|1115|xxt|dse|xxt|xxt
key_2|1100
key_2|1110|xxv|cad
key_2|1110|xxe|sdf
key_2|1115|xxv|cad|xxv|xxv
key_2|1115|xxe|sdf|xxe|xxe
key_3|1100
key_3|1110|xxw|sder
key_3|1110|xxz|csd
key_3|1115|xxw|sder|xxw|xxw
key_3|1115|xxz|csd|xxz|xxz
as you note, generate sorted pipe delimited
then split(), rstrip("|") and join()
"\n".join([l.rstrip("|") for l in
pd.concat([df1,df2,df3]).pipe(lambda d:
d.sort_values(d.columns.tolist())).to_csv(sep="|", index=False).split("\n")])
You can remove extra "|" with re.sub():
import re
s = pd.concat([df1, df2, df3]).sort_values("Column1") \
.to_csv(header=False, index=False, sep="|")
s1 = re.sub("\|*\n", "\n", s) # with regex
s2 = "\n".join([l.rstrip("|") for l in s.splitlines()]) # with rstrip
>>> print(s1.strip())
key_1|1100
key_1|1110|xxr|wer
key_1|1110|xxt|dse
key_1|1115|xxr|wer|xxr|xxr
key_1|1115|xxt|dse|xxt|xxt
key_2|1100
key_2|1110|xxv|cad
key_2|1110|xxe|sdf
key_2|1115|xxv|cad|xxv|xxv
key_2|1115|xxe|sdf|xxe|xxe
key_3|1100
key_3|1110|xxw|sder
key_3|1110|xxz|csd
key_3|1115|xxw|sder|xxw|xxw
key_3|1115|xxz|csd|xxz|xxz
>>> print(s2)
key_1|1100
key_1|1110|xxr|wer
key_1|1110|xxt|dse
key_1|1115|xxr|wer|xxr|xxr
key_1|1115|xxt|dse|xxt|xxt
key_2|1100
key_2|1110|xxv|cad
key_2|1110|xxe|sdf
key_2|1115|xxv|cad|xxv|xxv
key_2|1115|xxe|sdf|xxe|xxe
key_3|1100
key_3|1110|xxw|sder
key_3|1110|xxz|csd
key_3|1115|xxw|sder|xxw|xxw
key_3|1115|xxz|csd|xxz|xxz

List as names for names of Pandas Dataframes

I want to make the names of some stock symbols the actual name of a pandas dataframe.
import pandas as pd
import pandas_datareader.data as pdr
choices = ['ROK', 'HWM', 'PYPL', 'V', 'KIM', 'FISV', 'REG', 'EMN', 'GS', 'TYL']
for c in choices:
pdr.DataReader(c, data_source='yahoo', start=datetime(2000,1,1),
end=datetime(2020,1,1)).to_csv(f'Data/{c}.csv')
f'{c}'['Price'] = pd.read_csv(f'Data/{c}.csv', index_col='Date')['Adj Close']
I'm getting this error:
TypeError: 'str' object does not support item assignment
Is there a way to go about doing this? Maybe perhaps using the name of the stock symbol as the name of the dataframe is not the best convention.
Thank you
You can put it in a data structure as a dictionary.
import pandas as pd
import pandas_datareader.data as pdr
choices = ['ROK', 'HWM', 'PYPL', 'V', 'KIM', 'FISV', 'REG', 'EMN', 'GS', 'TYL']
dataframes = {}
for c in choices:
pdr.DataReader(c, data_source='yahoo', start=datetime(2000,1,1),
end=datetime(2020,1,1)).to_csv(f'Data/{c}.csv')
dataframes[c] = pd.read_csv(f'Data/{c}.csv', index_col='Date')['Adj Close']
So, you will get a structure like the one bellow:
>>> print(dataframes)
{'ROK': <your_ROK_dataframe_here>,
'HWM': <your_HWM_dataframe_here>,
...
}
Then, you can access a specific dataframe by using dataframes['XXXX'] where XXXX is one of the choices.
You shouldn't be storing variables with string as it can get quite messy down the line. If you wanted to keep with your convention I'd advise storing your dataframes as a dictionary with the stock symbols as a key
choices = ['ROK', 'HWM', 'PYPL', 'V', 'KIM', 'FISV', 'REG', 'EMN', 'GS', 'TYL']
choices_dict = {}
for c in choices:
pdr.DataReader(c, data_source='yahoo', start=datetime(2000,1,1),
end=datetime(2020,1,1)).to_csv(f'Data/{c}.csv')
csv_pd = pd.read_csv(f'Data/{c}.csv', index_col='Date')['Adj Close']
choices_dict[c] = pd.DataFrame(csv_pd, columns=['Price'])

How do I add values to a list stored as a dictionary value?

I have an empty dictionary, and need to pull industry info base on ticker symbols. I would then like to add all tickers under the same industry in a list with the industry as the key.
For example, the end would look something like the below:
{'technology': ['AAPL', 'ADBE'], 'Consumer Cyclical': ['TSLA', 'UA']}
Here is what I've been working on with no success:
import yfinance as yf
tickers = ['AAPL', 'ADBE', 'AMD', 'AMAT', 'AMZN', 'ANF',
'APA', 'BA', 'BABA', 'BBY', 'BIDU', 'BMY', 'BRX', 'BZUN',
'C', 'CAT', 'CLF', 'CMCSA', 'CMG', 'COST', 'CRM', 'CVX',
'DE', 'EBAY', 'FB', 'FCX', 'FDX', 'FSLR',
'GILD', 'GM', 'GME', 'GOOG','GPRO', 'GS', 'HAL', 'HD',
'HIG', 'HON', 'IBM', 'JCPB', 'JD', 'JPM', 'LULU', 'LYG',
'MA', 'MCD', 'MDT', 'MS', 'MSFT','MU', 'NEM', 'NFLX',
'NKE','PBR', 'QCOM', 'SLB', 'SNAP', 'SPG', 'TSLA', 'TWTR',
'TXN', 'UA', 'UAL', 'V', 'VZ' 'X', 'XLNX', 'ZM']
sector_dict = dict()
for ticker in tickers:
try:
sector = yf.Ticker(ticker).info['sector']
sector_dict[sector].update(ticker)
except:
sector_dict.update({'no sector':[ticker]})
The below just gives me an empty dictionary. Does anybody see where the issue is?
Assuming the information you need is returned from the API call - the code below may work for you.
import yfinance as yf
from collections import defaultdict
tickers = ['AAPL','ADBE']
sector_dict = defaultdict(list)
for ticker in tickers:
try:
sector_dict[yf.Ticker(ticker).info['sector']].append(ticker)
except Exception as e:
print(f'Failed to get ticker info for {ticker}')
print(sector_dict)
output
defaultdict(<class 'list'>, {'Technology': ['AAPL', 'ADBE']})
You should always avoid catch-all exceptions.
Your original example was masking the fact that update isn't a list method.
When you subscript a python dictionary like sector_dict[ticker], we're now talking about the value associated with the ticker key. In this case a list.
Also update isn't used like that, so I think it was masking a second error. It's usage is to update a dictionary with another dictionary or an iterable. Not to update an existing entry.
Finally, the try clause should be as small as possible, in order to be sure where the error is coming from or at least you can guarantee there won't be conflicting exceptions such as this case.
I think that's why your list is returning with only the last ticker in my previous solution, as yf.Ticker causes a KeyError and the KeyError exception gets called instead of the last one.
Here's how I'd do it:
sector_dict = {'no sector':[]}
for ticker in tickers:
try:
sector = yf.Ticker(ticker).info['sector']
except KeyError:
sector_dict['no sector'].append(ticker)
try:
sector_dict[sector].append(ticker)
except KeyError:
sector_dict[sector] = [ticker]

Python custom method set to new variable changes old variable

I have created a class with two methods, NRG_load and NRG_flat. The first loads a CSV, converts it into a DataFrame and applies some filtering; the second takes this DataFrame and, after creating two columns, it melts the DataFrame to pivot it.
I am trying out these methods with the following code:
nrg105 = eNRG.NRG_load('nrg_105a.tsv')
nrg105_flat = eNRG.NRG_flat(nrg105, '105')
where eNRG is the class, and '105' as second argument is needed to run an if-loop within the method to create the aforementioned columns.
The behaviour I cannot explain is that the second line - the one with the NRG_flat method - changes the nrg105 values.
Note that if I only run the NRG_load method, I get the expected DataFrame.
What is the behaviour I am missing? Because it's not the first time I apply a syntax like that, but I never had problems, so I don't know where I should look at.
Thank you in advance for all of your suggestions.
EDIT: as requested, here is the class' code:
# -*- coding: utf-8 -*-
"""
Created on Tue Apr 16 15:22:21 2019
#author: CAPIZZI Filippo Antonio
"""
import pandas as pd
from FixFilename import FixFilename as ff
from SplitColumn import SplitColumn as sc
from datetime import datetime as ddt
class EurostatNRG:
# This class includes the modules needed to load and filter
# the Eurostat NRG files
# Default countries' lists to be used by the functions
COUNTRIES = [
'EU28', 'AL', 'AT', 'BE', 'BG', 'CY', 'CZ', 'DE', 'DK', 'EE', 'EL',
'ES', 'FI', 'FR', 'GE', 'HR', 'HU', 'IE', 'IS', 'IT', 'LT', 'LU', 'LV',
'MD', 'ME', 'MK', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK',
'TR', 'UA', 'UK', 'XK'
]
# Default years of analysis
YEARS = list(range(2005, int(ddt.now().year) - 1))
# NOTE: the 'datetime' library will call the current year, but since
# the code is using the 'range' function, the end years will be always
# current-1 (e.g. if we are in 2019, 'current year' will be 2018).
# Thus, I have added "-1" because the end year is t-2.
INDIC_PROD = pd.read_excel(
'./Datasets/VITO/map_nrg.xlsx',
sheet_name=[
'nrg105a_indic', 'nrg105a_prod', 'nrg110a_indic', 'nrg110a_prod',
'nrg110'
],
convert_float=True)
def NRG_load(dataset, countries=COUNTRIES, years=YEARS, unit='ktoe'):
# This module will load and refine the NRG dataset,
# preparing it to be filtered
# Fix eventual flags
dataset = ff.fix_flags(dataset)
# Load the dataset into a DataFrame
df = pd.read_csv(
dataset,
delimiter='\t',
encoding='utf-8',
na_values=[':', ': ', ' :'],
decimal='.')
# Clean up spaces from the column names
df.columns = df.columns.str.strip()
# Removes the mentioned column because it's not needed
if 'Flag and Footnotes' in df.columns:
df.drop(columns=['Flag and Footnotes'], inplace=True)
# Split the first column into separate columns
df = sc.nrg_split_column(df)
# Rename the columns
df.rename(
columns={
'country': 'COUNTRY',
'fuel_code': 'KEY_PRODUCT',
'nrg_code': 'KEY_INDICATOR',
'unit': 'UNIT'
},
inplace=True)
# Filter the dataset
df = EurostatNRG.NRG_filter(
df, countries=countries, years=years, unit=unit)
return df
def NRG_filter(df, countries, years, unit):
# This module will filter the input DataFrame 'df'
# showing only the 'countries', 'years' and 'unit' selected
# First, all of the units not of interest are removed
df.drop(df[df.UNIT != unit.upper()].index, inplace=True)
# Then, all of the countries not of interest are filtered out
df.drop(df[~df['COUNTRY'].isin(countries)].index, inplace=True)
# Finally, all of the years not of interest are removed,
# and the columns are rearranged according to the desired output
main_cols = ['KEY_INDICATOR', 'KEY_PRODUCT', 'UNIT', 'COUNTRY']
cols = main_cols + [str(y) for y in years if y not in main_cols]
df = df.reindex(columns=cols)
return df
def NRG_flat(df, name):
# This module prepares the DataFrame to be flattened,
# then it gives it as output
# Assign the indicators and products' names
if '105' in name: # 'name' is the name of the dataset
# Creating the 'INDICATOR' column
indic_dic = dict(
zip(EurostatNRG.INDIC_PROD['nrg105a_indic'].KEY_INDICATOR,
EurostatNRG.INDIC_PROD['nrg105a_indic'].INDICATOR))
df['INDICATOR'] = df['KEY_INDICATOR'].map(indic_dic)
# Creating the 'PRODUCT' column
prod_dic = dict(
zip(
EurostatNRG.INDIC_PROD['nrg105a_prod'].KEY_PRODUCT.astype(
str), EurostatNRG.INDIC_PROD['nrg105a_prod'].PRODUCT))
df['PRODUCT'] = df['KEY_PRODUCT'].map(prod_dic)
elif '110' in name:
# Creating the 'INDICATOR' column
indic_dic = dict(
zip(EurostatNRG.INDIC_PROD['nrg110a_indic'].KEY_INDICATOR,
EurostatNRG.INDIC_PROD['nrg110a_indic'].INDICATOR))
df['INDICATOR'] = df['KEY_INDICATOR'].map(indic_dic)
# Creating the 'PRODUCT' column
prod_dic = dict(
zip(
EurostatNRG.INDIC_PROD['nrg110a_prod'].KEY_PRODUCT.astype(
str), EurostatNRG.INDIC_PROD['nrg110a_prod'].PRODUCT))
df['PRODUCT'] = df['KEY_PRODUCT'].map(prod_dic)
# Delete che columns 'KEY_INDICATOR' and 'KEY_PRODUCT', and
# rearrange the columns in the desired order
df.drop(columns=['KEY_INDICATOR', 'KEY_PRODUCT'], inplace=True)
main_cols = ['INDICATOR', 'PRODUCT', 'UNIT', 'COUNTRY']
year_cols = [y for y in df.columns if y not in main_cols]
cols = main_cols + year_cols
df = df.reindex(columns=cols)
# Pivot the DataFrame to have it in flat format
df = df.melt(
id_vars=df.columns[:4], var_name='YEAR', value_name='VALUE')
# Convert the 'VALUE' column into float numbers
df['VALUE'] = pd.to_numeric(df['VALUE'], downcast='float')
# Drop rows that have no indicators (it means they are not in
# the Excel file with the products of interest)
df.dropna(subset=['INDICATOR', 'PRODUCT'], inplace=True)
return df
EDIT 2: if this could help, this is the error I receive when using the EurostatNRG class in IPython:
[autoreload of EurostatNRG failed: Traceback (most recent call last):
File
"C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 244, in check
superreload(m, reload, self.old_objects) File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 394, in superreload
update_generic(old_obj, new_obj) File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 331, in update_generic
update(a, b) File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 279, in update_class
if (old_obj == new_obj) is True: File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py",
line 1478, in nonzero
.format(self.class.name)) ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or
a.all(). ]
I managed to find the culprit.
In the NRG_flat method, the lines:
df['INDICATOR'] = df['KEY_INDICATOR'].map(indic_dic)
...
df['PRODUCT'] = df['KEY_PRODUCT'].map(indic_dic)
mess up the copies of the df DataFrame, thus I had to change them with the Pandas assign method:
df = df.assign(INDICATOR=df.KEY_INDICATOR.map(prod_dic))
...
df = df.assign(PRODUCT=df.KEY_PRODUCT.map(prod_dic))
I do not get any more error.
Thank you for replying!

Categories