Empty values when assign a new column to an existing Datafram - python

when I try to make a new column to add to an existing dataframe , the new column only has empty values . However, when print "result" before assigns it to the dataframe it works fine! and thus I get this weird error of max arg!
ValueError: max() arg is an empty sequence
I'm using mplfinance to plot the data
strategy.py
def moving_average (self, df , i):
signal = df['sma20'][i]*1.10
if (df['sma20'][i] > df['sma50'][i]) & (signal >df['Close'][i]):
return df['Close'][i]
else:
return None
trading.py
for i in range(0, len(df['Close'])-1):
result = strategy.moving_average(df , i)
print(result)
df['buy']= result
df.to_csv('test.csv', encoding='utf-8')
apd = mpf.make_addplot(df['buy'],scatter=True,marker='^')
mpf.plot(df, type='candle', addplot=apd)

Based on the very small amount of information here, and on your comment
"because df['buy'] column has nan values only."
I'm going to guess that your problem is that strategy.moving_average() is returning None instead of nan when there is no signal.
There is a big difference between None and nan. (The main issue is that nan supports math, whereas None does not; and as a general rule plotting packages always do math).
I suggest you import numpy as np and then in strategy.moving_average()
change return None
to return np.nan.
ALSO just saw another problem.
You are only assigning a single value to df['buy'].
You need to take it out of the loop.
I suggest initialize result as an empty list before the loop
then:
result = []
for i in range(0, len(df['Close'])-1):
result.append(strategy.moving_average(df , i))
print(result)
df['buy']= result
df.to_csv('test.csv', encoding='utf-8')
apd = mpf.make_addplot(df['buy'],scatter=True,marker='^')
mpf.plot(df, type='candle', addplot=apd)

Related

Change Column values in pandas applying another function

I have a data frame in pandas, one of the columns contains time intervals presented as strings like 'P1Y4M1D'.
The example of the whole CSV:
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...
I created a parsing function, that takes that string 'P1Y4M1D' and returns an integer number.
I am wondering how is it possible to change all the column values to parsed values using that function?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv("citations.csv",
names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row iloc - to select data by row numbers
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
return my_ocan
def parse():
mydict = dict()
mydict2 = dict()
i = 1
r = 1
for x in my_ocan['oci']:
mydict[x] = str(my_ocan['timespan'][i])
i +=1
print(mydict)
for key, value in mydict.items():
is_negative = value.startswith('-')
if is_negative:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
else:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
daystotal = (year * 365) + (month * 30) + day
if not is_negative:
#mydict2[key] = daystotal
return daystotal
else:
#mydict2[key] = -daystotal
return -daystotal
#print(mydict2)
#return mydict2
Probably I do not even need to change the whole column with new parsed values, the final goal is to write a new function that returns average time of ['timespan'] of docs created in a particular year. Since I need parsed values, I thought it would be easier to change the whole column and manipulate a new data frame.
Also, I am curious what could be a way to apply the parsing function on each ['timespan'] row without modifying a data frame, I can only assume It could be smth like this, but I don't have a full understanding of how to do that:
for x in my_ocan['timespan']:
x = parse(str(my_ocan['timespan'])
How can I get a column with new values? Thank you! Peace :)
A df['timespan'].apply(parse) (as mentioned by #Dan) should work. You would need to modify only the parse function in order to receive the string as an argument and return the parsed string at the end. Something like this:
import pandas as pd
def parse_postal_code(postal_code):
# Splitting postal code and getting first letters
letters = postal_code.split('_')[0]
return letters
# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})
# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))
# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)
print(df['Postal Code Letter'])

Imputing the missing values string using a condition(pandas DataFrame)

Kaggle Dataset(working on)- Newyork Airbnb
Created with a raw data code for running better explanation of the issue
`airbnb= pd.read_csv("https://raw.githubusercontent.com/rafagarciac/Airbnb_NYC-Data-Science_Project/master/input/new-york-city-airbnb-open-data/AB_NYC_2019.csv")
airbnb[airbnb["host_name"].isnull()][["host_name","neighbourhood_group"]]
`DataFrame
I would like to fill the null values of "host_name" based on the "neighbourhood_group" column entities.
like
if airbnb['host_name'].isnull():
airbnb["neighbourhood_group"]=="Bronx"
airbnb["host_name"]= "Vie"
elif:
airbnb["neighbourhood_group"]=="Manhattan"
airbnb["host_name"]= "Sonder (NYC)"
else:
airbnb["host_name"]= "Michael"
(this is wrong,just to represent the output format i want)
I've tried using if statement but I couldn't apply in a correct way. Could you please me solve this.
Thanks
You could try this -
airbnb.loc[(airbnb['host_name'].isnull()) & (airbnb["neighbourhood_group"]=="Bronx"), "host_name"] = "Vie"
airbnb.loc[(airbnb['host_name'].isnull()) & (airbnb["neighbourhood_group"]=="Manhattan"), "host_name"] = "Sonder (NYC)"
airbnb.loc[airbnb['host_name'].isnull(), "host_name"] = "Michael"
Pandas has a special method to fill NA values:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You may create a dict with values for "host_name" field using "neighbourhood_group" values as keys and do this:
host_dict = {'Bronx': 'Vie', 'Manhattan': 'Sonder (NYC)'}
airbnb['host_name'] = airbnb['host_name'].fillna(value=airbnb[airbnb['host_name'].isna()]['neighbourhood_group'].map(host_dict))
airbnb['host_name'] = airbnb['host_name'].fillna("Michael")
"value" argument here may be a Series of values.
So, first of all, we create a Series with "neighbourhood_group" values which correspond to our missing values by using this part:
neighbourhood_group_series = airbnb[airbnb['host_name'].isna()]['neighbourhood_group']
Then using map function together with "host_dict" we get a Series with values that we want to impute:
neighbourhood_group_series.map(host_dict)
Finally we just impute in all other NA cells some default value, in our case "Michael".
You can do it with:
ornek = pd.DataFrame({'samp1': [None, None, None],
'samp2': ["sezer", "bozkir", "farkli"]})
def filter_by_col(row):
if row["samp2"] == "sezer":
return "ping"
if row["samp2"] == "bozkir":
return "pong"
return None
ornek.apply(lambda x: filter_by_col(x), axis=1)

Is not NaN conditional statement for python 3 and pandas

I am trying to create a new column in a pandas data frame by and calculating the value from existing columns.
I have 3 existing columns ("launched_date", "item_published_at", "item_created_at")
However, my "if row[column_name] is not None:" statement is allowing columns with NaN value and not skipping to the next statement.
In the code below, I would not expect the value of "nan" to be printed after the first conditional, I would expect something like "2018-08-17"
df['adjusted_date'] = df.apply(lambda row: adjusted_date(row), axis=1)
def adjusted_launch(row):
if row['launched_date']is not None:
print(row['launched_date'])
exit()
adjusted_date = date_to_time_in_timezone(row['launched_date'])
elif row['item_published_at'] is not None:
adjusted_date = row['item_published_at']#make datetime in PST
else:
adjusted_date = row['item_created_at'] #make datetime in PST
return adjusted_date
How can I structure this conditional statement correctly?
First fill "nan" as string where the data is empty
df.fillna("nan",inplace=True)
Then in function you can apply if condition like:
def adjusted_launch(row):
if row['launched_date'] !='nan':
......
Second Sol
import numpy as np
df.fillna(np.nan,inplace=True)
#suggested by #ShadowRanger
def funct(row):
if row['col'].notnull():
pass
df = df.where((pd.notnull(df)), None)
This will replace all nans with None, No other modifications required.

Pandas series append returning NaN values

I'm having difficulty getting my dataframe data into their own timeseries. Here is my code so far:
data = pd.read_csv(s3_data_path, index_col=0, parse_dates=True, dtype="float64")
num_timeseries = data.shape[1]
print("This is the number of time series you're running:")
print(num_timeseries)
data_length = num_timeseries
t0 = data.index[0]
print("This is the beginning date:")
print(t0)
time_series=[]
for i in range(num_timeseries):
index = pd.DatetimeIndex(start=t0, freq=freq, periods=data_length)
time_series.append(pd.Series(data=data.iloc[:,i], index=index))
print(data.head(10))
print(time_series[0]
Whenever I run print(data.head(10)) I see the following, which is what I expect:
However, my time_series object has values of NaN:
I don't quite understand what I'm missing to get the correct values in my time_series object. Thanks for the help!
Edit---
Whenever I remove the index=index out of my time_series.append the code generates my expected result (pictured below.) However, this creates an issue as no frequency is defined which is a requirement.

SettingWithCopyWarning while filtering rows that are <= than float

sencap.csv is a file that has a lot of columns that I don't need and I want to keep just some columns in order to start filtering it to analyze the information and do some graphs, which in this case it'll be a pie chart that aggregate energy quantities depending on its energy source. Everything works fine except the condition that asks to sum() only those rows which are less than 9.0 MW.
import pandas as pd
import matplotlib.pyplot as plt
aux = pd.read_csv('sencap.csv')
keep_col = ['subsistema','propietario','razon_social', 'estado',
'fecha_servicio_central', 'region_nombre', 'comuna_nombre',
'tipo_final', 'clasificacion', 'tipo_energia', 'potencia_neta_mw',
'ley_ernc', 'medio_generacion', 'distribuidora', 'punto_conexion',
]
c1 = aux['medio_generacion'] == 'PMGD'
c2 = aux['medio_generacion'] == 'PMG'
aux2 = aux[keep_col]
aux3 = aux2[c1 | c2]
for col in ['potencia_neta_mw']:
aux3[col] = pd.to_numeric(aux3[col].str.replace(',','.'))
c3 = aux3['potencia_neta_mw'] <= 9.0
aux4 = aux3[c3]
df = aux4.groupby(['tipo_final']).sum()
Warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
aux3[col] = pd.to_numeric(aux3[col].str.replace(',','.'))
aux3[col] = pd.to_numeric(aux3[col].str.replace(',','.'))
this line is the reason you are getting a warning.
Accessing the "col" using indexing may result in unpredictable behavior since it may return view or copy of original data.
it depends on the memory layout of the array, about which pandas makes no guarantees
pandas documentation advises users to use .loc instead.
Example:
In: df
Out:
one two
first second first second
0 a b c d
1 e f g h
2 i j k l
3 m n o p
dfmi.loc[:,('one','second')] = value
# becomes
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
In the second case __getitem__ is unpredictable. It may return view or copy of the data. Modifying a view and copy works differently.
Making change on copy is not reflected on the original data where as a change on a view does.
Note: So the warning is present to warn users, even if you get the expected output there is a chance it might causes some unpredictable behavior.

Categories