i wanna change all the cells inside my data frame that are string and value = ''.
I have one data set that has 7 columns .
for example:
a,b,c,d,e,f,g.
and has 700 rows.
i wanna change the value of the cells in specific 5 columns in one code.
I tried this:
columns = [a,b,c,d,e]
def get_tmp(i):
if len(i) == 0:
b ='tmp'
return b
else:
return i
weights_df[colun] = weights_df[colun].apply(get_tmp)
but this don't function.
to fix the problem i used a looping for:
columns = [a,b,c,d,e]
def get_tmp(i):
if len(i) == 0:
b ='tmp'
return b
else:
return i
for colun in columns:
weights_df[colun] = weights_df[colun].apply(get_tmp)
have another way to fix this situation using only .apply?
if have, Do i need change somethin in my function ? what i need change ?
thank you guys.
You can try this code.
import pandas as pd
import numpy as np
filename = 'Book1.xlsx'
weights_df= pd.read_excel(filename)
columns = ['a','b','c','d','e']
for col in columns:
weights_df[col] =
weights_df[col].apply(lambda x: 'tmp' if x=='' else x)
This code is working in my local.
thank you Bright Gene, it's a good answer.
in reality i wanna check if is possible to do the same changes with out the looping for.
I will be more clear, maybe i did some miscommunication.
this data frame have this columns a,b,c,d,e,f,g.
i wanna change only a party of these columns:
a,b,c,d,e.
the other are numbers.
I created a list: columns_to_modify= [a,b,c,d,e]
I wanna try to change inside like this:
weights_df[columns_to_modify] = weights_df[columns_to_modify].apply(lambda x: 'tmp' if x=='' else x)
at this moment i wanna understand if have a way to apply only to a specific columns whit out lopping for.
I have a code and my dataframe contains almost 800k rows and therefore it is impossible to iterate over it by using standard methods. I searched a little bit and see a method of iterrows() but i couldn't understand how to use. Basicly this is my code and can you help me how to update it for iterrows()?
**
for i in range(len(x["Value"])):
if x.loc[i ,"PP_Name"] in ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay'] :
x.loc[i,"Santral_Type"] = "HES"
elif x.loc[i ,"PP_Name"] in ['BND','BND2','TFB','TFB3','TFB4','KNT']:
x.loc[i,"Santral_Type"] = "TERMIK"
elif x.loc[i ,"PP_Name"] in ['BRS','ÇKL','DPZ']:
x.loc[i,"Santral_Type"] = "RES"
else : x.loc[i,"Santral_Type"] = "SOLAR"
**
How to iterate over very big dataframes -- In general, you don't. You should use some sort of vectorize operation to the column as a whole. For example, your case can be map and fillna:
map_dict = {
'HES' : ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay'],
'TERMIK' : ['BND','BND2','TFB','TFB3','TFB4','KNT'],
'RES' : ['BRS','ÇKL','DPZ']
}
inv_map_dict = {x:k for k,v in map_dict.items() for x in v}
df['Santral_Type'] = df['PP_Name'].map(inv_map_dict).fillna('SOLAR')
It is not advised to iterate through DataFrames for these things. Here is one possible way of doing it, applied to all rows of the DataFrame x at once:
# Default value
x["Santral_Type"] = "SOLAR"
x.loc[x.PP_Name.isin(['BRS','ÇKL','DPZ']), 'Santral_Type'] = "RES"
x.loc[x.PP_Name.isin(['BND','BND2','TFB','TFB3','TFB4','KNT']), 'Santral_Type'] = "TERMIK"
hes_list = ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']
x.loc[x.PP_Name.isin(hes_list), 'Santral_Type'] = "HES"
Note that 800k can not be considered a large table when using standard pandas methods.
I would advise strongly against using iterrows and for loops when you have vectorised solutions available which take advantage of the pandas api.
this is your code adapted with numpy which should run much faster than your current method.
import numpy as np
col = 'PP_Name'
conditions = [
x[col].isin(
['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']
),
x[col].isin(["BND", "BND2", "TFB", "TFB3", "TFB4", "KNT"]),
x[col].isin(["BRS", "ÇKL", "DPZ"])]
outcomes = ["HES", "TERMIK", "RES"]
x["Santral_Type"] = np.select(conditions, outcomes, default='SOLAR')
df.iterrows() according to documentation returns a tuple (index, Series).
You can use it like this:
for row in df.iterrows():
if row[1]['PP_Name'] in ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']:
df['Santral_Type] = "HES"
# and so on
By the way, I must say, using iterrows is going to be very slow, and looking at your sample code it's clear you can use simple pandas selection techniques to do this without explicit loops.
Better to do it as #mcsoini suggested
the simplest method could be .values, example:
def f(x0,...xn):
return('hello or some complicated operation')
df['newColumn']=[f(r[0],r[1],...,r[n]) for r in df.values]
the drawbacks of this method as far as i know is that you cannot refer to the column values by name but just by position and there is no info about the index of the df.
Advantage is faster than iterrows, itertuples and apply methods.
hope it helps
I'm trying to loop through the 'vol' dataframe, and conditionally check if the sample_date is between certain dates. If it is, assign a value to another column.
Here's the following code I have:
vol = pd.DataFrame(data=pd.date_range(start='11/3/2015', end='1/29/2019'))
vol.columns = ['sample_date']
vol['hydraulic_vol'] = np.nan
for i in vol.iterrows():
if pd.Timestamp('2015-11-03') <= vol.loc[i,'sample_date'] <= pd.Timestamp('2018-06-07'):
vol.loc[i,'hydraulic_vol'] = 319779
Here's the error I received:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
This is how you would do it properly:
cond = (pd.Timestamp('2015-11-03') <= vol.sample_date) &
(vol.sample_date <= pd.Timestamp('2018-06-07'))
vol.loc[cond, 'hydraulic_vol'] = 319779
Another way to do this would be to use the np.where method from the numpy module, in combination with the .between method.
This method works like this:
np.where(condition, value if true, value if false)
Code example
cond = vol.sample_date.between('2015-11-03', '2018-06-07')
vol['hydraulic_vol'] = np.where(cond, 319779, np.nan)
Or you can combine them in one single line of code:
vol['hydraulic_vol'] = np.where(vol.sample_date.between('2015-11-03', '2018-06-07'), 319779, np.nan)
Edit
I see that you're new here, so here's something I had to learn as well coming to python/pandas.
Looping over a dataframe should be your last resort, try to use vectorized solutions, in this case .loc or np.where, these will perform better in terms of speed compared to looping.
There is a dataset with one of the columns containing some missing values.I want to generate a new column and if the cell of the former column is missing then assign the new columns with 1,else 0.
I tried
df[newcolumn] = map(lamba x: 1 if x is None else 0, df[formercolumn])
but it didn't work.
While
df[newcolunm] = df[formercolunms].isnull().apply(lambda x: 1 if x is True else 0)
worked well.
Any better solutions to this situation?
Use np.where:
df['newcolumns'] = np.where(df.formercolumns.isnull(),0,1)
I have the following using numpy which is realy similar to your solution, but slightly shorter/faster
df[newcolunm] = df[formercolunms].apply(lambda x: 0 if np.isnan(x) else 1)
I think however that Scott answer is better/faster.
I am taking a dataframe, breaking it into two dataframes, and then I need to change the index values so that no number is greater than the total number of rows.
Here's the code:
dataset = pd.read_csv("dataset.csv",usecols['row_id','x','y','time'],index_col=0)
splitvalue = math.floor((0.9)*786239)
train = dataset[dataset.time < splitvalue]
test = dataset[dataset.time >= splitvalue]
Here's the change that I am doing. I am wondering if there is an easier way:
test.index=range(test.shape[0])
test.index.rename('row_id',inplace=True)
Is there a better way to do this?
try:
test = test.reset_index(drop=True).rename_axis('row_id')
You should shuffle your data before slicing....
dataset.reindex(np.random.permutation(dataset.index))
Otherwise your biasing your test/train sets.
You can assign a new Index object directly to overwrite the index:
test.index = pd.Index(np.arange(len(df)), name='row_id')