i wanna change all the cells inside my data frame that are string and value = ''.
I have one data set that has 7 columns .
for example:
a,b,c,d,e,f,g.
and has 700 rows.
i wanna change the value of the cells in specific 5 columns in one code.
I tried this:
columns = [a,b,c,d,e]
def get_tmp(i):
if len(i) == 0:
b ='tmp'
return b
else:
return i
weights_df[colun] = weights_df[colun].apply(get_tmp)
but this don't function.
to fix the problem i used a looping for:
columns = [a,b,c,d,e]
def get_tmp(i):
if len(i) == 0:
b ='tmp'
return b
else:
return i
for colun in columns:
weights_df[colun] = weights_df[colun].apply(get_tmp)
have another way to fix this situation using only .apply?
if have, Do i need change somethin in my function ? what i need change ?
thank you guys.
You can try this code.
import pandas as pd
import numpy as np
filename = 'Book1.xlsx'
weights_df= pd.read_excel(filename)
columns = ['a','b','c','d','e']
for col in columns:
weights_df[col] =
weights_df[col].apply(lambda x: 'tmp' if x=='' else x)
This code is working in my local.
thank you Bright Gene, it's a good answer.
in reality i wanna check if is possible to do the same changes with out the looping for.
I will be more clear, maybe i did some miscommunication.
this data frame have this columns a,b,c,d,e,f,g.
i wanna change only a party of these columns:
a,b,c,d,e.
the other are numbers.
I created a list: columns_to_modify= [a,b,c,d,e]
I wanna try to change inside like this:
weights_df[columns_to_modify] = weights_df[columns_to_modify].apply(lambda x: 'tmp' if x=='' else x)
at this moment i wanna understand if have a way to apply only to a specific columns whit out lopping for.
Related
I'm working with Pandas. I need to create a new column in a dataframe according to conditions in other columns. I try to look for each value in a series if it contains a value (a condition to return text).This works when the values are exactly the same but not when the value is only a part of the value of the series.
Sample data :
df = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
def conditions(df5):
if ("ores") in df5["Symptom"]: return "Things"
df["new_column"] = df.swifter.apply(conditions, axis=1)
It's doesn't work because any("something") is always True
So i tried :
df['new_column'] = np.where(df2["Symptom"].str.contains('ores'), 'yes', 'no') : return "Things"
It doesn't work because it's inside a loop.
I can't use np.select because it needed two separate lists and my code has to be easily editable (and it can't come from a dict).
It also doesn't work with find_all. And also not with :
df["new_column"] == "ores" is True: return "things"
I don't really understand why nothing work and what i have to do ?
Edit :
df5 = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
def conditions(df5):
(df5["Symptom"].str.contains('ores'), 'Things')
df5["Deversement Service"] = np.where(conditions)
df5
For the moment i have a lenght of values problem
To add a new column with condition, use np.where:
df = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
df['new'] = np.where(df["Symptom"].str.contains('ores'), 'Things', "")
print (df)
Symptom new
0 ores Things
1 ores + more texts Things
2 anything else
If you need a single boolean value, use pd.Series.any:
if df["Symptom"].str.contains('ores').any():
print ("Things")
# Things
I have created a new dataframe:
import pandas as pd
###creating a dataframe:
bb=pd.DataFrame(columns = ['INDate', 'INCOME', 'EXDate','EXPENSE'])
bb.to_excel('/py/deleteafter/bb_black_book.xlsx')
bb.head()
I can see new dataframe without rows:
Then I need to add a new value to the one of columns by cycle.
income_value=message.text ###It is depend from the user input
for i in range(len(bb)):
print(bb['INCOME'][i])
if bb['INCOME'][i] != 'NaN':
i += 1
#print('NOT_EMPTY_CELL')
else:
#print('ive found an empty cell=)')
bb['INCOME'][i]=income_value
break
And here I met an errors, cause my df have a 0 length:
print(range(len(bb)))
range(0, 0)
I don't sure that my solution is right, and I'm sure there is more simply solution might be. In overall, my main idea is:
How I can check a next empty cell in certain column (in my case column 'INCOME') to add the value to this FREE cell?
Or more simply - I need to add a value to the next not filled cell=)
Will be glad for your replies.
To find the last valid value you can use last_valid_index(). It outputs a nan when the dataframe is empty, so you could do:
idx = bb["INCOME"].last_valid_index())
import numpy as np
if np.isnan(idx) or idx is None:
bb.loc[0, "INCOME"] = income_value
else:
bb.loc[idx + 1, "INCOME"] = income_value
There is a much simpler way to append a row, e.g.:
row = {'INCOME': 20000, 'EXdate': '14/09/2020'}
df= df.append(pd.DataFrame(row))
If you want to add data in one column only where the index number doesn't exist, use loc
Change line
bb['INCOME'][i]=income_value
to
bb.loc[i,'INCOME']=income_value
and it should work fine.
I am trying to add columns to a python pandas df using the apply function.
However the number of columns to be added depend on the output of the function
used in the apply function.
example code:
number_of_columns_to_be_added = 2
def add_columns(number_of_columns_to_be_added):
df['n1'],df['n2'] = zip(*df['input'].apply(lambda x : do_something(x, number_of_columns_to_be_added)))
Any idea on how to define the ugly column part (df['n1'], ..., df['n696969']) before the = zip( ... part programatically?
I'm guessing that the output of zip is a tuple, therefore you could try this:
temp = zip(*df['input'].apply(lambda x : do_something(x, number_of_columns_to_be_added)))
for i, value in enumerate(temp, 1):
key = 'n'+str(i)
df[key] = value
temp will hold the all the entries and then you iterate over tempto assign the values to your dict with your specific keys. Hope this matches your original idea.
I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)
EDIT : here are the first lines :
df = pd.read_csv(os.path.join(path, file), dtype = str,delimiter = ';',error_bad_lines=False, nrows=50)
df["CALDAY"] = df["CALDAY"].apply(lambda x:dt.datetime.strptime(x,'%d/%m/%Y'))
df = df.fillna(0)
I have a csv file that has 1500 columns and 35000 rows. It contains values, but under the form 1.700,35 for example, whereas in python I need 1700.35. When I read the csv, all values are under a str type.
To solve this I wrote this function :
def format_nombre(df):
for i in range(length):
for j in range(width):
element = df.iloc[i,j]
if (type(element) != type(df.iloc[1,0])):
a = df.iloc[i,j].replace(".","")
b = float(a.replace(",","."))
df.iloc[i,j] = b
Basically, I select each intersection of all rows and columns, I replace the problematic characters, I turn the element into a float and I replace it in the dataframe. The if ensures that the function doesn't consider dates, which are in the first column of my dataframe.
The problem is that although the function does exactly what I want, it takes approximately 1 minute to cover 10 rows, so transforming my csv would take a little less than 60h.
I realize this is far from being optimized, but I struggled and failed to find a way that suited my needs and (scarce) skills.
How about:
def to_numeric(column):
if np.issubdtype(column.dtype, np.datetime64):
return column
else:
return column.str.replace('.', '').str.replace(',', '.').astype(float)
df = df.apply(to_numeric)
That's assuming all strings are valid. Otherwise use pd.to_numeric instead of astype(float).