I have a DataFrame that I need to modify based on one of column values. In particular, when the value in column a is above 110, I want the column b to be assigned value of -99. The only issue is that first 3 rows of the dataframe contain a mix of string and numerical data types so when I try:
df.loc[df['a'] >= 110, 'b'] = -99
I get a TypeError because comparison between str and int is not allowed.
So my question is: how do I do this assignment while ignoring the first 3 rows of the dataframe?
So far I came up with this rather dodgy way:
try:
df.loc[df['a'] >= 110, 'b'] = -99
except TypeError:
pass
This does seem to work, but it obviously doesn't seem like the proper way to do it.
EDIT: And also this method just skips first 3 rows, but I really need to keep them as is.
Try:
df.loc[df['a'].apply(pd.to_numeric, errors='coerce').ge(110), 'b'] = -99
or use errors='ignore'
Related
I have numbers in a List that should get assigned to certain rows of a dataframe consecutively.
List=[2,5,7,12….]
In my dataframe that looks similar to the below table, I need to do the following:
A frame_index that starts with 1 gets the next element of List as “sequence_number”
Frame_Index==1 then assign first element of List as Sequence_number.
Frame_index == 1 again, so assign second element of List as Sequence_number.
So my goal is to achieve a new dataframe like this:
I don't know which functions to use. If this weren't python language, I would use a for loop and check where frame_index==1, but my dataset is large and I need a pythonic way to achieve the described solution. I appreciate any help.
EDIT: I tried the following to fill with my List values to use fillna with ffill afterwards:
concatenated_df['Sequence_number']=[List[i] for i in
concatenated_df.index if (concatenated_df['Frame_Index'] == 1).any()]
But of course I'm getting "list index out of range" error.
I think you could do that in two steps.
Add column and fill with your list where frame_index == 1.
Use df.fillna() with method="ffill" kwarg.
import pandas as pd
df = pd.DataFrame({"frame_index": [1,2,3,4,1,2]})
sequence = [2,5]
df.loc[df["frame_index"] == 1, "sequence_number"] = sequence
df.ffill(inplace=True) # alias for df.fillna(method="ffill")
This puts the sequence_number as float64, which might be acceptable in your use case, if you want it to be int64, then you can just force it when creating the column (line 4) or cast it later.
I am looking to slice rows in a dataframe column based on conditions- I understand I can assign specific values to rows in my df column based on given conditions using .loc, however I need the condition just to determine how much to slice.
For example, if the row starts with 'A', I would like the first 6 chars ([:6]) whereas if it starts with 'B' I would like it to have the first 8 chars ([:8]).
I am doing this in order to get the data into the correct format before I perform an inner join with another dataframe using pd.merge()
.loc. I can use df.loc[df['column'][:1] == 'A'], but it doesn't give me the index of the rows that satisfy the condition. The best solution I can think of is creating a list of all of the indexes that satisfy the conditions and then manipulating each row one by one. Is there a better way to do this?
You can check with np.select
m1 = df.col.str[0] == 'A'
m2 = df.col.str[0] == 'B'
df['NewCol'] = np.select([m1, m2], [df.col.str[:6], df.col.str[:8]], default = df.col)
I have a dataframe with 2 columns and I want to create a 3rd column that returns True or False for each row according to if the value in column A is contained in the value in column B .
Here's my code:
C = []
for index, row in df.iterrows():
if row['A'][index] in row['B'][index]:
C[index] = True
else:
C[index] = False
I get the following errors:
1) TypeError: 'float' object is not subscriptable
2) IndexError: list assignment index out of range
How can I solve these errors?
I think the problem is some values of row['A'] or row['B'] contain float values. This is why when you get that float value you can not subscript it. Otherwise it would be like [float][index] which is what giving error. Are you expecting a string value there? It could be possible not all values are having same data type in the data frame.
Secondly, the index is of the row, I don't now why are you using it like this. For more clarifications I need to take a look at that data, but what seems possible is like even if you got a string or array value for row ['A'], which could be traversed, the index is too large. For ex-
row['A'] = "hello"
a = row['A'][10]
will give you the index error.
I need to compare some columns in a dataframe as a whole, for example:
df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
#Select condition: If df['A'] == 1 and df['B'] == 4, then pick up this row.
For this simple example, I can use below method:
df.loc[(df['A']==1)&(df['B']==4),'A':'B']
However, in reality my dataframe has tens of columns which should be compared as whole. Above solution will be very very messy if I choose to list all of them. So I think if regard them as a whole to compare with a list may solve the problem:
#something just like this:
df.loc[df.loc[:,'A':'B']==[1,4],'A':'B')]
Not worked. So I came up with the idea that first combine all desired columns into a new column as a list value, then compare this new column with the list. The latter has been solved in Pandas: compare list objects in Series
Although generally I've solved my case, I still want to know if there is an easier way to solve this problem? Thanks.
Or use [[]] for getting multiple columns:
df[(df[['A','B']].values==[1,4]).all(1)]
Demo:
>>> df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
>>> df[(df[['A','B']].values==[1,4]).all(1)]
A B
0 1 4
>>>
You can use a Boolean mask via a NumPy array representation of your data:
df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
res = df[(df.loc[:, 'A':'B'].values == [1, 4]).all(1)]
print(res)
A B
0 1 4
In this situation, never combine your columns into a single series of lists. This is inefficient as you will lose all vectorisation benefits, and any processing thereafter will involve Python-level loops.
I have a DataFrame df with columns 'a'. How would I create a new column 'b' which has dtype=object?
I know this may be considered poor form, but at the moment I have a dataframe df where the column 'a' contains arrays (each element is an np.array). I want to create a new column 'b' where each element is a new np.array that contains the logs of the corresponding elemnent in 'a'.
At the moment I tried these two methods, but neither worked:
for i in df.index:
df.set_value(i,'b', log10(df.loc[i,'a']))
and
for i in df.index:
df.loc[i,'b'] = log10(df.loc[i,'a']))
Both give me ValueError: Must have equal len keys and value when setting with an iterable.
I'm assuming the error comes about because the dtype of the new column is defaulted to float although I may be wrong.
As each row of your column is an array, it's better to use the standard NumPy mathematical functions for computing their element-wise logarithms to the base 10:
df['log_a'] = df.a.apply(lambda x: np.log10(x))