I have the following data set generated from the equation
df.loc[:,51] = [1217.0, -20.0, 13970.0, -74]
I dropped the negative values (specific values) and got this
df.loc[:,52] = [1217.0, 0, 13970.0, 0]
Now I am trying to get another column with the dropped values
df.loc[:,53] = df.drop_duplicates(subset=[df.loc[:,51], df.loc[:,52]])
I want this result.
The values that are dropped
df.loc[:,53] = [0,-20, 0,-74]
But I got the following error
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Try numpy.where
df.loc[:,53] = np.where(df.loc[:,51] == df.loc[:,52], 0, df.loc[:,51])
Here, I've done it with a sample data:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'rating': [4, -4, 3.5, -15, 5]
})
df.loc[(df['rating'] < 0 ), 'new_col'] = 0
df.loc[(df['rating'] > 0 ), 'new_col'] = df['rating']
df['dropped'] = np.where(df['rating'] == df['new_col'], 0, df['rating'])
df
Related
I am having an issue with returning the original df index of a row given a groupby condition after subselecting some of the df. It's easier to understand through code.
So if we start with a toy dataframe:
headers = ['a','b']
nrows = 8
df = pd.DataFrame(columns = headers)
df['a'] = [0]*(nrows//2) + [1]*(nrows//2)
df['b'] = [2]*(nrows//4) + [4]*(nrows//4) + [2]*(nrows//4) + [4]*(nrows//4)
print(df)
then I select the subset of data I am interested in and checking that the index is retained:
sub_df = df[df['a']==1] ## selects for only group 1 (indices 4-7)
print(sub_df.index) ## looks good so far
sub_df.index returns
Int64Index([4, 5, 6, 7], dtype='int64')
Which seems great! I would like to group data from that subset and extract the original df index and that is where the issue occurs:
For example:
g_df = sub_df.groupby('b')
g_df_idx = g_df.indices
print(g_df_idx) ## bad!
when I print(g_df_idx) I want it to return:
{2: array([4,5]), 4: array([6,7])}
Due to the way I will be using this code I can't just groupby(['a','b'])
I'm going nuts with this thing. Here are some of the many solutions I have tried:
## 1
e1_idx = sub_df.groupby('b').indices
# print(e1_idx) ## issue persists
## 2
e2 = sub_df.groupby('b', as_index = True) ## also tried as_index = False
e2_idx = e2.indices
# print(e2_idx) ## issue persists
## 3
e3 = sub_df.reset_index()
e3_idx = e3.groupby('b').indices
# print(e3_idx) ## issue persists
I'm sure there must be some simple solution I'm just overlooking. Would be very grateful for any advice.
you can do like this
g_df_idx = g_df.apply(lambda x: x.index).to_dict()
print(g_df_idx)
# {2: Int64Index([4, 5], dtype='int64'), 4: Int64Index([6, 7], dtype='int64')}
import pandas as pd
temp=[79,80,81,80,80,79,76,75,76,78,80,81]
for i in range(len(temp)):
if temp[i]<=80:
level=0
elif temp[i]>80 and temp[i]<=100:
level=1
elif temp[i]<=75:
n_level=0
elif temp[i]>75 and temp[i]<=95:
n_level=1//
df=pd.DataFrame([[temp[i],level]],columns=['temp1','level1','newlevel'])//
print(df)//
Unable to get the output which is expected to be like this
#temp# ##level## ###newlevel###
79 0 0
80 0 0
81 1 1
Please find below solution, I have used list comprehension to create the data required to build the data frame. Please change the conditions inside the list comprehensions to meet your needs. Code and Output
import pandas as pd
import numpy as np
temp=[79,80,81,80,80,79,76,75,76,78,80,81]
level=[0 if i<=80 else 1 for i in temp]
n_level=[0 if i<=75 else 1 for i in temp]
#print(temp)
#print(level)
#print(n_level)
df=pd.DataFrame({'temp':temp,'level':level,'newlevel':n_level})
df
You can generate your DataFrame using solely Pandas methods.
To generate both "level" columns, it is enough to use cut:
df = pd.DataFrame({'temp': temp,
'level1' : pd.cut(temp, [0, 80, 1000], labels=[0, 1]),
'newlevel': pd.cut(temp, [0, 75, 1000], labels=[0, 1])})
Note: Both "level" columns of the output DataFrame have category type.
If you are unhappy about that, cast them to int:
df = pd.DataFrame({'temp': temp,
'level1' : pd.cut(temp, [0, 80, 1000], labels=[0, 1]).astype(int),
'newlevel': pd.cut(temp, [0, 75, 1000], labels=[0, 1]).astype(int)})
This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
I'm trying to find matching values in a pandas dataframe. Once a match is found I want to perform some operations on the row of the dataframe.
Currently I'm using this Code:
import pandas as pd
d = {'child_id': [1, 2,5,4], 'parent_id': [3, 4,2,3], 'content': ["a","b","c","d"]}
df = pd.DataFrame(data=d)
for i in range(len(df)):
for j in range(len(df)):
if str(df['child_id'][j]) == str(df['parent_id'][i]):
print(df.content[i])
else:
pass
It works fine, but is rather slow. Since I'm dealing with a dataset with millions of rows, I would take months. Is there a faster way to do this?
Edit: To clarify what, I want to create is a dataframe, which contains the Content of Matches.
import pandas as pd
d = {'child_id': [1,2,5,4],
'parent_id': [3,4,2,3],
'content': ["a","b","c","d"]}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(columns = ("content_child", "content_parent"))
for i in range(len(df)):
for j in range(len(df)):
if str(df['child_id'][j]) == str(df['parent_id'][i]):
content_child = str(df["content"][i])
content_parent = str(df["content"][j])
s = pd.Series([content_child, content_parent], index=['content_child', 'content_parent'])
df2 = df2.append(s, ignore_index=True)
else:
pass
print(df2)
The fastest way is to use the features of numpy:
import pandas as pd
d = {
'child_id': [1, 2, 5, 4],
'parent_id': [3, 4, 2, 3],
'content': ["a", "b", "c", "d"]
}
df = pd.DataFrame(data=d)
comp1 = df['child_id'].values == df['parent_id'].values
comp2 = df['child_id'].values[::-1] == df['parent_id'].values
comp3 = df['child_id'].values == df['parent_id'].values[::-1]
if comp1.any() and not comp2.any() and not comp3.any():
comp = np.c_[ df['content'].values[comp1] ]
elif comp1.any() and comp2.any() and not comp3.any():
comp = np.c_[ df['content'].values[comp1], df['content'].values[comp2] ]
elif comp1.any() and comp2.any() and comp3.any():
comp = np.c_[ df['content'].values[comp1], df['content'].values[comp2], df['content'].values[comp3] ]
print( df['content'].values[comp] )
Which outputs:
[]
I have recently asked a question on applying select_dtypes for specific columns of a data frame.
I have this data frame that has different dtypes on its columns (str and int in this case).
df = pd.DataFrame([
[-1, 3, 0],
[5, 2, 1],
[-6, 3, 2],
[7, '<blank>', 3 ],
['<blank>', 2, 4],
['<blank>', '<blank', '<blank>']], columns='A B C'.split())
I want to create different masks for strings and integers. And then I will apply stylings based on these masks.
First let's define a function that will help me create my mask for different dtypes. (Thanks to #jpp)
def filter_type(s, num=True):
s_new = pd.to_numeric(s, errors='coerce')
if num:
return s_new.notnull()
else:
return s_new.isnull()
then our first mask will be:
mask1 = filter_type(df['A'], num=False) # working and creating the bool values
Second mask will be based on an interval of integers:
mask2 = df['A'].between(7 , 0 , inclusive=False)
But when I run the mask2 it gives me the error:
TypeError:'>' not supported between instances of 'str' and 'int'
How can I overcome this issue?
Note: Stylings I would like to apply is like below:
def highlight_col(x):
df=x.copy
mask1 = filter_type(df['A'], num=False)
mask2 = df['A'].between(7 , 0 , inclusive=False)
x.loc[mask1, ['A', 'B', 'C']] = 'background-color: ""'
x.loc[mask2, ['A', 'B', 'C']] = 'background-color: #7fbf7f'
pd.DataFrame.loc is used to set values. You need pd.DataFrame.style to set styles. In addition, you can use try / except for a method of identifying when numeric comparisons fail.
Here's a minimal example:
def styler(x):
res = []
for i in x:
try:
if 0 <= i <= 7:
res.append('background: red')
else:
res.append('')
except TypeError:
res.append('')
return res
res = df.style.apply(styler, axis = 1)
Result:
I tried to calculate specific quantile values from a data frame, as shown in the code below. There was no problem when calculate it in separate lines.
When attempting to run last 2 lines, I get the following error:
AttributeError: 'SeriesGroupBy' object has no attribute 'quantile(0.25)'
How can I fix this?
import pandas as pd
df = pd.DataFrame(
{
'x': [0, 1, 0, 1, 0, 1, 0, 1],
'y': [7, 6, 5, 4, 3, 2, 1, 0],
'number': [25000, 35000, 45000, 50000, 60000, 70000, 65000, 36000]
}
)
f = {'number': ['median', 'std', 'quantile']}
df1 = df.groupby('x').agg(f)
df.groupby('x').quantile(0.25)
df.groupby('x').quantile(0.75)
# code below with problem:
f = {'number': ['median', 'std', 'quantile(0.25)', 'quantile(0.75)']}
df1 = df.groupby('x').agg(f)
I prefer def functions
def q1(x):
return x.quantile(0.25)
def q3(x):
return x.quantile(0.75)
f = {'number': ['median', 'std', q1, q3]}
df1 = df.groupby('x').agg(f)
df1
Out[1643]:
number
median std q1 q3
x
0 52500 17969.882211 40000 61250
1 43000 16337.584481 35750 55000
#WeNYoBen's answer is great. There is one limitation though, and that lies with the fact that one needs to create a new function for every quantile. This can be a very unpythonic exercise if the number of quantiles become large. A better approach is to use a function to create a function, and to rename that function appropriately.
def rename(newname):
def decorator(f):
f.__name__ = newname
return f
return decorator
def q_at(y):
#rename(f'q{y:0.2f}')
def q(x):
return x.quantile(y)
return q
f = {'number': ['median', 'std', q_at(0.25) ,q_at(0.75)]}
df1 = df.groupby('x').agg(f)
df1
Out[]:
number
median std q0.25 q0.75
x
0 52500 17969.882211 40000 61250
1 43000 16337.584481 35750 55000
The rename decorator renames the function so that the pandas agg function can deal with the reuse of the quantile function returned (otherwise all quantiles results end up in columns that are named q).
There's a nice way if you want to give names to aggregated columns:
df1.groupby('x').agg(
q1_foo=pd.NamedAgg('number', q1),
q2_foo=pd.NamedAgg('number', q2)
)
where q1 and q2 are functions.
Or even simpler:
df1.groupby('x').agg(
q1_foo=('number', q1),
q2_foo=('number', q2)
)