I have two dataframes:
1) customer_id,gender
2) customer_id,...[other fields]
The first dataset is an answer dataset (gender is an answer). So, I want to exclude from the second dataset those customer_id which are in the first dataset (which gender we know) and call it 'train'. The rest records should become a 'test' dataset.
I think you need boolean indexing and condition with isin, inverting boolean Series is by ~:
df1 = pd.DataFrame({'customer_id':[1,2,3],
'gender':['m','f','m']})
print (df1)
customer_id gender
0 1 m
1 2 f
2 3 m
df2 = pd.DataFrame({'customer_id':[1,7,5],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df2)
B C D E F customer_id
0 4 7 1 5 7 1
1 5 8 3 3 4 7
2 6 9 5 6 3 5
mask = df2.customer_id.isin(df1.customer_id)
print (mask)
0 True
1 False
2 False
Name: customer_id, dtype: bool
print (~mask)
0 False
1 True
2 True
Name: customer_id, dtype: bool
train = df2[mask]
print (train)
B C D E F customer_id
0 4 7 1 5 7 1
test = df2[~mask]
print (test)
B C D E F customer_id
1 5 8 3 3 4 7
2 6 9 5 6 3 5
Related
I am a beginner in python.
I need to recode a CSV file:
unique_id,pid,Age
1,1,1
1,2,3
2,1,5
2,2,6
3,1,6
3,2,4
3,3,6
3,4,1
3,5,4
4,1,6
4,2,5
The condition is: for each [unique_id], if there is any [Age]==6, then put a value 1 in the corresponding rows of with a [pid]=1, others should be 0.
the output csv will look like this:
unique_id,pid,Age,recode
1,1,1,0
1,2,3,0
2,1,5,1
2,2,6,0
3,1,6,1
3,2,4,0
3,3,6,0
3,4,1,0
3,5,4,0
4,1,6,1
4,2,5,0
I was using numpy: like follwoing:
import numpy
input_file1 = "data.csv"
input_folder = 'G:/My Drive/'
Her_HH =pd.read_csv(input_folder + input_file1)
Her_HH['recode'] = numpy.select([Her_PP['Age']==6,Her_PP['Age']<6], [1,0], default=Her_HH['recode'])
Her_HH.to_csv('recode_elderly.csv', index=False)
but it does not put value 1 in where [pid] is 1. Any help will be appreciated.
You can use DataFrame.assign for new column with GroupBy.transform for test if at least one match by GroupBy.any, chain mask for test 1 with & for bitwise AND and last cast output to integers
#sorting if necessary
df = df.sort_values('unique_id')
m1 = df.assign(test=df['Age'] == 6).groupby('unique_id')['test'].transform('any')
Another idea for get groups with 6 is filter them with unique_id and Series.isin:
m1 = df['unique_id'].isin(df.loc[df['Age'] == 6, 'unique_id'])
m2 = df['pid'] == 1
df['recode'] = (m1 & m2).astype(int)
print (df)
unique_id pid Age recode
0 1 1 1 0
1 1 2 3 0
2 2 1 5 1
3 2 2 6 0
4 3 1 6 1
5 3 2 4 0
6 3 3 6 0
7 3 4 1 0
8 3 5 4 0
9 4 1 6 1
10 4 2 5 0
EDIT:
For check groups with no match 6 in Age column is possible filter by inverted mask by ~ and if want only all unique rows by unique_id values add DataFrame.drop_duplicates:
print (df[~m1])
unique_id pid Age
0 1 1 1
1 1 2 3
df1 = df[~m1].drop_duplicates('unique_id')
print (df1)
unique_id pid Age
0 1 1 1
This a bit clumsy, since I know numpy a lot better than pandas.
Load your csv sample into a dataframe:
In [205]: df = pd.read_csv('stack59885878.csv')
In [206]: df
Out[206]:
unique_id pid Age
0 1 1 1
1 1 2 3
2 2 1 5
3 2 2 6
4 3 1 6
5 3 2 4
6 3 3 6
7 3 4 1
8 3 5 4
9 4 1 6
10 4 2 5
Generate a groupby object based on the unique_id column:
In [207]: gps = df.groupby('unique_id')
In [209]: gps.groups
Out[209]:
{1: Int64Index([0, 1], dtype='int64'),
2: Int64Index([2, 3], dtype='int64'),
3: Int64Index([4, 5, 6, 7, 8], dtype='int64'),
4: Int64Index([9, 10], dtype='int64')}
I've seen pandas ways for iterating on groups, but here's a list comprehension. The iteration produce a tuple, with the id and a dataframe. We want to test each group dataframe for 'Age' and 'pid' values:
In [211]: recode_values = [(gp['Age']==6).any() & (gp['pid']==1) for x, gp in gps]
In [212]: recode_values
Out[212]:
[0 False
1 False
Name: pid, dtype: bool, 2 True
3 False
Name: pid, dtype: bool, 4 True
5 False
6 False
7 False
8 False
Name: pid, dtype: bool, 9 True
10 False
Name: pid, dtype: bool]
The result is a list of Series, with a True where pid is 1 and there's a 'Age' 6 in the group.
Joining these Series with numpy.hstack produces a boolean array, which we can convert to an integer array:
In [214]: np.hstack(recode_values)
Out[214]:
array([False, False, True, False, True, False, False, False, False,
True, False])
In [215]: df['recode']=_.astype(int) # assign that to a new column
In [216]: df
Out[216]:
unique_id pid Age recode
0 1 1 1 0
1 1 2 3 0
2 2 1 5 1
3 2 2 6 0
4 3 1 6 1
5 3 2 4 0
6 3 3 6 0
7 3 4 1 0
8 3 5 4 0
9 4 1 6 1
10 4 2 5 0
Again, I think there's an idiomatic pandas way of joining those series. But for now this works.
===
OK, the groupby object has an apply:
In [223]: def foo(gp):
...: return (gp['Age']==6).any() & (gp['pid']==1).astype(int)
...:
In [224]: gps.apply(foo)
Out[224]:
unique_id
1 0 0
1 0
2 2 1
3 0
3 4 1
5 0
6 0
7 0
8 0
4 9 1
10 0
Name: pid, dtype: int64
And remove the multi-indexing with:
In [242]: gps.apply(foo).reset_index(0, True)
Out[242]:
0 0
1 0
2 1
3 0
4 1
5 0
6 0
7 0
8 0
9 1
10 0
Name: pid, dtype: int64
In [243]: df['recode']=_ # and assign to recode
Lots of experimenting and learning here.
how to drop columns with more than 50 kinds of values using function?
here drop columns:date_dispatch,con_birth_dt,dat_cust_open,cust_mgr_team,mng_issu_date,created_date
app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
label 1
date_dispatch 2883
con_birth_dt 12617
con_sex_mf 2
dat_cust_open 264
cust_mgr_team 2250
mng_issu_date 1796
um_num 38
created_date 2900
hqck_flag 2
dqck_flag 2
tzck_flag 2
yhlcck_flag 2
bzjck_flag 2
gzck_flag 2
jjsz_flag 2
e_yhlcck_flag 2
zq_flag 2
xtsz_flag 1
whsz_flag 1
hjsz_flag 2
yb_flag 2
qslc_flag 2
Use drop with index values filtered by boolean indexing:
a = app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
df = app_train.drop(a.index[a > 50], axis=1)
Another solution is add reindex for missing columns and then filter by inverted condition <=:
a = (app_train.select_dtypes('object')
.apply(pd.Series.nunique, axis = 0)
.reindex(app_train.columns, fill_value=0))
df = app_train.loc[:, a <= 50]
Sample:
app_train = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (app_train)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
a = (app_train.select_dtypes('object')
.apply(pd.Series.nunique, axis = 0)
.reindex(app_train.columns, fill_value=0))
df = app_train.loc[:, a <= 5]
print (df)
B C D E F
0 4 7 1 5 a
1 5 8 3 3 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b
nunique + loc
You can use nunique followed by loc with Boolean indexing:
n = 5 # maximum number of unique values permitted
counts = app_train.select_dtypes(['object']).apply(pd.Series.nunique)
df = app_train.loc[:, ~app_train.columns.isin(counts[counts > n].index)]
# data from jezrael
print(df)
B C D E F
0 4 7 1 5 a
1 5 8 3 3 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b
Sample
I have 1000 by 6 dataframe, where A,B,C,D were rated by people on scale of 1-10.
In SELECT column, I have a value, which in all cases is same as value in either of A/B/C/D.
I want to change value in 'SELECT' to name of column to which it matches. For example, for ID 1, SELECT = 1, and D = 1, so the value of select should change to D.
import pandas as pd
df = pd.read_excel("u.xlsx",sheet_name = "Sheet2",header = 0)
But I am lost how to proceed.
Gwenersl solution compare all columns without ID and SELECT filtered by difference with DataFrame.eq (==), check first True value by idxmax and also if not exist matching value is set value no match with numpy.where:
cols = df.columns.difference(['ID','SELECT'])
mask = df[cols].eq(df['SELECT'], axis=0)
df['SELECT'] = np.where(mask.any(axis=1), mask.idxmax(axis=1), 'no match')
print (df)
ID A B C D SELECT
0 1 4 9 7 1 D
1 2 5 7 2 8 C
2 3 7 4 8 6 C
Detail:
print (mask)
A B C D
0 False False False True
1 False False True False
2 False False True False
Assuming the values in A, B, C, D are unique in each row with respect to SELECT, I'd do it like this:
>>> df
ID A B C D SELECT
0 1 4 9 7 1 1
1 2 5 7 2 8 2
2 3 7 4 8 6 8
>>>
>>> df_abcd = df.loc[:, 'A':'D']
>>> df['SELECT'] = df_abcd.apply(lambda row: row.isin(df['SELECT']).idxmax(), axis=1)
>>> df
ID A B C D SELECT
0 1 4 9 7 1 D
1 2 5 7 2 8 C
2 3 7 4 8 6 C
Use -
df['SELECT2'] = df.columns[pd.DataFrame([df['SELECT'] == df['A'], df['SELECT'] == df['B'], df['SELECT'] == df['C'], df['SELECT'] == df['D']]).transpose().idxmax(1)+1]
Output
ID A B C D SELECT SELECT2
0 1 4 9 7 1 1 D
1 2 5 7 2 8 2 C
2 3 7 4 8 6 8 C
I am practising on the IMDB dataset and i would like to find the top genres that had the most budget.
Actually that would be useful in situations where a boxplot is needed and the genres are numerous. Thus, minimising them to the most expensive would make the boxplot more clear.
i tried this: df.sort_values(by=["genres","budget"])
but it isn't right.
If need return all columns:
I think you need sort_values + groupby + head:
df=df.sort_values(by=["genres","budget"], ascending=[True, False]).groupby("genres").head(5)
Or nlargest:
df = df.groupby('genres', group_keys=False).apply(lambda x: x.nlargest(5, "budget"))
If need retun only genres and budget columns:
df = df.groupby('genres')["budget"].nlargest(2).reset_index(level=1, drop=True).reset_index()
Samples:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'budget':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'genres':list('aaabbb')})
print (df)
A B C E budget genres
0 a 4 7 5 1 a
1 b 5 8 3 3 a
2 c 4 9 6 5 a
3 d 5 4 9 7 b
4 e 5 2 2 1 b
5 f 4 3 4 0 b
df1=df.sort_values(by=["genres","budget"], ascending=[True, False]).groupby("genres").head(2)
df1 = df.groupby('genres', group_keys=False).apply(lambda x: x.nlargest(2, "budget"))
print (df1)
A B C E budget genres
2 c 4 9 6 5 a
1 b 5 8 3 3 a
3 d 5 4 9 7 b
4 e 5 2 2 1 b
df1=df.groupby('genres')["budget"].nlargest(2).reset_index(level=1, drop=True).reset_index()
print (df1)
genres budget
0 a 5
1 a 3
2 b 7
3 b 1
---
If need top genres with sum of badget per genres:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'budget':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'genres':list('aabbcc')})
print (df)
A B C E budget genres
0 a 4 7 5 1 a
1 b 5 8 3 3 a
2 c 4 9 6 5 b
3 d 5 4 9 7 b
4 e 5 2 2 1 c
5 f 4 3 4 0 c
df = df.groupby('genres')['budget'].sum().nlargest(2)
print (df)
genres
b 12
a 4
Name: budget, dtype: int64
Detail:
print (df.groupby('genres')['budget'].sum())
genres
a 4
b 12
c 1
Name: budget, dtype: int64
I have the below result from a pivot table, which is about the count of customer grades that visited my stores. I used the 'droplevel' method to flatten the column header into 1 layer, how can I do the same for the index? I want to remove 'Grade' above the index, so that the column headers are at the same level as 'Store No_'.
it seems you need remove column name:
df.columns.name = None
Or rename_axis:
df = df.rename_axis(None, axis=1)
Sample:
df = pd.DataFrame({'Store No_':[1,2,3],
'A':[4,5,6],
'B':[7,8,9],
'C':[1,3,5],
'D':[5,3,6],
'E':[7,4,3]})
df = df.set_index('Store No_')
df.columns.name = 'Grade'
print (df)
Grade A B C D E
Store No_
1 4 7 1 5 7
2 5 8 3 3 4
3 6 9 5 6 3
print (df.rename_axis(None, axis=1))
A B C D E
Store No_
1 4 7 1 5 7
2 5 8 3 3 4
3 6 9 5 6 3
df = df.rename_axis(None, axis=1).reset_index()
print (df)
Store No_ A B C D E
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3