Python Pandas Dataframe: replace variable by the frequency count - python

I have a dataframe which has categorical variables with hundreds of different values.
I'm able to verify the frequency of these levels using the 'values_count()' function of using a groupby statement + reset_index() ...
I was trying to replace these hundreds of values by their frequency count (and later on merge levels with low cardinality). I was trying to join two different dataframes (one with the values and the other with the counts), but I'm having issues...
For example, the frequency table would be below, with around 300 records (all unique):
v_catego Time
0 AA 353
1 AAC 136
2 ABB 2
3 ABC 1
4 ACA 13
300 ZZZ 33
original dataframe:
V_vatego
0 AA
1 AAC
2 ABB
3 AAC
4 DA
5 AAC
................
where I would like to replace(or add another) variable by the 'Time' values for each instance :
v_catego new_v_catego
0 AA 353
1 AAC 136
2 ABB 2
3 AA 353
4 AAC 136
.................
I know in R there is a simple function that does this. Is there an equivalent in python?

IIUC you can use concat, but before you have to set same categories in both Series (columns) by add_categories:
print df
v_catego Time
0 AA 353
1 AAC 136
2 ABB 2
3 AA 353
4 AAC 136
print df1
v_catego Time
0 ABC 1
1 ACA 13
#remember old cat in df1
old_cat = df1['v_catego']
#set same categories in both dataframes in column v_catego
df1['v_catego'] = df['v_catego'].cat.add_categories(df1['v_catego'])
df['v_catego'] = df['v_catego'].cat.add_categories(old_cat)
print df.v_catego
0 AA
1 AAC
2 ABB
3 AA
4 AAC
Name: v_catego, dtype: category
Categories (5, object): [AA, AAC, ABB, ABC, ACA]
print df1.v_catego
0 AA
1 AAC
Name: v_catego, dtype: category
Categories (5, object): [AA, AAC, ABB, ABC, ACA]
print pd.concat([df,df1])
v_catego Time
0 AA 353
1 AAC 136
2 ABB 2
3 AA 353
4 AAC 136
0 AA 1
1 AAC 13
EDIT:
I think you can use merge:
print df
v_catego
0 AA
1 AAC
2 ABB
3 AA
4 AAC
5 ABB
6 AA
7 AAC
8 AA
9 AAC
10 AAC
11 ABB
12 AA
13 AAC
14 ABB
15 AA
16 AAC
17 AA
18 AAC
df1 = df['v_catego'].value_counts()
.reset_index(name='count')
.rename(columns={'index': 'v_catego'})
print df1
v_catego count
0 AAC 8
1 AA 7
2 ABB 4
print pd.merge(df,df1,on=['v_catego'], how='left' )
v_catego count
0 AA 7
1 AAC 8
2 ABB 4
3 AA 7
4 AAC 8
5 ABB 4
6 AA 7
7 AAC 8
8 AA 7
9 AAC 8
10 AAC 8
11 ABB 4
12 AA 7
13 AAC 8
14 ABB 4
15 AA 7
16 AAC 8
17 AA 7
18 AAC 8

Related

How to groupby multiple columns in dataframe, except one in python

I have the following dataframe:
ID Code Color Value
-----------------------------------
0 111 AAA Blue 23
1 111 AAA Red 43
2 111 AAA Green 4
3 121 ABA Green 45
4 121 ABA Green 23
5 121 ABA Red 75
6 122 AAA Red 52
7 122 ACA Blue 24
8 122 ACA Blue 53
9 122 ACA Green 14
...
I want to group this dataframe by the columns "ID", and "Code", and sum the values from the "Value" column, while excluding the "Color" column from this grouping. Or in other words, I want to groupy by all non-Value columns, except for the "Color" column, and then sum the values from the "Value" column. I am using python for this.
What I am thinking of doing is creating a list of all column names that are not "Color" and "Value", and creating this "column_list", and then simply running:
df.groupby['column_list'].sum()
Though this will not work. How might I augment this code so that I can properly groupby as intended?
EDIT:
This code works:
bins = df.groupby([df.columns[0],
df.columns[1],
df.columns[2]).count()
bins["Weight"] = bins / bins.groupby(df.columns[0]).sum()
bins.reset_index(inplace=True)
bins['Weight'] = bins['Weight'].round(4)
display(HTML(bins.to_html()))
Full code that is not working:
column_list = [c for c in df.columns if c not in ['Value']]
bins = df.groupby(column_list, as_index=False)['Value'].count()
bins["Weight"] = bins / bins.groupby(df.columns[0]).sum()
bins.reset_index(inplace=True)
bins['Weight'] = bins['Weight'].round(4)
display(HTML(bins.to_html()))
You can pass list to groupby and specify column for aggregate sum:
column_list = [c for c in df.columns if c not in ['Color','Value']]
df1 = df.groupby(column_list, as_index=False)['Value'].sum()
Or:
column_list = list(df.columns.difference(['Color','Value'], sort=False))
df1 = df.groupby(column_list, as_index=False)['Value'].sum()
It working with sample data like:
df1 = df.groupby(['ID','Code'], as_index=False)['Value'].sum()
EDIT: Yes, also working:
column_list = [c for c in df.columns if c not in ['Color']]
df1 = df.groupby(column_list, as_index=False).sum()
Reason is because sum remove by default not numeric columns and if not specified Value it summed all columns.
So if Color is numeric, it sum it too:
print (df)
ID Code Color Value
0 111 AAA 1 23
1 111 AAA 2 43
2 111 AAA 3 4
3 121 ABA 1 45
4 121 ABA 1 23
5 121 ABA 2 75
6 122 AAA 1 52
7 122 ACA 2 24
8 122 ACA 1 53
9 122 ACA 2 14
column_list = [c for c in df.columns if c not in ['Color']]
df1 = df.groupby(column_list, as_index=False).sum()
print (df1)
ID Code Value Color
0 111 AAA 4 3
1 111 AAA 23 1
2 111 AAA 43 2
3 121 ABA 23 1
4 121 ABA 45 1
5 121 ABA 75 2
6 122 AAA 52 1
7 122 ACA 14 2
8 122 ACA 24 2
9 122 ACA 53 1
column_list = [c for c in df.columns if c not in ['Color']]
df1 = df.groupby(column_list, as_index=False)['Value'].sum()
print (df1)
ID Code Value
0 111 AAA 4
1 111 AAA 23
2 111 AAA 43
3 121 ABA 23
4 121 ABA 45
5 121 ABA 75
6 122 AAA 52
7 122 ACA 14
8 122 ACA 24
9 122 ACA 53
EDIT: If need MultiIndex in bins remove as_index=False and column after groupby:
bins = df.groupby([df.columns[0],
df.columns[1],
df.columns[2]).count()
should be changed to:
column_list = [c for c in df.columns if c not in ['Value']]
bins = df.groupby(column_list).count()

How to find overlapping rows between two dataframes based on start and end columns?

I have two pandas dataframes df1 and df2 of the form:
df1
start end text source
1 5 abc 1
8 10 def 1
15 20 ghi 1
25 30 xxx 1
42 45 zzz 1
df2
start end text source
1 6 jkl 2
7 9 mno 2
11 13 pqr 2
16 17 stu 2
18 19 vwx 2
32 37 yyy 2
40 47 rrr 2
I want to return the intersections of the two dataframes based on the start and end columns in following format:
out_df
start_1 end_1 start_2 end_2 text_1 text_2
1 5 1 6 abc jkl
8 10 7 9 def mno
15 20 16 17 ghi stu
15 20 18 19 ghi vwx
42 45 40 47 zzz rrr
What is the best method to achieve this?
One option is with conditional_join from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df1.conditional_join(
df2,
('start', 'end', '<='),
('end', 'start', '>='))
left right
start end text source start end text source
0 1 5 abc 1 1 6 jkl 2
1 8 10 def 1 7 9 mno 2
2 15 20 ghi 1 16 17 stu 2
3 15 20 ghi 1 18 19 vwx 2
4 42 45 zzz 1 40 47 rrr 2
In the dev version, you can rename the columns, and avoid the MultiIndex (the MultiIndex occurs because the column names are not unique):
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
df1.conditional_join(
df2,
('start', 'end', '<='),
('end', 'start', '>='),
df_columns = {'start':'start_1',
'end':'end_1',
'text':'text_1'},
right_columns = {'start':'start_2',
'end':'end_2',
'text':'text_2'})
start_1 end_1 text_1 start_2 end_2 text_2
0 1 5 abc 1 6 jkl
1 8 10 def 7 9 mno
2 15 20 ghi 16 17 stu
3 15 20 ghi 18 19 vwx
4 42 45 zzz 40 47 rrr
The idea for overlaps is the start of interval one should be less than the end of interval 2, while the end of interval two should be less than the start of interval one, that way overlap is assured. I pulled that idea from pd.Interval.overlaps here
Another option is with the piso library; the answer here might point you in the right direction

How to find the next row that have a value in column in a dataframe pandas?

I have a dataframe such as:
id info date group label
1 aa 02/05 1 7
2 ba 02/05 1 8
3 cp 09/05 2 7
4 dd 09/05 2 8
5 ii 09/05 2 9
Every group should have the numbers 7, 8 and 9. In the example above, the group 1 does not have the three numbers, the number 9 is missing. In that case, I would like to find the closest row with a 9 in the label, and add it to the dataframe, also changing the date to the group's date.
So the desired result would be:
id info date group label
1 aa 02/05 1 7
2 ba 02/05 1 8
6 ii 02/05 1 9
3 cp 09/05 2 7
4 dd 09/05 2 8
5 ii 09/05 2 9
Welcome to SO. Its good if you include what you have tried so far so keep that in mind. Anyhow for this question, breakdown your thought process into pandas syntax. Like first step would be to check what group do not have which label from [8,9]:
dfs = df.groupby(['group', 'date']).agg({'label':set}).reset_index().sort_values('group')
dfs['label'] = dfs['label'].apply(lambda x: {8, 9}.difference(x)).explode() # This is the missing label
dfs
Which will give you:
group
date
label
1
02/05
9
2
09/05
nan
Now merge it with original on label and have info filled in:
final_df = pd.concat([df, dfs.merge(df[['label', 'info']], on='label', suffixes=['','_grouped'])])
final_df
id
info
date
group
label
1
aa
02/05
1
7
2
ba
02/05
1
8
3
cp
09/05
2
7
4
dd
09/05
2
8
5
ii
09/05
2
9
nan
ii
02/05
1
9
And prettify:
final_df.reset_index(drop=True).reset_index().assign(id=lambda x:x['index']+1).drop(columns=['index']).sort_values(['group', 'id'])
id
info
date
group
label
1
aa
02/05
1
7
2
ba
02/05
1
8
6
ii
02/05
1
9
3
cp
09/05
2
7
4
dd
09/05
2
8
5
ii
09/05
2
9

DataFrame MultiIndex - find column by value

I have a multiindex dataframe with two layers of indices and roughly 100 columns. I would like to get groups of values (organized in columns) based on the presence of a certain value, but I am still struggling with the indexing mechanics.
Here is some example data:
import pandas as pd
index_arrays = [np.array(["one"]*5+["two"]*5),
np.array(["aaa","bbb","ccc","ddd","eee"]*2)]
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],
[10,11,12],[13,14,15],[16,1,17],
[18,19,20],[21,22,23],[24,25,26],
[27,28,29]], index=index_arrays)
Gives
0 1 2
one aaa 1 2 3
bbb 4 5 6
ccc 7 8 9
ddd 10 11 12
eee 13 14 15
two aaa 16 1 17
bbb 18 19 20
ccc 21 22 23
ddd 24 25 26
eee 27 28 29
Now, for each level_0 index (one and two), I want to return the entire column in which the level_1 index of aaa equals to a certain value, for example 1.
What I got so far is this:
df[df.loc[(slice(None), "aaa"),:]==1].any(axis=1)
>
one aaa True
bbb False
ccc False
ddd False
eee False
two aaa True
bbb False
ccc False
ddd False
eee False
Instead of the boolean values, I would like to retrieve the actual values. The expected output would be:
expected:
0
one aaa 1
bbb 4
ccc 7
ddd 10
eee 13
two aaa 1
bbb 19
ccc 22
ddd 25
eee 28
I would appreciate your help.
Bonus question: Additionally, it would be great to know which column contains the values in question. For the example above, this would be column 0 (for index one)and column 1 (for index two). Is there a way to do this?
Thanks!
This might be what you're looking for:
df.loc[df.index.get_level_values(0) == 'one', df.loc[('one', 'aaa')] == 1]
This outputs:
0
one aaa 1
bbb 4
ccc 7
ddd 10
eee 13
To combine the results for all of the different values of the first index, generate these DataFrames and concatenate them:
output_df = pd.DataFrame()
for level_0_val in df.index.get_level_values(0).unique():
_ = df.loc[df.index.get_level_values(0) == level_0_val, df.loc[(level_0_val, 'aaa')] == 1]
output_df = output_df.append(_)
Here is output_df:
0 1
one aaa 1.0 NaN
bbb 4.0 NaN
ccc 7.0 NaN
ddd 10.0 NaN
eee 13.0 NaN
two aaa NaN 1.0
bbb NaN 19.0
ccc NaN 22.0
ddd NaN 25.0
eee NaN 28.0
You can then generate your desired output from this.
Let's try with DataFrame.xs:
m = df.xs('aaa', level=1).eq(1).any()
Or with pd.IndexSlice:
m = df.loc[pd.IndexSlice[:, 'aaa'], :].eq(1).any()
Result:
df.loc[:, m]
0 1
one aaa 1 2
bbb 4 5
ccc 7 8
ddd 10 11
eee 13 14
two aaa 16 1
bbb 18 19
ccc 21 22
ddd 24 25
eee 27 28
df.columns[m]
Int64Index([0, 1], dtype='int64')

pandas merge two dataframes without cross-references and with NaN's for uneven number of rows

EDITED 3/5/19:
Tried different ways to merge and/or join the data below but couldn't wrap my head around how to do that correctly.
Initially I have a data like this:
index unique_id group_name id name
0 100 ABC 20 aaa
1 100 ABC 21 bbb
2 100 DEF 22 ccc
3 100 DEF 23 ddd
4 100 DEF 24 eee
5 100 DEF 25 fff
6 101 ABC 30 ggg
7 101 ABC 31 hhh
8 101 ABC 32 iii
9 101 DEF 33 jjj
The goal is to reshape it by merging on unique_id so that the result looks like this:
index unique_id group_name_x id_x name_x group_name_y id_y name_y
0 100 ABC 20 aaa DEF 22 ccc
1 100 ABC 21 bbb DEF 23 ddd
2 100 NaN NaN NaN DEF 24 eee
3 100 NaN NaN NaN DEF 25 fff
4 101 ABC 30 ggg DEF 33 jjj
5 101 ABC 31 hhh NaN NaN NaN
6 101 ABC 32 iii NaN NaN NaN
How can I do this in pandas? The best I could think of is to split the data into two dataframes by group name (ABC and DEF) and then merge them with how='outer', on='unique_id', but that way it creates references between each record (2 ABC x 4 DEF = 8 records) without any NaN's.
pd.concat with axis=1 mentioned in answers doesn't align the data per unique_id and doesn't create any NaN's.
As you said , split the dataframe then concat both dataframe by row wise after resetting both index
A working code,
df=pd.read_clipboard()
req_cols=['group_name','id','name']
df_1=df[df['group_name']=='ABC'].reset_index(drop=True)
df_2=df[df['group_name']=='DEF'].reset_index(drop=True)
df_1=df_1.rename(columns = dict(zip(df_1[req_cols].columns.values, df_1[req_cols].add_suffix('_x'))))
df_2=df_2.rename(columns = dict(zip(df_2[req_cols].columns.values, df_2[req_cols].add_suffix('_y'))))
req_cols_x=[val+'_x'for val in req_cols]
print (pd.concat([df_2,df_1[req_cols_x]],axis=1))
O/P:
index unique_id group_name_y id_y name_y group_name_x id_x name_x
0 2 100 DEF 22 ccc ABC 20.0 aaa
1 3 100 DEF 23 ddd ABC 21.0 bbb
2 4 100 DEF 24 eee NaN NaN NaN
3 5 100 DEF 25 fff NaN NaN NaN

Categories