I have the following dataframe df1:
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy', 'Lisa', 'Molly', 'Lisa', 'Molly', 'Fred'],
'gender': ['m', 'f', 'f', 'm', 'f', 'f', 'f', 'f','f', 'm'],
}
df1 = pd.DataFrame(data, index = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
I want to create a table with some standard and some custom summary statistics df2.
df2 = df1.describe()
df2.rename(index={'top':'mode'},inplace=True)
df2.rename(index={'freq':'mode freq'},inplace=True)
df2
df2:
gender name
count 10 10
unique 2 7
mode f Molly
mode freq 7 3
I want to append one row to df2 for the second mode and one for the frequency of the second mode:
Example:
gender name
count 10 10
unique 2 7
mode f Molly
mode freq 7 3
2nd mode m Lisa
2nd freq 3 2
I figured out that you can get the second mode & frequency by doing this:
my_series
for column in df1:
my_series=df1[column].value_counts()[1:2]
print(my_series)
But how do I append this to df2?
You can do apply with value_counts, then we need modify your dataframe shape .
df1.apply(lambda x : x.value_counts().iloc[[1]]).stack().reset_index(level=0).T
Out[172]:
name gender
level_0 Lisa m
0 2 3
The final out put (Change the index name using what you show to us rename)
pd.concat([df1.describe(),df1.apply(lambda x : x.value_counts().iloc[[1]]).stack().reset_index(level=0).T])
Out[173]:
gender name
count 10 10
unique 2 7
top f Molly
freq 7 3
level_0 m Lisa
0 3 2
With Counter
from collections import Counter
def f(s):
return pd.Series(Counter(s).most_common(2)[1], ['mode2', 'mode2 freq'])
df1.describe().rename(dict(top='mode1', freq='mode1 freq')).append(df1.apply(f))
name gender
count 10 10
unique 7 2
mode1 Molly f
mode1 freq 3 7
mode2 Lisa m
mode2 freq 2 3
value_counts
Same thing without Counter
def f(s):
c = s.value_counts()
return pd.Series([s.iat[1], s.index[1]], ['mode2', 'mode2 freq'])
df1.describe().rename(dict(top='mode1', freq='mode1 freq')).append(df1.apply(f))
Numpy bits
def f(s):
f, u = pd.factorize(s)
c = np.bincount(f)
i = np.argpartition(c, -2)[-2]
return pd.Series([u[i], c[i]], ['mode2', 'mode2 freq'])
df1.describe().rename(dict(top='mode1', freq='mode1 freq')).append(df1.apply(f))
Related
I have a data which I need to transform in order to get 2 cols insted of 4 :
data = [['123', 'Billy', 'Bill', 'Bi'],
['234', 'James', 'J', 'Ji'],
['543', 'Floyd', 'Flo', 'F'],
]
processed_data = ?
needed_df = pandas.DataFrame(processed_data, columns=['Number', 'Name'])
I expect the following behaviour:
['123', 'Billy']
['123', 'Bill']
['123', 'Bi']
['234', 'James']
['234', 'J']
['234', 'Ji']
I've tried to use for in for loop but getting the wrong result:
for row in df.iterrows():
for col in df.columns:
new_row = ...
processed_df = pandas.concat(df, new_row)
Such a construction gives a too big result
The similar question using sql:
How to split several columns into one column with several records in SQL?
Or, you can convert you exists data into a dataframe then perform pandas dataframe reshaping with melt:
import pandas as pd
data = [['123', 'Billy', 'Bill', 'Bi'],
['234', 'James', 'J', 'Ji'],
['543', 'Floyd', 'Flo', 'F'],
]
df = pd.DataFrame(data)
df.melt(0).sort_values(0)
Output:
0 variable value
0 123 1 Billy
3 123 2 Bill
6 123 3 Bi
1 234 1 James
4 234 2 J
7 234 3 Ji
2 543 1 Floyd
5 543 2 Flo
8 543 3 F
Let use list comprehension to create pairs of Name and Number then create a new dataframe
pd.DataFrame([[x, z] for x, *y in data for z in y], columns=['Number', 'Name'])
Number Name
0 123 Billy
1 123 Bill
2 123 Bi
3 234 James
4 234 J
5 234 Ji
6 543 Floyd
7 543 Flo
8 543 F
I have a data frame and a dictionary like this:
thresholds = {'column':{'A':10,'B':11,'C':9}}
df:
Column
A 13
A 7
A 11
B 12
B 14
B 14
C 7
C 8
C 11
For every index group, I want to calculate the count of values less than the threshold and greater than the threshold value.
So my output looks like this:
df:
Values<Thr Values>Thr
A 1 2
B 0 3
C 2 1
Can anyone help me with this
You can use:
import numpy as np
t = df.index.to_series().map(thresholds['column'])
out = (pd.crosstab(df.index, np.where(df['Column'].gt(t), 'Values>Thr', 'Values≤Thr'))
.rename_axis(index=None, columns=None)
)
Output:
Values>Thr Values≤Thr
A 2 1
B 3 0
C 1 2
syntax variant
out = (pd.crosstab(df.index, df['Column'].gt(t))
.rename_axis(index=None, columns=None)
.rename(columns={False: 'Values≤Thr', True: 'Values>Thr'})
)
apply on many column based on the key in the dictionary
def count(s):
t = s.index.to_series().map(thresholds.get(s.name, {}))
return (pd.crosstab(s.index, s.gt(t))
.rename_axis(index=None, columns=None)
.rename(columns={False: 'Values≤Thr', True: 'Values>Thr'})
)
out = pd.concat({c: count(df[c]) for c in df})
NB. The key of the dictionary must match exactly. I changed the case for the demo.
Output:
Values≤Thr Values>Thr
Column A 1 2
B 0 3
C 2 1
Here another option:
import pandas as pd
df = pd.DataFrame({'Column': [13, 7, 11, 12, 14, 14, 7, 8, 11]})
df.index = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']
thresholds = {'column':{'A':10,'B':11,'C':9}}
df['smaller'] = df['Column'].groupby(df.index).transform(lambda x: x < thresholds['column'][x.name]).astype(int)
df['greater'] = df['Column'].groupby(df.index).transform(lambda x: x > thresholds['column'][x.name]).astype(int)
df.drop(columns=['Column'], inplace=True)
# group by index summing the greater and smaller columns
sums = df.groupby(df.index).sum()
sums
I have a data frame like this:
df:
ID Names
3 [Ally, Ben, Cris]
5 [Bruno, Coleen, Flyn]
2 [Dave, Bob]
7 [Rob, Ally, Bob]
11 [Jill, Tom, Sal]
The Names column is a list of names. Some of them could be repeated.
I want to filter the data frame on Names columns where the names start with either A or B or D.
So my output should look like this:
ID Names
3 [Ally, Ben]
5 [Bruno]
2 [Dave, Bob]
7 [Ally, Bob]
Reproducible input:
df = pd.DataFrame({'ID': [3, 5, 2, 7, 11],
'Names': [['Ally', 'Ben', 'Cris'],
['Bruno', 'Coleen', 'Flyn'],
['Dave', 'Bob'],
['Rob', 'Ally', 'Bob'],
['Jill', 'Tom', 'Sal']]
})
You can use a list comprehension to filter the names, and boolean indexing to filter the rows:
target = {'A', 'B', 'D'}
df['Names'] = [[n for n in l if n[0] in target] for l in df['Names']]
df = df[df['Names'].str.len().gt(0)]
Or using explode, and groupby.agg:
s = (df['Names']
.explode()
.loc[lambda x: x.str[0].isin(['A', 'B', 'D'])]
.groupby(level=0).agg(list)
)
df = df.loc[s.index].assign(Names=s)
output:
ID Names
0 3 [Ally, Ben]
1 5 [Bruno]
2 2 [Dave, Bob]
3 7 [Ally, Bob]
just a slight variance of this solution:
df = (df.explode('Names').
query("Names.str[0].isin(['A','B','D'])",engine='python').
groupby('ID').
agg(list).
reset_index())
print(df)
'''
ID Names
0 2 [Dave, Bob]
1 3 [Ally, Ben]
2 5 [Bruno]
3 7 [Ally, Bob]
I have two dataframes of different size (df1 nad df2). I would like to remove from df1 all the rows which are stored within df2.
So if I have df2 equals to:
A B
0 wer 6
1 tyu 7
And df1 equals to:
A B C
0 qwe 5 a
1 wer 6 s
2 wer 6 d
3 rty 9 f
4 tyu 7 g
5 tyu 7 h
6 tyu 7 j
7 iop 1 k
The final result should be like so:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
I was able to achieve my goal by using a for loop but I would like to know if there is a better and more elegant and efficient way to perform such operation.
Here is the code I wrote in case you need it:
import pandas as pd
df1 = pd.DataFrame({'A' : ['qwe', 'wer', 'wer', 'rty', 'tyu', 'tyu', 'tyu', 'iop'],
'B' : [ 5, 6, 6, 9, 7, 7, 7, 1],
'C' : ['a' , 's', 'd', 'f', 'g', 'h', 'j', 'k']})
df2 = pd.DataFrame({'A' : ['wer', 'tyu'],
'B' : [ 6, 7]})
for i, row in df2.iterrows():
df1 = df1[(df1['A']!=row['A']) & (df1['B']!=row['B'])].reset_index(drop=True)
Use merge with outer join with filter by query, last remove helper column by drop:
df = pd.merge(df1, df2, on=['A','B'], how='outer', indicator=True)
.query("_merge != 'both'")
.drop('_merge', axis=1)
.reset_index(drop=True)
print (df)
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
The cleanest way I found was to use drop from pandas using the index of the dataframe you want to drop:
df1.drop(df2.index, axis=0,inplace=True)
You can use np.in1d to check if any row in df1 exists in df2. And then use it as a reversed mask to select rows from df1.
df1[~df1[['A','B']].apply(lambda x: np.in1d(x,df2).all(),axis=1)]\
.reset_index(drop=True)
Out[115]:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
pandas has a method called isin, however this relies on unique indices. We can define a lambda function to create columns we can use in this from the existing 'A' and 'B' of df1 and df2. We then negate this (as we want the values not in df2) and reset the index:
import pandas as pd
df1 = pd.DataFrame({'A' : ['qwe', 'wer', 'wer', 'rty', 'tyu', 'tyu', 'tyu', 'iop'],
'B' : [ 5, 6, 6, 9, 7, 7, 7, 1],
'C' : ['a' , 's', 'd', 'f', 'g', 'h', 'j', 'k']})
df2 = pd.DataFrame({'A' : ['wer', 'tyu'],
'B' : [ 6, 7]})
unique_ind = lambda df: df['A'].astype(str) + '_' + df['B'].astype(str)
print df1[~unique_ind(df1).isin(unique_ind(df2))].reset_index(drop=True)
printing:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
I think the cleanest way can be:
We have base dataframe D and want to remove a subset D1. Let the output be D2
D2 = pd.DataFrame(D, index = set(D.index).difference(set(D1.index))).reset_index()
I find this other alternative useful too:
pd.concat([df1,df2], axis=0, ignore_index=True).drop_duplicates(subset=["A","B"],keep=False, ignore_index=True)
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
keep=False drops both duplicates.
It doesn't require to put all the equal columns between the two df, so I find that a bit easier.
Hello I have the following Data Frame:
df =
ID Value
a 45
b 3
c 10
And another dataframe with the numeric ID of each value
df1 =
ID ID_n
a 3
b 35
c 0
d 7
e 1
I would like to have a new column in df with the numeric ID, so:
df =
ID Value ID_n
a 45 3
b 3 35
c 10 0
Thanks
Use pandas merge:
import pandas as pd
df1 = pd.DataFrame({
'ID': ['a', 'b', 'c'],
'Value': [45, 3, 10]
})
df2 = pd.DataFrame({
'ID': ['a', 'b', 'c', 'd', 'e'],
'ID_n': [3, 35, 0, 7, 1],
})
df1.set_index(['ID'], drop=False, inplace=True)
df2.set_index(['ID'], drop=False, inplace=True)
print pd.merge(df1, df2, on="ID", how='left')
output:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You could use join(),
In [14]: df1.join(df2)
Out[14]:
Value ID_n
ID
a 45 3
b 3 35
c 10 0
If you want index to be numeric you could reset_index(),
In [17]: df1.join(df2).reset_index()
Out[17]:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You can do this in a single operation. join works on the index, which you don't appear to have set. Just set the index to ID, join df after also setting its index to ID, and then reset your index to return your original dataframe with the new column added.
>>> df.set_index('ID').join(df1.set_index('ID')).reset_index()
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
Also, because you don't do an inplace set_index on df1, its structure remains the same (i.e. you don't change its indexing).