I have a large dataset with similar data:
>>> df = pd.DataFrame(
... {'A': ['one', 'two', 'two', 'one', 'one', 'three'],
... 'B': ['a', 'b', 'c', 'a', 'a', np.nan]})
>>> df
A B
0 one a
1 two b
2 two c
3 one a
4 one a
5 three NaN
There are two aggregation functions 'any' and 'unique':
>>> df.groupby('A')['B'].any()
A
one True
three False
two True
Name: B, dtype: bool
>>> df.groupby('A')['B'].unique()
A
one [a]
three [nan]
two [b, c]
Name: B, dtype: object
but I want to get the folowing result (or something close to it):
A
one a
three False
two True
I can do it with some complex code, but it is better for me to find appropriate function in python packages or the easiest way to solve problem. I'd be grateful if you could help me with that.
You can aggregate Series.nunique for first column and unique values with remove possible missing values for another columns:
df1 = df.groupby('A').agg(count=('B','nunique'),
uniq_without_NaNs = ('B', lambda x: x.dropna().unique()))
print (df1)
count uniq_without_NaNs
A
one 1 [a]
three 0 []
two 2 [b, c]
Then create mask if greater column count by 1 and replace values by uniq_without_NaNs if equal count with 1:
out = df1['count'].gt(1).mask(df1['count'].eq(1), df1['uniq_without_NaNs'].str[0])
print (out)
A
one a
three False
two True
Name: count, dtype: object
>>> g = df.groupby("A")["B"].agg
>>> nun = g("nunique")
>>> pd.Series(np.select([nun > 1, nun == 1],
[True, g("unique").str[0]],
default=False),
index=nun.index)
A
one a
three False
two True
dtype: object
get a hold on the group aggreagator
count number of uniques
if > 1, i.e., more than 1 uniques, put True
if == 1, i.e., only 1 unique, put that unique value
else, i.e., no uniques (full NaNs), put False
You can combine groupby with agg and use boolean mask to choose the correct output:
# Your code
agg = df.groupby('A')['B'].agg(['any', 'unique'])
# Boolean mask to choose between 'any' and 'unique' column
m = agg['unique'].str.len().eq(1) & agg['unique'].str[0].notna()
# Final output
out = agg['any'].mask(m, other=agg['unique'].str[0])
Output:
>>> out
A
one a
three False
two True
>>> agg
any unique
A
one True [a]
three False [nan]
two True [b, c]
>>> m
A
one True # choose 'unique' column
three False # choose 'any' column
two False # choose 'any' column
new_df = df.groupby('A')['B'].apply(lambda x: x.notna().any())
new_df = new_df .reset_index()
new_df .columns = ['A', 'B']
this will give you:
A B
0 one True
1 three False
2 two True
now if we want to find the values we can do:
df.groupby('A')['B'].apply(lambda x: x[x.notna()].unique()[0] if x.notna().any() else np.nan)
which gives:
A
one a
three NaN
two b
The expression
series = df.groupby('A')['B'].agg(lambda x: pd.Series(x.unique()))
will give the next result:
one a
three Nan
two [b, c]
where simple value can be identified by the type:
series[series.apply(type) == str]
Think it is easy enough to use often, but probably it is not the optimal solution.
Related
I'm trying to concatenate two Pandas DataFrames and then drop the duplicates, however, for some reason drop_duplicates doesn't work for most of the identical rows (only few of them dropped). For instance these two are identical (at least in my eyes) but they are still showing up: Identical rows here
This is the code I have tried, the result with or without subset arguments varies but still doesn't give me the result I wanted. It tends to over delete and under delete when I play with arguments (for instance add or remove columns)
bigdata = pd.concat([df_q,df_q_temp]).drop_duplicates(subset=['Date', 'Value'], keep ='first').reset_index(drop=True)
Can anyone point me to a right direction?
Thanks
Expanding on my comment, here is a way to make the differences explicit and normalize your df to drop near-duplicates:
Part 1: show differences
def eq_nan(a, b):
return (a == b) | ((a != a) & (b != b)) # treat NaN as equal
Let's try with some data:
df = pd.DataFrame([
['foo\u00a0bar', 1, np.nan, None, 4.00000000001, pd.Timestamp('2021-01-01')],
['foo bar', 1, np.nan, None, 4, '2021-01-01 00:00:00']],
columns=list('uvwxyz'),
)
df.loc[1, 'z'] = str(df.loc[1, 'z']) # the init above converts the second date (str) as Timestamp
>>> df.dtypes
u object
v int64
w float64
x object
y float64
z object
dtype: object
>>> df.drop_duplicates()
u v w x y z
0 foo bar 1 NaN None 4.0 2021-01-01 00:00:00
1 foo bar 1 NaN None 4.0 2021-01-01 00:00:00
Find what elements among those two rows are different:
a = df.loc[0]
b = df.loc[1]
diff = ~eq_nan(a, b)
for (col, x), y in zip(a[diff].iteritems(), b[diff]):
print(f'{col}:\t{x!r} != {y!r}')
# output:
u: 'foo\xa0bar' != 'foo bar'
y: 4.00000000001 != 4.0
z: Timestamp('2021-01-01 00:00:00') != '2021-01-01 00:00:00'
Side note: alternatively, if you have cells containing complex types, e.g. list, dict, etc., you may use pytest (outside of testing) to get some nice verbose explanation of exactly how the values differ:
from _pytest.assertion.util import _compare_eq_verbose
for (col, x), y in zip(a[diff].iteritems(), b[diff]):
da, db = _compare_eq_verbose(x, y)
print(f'{col}:\t{da} != {db}')
# Output:
u: +'foo\xa0bar' != -'foo bar'
y: +4.00000000001 != -4.0
z: +Timestamp('2021-01-01 00:00:00') != -'2021-01-01 00:00:00'
Part 2: example of normalization to help drop duplicates
We use Pandas' own Series formatter to convert each row into a string representation:
def normalize_row(r):
vals = r.to_string(header=False, index=False, name=False).splitlines()
vals = [
' '.join(s.strip().split()) # transform any whitespace (e.g. unicode non-breaking space) into ' '
for s in vals
]
return vals
Example for the first row above:
>>> normalize_row(df.iloc[0])
['foo bar', '1', 'NaN', 'NaN', '4.0', '2021-01-01 00:00:00']
Usage to drop visually identical duplicates:
newdf = df.loc[df.apply(normalize_row, axis=0).drop_duplicates().index]
>>> newdf
u v w x y z
0 foo bar 1 NaN None 4.0 2021-01-01 00:00:00
>>> newdf.dtypes
u object
v int64
w float64
x object
y float64
z object
dtype: object
Note: the rows that make it through this filter are copied exactly into newdf (not the string lists that were used for near-duplicate detection).
Take care of string columns with no values.
Make sure in the dataframe the cells without value are read as None. Specially, in the object typed columns, there spaces with are different from None, but actually there is no value there.
For example:
import pandas as pd
df = pd.DataFrame({'Col_1': ['one', ' ', 'two', None],
'Col_2': [1, 2, 3, 2],
'Col_3': ['one', None, 'two', ' ']})
df
Col_1 Col_2 Col_3
0 one 1 one
1 2 None
2 two 3 two
3 None 2
As you see row 1 and row 3 don't have value in Col_1 and Col_2. But, since two of them are None and the two others are spaces, they are different.
I had the same problem and struggled a lot with the code, since I found this. I solved it by replacing None values with spaces:
df = df.fillna(' ')
I am handling a df, and want to select the columns that meet the conditions by filtering the values of a row.
I only know one stupid way:query a cell value in a loop, then get my expect
columns.
>>> df = pd.DataFrame({'A':list('abcd'),'B':list('1bfe'),'C':list('ghgk')})
>>> df
A B C
0 a 1 g
1 b b h
2 c f g
3 d e k
>>> #get columns ,condition: second row equal 'b'
...
>>> cols = list()
>>> for val in df:
... if df.loc[1,val] == 'b':
... cols.append(val)
...
>>> cols
['A', 'B']
use
df.columns[df.loc[1]=='b']
There is this nice way to make queries in pandas dataframes.
df = pd.DataFrame({'A':list('abcd'),'B':list('1bfe'),'C':list('ghgk')})
df.query("A == 'a' and B == '1'")
This query will return the first row of the dataframe based on the fact that column A matches a and column B matches 1
On the phone, can’t test but this would run:
df.columns[df.loc[1].eq('b')]
I am trying to search for values and portions of values from one column to another and return a third value.
Essentially, I have two dataframes: df and df2. The first has a part number in 'col1'. The second has the part number, or portion of it, in 'col1' and the value I want to put in df['col2'] in 'col2'.
import pandas as pd
df = pd.DataFrame({'col1': ['1-1-1', '1-1-2', '1-1-3',
'2-1-1', '2-1-2', '2-1-3']})
df2 = pd.DataFrame({'col1': ['1-1-1', '1-1-2', '1-1-3', '2-1'],
'col2': ['A', 'B', 'C', 'D']})
Of course this:
df['col1'].isin(df2['col1'])
Only covers everything that matches, not the portions:
df['col1'].isin(df2['col1'])
Out[27]:
0 True
1 True
2 True
3 False
4 False
5 False
Name: col1, dtype: bool
I tried:
df[df['col1'].str.contains(df2['col1'])]
but get:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I also tried use a dictionary made from df2; using the same approaches as above and also mapping it--with no luck
The results for df I need would look like this:
col1 col2
'1-1-1' 'A'
'1-1-2' 'B'
'1-1-3' 'C'
'2-1-1' 'D'
'2-1-2' 'D'
'2-1-3' 'D'
I can't figure out how to get the 'D' value into 'col2' because df2['col1'] contains '2-1'--only a portion of the part number.
Any help would be greatly appreciated. Thank you in advance.
We can do str.findall
s=df.col1.str.findall('|'.join(df2.col1.tolist())).str[0].map(df2.set_index('col1').col2)
df['New']=s
df
col1 New
0 1-1-1 A
1 1-1-2 B
2 1-1-3 C
3 2-1-1 D
4 2-1-2 D
5 2-1-3 D
If your df and df2 the specific format as in the sample, another way is using a dict map with fillna by mapping from rsplit
d = dict(df2[['col1', 'col2']].values)
df['col2'] = df.col1.map(d).fillna(df.col1.str.rsplit('-',1).str[0].map(d))
Out[1223]:
col1 col2
0 1-1-1 A
1 1-1-2 B
2 1-1-3 C
3 2-1-1 D
4 2-1-2 D
5 2-1-3 D
Otherwise, besides using findall as in Wen's solution, you may also use extract using with dict d from above
df.col1.str.extract('('+'|'.join(df2.col1)+')')[0].map(d)
I have a pandas dataframe plus a pandas series of identifiers, and would like to filter the rows from the dataframe that correspond to the identifiers in the series. To get the identifiers from the dataframe, I need to concatenate its first two columns. I have tried various things to filter, but none seem to work so far. Here is what I have tried:
1) I tried adding a column of booleans to the data frame, being true if that row corresponds to one of the identifiers, and false otherwise (hoping to be able to do filtering afterwards using the new column):
df["isInAcids"] = (df["AcNo"] + df["Sortcode"]) in acids
where
acids
is the series containing the identifiers.
However, this gives me a
TypeError: unhashable type
2) I tried filtering using the apply function:
df[df.apply(lambda x: x["AcNo"] + x["Sortcode"] in acids, axis = 1)]
This doesn't give me an error, but the length of the data frame remains unchanged, so it doesn't appear to filter anything.
3) I have added a new column, containing the concatenated strings/identifiers, and then try to filter afterwards (see Filter dataframe rows if value in column is in a set list of values):
df["ACIDS"] = df["AcNo"] + df["Sortcode"]
df[df["ACIDS"].isin(acids)]
But again, the dataframe doesn't change.
I hope this makes sense...
Any suggestions where I might be going wrong?
Thanks,
Anne
I think you're asking for something like the following:
In [1]: other_ids = pd.Series(['a', 'b', 'c', 'c'])
In [2]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'c', 'f']})
In [3]: df
Out[3]:
ids vals
0 a 1
1 b 2
2 c 3
3 f 4
In [4]: other_ids
Out[4]:
0 a
1 b
2 c
3 c
dtype: object
In this case, the series other_ids would be like your series acids. We want to select just those rows of df whose id is in the series other_ids. To do that we'll use the dataframe's method .isin().
In [5]: df.ids.isin(other_ids)
Out[5]:
0 True
1 True
2 True
3 False
Name: ids, dtype: bool
This gives a column of bools that we can index into:
In [6]: df[df.ids.isin(other_ids)]
Out[6]:
ids vals
0 a 1
1 b 2
2 c 3
This is close to what you're doing with your 3rd attempt. Once you post a sample of your dataframe I can edit this answer, if it doesn't work already.
Reading a bit more, you may be having trouble because you have two columns in df that are your ids? Dataframe doesn't have an isin method, but we can get around that with something like:
In [26]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'f'],
'ids2': ['e', 'f', 'c', 'f']})
In [27]: df
Out[27]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3
3 f f 4
In [28]: df.ids.isin(ids) + df.ids2.isin(ids)
Out[28]:
0 True
1 True
2 True
3 False
dtype: bool
True is like 1 and False is like zero so we add the two boolean series from the two isins() to get something like an OR operation. Then like before we can index into this boolean series:
In [29]: new = df.ix[df.ids.isin(ids) + df.ids2.isin(ids)]
In [30]: new
Out[30]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3
I have a column in python pandas DataFrame that has boolean True/False values, but for further calculations I need 1/0 representation. Is there a quick pandas/numpy way to do that?
A succinct way to convert a single column of boolean values to a column of integers 1 or 0:
df["somecolumn"] = df["somecolumn"].astype(int)
Just multiply your Dataframe by 1 (int)
[1]: data = pd.DataFrame([[True, False, True], [False, False, True]])
[2]: print data
0 1 2
0 True False True
1 False False True
[3]: print data*1
0 1 2
0 1 0 1
1 0 0 1
True is 1 in Python, and likewise False is 0*:
>>> True == 1
True
>>> False == 0
True
You should be able to perform any operations you want on them by just treating them as though they were numbers, as they are numbers:
>>> issubclass(bool, int)
True
>>> True * 5
5
So to answer your question, no work necessary - you already have what you are looking for.
* Note I use is as an English word, not the Python keyword is - True will not be the same object as any random 1.
This question specifically mentions a single column, so the currently accepted answer works. However, it doesn't generalize to multiple columns. For those interested in a general solution, use the following:
df.replace({False: 0, True: 1}, inplace=True)
This works for a DataFrame that contains columns of many different types, regardless of how many are boolean.
You also can do this directly on Frames
In [104]: df = DataFrame(dict(A = True, B = False),index=range(3))
In [105]: df
Out[105]:
A B
0 True False
1 True False
2 True False
In [106]: df.dtypes
Out[106]:
A bool
B bool
dtype: object
In [107]: df.astype(int)
Out[107]:
A B
0 1 0
1 1 0
2 1 0
In [108]: df.astype(int).dtypes
Out[108]:
A int64
B int64
dtype: object
Use Series.view for convert boolean to integers:
df["somecolumn"] = df["somecolumn"].view('i1')
You can use a transformation for your data frame:
df = pd.DataFrame(my_data condition)
transforming True/False in 1/0
df = df*1
I had to map FAKE/REAL to 0/1 but couldn't find proper answer.
Please find below how to map column name 'type' which has values FAKE/REAL to 0/1 (Note: similar can be applied to any column name and values)
df.loc[df['type'] == 'FAKE', 'type'] = 0
df.loc[df['type'] == 'REAL', 'type'] = 1
This is a reproducible example based on some of the existing answers:
import pandas as pd
def bool_to_int(s: pd.Series) -> pd.Series:
"""Convert the boolean to binary representation, maintain NaN values."""
return s.replace({True: 1, False: 0})
# generate a random dataframe
df = pd.DataFrame({"a": range(10), "b": range(10, 0, -1)}).assign(
a_bool=lambda df: df["a"] > 5,
b_bool=lambda df: df["b"] % 2 == 0,
)
# select all bool columns (or specify which cols to use)
bool_cols = [c for c, d in df.dtypes.items() if d == "bool"]
# apply the new coding to a new dataframe (or can replace the existing one)
df_new = df.assign(**{c: lambda df: df[c].pipe(bool_to_int) for c in bool_cols})
Tries and tested:
df[col] = df[col].map({'True': 1,'False' :0 })
If there are more than one columns with True/False, use the following.
for col in bool_cols:
df[col] = df[col].map({'True': 1,'False' :0 })
#AMC wrote this in a comment
If the column is of the type object
df["somecolumn"] = df["somecolumn"].astype(bool).astype(int)