I am processing inbound user data. I receive DataFrame h that is supposed to contain all float but has some strings:
>>> h = pd.DataFrame(np.random.rand(3, 2), columns=['a', 'b'])
>>> h.loc[0, 'a'] = 'bad'
>>> h.loc[1, 'b'] = 'robot'
>>> h
a b
0 bad 0.747314
1 0.921919 robot
2 0.754256 0.664455
I process and set the strings to np.nan (I realize np.nan is a float but this is to illustrate):
>>> hh = h.copy()
>>> hh.loc[0, 'a'] = np.nan
>>> hh.loc[1, 'b'] = np.nan
>>> hh
a b
0 NaN 0.747314
1 0.921919 NaN
2 0.754256 0.664455
I have a DataFrame with expected values (or a dict):
>>> g = pd.DataFrame({'a': ['foo'], 'b': ['bar']}, index=h.index)
>>> g
a b
0 foo bar
1 foo bar
2 foo bar
Which I use to fill where the bad data is.
>>> hh.fillna(g)
a b
0 foo 0.747314
1 0.921919 bar
2 0.754256 0.664455
I need to include the expected data too. So the result should be:
>>> magic(hh, g)
a b
0 rec=bad; exp=foo 0.747314
1 0.921919 rec=robot; exp=bar
2 0.754256 0.664455
How can I create such a result?
You can convert non necessary values to NaNs by DataFrame.where, join together with strings and last replace original values:
m = hh.isna()
df = ('rec=' + h.where(m) + '; exp=' + g.where(m)).fillna(h)
print (df)
a b
0 rec=bad; exp=foo 0.440508
1 0.525949 rec=robot; exp=bar
2 0.337586 0.414336
Related
There is a data frame.
I would like to add column 'e' after checking below conditions.
if component of 'c' is in column 'a' AND component of 'd' is in column 'b' at same row , then component of e is OK
else ""
import pandas as pd
import numpy as np
A = {'a':[0,2,1,4], 'b':[4,5,1,7],'c':['1','2','3','6'], 'd':['1','4','2','9']}
df = pd.DataFrame(A)
The result I want to get is
A = {'a':[0,2,1,4], 'b':[4,5,1,7],'c':['1','2','3','6'], 'd':['1','4','2','9'], 'e':['OK','','','']}
You can merge df with itself on ['a', 'b'] on the left and ['c', 'd'] on the right. If index 'survives' the merge, then e should be OK:
df['e'] = np.where(
df.index.isin(df.merge(df, left_on=['a', 'b'], right_on=['c', 'd']).index),
'OK', '')
df
Output:
a b c d e
0 0 4 1 1 OK
1 2 5 2 4
2 1 1 3 2
3 4 7 6 9
P.S. Before the merge, we need to convert a and b columns to str type (or c and d to numeric), so that we can compare c and a, and d and b:
df[['a', 'b']] = df[['a', 'b']].astype(str)
I have the following Pandas dataframe:
A B C
A A Test1
A A Test2
A A XYZ
A B BA
A B AB
B A AA
I want to group this dataset twice: First by A and B to concate the group within C and afterwards only on A to get the groups defined solely by column A. The result looks like this:
A A Test1,Test2,XYZ
A B AB, BA
B A AA
And the final result should be:
A A,A:(Test1,Test2,XYZ), A,B:(AB, BA)
B B,A:(AA)
Concatenating itself works, however the sorting does not seem work.
Can anyone help me with this problem?
Kind regards.
Using groupby + join
s1=df.groupby(['A','B']).C.apply(','.join)
s1
Out[421]:
A B
A A Test1,Test2,XYZ
B BA,AB
B A AA
Name: C, dtype: object
s1.reset_index().groupby('A').apply(lambda x : x.set_index(['A','B'])['C'].to_dict())
Out[420]:
A
A {('A', 'A'): 'Test1,Test2,XYZ', ('A', 'B'): 'B...
B {('B', 'A'): 'AA'}
dtype: object
First sort_values by 3 columns, then groupby with join first, then join A with B columns and last groupby for dictionary per groups:
df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(','.join).reset_index()
#if only 3 columns DataFrame
#df1 = df.sort_values().groupby(['A','B'])['C'].apply(','.join).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
A B C D
0 A A Test1,Test2,XYZ A,A
1 A B AB,BA A,B
2 B A AA B,A
s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
A val
0 A {'A,A': 'Test1,Test2,XYZ', 'A,B': 'AB,BA'}
1 B {'B,A': 'AA'}
If need tuples only change first part of code:
df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(tuple).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
A B C D
0 A A (Test1, Test2, XYZ) A,A
1 A B (AB, BA) A,B
2 B A (AA,) B,A
s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
A val
0 A {'A,A': ('Test1', 'Test2', 'XYZ'), 'A,B': ('AB...
1 B {'B,A': ('AA',)}
I have a csv file with four columns. I read it like this:
df = pd.read_csv('my.csv', error_bad_lines=False, sep='\t', header=None, names=['A', 'B', 'C', 'D'])
Now, field C contains string values. But in some rows there are non-string type (floats or numbers) values. How to drop those rows? I'm using version 0.18.1 of Pandas.
Setup
df = pd.DataFrame([['a', 'b', 'c', 'd'], ['e', 'f', 1.2, 'g']], columns=list('ABCD'))
print df
A B C D
0 a b c d
1 e f 1.2 g
Notice you can see what the individual cell types are.
print type(df.loc[0, 'C']), type(df.loc[1, 'C'])
<type 'str'> <type 'float'>
mask and slice
print df.loc[df.C.apply(type) != float]
A B C D
0 a b c d
more general
print df.loc[df.C.apply(lambda x: not isinstance(x, (float, int)))]
A B C D
0 a b c d
you could also use float as an attempt to determine if it can be a float.
def try_float(x):
try:
float(x)
return True
except:
return False
print df.loc[~df.C.apply(try_float)]
A B C D
0 a b c d
The problem with this approach is that you'll exclude strings that can be interpreted as floats.
Comparing times for the few options I've provided and also jezrael's solution with small dataframes.
For a dataframe with 500,000 rows:
Checking if its type is float seems to be most performant with is numeric right behind it. If you need to check int and float, I'd go with jezrael's answer. If you can get away with checking for float, use that one.
You can use boolean indexing with mask created by to_numeric with parameter errors='coerce' - you get NaN where are string values. Then check isnull:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':['a',8,9],
'D':[1,3,5]})
print (df)
A B C D
0 1 4 a 1
1 2 5 8 3
2 3 6 9 5
print (pd.to_numeric(df.C, errors='coerce'))
0 NaN
1 8.0
2 9.0
Name: C, dtype: float64
print (pd.to_numeric(df.C, errors='coerce').isnull())
0 True
1 False
2 False
Name: C, dtype: bool
print (df[pd.to_numeric(df.C, errors='coerce').isnull()])
A B C D
0 1 4 a 1
Use pandas.DataFrame.select_dtypes method.
Ex.
df.select_dtypes(exclude='object')
or
df.select_dtypes(include=['int64','float','int'])
I have a dataframe that has two columns, user_id and item_bought.
Here user_id is the index of the dataframe. I want to group by both user_id and item_bought and get the item wise count for the user.
How do I do that?
From version 0.20.1 it is simplier:
Strings passed to DataFrame.groupby() as the by parameter may now reference either column names or index level names
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
'B': np.arange(8)}, index=index)
print (df)
A B
first second
bar one 1 0
two 1 1
baz one 1 2
two 1 3
foo one 2 4
two 2 5
qux one 3 6
two 3 7
print (df.groupby(['second', 'A']).sum())
B
second A
one 1 2
2 4
3 6
two 1 4
2 5
3 7
this should work:
>>> df = pd.DataFrame(np.random.randint(0,5,(6, 2)), columns=['col1','col2'])
>>> df['ind1'] = list('AAABCC')
>>> df['ind2'] = range(6)
>>> df.set_index(['ind1','ind2'], inplace=True)
>>> df
col1 col2
ind1 ind2
A 0 3 2
1 2 0
2 2 3
B 3 2 4
C 4 3 1
5 0 0
>>> df.groupby([df.index.get_level_values(0),'col1']).count()
col2
ind1 col1
A 2 2
3 1
B 2 1
C 0 1
3 1
I had the same problem using one of the columns from multiindex. with multiindex, you cannot use df.index.levels[0] since it has only distinct values from that particular index level and will be most likely of different size than whole dataframe...
check http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_level_values.html - get_level_values "Return vector of label values for requested level, equal to the length of the index"
import pandas as pd
import numpy as np
In [11]:
df = pd.DataFrame()
In [12]:
df['user_id'] = ['b','b','b','c']
In [13]:
df['item_bought'] = ['x','x','y','y']
In [14]:
df['ct'] = 1
In [15]:
df
Out[15]:
user_id item_bought ct
0 b x 1
1 b x 1
2 b y 1
3 c y 1
In [16]:
pd.pivot_table(df,values='ct',index=['user_id','item_bought'],aggfunc=np.sum)
Out[16]:
user_id item_bought
b x 2
y 1
c y 1
I had the same problem - imported a bunch of data and I wanted to groupby a field that was the index. I didn't have a multi-index or any of that jazz and nor do you.
I figured the problem is that the field I want is the index, so at first I just reset the index - but this gives me a useless index field that I don't want. So now I do the following (two levels of grouping):
grouped = df.reset_index().groupby(by=['Field1','Field2'])
then I can use 'grouped' in a bunch of ways for different reports
grouped[['Field3','Field4']].agg([np.mean, np.std])
(which was what I wanted, giving me Field4 and Field3 averages, grouped by Field1 (the index) and Field2
For you, if you just want to do the count of items per user, in one simple line using groupby, the code could be
df.reset_index().groupby(by=['user_id']).count()
If you want to do more things then you can (like me) create 'grouped' and then use that. As a beginner, I find it easier to follow that way.
Please note, that the "reset_index" is not 'in place' and so will not mess up your original dataframe
I would like to print all rows of a dataframe where I find the value '-' in any of the columns. Can someone please explain a way that is better than those described below?
This Q&A already explains how to do so by using boolean indexing but each column needs to be declared separately:
print df.ix[df['A'].isin(['-']) | df['B'].isin(['-']) | df['C'].isin(['-'])]
I tried the following but I get an error 'Cannot index with multidimensional key':
df.ix[df[df.columns.values].isin(['-'])]
So I used this code but I'm not happy with the separate printing for each column tested because it is harder to work with and can print the same row more than once:
import pandas as pd
d = {'A': [1,2,3], 'B': [4,'-',6], 'C': [7,8,'-']}
df = pd.DataFrame(d)
for i in range(len(d.keys())):
temp = df.ix[df.iloc[:,i].isin(['-'])]
if temp.shape[0] > 0:
print temp
Output looks like this:
A B C
1 2 - 8
[1 rows x 3 columns]
A B C
2 3 6 -
[1 rows x 3 columns]
Thanks for your advice.
Alternatively, you could do something like df[df.isin(["-"]).any(axis=1)], e.g.
>>> df = pd.DataFrame({'A': [1,2,3], 'B': ['-','-',6], 'C': [7,8,9]})
>>> df.isin(["-"]).any(axis=1)
0 True
1 True
2 False
dtype: bool
>>> df[df.isin(["-"]).any(axis=1)]
A B C
0 1 - 7
1 2 - 8
(Note I changed the frame a bit so I wouldn't get the axes wrong.)
you can do:
>>> idx = df.apply(lambda ts: any(ts == '-'), axis=1)
>>> df[idx]
A B C
1 2 - 8
2 3 6 -
or
lambda ts: '-' in ts.values
note that in looks into the index not the values, so you need .values