I have a pandas Series in which the data is datetime type. I would like to convert it into a unique integer index. I am looking for a direct, fast command, as the data is big.
Example:
0
0 2015-07-05
1 2015-07-12
3 2015-07-19
4 2015-07-12
Should be converted to:
0
0 1
1 2
3 3
4 2
In fact, I am also wondering whether there is a general purpose command, that converts a series of any data type into a series of unique integers in this way.
Use factorize:
s = pd.Series(['2015-07-05', '2015-07-12', '2015-07-19', '2015-07-12'], name=0)
print (s)
0 2015-07-05
1 2015-07-12
2 2015-07-19
3 2015-07-12
Name: 0, dtype: object
s1 = pd.Series(pd.factorize(s)[0] + 1, s.index)
print (s1)
0 1
1 2
3 3
4 2
dtype: int64
Another possible solution is rank:
s1 = s.rank(method='dense').astype(int)
print (s1)
0 1
1 2
2 3
3 2
Name: 0, dtype: int32
Timings are different:
s = pd.concat([s]*100000).reset_index(drop=True)
In [78]: %timeit (pd.Series(pd.factorize(s)[0] + 1, s.index))
100 loops, best of 3: 13.9 ms per loop
In [79]: %timeit (s.rank(method='dense').astype(int))
1 loop, best of 3: 536 ms per loop
Related
I'm using Pandas to come up with new column that will search through the entire column with values [1-100] and will count the values where it's less than the current row.
See [df] example below:
[A][NewCol]
1 0
3 2
2 1
5 4
8 5
3 2
Essentially, for each row I need to look at the entire Column A, and count how many values are less than the current row. So for Value 5, there are 4 values that are less (<) than 5 (1,2,3,3).
What would be the easiest way of doing this?
Thanks!
One way to do it like this, use rank with method='min':
df['NewCol'] = (df['A'].rank(method='min') - 1).astype(int)
Output:
A NewCol
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2
I am using numpy broadcast
s=df.A.values
(s[:,None]>s).sum(1)
Out[649]: array([0, 2, 1, 4, 5, 2])
#df['NewCol']=(s[:,None]>s).sum(1)
timing
df=pd.concat([df]*1000)
%%timeit
s=df.A.values
(s[:,None]>s).sum(1)
10 loops, best of 3: 83.7 ms per loop
%timeit (df['A'].rank(method='min') - 1).astype(int)
1000 loops, best of 3: 479 µs per loop
Try this code
A = [Your numbers]
less_than = []
for element in A:
counter = 0
for number in A:
if number < element:
counter += 1
less_than.append(counter)
You can do it this way:
import pandas as pd
df = pd.DataFrame({'A': [1,3,2,5,8,3]})
df['NewCol'] = 0
for idx, row in df.iterrows():
df.loc[idx, 'NewCol'] = (df.loc[:, 'A'] < row.A).sum()
print(df)
A NewCol
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2
Another way is sort and reset index:
m=df.A.sort_values().reset_index(drop=True).reset_index()
m.columns=['new','A']
print(m)
new A
0 0 1
1 1 2
2 2 3
3 3 3
4 4 5
5 5 8
You didn't specify if speed or memory usage was important (or if you had a very large dataset). The "easiest" way to do it is straightfoward: calculate how many are less then i for each entry in the column and collect those into a new column:
df=pd.DataFrame({'A': [1,3,2,5,8,3]})
col=df['A']
df['new_col']=[ sum(col<i) for i in col ]
print(df)
Result:
A new_col
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2
There might be more efficient ways to do this on large datasets, such as sorting your column first.
I have some problem where data is sorted by date, for example something like this:
date, value, min
2015-08-17, 3, nan
2015-08-18, 2, nan
2015-08-19, 4, nan
2015-08-28, 1, nan
2015-08-29, 5, nan
Now I want to save min values in min column till this row, so result would look something like this:
date, value, min
2015-08-17, 3, 3
2015-08-18, 2, 2
2015-08-19, 4, 2
2015-08-28, 1, 1
2015-08-29, 5, 1
I've tried some options, but still don't get what I'm doing wrong, here is one example that I tried:
data['min'] = min(data['value'], data['min'].shift())
I don't want to iterate through all rows because the data I have is big. What is the best strategy you can write using pandas for this kind of problem?
Since you mentioned that you are working with big dataset, with focus on performance, here's one using NumPy's np.minimum.accumulate -
df['min'] = np.minimum.accumulate(df.value)
Sample run -
In [70]: df
Out[70]:
date value min
0 2015-08-17 3 NaN
1 2015-08-18 2 NaN
2 2015-08-19 4 NaN
3 2015-08-28 1 NaN
4 2015-08-29 5 NaN
In [71]: df['min'] = np.minimum.accumulate(df.value)
In [72]: df
Out[72]:
date value min
0 2015-08-17 3 3
1 2015-08-18 2 2
2 2015-08-19 4 2
3 2015-08-28 1 1
4 2015-08-29 5 1
Runtime test -
In [65]: df = pd.DataFrame(np.random.randint(0,100,(1000000)), columns=list(['value']))
# #MaxU's soln using pandas cummin
In [66]: %timeit df['min'] = df.value.cummin()
100 loops, best of 3: 6.84 ms per loop
In [67]: df = pd.DataFrame(np.random.randint(0,100,(1000000)), columns=list(['value']))
# Using NumPy
In [68]: %timeit df['min'] = np.minimum.accumulate(df.value)
100 loops, best of 3: 3.97 ms per loop
Use cummin() method:
In [53]: df['min'] = df.value.cummin()
In [54]: df
Out[54]:
date value min
0 2015-08-17 3 3
1 2015-08-18 2 2
2 2015-08-19 4 2
3 2015-08-28 1 1
4 2015-08-29 5 1
I have a dataframe that looks like this:
In [60]: df1
Out[60]:
DIFF UID
0 NaN 1
1 13.0 1
2 4.0 1
3 NaN 2
4 3.0 2
5 23.0 2
6 NaN 3
7 4.0 3
8 29.0 3
9 42.0 3
10 NaN 4
11 3.0 4
and for each UID I want to calculate how many instances are found to have a value for DIFF over a given param.
I have tried something like this:
In [61]: threshold = 5
In [62]: df1[df1.DIFF > threshold].groupby('UID')['DIFF'].count().reset_index().rename(columns={'DIFF':'ATTR_NAME'})
Out[63]:
UID ATTR_NAME
0 1 1
1 2 1
2 3 2
That works fine, in regards to finding the returning the right count of instances per user etc. However, I would like to be able to also include the users that have 0 instances, which are now excluded in the df1[df1.DIFF > threshold] part.
The desired output would be:
UID ATTR_NAME
0 1 1
1 2 1
2 3 2
3 4 0
Any ideas?
Thanks
Simple, use .reindex:
req = df1[df1.DIFF > threshold].groupby('UID')['DIFF'].count()
req = req.reindex(df1.UID.unique()).reset_index().rename(columns={'DIFF':'ATTR_NAME'})
In one line:
df1[df1.DIFF > threshold].groupby('UID')['DIFF'].count().reindex(df1.UID.unique()).reset_index().rename(columns={'DIFF':'ATTR_NAME'})
Another way would be to use a function with apply() to do this:
In [101]: def count_instances(x, threshold):
counter = 0
for i in x:
if i > threshold: counter += 1
return counter
.....:
In [102]: df1.groupby('UID')['DIFF'].apply(lambda x: count_instances(x, 5)).reset_index()
Out[102]:
UID DIFF
0 1 1
1 2 1
2 3 2
3 4 0
It appears that this way is a little faster as well:
In [103]: %timeit df1.groupby('UID')['DIFF'].apply(lambda x: count_instances(x, 5)).reset_index()
100 loops, best of 3: 2.38 ms per loop
In [104]: %timeit df1[df1.DIFF > 5].groupby('UID')['DIFF'].count().reset_index()
100 loops, best of 3: 2.39 ms per loop
Would something like this work well for you?
Searching to count the numbers of values matching criteria without filtering out the keys that have no matches is equivalent to counting the numbers of True matches per group, which can be done with a sum of boolean values:
(df1.DIFF > 5).groupby(df1.UID).sum().reset_index()
UID DIFF
0 1 1.0
1 2 1.0
2 3 2.0
3 4 0.0
In Pandas, I have a dataframe with ZipCode, Age, and a bunch of columns that should all have values 1 or 0, ie:
ZipCode Age A B C D
12345 21 0 1 1 1
12345 22 1 0 1 4
23456 45 1 0 1 1
23456 21 3 1 0 0
I want to delete all rows in which 0 or 1 doesn't appear in columns A,B,C, or D as a way to clean up the data. In this case, I would remove the 2nd and 4th row because 4 appears in column D in row 2 and 3 appears in column A in row 4. I want to do this even if I have 100 columns to check such that I don't have to look up every column one by one in my conditional statement. How would I do this?
Use isin to test for membership and all to test if all row values are True and use this boolean mask to filter the df:
In [12]:
df[df.ix[:,'A':].isin([0,1]).all(axis=1)]
Out[12]:
ZipCode Age A B C D
0 12345 21 0 1 1 1
2 23456 45 1 0 1 1
You can opt for a vectorized solution:
In [64]: df[df[['A','B','C','D']].isin([0,1]).sum(axis=1)==4]
Out[64]:
ZipCode Age A B C D
0 12345 21 0 1 1 1
2 23456 45 1 0 1 1
Other two solutions works well but if you interested in speed you should look at numpy in1d function:
data=df.loc[:, 'A':]
In [72]: df[np.in1d(data.values,[0,1]).reshape(data.shape).all(axis=1)]
Out[72]:
ZipCode Age A B C D
0 12345 21 0 1 1 1
2 23456 45 1 0 1 1
Timing:
In [73]: %timeit data=df.loc[:, 'A':]; df[np.in1d(data.values,[0,1]).reshape(data.shape).all(axis=1)]
1000 loops, best of 3: 558 us per loop
In [74]: %timeit df[df.ix[:,'A':].isin([0,1]).all(axis=1)]
1000 loops, best of 3: 843 us per loop
In [75]: %timeit df[df[['A','B','C','D']].isin([0,1]).sum(axis=1)==4]
1000 loops, best of 3: 1.44 ms per loop
Is it possible to put percentile cuts on all columns of a dataframe with using a loop? This is how I am doing it now:
df = pd.DataFrame(np.random.randn(10,5))
df_q = pd.DataFrame()
for i in list(range(len(df.columns))):
df_q[i] = pd.qcut(df[i], 5, labels=list(range(5)))
I am hoping there is a slick pandas solution for this to avoid the use of a loop.
Thanks!
pd.qcut accepts an 1D array or Series as its argument. To apply pd.qcut to every column requires multiple calls to pd.qcut. So no matter how you dress it up, there will be a loop -- either explicit or implicit.
You could for example, use apply to call pd.qcut for each column:
In [46]: df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
Out[46]:
0 1 2 3 4
0 4 0 3 0 3
1 0 0 2 3 0
2 3 4 1 2 3
3 4 1 1 1 4
4 3 2 2 4 1
5 2 4 3 0 1
6 2 3 0 4 4
7 1 3 4 2 2
8 0 1 4 3 0
9 1 2 0 1 2
but under the hood, df.apply is using a for-loop, so it really isn't very different than your for-loop:
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
In [47]: %timeit df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
100 loops, best of 3: 2.9 ms per loop
In [48]: %%timeit
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
100 loops, best of 3: 2.95 ms per loop
Note that
for i in list(range(len(df.columns))):
will only work if the columns of df happen to be sequential integers starting at 0.
It is more robust to use
for col in df:
to iterate over the columns of the DataFrame.