I have a dataframe data, which has close to 4 millions rows. It is a list of cities on the world. I need to query the city name as fast as possible.
I found out that one with 346ms via indexing the city name:
d2=data.set_index("city",inplace=False)
timeit d2.loc[['PARIS']]
1 loop, best of 3: 346 ms per loop
This is still much too slow. I wonder if with group-by I could achieve faster query (how to do such a query). Each city has around 10 rows in the dataframe (duplicate city).
I searched several days and could not find a clear solution on the internet
thank you
Setup
df = pd.DataFrame(data=[['Paris'+str(i),i] for i in range(100000)]*10,columns=['city','value'])
Baseline
df2 = df.set_index('city')
%timeit df2.loc[['Paris9999']]
10 loops, best of 3: 45.6 ms per loop
Solution
Using a lookup dict and then use iloc:
idx_dict = df.groupby(by='city').apply(lambda x: x.index.tolist()).to_dict()
%timeit df.iloc[d['Paris9999']]
1000 loops, best of 3: 432 µs per loop
It seems this approach is almost 100 times faster than the baseline.
Comparing to other approaches:
%timeit df2[df2.index.values=="Paris9999"]
100 loops, best of 3: 16.7 ms per loop
%timeit full_array_based(df2, "Paris9999")
10 loops, best of 3: 19.6 ms per loop
Working with the array data for the index, comparing against the needed index and then using the mask from the comparison might be one option when looking for performance. A sample case might make things clear.
1) Input dataframes :
In [591]: df
Out[591]:
city population
0 Delhi 1000
1 Paris 56
2 NY 89
3 Paris 36
4 Delhi 300
5 Paris 52
6 Paris 34
7 Delhi 40
8 NY 89
9 Delhi 450
In [592]: d2 = df.set_index("city",inplace=False)
In [593]: d2
Out[593]:
population
city
Delhi 1000
Paris 56
NY 89
Paris 36
Delhi 300
Paris 52
Paris 34
Delhi 40
NY 89
Delhi 450
2) Indexing with .loc :
In [594]: d2.loc[['Paris']]
Out[594]:
population
city
Paris 56
Paris 36
Paris 52
Paris 34
3) Use mask based indexing :
In [595]: d2[d2.index.values=="Paris"]
Out[595]:
population
city
Paris 56
Paris 36
Paris 52
Paris 34
4) Finally timings :
In [596]: %timeit d2.loc[['Paris']]
1000 loops, best of 3: 475 µs per loop
In [597]: %timeit d2[d2.index.values=="Paris"]
10000 loops, best of 3: 156 µs per loop
Further boost
Going further with using array data, we can extract the entire input dataframe as array and index into it. Thus, an implementation using that philosophy would look something like this -
def full_array_based(d2, indexval):
df0 = pd.DataFrame(d2.values[d2.index.values==indexval])
df0.index = [indexval]*df0.shape[0]
df0.columns = d2.columns
return df0
Sample run and timings -
In [635]: full_array_based(d2, "Paris")
Out[635]:
population
Paris 56
Paris 36
Paris 52
Paris 34
In [636]: %timeit full_array_based(d2, "Paris")
10000 loops, best of 3: 146 µs per loop
If we are allowed to pre-process to setup a dictonary that could be indexed for extracting city string based data extraction from the input dataframe, here's one solution using NumPy to do so -
def indexed_dict_numpy(df):
cs = df.city.values.astype(str)
sidx = cs.argsort()
scs = cs[sidx]
idx = np.concatenate(( [0], np.flatnonzero(scs[1:] != scs[:-1])+1, [cs.size]))
return {n:sidx[i:j] for n,i,j in zip(cs[sidx[idx[:-1]]], idx[:-1], idx[1:])}
Sample run -
In [10]: df
Out[10]:
city population
0 Delhi 1000
1 Paris 56
2 NY 89
3 Paris 36
4 Delhi 300
5 Paris 52
6 Paris 34
7 Delhi 40
8 NY 89
9 Delhi 450
In [11]: dict1 = indexed_dict_numpy(df)
In [12]: df.iloc[dict1['Paris']]
Out[12]:
city population
1 Paris 56
3 Paris 36
5 Paris 52
6 Paris 34
Runtime test against #Allen's solution to setup a similar dictionary with 4 Mil rows -
In [43]: # Setup 4 miliion rows of df
...: df = pd.DataFrame(data=[['Paris'+str(i),i] for i in range(400000)]*10,\
...: columns=['city','value'])
...: np.random.shuffle(df.values)
...:
In [44]: %timeit df.groupby(by='city').apply(lambda x: x.index.tolist()).to_dict()
1 loops, best of 3: 2.01 s per loop
In [45]: %timeit indexed_dict_numpy(df)
1 loops, best of 3: 1.15 s per loop
Related
my dataframe is like this
star_rating actors_list
0 9.3 [u'Tim Robbins', u'Morgan Freeman']
1 9.2 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 [u'Al Pacino', u'Robert De Niro']
3 9.0 [u'Christian Bale', u'Heath Ledger']
4 8.9 [u'John Travolta', u'Uma Thurman']
I want to extract the most frequent names in the actors_list column. I found this code. do you have a better suggestion? especially for big data.
import pandas as pd
df= pd.read_table (r'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv',sep=',')
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()
expected output for (this data)
robert de niro 13
tom hanks 12
clint eastwood 11
johnny depp 10
al pacino 10
james stewart 9
By my tests, it would be much faster to do the regex cleanup after counting.
from itertools import chain
import re
p = re.compile("""^u['"](.*)['"]$""")
ser = pd.Series(list(chain.from_iterable(
x.title().split(', ') for x in df.actors_list.str[1:-1]))).value_counts()
ser.index = [p.sub(r"\1", x) for x in ser.index.tolist()]
ser.head()
Robert De Niro 18
Brad Pitt 14
Clint Eastwood 14
Tom Hanks 14
Al Pacino 13
dtype: int64
Its always better to go for plain python than depending on pandas since it consumes huge amount of memory if the list is large.
If the list is of size 1000, then the non 1000 length lists will have Nan's when you use expand = True which is a waste of memeory. Try this instead.
df = pd.concat([df]*1000) # For the sake of large df.
%%timeit
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()
10 loops, best of 3: 65.9 ms per loop
%%timeit
df['actors_list'] = df['actors_list'].str.strip('[]').str.replace(', ',',').str.split(',')
10 loops, best of 3: 24.1 ms per loop
%%timeit
words = {}
for i in df['actors_list']:
for w in i :
if w in words:
words[w]+=1
else:
words[w]=1
100 loops, best of 3: 5.44 ms per loop
I will using ast convert the list like to list
import ast
df.actors_list=df.actors_list.apply(ast.literal_eval)
pd.DataFrame(df.actors_list.tolist()).melt().value.value_counts()
according to this code I got below chart
which
coldspeed's code is wen2()
Dark's code is wen4()
Mine code is wen1()
W-B's code is wen3()
I have the following (sample) dataframe:
Age height weight haircolor
joe 35 5.5 145 brown
mary 26 5.25 110 blonde
pete 44 6.02 185 red
....
There are no duplicate values in the index.
I am in the unenviable position of having to append to this dataframe using elements from a number of other dataframes. So I'm appending as follows:
names_df = names_df.append({'Age': someage,
'height': someheight,
'weight':someweight,
'haircolor': somehaircolor'},
ignore_index=True)
My question is using this method how do I set the new index value in names_df equal to the person's name?
The only thing I can think of is to reset the df index before I append and then re-set it afterward. Ugly. Has to be a better way.
Thanks in advance.
I am not sure in what format are you getting the data that you are appending to the original df but one way is as follows:
df.loc['new_name', :] = ['someage', 'someheight', 'someweight', 'somehaircolor']
Age height weight haircolor
joe 35 5.5 145 brown
mary 26 5.25 110 blonde
pete 44 6.02 185 red
new_name someage someheight someweight somehaircolor
Time Testing:
%timeit df.loc['new_name', :] = ['someage', 'someheight', 'someweight', 'somehaircolor']
1000 loops, best of 3: 408 µs per loop
%timeit df.append(pd.DataFrame({'Age': 'someage', 'height': 'someheight','weight':'someweight','haircolor': 'somehaircolor'}, index=['some_person']))
100 loops, best of 3: 2.59 ms per loop
Here's another way using append. Instead of passing a dictionary, pass a dataframe (created with dictionary) while specifying index:
names_df = names_df.append(pd.DataFrame({'Age': 'someage',
'height': 'someheight',
'weight':'someweight',
'haircolor': 'somehaircolor'}, index=['some_person']))
I would greatly appreciate someones help with counting the number of matching state names from two columns in my csv file. For instance consider the first 7 observations from columns State_born_in and state_lives_in:
State_born_in State_lives_in
New York Florida
Massachusetts Massachusetts
Florida Massachusetts
Illinois Illinois
Iowa Texas
New Hampshire Massachusetts
California California
Basically I want to count the number of people who lived in the same state they were born in. I then want the percentage of all people who live in the same state they're born in. So in the example above I would have a count = 2 since there are two people that live in the same state they were born in(California and Massachusetts) who live in the same state they were born in. And if I wanted the percentage I would just divide 2 by the number of observations. I'm still relatively new to using pandas but this is what I've tried so far
df = pd.read_csv("uscitizens.csv","a")
import pandas as pd
counts = df[(df['State_born_in'] == df['state_lives_in'])] ; counts
percentage = counts/len(df['State_born_in'])
Moreover, how would I do this on a dataset that has over 2 million observations? I would greatly appreciate anyone's help
You can use first boolean indexing and then simple divide length of filtered DataFrame with length of original (it is same as length of index, what is fastest):
print (df)
State_born_in State_lives_in
0 New York Florida
1 Massachusetts Massachusetts
2 Massachusetts Massachusetts
3 Massachusetts Massachusetts
4 Florida Massachusetts
5 Illinois Illinois
6 Iowa Texas
7 New Hampshire Massachusetts
8 California California
same = df[(df['State_born_in'] == df['State_lives_in'])]
print (same)
State_born_in State_lives_in
1 Massachusetts Massachusetts
2 Massachusetts Massachusetts
3 Massachusetts Massachusetts
5 Illinois Illinois
8 California California
counts = len(same.index)
print (counts)
5
percentage = 100 * counts/len(df.index)
print (percentage)
55.55555555555556
Timings:
In [21]: %timeit len(same.index)
The slowest run took 18.82 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 546 ns per loop
In [22]: %timeit same.shape[0]
The slowest run took 21.82 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.37 µs per loop
In [23]: %timeit len(same['State_born_in'])
The slowest run took 46.92 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 10.3 µs per loop
Faster solution:
same = (df['State_born_in'] == df['State_lives_in'])
print (same)
0 False
1 True
2 True
3 True
4 False
5 True
6 False
7 False
8 True
dtype: bool
counts = same.sum()
print (counts)
5
percentage = 100 * counts/len(df.index)
print (percentage)
55.5555555556
Timings in 2M DataFrame:
#[2000000 rows x 2 columns]
df = pd.concat([df]*200000).reset_index(drop=True)
#print (df)
In [127]: %timeit (100 * (df['State_born_in'] == df['State_lives_in']).sum()/len(df.index))
1 loop, best of 3: 444 ms per loop
In [128]: %timeit (100 * len(df[(df['State_born_in'] == df['State_lives_in'])].index)/len(df.index))
1 loop, best of 3: 472 ms per loop
Are you expecting this?
counts = df[ df['State_born_in'] == df['State_lives_in'] ].groupby('State_born_in').agg(['count']).sum()
counts / len(df['State_born_in'])
I have data like this
location sales store
0 68 583 17
1 28 857 2
2 55 190 59
3 98 517 64
4 94 892 79
...
For each unique pair (location, store), there are 1 or more sales. I want to add a column, pcnt_sales that shows what percent of the total sales for that (location, store) pair was made up by the sale in the given row.
location sales store pcnt_sales
0 68 583 17 0.254363
1 28 857 2 0.346543
2 55 190 59 1.000000
3 98 517 64 0.272105
4 94 892 79 1.000000
...
This works, but is slow
import pandas as pd
import numpy as np
df = pd.DataFrame({'location':np.random.randint(0, 100, 10000), 'store':np.random.randint(0, 100, 10000), 'sales': np.random.randint(0, 1000, 10000)})
import timeit
start_time = timeit.default_timer()
df['pcnt_sales'] = df.groupby(['location', 'store'])['sales'].apply(lambda x: x/x.sum())
print(timeit.default_timer() - start_time) # 1.46 seconds
By comparison, R's data.table does this super fast
library(data.table)
dt <- data.table(location=sample(100, size=10000, replace=TRUE), store=sample(100, size=10000, replace=TRUE), sales=sample(1000, size=10000, replace=TRUE))
ptm <- proc.time()
dt[, pcnt_sales:=sales/sum(sales), by=c("location", "store")]
proc.time() - ptm # 0.007 seconds
How do I do this efficiently in Pandas (especially considering my real dataset has millions of rows)?
For performance you want to avoid apply. You could use transform to get the result of the groupby expanded to the original index instead, at which point a division would work at vectorized speed:
>>> %timeit df['pcnt_sales'] = df.groupby(['location', 'store'])['sales'].apply(lambda x: x/x.sum())
1 loop, best of 3: 2.27 s per loop
>>> %timeit df['pcnt_sales2'] = (df["sales"] /
df.groupby(['location', 'store'])['sales'].transform(sum))
100 loops, best of 3: 6.25 ms per loop
>>> df["pcnt_sales"].equals(df["pcnt_sales2"])
True
There is a Pandas DataFrame object with some stock data. SMAs are moving averages calculated from previous 45/15 days.
Date Price SMA_45 SMA_15
20150127 102.75 113 106
20150128 103.05 100 106
20150129 105.10 112 105
20150130 105.35 111 105
20150202 107.15 111 105
20150203 111.95 110 105
20150204 111.90 110 106
I want to find all dates, when SMA_15 and SMA_45 intersect.
Can it be done efficiently using Pandas or Numpy? How?
EDIT:
What I mean by 'intersection':
The data row, when:
long SMA(45) value was bigger than short SMA(15) value for longer than short SMA period(15) and it became smaller.
long SMA(45) value was smaller than short SMA(15) value for longer than short SMA period(15) and it became bigger.
I'm taking a crossover to mean when the SMA lines -- as functions of time --
intersect, as depicted on this investopedia
page.
Since the SMAs represent continuous functions, there is a crossing when,
for a given row, (SMA_15 is less than SMA_45) and (the previous SMA_15 is
greater than the previous SMA_45) -- or vice versa.
In code, that could be expressed as
previous_15 = df['SMA_15'].shift(1)
previous_45 = df['SMA_45'].shift(1)
crossing = (((df['SMA_15'] <= df['SMA_45']) & (previous_15 >= previous_45))
| ((df['SMA_15'] >= df['SMA_45']) & (previous_15 <= previous_45)))
If we change your data to
Date Price SMA_45 SMA_15
20150127 102.75 113 106
20150128 103.05 100 106
20150129 105.10 112 105
20150130 105.35 111 105
20150202 107.15 111 105
20150203 111.95 110 105
20150204 111.90 110 106
so that there are crossings,
then
import pandas as pd
df = pd.read_table('data', sep='\s+')
previous_15 = df['SMA_15'].shift(1)
previous_45 = df['SMA_45'].shift(1)
crossing = (((df['SMA_15'] <= df['SMA_45']) & (previous_15 >= previous_45))
| ((df['SMA_15'] >= df['SMA_45']) & (previous_15 <= previous_45)))
crossing_dates = df.loc[crossing, 'Date']
print(crossing_dates)
yields
1 20150128
2 20150129
Name: Date, dtype: int64
The following methods gives the similar results, but takes less time than the previous methods:
df['position'] = df['SMA_15'] > df['SMA_45']
df['pre_position'] = df['position'].shift(1)
df.dropna(inplace=True) # dropping the NaN values
df['crossover'] = np.where(df['position'] == df['pre_position'], False, True)
Time taken for this approach: 2.7 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Time taken for previous approach: 3.46 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As an alternative to the unutbu's answer, something like below can also be done to find the indices where SMA_15 crosses SMA_45.
diff = df['SMA_15'] < df['SMA_45']
diff_forward = diff.shift(1)
crossing = np.where(abs(diff - diff_forward) == 1)[0]
print(crossing)
>>> [1,2]
print(df.iloc[crossing])
>>>
Date Price SMA_15 SMA_45
1 20150128 103.05 100 106
2 20150129 105.10 112 105