In the following, male_trips is a big pandas data frame and stations is a small pandas data frame. For each station id I'd like to know how many male trips took place. The following does the job, but takes a long time:
mc = [ sum( male_trips['start_station_id'] == id ) for id in stations['id'] ]
how should I go about this instead?
Update! So there were two main approaches: groupby() followed by size(), and the simpler .value_counts(). I did a quick timeit, and the groupby approach wins by quite a large margin! Here is the code:
from timeit import Timer
setup = "import pandas; male_trips=pandas.load('maletrips')"
a = "male_trips.start_station_id.value_counts()"
b = "male_trips.groupby('start_station_id').size()"
Timer(a,setup).timeit(100)
Timer(b,setup).timeit(100)
and here is the result:
In [4]: Timer(a,setup).timeit(100) # <- this is value_counts
Out[4]: 9.709594964981079
In [5]: Timer(b,setup).timeit(100) # <- this is groupby / size
Out[5]: 1.5574288368225098
Note that, at this speed, for exploring data typing value_counts is marginally quicker and less remembering!
I'd do like Vishal but instead of using sum() using size() to get a count of the number of rows allocated to each group of 'start_station_id'. So:
df = male_trips.groupby('start_station_id').size()
My answer below works in Pandas 0.7.3. Not sure about the new releases.
This is what the pandas.Series.value_counts method is for:
count_series = male_trips.start_station_id.value_counts()
It should be straight-forward to then inspect count_series based on the values in stations['id']. However, if you insist on only considering those values, you could do the following:
count_series = (
male_trips[male_trips.start_station_id.isin(stations.id.values)]
.start_station_id
.value_counts()
)
and this will only give counts for station IDs actually found in stations.id.
male_trips.count()
doesnt work?
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html
how long would this take:
df = male_trips.groupby('start_station_id').sum()
edit: after seeing in the answer above that isin and value_counts exist (and value_counts even comes with its own entry in pandas.core.algorithm and also isin isn't simply np.in1d) I updated the three methods below
male_trips.start_station_id[male_trips.start_station_id.isin(station.id)].value_counts()
You could also do an inner join on stations.id:
pd.merge(male_trips, station, left_on='start_station_id', right_on='id') followed by value_counts.
Or:
male_trips.set_index('start_station_id, inplace=True)
station.set_index('id, inplace=True)
male_trips.ix[male_trips.index.intersection(station.index)].reset_index().start_station_id.value_counts()
If you have the time I'd be interested how this performs differently with a huge DataFrame.
Related
I have three columns: id(unique), value, time
I want to create a new column that does a simple row_number without any partitioning
I tried : df['test'] = df.groupby('id_col').cumcount()+1
But the output is only ones.
Expecting to get 1->len of the dataframe
Also , is there a way to do it in numpy for better performance
If your index is already ordered starting from 0
df["row_num"] = df.index + 1
else:
df["row_num"] = df.reset_index().index + 1
Comparing time with %%timeit speed from fastest to slowest: #Scott Boston's method > #Henry Ecker's method > mine
df["row_num"] = range(1,len(df)+1)
Alternative:
df.insert(0, "row_num", range(1,len(df)+1))
I have a code and my dataframe contains almost 800k rows and therefore it is impossible to iterate over it by using standard methods. I searched a little bit and see a method of iterrows() but i couldn't understand how to use. Basicly this is my code and can you help me how to update it for iterrows()?
**
for i in range(len(x["Value"])):
if x.loc[i ,"PP_Name"] in ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay'] :
x.loc[i,"Santral_Type"] = "HES"
elif x.loc[i ,"PP_Name"] in ['BND','BND2','TFB','TFB3','TFB4','KNT']:
x.loc[i,"Santral_Type"] = "TERMIK"
elif x.loc[i ,"PP_Name"] in ['BRS','ÇKL','DPZ']:
x.loc[i,"Santral_Type"] = "RES"
else : x.loc[i,"Santral_Type"] = "SOLAR"
**
How to iterate over very big dataframes -- In general, you don't. You should use some sort of vectorize operation to the column as a whole. For example, your case can be map and fillna:
map_dict = {
'HES' : ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay'],
'TERMIK' : ['BND','BND2','TFB','TFB3','TFB4','KNT'],
'RES' : ['BRS','ÇKL','DPZ']
}
inv_map_dict = {x:k for k,v in map_dict.items() for x in v}
df['Santral_Type'] = df['PP_Name'].map(inv_map_dict).fillna('SOLAR')
It is not advised to iterate through DataFrames for these things. Here is one possible way of doing it, applied to all rows of the DataFrame x at once:
# Default value
x["Santral_Type"] = "SOLAR"
x.loc[x.PP_Name.isin(['BRS','ÇKL','DPZ']), 'Santral_Type'] = "RES"
x.loc[x.PP_Name.isin(['BND','BND2','TFB','TFB3','TFB4','KNT']), 'Santral_Type'] = "TERMIK"
hes_list = ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']
x.loc[x.PP_Name.isin(hes_list), 'Santral_Type'] = "HES"
Note that 800k can not be considered a large table when using standard pandas methods.
I would advise strongly against using iterrows and for loops when you have vectorised solutions available which take advantage of the pandas api.
this is your code adapted with numpy which should run much faster than your current method.
import numpy as np
col = 'PP_Name'
conditions = [
x[col].isin(
['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']
),
x[col].isin(["BND", "BND2", "TFB", "TFB3", "TFB4", "KNT"]),
x[col].isin(["BRS", "ÇKL", "DPZ"])]
outcomes = ["HES", "TERMIK", "RES"]
x["Santral_Type"] = np.select(conditions, outcomes, default='SOLAR')
df.iterrows() according to documentation returns a tuple (index, Series).
You can use it like this:
for row in df.iterrows():
if row[1]['PP_Name'] in ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']:
df['Santral_Type] = "HES"
# and so on
By the way, I must say, using iterrows is going to be very slow, and looking at your sample code it's clear you can use simple pandas selection techniques to do this without explicit loops.
Better to do it as #mcsoini suggested
the simplest method could be .values, example:
def f(x0,...xn):
return('hello or some complicated operation')
df['newColumn']=[f(r[0],r[1],...,r[n]) for r in df.values]
the drawbacks of this method as far as i know is that you cannot refer to the column values by name but just by position and there is no info about the index of the df.
Advantage is faster than iterrows, itertuples and apply methods.
hope it helps
Question
I have an email_alias column and I'd like to find the number of integers in that column (per row) in another column using Python. So far I can only count the total number of numbers in the entire column.
Attempt
I tried: df['count_numbers'] = sum(c.isdigit() for c in df['email_alias'])
Example:
email_alias count_numbers
thisisatest111 3
testnumber2 1
I believe this might be the simplest solution.
df['count_numbers'] = df['email_alias'].str.count('\d')
You can apply a custom python function to the column. I don't think there's a vectorized way. sum() here takes advantage of the fact that bools are a subclass of ints so all True values are equal to 1.
import pandas as pd
def count_digits(string):
return sum(item.isdigit() for item in string)
df = pd.DataFrame({'a': ['thisisatest111', 'testnumber2']})
df['counts'] = df['a'].apply(count_digits)
Your approach of:
df['count_numbers'] = sum(c.isdigit() for c in df['email_alias'])
could not work because df['count_numbers'] = is an assignment to every value in that column. Here, apply implicitly iterates over the rows (but in Python time, so it's not vectorized). Then again, most of the .str accessor methods of Pandas are, too, despite the syntax suggesting it will go faster than a for loop.
I have a Pandas DataFrame of subscriptions, each with a start datetime (timestamp) and an optional end datetime (if they were canceled).
For simplicity, I have created string columns for the date (e.g. "20170901") based on start and end datetimes (timestamps). It looks like this:
df = pd.DataFrame([('20170511', None), ('20170514', '20170613'), ('20170901', None), ...], columns=["sd", "ed"])
The end result should be a time series of how many subscriptions were active on any given date in a range.
To that end, I created an Index for all the days within a range:
days = df.groupby(["sd"])["sd"].count()
I am able to create what I am interested in with a loop each executing a query over the entire DataFrame df.
count_by_day = pd.DataFrame([
len(df.loc[(df.sd <= i) & (df.ed.isnull() | (df.ed > i))])
for i in days.index], index=days.index)
Note that I have values for each day in the original dataset, so there are no gaps. I'm sure getting the date range can be improved.
The actual question is: is there an efficient way to compute this for a large initial dataset df, with multiple thousands of rows? It seems the method I used is quadratic in complexity. I've also tried df.query() but it's 66% slower than the Pythonic filter and does not change the complexity.
I tried to search the Pandas docs for examples but I seem to be using the wrong keywords. Any ideas?
It's an interesting problem, here's how I would do it. Not sure about performance
EDIT: My first answer was incorrect, I didn't read fully the question
# Initial data, columns as Timestamps
df = pd.DataFrame([('20170511', None), ('20170514', '20170613'), ('20170901', None)], columns=["sd", "ed"])
df['sd'] = pd.DatetimeIndex(df.sd)
df['ed'] = pd.DatetimeIndex(df.ed)
# Range input and related index
beg = pd.Timestamp('2017-05-15')
end = pd.Timestamp('2017-09-15')
idx = pd.DatetimeIndex(start=beg, end=end, freq='D')
# We filter data for records out of the range and then clip the
# the subscriptions start/end to the range bounds.
fdf = df[(df.sd <= beg) | ((df.ed >= end) | (pd.isnull(df.ed)))]
fdf['ed'].fillna(end, inplace=True)
fdf['ps'] = fdf.sd.apply(lambda x: max(x, beg))
fdf['pe'] = fdf.ed.apply(lambda x: min(x, end))
# We run a conditional count
idx.to_series().apply(lambda x: len(fdf[(fdf.ps<=x) & (fdf.pe >=x)]))
Ok, I'm answering my own question after quite a bit of research, fiddling and trying things out. I may still be missing an obvious solution but maybe it helps.
The fastest solution I could find to date is (thanks Alex for some nice code patterns):
# Start with test data from question
df = pd.DataFrame([('20170511', None), ('20170514', '20170613'),
('20170901', None), ...], columns=['sd', 'ed'])
# Convert to datetime columns
df['sd'] = pd.DatetimeIndex(df['sd'])
df['ed'] = pd.DatetimeIndex(df['ed'])
df.ed.fillna(df.sd.max(), inplace=True)
# Note: In my real data I have timestamps - I convert them like this:
#df['sd'] = pd.to_datetime(df['start_date'], unit='s').apply(lambda x: x.date())
# Set and sort multi-index to enable slices
df = df.set_index(['sd', 'ed'], drop=False)
df.sort_index(inplace=True)
# Compute the active counts by day in range
di = pd.DatetimeIndex(start=df.sd.min(), end=df.sd.max(), freq='D')
count_by_day = di.to_series().apply(lambda i: len(df.loc[
(slice(None, i.date()), slice(i.date(), None)), :]))
In my real dataset (with >10K rows for df and a date range of about a year), this was twice as fast as the code in the question, about 1.5s.
Here some lessons I learned:
Creating a Series with counters for the date range and iterating through the dataset df with df.apply or df.itertuples and incrementing counters was much slower. Curiously, apply was slower than itertuples. Don't even think of iterrows
My dataset had a product_id with each row, so filtering the dataset for each product and running the calculation on the filtered result (for each product) was twice as fast as adding the product_id to the multi-index and slicing on that level too
Building an intermediate Series of active days (from iterating through each row in df and adding each date in the active range to the Series) and then grouping by date was much slower.
Running the code in the question on a df with a multi-index did not change the performance.
Running the code in the question on a df with a limited set of columns (my real dataset has 22 columns) did not change the performance.
I was looking at pd.crosstab and pd.Period but I was not able to get anything to work
Pandas is pretty awesome and trying to outsmart it is really hard (esp. non-vectorized in Python)
In the following, male_trips is a big pandas data frame and stations is a small pandas data frame. For each station id I'd like to know how many male trips took place. The following does the job, but takes a long time:
mc = [ sum( male_trips['start_station_id'] == id ) for id in stations['id'] ]
how should I go about this instead?
Update! So there were two main approaches: groupby() followed by size(), and the simpler .value_counts(). I did a quick timeit, and the groupby approach wins by quite a large margin! Here is the code:
from timeit import Timer
setup = "import pandas; male_trips=pandas.load('maletrips')"
a = "male_trips.start_station_id.value_counts()"
b = "male_trips.groupby('start_station_id').size()"
Timer(a,setup).timeit(100)
Timer(b,setup).timeit(100)
and here is the result:
In [4]: Timer(a,setup).timeit(100) # <- this is value_counts
Out[4]: 9.709594964981079
In [5]: Timer(b,setup).timeit(100) # <- this is groupby / size
Out[5]: 1.5574288368225098
Note that, at this speed, for exploring data typing value_counts is marginally quicker and less remembering!
I'd do like Vishal but instead of using sum() using size() to get a count of the number of rows allocated to each group of 'start_station_id'. So:
df = male_trips.groupby('start_station_id').size()
My answer below works in Pandas 0.7.3. Not sure about the new releases.
This is what the pandas.Series.value_counts method is for:
count_series = male_trips.start_station_id.value_counts()
It should be straight-forward to then inspect count_series based on the values in stations['id']. However, if you insist on only considering those values, you could do the following:
count_series = (
male_trips[male_trips.start_station_id.isin(stations.id.values)]
.start_station_id
.value_counts()
)
and this will only give counts for station IDs actually found in stations.id.
male_trips.count()
doesnt work?
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html
how long would this take:
df = male_trips.groupby('start_station_id').sum()
edit: after seeing in the answer above that isin and value_counts exist (and value_counts even comes with its own entry in pandas.core.algorithm and also isin isn't simply np.in1d) I updated the three methods below
male_trips.start_station_id[male_trips.start_station_id.isin(station.id)].value_counts()
You could also do an inner join on stations.id:
pd.merge(male_trips, station, left_on='start_station_id', right_on='id') followed by value_counts.
Or:
male_trips.set_index('start_station_id, inplace=True)
station.set_index('id, inplace=True)
male_trips.ix[male_trips.index.intersection(station.index)].reset_index().start_station_id.value_counts()
If you have the time I'd be interested how this performs differently with a huge DataFrame.