I have three columns: id(unique), value, time
I want to create a new column that does a simple row_number without any partitioning
I tried : df['test'] = df.groupby('id_col').cumcount()+1
But the output is only ones.
Expecting to get 1->len of the dataframe
Also , is there a way to do it in numpy for better performance
If your index is already ordered starting from 0
df["row_num"] = df.index + 1
else:
df["row_num"] = df.reset_index().index + 1
Comparing time with %%timeit speed from fastest to slowest: #Scott Boston's method > #Henry Ecker's method > mine
df["row_num"] = range(1,len(df)+1)
Alternative:
df.insert(0, "row_num", range(1,len(df)+1))
Related
I need to create a new column in a csv called BTTS, which is based on two other columns, FTHG and FTAG. If FTHG & FTAG are both greater than zero, BTTS should be 1. Otherwise it should be zero.
What's the best way to do this in pandas / numpys?
I'm not sure, what the best way is. But here is one solution using pandas loc method:
df.loc[((df['FTHG'] > 0) & (df['FTAG'] > 0)),'BTTS'] = 1
df['BTTS'].fillna(0, inplace=True)
Another solution using pandas apply method:
def check_greater_zero(row):
return 1 if row['FTHG'] > 0 & row['FTAG'] > 0 else 0
df['BTTS'] = df.apply(check_greater_zero, axis=1)
EDIT:
As stated in the comments, the first, vectorized, implementation is more efficient.
I dont know if this is the best way to do it but this works :)
df['BTTS'] = [1 if x == y == 1 else 0 for x, y in zip(df['FTAG'], df['FTHG'])]
I am trying to find a more pandorable way to get all rows of a DataFrame past a certain value in the a certain column (the Quarter column in this case).
I want to slice a DataFrame of GDP statistics to get all rows past the first quarter of 2000 (2000q1). Currently, I'm doing this by getting the index number of the value in the GDP_df["Quarter"] column that equals 2000q1 (see below). This seems way too convoluted and there must be an easier, simpler, more idiomatic way to achieve this. Any ideas?
Current Method:
def get_GDP_df():
GDP_df = pd.read_excel(
"gdplev.xls",
names=["Quarter", "GDP in 2009 dollars"],
parse_cols = "E,G", skiprows = 7)
year_2000 = GDP_df.index[GDP_df["Quarter"] == '2000q1'].tolist()[0]
GDP_df["Growth"] = (GDP_df["GDP in 2009 dollars"]
.pct_change()
.apply(lambda x: f"{round((x * 100), 2)}%"))
GDP_df = GDP_df[year_2000:]
return GDP_df
Output:
Also, after the DataFrame has been sliced, the indices now start at 212. Is there a method to renumber the indices so they start at 0 or 1?
The following is equivalent:
year_2000 = (GDP_df["Quarter"] == '2000q1').idxmax()
GDP_df["Growth"] = (GDP_df["GDP in 2009 dollars"]
.pct_change()
.mul(100)
.round(2)
.apply(lambda x: f"{x}%"))
return GDP_df.loc[year_2000:]
As pointed in the comments you can use the new awesome method query()
that Query the columns of a DataFrame with a boolean expression that
uses the top-level pandas.eval() function to evaluate the passed
query with pandas.eval method that Evaluate a Python expression
as a string using various backends that uses only Python
expressions.
import pandas as pd
raw_data = {'ID':['101','101','101','102','102','102','102','103','103','103','103'],
'Week':['08-02-2000','09-02-2000','11-02-2000','10-02-2000','09-02-2000','08-02-2000','07-02-2000','01-02-2000',
'02-02-2000','03-02-2000','04-02-2000'],
'Quarter':['2000q1','2000q2','2000q3','2000q4','2000q1','2000q2','2000q3','2000q4','2000q1','2000q2','2000q3'],
'GDP in 2000 dollars':[15,15,10,15,15,5,10,10,15,20,11]}
def get_GDP_df():
GDP_df = pd.DataFrame(raw_data).set_index('ID')
print(GDP_df) # for reference to see how the data is indexed, printing out to the screen
GDP_df = GDP_df.query("Quarter >= '2000q1'").reset_index(drop=True) #performing the query() + reindexing the dataframe
GDP_df["Growth"] = (GDP_df["GDP in 2000 dollars"]
.pct_change()
.apply(lambda x: f"{round((x * 100), 2)}%"))
return GDP_df
get_GDP_df()
In the following, male_trips is a big pandas data frame and stations is a small pandas data frame. For each station id I'd like to know how many male trips took place. The following does the job, but takes a long time:
mc = [ sum( male_trips['start_station_id'] == id ) for id in stations['id'] ]
how should I go about this instead?
Update! So there were two main approaches: groupby() followed by size(), and the simpler .value_counts(). I did a quick timeit, and the groupby approach wins by quite a large margin! Here is the code:
from timeit import Timer
setup = "import pandas; male_trips=pandas.load('maletrips')"
a = "male_trips.start_station_id.value_counts()"
b = "male_trips.groupby('start_station_id').size()"
Timer(a,setup).timeit(100)
Timer(b,setup).timeit(100)
and here is the result:
In [4]: Timer(a,setup).timeit(100) # <- this is value_counts
Out[4]: 9.709594964981079
In [5]: Timer(b,setup).timeit(100) # <- this is groupby / size
Out[5]: 1.5574288368225098
Note that, at this speed, for exploring data typing value_counts is marginally quicker and less remembering!
I'd do like Vishal but instead of using sum() using size() to get a count of the number of rows allocated to each group of 'start_station_id'. So:
df = male_trips.groupby('start_station_id').size()
My answer below works in Pandas 0.7.3. Not sure about the new releases.
This is what the pandas.Series.value_counts method is for:
count_series = male_trips.start_station_id.value_counts()
It should be straight-forward to then inspect count_series based on the values in stations['id']. However, if you insist on only considering those values, you could do the following:
count_series = (
male_trips[male_trips.start_station_id.isin(stations.id.values)]
.start_station_id
.value_counts()
)
and this will only give counts for station IDs actually found in stations.id.
male_trips.count()
doesnt work?
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html
how long would this take:
df = male_trips.groupby('start_station_id').sum()
edit: after seeing in the answer above that isin and value_counts exist (and value_counts even comes with its own entry in pandas.core.algorithm and also isin isn't simply np.in1d) I updated the three methods below
male_trips.start_station_id[male_trips.start_station_id.isin(station.id)].value_counts()
You could also do an inner join on stations.id:
pd.merge(male_trips, station, left_on='start_station_id', right_on='id') followed by value_counts.
Or:
male_trips.set_index('start_station_id, inplace=True)
station.set_index('id, inplace=True)
male_trips.ix[male_trips.index.intersection(station.index)].reset_index().start_station_id.value_counts()
If you have the time I'd be interested how this performs differently with a huge DataFrame.
I have two DataFrames "A" and "B". Each has two columns "key1" and "key2", but a unique key is a combination of two. I want to select from second DataFrame all rows with combination of "key1" and "key2" columns which are contained in DataFrame "A".
Simple example:
A = pd.DataFrame({'a':list(range(20000))*100,
'b':np.repeat(list(range(100)),20000)})
B = pd.DataFrame({'a':list(range(40000))*100,
'b':np.repeat(list(range(100)),40000),
'c':np.random.randint(4000000, size = 4000000)})
Solution 1:
%%time
A['marker'] = True
C = B.merge(A, on=['a','b'], how='inner').drop('marker', axis=1)
1.26 s
Solution 2:
%%time
A['marker'] = A['a'].astype(str) + '_' + A['b'].astype(str)
B['marker'] = B['a'].astype(str) + '_' + B['b'].astype(str)
C = B[B.marker.isin(A.marker)]
20.4 s
This works, but is there a more elegant (and fast) solution?
You could try taking a look at pd.MultiIndex and using multi-level indices instead of plain/meaningless integer ones. Not sure if it would be a lot faster in the real data, but modifying your example data slightly:
index1 = pd.MultiIndex.from_arrays([range(20000)*100, np.repeat(range(100),20000)]) #former A
index2 = pd.MultiIndex.from_arrays([range(40000)*100, np.repeat(range(100),40000)]) #index of B[['a', 'b']]
s = pd.Series(np.random.randint(4000000, size = 4000000), index=index2) #former B['c']
In [93]: %timeit c = s[index1]
1 loops, best of 3: 803 ms per loop
The indexing of s with a different index (index1) from its original index (index2) is roughly equivalent to the your merge operation.
Usually operations on the index tend to be faster than those performed on regular DataFrame columns. But either way, you are probably looking for marginal improvement here. I don't think you can get this done in the microsecond scale.
In the following, male_trips is a big pandas data frame and stations is a small pandas data frame. For each station id I'd like to know how many male trips took place. The following does the job, but takes a long time:
mc = [ sum( male_trips['start_station_id'] == id ) for id in stations['id'] ]
how should I go about this instead?
Update! So there were two main approaches: groupby() followed by size(), and the simpler .value_counts(). I did a quick timeit, and the groupby approach wins by quite a large margin! Here is the code:
from timeit import Timer
setup = "import pandas; male_trips=pandas.load('maletrips')"
a = "male_trips.start_station_id.value_counts()"
b = "male_trips.groupby('start_station_id').size()"
Timer(a,setup).timeit(100)
Timer(b,setup).timeit(100)
and here is the result:
In [4]: Timer(a,setup).timeit(100) # <- this is value_counts
Out[4]: 9.709594964981079
In [5]: Timer(b,setup).timeit(100) # <- this is groupby / size
Out[5]: 1.5574288368225098
Note that, at this speed, for exploring data typing value_counts is marginally quicker and less remembering!
I'd do like Vishal but instead of using sum() using size() to get a count of the number of rows allocated to each group of 'start_station_id'. So:
df = male_trips.groupby('start_station_id').size()
My answer below works in Pandas 0.7.3. Not sure about the new releases.
This is what the pandas.Series.value_counts method is for:
count_series = male_trips.start_station_id.value_counts()
It should be straight-forward to then inspect count_series based on the values in stations['id']. However, if you insist on only considering those values, you could do the following:
count_series = (
male_trips[male_trips.start_station_id.isin(stations.id.values)]
.start_station_id
.value_counts()
)
and this will only give counts for station IDs actually found in stations.id.
male_trips.count()
doesnt work?
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html
how long would this take:
df = male_trips.groupby('start_station_id').sum()
edit: after seeing in the answer above that isin and value_counts exist (and value_counts even comes with its own entry in pandas.core.algorithm and also isin isn't simply np.in1d) I updated the three methods below
male_trips.start_station_id[male_trips.start_station_id.isin(station.id)].value_counts()
You could also do an inner join on stations.id:
pd.merge(male_trips, station, left_on='start_station_id', right_on='id') followed by value_counts.
Or:
male_trips.set_index('start_station_id, inplace=True)
station.set_index('id, inplace=True)
male_trips.ix[male_trips.index.intersection(station.index)].reset_index().start_station_id.value_counts()
If you have the time I'd be interested how this performs differently with a huge DataFrame.