Python Pandas: Efficiently assign values to a slice [duplicate] - python

This question already has an answer here:
Replace values in a pandas series via dictionary efficiently
(1 answer)
Closed 4 years ago.
I have a dataframe next_train with weekly data for many players (80,000 players observed through 4 weeks, total of 320,000 observations) and a dictionary players containing a binary variable for some of the players (say 10,000). I want to add this binary variable to the dataframe next_train (if a player is not in the dictionary players, I set the variable equal to zero). This is how I'm doing it:
next_train = pd.read_csv()
# ... calculate dictionary 'players' ...
next_train['variable'] = 0
for player in players:
next_train.loc[next_train['id_of_player'] == player, 'variable'] = players[player]
However the for loop takes ages to complete, and I don't understand why. It looks like the task is to perform binary search for the value player in my dataframe for 10,000 times (size of the players dictionary), but the execution time is several minutes. Is there any efficient way to do this task?

You should use map instead of slicing, that will be way faster:
next_train['variable'] = next_train.id_of_player.map(players)
As you want 0 in the other rows, you can then use fillna:
next_train.variable.fillna(0,inplace = True)
Moreover, if your dictionnary only contains boolean values, you might want to redefine the type of variable column to take less space. So you end with this piece of code:
next_train['variable'] = next_train.id_of_player.map(players).fillna(0).astype(int)

Use map and fillna:
next_train['variable'] = next_train['id_of_player'].map(players).fillna(0)
This creates a new column by applying the dictionary on the player ids and then fills all empty values with 0.

Related

PySpark select top N Rows from each group

I want to choose a N rows randomly for each category of a column in a data frame. Let's say the column is the 'color' and N is 5. Then I'd want to choose 5 items for each of the colors.
The usual way of doing this is something like this
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
# Define a random key that can be used to sort by
df = df.select("*", rand().alias(key))
# Sort the rows within each color by the key
# Simultaneously enumerate the sorted rows
.withColumn(num, row_number().over(Window.partitionBy(color).orderBy(key)))
# Choose only N items for each category
.where(f"{num} <= {N}")
# Drop key column
.drop(key)
But orderBy blows up with an out of memory error on large dataframes. I'm considering using sort to work around this. Context: 'orderBy' runs on a single executor and guarantees total order while sort uses several partitions. I'm ok with the approximate nature of sort as I'm using this to select random subsets anyway.
I can't just replace orderBy as sort can't be used with row_number in a window as above.
Any pointers appreciated.
References:
Code snippet from https://sparkbyexamples.com/pyspark/pyspark-retrieve-top-n-from-each-group-of-dataframe/
Comparison between orderBy and sort from https://towardsdatascience.com/sort-vs-orderby-in-spark-8a912475390
You want to use what they call a 'salt' to redistribute the data, and make it smaller. (Here I'm going to split your colour column into floor(key*8) before randomly sorting it, but that's just a guess that it will work for you and really could be increased if you wish) Then you can re-window as you do today without the salt.
# Define a random key that can be used to sort by and salt by
df = df.select("*", rand().alias(key))
# Sort the rows within each color by the key
# Simultaneously enumerate the sorted rows
.withColumn(num, row_number().over(Window.partitionBy(color,floor(key*8)).orderBy(key)))#divides the data into smaller by a factor of 8 chunks using the salt
# Choose only N items for each category
.where(f"{num} <= {N}")
.drop( num )
.withColumn(num, row_number().over(Window.partitionBy(color).orderBy(key)))
.where(f"{num} <= {N}")
# Drop key column
.drop(key)
I do think you should look into df.sample as it's made to do this type of thing but if you like your logic as is this will work for you.

How to to arrange a loop in order to loop over columns and then do something

I'm a complete newbie to python, and I'm currently trying to work on a problem that allows me to take the average of each column except the number of columns is unknown.
I figured how to do it if I knew how many columns it is and to do each calculation separate. I'm supposed to do it by creating an empty list and looping the columns back into it.
import numpy as np
#average of all data not including NAN
def average (dataset):
return np.mean (dataset [np.isfinite (dataset)])
#this is how I did it by each column separate
dataset = np.genfromtxt("some file")
print (average(dataset [:,0]))
print (average(dataset [:,1]))
#what I'm trying to do with a loop
def avg (dataset):
for column in dataset:
lst = []
column = #i'm not sure how to define how many columns I have
Avg = average (column)
return Avg
You can use the numpy.mean() function:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
with:
np.mean(my_data, axis=0)
The axis indicates whether you are taking the average along columns or rows (axis = 0 means you take the average of each column, what you are trying to do). The output will be a vector whose length is the same as the number of columns (or rows) along which you took the average, and each element is the average of the corresponding column (or row). You do not need to know the shape of the matrix in advance to do this.
You CAN do this using a for loop, but it's not a good idea -- looping over matrices in numpy is slow, whereas using vectorized operations like np.mean() is very very fast. So in general when using numpy one tries to use those types of built-in operations instead of looping over everything at least if possible.
Also -- if you want the number of columns in your matrix -- it's
my_matrix.shape[1]
returns number of columns;
my_matrix.shape[0] is number of rows.

Iterate through and overwrite specific values in a pandas dataframe

I have a large dataframe collating a bunch of basketball data (screenshot below). Every column to the right of Opp Lineup is a dummy variable indicating if that player (indicated in the column name) is in the current lineup (the last part of the column name is team name, which needs to be compared to the opponent column to make sure two players with the same number and name on different teams don't mess it up). I know several ways of iterating through a pandas dataframe (iterrows, itertuples, iteritems), but I don't know the way to accomplish what I need to, which is for each line in each column:
Compare the team (columnname.split()[2:]) to the Opponent column (except for LSU players)
See if the name (columnname.split()[:2]) is in Opp Lineup or, for LSU players, lineup
If the above conditions are satisfied, replace that value with 1, otherwise leave it as 0
What is the best method for looping through the dataframe and accomplishing this task? Speed doesn't really matter in this instance. I understand all of the logic involved, except I'm not familiar enough with pandas to know how to loop through it, and trying various things I've seen on Google isn't working.
Consider a reshape/pivot solution as your data is in wide format but you need to compare values row-wise in long format. So, first melt your data so all column headers become an actual column 'Player' and its corresponding value to 'IsInLineup'. Run your conditional comparison for dummy values, and then pivot back to original structure with players across column headers. Of course, I do not have actual data to test this example fully.
# MELT
reshapedf = pd.melt(df, id_vars=['Opponent', 'Lineup', 'Minutes', 'Plus Minus',
'Plus Minus Per Minute', 'Opp Lineup'],
var_name='Player', value_name='IsInLineup')
# APPLY FUNCTION (SPLITTING VALUE AND THEN JOINING FOR SUBSET STRING)
reshapedf['IsInLineup'] = reshapedf.apply(lambda row: (' '.join(row['Player'].split(' ')[:2]) in row['Opp Lineup'] and
' '.join(row['Player'].split(' ')[2:]) in row['Opponent'])*1, axis=1)
# PIVOT (UNMELT)
df2 = reshapedf.pivot_table(index=['Opponent', 'Lineup', 'Minutes', 'Plus Minus',
'Plus Minus Per Minute', 'Opp Lineup'], columns='Player').reset_index()
df2.columns = df2.columns.droplevel(0).rename(None)
df2.columns = df.columns
If above lambda function looks a little complex, try equivalent apply defined function():
# APPLY FUNCTION (SPLITTING VALUE AND THEN JOINING FOR SUBSET STRING)
def f(row):
if (' '.join(row['Player'].split(' ')[:2]) in row['Opp Lineup'] and \
' '.join(row['Player'].split(' ')[2:]) in row['Opponent']):
return 1
else:
return 0
reshapedf['IsInLineup'] = reshapedf.apply(f,axis=1)
I ended up using a work around. I iterated through using df.iterrows and for each one created a list for each iteration where checked for the value I wanted and then appended the 0 or 1 to the temporary list. Then I simply inserted it to the dataframe. Possibly not the most efficient memory-wise, but it worked.

Merging dataframes together in a for loop [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have a dictionary of pandas dataframes, each frame contains timestamps and market caps corresponding to the timestamps, the keys of which are:
coins = ['dashcoin','litecoin','dogecoin','nxt']
I would like to create a new key in the dictionary 'merge' and using the pd.merge method merge the 4 existing dataframes according to their timestamp (I want completed rows so using 'inner' join method will be appropriate.
Sample of one of the data frames:
data2['nxt'].head()
Out[214]:
timestamp nxt_cap
0 2013-12-04 15091900
1 2013-12-05 14936300
2 2013-12-06 11237100
3 2013-12-07 7031430
4 2013-12-08 6292640
I'm currently getting a result using this code:
data2['merged'] = data2['dogecoin']
for coin in coins:
data2['merged'] = pd.merge(left=data2['merged'],right=data2[coin], left_on='timestamp', right_on='timestamp')
but this repeats 'dogecoin' in 'merged', however if data2['merged'] is not = data2['dogecoin'] (or some similar data) then the merge function won't work as the values are non existent in 'merge'
EDIT: my desired result is create one merged dataframe seen in a new element in dictionary 'data2' (data2['merged']), containing the merged data frames from the other elements in data2
Try replacing the generalized pd.merge() with actual named df but you must begin dataframe with at least a first one:
data2['merged'] = data2['dashcoin']
# LEAVE OUT FIRST ELEMENT
for coin in coins[1:]:
data2['merged'] = data2['merged'].merge(data2[coin], on='timestamp')
Since you've already made coins a list, why not just something like
data2['merged'] = data2[coins[0]]
for coin in coins[1:]:
data2['merged'] = pd.merge(....
Unless I'm misunderstanding, this question isn't specific to dataframes, it's just about how to write a loop when the first element has to be treated differently to the rest.

Return subset/slice of Pandas dataframe based on matching column of other dataframe, for each element in column?

So I think this is a relatively simple question:
I have a Pandas data frame (A) that has a key column (which is not unique/will have repeats of the key)
I have another Pandas data frame (B) that has a key column, which may have many matching entries/repeats.
So what I'd like is a bunch of data frames (a list, or a bunch of slice parameters, etc.), one for each key in A (regardless of whether it's unique or not)
In [bad] pseudocode:
for each key in A:
resultDF[] = Rows in B where B.key = key
I can easily do this iteratively with loops, but I've read that you're supposed to slice/merge/join data frames holistically, so I'm trying to see if I can find a better way of doing this.
A join will give me all the stuff that matches, but that's not exactly what I'm looking for, since I need a resulting dataframe for each key (i.e. for every row) in A.
Thanks!
EDIT:
I was trying to be brief, but here are some more details:
Eventually, what I need to do is generate some simple statistical metrics for elements in the columns of each row.
In other words, I have a DF, call it A, and it has a r rows, with c columns, one of which is a key. There may be repeats on the key.
I want to "match" that key with another [set of?] dataframe, returning however many rows match the key. Then, for that set of rows, I want to, say, determine the min and max of certain element (and std. dev, variance, etc.) and then determine if the corresponding element in A falls within that range.
You're absolutely right that it's possible that if row 1 and row 3 of DF A have the same key -- but potentially DIFFERENT elements -- they'd be checked against the same result set (the ranges of which obviously won't change). That's fine. These won't likely ever be big enough to make that an issue (but if there's the better way of doing it, that's great).
The point is that I need to be able to do the "in range" and stat summary computation for EACH key in A.
Again, I can easily do all of this iteratively. But this seems like the sort of thing pandas could do well, and I'm just getting into using it.
Thanks again!
FURTHER EDIT
The DF looks like this:
df = pd.DataFrame([[1,2,3,4,1,2,3,4], [28,15,13,11,12,23,21,15],['keyA','keyB','keyC','keyD', 'keyA','keyB','keyC','keyD']]).T
df.columns = ['SEQ','VAL','KEY']
SEQ VAL KEY
0 1 28 keyA
1 2 15 keyB
2 3 13 keyC
3 4 11 keyD
4 1 12 keyA
5 2 23 keyB
6 3 21 keyC
7 4 15 keyD
Both DF's A and B are of this format.
I can iterative get the resultant sets by:
loop_iter = len(A) / max(A['SEQ_NUM'])
for start in range(0, loop_iter):
matchA = A.iloc[start::loop_iter, :]['KEY']
That's simple. But I guess I'm wondering if I can do this "inline". Also, if for some reason the numeric ordering breaks (i.e. the SEQ get out of order) this this won't work. There seems to be no reason NOT to do it explicitly splitting on the keys, right? So perhaps I have TWO questions: 1). How to split on keys, iteratively (i.e. accessing a DF one row at a time), and 2). How to match a DF and do summary statistics, etc., on a DF that matches on the key.
So, once again:
1). Iterate through DF A, going one at a time, and grabbing a key.
2). Match the key to the SET (matchB) of keys in B that match
3). Do some stats on "values" of matchB, check to see if val.A is in range, etc.
4). Profit!
Ok, from what I understand, the problem at its most simple is that you have a pd.Series of values (i.e. a["key"], which let's just call keys), which correspond to the rows of a pd.DataFrame (the df called b), such that set(b["key"]).issuperset(set(keys)). You then want to apply some function to each group of rows in b where the b["key"] is one of the values in keys.
I'm purposefully disregarding the other df -- a -- that you mention in your prompt, because it doesn't seem to bear any significance to the problem, other than being the source of keys.
Anyway, this is a fairly standard sort of operation -- it's a groupby-apply.
def descriptive_func(df):
"""
Takes a df where key is always equal and returns some summary.
:type df: pd.DataFrame
:rtype: pd.Series|pd.DataFrame
"""
pass
# filter down to those rows we're interested in
valid_rows = b[b["key"].isin(set(keys))]
# this groups by the value and applies the descriptive func to each sub df in turn
summary = valid_rows.groupby("key").apply(descriptive_func)
There are a few built in methods on the groupby object that are useful. For example, check out valid_rows.groupby("key").sum() or valid_rows.groupby("key").describe(). Under the covers, these are really similar uses of apply. The shape of the returned summary is determined by the applied function. The unique grouped-by values -- those of b["key"] -- always constitute the index, but if the applied function returns a scalar, summary is a Series; if the applied function returns a Series, then summary constituted of the return Series as rows; if the applied function returns a DataFrame, then the result is a multiindex DataFrame. This is a core pattern in Pandas, and there's a whole, whole lot to explore here.

Categories