This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 6 months ago.
I have a df that is 2,000 lines long, it is taking me about 5 minutes to map 2,000 IDs to the corresponding names, which is way too long.
I'm trying to figure out a way to reduce the mapping time. One option I want to try is mapping the IDs as I make the dictionary- and not storing more than one line at a time in the dictionary.
Here is the process I'm using:
df_IDs=
studentID grade age
123 12th 18
432 11th 17
421 11th 16
And I want to replace the 'studentID' column with the student names that I can get from a mapping file (student_directory_df).
I created functions that will make a dictionary and map the IDs to the names:
dicts={}
def search_tool_student(df, column, accession):
index=np.where(df[column].str.contains(accession, na=False))
if len(index[0])==0:
done=""
pass
else:
done=df.iloc[index]
return(done)
def create_dict(x):
result_df = search_tool_student(student_directory_df, 'studentID', x)
if (len(result_df) == 0):
print('bad match found: '+ x)
else:
student_name = result_df['name'].iloc[0]
dicts[x] = student_name
return(dicts)
def map_ID(df_IDs):
studentIDs=df_IDs['studentID']
new_dict=list(map(create_dict, studentIDs))[0]
df_IDs['studentID']=df_IDs['studentID'].map(new_dict).fillna(df_IDs['studentID'])
return(df_IDs)
desired output
studentID grade age
sally 12th 18
joe 11th 17
sarah 11th 16
Pandas works much better when you think of DataFrames not as grids of cells to index into, but rather collections of columns (Series) than you can manipulate. The more you can offload into a single pandas function, the more likely it is to be optimized and more performant.
In this case, you want use Series.map as best you can. Series.map can take a function, dictionary-like mapping, or a Series.
Since you already have the student names and ids in a dataframe, I would recommend something like the following:
# create a series to do the mapping
id_to_name_map = student_directory_df.set_index('studentID')['name']
# pass it to map all at once
df_IDs['student_name'] = df_IDs['studentID'].map(id_to_name_map)
Hope that helps!
Related
From a list of names I created (greater_three), I want to find all the names in that list in my DataFrame (new), and amend those values "location coordinates" in that DataFrame to a new list. But when I append I am also taking an index value.
location = []
for name in new['DBA Name']:
if name in greater_three:
location.append(new['Location'])
else:
pass
Location
My output list (location) should like like this:
[[41.7770923949, -87.6060037796],
[41.7770923949, -87.6060037796],
[41.7770923949, -87.6060037796],
But I am getting it with an Index like this:
[0 (41.777092394888655, -87.60600377956905)
1 (41.78457591499572, -87.6547753761994)
2 (41.74427989606148, -87.5716351762223)
3 (41.69164609748754, -87.6422140544927)
Also, smaller issue but I'm curious, it is iterating many times through (after I removed all the duplicate names from the data frame) like below, it should only have length of 26 coordinates (25 including 0):
22 (41.901086765978654, -87.74854019856667)
23 (41.70774046981763, -87.64300283870763)
24 (41.75937734623751, -87.66111539963164)
25 (41.75655095611123, -87.61068980246957)
Name: Location, dtype: object,
0 (41.777092394888655, -87.60600377956905)
1 (41.78457591499572, -87.6547753761994)
2 (41.74427989606148, -87.5716351762223)
...
23 (41.70774046981763, -87.64300283870763)
24 (41.75937734623751, -87.66111539963164)
25 (41.75655095611123, -87.61068980246957)
Name: Location, dtype: object,
0 (41.777092394888655, -87.60600377956905)
1 (41.78457591499572, -87.6547753761994)
2 (41.74427989606148, -87.5716351762223)
3 (41.69164609748754, -87.6422140544927)
My columns look like this, I just need the coordinates in a list, I can take from either: 'Longitutude'and 'Latitude' or 'Location'.
enter image description here
For every name, you're just re-appending the whole column to your list, rather than just the entries correspond to each name in your loop. You can fix this using .loc and filtering on where the names match. You should also .drop_duplicates in your new['DBA Name'] to go through to avoid appending the same thing several times. Also your else: pass is not required. Your location column is tuples, but you want the output to be a list, so you'll need to covert that. See below:
location = []
for name in new['DBA Name'].drop_duplicates():
if name in greater_three:
location.append(list(new.loc[new['DBA Name'] == name, 'Location'].iloc[0]))
location
This question already has answers here:
Create new column based on values from other columns / apply a function of multiple columns, row-wise in Pandas
(8 answers)
Closed 2 years ago.
I am trying to build a column that will be based off of another. The new column should reflect the values meeting certain criteria and put 0's where the values do not meet the criteria.
Example, a column called bank balance will have negative and positive values; the new column, overdraft, will have the negative values for the appropriate row and 0 where the balance is greater than 0.
Bal Ovr
21 0
-34 -34
45 0
-32 -32
The final result should look like that.
Assuming your dataframe is called df, you can use np.where and do:
import numpy as np
df['Ovr'] = np.where(df['Bal'] <0,'df['Bal'],0)
which will create a column called Ovr, with 0's when Bal is +ve, and the same as Bal when Bal is -ve.
df["over"] = df.Bal.apply(lambda x: 0 if x>0 else x)
Additional method to enrich your coding skills. However, it isn't needed for such easy tasks.
This question already has an answer here:
Replace values in a pandas series via dictionary efficiently
(1 answer)
Closed 4 years ago.
I have a dataframe next_train with weekly data for many players (80,000 players observed through 4 weeks, total of 320,000 observations) and a dictionary players containing a binary variable for some of the players (say 10,000). I want to add this binary variable to the dataframe next_train (if a player is not in the dictionary players, I set the variable equal to zero). This is how I'm doing it:
next_train = pd.read_csv()
# ... calculate dictionary 'players' ...
next_train['variable'] = 0
for player in players:
next_train.loc[next_train['id_of_player'] == player, 'variable'] = players[player]
However the for loop takes ages to complete, and I don't understand why. It looks like the task is to perform binary search for the value player in my dataframe for 10,000 times (size of the players dictionary), but the execution time is several minutes. Is there any efficient way to do this task?
You should use map instead of slicing, that will be way faster:
next_train['variable'] = next_train.id_of_player.map(players)
As you want 0 in the other rows, you can then use fillna:
next_train.variable.fillna(0,inplace = True)
Moreover, if your dictionnary only contains boolean values, you might want to redefine the type of variable column to take less space. So you end with this piece of code:
next_train['variable'] = next_train.id_of_player.map(players).fillna(0).astype(int)
Use map and fillna:
next_train['variable'] = next_train['id_of_player'].map(players).fillna(0)
This creates a new column by applying the dictionary on the player ids and then fills all empty values with 0.
My goal is that by given a value on a row (let's say 3), look for the value of a given column 3 rows below. Currently I am perfoming this using for loops but it is tremendously inefficient.
I have read that vectorizing can help to solve this problem but I am not sure how.
My data is like this:
Date DaysToReception Quantity QuantityAtTheEnd
20/03 3 102
21/03 - 88
22/03 - 57
23/03 5 178
24/03
And I want to obtain:
Date DaysToReception Quantity QuantityAtReception
20/03 3 102 178
21/03 - 88
22/03 - 57
23/03 5 178
24/03
...
Thanks for your help!
If you have unique date or DaysToReception, you can actually use Map/HashMap where the key will be the date or DaysToReception and the values will be other information which you can store using a list or any other appropriate data structure.
This will definitely improve the efficiency.
As you pointed out that "number of rows I search the value below depends on the value "DaysToReception", I believe "DaysToReception" will not be unique. In that case, the key to your Map will be the date.
The easiest way I can think of to do this in pandas is the following:
# something like your dataframe
df = pd.DataFrame(dict(date=['20/03', '21/03', '22/03', '23/03'],
days=[3, None, None, 5,],
quant=[102, 88, 57, 178]))
# get the indexs of all days that aren't missing
idxs = df.index[~pd.isnull(df.days)]
# get number of days to go
values = df.days[idxs].values.astype(int)
# get index of three days ahead
new_idxs = idxs+values
# create a blank column
df['quant_end'] = None
# Now fill it with the data we're after
df.quant_end[idxs] = df.quant[new_idxs]
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have a dictionary of pandas dataframes, each frame contains timestamps and market caps corresponding to the timestamps, the keys of which are:
coins = ['dashcoin','litecoin','dogecoin','nxt']
I would like to create a new key in the dictionary 'merge' and using the pd.merge method merge the 4 existing dataframes according to their timestamp (I want completed rows so using 'inner' join method will be appropriate.
Sample of one of the data frames:
data2['nxt'].head()
Out[214]:
timestamp nxt_cap
0 2013-12-04 15091900
1 2013-12-05 14936300
2 2013-12-06 11237100
3 2013-12-07 7031430
4 2013-12-08 6292640
I'm currently getting a result using this code:
data2['merged'] = data2['dogecoin']
for coin in coins:
data2['merged'] = pd.merge(left=data2['merged'],right=data2[coin], left_on='timestamp', right_on='timestamp')
but this repeats 'dogecoin' in 'merged', however if data2['merged'] is not = data2['dogecoin'] (or some similar data) then the merge function won't work as the values are non existent in 'merge'
EDIT: my desired result is create one merged dataframe seen in a new element in dictionary 'data2' (data2['merged']), containing the merged data frames from the other elements in data2
Try replacing the generalized pd.merge() with actual named df but you must begin dataframe with at least a first one:
data2['merged'] = data2['dashcoin']
# LEAVE OUT FIRST ELEMENT
for coin in coins[1:]:
data2['merged'] = data2['merged'].merge(data2[coin], on='timestamp')
Since you've already made coins a list, why not just something like
data2['merged'] = data2[coins[0]]
for coin in coins[1:]:
data2['merged'] = pd.merge(....
Unless I'm misunderstanding, this question isn't specific to dataframes, it's just about how to write a loop when the first element has to be treated differently to the rest.