I am trying to add an incrementing number to the end of a string and apply this to every row in one column of my dataframe. The column has 76 rows. Ideally I want the output to look like this:
Event Reference Number
FEBRUARY_2019_1
FEBRUARY_2019_2
FEBRUARY_2019_3
I am trying to use apply with a lambda function but am stuck trying to figure out the next step.
df_Trade_File['Event Reference Number'].apply(lambda x: enumerate(df_Trade_File['Event Reference Number']))
First thing to note is that for the change to take affect you will need to set your df to the result at which time you could use apply or something to format it but in this situation you could probably do something simpler like:
df_Trade_File['Event Reference Number'] = df_Trade_File['Event Reference Number'] + '_' + str(df.index)
assuming you are using a standard index of 1-76.
Try using Series.str.cat
df["Event Reference Number"] = df["Event Reference Number"].str.cat(map(str, df.index), sep='_')
If you're using an index other than standard index replace map(str, df.index) with a range but make it all strings or python will yell at you:
map(str, range(1, df.shape[0]+1))
Related
I'm trying to create a new column that contains all of the assortments (Asst 1 - 50) that a SKU may belong to. A SKU belongs to an assortment if it is indicated by an "x" in the corresponding column.
The script will need to be able to iterate over the rows in the SKU column and check for that 'x' in any of the ASST columns. If it finds one, copy the name of that assortment column into the newly created "all assortments" column.
After one Liner:
I have been attempting this using the df.apply method but I cannot seem to get it right.
def assortment_crunch(row):
if row == 'x':
df['Asst #1'].apply(assortment_crunch):
my attempt doesn't really account for the need to iterate over all of the "asst" columns and how to assign that column to the newly created one.
Here's a super fast ("vectorized") one-liner:
asst_cols = df.filter(like='Asst #')
df['All Assortment'] = [', '.join(asst_cols.columns[mask]) for mask in asst_cols.eq('x').to_numpy()]
Explanation:
df.filter(like='Asst #') - returns all the columns that contain Asst # in their name
.eq('x') - exactly the same as == 'x', it's just easier for chaining functions like this because of the parentheses mess that would occur otherwise
to_numpy() - converts the mask dataframe in to a list of masks
I'm not sure if this is the most efficient way, but you can try this.
Instead of applying to the column, apply to the whole DF to get access to the row. Then you can iterate through each column and build up the value for the final column:
def make_all_assortments_cell(row):
assortments_in_row = []
for i in range(1, 51):
column_name = f'Asst #{i}'
if (row[column_name] == 'x').any():
assortments_in_row.append(row[column_name])
return ", ".join(assortments_in_row)
df["All Assortments"] = df.apply(make_all_assortments_cell)
I think this will work though I haven't tested it.
I am working with matching two separate dataframes on first name using HMNI's fuzzymerge.
On output each row returns a key like: (May, 0.9905315373004635)
I am trying to separate the Name and Score into their own columns. I tried the below code but don't quite get the right output - every row ends up with the same exact name/score in the new columns.
for i, v in enumerate(matched.key):
matched['MatchedNameFinal'] = (matched.key[i][0][0])
matched['MatchedNameScore'] = (matched.key[i][0][1])
matched[['consumer_name_first', 'key','MatchedNameFinal', 'MatchedNameScore']]
first when going over rows in pandas is better to use apply
matched['MatchedNameFinal'] = matched.key.apply(lambda x: x[0][0])
matched['MatchedNameScore'] = matched.key.apply(lambda x: x[0][1])
and in your case I think you are missing a tab in the for loop
for i, v in enumerate(matched.key):
matched['MatchedNameFinal'] = (matched.key[i][0][0])
matched['MatchedNameScore'] = (matched.key[i][0][1])
Generally, you want to avoid using enumerate for pandas because pandas functions are vectorized and much faster to execute.
So this solution won't iterate using enumerate.
First you turn the list into single tuple per row.
matched.key.explode()
Then use zip to split the tuple into 2 columns.
matched['col1'], matched['col2'] = zip(tuples)
Do all in 1 line.
matched['MatchedNameFinal'], matched['MatchedNameScore'] = zip(*matched.key.explode())
I have a function that aims at printing the sum along a column of a pandas DataFrame after filtering on some rows to be defined ; and the percentage this quantity makes up in the same sum without any filter:
def my_function(df, filter_to_apply, col):
my_sum = np.sum(df[filter_to_apply][col])
print(my_sum)
print(my_sum/np.sum(df[col]))
Now I am wondering if there is any way to have a filter_to_apply that actually doesn't do any filter (i.e. keeps all rows), to keep using my function (that is actually a bit more complex and convenient) even when I don't want any filter.
So, some filter_f1 that would do: df[filter_f1] = df and could be used with other filters: filter_f1 & filter_f2.
One possible answer is: df.index.isin(df.index) but I am wondering if there is anything easier to understand (e.g. I tried to use just True but it didn't work).
A Python slice object, i.e. slice(-1), acts as an object that selects all indexes in a indexable object. So df[slice(-1)] would select all rows in the DataFrame. You can store that in a variable an an initial value which you can further refine in your logic:
filter_to_apply = slice(-1) # initialize to select all rows
... # logic that may set `filter_to_apply` to something more restrictive
my_function(df, filter_to_apply, col)
This is a way to select all rows:
df[range(0, len(df))]
this is also
df[:]
But I haven't figured out a way to pass : as an argument.
Theres a function called loc on pandas that filters rows. You could do something like this:
df2 = df.loc[<Filter here>]
#Filter can be something like df['price']>500 or df['name'] == 'Brian'
#basically something that for each row returns a boolean
total = df2['ColumnToSum'].sum()
I am creating a function. One input of this function will be a panda dataframe and one of its tasks is to do some operation with two variables of this dataframe. These two variables are not fixed and I want to have the freedom to determine them using parameters as inputs of the function fun.
For example, suppose at some moment the variables I want to use are 'var1' and 'var2' (but at another time, I may want to use others two variables). Supose that these variables take values 1,2,3,4 and I want to reduce df doing var1 == 1 and var2 == 1. My functions is like this
def fun(df , var = ['input_var1', 'input_var2'] , val):
df = df.rename(columns={ var[1] : 'aux_var1 ', var[2]:'aux_var2'})
# Other operations
df = df.loc[(df.aux_var1 == val ) & (df.aux_var2 == val )]
# end of operations
# recover
df = df.rename(columns={ 'aux_var1': var[1] ,'aux_var2': var[2]})
return df
When I use the function fun, I have the error
fun(df, var = ['var1','var2'], val = 1)
IndexError: list index out of range
Actually, I want to do other more complex operations and I didn't describe these operations so as not to extend the question. Perhaps the simple example above has a solution that does not need to rename the variables. But maybe this solution doesn't work with the operations I really want to do. So first, I would necessarily like to correct the error when renaming the variables. If you want to give another more elegant solution that doesn't need renaming, I appreciate that too, but I will be very grateful if besides the elegant solution, you offer me the solution about renaming.
Python liste are zero indexed, i.e. the first element index is 0.
Just change the lines:
df = df.rename(columns={ var[1] : 'aux_var1 ', var[2]:'aux_var2'})
df = df.rename(columns={ 'aux_var1': var[1] ,'aux_var2': var[2]})
to
df = df.rename(columns={ var[0] : 'aux_var1 ', var[1]:'aux_var2'})
df = df.rename(columns={ 'aux_var1': var[0] ,'aux_var2': var[1]})
respectively
In this case you are accessing var[2] but a 2-element list in Python has elements 0 and 1. Element 2 does not exist and therefore accessing it is out of range.
As it has been mentioned in other answers, the error you are receiving is due to the 0-indexing of Python lists, i.e. if you wish to access the first element of the list var, you do that by taking the 0 index instead of 1 index: var[0].
However to the topic of renaming, you are able to perform the filtering of pandas dataframe without any column renaming. I can see that you are accessing the column as an attribute of the dataframe, however you are able to achieve the same via utilising the __getitem__ method, which is more commonly used with square brackets, f.e. df[var[0]].
If you wish to have more generality over your function without any renaming happening, I can suggest this:
from functools import reduce
def fun(df , var, val):
_sub = reduce(
lambda x, y: x & (df[y] == val),
var,
pd.Series([True]*df.shape[0])
)
return df[_sub]
This will work with any number of input column variables. Hope this will serve as an inspiration to your more complicated operations you intend to do.
Problem Overview:
I am attempting to clean stock data loaded from CSV file into Pandas DataFrame. The indexing operation I perform works. If I call print, I can see the values I want are being pulled from the frame. However, when I try to replace the values, as shown in the screenshot, PANDAS ignores my request. Ultimately, I'm just trying to extract a value out of one column and move it to another. The PANDAS documentation suggests using the .replace() method, but that doesn't seem to be working with the operation I'm trying to perform.
Here's a pic of the code and data before and after code is run.
And the for loop (as referenced in the pic):
for i, j in zip(all_exchanges['MarketCap'], all_exchanges['MarketCapSym']):
if 'M' in i: j = j.replace('n/a','M')
elif 'B' in i: j = j.replace('n/a','M')
The problem is that j is a string, thus immutable.
You're replacing data, but not in the original dataset.
You have to do it another way, less elegant, without zip (I simplified your test BTW since it did the same on both conditions):
aem = all_exchanges['MarketCap']
aems = all_exchanges['MarketCapSym']
for i in range(min(len(aem),len(aems)): # like zip: shortest of both
if 'M' in aem[i] or 'B' in aem[i]:
aems[i] = aems[i].replace('n/a','M')
now you're replacing in the original dataset.
If both columns are in the same dataframe, all_exchanges, iterate over the rows.
for i, row in enumerate ( all_exchanges ):
# get whatever you want from row
# using the index you should be able to set a value
all_exchanges.loc[i, 'columnname'] = xyz
That should be the syntax of I remember ;)
Here is quite exhaustive tutorial on missing values and pandas. I suggest using fillna():
df['MarketCap'].fillna('M', inplace=True)
df['MarketCapSym'].fillna('M', inplace=True)
Avoid iterating if you can. As already pointed out, you're not modifying the original data. Index on the MarketCap column and perform the replace as follows.
# overwrites any data in the MarketCapSym column
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'] = 'M'
# only replaces 'n/a'
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'].replace({'n/a', 'M'}, inplace=True)
Thanks to all who posted. After thinking about your solutions and the problem a bit longer, I realized there might be a different approach. Instead of initializing a MarketCapSym column with 'n/a', I instead created that column as a copy of MarketCap and then extracted anything that wasn't an "M" or "B".
I was able to get the solution down to one line:
all_exchanges['MarketCapSymbol'] = [ re.sub('[$.0-9]', '', i) for i in all_exchanges.loc[:,'MarketCap'] ]
A breakdown of the solution is as follows:
all_exchanges['MarketCapSymbol'] = - Make a new column on the DataFrame called 'MarketCapSymbol.
all_exchanges.loc[:,'MarketCap'] - Initialize the values in the new column to those in 'MarketCap'.
re.sub('[$.0-9]', '', i) for i in - Since all I want is the 'M' or 'B', apply re.sub() on each element, extracting [$.0-9] and leaving only the M|B.
Using a list comprehension this way seemed a bit more natural / readable to me in my limited experience with PANDAS. Let me know what you think!