Say we have two dataframes, with one being a slice of the other.
If I'm iterating over the smaller DF, how do I find the index in the bigger DF, then find the row it is on?
So it would be something like this:
for idx in smaller.index:
loc = bigger.ix[ix]**.row_location???**
while not fin:
looking_for_something = bigger.iloc[loc]
if looking_for_something != criteria:
loc += 1
else:
fin = 1
I'm sure it is something simple but I can't seem to locate the way to do this.
If smaller is a slice of bigger, isn't all the information you are looking for in bigger already available in smaller?
If not, perhaps there are some columns in bigger not present in smaller. (Perhaps smaller should have been defined to include those columns?) In any case, you could use pd.merge or smaller.join(bigger, how='inner', ...) to match-up the rows in bigger with the rows in smaller that share the same index. This will accomplish in one fell swoop all the matches you are looking for with
for idx in smaller.index:
loc = bigger.ix[ix]**.row_location???**
Moreover, it will be quicker. In general, performing operations row-by-row will not be the fastest way to achieve results. It is better to think in terms of joins or merges or groupbys or some such operation that works on whole arrays at a time.
Related
I'm sorting a dataframe containing stock capitalisations from largest to smallest row-wise (I will compute the ratio of top 10 stocks vs the whole market as a proxy for concentration).
f = lambda x: pd.Series(x.sort_values(ascending=False, na_position='last').to_numpy(), index=stock_mv.columns)
stock_mv = stock_mv.apply(f, axis=1)
When I do this, however, the column names (tickers) no longer make sense. I read somewhere that you shouldn't delete column names or have them set to the same thing.
What is the best practice thing to do in this situation?
Thank you very much - I am very much a novice coder.
If I understand your problem right, you want to sort a dataframe row-wise. If that's the case, Try this:
stock_mv = stock_mv.sort_values(axis=1, ascending=False)
first I'm a newbie so if there's a simpler way to do this, I'm all ears.
I have some relatively simple code to find duplicates, then remove them. I'm not sure what I'm doing wrong. Basically I create a series from .duplicated. Then I'm running a for loop against the data frame to remove the duplicates. I know I have dups (193 of them), but nothing is getting removed. I start with 1893 rows and still have 1893 at the end. Here's what I have so far.
#drop the rows, starting w creating a boolean of where dups are
ms_clnd_bool = ms_clnd_study.duplicated()
print(ms_clnd_bool) #look at what I have
x = 0
for row in ms_clnd_bool: #for loop through the duplicates series
if ms_clnd_bool[x] == True:
ms_clnd_study.drop(ms_clnd_study.index[x])
x += 1
ms_clnd_study
Thanks for the help!
Pandas has drop_duplicates method: documentation. It does exactly what you are aiming for. You can decide which row to keep (first, last or non).
As a general tip: it's not common to use loops to scan through your whole dataframe in pandas. for something as common as dropping duplicates, you first better look for existing solutions rather then writing one on your own.
As for your code, you should specify inplace=True. Notice that removing while looping can be dangerous: just think what happens if I have a list [1, 2, 3], removing the 2 and keep looping. I'll get index out of bounds. Maybe it won't happens in pandas, but it's a source for troubles
Editing a large dataframe in python. How do you drop entire rows in the dataframe if a specific column's row has the value 0.0?
When I drop the 0.0s in the overall satisfaction column the edits are not displayed in my scatterplot matrix of the large dataframe.
I have tried:
filtered_df = filtered_df.drop([('overall_satisfaction'==0)], axis=0)
also tried replacing 0.0 with nulls & dropping the nulls:
filtered_df = filtered_df.['overall_satisfaction'].replace(0.0, np.nan), axis=0)
filtered_df = filtered_df[filtered_NZ_df['overall_satisfaction'].notnull()]
What concept am I missing? Thanks :)
So it seems like your values are small enough to be represented as zeros, but are not actually zeros. This usually happens when calculations result in vanishing gradients (really small numbers that approach zero, but are not quite zero), so equality comparisons do not give you the result you're looking for.
In cases like this, numpy has a handy function called isclose that lets you test whether a number is close enough to another number within a certain tolerance.
In your case, doing
df = df[~np.isclose(df['overall_satisfaction'], 0)]
Seems to work.
I have two dataframes with different lengths(df,df1). They share one similar label "collo_number". I want to search the second dataframe for every collo_number in the first data frame. Problem is that the second date frame contains multiple rows for different dates for every collo_nummer. So i want to sum these dates and add this in a new column in the first database.
I now use a loop but it is rather slow and has to perform this operation for al 7 days in a week. Is there a way to get a better performance? I tried multiple solutions but keep getting the error that i cannot use the equal sign for two databases with different lenghts. Help would really be appreciated! Here is an example of what is working but with a rather bad performance.
df5=[df1.loc[(df1.index == nasa) & (df1.afleverdag == x1) & (df1.ind_init_actie=="N"), "aantal_colli"].sum() for nasa in df.collonr]
Your description is a bit vague (hence my comment). First what you good do is to select the rows of the dataframe that you want to search:
dftmp = df1[(df1.afleverdag==x1) & (df1.ind_init_actie=='N')]
so that you don't do this for every item in the loop.
Second, use .groupby.
newseries = dftmp['aantal_colli'].groupby(dftmp.index).sum()
newseries = newseries.ix[df.collonr.unique()]
I am trying to do some analysis on baseball pitch F/x data. All the pitch data is stored in a pandas dataframe with columns like 'Pitch speed' and 'X location.' I have a wrapper function (using pandas.query) that, for a given pitch, will find other pitches with similar speed and location. This function returns a pandas dataframe of unknown size. I would like to use this function over large numbers of pitches; for example, to find all pitches similar to those thrown in a single game. I have a function that does this correctly, but it is quite slow (probably because it is constantly resizing resampled_pitches):
def get_pitches_from_templates(template_pitches, all_pitches):
resampled_pitches = pd.DataFrame(columns = all_pitches.columns.values.tolist())
for i, row in template_pitches.iterrows():
resampled_pitches = resampled_pitches.append( get_pitches_from_template( row, all_pitches))
return resampled_pitches
I have tried to rewrite the function using pandas.apply on each row, or by creating a list of dataframes and then merging, but can't quite get the syntax right.
What would be the fastest way to this type of sampling and merging?
it sounds like you should use pd.concat for this.
res = []
for i, row in template_pitches.iterrows():
res.append(resampled_pitches.append(get_pitches_from_template(row, all_pitches)))
return pd.concat(res)
I think that a merge might be even faster. Usage of df.iterrows() isn't recommended as it generates a series for every row.