Apply if else condition in specific pandas column by location - python

I am trying to apply a condition to a pandas column by location and am not quite sure how. Here is some sample data:
data = {'Pop': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967],
'Pop2': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967]}
PopDF = pd.DataFrame(data)
remainder = 6
#I would like to subtract 1 from PopDF['Pop2'] column cells 0-remainder.
#The remaining cells in the column I would like to stay as is (retain original pop values).
PopDF['Pop2']= PopDF['Pop2'].iloc[:(remainder)]-1
PopDF['Pop2'].iloc[(remainder):] = PopDF['Pop'].iloc[(remainder):]
The first line works to subtract 1 in the correct locations, however, the remaining cells become NaN. The second line of code does not work – the error is:
ValueError: Length of values (1) does not match length of index (8)

Instead of selected the first N rows and subtracting them, subtract the entire column and only assign the first 6 values of it:
df.loc[:remainder, 'Pop2'] = df['Pop2'] - 1
Output:
>>> df
Pop Pop2
0 728375 728374
1 733355 733354
2 695395 695394
3 734658 734657
4 732811 732810
5 789396 789395
6 727761 727760
7 751967 751967

Related

Splitting row values and count unique's from a DataFrame

I have the following data in a column titled Reference:
ABS052
ABS052/01
ABS052/02
ADA010/00
ADD005
ADD005/01
ADD005/02
ADD005/03
ADD005/04
ADD005/05
...
WOO032
WOO032/01
WOO032/02
WOO032/03
WOO045
WOO045/01
WOO045/02
WOO045/03
WOO045/04
I would like to know how to split the row values to create a Dataframe that contains the single Reference code, plus a Count value, for example:
Reference
Count
ABS052
3
ADA010
0
ADD005
2
...
...
WOO032
3
WOO045
4
I have the following code:
df['Reference'] = df['Reference'].str.split('/')
Results in:
['ABS052'],
['ABS052','01'],
['ABS052','02'],
['ABS052','03'],
...
But I'm not sure how to ditch the last two digits from the list in each row.
All I want now is to keep the string in each row [0] if that makes sense, then I could just retrieve a value_count from the 'Reference' column.
There seems to be something wrong with the expected result listed in the question.
Let's say you want to ditch the digits and count the prefix occurrences:
df.Reference.str.split("/", expand=True)[0].value_counts()
If instead the suffix means something and you want to keep the highest value this should do
df.Reference.str.split("/", expand=True).fillna("00").astype({0: str, 1: int}).groupby(0).max()
You can just use regex to replace the last two digits like this:
df = pd.DataFrame({'a':['ABS052','ABS052/01','ABS052/02','ADA010/00','ADD005','ADD005/01','ADD005/02','ADD005/03','ADD005/04','ADD005/05']})
df = df['a'].str.replace(r'\/\d+$', '').value_counts().reset_index()
Output:
>>>> index a
0 ADD005 6
1 ABS052 3
2 ADA010 1
You are almost there, you can add expand=True to split and then use groupby:
df['Reference'].str.split("/", expand=True).fillna("--").groupby(0).count()
returns:
1
0
ABS052 3
ADA010 1
ADD005 6
for the first couple of rows of your data.
The fillna("--") makes sure you also count lines like ABS052 without a "", i.e. None in the second column.
Output to df with column names
df['Reference'] = df['Reference'].str.split('/').str[0]
df_counts = df['Reference'].value_counts().rename_axis('Reference').reset_index(name='Counts')
output
Reference Counts
0 ADD005 6
1 ABS052 3
2 ADA010 1
Explanation - The first line gives a clean series called 'Reference'. The second line gives a count of unique items and then resets the index and renames the columns.

Wrangling shifted DataFrame with Pandas

In the following pandas DataFrame, The first two columns (Remessas_A and Remessas_A_1d) were given and I had to find the third (previsao) following the pattern described below. Notice that I'm not counting the column DataEntrega as the first, which is a datetime index.
DataEntrega,Remessas_A,Remessas_A_1d,previsao
2020-07-25,696.0,,
2020-07-26,0.0,,
2020-07-27,518.0,,
2020-07-28,629.0,,
2020-07-29,699.0,,
2020-07-30,660.0,,
2020-07-31,712.0,,
2020-08-01,2.0,-672.348684948797,23.651315051203028
2020-08-02,0.0,-504.2138715410994,-504.2138715410994
2020-08-03,4.0,-91.10009092298037,426.89990907701963
2020-08-04,327.0,194.46620611760167,823.4662061176017
2020-08-05,442.0,220.65451760630847,919.6545176063084
2020-08-06,474.0,-886.140302693952,-226.14030269395198
2020-08-07,506.0,-61.28132269808316,650.7186773019168
2020-08-08,11.0,207.12286256242962,230.77417761363265
2020-08-09,2.0,109.36137834671834,-394.85249319438105
2020-08-10,388.0,146.2428764085755,573.1427854855951
2020-08-11,523.0,-193.02046115081606,630.4457449667857
2020-08-12,509.0,-358.59415822684485,561.0603593794635
2020-08-13,624.0,966.9258406162757,740.7855379223237
2020-08-14,560.0,175.8273195122506,826.5459968141674
2020-08-15,70.0,19.337299248463978,250.11147686209662
2020-08-16,3.0,83.09413535361391,-311.75835784076713
2020-08-17,401.0,-84.67345026550751,488.4693352200876
2020-08-18,526.0,158.53310638454195,788.9788513513276
2020-08-19,580.0,285.99137337700336,847.0517327564669
2020-08-20,624.0,-480.93226226400344,259.85327565832023
2020-08-21,603.0,-194.68412031046182,631.8618765037056
2020-08-22,45.0,-39.23172496101115,210.87975190108546
2020-08-23,2.0,-115.26376570266325,-427.0221235434304
2020-08-24,463.0,10.04635376084557,498.5156889809332
2020-08-25,496.0,-32.44638720124206,756.5324641500856
2020-08-26,600.0,-198.6715680014182,648.3801647550487
2020-08-27,663.0,210.40991269713578,470.263188355456
2020-08-28,628.0,40.32391720053602,672.1857937042416
2020-08-29,380.0,-2.4418918145294626,208.437860086556
2020-08-30,0.0,152.66166068424076,-274.3604628591896
2020-08-31,407.0,18.499558564880928,517.0152475458141
The first 7 values of Remessas_A_1d and previsao are nulls, and will be kept nulls.
In order to obtain the first 7 non nulls values of previsao, from 2020-08-01 to 2020-08-07, I've made a shift of the Remessas_A 7 days ahead and I've added the rows of the shifted Remessas_A and the original Remessas_A_1d:
#res is the name of the dataframe
res['previsao'].loc['2020-08-01':'2020-08-07'] = res['Remessas_A'].shift(7).loc['2020-08-01':'2020-08-07'].add(res['Remessas_A_1d'].loc['2020-08-01':'2020-08-07'])
To find the next 7 values of previsao, from 2020-08-08 to 2020-08-14, now I shifted the previsao column 7 days ahead and I've added the rows of the shifted previsao and the original previsao:
res['previsao'].loc['2020-08-08':'2020-08-14'] = res['previsao'].shift(7).loc['2020-08-08':'2020-08-14'].add(res['Remessas_A_1d'].loc['2020-08-08':'2020-08-14'])
To find the next values of previsao, I repeated the last step, moving 7 days ahead each time:
res['previsao'].loc['2020-08-15':'2020-08-21'] = res['previsao'].shift(7).loc['2020-08-15':'2020-08-21'].add(res['Remessas_A_1d'].loc['2020-08-15':'2020-08-21'])
res['previsao'].loc['2020-08-22':'2020-08-28'] = res['previsao'].shift(7).loc['2020-08-22':'2020-08-28'].add(res['Remessas_A_1d'].loc['2020-08-22':'2020-08-28'])
res['previsao'].loc['2020-08-29':'2020-08-31'] = res['previsao'].shift(7).loc['2020-08-29':'2020-08-31'].add(res['Remessas_A_1d'].loc['2020-08-29':'2020-08-31'])
#the last line only spaned 3 days because I reached the end of my dataframe
Instead of doing that by hand, how can I create a function that would take periods=7, Remessas_A and Remessas_A_1d as input and would give previsao as the output?
Not the most elegant code, but this should do the trick:
df["previsao"][df.index <= pd.to_datetime("2020-08-07")] = df["Remessas_A"].shift(7) + df["Remessas_A_1d"]
for d in pd.date_range("2020-08-08", "2020-08-31"):
df.loc[d, "previsao"] = df.loc[d - pd.Timedelta("7d"), "previsao"] + df.loc[d, "Remessas_A_1d"]
Edit: I've assumed you have DataEntrega as an index and datetime object. Can post the rest of the code if you need.

How to append dataframe with selected columns having higher feature score

Hi I am new to python let me know if the question is not clear.
Here is my dataframe:
df = pd.DataFrame(df_test)
age bmi children charges
0 19 27.900 0 16884.92400
1 18 33.770 1 1725.55230
2 28 33.000 3 4449.46200
3 33 22.705 0 21984.47061
I am applying select 'k' best feature selection using chi squared test for this numerical data
X_clf = numeric_data.iloc[:,0:(col_len-1)]
y_clf = numeric_data.iloc[:,-1]
bestfeatures = SelectKBest(score_func=chi2, k=2)
fit = bestfeatures.fit(X_clf,y_clf)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_clf.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
This is my output:
Feature Score
0 age 6703.764216
1 bmi 1592.481991
2 children 1752.136519
I wish to now append my dataframe to contain only the features with 2 highest scores. However I wish to do so without hardcoding the column names while appending into my dataframe.
I have tried to store the column names into a list and append those with highest score but am getting a Value error. Is there any method/function i could try by storing the selected columns and then appending them based on they're scores?
Expected Output: Column 'bmi' is not there as it has lowest of 3 scores
age children charges
0 19 0 16884.92400
1 18 1 1725.55230
2 28 3 4449.46200
3 33 0 21984.47061
So first you want to find out which features have the largest values, then find the Featurename of the columns you do not want to see.
colToDrop = feature.iloc[~feature['Score'].nlargest(2)]['Feature'].values
Next we just filter the original df and remove those columns from the columns list
df[df.columns.drop(colToDrop)]
I believe you need to work on the dataframe featureScores to keep the first 2 features with the highest Score and then use this values as a list to filter the columns in the original dataframe. Something along the lines of:
important_features = featureScores.sort_values('Score',ascending=False)['Feature'].values.tolist()[:2] + ['charges']
filtered_df = df[important_features]
The sort_values() is to make sure the features (in case there are more) are sorted from highest score to lowest score. We then are creating a list of the first 2 values of the column Feature (which has been sorted already) with .values.tolist()[:2]. Since you seem to also want to include the column charges in your output, we are appending it manually with +['charges'] to our list of important_features.
Finally, we're creating a filtered_df by selecing only the important_features columns from the original df.
Edit based on comments:
If you can guarantee charges will be the last column in your original df then you can simply do:
important_features = featureScores.sort_values('Score',ascending=False)['Feature'].values.tolist()[:2] + [df.columns[-1]]
filtered_df = df[important_features]
I see you have previously defined your y column with y_clf = numeric_data.iloc[:,-1] you can then use y_clf.columns or [df.columns[-1]], either should work fine.

Check if label, position in one dataframe fall in label, range of second dataframe of different lengths pandas

I have 2 dataframes. The first contains a label and a start and a stop column, and the second contains a label and a position. The second dataframe is longer than the first.
I want to select rows in the second dataframe that fall within any of the ranges in the first dataframe.
Here is a MCVE. The interval_df is the one containing the intervals I want to look inside, the check_df has the label + position I want to search for, and result_df is what I expect the final output to be. Uncomment any print statement to see:
import pandas as pd
interval_df = pd.DataFrame({'Chr': [1,1,2,2,3,3], 'Pos1': [1,100,1,60,80,200], 'Pos2': [10,150,50,70,90,210]})
#print(interval_df)
check_df = pd.DataFrame({'Chr': [1,1,1,1,1,2,2,3,3,3,3,3], 'Pos':[8,9,11,110,200,45,55,10,50,80,85,200]})
#print(check_df)
result_df = pd.DataFrame({'Chr': [1,1,1,2,3,3,3], 'Pos':[8,9,110,45,80,85,200]})
#print(result_df)
# create ranges in the first df as a unique column
interval_df['interval'] = interval_df.apply(lambda x: range(x['Pos1'], x['Pos2']+1), axis=1)
# find the positions in check_df that have the same label and a position that is in the range in interval_df
output = check_df.loc[(check_df['Chr'] == interval_df['Chr']) & (check_df['Pos'] in interval_df['interval'])]
But here is the error I get:
ValueError: Can only compare identically-labeled Series objects
Thanks!
You can't do check_df['Chr'] == interval_df['Chr'] because they have different length. Plus, (check_df['Pos'] in interval_df['interval']) would fail for several reasons:
you can't do series in list_like, but rather series.isin(list_like);
even then series.isin(list_like) would check if each element of the series is in the list_like, however, in your case interval_df['interval'] is a series of ranges, so 8 in check_df['Pos'] would not be equal to range(1,10) in interval_df['interval'].
All that said, you can perform a cross merge and query:
(check_df.merge(interval_df, on='Chr', how='left')
.query('Pos1<=Pos<=Pos2')
[check_df.columns]
)
Output:
Chr Pos
0 1 8
2 1 9
7 1 110
10 2 45
18 3 80
20 3 85
23 3 200

Pandas new column as string extraction of another only for certain condition on string length verified: Fast way

I am working with a large df (near 2 millions rows) and need to create a new column from another one. The task seems easy: the starting column, called "PTCODICEFISCALE" contains a string made of 11 either 16 characters, no other possibilities, no NaN.
The new column I have to create ("COGNOME") must contain the 3 first characters of "PTCODICEFISCALE" ONLY IF the lenght of the "PTCODICEFISCALE" nth-row is 16; else when the lenght is 11 the new column should contain nothing, which means "NaN" I think.
I have tried this:
csv.loc[len(csv['PTCODICEFISCALE']) == 16, 'COGNOME'] = csv.loc[csv.PTCODICEFISCALE.str[:3]]
In the output this error message appears:
ValueError: cannot index with vector containing NA / NaN values
Which I don't understand.
I am sure there are no NA /NaN in "PTCODICEFISCALE" column.
Any help? Thanks!
P.S.: "csv" is the name of the DataFrame
I think you need numpy.where and condition with str.len:
csv['COGNOME'] = np.where(csv.PTCODICEFISCALE.str.len() == 16, csv.PTCODICEFISCALE.str[:3], np.nan)
Sample:
csv = pd.DataFrame({'PTCODICEFISCALE':['0123456789123456','1','01234567891234']})
print (csv)
PTCODICEFISCALE
0 0123456789123456
1 1
2 01234567891234
csv['COGNOME'] = np.where(csv.PTCODICEFISCALE.str.len() == 16, csv.PTCODICEFISCALE.str[:3], np.nan)
print (csv)
PTCODICEFISCALE COGNOME
0 0123456789123456 012
1 1 NaN
2 01234567891234 NaN

Categories