Taking a proportion of a dataframe based on column values - python

I have a Pandas dataframe with ~50,000 rows and I want to randomly select a proportion of rows from that dataframe based on a number of conditions. Specifically, I have a column called 'type of use' and, for each field in that column, I want to select a different proportion of rows.
For instance:
df[df['type of use'] == 'housing'].sample(frac=0.2)
This code returns 20% of all the rows which have 'housing' as their 'type of use'. The problem is I do not know how to do this for the remaining fields in a way that is 'idiomatic'. I also do not know how I could take the result from this sampling to form a new dataframe.

You can make a unique list for all the values in the column by list(df['type of use'].unique()) and iterate like below:
for i in list(df['type of use'].unique()):
print(df[df['type of use'] == i].sample(frac=0.2))
or
i = 0
while i < len(list(df['type of use'].unique())):
df1 = df[(df['type of use']==list(df['type of use'].unique())[i])].sample(frac=0.2)
print(df1.head())
i = i + 1
For storing you can create a dictionary:
dfs = ['df' + str(x) for x in list(df2['type of use'].unique())]
dicdf = dict()
i = 0
while i < len(dfs):
dicdf[dfs[i]] = df[(df['type of use']==list(df2['type of use'].unique())[i])].sample(frac=0.2)
i = i + 1
print(dicdf)
This will print a dictionary of the dataframes.
You can print what you like to see for example for housing sample : print (dicdf['dfhousing'])

Sorry this is coming in 2+ years late, but I think you can do this without iterating, based on help I received to a similar question here. Applying it to your data:
import pandas as pd
import math
percentage_to_flag = 0.2 #I'm assuming you want the same %age for all 'types of use'?
#First, create a new 'helper' dataframe:
random_state = 41 # Change to get different random values.
df_sample = df.groupby("type of use").apply(lambda x: x.sample(n=(math.ceil(percentage_to_flag * len(x))),random_state=random_state))
df_sample = df_sample.reset_index(level=0, drop=True) #may need this to simplify multi-index dataframe
# Now, mark the random sample in a new column in the original dataframe:
df["marked"] = False
df.loc[df_sample.index, "marked"] = True

Related

Python - How add Column to DataFrame based on multple criteria from the same DataFrame

I'm trying to create a new column in a DataFrame that comes from a CSV file. What makes a this little bit tricky is that the values from this new column depends on conditions from other columns from the DataFrame.
The output column depends on the values from the following columns from this dataframe: VaccineCode | Occurrence | VaccineN | firstVaccineDate
So if the condition is met for a specific vaccine, I have to sum the respective date from the ApplicationDate column, in order to tell the vaccine date of the second dose.
My code:
import pandas as pd
import datetime
from datetime import timedelta, date, datetime
df = pd.read_csv(path_csv, engine='python', sep=';')
criteria_Astrazeneca = (df.VaccineCode == 85) & (df.Occurrence == 1) & (df.VaccineN == 1)
criteria_Pfizer = (df.VaccineCode == 86) & (df.Occurrence == 1) & (df.VaccineN == 1)
criteria_CoronaVac = (df.VaccineCode == 87) & (df.Occurrence == 1) & (df.VaccineN == 1)
days_pfizer = 56
days_coronaVac = 28
days_astraZeneca = 84
What I've tried so far:
df['New_Column'] = df[criteria_CoronaVac].firstVaccineDate + timedelta(days=days_coronaVac)
This works until the point that I have to complete the same New_Column with the others results, like this:
df['New_Column'] = df[criteria_CoronaVac].firstVaccineDate + timedelta(days=days_coronaVac)
df['New_Column'] = df[criteria_Pfizer].firstVaccineDate + timedelta(days=days_pfizer)
df['New_Column'] = df[criteria_AstraZeneca].firstVaccineDate + timedelta(days=days_astraZeneca)
Naturally, the problem with this approach comes from the fact that the next statement overwrites those before, so I end up just with the New_Column filled with the results that came from the last statement. I need a way to put all results in the same column.
My last try was:
df['New_Column'] = df[criteria_CoronaVac].firstVaccineDate + timedelta(days=days_coronaVac)
df[criteria_Pfizer].loc[:,'New_Column'] = df[criteria_Pfizer].firstVaccineDate + timedelta(days=days_pfizer)
But it gives the following error:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_column(ilocs[0], value, pi)
Thank you very much #ddejohn, the first link helped me to solve my problem as follows:
df['New_Column'] = df[criteria_CoronaVac].firstVaccineDate + timedelta(days=days_coronaVac)
df.loc[criteria_Pfizer,'New_Column'] = df[criteria_Pfizer].firstVaccineDate + timedelta(days=days_pfizer)
df.loc[criteria_Astrazeneca,'New_Column'] = df[criteria_Astrazeneca].firstVaccineDate + timedelta(days=days_astraZeneca)
That way, the first statement create the column and fill with the coronavac indexes and the next ones fill the same column just in the respective indexes.
Problem solved, thanks again.
You could also use an data frame transform to create a new rule

Looping over rows in Pandas dataframe taking too long

I have been running the code for like 45 mins now and is still going. Can someone please suggest to me how I can make it faster?
df4 is a panda data frame. df4.head() looks like this
df4 = pd.DataFrame({
'hashtag':np.random.randn(3000000),
'sentiment_score':np.random.choice( [0,1], 3000000),
'user_id':np.random.choice( ['11','12','13'], 3000000),
})
What I am aiming to have is a new column called rating.
len(df4.index) is 3,037,321.
ratings = []
for index in df4.index:
rowUserID = df4['user_id'][index]
rowTrackID = df4['track_id'][index]
rowSentimentScore = df4['sentiment_score'][index]
condition = ((df4['user_id'] == rowUserID) & (df4['sentiment_score'] == rowSentimentScore))
allRows = df4[condition]
totalSongListendForContext = len(allRows.index)
rows = df4[(condition & (df4['track_id'] == rowTrackID))]
songListendForContext = len(rows.index)
rating = songListendForContext/totalSongListendForContext
ratings.append(rating)
Globally, you'll need groupby. you can either:
use two groupby with transform to get the size of what you called condition and the size of the condition & (df4['track_id'] == rowTrackID), divide the second by the first:
df4['ratings'] = (df4.groupby(['user_id', 'sentiment_score','track_id'])['track_id'].transform('size')
/ df4.groupby(['user_id', 'sentiment_score'])['track_id'].transform('size'))
Or use groupby with value_counts with the parameter normalize=True and merge the result with df4:
df4 = df4.merge(df4.groupby(['user_id', 'sentiment_score'])['track_id']
.value_counts(normalize=True)
.rename('ratings').reset_index(),
how='left')
in both case, you will get the same result as your list ratings (that I assume you wanted to be a column). I would say the second option is faster but it depends on the number of groups you have in your real case.

How to assess all values of a row in a pandas dataframe and write into a new column

I have a pandas dataframe of 62 rows x 10 columns. Each row contains numbers and if any of the numbers are within a certain range then return a string into the last column.
I have unsuccessfully tried the .apply method to use a function to make the assessment. I have also tried to import as a series but then the .apply method causes problems because it is a list.
df = pd.read_csv(results)
For example, in the image attached, if any value from Base 2019 to FY26 Load is between 0.95 and 1.05 then return 'Acceptable' into the last column otherwise return 'Not Acceptable'.
Any help, even a start would be much appreciated.
This should perform as expected:
results = "input.csv"
df = pd.read_csv(results)
low = 0.95
high = 1.05
# The columns to check
cols = df.columns[2:]
df['Acceptable?'] = (df[cols] > low).any(axis=1) & (df[cols] < high).all(axis=1)

Python: Count instances of a specific character in all rows within a dataframe column

I have a dataframe (df) containing columns ['toaddress', 'ccaddress', 'body']
I want to iterate through the index of the dataframe to get the min, max, and average amount of email addresses in toaddress and ccaddress fields as determined by counting the instance of and '#' within each field in those two columns
If all else fails, i guess I could just use df.toaddress.str.contains(r'#').sum() and divide that by the number of rows in the data frame to get the average, but I think it's just counting the rows that at least have 1 # sign.
You can use
df[['toaddress', 'ccaddress']].applymap(lambda x: str.count(x, '#'))
to get back the count of '#' within each cell.
Then you can just compute the pandas max, min, and mean along the row axis in the result.
As I commented on the original question, you already suggested using df.toaddress.str.contains(r'#').sum() -- why not use df.toaddress.str.count(r'#') if you're happy going column by column instead of the method I showed above?
len(filter(lambda df: df.toaddress.str.contains(r'#'),rows))
or even
len(filter(lambda df: r'#' in str(df.toaddress), rows))
Perhaps something like this
from pandas import *
import re
df = DataFrame({"emails": ["fake#gmail.com, example#gmail.com",
"KingArthur#aol.com, none, SirRobyn#msn.net, TheBlackKnight#clintonserver.com"]})
at = re.compile(r"#", re.I)
def count_emails(string):
count = 0
for i in at.finditer(string):
count += 1
return count
df["count"] = df["emails"].map(count_emails)
df
Returns:
emails count
0 "fake#gmail.com, example#gmail.com" 2
1 "KingArthur#aol.com, none, SirRobyn#msn.net, Th..." 3
This answer uses https://pypi.python.org/pypi/fake-factory to generate the test data
import pandas as pd
from random import randint
from faker import Factory
fake = Factory.create()
def emails():
emailAdd = [fake.email()]
for x in range(randint(0,3)):
emailAdd.append(fake.email())
return emailAdd
df1 = pd.DataFrame(columns=['toaddress', 'ccaddress', 'body'])
for extra in range(10):
df1 = df1.append(pd.DataFrame({'toaddress':[emails()],'ccaddress':[emails()],'body':fake.text()}),ignore_index=True)
print('toaddress length is {}'.format([len(x) for x in df1.toaddress.values]))
print('ccaddress length is {}'.format([len(x) for x in df1.ccaddress.values]))
The last 2 lines is the part that counts your emails.
I wasn't sure if you wanted to check for '#' specifically, maybe you can use fake-factory to generate some test data as an example?

Speed up vlookup like operation using pandas in python

I have written some code to essentially do a excel style vlookup on two pandas dataframes and want to speed it up.
The structure of the data frames is as follows:
dbase1_df.columns:
'VALUE', 'COUNT', 'GRID', 'SGO10GEO'
merged_df.columns:
'GRID', 'ST0, 'ST1', 'ST2', 'ST3', 'ST4', 'ST5', 'ST6', 'ST7', 'ST8', 'ST9', 'ST10'
sgo_df.columns:
'mkey', 'type'
To combine them, I do the following:
1. For each row in dbase1_df, find the row where its 'SGO10GEO' value matches the 'mkey' value of sgo_df. Obtain the 'type' from that row in sgo_df.
'type' contains an integer ranging from 0 to 10. Create a column name by appending 'ST' to type.
Find the value in merged_df, where its 'GRID' value matches the 'GRID' value in dbase1_df and the column name is the one we obtained in step 2. Output this value into a csv file.
// Read in dbase1 dbf into data frame
dbase1_df = pandas.DataFrame.from_csv(dbase1_file,index_col=False)
merged_df = pandas.DataFrame.from_csv('merged.csv',index_col=False)
lup_out.writerow(["VALUE","TYPE",EXTRACT_VAR.upper()])
// For each unique value in dbase1 data frame:
for index, row in dbase1_df.iterrows():
# 1. Find the soil type corresponding to the mukey
tmp = sgo_df.type.values[sgo_df['mkey'] == int(row['SGO10GEO'])]
if tmp.size > 0:
s_type = 'ST'+tmp[0]
val = int(row['VALUE'])
# 2. Obtain hmu value
tmp_val = merged_df[s_type].values[merged_df['GRID'] == int(row['GRID'])]
if tmp_val.size > 0:
hmu_val = tmp_val[0]
# 4. Output into data frame: VALUE, hmu value
lup_out.writerow([val,s_type,hmu_val])
else:
err_out.writerow([merged_df['GRID'], type, row['GRID']])
Is there anything here that might be a speed bottleneck? Currently it takes me around 20 minutes for around ~500,000 rows in dbase1_df; ~1,000 rows in merged_df and ~500,000 rows in sgo_df.
thanks!
You need to use the merge operation in Pandas to get a better performance. I'm not able to test the below code since I don't have the data but at minimum it should help you to get the idea:
import pandas as pd
dbase1_df = pd.DataFrame.from_csv('dbase1_file.csv',index_col=False)
sgo_df = pd.DataFrame.from_csv('sgo_df.csv',index_col=False)
merged_df = pd.DataFrame.from_csv('merged_df.csv',index_col=False)
#you need to use the same column names for common columns to be able to do the merge operation in pandas , so we changed the column name to mkey
dbase1_df.columns = [u'VALUE', u'COUNT', u'GRID', u'mkey']
#Below operation merges the two dataframes
Step1_Merge = pd.merge(dbase1_df,sgo_df)
#We need to add a new column to concatenate ST and type
Step1_Merge['type_2'] = Step1_Merge['type'].map(lambda x: 'ST'+str(x))
# We need to change the shape of merged_df and move columns to rows to be able to do another merge operation
id = merged_df.ix[:,['GRID']]
a = pd.merge(merged_df.stack(0).reset_index(1), id, left_index=True, right_index=True)
# We also need to change the automatically generated name to type_2 to be able to do the next merge operation
a.columns = [u'type_2', 0, u'GRID']
result = pd.merge(Step1_Merge,a,on=[u'type_2',u'GRID'])

Categories