Dataframe.sample - Weights - How to use it? - python

I have this situation:
A have a probability of 0.1348 calculated in a variable called treat_conv
Now, I am trying to create a dataframe from the original dataframe, using this probability to bring a especified column. Is that possible? I am trying to using weights but no success. Maybe am I using it wrong?
Follow my code:
weights = np.array(treat_conv) #creating a array with treat_conv
new_page_converted = df2.sample(n = treat_group.shape[0], weights=df2.converted(weights)) #creating new dataframe with the number of rows of treat_group and the column converted must have a 0.13 of chance to bring value 1
So, the code works if I use the n alone. It creates a new dataframe with the correct ammount of rows. But I cant get the correct probabiliy to bring certain ammount of value 1 in converted column.
I hope my explanation is undestandable.
Thank you!

You could do something like this
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.arange(0, 100, 1), columns=["SomeValue"])
selected = pd.DataFrame(data=np.random.choice(df["SomeValue"], int(len(df["SomeValue"]) * 0.13), replace=False),
columns=["SomeValue"])
selected["Trigger"] = 1
df = df.merge(selected, how="left", on="SomeValue")
df["Trigger"].fillna(0, inplace=True)
"df" is your original DataFrame. Then select random 13% of the values and add a column indicating they've been selected. Finally, merge all back to your original Dataframe.

Related

Calculate Gunning-Fog score on excel values

I have a spreadsheet with fields containing a body of text.
I want to calculate the Gunning-Fog score on each row and have the value output to that same excel file as a new column. To do that, I first need to calculate the score for each row. The code below works if I hard key the text into the df variable. However, it does not work when I define the field in the sheet (i.e., rfds) and pass that through to my r variable. I get the following error, but two fields I am testing contain 3,896 and 4,843 words respectively.
readability.exceptions.ReadabilityException: 100 words required.
Am I missing something obvious? Disclaimer, I am very new to python and coding in general! Any help is appreciated.
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
rfd = df["Item 1A"]
rfds = rfd.to_string() # to fix "TypeError: expected string or buffer"
r = Readability(rfds)
fog = r.gunning_fog()
print(fog.score)
TL;DR: You need to pass the cell value and are currently passing a column of cells.
This line rfd = df["Item 1A"] returns a reference to a column. rfd.to_string() then generates a string containing either length (number of rows in the column) or the column reference. This is why a TypeError was thrown - neither the length nor the reference are strings.
Rather than taking a column and going down it, approach it from the other direction. Take the rows and then pull out the column:
for index, row in df.iterrows():
print(row.iloc[2])
The [2] is the column index.
Now a cell identifier exists, this can be passed to the Readability calculator:
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Note that these can be combined together into one command:
print(Readability(row.iloc[2]).gunning_fog())
This shows you how commands can be chained together - which way you find it easier is up to you. The chaining is useful when you give it to something like apply or applymap.
Putting the whole thing together (the step by step way):
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
for index, row in df.iterrows():
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Or the clever way:
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
print(df["Item 1A"].apply(lambda x: Readability(x).gunning_fog()))

In python How to ensure that seed in using randint keeps changing when i am trying to pick a random number?

def claims(dataframe):
dataframe.loc[(dataframe.severity ==1),'claims_made']= randint(200, 20000)
return dataframe
here 'severity' is an existing column and 'claims_made' is a new column, I want to have the randint keep picking different values that are being assigned to the 'claims_made' column. because for now it's just picking one random value out of the bucket specified and is assigning the same value to all the rows that satisfy the condition
Your code gets a single randint and applies that one value to the column you create. Its the same as if you had done
val = randint(20, 20000)
dataframe.loc[(dataframe.severity ==1),'claims_made']= val
Instead you could get an index of the rows you want to assign. Use it to create a series of random integers and when you assign that back to the dataframe, non-indexed rows become NaN.
import pandas as pd
import numpy as np
def claims(dataframe):
wanted_index = dataframe[df.severity==1].index
dataframe["claims_made"] = pd.Series(
np.random.randint(20,20000, size=len(wanted_index)),
index=wanted_index)
return dataframe
df = pd.DataFrame({"severity":[1, 1, 0, 8, -1, 99, 1]})
print(claims(df))
If you want to stick with your existing approach, you could do something like this:
def claims2(df):
n_rows = len(df.loc[(df.severity==1), 'claims_made'])
vals = [randint(200, 20000) for _ in range(n_rows)]
df.loc[(df.severity==1), 'claims_made'] = vals
return df
p.s. I'd recommend accessing columns via df['severity'] instead of df.severity -- you can get into trouble using the . syntax if you have a dataset with spaces etc. in the column names.
I'll give you a broad hint; coding is up to you.
Form a series (a temporary column object) of random numbers in the desired range. Assign that series to your data frame column. You can find examples of this technique in any tutorial on data frames.

Taking first value in a rolling window that is not numeric

This question follows one I previously asked here, and that was answered for numeric values.
I raise this 2nd one now relative to data of Period type.
While the example given below appears simple, I have actually windows that are of variable size. Interested in the 1st row of the windows, I am looking for a technic that makes use of this definition.
import pandas as pd
from random import seed, randint
# DataFrame
pi1h = pd.period_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0, 10) for ts in pi1h]
df = pd.DataFrame({'Values' : values, 'Period' : pi1h}, index=pi1h)
# This works (numeric type)
df['first'] = df['Values'].rolling(3).agg(lambda rows: rows[0])
# This doesn't (Period type)
df['OpeningPeriod'] = df['Period'].rolling(3).agg(lambda rows: rows[0])
Result of 2nd command
DataError: No numeric types to aggregate
Please, any idea? Thanks for any help! Bests,
First row of rolling window of size 3 means row, which is 2 rows above the current - just use pd.Series.shift(2):
df['OpeningPeriod'] = df['Period'].shift(2)
For the variable size (for the sake of example- I took Values column as this variable size):
import numpy as np
x=(np.arange(len(df))-df['Values'])
df['OpeningPeriod'] = np.where(x.ge(0), df.loc[df.index[x.tolist()], 'Period'], np.nan)
Convert your period[H] to a float
# convert to float
df['Period1'] = df['Period'].dt.to_timestamp().values.astype(float)
# rolling and convert back to period
df['OpeningPeriod'] = pd.to_datetime(df['Period1'].rolling(3)\
.agg(lambda rows: rows[0])).dt.to_period('1h')
# drop column
df = df.drop(columns='Period1')

Pandas creating columns by multiplying other columns

I have a dataframe with columns below
df = pd.DataFrame({'t0_p0':[1,2,3], 't1_p0':[1,2,3], 't2_p0':[1,2,3], 't0_p1':[1,2,3], 't1_p1':[1,2,3], 't2_p1':[1,2,3], 't0_p3':[1,2,3], 't1_p3':[1,2,3], 't2_p3':[1,2,3], 'Month_1':[1,0,0],'Month_2':[0,1,0] 'Hour_1':[1,0,0],'Hour_2':[0,1,0], 'x_1':[0,1,1], 'holid':[2,7,8]})
With the dataframe above, I want to multiply columns Month and hours by each of the other columns. For example, t0_p0 * Month_1, t0_p0 * Month_2, ..., 't2_P3'* Month_2, and same for Hours. I will not multiply month by the hour.
and the results of multiplications should be added to a new column named as follows, Month1_t0_p0 or Hour2_t2_p3. so basically names of two columns multiplied put together.
what would be the pythonic way of doing this. I know how to multiply columns like:
df['Month1_t0_p0'] = df['Month_1'] * df['t0_p0']
However, I am not sure how to automatically select the columns I want to multiply and create and name columns in the way I described above.
pleas help me out here.
Thank you so much.
You can do this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'t0_p0':[1,2,3],
't1_p0':[1,2,3],
't2_p0':[1,2,3],
't0_p1':[1,2,3],
't1_p1':[1,2,3],
't2_p1':[1,2,3],
't0_p3':[1,2,3],
't1_p3':[1,2,3],
't2_p3':[1,2,3],
'Month_1':[1,0,0],
'Month_2':[0,1,0],
'Hour_1':[1,0,0],
'Hour_2':[0,1,0],
'x_1':[0,1,1],
'holid':[2,7,8]})
cols_tp = df.columns[df.columns.str.startswith('t')]
cols_m = df.columns[df.columns.str.lower().str.startswith('m')]
for col_tp in cols_tp:
for col_m in cols_m:
df[col_m + '_' + col_tp] = df[col_m] * df[col_tp]
df
Maybe as a starter: Create a new df with only those columns you want to multiply with, then iterate over that new df. By concatenation create the final df with the new columns and those you didnt want to multiply with. I am not sure though how to automatically generate the name of the columns, nor do I have the exact code for the iteration. Sorry for that. As said, maybe a starter.

Monte carlo simulation in python - problem with looping

I am running a simple python script for MC. Basically it reads through every row in the dataframe and selects the max and min of the two variables. Then the simulation if run 1000 times selecting a random value between the min and max and computes the product and writes the P50 value back to the datatable.
Somehow the P50 output is the same for all rows. Any help on where I am going wrong?
import pandas as pd
import random
import numpy as np
data = [[0.075,0.085, 120, 150], [0.055, 0.075, 150, 350],[0.045,0.055,175,400]]
df = pd.DataFrame(data, columns = ['P_min','P_max','H_min','H_max'])
NumSim = 1000
for index, row in df.iterrows():
outdata = np.zeros(shape=(NumSim,), dtype=float)
for k in range(NumSim):
phi = (row['P_min'] + (row['P_max'] - row['P_min']) * random.uniform(0, 1))
ht = (row['H_min'] + (row['H_max'] - row['H_min']) * random.uniform(0, 1))
outdata[k] = phi*ht
df['out_p50'] = np.percentile(outdata,50)
print(df)
By df['out_p50'] = np.percentile(outdata,50) you are saying that you want the whole column to be set to given value, not a specific row of the column. Therefore, the numbers are generated and saved but they are saved to the whole column and in the end, you see the last generated number in every row.
Instead, use df.loc[index, 'out_p50'] = np.percentile(outdata,50) to specify the specific row you want to set.
Yup -- you're writing a scalar value to the entire column. You overwrite that value on each iteration. If you want, you can simply specify the row with df.loc for a quick fix. Also consider using outdata.median instead of percentile.
Perhaps the most important feature of PANDAS is the built-in support for vectorization: you work with entire columns of data, rather than looping through the data frame. Think like a list comprehension in which you don't need the for row in df iteration at the end.

Categories