Generate szenarios with differnet means from data frame - python

I have the following data frame:
Cluster OPS(4) mean(ln) std(ln)
0 5-894 5-894a 2.203 0.775
1 5-894 5-894b 2.203 0.775
2 5-894 5-894c 2.203 0.775
3 5-894 5-894d 2.203 0.775
4 5-894 5-894e 2.203 0.775
For each surgery type (in column OPS(4)) I would like to generate 10.000 scenarios which should be stored in another data frame.
I know, that I can create scenarios with:
num_reps = 10.000
scenarios = np.ceil(np.random.lognormal(mean, std, num_reps))
And the new data frame should look like this whith 10,000 scenarios in each column:
scen_per_surg = pd.DataFrame(index=range(num_reps), columns=merged_information['OPS(4)'])
OPS(4) 5-894a 5-894b 5-894c 5-894d 5-894e
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
...
Unfortunately, I don't know how to iterate over the rows of the first data frame to create the scenarios.
Can somebody help me?
Best regards

Create some experimenting data
import pandas as pd
df = pd.DataFrame(data=[
[ '5-894' , '5-894a' , 2.0 , 0.70],
[ '5-894' , '5-894b' , 2.1 , 0.71],
[ '5-894' , '5-894c' , 2.2 , 0.72],
[ '5-894' , '5-894d' , 2.3 , 0.73],
[ '5-894' , '5-894e' , 2.4 , 0.74] ], columns =['Cluster', 'OPS(4)', 'mean(ln)', 'std(ln)'])
print(df)
create an empty dataframe
new_df = pd.DataFrame()
Define a function that will be applied to each row of the original df and generates the random values required and assign it to a column in new df
import numpy as np
def geb_scenarios(row):
# print(row)
col, mean, std = row[1:]
new_df[col] = np.ceil(np.random.lognormal(mean, std, 10))
Apply the function
df.apply(geb_scenarios, axis=1)
print(new_df)

Related

Create dataframe pandas from dict where values are list of tuples and each column name is unique

I have two lists that I use to create a dictionary, where list1 has text data and list2 is a list of tuples (text, float). I use these 2 lists to create a dictionary and the goal is to create a dataframe where each row of the first column will contain the elements of list1, each column will have a column name based on each unique text term from the first tuple element and for each row there will be the float values that connect them.
For example here's the dictionary with keys : {be, associate, induce, represent} and values : {('prove', 0.583171546459198), ('serve', 0.4951282739639282)} etc.
{'be': [('prove', 0.583171546459198), ('serve', 0.4951282739639282), ('render', 0.4826732873916626), ('represent', 0.47748714685440063), ('lead', 0.47725602984428406), ('replace', 0.4695377051830292), ('contribute', 0.4529820680618286)],
'associate': [('interact', 0.8237789273262024), ('colocalize', 0.6831706762313843)],
'induce': [('suppress', 0.8159114718437195), ('provoke', 0.7866303324699402), ('elicit', 0.7509980201721191), ('inhibit', 0.7498961687088013), ('potentiate', 0.742023229598999), ('produce', 0.7384929656982422), ('attenuate', 0.7352016568183899), ('abrogate', 0.7260081768035889), ('trigger', 0.717864990234375), ('stimulate', 0.7136563658714294)],
'represent': [('prove', 0.6612186431884766), ('evoke', 0.6591314673423767), ('up-regulate', 0.6582908034324646), ('synergize', 0.6541063785552979), ('activate', 0.6512928009033203), ('mediate', 0.6494284272193909)]}
Desired Output
prove serve render represent
be 0.58 0.49 0.48 0.47
associate 0 0 0 0
induce 0.45 0.58 0.9 0.7
represent 0.66 0 0 1
So what tricks me is that the verb prove can be found in more than one keys (i.e. for the key be, the score is 0.58 and for the key represent the score is 0.66).
If I use df = pd.DataFrame.from_dict(d,orient='index'), then the verb prove will appear twice as a column name, whereas I want each term to appear once in each column.
Can someone help?
With the dictionary that you provided (as d), you can't use from_dict directly.
You either need to rework the dictionary to have elements as dictionaries:
pd.DataFrame.from_dict({k: dict(v) for k,v in d.items()}, orient='index')
Or you need to read it as a Series and to reshape:
(pd.Series(d).explode()
.apply(pd.Series)
.set_index(0, append=True)[1]
.unstack(fill_value=0)
)
output:
prove serve render represent lead replace \
be 0.583172 0.495128 0.482673 0.477487 0.477256 0.469538
represent 0.661219 NaN NaN NaN NaN NaN
associate NaN NaN NaN NaN NaN NaN
induce NaN NaN NaN NaN NaN NaN
contribute interact colocalize suppress ... produce \
be 0.452982 NaN NaN NaN ... NaN
represent NaN NaN NaN NaN ... NaN
associate NaN 0.823779 0.683171 NaN ... NaN
induce NaN NaN NaN 0.815911 ... 0.738493
attenuate abrogate trigger stimulate evoke up-regulate \
be NaN NaN NaN NaN NaN NaN
represent NaN NaN NaN NaN 0.659131 0.658291
associate NaN NaN NaN NaN NaN NaN
induce 0.735202 0.726008 0.717865 0.713656 NaN NaN
synergize activate mediate
be NaN NaN NaN
represent 0.654106 0.651293 0.649428
associate NaN NaN NaN
induce NaN NaN NaN
[4 rows x 24 columns]

pandas dataframe condition based on regex expression

TTT
1. 802010001-999-00000285-888-
2. 256788
3. 1940
4. NaN
5. NaN
6. 702010001-X-2YZ-00000285-888-
I want to Fill column GGT column with all other values except for the amounts
Required table would be like this
TTT GGT
1. 802010001-999-00000285-888- 802010001-999-00000285-888-
2. 256788 NaN
3. 1940 NaN
4. NaN NaN
5. NaN NaN
6. 702010001-X-2YZ-00000285-888- 702010001-X-2YZ-00000285-888-
the orginal table has more than 200thousands rows.
If you want to remove the rows with only numbers, you can use the match() method of the string elements of the column TTT. You can use a code like that :
df["GGT"] = df["TTT"][df["TTT"].str.match(r'^(\d)+$')==False]
Use Series.mask:
df['GGT'] = df['TTT'].mask(pd.to_numeric(df['TTT'], errors='coerce').notna())
Or:
df['GGT'] = df['TTT'].mask(df["TTT"].astype(str).str.contains('^\d+$', na=True))
print (df)
TTT GGT
0 802010001-999-00000285-888- 802010001-999-00000285-888-
1 256788 NaN
2 1940 NaN
3 NaN NaN
4 702010001-X-2YZ-00000285-888- 702010001-X-2YZ-00000285-888-
I

Pandas Dataframe '[nan nan nan ... nan nan nan] not found in axis'

Receiving this error
'[nan nan nan ... nan nan nan] not found in axis'
When trying to drop columns from a dataframe if the value is zero
train_df.head()
external_company_id company_name email_domain mx_record ...
NaN Expresstext expresstext.net unknown expresstext.net ... 0.0 0.0 0.0 0.0
NaN Jobox jobox.ai unknown www.jobox.ai ... 17.0 -31.0 9.0 30.0
NaN Relola relola.com unknown home.relola.com ... 5.0 -25.0 5.0
train_df.drop(train_df[train_df['total_funding'] == float(0)].index, inplace = True, axis=0)
'[nan nan nan ... nan nan nan] not found in axis'
What would be causing this error?
I learned that pandas automatically uses the first column as the index for read_csv.
Because my first column was empty every index somehow ended up being NaN as seen in the question above.
I ran these two lines of code to create a new index and fill it.
train_df.index.name = 'id'
train_df.index = [x for x in range(1, len(train_df.values)+1)]
Then the former error disapeared
Instead of :
train_df.drop(...)
try:
train_df = train_df[train_df[train_df['total_funding'] != float(0)]

Python Pandas - Rolling regressions for multiple columns in a dataframe

I have a large dataframe containing daily timeseries of prices for 10,000 columns (stocks) over a period of 20 years (5000 rows x 10000 columns). Missing observations are indicated by NaNs.
0 1 2 3 4 5 6 7 8 \
31.12.2009 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
01.01.2010 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
04.01.2010 31.85 66.99 NaN NaN NaN NaN 404.93 57.04 NaN
05.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
06.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
Now I want to run a rolling regression for a 250 day window for each column over the whole sample period and save the coefficient in another dataframe
Iterating over the colums and rows using two for-loops isn't very efficient, so I tried this but getting the following error message
def regress(start, end):
y = df_returns.iloc[start:end].values
if np.isnan(y).any() == False:
X = np.arange(len(y))
X = sm.add_constant(X, has_constant="add")
model = sm.OLS(y,X).fit()
return model.params[1]
else:
return np.nan
regression_window = 250
for t in (regression_window, len(df_returns.index)):
df_coef[t] = df_returns.apply(regress(t-regression_window, t), axis=1)
TypeError: ("'float' object is not callable", 'occurred at index 31.12.2009')
here is my version, using df.rolling() instead and iterating over the columns.
I am not completely sure it is what you were looking for don't hesitate to comment
import statsmodels.regression.linear_model as sm
import statsmodels.tools.tools as sm2
df_returns =pd.DataFrame({'0':[30,30,31,32,32],'1':[60,60,60,60,60],'2':[np.NaN,np.NaN,np.NaN,np.NaN,np.NaN]})
def regress(X,Z):
if np.isnan(X).any() == False:
model = sm.OLS(X,Z).fit()
return model.params[1]
else:
return np.NaN
regression_window = 3
Z = np.arange(regression_window)
Z= sm2.add_constant(Z, has_constant="add")
df_coef=pd.DataFrame()
for col in df_returns.columns:
df_coef[col]=df_returns[col].rolling(window=regression_window).apply(lambda col : regress(col, Z))
df_coef

I'm trying modify a pandas data frame so that I will have 2 columns. A frequency column and a date column.

Basically, what I'm working with is a dataframe with all of the parking tickets given out in one year. Every ticket takes up its own row in the unaltered dataframe. What I want to do is group all the tickets by date so that I have 2 columns (date, and the amount of tickets issued on that day). Right now I can achieve that, however, the date is not considered a column by pandas.
import numpy as np
import matplotlib as mp
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.read_csv('C:/Users/brett/OneDrive/Data Science
Fundamentals/Parking_Tags_Data_2012.csv')
unnecessary_cols = ['tag_number_masked', 'infraction_code',
'infraction_description', 'set_fine_amount', 'time_of_infraction',
'location1', 'location2', 'location3', 'location4',
'province']
df1 = df1.drop (unnecessary_cols, 1)
df1 =
(df1.groupby('date_of_infraction').agg({'date_of_infraction':'count'}))
df1['frequency'] =
(df1.groupby('date_of_infraction').agg({'date_of_infraction':'count'}))
print (df1)
df1 = (df1.iloc[121:274])
The output is:
date_of_infraction date_of_infraction frequency
20120101 1059 NaN
20120102 2711 NaN
20120103 6889 NaN
20120104 8030 NaN
20120105 7991 NaN
20120106 8693 NaN
20120107 7237 NaN
20120108 5061 NaN
20120109 7974 NaN
20120110 8872 NaN
20120111 9110 NaN
20120112 8667 NaN
20120113 7247 NaN
20120114 7211 NaN
20120115 6116 NaN
20120116 9168 NaN
20120117 8973 NaN
20120118 9016 NaN
20120119 7998 NaN
20120120 8214 NaN
20120121 6400 NaN
20120122 6355 NaN
20120123 7777 NaN
20120124 8628 NaN
20120125 8527 NaN
20120126 8239 NaN
20120127 8667 NaN
20120128 7174 NaN
20120129 5378 NaN
20120130 7901 NaN
... ... ...
20121202 5342 NaN
20121203 7336 NaN
20121204 7258 NaN
20121205 8629 NaN
20121206 8893 NaN
20121207 8479 NaN
20121208 7680 NaN
20121209 5357 NaN
20121210 7589 NaN
20121211 8918 NaN
20121212 9149 NaN
20121213 7583 NaN
20121214 8329 NaN
20121215 7072 NaN
20121216 5614 NaN
20121217 8038 NaN
20121218 8194 NaN
20121219 6799 NaN
20121220 7102 NaN
20121221 7616 NaN
20121222 5575 NaN
20121223 4403 NaN
20121224 5492 NaN
20121225 673 NaN
20121226 1488 NaN
20121227 4428 NaN
20121228 5882 NaN
20121229 3858 NaN
20121230 3817 NaN
20121231 4530 NaN
Essentially, I want to move all the columns over by one to the right. Right now pandas only considers the last two columns as actual columns. I hope this made sense.
The count of infractions per date should be achievable with just one call to groupby. Try this:
import numpy as np
import pandas as pd
df1 = pd.read_csv('C:/Users/brett/OneDrive/Data Science
Fundamentals/Parking_Tags_Data_2012.csv')
unnecessary_cols = ['tag_number_masked', 'infraction_code',
'infraction_description', 'set_fine_amount', 'time_of_infraction',
'location1', 'location2', 'location3', 'location4',
'province']
df1 = df1.drop(unnecessary_cols, 1)
# reset_index() to move the dates into their own column
counts = df1.groupby('date_of_infraction').count().reset_index()
print(counts)
Note that any dates with zero tickets will not show up as 0; instead, they will simply be absent from counts.
If this doesn't work, it would be helpful for us to see the first few rows of df1 after you drop the unnecessary rows.
Try using as_index=False.
For example:
import numpy as np
import pandas as pd
data = {"date_of_infraction":["20120101", "20120101", "20120202", "20120202"],
"foo":np.random.random(4)}
df = pd.DataFrame(data)
df
date_of_infraction foo
0 20120101 0.681286
1 20120101 0.826723
2 20120202 0.669367
3 20120202 0.766019
(df.groupby("date_of_infraction", as_index=False) # <-- acts like reset_index()
.foo.count()
.rename(columns={"foo":"frequency"})
)
date_of_infraction frequency
0 20120101 2
1 20120202 2

Categories