I am doing exploratory data analysis, while doing that i am using the same lines of code many times .So i came to know that why can't i wrote the function for that.But i am new to python i don't know how to define a function exactly.So please help me.....
textdata is my main dataframe and tonumber,smstext are my variables
# subsetting the textdata
mesbytonum = textdata[['tonumber', 'smstext']]
# calculating the no.of messages by tonumber
messbytonum_freq = mesbytonum.groupby('tonumber').agg(len)
# resetting the index
messbytonum_freq.reset_index(inplace=True)
# making them in a descending order
messbytonum_freq_result = messbytonum_freq.sort(['smstext'], ascending=[0])
#calcuating percentages
messbytonum_freq_result['percentage'] = messbytonum_freq_result['smstext']/sum(messbytonum_freq_result['smstext'])
# considering top10
top10tonum = messbytonum_freq_result.head(10)
# top10tonum
i have repeated the similar kind of code around 20 times so i want to write the function for the above code which makes my code smaller. So please help me how can i define.
Thanks in advance
The function is defined like this:
def func(arg1, arg2, argN):
# do something
# you may need to return value(s) too
And called like this:
func(1,2,3) # you can use anything instead of 1,2 and 3
It will be
def MyFunc(textdata):
mesbytonum = textdata[['tonumber', 'smstext']]
messbytonum_freq = mesbytonum.groupby('tonumber').agg(len)
messbytonum_freq.reset_index(inplace=True)
messbytonum_freq_result = messbytonum_freq.sort(['smstext'], ascending=[0])
messbytonum_freq_result['percentage'] = messbytonum_freq_result['smstext']/sum(messbytonum_freq_result['smstext'])
top10tonum = messbytonum_freq_result.head(10)
return # what do you want to return?
# use this function
result=MyFunc(<argument here>)
# then you need to use result somehow
Your function can also return multiple values
return spam, egg
which you have to use like this
mySpam, myEgg=MyFunction(<argument>)
Related
How do I pass the following commands into the latex environment?
\centering (I need landscape tables to be centered)
and
\caption* (I need to skip for a panel the table numbering)
In addition, I would need to add parentheses and asterisks to the t-statistics, meaning row-specific formatting on the dataframes.
For example:
Current
variable
value
const
2.439628
t stat
13.921319
FamFirm
0.114914
t stat
0.351283
founder
0.154914
t stat
2.351283
Adjusted R Square
0.291328
I want this
variable
value
const
2.439628
t stat
(13.921319)***
FamFirm
0.114914
t stat
(0.351283)
founder
0.154914
t stat
(1.651283)**
Adjusted R Square
0.291328
I'm doing my research papers in DataSpell. All empirical work is in Python, and then I use Latex (TexiFy) to create the pdf within DataSpell. Due to this workflow, I can't edit tables in latex code while they get overwritten every time I run the jupyter notebook.
In case it helps, here's an example of how I pass a table to the latex environment:
# drop index to column
panel_a.reset_index(inplace=True)
# write Latex index and cut names to appropriate length
ind_list = [
"ageFirm",
"meanAgeF",
"lnAssets",
"bsVol",
"roa",
"fndrCeo",
"lnQ",
"sic",
"hightech",
"nonFndrFam"
]
# assign the list of values to the column
panel_a["index"] = ind_list
# format column names
header = ["", "count","mean", "std", "min", "25%", "50%", "75%", "max"]
panel_a.columns = header
with open(
os.path.join(r"/.../tables/panel_a.tex"),"w"
) as tf:
tf.write(
panel_a
.style
.format(precision=3)
.format_index(escape="latex", axis=1)
.hide(level=0, axis=0)
.to_latex(
caption = "Panel A: Summary Statistics for the Full Sample",
label = "tab:table_label",
hrules=True,
))
You're asking three questions in one. I think I can do you two out of three (I hear that "ain't bad").
How to pass \centering to the LaTeX env using Styler.to_latex?
Use the position_float parameter. Simplified:
df.style.to_latex(position_float='centering')
How to pass \caption*?
This one I don't know. Perhaps useful: Why is caption not working.
How to apply row-specific formatting?
This one's a little tricky. Let me give an example of how I would normally do this:
df = pd.DataFrame({'a':['some_var','t stat'],'b':[1.01235,2.01235]})
df.style.format({'a': str, 'b': lambda x: "{:.3f}".format(x)
if x < 2 else '({:.3f})***'.format(x)})
Result:
You can see from this example that style.format accepts a callable (here nested inside a dict, but you could also do: .format(func, subset='value')). So, this is great if each value itself is evaluated (x < 2).
The problem in your case is that the evaluation is over some other value, namely a (not supplied) P value combined with panel_a['variable'] == 't stat'. Now, assuming you have those P values in a different column, I suggest you create a for loop to populate a list that becomes like this:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
Now, we can apply a function to df.style.format, and pop/select from the list like so:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
def func(v):
fmt = fmt_list.pop(0)
return fmt.format(v)
panel_a.style.format({'variable': str, 'value': func})
Result:
This solution is admittedly a bit "hacky", since modifying a globally declared list inside a function is far from good practice; e.g. if you modify the list again before calling func, its functionality is unlikely to result in the expected behaviour or worse, it may throw an error that is difficult to track down. I'm not sure how to remedy this other than simply turning all the floats into strings in panel_a.value inplace. In that case, of course, you don't need .format anymore, but it will alter your df and that's also not ideal. I guess you could make a copy first (df2 = df.copy()), but that will affect memory.
Anyway, hope this helps. So, in full you add this as follows to your code:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
def func(v):
fmt = fmt_list.pop(0)
return fmt.format(v)
with open(fname, "w") as tf:
tf.write(
panel_a
.style
.format({'variable': str, 'value': func})
...
.to_latex(
...
position_float='centering'
))
I have a very long dictionary containing more than 1 million keys.
df['file'] = files
df['features'] = df.apply(lambda row: getVector(model, row['file']), axis=1)
Below is getVector function:
vector=model.predict(inp.reshape(1, 100, 100, 1))
print(file+ " is added.")
return vector
But, it shows blahblah.jpg is added but it has no use as I do not know how many files have been processed.
My question is that how I get the count of files that has been processed?
For example,
1200 out of 1,000,000 is processed.
or even just
1200
I do not want to have a neat and clean output. I just want to know how many files have been processed.
Any idea would be appreciated.
You can use a decorator to count it:
import functools
def count(func):
#functools.wraps(func)
def wrappercount(*args):
func(*args)
func.counter+=1
print(func.counter,"of 1,000,000 is processed.")
func.counter=0
return wrappercount
#count
def getVector(*args):
#write code and add arguments
Learn more about decorators here.
If you're having a class of which you want to count a method's usage, you can use the following:
class getVector:
counter=0
def __init__(self):#add your arguments here
#write code here
getVector.counter +=1
print(getVector.counter,"of 1,000,000 is processed.")
How about creating a column in the initial dataframe with incremental numbers? You would use this column as an input of your getVector function and display the value.
Something like this:
df['features'].apply(lambda row: getVector(model, row['file'], row['count']), axis=1)
I want to generate a number of note in a music database, but I don't know how to use some iteration like for loop to simplify the following, I hope someone can shed some lights on me.
note1 = note.Note(tune_array_bass[0])
note2 =note.Note(tune_array_bass[1])
note3 =note.Note(tune_array_bass[2])
note4 =note.Note(tune_array_bass[3])
note5 =note.Note(tune_array_bass[4])
note6 =note.Note(tune_array_bass[5])
note7 =note.Note(tune_array_bass[6])
note8 =note.Note(tune_array_bass[7])
note9 =note.Note(tune_array_bass[8])
note10 =note.Note(tune_array_bass[9])
note11 =note.Note(tune_array_bass[10])
note12 =note.Note(tune_array_bass[11])
note13 =note.Note(tune_array_bass[12])
note14 =note.Note(tune_array_bass[13])
I didn't exactly understand what you meant but, here's my solution to list the notes using a loop:
class Note:
def __init__(self,note):
self.note = note
tune_array_bass = [1,2,3,4,5,6,7,8,9,10,11,12,13]
for item in tune_array_bass:
new_notes = Note(item)
print(new_notes)
You could store the objects in a list instead:
notes = []
for i in range(14):
notes.append(note.Note(tune_array_bass[i]))
Then you can access each note object using notes[0], notes[1] etc.
Don't create many individual variables like this.
Instead, you could use a list:
notes = [note.Note(tune_array_bass[i]) for i in range(14)]
# get a note from the list:
notes[0]
Or a dictionary:
notes = {f"Note{i+1}":note.Note(tune_array_bass[i]) for i in range(14)}
# get a note from the dict:
notes["Note1"]
Can't get my mind around this...
I read a bunch of spreadsheets, do a bunch of calculations and then want to create a summary DF from each set of calculations. I can create the initial df but don't know how to control my loops so that I
create the initial DF (1st time though the loop)
If it has been created append the next DF (last two rows) for each additional tab.
I just can't wrap my head how to create the right nested loop so that once the 1st one is done the subsequent ones get appended?
My current code looks like: (which just dumbly prints each tab's results separately rather than create a new consolidated sumdf with just the last 2 rows of each tabs' results..
#make summary
area_tabs=['5','12']
for area_tabs in area_tabs:
actdf,aname = get_data(area_tabs)
lastq,fcast_yr,projections,yrahead,aname,actdf,merged2,mergederrs,montdist,ols_test,mergedfcst=do_projections(actdf)
sumdf=merged2[-2:]
sumdf['name']= aname #<<< I'll be doing a few more calculations here as well
print sumdf
Still a newb learning basic python loop techniques :-(
Often a neater way than writing for loops, especially if you are planning on using the result, is to use a list comprehension over a function:
def get_sumdf(area_tab): # perhaps you can name better?
actdf,aname = get_data(area_tab)
lastq,fcast_yr,projections,yrahead,aname,actdf,merged2,mergederrs,montdist,ols_test,mergedfcst=do_projections(actdf)
sumdf=merged2[-2:]
sumdf['name']= aname #<<< I'll be doing a few more calculations here as well
return sumdf
[get_sumdf(area_tab) for area_tab in areas_tabs]
and concat:
pd.concat([get_sumdf(area_tab) for area_tab in areas_tabs])
or you can also use a generator expression:
pd.concat(get_sumdf(area_tab) for area_tab in areas_tabs)
.
To explain my comment re named tuples and dictionaries, I think this line is difficult to read and ripe for bugs:
lastq,fcast_yr,projections,yrahead,aname,actdf,merged2,mergederrs,montdist,ols_test,mergedfcst=do_projections(actdf)
A trick is to have do_projections return a named tuple, rather than a tuple:
from collections import namedtuple
Projection = namedtuple('Projection', ['lastq', 'fcast_yr', 'projections', 'yrahead', 'aname', 'actdf', 'merged2', 'mergederrs', 'montdist', 'ols_test', 'mergedfcst'])
then inside do_projections:
return (1, 2, 3, 4, ...) # don't do this
return Projection(1, 2, 3, 4, ...) # do this
return Projection(last_q=last_q, fcast_yr=f_cast_yr, ...) # or this
I think this avoids bugs and is a lot cleaner, especially to access the results later.
projections = do_projections(actdf)
projections.aname
Do the initialisation outside the for loop. Something like this:
#make summary
area_tabs=['5','12']
if not area_tabs:
return # nothing to do
# init the first frame
actdf,aname = get_data(area_tabs[0])
lastq,fcast_yr,projections,yrahead,aname,actdf,merged2,mergederrs,montdist,ols_test,mergedfcst =do_projections(actdf)
sumdf=merged2[-2:]
sumdf['name']= aname
for area_tabs in area_tabs[1:]:
actdf,aname = get_data(area_tabs)
lastq,fcast_yr,projections,yrahead,aname,actdf,merged2,mergederrs,montdist,ols_test,mergedfcst =do_projections(actdf)
sumdf=merged2[-2:]
sumdf['name']= aname #<<< I'll be doing a few more calculations here as well
print sumdf
You can further improve the code by putting the common steps into a function.
OK I am using different taggers to tag a text. Default, unigram, bigram and trigram.
I have to check which combination of three of those four taggers is the most accurate.
To do that i have to loop through all the possible combinations which i do like this:
permutaties = list(itertools.permutations(['default_tagger','unigram_tagger',
'bigram_tagger','trigram_tagger'],3))
resultaten = []
for element in permutaties:
resultaten.append(accuracy(element))
so each element is a tuple of three tagmethods like for example: ('default_tagger', 'bigram_tagger', 'trigram_tagger')
In the accuracy function I now have to dynamically call the three accompanying methods of each tagger, the problem is: I don't know how to do this.
The tagger functions are as follows:
unigram_tagger = nltk.UnigramTagger(brown_train, backoff=backofff)
bigram_tagger = nltk.BigramTagger(brown_train, backoff=backofff)
trigram_tagger = nltk.TrigramTagger(brown_train, backoff=backofff)
default_tagger = nltk.DefaultTagger('NN')
So for the example the code should become:
t0 = nltk.DefaultTagger('NN')
t1 = nltk.BigramTagger(brown_train, backoff=t0)
t2 = nltk.TrigramTagger(brown_train, backoff=t1)
t2.evaluate(brown_test)
So in essence the problem is how to iterate through all 24 combinations of that list of 4 functions.
Any Python Masters that can help me?
Not shure if I understood what you need, but you can use the methods you want to call themselves instead of strings - sou your code could become soemthing like:
permutaties = itertools.permutations([nltk.UnigramTagger, nltk.BigramTagger, nltk.TrigramTagger, nltk.DefaultTagger],3)
resultaten = []
for element in permutaties:
resultaten.append(accuracy(element, brown_Train, brown_element))
def accuracy(element, brown_train,brown_element):
if element is nltk.DeafultTagger:
evaluator = element("NN")
else:
evaluator = element(brown_train, backoff=XXX) #maybe insert more elif
#clauses to retrieve the proper backoff parameter --or you could
# usr a tuple in the call to permutations so the apropriate backoff
#is avaliable for each function to be called
return evaluator.evaluate(brown_test) # ? I am not shure from your code if this is your intent
Starting with jsbueno's code, I suggest writing a wrapper function for each of the taggers to give them the same signature. And since you only need them once, I suggest using a lambda.
permutaties = itertools.permutations([lambda: ntlk.DefaultTagger("NN"),
lambda: nltk.UnigramTagger(brown_train, backoff),
lambda: nltk.BigramTagger(brown_train, backoff),
lambda: nltk.TrigramTagger(brown_train, backoff)],3)
This would allow you to call each directly, without a special function that figures out which function you're calling and employs the appropriate signature.
basing on jsbueno code I think that you want to reuse evaluator as the backoff argument so the code should be
permutaties = itertools.permutations([nltk.UnigramTagger, nltk.BigramTagger, nltk.TrigramTagger, nltk.DefaultTagger],3)
resultaten = []
for element in permutaties:
resultaten.append(accuracy(element, brown_Train, brown_element))
def accuracy(element, brown_train,brown_element):
evaluator = "NN"
for e in element:
if evaluator == "NN":
evaluator = e("NN")
else:
evaluator = e(brown_train, backoff=evaluator) #maybe insert more elif
#clauses to retrieve the proper backoff parameter --or you could
# usr a tuple in the call to permutations so the apropriate backoff
#is avaliable for each function to be called
return evaluator.evaluate(brown_test) # ? I am not shure from your code if this is your intent