index a column of dataframe regarding other columns and - python

I have provided this data frame,
as you see I have 3 index chapter, ParaIndex, (paragraph index) and Sentindex (sententcesindex), I have 70 chapters, 1699 Paragraph, and 6999 sentences
so each of them starts from the beginning (0 or 1 ), the problem is that I want to make a widget to call a "specific sentence" which placed in a specific paragraph of a chapter. something like this
https://towardsdatascience.com/interactive-controls-for-jupyter-notebooks-f5c94829aee6
but for extracting specific sentences in the specific paragraph of the specific chapter
I think I should have another index (like ChapParaSent ABBREVIATION for all) or even multidimensions index which show that this sentence where exactly placed
any idea how can I provide that using ipywidget
https://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html
#interact
def showDetail( Chapter=(1,70),ParaIndex=(0,1699),SentIndex=(0,6999)):
return df.loc[(df.Chapter == Chapter) & (df.ParaIndex==ParaIndex)&(df.SentIndex==SentIndex)]
the problem with this is since we do not know each chapter has how many paragraphs has as well as and we do not know in each paragraph SentIndex the index to start from which number most of the time we have no result.
the aim is to adopt this (or define a new index) in a way that with changing the bar buttons we have always one unique sentence
for example, here I have the result:
but when I changed to this :
I do not have any result, the REASON is obvious because we do not have any index as 1-2-1 since, in chapter 1, Paragraph index 2: Sentindex starts from 2!
One solution I saw that it was a complete definition of a multidimensional data frame but I need something easier that I can use by ipywidget...
many many thanks

Im sure there is a easier solution out there but that works I guess.
import pandas as pd
data = [
dict(Chapter=0, ParaIndex=0, SentIndex=0, content="0"),
dict(Chapter=1, ParaIndex=1, SentIndex=1, content="a"),
dict(Chapter=1, ParaIndex=1, SentIndex=2, content="b"),
dict(Chapter=2, ParaIndex=2, SentIndex=3, content="c"),
dict(Chapter=2, ParaIndex=2, SentIndex=4, content="d"),
dict(Chapter=2, ParaIndex=3, SentIndex=5, content="e"),
dict(Chapter=3, ParaIndex=4, SentIndex=6, content="f"),
]
df = pd.DataFrame(data)
def showbyindex(target_chapter, target_paragraph, target_sentence):
df_chapter = df.loc[df.Chapter==target_chapter]
unique_paragraphs = df_chapter.ParaIndex.unique()
paragraph_idx = unique_paragraphs[target_paragraph]
df_paragraph = df_chapter.loc[df.ParaIndex==paragraph_idx]
return df_paragraph.iloc[target_sentence]
showbyindex(target_chapter=2, target_paragraph=0, target_sentence=1)
Edit:
If you want the sliders only to be within a valid range you can define IntSliders for your interact decorator:
chapter_slider = widgets.IntSlider(min=0, max=max(df.Chapter.unique()), step=1, value=0)
paragraph_slider = widgets.IntSlider(min=0, max=1, step=1, value=0)
sentence_slider = widgets.IntSlider(min=0, max=1, step=1, value=0)
#interact(target_chapter=chapter_slider, target_paragraph=paragraph_slider, target_sentence=sentence_slider)
Now you have to check the valid number of paragraphs/sentences within your showbyindex function and set the sliders value/max accordingly.
if(...):
paragraph_slider.max = ...
...

Related

Pandas Styler.to_latex() - how to pass commands and do simple editing

How do I pass the following commands into the latex environment?
\centering (I need landscape tables to be centered)
and
\caption* (I need to skip for a panel the table numbering)
In addition, I would need to add parentheses and asterisks to the t-statistics, meaning row-specific formatting on the dataframes.
For example:
Current
variable
value
const
2.439628
t stat
13.921319
FamFirm
0.114914
t stat
0.351283
founder
0.154914
t stat
2.351283
Adjusted R Square
0.291328
I want this
variable
value
const
2.439628
t stat
(13.921319)***
FamFirm
0.114914
t stat
(0.351283)
founder
0.154914
t stat
(1.651283)**
Adjusted R Square
0.291328
I'm doing my research papers in DataSpell. All empirical work is in Python, and then I use Latex (TexiFy) to create the pdf within DataSpell. Due to this workflow, I can't edit tables in latex code while they get overwritten every time I run the jupyter notebook.
In case it helps, here's an example of how I pass a table to the latex environment:
# drop index to column
panel_a.reset_index(inplace=True)
# write Latex index and cut names to appropriate length
ind_list = [
"ageFirm",
"meanAgeF",
"lnAssets",
"bsVol",
"roa",
"fndrCeo",
"lnQ",
"sic",
"hightech",
"nonFndrFam"
]
# assign the list of values to the column
panel_a["index"] = ind_list
# format column names
header = ["", "count","mean", "std", "min", "25%", "50%", "75%", "max"]
panel_a.columns = header
with open(
os.path.join(r"/.../tables/panel_a.tex"),"w"
) as tf:
tf.write(
panel_a
.style
.format(precision=3)
.format_index(escape="latex", axis=1)
.hide(level=0, axis=0)
.to_latex(
caption = "Panel A: Summary Statistics for the Full Sample",
label = "tab:table_label",
hrules=True,
))
You're asking three questions in one. I think I can do you two out of three (I hear that "ain't bad").
How to pass \centering to the LaTeX env using Styler.to_latex?
Use the position_float parameter. Simplified:
df.style.to_latex(position_float='centering')
How to pass \caption*?
This one I don't know. Perhaps useful: Why is caption not working.
How to apply row-specific formatting?
This one's a little tricky. Let me give an example of how I would normally do this:
df = pd.DataFrame({'a':['some_var','t stat'],'b':[1.01235,2.01235]})
df.style.format({'a': str, 'b': lambda x: "{:.3f}".format(x)
if x < 2 else '({:.3f})***'.format(x)})
Result:
You can see from this example that style.format accepts a callable (here nested inside a dict, but you could also do: .format(func, subset='value')). So, this is great if each value itself is evaluated (x < 2).
The problem in your case is that the evaluation is over some other value, namely a (not supplied) P value combined with panel_a['variable'] == 't stat'. Now, assuming you have those P values in a different column, I suggest you create a for loop to populate a list that becomes like this:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
Now, we can apply a function to df.style.format, and pop/select from the list like so:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
def func(v):
fmt = fmt_list.pop(0)
return fmt.format(v)
panel_a.style.format({'variable': str, 'value': func})
Result:
This solution is admittedly a bit "hacky", since modifying a globally declared list inside a function is far from good practice; e.g. if you modify the list again before calling func, its functionality is unlikely to result in the expected behaviour or worse, it may throw an error that is difficult to track down. I'm not sure how to remedy this other than simply turning all the floats into strings in panel_a.value inplace. In that case, of course, you don't need .format anymore, but it will alter your df and that's also not ideal. I guess you could make a copy first (df2 = df.copy()), but that will affect memory.
Anyway, hope this helps. So, in full you add this as follows to your code:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
def func(v):
fmt = fmt_list.pop(0)
return fmt.format(v)
with open(fname, "w") as tf:
tf.write(
panel_a
.style
.format({'variable': str, 'value': func})
...
.to_latex(
...
position_float='centering'
))

Length of values does not match length of index when using pandas

I'm getting 'ValueError: Length of values does not match length of index' while using Pandas. I'm reading in data from an Excel spreadsheet using Pandas' 'pd.read_excel method. I then filter the data using Pandas' filter method. I've created 'dataSubset' to represent the filtered data. I use 'dataSubset' to create several 'mean' columns representing the mean of multiple columns respectively. I then create 'finalData' which represents the pd.concat function concatenating all of the calculated mean columns together. This code runs perfectly; however, if I uncomment any additional columns, the code blows up and gives the aformentioned error.
What am I doing wrong? It works as long as I don't concatenate more than it wants.
Help.
import pandas as pd
dataIn = pd.read_excel('IDT/IDT-B.xlsx')
dataSubset = dataIn.filter([
"First Name",
"Last Name",
"2.14.2 Control Structures Example Quiz",
"2.14.4 Random Hurdles",
"2.15.2 Quiz: Which Control Structure?",
"2.16.2 How to Indent Your Code Quiz",
"2.16.4 Diagonal",
"2.16.5 Staircase",
"2.17.2 Debugging Basics",
"2.17.6 Debugging with Error Messages",
"3.2.6 Programming with Karel Quiz",
"5.1.2 Hello World Quiz",
"5.1.4 Your Name and Hobby",
"5.2.2 Variables Quiz",
"5.2.4 Daily Activities",
"5.3.2 User Input Quiz",
"5.3.4 Dinner Plans",
"5.4.2 Basic Math in JavaScript Quiz",
"5.4.6 T-Shirt Shop",
"5.4.7 Running Speed",
"5.5.2 JavaScript Graphics Quiz",
"5.5.8 Flag of the Netherlands",
"5.5.9 Snowman",
"5.6.2 Using RGB to Create Colors",
"5.6.4 Exploring RGB",
"5.6.5 Making Yellow",
"5.6.6 Rainbow",
"5.6.7 Create a Color Image!",
"6.1.1 Ghost",
"6.1.2 Fried Egg",
"6.1.3 Draw Something",
"6.1.4 JavaScript and Graphics Quiz"
], axis=1)
# If any of these dataframes are uncommented, the code blows up.
# dataSubset['aver_2.14'] = dataSubset[["2.14.2 Control Structures Example Quiz",
# "2.14.4 Random Hurdles"]],
# dataSubset['aver_2.15'] = dataSubset[["2.15.2 Quiz: Which Control Structure?"]],
# #
# # dataSubset['aver_2.16'] = dataSubset[["2.16.2 How to Indent Your Code Quiz",
# # "2.16.4 Diagonal"]],
# #
# # dataSubset['aver_2.17'] = dataSubset[["2.17.2 Debugging Basics",
# # "2.17.6 Debugging with Error Messages", ]]
dataSubset['unit_quiz_326'] = dataSubset[["3.2.6 Programming with Karel Quiz"]]
dataSubset['aver_5.1'] = dataSubset[["5.1.2 Hello World Quiz",
"5.1.4 Your Name and Hobby"]].mean(axis=1)
dataSubset['aver_5.2'] = dataSubset[["5.2.2 Variables Quiz",
"5.2.4 Daily Activities"]].mean(axis=1)
dataSubset['aver_5.3'] = dataSubset[["5.3.2 User Input Quiz",
"5.3.4 Dinner Plans"]].mean(axis=1)
dataSubset['aver_5.4'] = dataSubset[["5.4.2 Basic Math in JavaScript Quiz",
"5.4.6 T-Shirt Shop",
"5.4.7 Running Speed"]].mean(axis=1)
dataSubset['aver_5.5'] = dataSubset[["5.5.2 JavaScript Graphics Quiz",
"5.5.8 Flag of the Netherlands",
"5.5.9 Snowman"]].mean(axis=1)
dataSubset['aver_5.6'] = dataSubset[["5.6.2 Using RGB to Create Colors",
"5.6.4 Exploring RGB",
"5.6.5 Making Yellow",
"5.6.6 Rainbow",
"5.6.7 Create a Color Image!"]].mean(axis=1)
dataSubset['aver_6.1'] = dataSubset[["6.1.1 Ghost",
"6.1.2 Fried Egg",
"6.1.3 Draw Something",
"6.1.4 JavaScript and Graphics Quiz"]].mean(axis=1)
finalData = pd.concat([dataSubset['First Name'],
dataSubset['Last Name'],
dataSubset['unit_quiz_326'],
# dataSubset['aver_2.14'],
# dataSubset['aver_2.15'],
# dataSubset['aver_2.16'],
# dataSubset['aver_2.17'],
dataSubset['aver_5.1'],
dataSubset['aver_5.2'],
dataSubset['aver_5.3'],
dataSubset['aver_5.4'],
dataSubset['aver_5.5'],
dataSubset['aver_5.6'],
dataSubset['aver_6.1']], axis=1)
finalData.to_excel('output/gradesOut.xlsx')
Cause of ValueError
Based on this line:
dataSubset['aver_2.15'] = dataSubset[["2.15.2 Quiz: Which Control Structure?"]],
The right side of the assignment has a trailing comma, the line is equivalent to this:
dataSubset['aver_2.15'] = (dataSubset[["2.15.2 Quiz: Which Control Structure?"]], )
Basically, the line is trying to perform the following assignment:
pandas.Series <-- Tuple[pandas.DataFrame] # tuple with length 1
So there is a length mismatch between the left side (assignment target) and the right side (object that should be assigned to the target):
Left side: Length of the Series (think "number of rows")
Right side: One
Why is it pandas.Series on the left, but pandas.DataFrame on the right?
If you use single square brackets, you get a Series object: s = df['a']
If you use double square brackets, you get a DataFrame object: df2 = df[['a'']]
Possible solution
It seems you want to combine multiple columns into a new column. In one of the working lines, you take the mean of two columns with .mean(axis=1):
dataSubset['aver_5.1'] = dataSubset[["5.1.2 Hello World Quiz",
"5.1.4 Your Name and Hobby"]].mean(axis=1)
So, to fix your code, you probably need to:
Remove trailing commas
Add mean() or some other "combining function" to the lines where you select multiple columns

Apply operation and a division operation in the same step using Python

I am trying to get proportion of nouns in my text using the code below and it is giving me an error. I am using a function that calculates the number of nouns in my text and I have the overall word count in a different column.
pos_family = {
'noun' : ['NN','NNS','NNP','NNPS']
}
def check_pos_tag(x, flag):
cnt = 0
try:
for tag,value in x.items():
if tag in pos_family[flag]:
cnt +=value
except:
pass
return cnt
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')/df2['word_count'])
Note: I have used nltk package to get the counts by PoS tags and I have the counts in a dictionary in PoS_Count column in my dataframe.
If I remove "/df2['word_count']" in the first run and get the noun count and include it again and run, it works fine but if I run it for the first time I get the below error.
ValueError: Wrong number of items passed 100, placement implies 1
Any help is greatly appreciated
Thanks in Advance!
As you have guessed, the problem is in the /df2['word_count'] bit.
df2['word_count'] is a pandas series, but you need to use a float or int here, because you are dividing check_pos_tag(x, 'noun') (which is an int) by it.
A possible solution is to extract the corresponding field from the series and use it in your lambda.
However, it would be easier (and arguably faster) to do each operation alone.
Try this:
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')) / df2['word_count']

grouped box plots are too narrow to read

Actually everything works just fine: I try to compare different groups in my survey in one chart. thus I wrote the following code in Python (Jupyter-Notebook)
for value in values:
catpool=getcat()
py.offline.init_notebook_mode()
data = []
for cata in catpool:
for con in constraints:
data.append( go.Box( y=getdf(value,cata,con[0])['Y'+value],x=con[1], name=cata, showlegend=False, boxmean='sd',
#boxpoints='all',
jitter=0.3,
pointpos=0 ) )
layout=go.Layout(title="categorie: "+getclearname(value)+" - local space syntax measurements on placements<br>",yaxis=dict( title='percentage of range between flat extremes'),boxmode='group',showlegend=True)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig, filename='pandas-box-plot')
the function 'getdf' queries a column from a database
unfortunately it results in an unreadable diagram like this
a diagram with to narrow box plots
is it possible to give the groups less spacing and accordingly the boxplots in the group more? Or anything else that would make it more readable?
Thank you
I solved that problem at my own:
The strange behaviour occured, because of the for cata in catpool loop - data for boxplots in groups has to contain the group value in the data frame: so I just didn't loop at this place, but looped in the concatenation of the SQL-statement. So the same queries were carried out, but were joined together by "UNION" like this:
def formstmtelse (val, category, ext, constraint, nine=True):
stmt=""
if nine:
matrix=['11','12','13','21','22','23','31','32','33']
else:
matrix=['11']
j=0
for cin in category:
j=j+1
if j>1:
stmt=stmt+" UNION "
m=0
for cell in matrix:
m=m+1
if m>1:
stmt=stmt+"""UNION
"""
stmt=stmt+"""SELECT '"""+cin+"""' AS cata,
"""
if ext:
stmt=stmt+"""((("""+val+"""-MIN"""+val+""")/(MAX"""+val+"""-MIN"""+val+"""))*100) AS Y"""+val
else:
stmt=stmt+"""((("""+val+""")/(MAX"""+val+"""))*100) AS Y"""+val
stmt=stmt+"""
FROM `stests`
JOIN ssubject
ON ssubject.ssubjectID=stests.ssubjectID
JOIN scountries
ON CountryGrownUpIn=scountriesID
JOIN scontinents
ON Continent=scontinentsID
JOIN gridpoint
ON stests.FlatID=gridpoint.GFlatID AND G"""+cin+cell+"""=GIntID
JOIN
(
SELECT GFlatID AS GSUBFlatID"""
stmt=stmt+""",
MAX("""+val+""") AS MAX"""+val+""",MIN("""+val+""") AS MIN"""+val
stmt=stmt+"""
FROM gridpoint
GROUP BY GSUBFlatID
)
AS virtualtable
ON FlatID=virtualtable.GSUBFlatID
"""+constraint+" "
return stmt
that solves the problem. The 'x' attribute of the box call must be a list or dataframe, not just a single string, even though the string is always the same. I guess this was causing 'invisible' x-values making the automated scaling impossible, or in other words a new point in the legend for each query with a new constraint and for each category, only one dataframe is needed for every constraint. when changing this - fine adjustments could be made by adding boxgroupgap or groupgab attribute to layout...
please excuse my english

Why does my association model find subgroups in a dataset when there shouldn't any?

I give a lot of information on the methods that I used to write my code. If you just want to read my question, skip to the quotes at the end.
I'm working on a project that has a goal of detecting sub populations in a group of patients. I thought this sounded like the perfect opportunity to use association rule mining as I'm currently taking a class on the subject.
I there are 42 variables in total. Of those, 20 are continuous and had to be discretized. For each variable, I used the Freedman-Diaconis rule to determine how many categories to divide a group into.
def Freedman_Diaconis(column_values):
#sort the list first
column_values[1].sort()
first_quartile = int(len(column_values[1]) * .25)
third_quartile = int(len(column_values[1]) * .75)
fq_value = column_values[1][first_quartile]
tq_value = column_values[1][third_quartile]
iqr = tq_value - fq_value
n_to_pow = len(column_values[1])**(-1/3)
h = 2 * iqr * n_to_pow
retval = (column_values[1][-1] - column_values[1][1])/h
test = int(retval+1)
return test
From there I used min-max normalization
def min_max_transform(column_of_data, num_bins):
min_max_normalizer = preprocessing.MinMaxScaler(feature_range=(1, num_bins))
data_min_max = min_max_normalizer.fit_transform(column_of_data[1])
data_min_max_ints = take_int(data_min_max)
return data_min_max_ints
to transform my data and then I simply took the interger portion to get the final categorization.
def take_int(list_of_float):
ints = []
for flt in list_of_float:
asint = int(flt)
ints.append(asint)
return ints
I then also wrote a function that I used to combine this value with the variable name.
def string_transform(prefix, column, index):
transformed_list = []
transformed = ""
if index < 4:
for entry in column[1]:
transformed = prefix+str(entry)
transformed_list.append(transformed)
else:
prefix_num = prefix.split('x')
for entry in column[1]:
transformed = str(prefix_num[1])+'x'+str(entry)
transformed_list.append(transformed)
return transformed_list
This was done to differentiate variables that have the same value, but appear in different columns. For example, having a value of 1 for variable x14 means something different from getting a value of 1 in variable x20. The string transform function would create 14x1 and 20x1 for the previously mentioned examples.
After this, I wrote everything to a file in basket format
def create_basket(list_of_lists, headers):
#for filename in os.listdir("."):
# if filename.e
if not os.path.exists('baskets'):
os.makedirs('baskets')
down_length = len(list_of_lists[0])
with open('baskets/dataset.basket', 'w') as basketfile:
basket_writer = csv.DictWriter(basketfile, fieldnames=headers)
for i in range(0, down_length):
basket_writer.writerow({"trt": list_of_lists[0][i], "y": list_of_lists[1][i], "x1": list_of_lists[2][i],
"x2": list_of_lists[3][i], "x3": list_of_lists[4][i], "x4": list_of_lists[5][i],
"x5": list_of_lists[6][i], "x6": list_of_lists[7][i], "x7": list_of_lists[8][i],
"x8": list_of_lists[9][i], "x9": list_of_lists[10][i], "x10": list_of_lists[11][i],
"x11": list_of_lists[12][i], "x12":list_of_lists[13][i], "x13": list_of_lists[14][i],
"x14": list_of_lists[15][i], "x15": list_of_lists[16][i], "x16": list_of_lists[17][i],
"x17": list_of_lists[18][i], "x18": list_of_lists[19][i], "x19": list_of_lists[20][i],
"x20": list_of_lists[21][i], "x21": list_of_lists[22][i], "x22": list_of_lists[23][i],
"x23": list_of_lists[24][i], "x24": list_of_lists[25][i], "x25": list_of_lists[26][i],
"x26": list_of_lists[27][i], "x27": list_of_lists[28][i], "x28": list_of_lists[29][i],
"x29": list_of_lists[30][i], "x30": list_of_lists[31][i], "x31": list_of_lists[32][i],
"x32": list_of_lists[33][i], "x33": list_of_lists[34][i], "x34": list_of_lists[35][i],
"x35": list_of_lists[36][i], "x36": list_of_lists[37][i], "x37": list_of_lists[38][i],
"x38": list_of_lists[39][i], "x39": list_of_lists[40][i], "x40": list_of_lists[41][i]})
and I used the apriori package in Orange to see if there were any association rules.
rules = Orange.associate.AssociationRulesSparseInducer(patient_basket, support=0.3, confidence=0.3)
print "%4s %4s %s" % ("Supp", "Conf", "Rule")
for r in rules:
my_rule = str(r)
split_rule = my_rule.split("->")
if 'trt' in split_rule[1]:
print 'treatment rule'
print "%4.1f %4.1f %s" % (r.support, r.confidence, r)
Using this, technique I found quite a few association rules with my testing data.
THIS IS WHERE I HAVE A PROBLEM
When I read the notes for the training data, there is this note
...That is, the only
reason for the differences among observed responses to the same treatment across patients is
random noise. Hence, there is NO meaningful subgroup for this dataset...
My question is,
why do I get multiple association rules that would imply that there are subgroups, when according to the notes I shouldn't see anything?
I'm getting lift numbers that are above 2 as opposed to the 1 that you should expect if everything was random like the notes state.
Supp Conf Rule
0.3 0.7 6x0 -> trt1
Even though my code runs, I'm not getting results anywhere close to what should be expected. This leads me to believe that I messed something up, but I'm not sure what it is.
After some research, I realized that my sample size is too small for the number of variables that I have. I would need a way larger sample size in order to really use the method that I was using. In fact, the method that I tried to use was developed with the assumption that it would be run on databases with hundreds of thousands or millions of rows.

Categories