Length of values does not match length of index when using pandas - python

I'm getting 'ValueError: Length of values does not match length of index' while using Pandas. I'm reading in data from an Excel spreadsheet using Pandas' 'pd.read_excel method. I then filter the data using Pandas' filter method. I've created 'dataSubset' to represent the filtered data. I use 'dataSubset' to create several 'mean' columns representing the mean of multiple columns respectively. I then create 'finalData' which represents the pd.concat function concatenating all of the calculated mean columns together. This code runs perfectly; however, if I uncomment any additional columns, the code blows up and gives the aformentioned error.
What am I doing wrong? It works as long as I don't concatenate more than it wants.
Help.
import pandas as pd
dataIn = pd.read_excel('IDT/IDT-B.xlsx')
dataSubset = dataIn.filter([
"First Name",
"Last Name",
"2.14.2 Control Structures Example Quiz",
"2.14.4 Random Hurdles",
"2.15.2 Quiz: Which Control Structure?",
"2.16.2 How to Indent Your Code Quiz",
"2.16.4 Diagonal",
"2.16.5 Staircase",
"2.17.2 Debugging Basics",
"2.17.6 Debugging with Error Messages",
"3.2.6 Programming with Karel Quiz",
"5.1.2 Hello World Quiz",
"5.1.4 Your Name and Hobby",
"5.2.2 Variables Quiz",
"5.2.4 Daily Activities",
"5.3.2 User Input Quiz",
"5.3.4 Dinner Plans",
"5.4.2 Basic Math in JavaScript Quiz",
"5.4.6 T-Shirt Shop",
"5.4.7 Running Speed",
"5.5.2 JavaScript Graphics Quiz",
"5.5.8 Flag of the Netherlands",
"5.5.9 Snowman",
"5.6.2 Using RGB to Create Colors",
"5.6.4 Exploring RGB",
"5.6.5 Making Yellow",
"5.6.6 Rainbow",
"5.6.7 Create a Color Image!",
"6.1.1 Ghost",
"6.1.2 Fried Egg",
"6.1.3 Draw Something",
"6.1.4 JavaScript and Graphics Quiz"
], axis=1)
# If any of these dataframes are uncommented, the code blows up.
# dataSubset['aver_2.14'] = dataSubset[["2.14.2 Control Structures Example Quiz",
# "2.14.4 Random Hurdles"]],
# dataSubset['aver_2.15'] = dataSubset[["2.15.2 Quiz: Which Control Structure?"]],
# #
# # dataSubset['aver_2.16'] = dataSubset[["2.16.2 How to Indent Your Code Quiz",
# # "2.16.4 Diagonal"]],
# #
# # dataSubset['aver_2.17'] = dataSubset[["2.17.2 Debugging Basics",
# # "2.17.6 Debugging with Error Messages", ]]
dataSubset['unit_quiz_326'] = dataSubset[["3.2.6 Programming with Karel Quiz"]]
dataSubset['aver_5.1'] = dataSubset[["5.1.2 Hello World Quiz",
"5.1.4 Your Name and Hobby"]].mean(axis=1)
dataSubset['aver_5.2'] = dataSubset[["5.2.2 Variables Quiz",
"5.2.4 Daily Activities"]].mean(axis=1)
dataSubset['aver_5.3'] = dataSubset[["5.3.2 User Input Quiz",
"5.3.4 Dinner Plans"]].mean(axis=1)
dataSubset['aver_5.4'] = dataSubset[["5.4.2 Basic Math in JavaScript Quiz",
"5.4.6 T-Shirt Shop",
"5.4.7 Running Speed"]].mean(axis=1)
dataSubset['aver_5.5'] = dataSubset[["5.5.2 JavaScript Graphics Quiz",
"5.5.8 Flag of the Netherlands",
"5.5.9 Snowman"]].mean(axis=1)
dataSubset['aver_5.6'] = dataSubset[["5.6.2 Using RGB to Create Colors",
"5.6.4 Exploring RGB",
"5.6.5 Making Yellow",
"5.6.6 Rainbow",
"5.6.7 Create a Color Image!"]].mean(axis=1)
dataSubset['aver_6.1'] = dataSubset[["6.1.1 Ghost",
"6.1.2 Fried Egg",
"6.1.3 Draw Something",
"6.1.4 JavaScript and Graphics Quiz"]].mean(axis=1)
finalData = pd.concat([dataSubset['First Name'],
dataSubset['Last Name'],
dataSubset['unit_quiz_326'],
# dataSubset['aver_2.14'],
# dataSubset['aver_2.15'],
# dataSubset['aver_2.16'],
# dataSubset['aver_2.17'],
dataSubset['aver_5.1'],
dataSubset['aver_5.2'],
dataSubset['aver_5.3'],
dataSubset['aver_5.4'],
dataSubset['aver_5.5'],
dataSubset['aver_5.6'],
dataSubset['aver_6.1']], axis=1)
finalData.to_excel('output/gradesOut.xlsx')

Cause of ValueError
Based on this line:
dataSubset['aver_2.15'] = dataSubset[["2.15.2 Quiz: Which Control Structure?"]],
The right side of the assignment has a trailing comma, the line is equivalent to this:
dataSubset['aver_2.15'] = (dataSubset[["2.15.2 Quiz: Which Control Structure?"]], )
Basically, the line is trying to perform the following assignment:
pandas.Series <-- Tuple[pandas.DataFrame] # tuple with length 1
So there is a length mismatch between the left side (assignment target) and the right side (object that should be assigned to the target):
Left side: Length of the Series (think "number of rows")
Right side: One
Why is it pandas.Series on the left, but pandas.DataFrame on the right?
If you use single square brackets, you get a Series object: s = df['a']
If you use double square brackets, you get a DataFrame object: df2 = df[['a'']]
Possible solution
It seems you want to combine multiple columns into a new column. In one of the working lines, you take the mean of two columns with .mean(axis=1):
dataSubset['aver_5.1'] = dataSubset[["5.1.2 Hello World Quiz",
"5.1.4 Your Name and Hobby"]].mean(axis=1)
So, to fix your code, you probably need to:
Remove trailing commas
Add mean() or some other "combining function" to the lines where you select multiple columns

Related

Pandas Styler.to_latex() - how to pass commands and do simple editing

How do I pass the following commands into the latex environment?
\centering (I need landscape tables to be centered)
and
\caption* (I need to skip for a panel the table numbering)
In addition, I would need to add parentheses and asterisks to the t-statistics, meaning row-specific formatting on the dataframes.
For example:
Current
variable
value
const
2.439628
t stat
13.921319
FamFirm
0.114914
t stat
0.351283
founder
0.154914
t stat
2.351283
Adjusted R Square
0.291328
I want this
variable
value
const
2.439628
t stat
(13.921319)***
FamFirm
0.114914
t stat
(0.351283)
founder
0.154914
t stat
(1.651283)**
Adjusted R Square
0.291328
I'm doing my research papers in DataSpell. All empirical work is in Python, and then I use Latex (TexiFy) to create the pdf within DataSpell. Due to this workflow, I can't edit tables in latex code while they get overwritten every time I run the jupyter notebook.
In case it helps, here's an example of how I pass a table to the latex environment:
# drop index to column
panel_a.reset_index(inplace=True)
# write Latex index and cut names to appropriate length
ind_list = [
"ageFirm",
"meanAgeF",
"lnAssets",
"bsVol",
"roa",
"fndrCeo",
"lnQ",
"sic",
"hightech",
"nonFndrFam"
]
# assign the list of values to the column
panel_a["index"] = ind_list
# format column names
header = ["", "count","mean", "std", "min", "25%", "50%", "75%", "max"]
panel_a.columns = header
with open(
os.path.join(r"/.../tables/panel_a.tex"),"w"
) as tf:
tf.write(
panel_a
.style
.format(precision=3)
.format_index(escape="latex", axis=1)
.hide(level=0, axis=0)
.to_latex(
caption = "Panel A: Summary Statistics for the Full Sample",
label = "tab:table_label",
hrules=True,
))
You're asking three questions in one. I think I can do you two out of three (I hear that "ain't bad").
How to pass \centering to the LaTeX env using Styler.to_latex?
Use the position_float parameter. Simplified:
df.style.to_latex(position_float='centering')
How to pass \caption*?
This one I don't know. Perhaps useful: Why is caption not working.
How to apply row-specific formatting?
This one's a little tricky. Let me give an example of how I would normally do this:
df = pd.DataFrame({'a':['some_var','t stat'],'b':[1.01235,2.01235]})
df.style.format({'a': str, 'b': lambda x: "{:.3f}".format(x)
if x < 2 else '({:.3f})***'.format(x)})
Result:
You can see from this example that style.format accepts a callable (here nested inside a dict, but you could also do: .format(func, subset='value')). So, this is great if each value itself is evaluated (x < 2).
The problem in your case is that the evaluation is over some other value, namely a (not supplied) P value combined with panel_a['variable'] == 't stat'. Now, assuming you have those P values in a different column, I suggest you create a for loop to populate a list that becomes like this:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
Now, we can apply a function to df.style.format, and pop/select from the list like so:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
def func(v):
fmt = fmt_list.pop(0)
return fmt.format(v)
panel_a.style.format({'variable': str, 'value': func})
Result:
This solution is admittedly a bit "hacky", since modifying a globally declared list inside a function is far from good practice; e.g. if you modify the list again before calling func, its functionality is unlikely to result in the expected behaviour or worse, it may throw an error that is difficult to track down. I'm not sure how to remedy this other than simply turning all the floats into strings in panel_a.value inplace. In that case, of course, you don't need .format anymore, but it will alter your df and that's also not ideal. I guess you could make a copy first (df2 = df.copy()), but that will affect memory.
Anyway, hope this helps. So, in full you add this as follows to your code:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
def func(v):
fmt = fmt_list.pop(0)
return fmt.format(v)
with open(fname, "w") as tf:
tf.write(
panel_a
.style
.format({'variable': str, 'value': func})
...
.to_latex(
...
position_float='centering'
))

index a column of dataframe regarding other columns and

I have provided this data frame,
as you see I have 3 index chapter, ParaIndex, (paragraph index) and Sentindex (sententcesindex), I have 70 chapters, 1699 Paragraph, and 6999 sentences
so each of them starts from the beginning (0 or 1 ), the problem is that I want to make a widget to call a "specific sentence" which placed in a specific paragraph of a chapter. something like this
https://towardsdatascience.com/interactive-controls-for-jupyter-notebooks-f5c94829aee6
but for extracting specific sentences in the specific paragraph of the specific chapter
I think I should have another index (like ChapParaSent ABBREVIATION for all) or even multidimensions index which show that this sentence where exactly placed
any idea how can I provide that using ipywidget
https://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html
#interact
def showDetail( Chapter=(1,70),ParaIndex=(0,1699),SentIndex=(0,6999)):
return df.loc[(df.Chapter == Chapter) & (df.ParaIndex==ParaIndex)&(df.SentIndex==SentIndex)]
the problem with this is since we do not know each chapter has how many paragraphs has as well as and we do not know in each paragraph SentIndex the index to start from which number most of the time we have no result.
the aim is to adopt this (or define a new index) in a way that with changing the bar buttons we have always one unique sentence
for example, here I have the result:
but when I changed to this :
I do not have any result, the REASON is obvious because we do not have any index as 1-2-1 since, in chapter 1, Paragraph index 2: Sentindex starts from 2!
One solution I saw that it was a complete definition of a multidimensional data frame but I need something easier that I can use by ipywidget...
many many thanks
Im sure there is a easier solution out there but that works I guess.
import pandas as pd
data = [
dict(Chapter=0, ParaIndex=0, SentIndex=0, content="0"),
dict(Chapter=1, ParaIndex=1, SentIndex=1, content="a"),
dict(Chapter=1, ParaIndex=1, SentIndex=2, content="b"),
dict(Chapter=2, ParaIndex=2, SentIndex=3, content="c"),
dict(Chapter=2, ParaIndex=2, SentIndex=4, content="d"),
dict(Chapter=2, ParaIndex=3, SentIndex=5, content="e"),
dict(Chapter=3, ParaIndex=4, SentIndex=6, content="f"),
]
df = pd.DataFrame(data)
def showbyindex(target_chapter, target_paragraph, target_sentence):
df_chapter = df.loc[df.Chapter==target_chapter]
unique_paragraphs = df_chapter.ParaIndex.unique()
paragraph_idx = unique_paragraphs[target_paragraph]
df_paragraph = df_chapter.loc[df.ParaIndex==paragraph_idx]
return df_paragraph.iloc[target_sentence]
showbyindex(target_chapter=2, target_paragraph=0, target_sentence=1)
Edit:
If you want the sliders only to be within a valid range you can define IntSliders for your interact decorator:
chapter_slider = widgets.IntSlider(min=0, max=max(df.Chapter.unique()), step=1, value=0)
paragraph_slider = widgets.IntSlider(min=0, max=1, step=1, value=0)
sentence_slider = widgets.IntSlider(min=0, max=1, step=1, value=0)
#interact(target_chapter=chapter_slider, target_paragraph=paragraph_slider, target_sentence=sentence_slider)
Now you have to check the valid number of paragraphs/sentences within your showbyindex function and set the sliders value/max accordingly.
if(...):
paragraph_slider.max = ...
...

When using pygal.maps.world is there a way to format the numbers that display a country's population?

I am using pygal to make an interactive map showing world country populations from 2010. I am trying to find a way so that the populations of the country display with commas inserted ie as 10,000 not simply 10000.
I have already tried using "{:,}".format(x) when reading the numbers into my lists for the different population levels, but it causes an error. I believe this to be because this changes the value to a string.
I also tried inserting a piece of code I found online
wm.value_formatter = lambda x: "{:,}".format(x).
This doesn't cause any errors but doesn't fix how the numbers are formatted either. I am hoping someone might know of a built in function such as:
wm_style = RotateStyle('#336699')
Which is letting me set a color scheme.
Below is a the part of my code which is plotting the map.
wm = World()
wm.force_uri_protocol = "http"
wm_style = RotateStyle('#996699')
wm.value_formatter = lambda x: "{:,}".format(x)
wm.value_formatter = lambda y: "{:,}".format(y)
wm = World(style=wm_style)
wm.title = "Country populations year 2010"
wm.add('0-10 million', cc_pop_low)
wm.add("10m to 1 billion", cc_pop_mid)
wm.add('Over 1 billion', cc_pop_high)
wm.render_to_file('world_population.svg')
Setting the value_formatter property will change the label format, but in your code you recreate the World object after setting the property. This newly created object will have the default value formatter. You can also remove one of the lines setting the value_formatter property as they both achieve the same thing.
Re-ordering the code will fix your problem:
wm_style = RotateStyle('#996699')
wm = World(style=wm_style)
wm.value_formatter = lambda x: "{:,}".format(x)
wm.force_uri_protocol = "http"
wm.title = "Country populations year 2010"
wm.add('0-10 million', cc_pop_low)
wm.add("10m to 1 billion", cc_pop_mid)
wm.add('Over 1 billion', cc_pop_high)
wm.render_to_file('world_population.svg')

grouped box plots are too narrow to read

Actually everything works just fine: I try to compare different groups in my survey in one chart. thus I wrote the following code in Python (Jupyter-Notebook)
for value in values:
catpool=getcat()
py.offline.init_notebook_mode()
data = []
for cata in catpool:
for con in constraints:
data.append( go.Box( y=getdf(value,cata,con[0])['Y'+value],x=con[1], name=cata, showlegend=False, boxmean='sd',
#boxpoints='all',
jitter=0.3,
pointpos=0 ) )
layout=go.Layout(title="categorie: "+getclearname(value)+" - local space syntax measurements on placements<br>",yaxis=dict( title='percentage of range between flat extremes'),boxmode='group',showlegend=True)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig, filename='pandas-box-plot')
the function 'getdf' queries a column from a database
unfortunately it results in an unreadable diagram like this
a diagram with to narrow box plots
is it possible to give the groups less spacing and accordingly the boxplots in the group more? Or anything else that would make it more readable?
Thank you
I solved that problem at my own:
The strange behaviour occured, because of the for cata in catpool loop - data for boxplots in groups has to contain the group value in the data frame: so I just didn't loop at this place, but looped in the concatenation of the SQL-statement. So the same queries were carried out, but were joined together by "UNION" like this:
def formstmtelse (val, category, ext, constraint, nine=True):
stmt=""
if nine:
matrix=['11','12','13','21','22','23','31','32','33']
else:
matrix=['11']
j=0
for cin in category:
j=j+1
if j>1:
stmt=stmt+" UNION "
m=0
for cell in matrix:
m=m+1
if m>1:
stmt=stmt+"""UNION
"""
stmt=stmt+"""SELECT '"""+cin+"""' AS cata,
"""
if ext:
stmt=stmt+"""((("""+val+"""-MIN"""+val+""")/(MAX"""+val+"""-MIN"""+val+"""))*100) AS Y"""+val
else:
stmt=stmt+"""((("""+val+""")/(MAX"""+val+"""))*100) AS Y"""+val
stmt=stmt+"""
FROM `stests`
JOIN ssubject
ON ssubject.ssubjectID=stests.ssubjectID
JOIN scountries
ON CountryGrownUpIn=scountriesID
JOIN scontinents
ON Continent=scontinentsID
JOIN gridpoint
ON stests.FlatID=gridpoint.GFlatID AND G"""+cin+cell+"""=GIntID
JOIN
(
SELECT GFlatID AS GSUBFlatID"""
stmt=stmt+""",
MAX("""+val+""") AS MAX"""+val+""",MIN("""+val+""") AS MIN"""+val
stmt=stmt+"""
FROM gridpoint
GROUP BY GSUBFlatID
)
AS virtualtable
ON FlatID=virtualtable.GSUBFlatID
"""+constraint+" "
return stmt
that solves the problem. The 'x' attribute of the box call must be a list or dataframe, not just a single string, even though the string is always the same. I guess this was causing 'invisible' x-values making the automated scaling impossible, or in other words a new point in the legend for each query with a new constraint and for each category, only one dataframe is needed for every constraint. when changing this - fine adjustments could be made by adding boxgroupgap or groupgab attribute to layout...
please excuse my english

How to modify SPSS output files with Python?

I'm making custom tables in SPSS, but when the cell values (percentages) are rounded to 1 decimal, they sometimes add up to 99,9 or 100,1 in stead of 100,0. My boss asked my to have everything neatly add up to 100. This means slightly changing some values in the output tables.
I wrote some code to retrieve cell values from tables, which works fine, but I cannot find any method or class that allows me to change cells in already generated output. I've tried things like :
Table[(rij,6)] = CellText.Number(11)
and
SpssDataCells[(rij,6)] = CellText.Number(11)
but it keeps giving me "AttributeError: 'SpssClient.SpssTextItem' object has no attribute 'DataCellArray'"
How do I succesfully change cell values of output tables in SPSS?
My code so far:
import SpssClient, spss
# Python verbinden met SPSS.
SpssClient.StartClient()
OutputDoc = SpssClient.GetDesignatedOutputDoc()
OutputItemList = OutputDoc.GetOutputItems()
# Laatste tabel pakken.
lastTab = OutputItemList.Size()-2
OutputItem = OutputItemList.GetItemAt(lastTab)
Table = OutputItem.GetSpecificType()
SpssDataCells = Table.DataCellArray()
# For loop. Voor iedere rij testen of de afgeronde waarden optellen tot 100.
# Specifieke getallen pakken.
rij=0
try:
while (rij<20):
b14 = float(SpssDataCells.GetUnformattedValueAt(rij,0))
z14 = float(SpssDataCells.GetUnformattedValueAt(rij,1))
zz14 = float(SpssDataCells.GetUnformattedValueAt(rij,2))
b15 = float(SpssDataCells.GetUnformattedValueAt(rij,4))
z15 = float(SpssDataCells.GetUnformattedValueAt(rij,5))
zz15 = float(SpssDataCells.GetUnformattedValueAt(rij,6))
print [b14,z14,zz14,b15,z15,zz15]
rij=rij+1
except:
print 'Einde tabel'
The SetValueAt method is what you require to change the value of a cell in a table.
PS. I think your boss should focus on more important things than to spend billable time on having percentages add up neatly to 100% (due to rounding). Also ensure you are using as many decimal point precision as possible in your calculations so to minimize this "discrepancy".
Update:
Just to give an example of what you can do with manipulation like this (beyond fixing round errors):
The table above shows the Share of Voice (SoV) of a Respiratory drug brand (R3) and it's rank among all brands (first two columns of data) and SoV & Rank also within it's same class of brands only (third and forth column). This is compared against previous month (July 15) and if the rank has increased then it is highlighted in green and an upward facing arrow is added and if declined in rank then highlighted in red and downward red facing arrow added. Just adds a little, color and visualization to what otherwise can be dull tables.

Categories