How to avoid this ValueError during concatenation? - python

I've been trying to concatenate a list of pandas Dataframes with only one column each, but I keep getting this error:
ValueError: Shape of passed values is (8980, 2), indices imply (200, 2)
I made sure that all the shapes are identical (200 rows × 1 columns) and I removed all the NA values. The concatenation works along the rows (axis=0) but doesn't work along the columns (axis=1).
I previously manipulated the Dataframes with some Transpositions df.T and with other operations like dropna(axis=0, how='all'). I don't think that this could be the cause for the error because I tried it on a toy dataset and it worked fine. Here's some code for context:
test_full[:3] #this is what my list of pandas Dataframes looks like (the first 3 items)
[Unnamed: 1
1 3520
2 2014
3 10253
4 5929
1 3243
.. ...
[200 rows x 1 columns],
Unnamed: 2
1 2476
2 1455
3 7245
4 4304
1 2275
.. ...
[200 rows x 1 columns],
Unnamed: 3
1 1044
2 559
3 3008
4 1625
1 968
.. ...
[200 rows x 1 columns]]
For the Concatenation I tried:
pd.concat(test_full, axis=1)
ValueError Traceback (most recent call last)
<ipython-input-158-f067bc5875c9> in <module>
----> 1 pd.concat(test_full, axis=1)
ValueError: Shape of passed values is (8980, 104), indices imply (200, 104)
As an output I was hoping for:
Unnamed: 1 Unnamed:2 Unnamed:3
1 3520 1232 6349
2 2014 4353 2974
3 10253 1234 1223
4 5929 7456 9854
1 3243 7654 11034
.. ... ... ...
I also don't really know what the Shape (8980, 104) and the indices imply (200,104)are referring to.
I would really appreciate some suggestions.

From my experience, this error tends to happen if either index has duplicate values as it doesn't know how to handle it. From your example, it seems like you have multiple 1s. If this isn't necessary, you could call df.reset_index(drop=True, inplace=True) for each dataframe before concatenating.
This issue doesn't occur when you concatenate along the index as it just "puts them on top of each other", regardless of what the index is.
What the error message is telling you is that the resulting shape is (8980, 104) but that the expected shape should be (200, 104) based on the index.

Related

Pandas and Matplotlib: Can't get the stackplot to work with using Matplotlib on the dataframe

I have a Dataframe object coming from a SQL-Query that looks like this:
Frage/Diskussion ... Wissenschaft&Technik
date ...
2018-05-10 13 ... 6
2018-05-11 28 ... 1
2018-05-12 11 ... 2
2018-05-13 21 ... 3
2018-05-14 30 ... 4
2018-05-15 38 ... 5
2018-05-16 25 ... 7
2018-05-17 23 ... 2
2018-05-18 24 ... 4
2018-05-19 31 ... 4
[10 rows x 6 columns]
I want to visualize this data with a Matplotlib stackplot in python.
What works is following line:
df.plot(kind='area', stacked=True)
What doesn't work is following line:
plt.stackplot(df.index, df.values)
The error I get with the last line is:
"ValueError: operands could not be broadcast together with shapes (10,) (6,) "
Obviously the last line with the 10 rows x 6 columns is passed into the plotting function.. and I can't get rid of it.
Writing out each column by hand is also working but not really what I want since there will be many rows later on.
plt.stackplot(df.index.values, df['Frage/Diskussion'], df['Humor'], df['Nachrichten'], df['Politik'], df['Interessant'], df['Wissenschaft&Technik'])
Your problem here is that df.values is a column by row array. To get the form you want you need to transpose it. Fortunately, that is easy. Replace df.values by df.values.T! So in your code replace:
plt.stackplot(df.index,df.values)
with
plt.stackplot(df.index,df.values.T)

Add row in pandas DF as numeric column index?

I have tried this a few ways and am stumped. My last attempt generates an error that says: "ValueError: Plan shapes are not aligned"
So I have a dataframe that can have up to about 1,000 columns in it based on data read in from an external file. The columns are all going to have their own labels/names, i.e. "Name", "BirthYear", Hometown", etc. I want to add a row at the beginning of the dataframe that runs from 0 to (as many columns as there are), so if the data ends up having 232 columns, this new first row would have values of 0,1,2,3,4....229,230,231,232.
What I am doing is creating a one-row dataframe with as many columns/values as there are in the main ("mega") dataframe, and then concatenating them. It throws this shape error at me, but when I print the shape of each frame, they match up in terms of length. Not sure what I am doing wrong, any help would be appreciated. Thank you!
colList = list(range(0, len(mega.columns)))
indexRow = pd.DataFrame(colList).T
print(indexRow)
print(indexRow.shape)
print(mega.shape)
mega = pd.concat([indexRow, mega],axis=0)
Here is the result...
0 1 2 3 4 5 6 7 8 9 ... 1045 \
0 0 1 2 3 4 5 6 7 8 9 ... 1045
1046 1047 1048 1049 1050 1051 1052 1053 1054
0 1046 1047 1048 1049 1050 1051 1052 1053 1054
[1 rows x 1055 columns]
(1, 1055)
(4, 1055)
ValueError: Plan shapes are not aligned
This is one way to do it. Depending on your data, this could mix types (e.g. if one column was timestamps). Also, this resets your index in mega.
mega = pd.DataFrame(np.random.randn(3,3), columns=list('ABC'))
indexRow = pd.DataFrame({col: [n] for n, col in enumerate(mega)})
>>> pd.concat([indexRow, mega], ignore_index=True)
A B C
0 0.000000 1.000000 2.000000
1 0.413145 -1.475655 0.529429
2 0.416250 -0.055519 1.611539
3 0.154045 -0.038109 1.020616

IndexError: index 1491188345 is out of bounds for axis 0 with size 1491089723

I have a dataframe,df with 646585 rows and 3 columns which looks like :
index inp aco count
0 2.3.6. dp-ptp-a2f 22000
1 2.3.12. ft-ptp-a2f 21300
2 2.5.9. dp-ptp-a2f 21010
3 0.8.0. dp-ptp-a4f 20000
4 2.3.6. ft-ptp-a2f 19000
5 2.3.6. ff-ptp-a2f 18500
... ...
... ...
... ...
I tried to pivot the dataframe using the code:
df1=df.pivot_table(values='count', index='inp', columns='aco',fill_value=0)
print(df1)
but I got
IndexError: index 1491188345 is out of bounds for axis 0 with size 1491089723
There is open Pandas issue describing this error.
AFAIK currently there is no solution/workaround provided, except reducing the data set

Pandas: ValueError - operands could not be broadcast together with shapes

I get the following runtime error while performing operations like add() and combine_first() on large dataframes:
ValueError: operands could not be broadcast together with shapes (680,) (10411,)
Broadcasting errors seem to happen quite often using Numpy (matrix dimensions mismatch), however I do not understand why it does effect my multiindex dataframes / series. Each of the concat-elements produces a runtime error:
My code:
# I want to merge two dataframes data1 and data2
# add up the 'requests' column
# merge 'begin' column choosing data1-entries first on collision
# merge 'end' column choosing data2-entries first on collision
pd.concat([\
data1["begin"].combine_first(data2["begin"]),\
data2["end"].combine_first(data1["end"]),\
data1["requests"].add(data2["requests"], fill_value=0)\
], axis=1)
My data:
# data1
requests begin end
IP sessionID
*1.*16.*01.5* 20 9 2011-12-16 13:06:23 2011-12-16 16:50:57
21 3 2011-12-17 11:46:26 2011-12-17 11:46:29
22 15 2011-12-19 10:10:14 2011-12-19 16:10:47
23 9 2011-12-20 09:11:23 2011-12-20 13:01:12
24 9 2011-12-21 00:15:22 2011-12-21 02:50:22
...
6*.8*.20*.14* 6283 1 2011-12-25 01:35:25 2011-12-25 01:35:25
20*.11*.3.10* 6284 1 2011-12-25 01:47:45 2011-12-25 01:47:45
[680 rows x 3 columns]
# data2
requests begin end
IP sessionID
*8.24*.135.24* 9215 1 2011-12-29 03:14:10 2011-12-29 03:14:10
*09.2**.22*.4* 9216 1 2011-12-29 03:14:38 2011-12-29 03:14:38
*21.14*.2**.22* 9217 12 2011-12-29 03:16:06 2011-12-29 03:19:45
...
19*.8*.2**.1*1 62728 2 2012-03-31 11:08:47 2012-03-31 11:08:47
6*.16*.10*.155 77282 1 2012-03-31 11:19:33 2012-03-31 11:19:33
17*.3*.18*.6* 77305 1 2012-03-31 11:55:52 2012-03-31 11:55:52
6*.6*.2*.20* 77308 1 2012-03-31 11:59:05 2012-03-31 11:59:05
[10411 rows x 3 columns]
I don't know why, maybe it is a bug or something, but stating explicitly to use all rows from each series with [:] works as expected. No errors.
print pd.concat([\
data1["begin"][:].combine_first(data2["begin"][:]),\
data2["end"][:].combine_first(data1["end"][:]),\
data1["requests"][:].add(data2["requests"][:], fill_value=0)\
], axis=1)
It looks that when you do data1["requests"].add(data2["requests"], fill_value=0) you are trying to sum 2 pandas Series with different size of rows. Series.add will broadcast the add operation to all elements in both series and this imply same dimension.
Use the numpy.concatenate((df['col1', df['col2']), axis=None)) works.

pandas, dataframe, groupby, std

New to pandas here. A (trivial) problem: hosts, operations, execution times. I want to group by host, then by host+operation, calculate std deviation for execution time per host, then by host+operation pair. Seems simple?
It works for grouping by a single column:
df
Out[360]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 132564 entries, 0 to 132563
Data columns (total 9 columns):
datespecial 132564 non-null values
host 132564 non-null values
idnum 132564 non-null values
operation 132564 non-null values
time 132564 non-null values
...
dtypes: float32(1), int64(2), object(6)
byhost = df.groupby('host')
byhost.std()
Out[362]:
datespecial idnum time
host
ahost1.test 11946.961952 40367.033852 0.003699
host1.test 15484.975077 38206.578115 0.008800
host10.test NaN 37644.137631 0.018001
...
Nice. Now:
byhostandop = df.groupby(['host', 'operation'])
byhostandop.std()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-364-2c2566b866c4> in <module>()
----> 1 byhostandop.std()
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in std(self, ddof)
386 # todo, implement at cython level?
387 if ddof == 1:
--> 388 return self._cython_agg_general('std')
389 else:
390 f = lambda x: x.std(ddof=ddof)
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_general(self, how, numeric_only)
1615
1616 def _cython_agg_general(self, how, numeric_only=True):
-> 1617 new_blocks = self._cython_agg_blocks(how, numeric_only=numeric_only)
1618 return self._wrap_agged_blocks(new_blocks)
1619
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_blocks(self, how, numeric_only)
1653 values = com.ensure_float(values)
1654
-> 1655 result, _ = self.grouper.aggregate(values, how, axis=agg_axis)
1656
1657 # see if we can cast the block back to the original dtype
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, values, how, axis)
838 if is_numeric:
839 result = lib.row_bool_subset(result,
--> 840 (counts > 0).view(np.uint8))
841 else:
842 result = lib.row_bool_subset_object(result,
/home/username/anaconda/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.row_bool_subset (pandas/lib.c:16540)()
ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'
Huh?? Why do I get this exception?
More questions:
how do I calculate std deviation on dataframe.groupby([several columns])?
how can I limit calculation to a selected column? E.g. it obviously doesn't make sense to calculate std dev on dates/timestamps here.
It's important to know your version of Pandas / Python. Looks like this exception could arise in Pandas version < 0.10 (see ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'). To avoid this, you can cast your float columns to float64:
df.astype('float64')
To calculate std() on selected columns, just select columns :)
>>> df = pd.DataFrame({'a':range(10), 'b':range(10,20), 'c':list('abcdefghij'), 'g':[1]*3 + [2]*3 + [3]*4})
>>> df
a b c g
0 0 10 a 1
1 1 11 b 1
2 2 12 c 1
3 3 13 d 2
4 4 14 e 2
5 5 15 f 2
6 6 16 g 3
7 7 17 h 3
8 8 18 i 3
9 9 19 j 3
>>> df.groupby('g')[['a', 'b']].std()
a b
g
1 1.000000 1.000000
2 1.000000 1.000000
3 1.290994 1.290994
update
As far as it goes, it looks like std() is calling aggregation() on the groupby result, and a subtle bug (see here - Python Pandas: Using Aggregate vs Apply to define new columns). To avoid this, you can use apply():
byhostandop['time'].apply(lambda x: x.std())

Categories