Pandas: ValueError - operands could not be broadcast together with shapes - python

I get the following runtime error while performing operations like add() and combine_first() on large dataframes:
ValueError: operands could not be broadcast together with shapes (680,) (10411,)
Broadcasting errors seem to happen quite often using Numpy (matrix dimensions mismatch), however I do not understand why it does effect my multiindex dataframes / series. Each of the concat-elements produces a runtime error:
My code:
# I want to merge two dataframes data1 and data2
# add up the 'requests' column
# merge 'begin' column choosing data1-entries first on collision
# merge 'end' column choosing data2-entries first on collision
pd.concat([\
data1["begin"].combine_first(data2["begin"]),\
data2["end"].combine_first(data1["end"]),\
data1["requests"].add(data2["requests"], fill_value=0)\
], axis=1)
My data:
# data1
requests begin end
IP sessionID
*1.*16.*01.5* 20 9 2011-12-16 13:06:23 2011-12-16 16:50:57
21 3 2011-12-17 11:46:26 2011-12-17 11:46:29
22 15 2011-12-19 10:10:14 2011-12-19 16:10:47
23 9 2011-12-20 09:11:23 2011-12-20 13:01:12
24 9 2011-12-21 00:15:22 2011-12-21 02:50:22
...
6*.8*.20*.14* 6283 1 2011-12-25 01:35:25 2011-12-25 01:35:25
20*.11*.3.10* 6284 1 2011-12-25 01:47:45 2011-12-25 01:47:45
[680 rows x 3 columns]
# data2
requests begin end
IP sessionID
*8.24*.135.24* 9215 1 2011-12-29 03:14:10 2011-12-29 03:14:10
*09.2**.22*.4* 9216 1 2011-12-29 03:14:38 2011-12-29 03:14:38
*21.14*.2**.22* 9217 12 2011-12-29 03:16:06 2011-12-29 03:19:45
...
19*.8*.2**.1*1 62728 2 2012-03-31 11:08:47 2012-03-31 11:08:47
6*.16*.10*.155 77282 1 2012-03-31 11:19:33 2012-03-31 11:19:33
17*.3*.18*.6* 77305 1 2012-03-31 11:55:52 2012-03-31 11:55:52
6*.6*.2*.20* 77308 1 2012-03-31 11:59:05 2012-03-31 11:59:05
[10411 rows x 3 columns]

I don't know why, maybe it is a bug or something, but stating explicitly to use all rows from each series with [:] works as expected. No errors.
print pd.concat([\
data1["begin"][:].combine_first(data2["begin"][:]),\
data2["end"][:].combine_first(data1["end"][:]),\
data1["requests"][:].add(data2["requests"][:], fill_value=0)\
], axis=1)

It looks that when you do data1["requests"].add(data2["requests"], fill_value=0) you are trying to sum 2 pandas Series with different size of rows. Series.add will broadcast the add operation to all elements in both series and this imply same dimension.

Use the numpy.concatenate((df['col1', df['col2']), axis=None)) works.

Related

How to avoid this ValueError during concatenation?

I've been trying to concatenate a list of pandas Dataframes with only one column each, but I keep getting this error:
ValueError: Shape of passed values is (8980, 2), indices imply (200, 2)
I made sure that all the shapes are identical (200 rows × 1 columns) and I removed all the NA values. The concatenation works along the rows (axis=0) but doesn't work along the columns (axis=1).
I previously manipulated the Dataframes with some Transpositions df.T and with other operations like dropna(axis=0, how='all'). I don't think that this could be the cause for the error because I tried it on a toy dataset and it worked fine. Here's some code for context:
test_full[:3] #this is what my list of pandas Dataframes looks like (the first 3 items)
[Unnamed: 1
1 3520
2 2014
3 10253
4 5929
1 3243
.. ...
[200 rows x 1 columns],
Unnamed: 2
1 2476
2 1455
3 7245
4 4304
1 2275
.. ...
[200 rows x 1 columns],
Unnamed: 3
1 1044
2 559
3 3008
4 1625
1 968
.. ...
[200 rows x 1 columns]]
For the Concatenation I tried:
pd.concat(test_full, axis=1)
ValueError Traceback (most recent call last)
<ipython-input-158-f067bc5875c9> in <module>
----> 1 pd.concat(test_full, axis=1)
ValueError: Shape of passed values is (8980, 104), indices imply (200, 104)
As an output I was hoping for:
Unnamed: 1 Unnamed:2 Unnamed:3
1 3520 1232 6349
2 2014 4353 2974
3 10253 1234 1223
4 5929 7456 9854
1 3243 7654 11034
.. ... ... ...
I also don't really know what the Shape (8980, 104) and the indices imply (200,104)are referring to.
I would really appreciate some suggestions.
From my experience, this error tends to happen if either index has duplicate values as it doesn't know how to handle it. From your example, it seems like you have multiple 1s. If this isn't necessary, you could call df.reset_index(drop=True, inplace=True) for each dataframe before concatenating.
This issue doesn't occur when you concatenate along the index as it just "puts them on top of each other", regardless of what the index is.
What the error message is telling you is that the resulting shape is (8980, 104) but that the expected shape should be (200, 104) based on the index.

What is happening when pandas.Series converts int64s into NaNs?

I have a csv with dates and integers (Headers: Date, Number), separated by a tab.
I'm trying to create a calendar heatmap with CalMap (demo on that page). The function that creates the chart takes data that's indexed by DateTime.
df = pd.read_csv("data.csv",delimiter="\t")
df['Date'] = df['Date'].astype('datetime64[ns]')
events = pd.Series(df['Date'],index = df['Number'])
calmap.yearplot(events)
But when I check events.head(5), it gives the date followed by NaN. I check df['Number'].head(5) and they appear as int64.
What am I doing wrong that is causing this conversion?
Edit: Data below
Date Number
7/9/2018 40
7/10/2018 40
7/11/2018 40
7/12/2018 70
7/13/2018 30
Edit: Output of events.head(5)
2018-07-09 NaN
2018-07-10 NaN
2018-07-11 NaN
2018-07-12 NaN
2018-07-13 NaN
dtype: float64
First of all, it is not NaN, it is NaT (Not a Timestamp), which is unique to Pandas, though Pandas makes it compatible with NaN, and uses it similarly to NaN in floating-point columns to mark missing data.
What pd.Series(data, index=index) does apparently depends on the type of data. If data is a list, then index has to be of equal length, and a new Series will be constructed, with data being data, and index being index. However, if data is already a Series (such as df['Date']), it will instead take the rows corresponding to index and construct a new Series out of those rows. For example:
pd.Series(df['Date'], [1, 1, 4])
will give you
1 2018-07-10
1 2018-07-10
4 2018-07-13
Where 2018-07-10 comes from row #1, and 2018-07-11 from row #4 of df['Date']. However, there is no row with index 40, 70 or 30 in your sample input data, so missing data is presumed, and NaT is inserted instead.
In contrast, this is what you get when you use a list instead:
pd.Series(df['Date'].to_list(), index=df['Number'])
# => Number
# 40 2018-07-09
# 40 2018-07-10
# 40 2018-07-11
# 70 2018-07-12
# 30 2018-07-13
# dtype: datetime64[ns]
I was able to fix this by changing the series into lists via df['Date'].tolist() and df['Number'].tolist(). calmap.calendarplot(events) was able to accept these instead of the original parameters as series.

Apply function to all columns in pd dataframe using variables in other dataframes

I have a series of dataframes, some hold static values some are time series.
I have been able to add
I want to transpose the values from one time series to a new time series, applying a function which draws values from both the original time series dataframe and the dataframe which holds static values.
A snip of the time series and static dataframes are below.
Time series dataframe (Irradiance)
Datetime Date Time GHI DIF flagR SE SA TEMP
2017-07-01 00:11:00 01.07.2017 00:11 0 0 0 -9.39 -179.97 11.1
2017-07-01 00:26:00 01.07.2017 00:26 0 0 0 -9.33 -176.47 11.0
2017-07-01 00:41:00 01.07.2017 00:41 0 0 0 -9.14 -172.98 10.9
2017-07-01 00:56:00 01.07.2017 00:56 0 0 0 -8.83 -169.51 10.9
2017-07-01 01:11:00 01.07.2017 01:11 0 0 0 -8.40 -166.04 11.0
Static dataframe (Properties)
Bearing (degrees) Inclination (degrees)
Home ID
151631 244 29
151632 244 29
151633 184 44
I have written a function which I want to use to populate a new dataframe using values from both of these.
def dif(DIF, Inclination, GHI):
global Albedo
return DIF * (1 + math.cos(math.radians(Inclination)) / 2) + (GHI * Albedo * (1 - math.cos(math.radians(Inclination)) / 2))
When I have tried to do the same, but within the same dataframe I have used the Numpy vectorize funcion, so I thought I would be able to iterate over each column of the the new dataframe using the following code.
for column in DIF:
DIF[column] = np.vectorize(dif)(irradiance['DIF'], properties.iloc['Inclination (degrees)'][column], irradiance['GHI'])
Instead this throws the following error.
TypeError: cannot do positional indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [Inclination (degrees)] of <class 'str'>
I've checked the dtypes for the values of Inclination(degrees) but it is returned as Int64, not str so I'm not sure why this error is being generated.
I'm obviously missing something critical here. Are there alternative methods that would work better, or at all? Any help would be much appreciated.

Plotting counts of a dataframe grouped by timestamp

So I have a pandas dataframe which has a large number of columns, and one of the columns is a timestamp in datetime format. Each row in the dataframe represents a single "event". What I'm trying to do is graph the frequency of these events over time. Basically a simple bar graph showing how many events per month.
Started with this code:
data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count().plot(kind = 'bar')
plt.show()
This "kind of" works. But there are 2 problems:
1) The graph comes with a legend which includes all the columns in the original data (like 30+ columns). And each bar on the graph has a tiny sub-bar for each of the columns (all of which are the same value since I'm just counting events).
2) There are some months where there are zero events. And these months don't show up on the graph at all.
I finally came up with code to get the graph looking the way I wanted. But it seems to me that I'm not doing this the "correct" way, since this must be a fairly common usecase.
Basically I created a new dataframe with one column "count" and an index that's a string representation of month/year. I populated that with zeroes over the time range I care about and then I copied over the data from the first frame into the new one. Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
cnt = data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count()
index = []
for year in [2015, 2016, 2017, 2018]:
for month in range(1,13):
index.append('%04d-%02d'%(year, month))
cnt_new = pd.DataFrame(index=index, columns=['count'])
cnt_new = cnt_new.fillna(0)
for i, row in cnt.iterrows():
cnt_new.at['%04d-%02d'%i,'count'] = row[0]
cnt_new.plot(kind = 'bar')
plt.show()
Anyone know an easier way to go about this?
EDIT --> Per request, here's an idea of the type of dataframe. It's the results from an SQL query. Actual data is my company's so...
Timestamp FirstName LastName HairColor \
0 2018-11-30 02:16:11 Fred Schwartz brown
1 2018-11-29 16:25:55 Sam Smith black
2 2018-11-19 21:12:29 Helen Hunt red
OK, so I think I got it. Thanks to Yuca for resample command. I just need to run that on the Timestamp data series (rather than on the whole dataframe) and it gives me exactly what I was looking for.
> data.index = data.Timestamp
> data.Timestamp.resample('M').count()
Timestamp
2017-11-30 0
2017-12-31 0
2018-01-31 1
2018-02-28 2
2018-03-31 7
2018-04-30 9
2018-05-31 2
2018-06-30 6
2018-07-31 5
2018-08-31 4
2018-09-30 1
2018-10-31 0
2018-11-30 5
So OP request is: "Basically a simple bar graph showing how many events per month"
Using pd.resample and monthly frequency yields the desired result
df[['FirstName']].resample('M').count()
Output:
FirstName
Timestamp
2018-11-30 3
To include non observed months, we need to create a baseline calendar
df_a = pd.DataFrame(index = pd.date_range(df.index[0].date(), periods=12, freq='M'))
and then assign to it the result of our resample
df_a['count'] = df[['FirstName']].resample('M').count()
Output:
count
2018-11-30 3.0
2018-12-31 NaN
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 NaN
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 NaN
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 NaN
2019-10-31 NaN

Pandas and Matplotlib: Can't get the stackplot to work with using Matplotlib on the dataframe

I have a Dataframe object coming from a SQL-Query that looks like this:
Frage/Diskussion ... Wissenschaft&Technik
date ...
2018-05-10 13 ... 6
2018-05-11 28 ... 1
2018-05-12 11 ... 2
2018-05-13 21 ... 3
2018-05-14 30 ... 4
2018-05-15 38 ... 5
2018-05-16 25 ... 7
2018-05-17 23 ... 2
2018-05-18 24 ... 4
2018-05-19 31 ... 4
[10 rows x 6 columns]
I want to visualize this data with a Matplotlib stackplot in python.
What works is following line:
df.plot(kind='area', stacked=True)
What doesn't work is following line:
plt.stackplot(df.index, df.values)
The error I get with the last line is:
"ValueError: operands could not be broadcast together with shapes (10,) (6,) "
Obviously the last line with the 10 rows x 6 columns is passed into the plotting function.. and I can't get rid of it.
Writing out each column by hand is also working but not really what I want since there will be many rows later on.
plt.stackplot(df.index.values, df['Frage/Diskussion'], df['Humor'], df['Nachrichten'], df['Politik'], df['Interessant'], df['Wissenschaft&Technik'])
Your problem here is that df.values is a column by row array. To get the form you want you need to transpose it. Fortunately, that is easy. Replace df.values by df.values.T! So in your code replace:
plt.stackplot(df.index,df.values)
with
plt.stackplot(df.index,df.values.T)

Categories