I have tried this a few ways and am stumped. My last attempt generates an error that says: "ValueError: Plan shapes are not aligned"
So I have a dataframe that can have up to about 1,000 columns in it based on data read in from an external file. The columns are all going to have their own labels/names, i.e. "Name", "BirthYear", Hometown", etc. I want to add a row at the beginning of the dataframe that runs from 0 to (as many columns as there are), so if the data ends up having 232 columns, this new first row would have values of 0,1,2,3,4....229,230,231,232.
What I am doing is creating a one-row dataframe with as many columns/values as there are in the main ("mega") dataframe, and then concatenating them. It throws this shape error at me, but when I print the shape of each frame, they match up in terms of length. Not sure what I am doing wrong, any help would be appreciated. Thank you!
colList = list(range(0, len(mega.columns)))
indexRow = pd.DataFrame(colList).T
print(indexRow)
print(indexRow.shape)
print(mega.shape)
mega = pd.concat([indexRow, mega],axis=0)
Here is the result...
0 1 2 3 4 5 6 7 8 9 ... 1045 \
0 0 1 2 3 4 5 6 7 8 9 ... 1045
1046 1047 1048 1049 1050 1051 1052 1053 1054
0 1046 1047 1048 1049 1050 1051 1052 1053 1054
[1 rows x 1055 columns]
(1, 1055)
(4, 1055)
ValueError: Plan shapes are not aligned
This is one way to do it. Depending on your data, this could mix types (e.g. if one column was timestamps). Also, this resets your index in mega.
mega = pd.DataFrame(np.random.randn(3,3), columns=list('ABC'))
indexRow = pd.DataFrame({col: [n] for n, col in enumerate(mega)})
>>> pd.concat([indexRow, mega], ignore_index=True)
A B C
0 0.000000 1.000000 2.000000
1 0.413145 -1.475655 0.529429
2 0.416250 -0.055519 1.611539
3 0.154045 -0.038109 1.020616
Related
I would like to shift a column in a Pandas DataFrame, but I haven't been able to find a method to do it from the documentation without rewriting the whole DF. Does anyone know how to do it?
DataFrame:
## x1 x2
##0 206 214
##1 226 234
##2 245 253
##3 265 272
##4 283 291
Desired output:
## x1 x2
##0 206 nan
##1 226 214
##2 245 234
##3 265 253
##4 283 272
##5 nan 291
In [18]: a
Out[18]:
x1 x2
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
In [19]: a['x2'] = a.x2.shift(1)
In [20]: a
Out[20]:
x1 x2
0 0 NaN
1 1 5
2 2 6
3 3 7
4 4 8
You need to use df.shift here.
df.shift(i) shifts the entire dataframe by i units down.
So, for i = 1:
Input:
x1 x2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Output:
x1 x2
0 Nan Nan
1 206 214
2 226 234
3 245 253
4 265 272
So, run this script to get the expected output:
import pandas as pd
df = pd.DataFrame({'x1': ['206', '226', '245',' 265', '283'],
'x2': ['214', '234', '253', '272', '291']})
print(df)
df['x2'] = df['x2'].shift(1)
print(df)
Lets define the dataframe from your example by
>>> df = pd.DataFrame([[206, 214], [226, 234], [245, 253], [265, 272], [283, 291]],
columns=[1, 2])
>>> df
1 2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Then you could manipulate the index of the second column by
>>> df[2].index = df[2].index+1
and finally re-combine the single columns
>>> pd.concat([df[1], df[2]], axis=1)
1 2
0 206.0 NaN
1 226.0 214.0
2 245.0 234.0
3 265.0 253.0
4 283.0 272.0
5 NaN 291.0
Perhaps not fast but simple to read. Consider setting variables for the column names and the actual shift required.
Edit: Generally shifting is possible by df[2].shift(1) as already posted however would that cut-off the carryover.
If you don't want to lose the columns you shift past the end of your dataframe, simply append the required number first:
offset = 5
DF = DF.append([np.nan for x in range(offset)])
DF = DF.shift(periods=offset)
DF = DF.reset_index() #Only works if sequential index
I suppose imports
import pandas as pd
import numpy as np
First append new row with NaN, NaN,... at the end of DataFrame (df).
s1 = df.iloc[0] # copy 1st row to a new Series s1
s1[:] = np.NaN # set all values to NaN
df2 = df.append(s1, ignore_index=True) # add s1 to the end of df
It will create new DF df2. Maybe there is more elegant way but this works.
Now you can shift it:
df2.x2 = df2.x2.shift(1) # shift what you want
Trying to answer a personal problem and similar to yours I found on Pandas Doc what I think would answer this question:
DataFrame.shift(periods=1, freq=None, axis=0)
Shift index by desired number of periods with an optional time freq
Notes
If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.
Hope to help future questions in this matter.
df3
1 108.210 108.231
2 108.231 108.156
3 108.156 108.196
4 108.196 108.074
... ... ...
2495 108.351 108.279
2496 108.279 108.669
2497 108.669 108.687
2498 108.687 108.915
2499 108.915 108.852
df3['yo'] = df3['yo'].shift(-1)
yo price
0 108.231 108.210
1 108.156 108.231
2 108.196 108.156
3 108.074 108.196
4 108.104 108.074
... ... ...
2495 108.669 108.279
2496 108.687 108.669
2497 108.915 108.687
2498 108.852 108.915
2499 NaN 108.852
This is how I do it:
df_ext = pd.DataFrame(index=pd.date_range(df.index[-1], periods=8, closed='right'))
df2 = pd.concat([df, df_ext], axis=0, sort=True)
df2["forecast"] = df2["some column"].shift(7)
Basically I am generating an empty dataframe with the desired index and then just concatenate them together. But I would really like to see this as a standard feature in pandas so I have proposed an enhancement to pandas.
I'm new to pandas, and I may not be understanding the question, but this solution worked for my problem:
# Shift contents of column 'x2' down 1 row
df['x2'] = df['x2'].shift()
Or, to create a new column with contents of 'x2' shifted down 1 row
# Create new column with contents of 'x2' shifted down 1 row
df['x3'] = df['x2'].shift()
I had a read of the official docs for shift() while trying to figure this out, but it doesn't make much sense to me, and has no examples referencing this specific behavior.
Note that the last row of column 'x2' is effectively pushed off the end of the Dataframe. I expected shift() to have a flag to change this behaviour, but I can't find anything.
I've been trying to concatenate a list of pandas Dataframes with only one column each, but I keep getting this error:
ValueError: Shape of passed values is (8980, 2), indices imply (200, 2)
I made sure that all the shapes are identical (200 rows × 1 columns) and I removed all the NA values. The concatenation works along the rows (axis=0) but doesn't work along the columns (axis=1).
I previously manipulated the Dataframes with some Transpositions df.T and with other operations like dropna(axis=0, how='all'). I don't think that this could be the cause for the error because I tried it on a toy dataset and it worked fine. Here's some code for context:
test_full[:3] #this is what my list of pandas Dataframes looks like (the first 3 items)
[Unnamed: 1
1 3520
2 2014
3 10253
4 5929
1 3243
.. ...
[200 rows x 1 columns],
Unnamed: 2
1 2476
2 1455
3 7245
4 4304
1 2275
.. ...
[200 rows x 1 columns],
Unnamed: 3
1 1044
2 559
3 3008
4 1625
1 968
.. ...
[200 rows x 1 columns]]
For the Concatenation I tried:
pd.concat(test_full, axis=1)
ValueError Traceback (most recent call last)
<ipython-input-158-f067bc5875c9> in <module>
----> 1 pd.concat(test_full, axis=1)
ValueError: Shape of passed values is (8980, 104), indices imply (200, 104)
As an output I was hoping for:
Unnamed: 1 Unnamed:2 Unnamed:3
1 3520 1232 6349
2 2014 4353 2974
3 10253 1234 1223
4 5929 7456 9854
1 3243 7654 11034
.. ... ... ...
I also don't really know what the Shape (8980, 104) and the indices imply (200,104)are referring to.
I would really appreciate some suggestions.
From my experience, this error tends to happen if either index has duplicate values as it doesn't know how to handle it. From your example, it seems like you have multiple 1s. If this isn't necessary, you could call df.reset_index(drop=True, inplace=True) for each dataframe before concatenating.
This issue doesn't occur when you concatenate along the index as it just "puts them on top of each other", regardless of what the index is.
What the error message is telling you is that the resulting shape is (8980, 104) but that the expected shape should be (200, 104) based on the index.
Sorry being naive. I have the following data and I want to feature engineered some columns. But I don't have how I can do multiple operations on the same data frame. One thing to mention I have multiple entries for each customer. So, in the end, I want aggregated values (i.e. 1 entry for each customer)
customer_id purchase_amount date_of_purchase days_since
0 760 25.0 06-11-2009 2395
1 860 50.0 09-28-2012 1190
2 1200 100.0 10-25-2005 3720
3 1420 50.0 09-07-2009 2307
4 1940 70.0 01-25-2013 1071
new column based on min, count and mean
customer_purchases['amount'] = customer_purchases.groupby(['customer_id'])['purchase_amount'].agg('min')
customer_purchases['frequency'] = customer_purchases.groupby(['customer_id'])['days_since'].agg('count')
customer_purchases['recency'] = customer_purchases.groupby(['customer_id'])['days_since'].agg('mean')
nexpected outcome
customer_id purchase_amount date_of_purchase days_since recency frequency amount first_purchase
0 760 25.0 06-11-2009 2395 1273 5 38.000000 3293
1 860 50.0 09-28-2012 1190 118 10 54.000000 3744
2 1200 100.0 10-25-2005 3720 1192 9 102.777778 3907
3 1420 50.0 09-07-2009 2307 142 34 51.029412 3825
4 1940 70.0 01-25-2013 1071 686 10 47.500000 3984
One solution :
I can think of 3 separate operations for each needed column and then join all those to get a new data frame. I know it's not efficient for just sake what I need
df_1 = customer_purchases.groupby('customer_id', sort = False)["purchase_amount"].min().reset_index(name ='amount')
df_2 = customer_purchases.groupby('customer_id', sort = False)["days_since"].count().reset_index(name ='frequency')
df_3 = customer_purchases.groupby('customer_id', sort = False)["days_since"].mean().reset_index(name ='recency')
However, either I get an error or not data frame with correct data.
Your help and patience will be appreciated.
SOLUTION
finally i found the solution
def f(x):
recency = x['days_since'].min()
frequency = x['days_since'].count()
monetary_value = x['purchase_amount'].mean()
c = ['recency','frequency, monetary_value']
return pd.Series([recency, frequency, monetary_value], index =c )
df1 = customer_purchases.groupby('customer_id').apply(f)
print (df1)
Use instead
customer_purchases.groupby('customer_id')['purchase_amount'].transform(lambda x : x.min())
Transform will give output for each row of original dataframe instead of grouped row as in case of using agg
Example DataFrame
df = pd.DataFrame(np.random.randint(2200, 3300, 50), index=[np.random.randint(0,6, 50)] ,columns=list('A'))
Below is a sample of what the data would look like
A
5 2393
4 2421
0 3038
5 2914
4 2559
4 2314
5 3006
3 2553
0 2642
3 2441
3 2512
0 2412
What I would like to do is drop the first n (lets use 2 for this example) records of index. So from the previous data example it would become...
A
4 2314
5 3006
3 2512
0 2412
Any guidance here would be appreciated. I haven't been able to get anything to work.
use tail with -2
s.groupby(level=0, group_keys=False).apply(pd.DataFrame.tail, n=-2)
A
0 2412
3 2512
4 2314
5 3006
To really nail it down
s.groupby(level=0, group_keys=False, sort=False).apply(pd.DataFrame.tail, n=-2)
A
5 3006
4 2314
0 2412
3 2512
If I understand your question, you want a new dataframe that has the first n rows of your dataframe removed. If that's what you want, I would reset the index, then drop based on pandas' default index, then put the original index back. Here's how you might do that.
df = pd.DataFrame(data=np.random.randint(2200, 3300, 50),
index=np.random.randint(0,6, 50),
columns=list('A'))
n = 5
print(df.head(n * 2))
df_new = df.reset_index().drop(range(n)).set_index('index')
print(df_new.head(n))
New to pandas here. A (trivial) problem: hosts, operations, execution times. I want to group by host, then by host+operation, calculate std deviation for execution time per host, then by host+operation pair. Seems simple?
It works for grouping by a single column:
df
Out[360]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 132564 entries, 0 to 132563
Data columns (total 9 columns):
datespecial 132564 non-null values
host 132564 non-null values
idnum 132564 non-null values
operation 132564 non-null values
time 132564 non-null values
...
dtypes: float32(1), int64(2), object(6)
byhost = df.groupby('host')
byhost.std()
Out[362]:
datespecial idnum time
host
ahost1.test 11946.961952 40367.033852 0.003699
host1.test 15484.975077 38206.578115 0.008800
host10.test NaN 37644.137631 0.018001
...
Nice. Now:
byhostandop = df.groupby(['host', 'operation'])
byhostandop.std()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-364-2c2566b866c4> in <module>()
----> 1 byhostandop.std()
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in std(self, ddof)
386 # todo, implement at cython level?
387 if ddof == 1:
--> 388 return self._cython_agg_general('std')
389 else:
390 f = lambda x: x.std(ddof=ddof)
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_general(self, how, numeric_only)
1615
1616 def _cython_agg_general(self, how, numeric_only=True):
-> 1617 new_blocks = self._cython_agg_blocks(how, numeric_only=numeric_only)
1618 return self._wrap_agged_blocks(new_blocks)
1619
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_blocks(self, how, numeric_only)
1653 values = com.ensure_float(values)
1654
-> 1655 result, _ = self.grouper.aggregate(values, how, axis=agg_axis)
1656
1657 # see if we can cast the block back to the original dtype
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, values, how, axis)
838 if is_numeric:
839 result = lib.row_bool_subset(result,
--> 840 (counts > 0).view(np.uint8))
841 else:
842 result = lib.row_bool_subset_object(result,
/home/username/anaconda/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.row_bool_subset (pandas/lib.c:16540)()
ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'
Huh?? Why do I get this exception?
More questions:
how do I calculate std deviation on dataframe.groupby([several columns])?
how can I limit calculation to a selected column? E.g. it obviously doesn't make sense to calculate std dev on dates/timestamps here.
It's important to know your version of Pandas / Python. Looks like this exception could arise in Pandas version < 0.10 (see ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'). To avoid this, you can cast your float columns to float64:
df.astype('float64')
To calculate std() on selected columns, just select columns :)
>>> df = pd.DataFrame({'a':range(10), 'b':range(10,20), 'c':list('abcdefghij'), 'g':[1]*3 + [2]*3 + [3]*4})
>>> df
a b c g
0 0 10 a 1
1 1 11 b 1
2 2 12 c 1
3 3 13 d 2
4 4 14 e 2
5 5 15 f 2
6 6 16 g 3
7 7 17 h 3
8 8 18 i 3
9 9 19 j 3
>>> df.groupby('g')[['a', 'b']].std()
a b
g
1 1.000000 1.000000
2 1.000000 1.000000
3 1.290994 1.290994
update
As far as it goes, it looks like std() is calling aggregation() on the groupby result, and a subtle bug (see here - Python Pandas: Using Aggregate vs Apply to define new columns). To avoid this, you can use apply():
byhostandop['time'].apply(lambda x: x.std())