imaging i have a series looks like this:
Out[64]:
2 0
3 1
80 1
83 1
84 2
85 2
how can i append an item at the very beginning of this series? the native pandas.Series.append function only appends at the end.
thanks a lot
There is a pandas.concat function...
import pandas as pd
a = pd.Series([2,3,4])
pd.concat([pd.Series([1]), a])
See the Merge, Join, and Concatenate documentation.
Using concat, or append, the resulting series will have duplicate indices:
for concat():
import pandas as pd
a = pd.Series([2,3,4])
pd.concat([pd.Series([1]), a])
Out[143]:
0 1
0 2
1 3
2 4
and for append():
import pandas as pd
a = pd.Series([2,3,4])
a.append(pd.Series([1]))
Out[149]:
0 2
1 3
2 4
0 1
This could be a problem in the future, since a[0] (if you assign the result to a) will return two values for either case.
My solutions are in this case:
import pandas as pd
a = pd.Series([2,3,4])
b = [1]
b[1:] = a
pd.Series(b)
Out[199]:
0 1
1 2
2 3
3 4
or, by reindexing with concat():
import pandas as pd
a = pd.Series([2,3,4])
a.index = a.index + 1
pd.concat([pd.Series([1]), a])
Out[208]:
0 1
1 2
2 3
3 4
In case you need to prepend a single value from a different Series b, say its last value, this is what works for me:
import pandas as pd
a = pd.Series([2, 3, 4])
b = pd.Series([0, 1])
pd.concat([b[-1:], a])
Similarly, you can use append with a list or tuple of series (so long as you're using pandas version .13 or greater)
import pandas as pd
a = pd.Series([2,3,4])
pd.Series.append((pd.Series([1]), a))
Related
I am having a data frame of four columns. I want to find the minimum among the first two columns and the last two columns for each row.
Code:
np.random.seed(0)
xdf = pd.DataFrame({'a':np.random.rand(1,10)[0]*10,'b':np.random.rand(1,10)[0]*10,'c':np.random.rand(1,10)[0]*10,'d':np.random.rand(1,10)[0]*10,},index=np.arange(0,10,1))
xdf['ab_min'] = xdf[['a','b']].min(axis=1)
xdf['cd_min'] = xdf[['c','d']].min(axis=1)
xdf['minimum'] = xdf['ab_min'].list()+xdf['cd_min'].list()
Expected answer:
xdf['minimum']
0 [ab_min,cd_min]
1 [ab_min,cd_min]
2 [ab_min,cd_min]
3 [ab_min,cd_min]
Present answer:
AttributeError: 'Series' object has no attribute 'list'
Select the columns ab_min and cd_min then use to_numpy to convert it to numpy array and assign the result to minimum column
xdf['minimum'] = xdf[['ab_min', 'cd_min']].to_numpy().tolist()
>>> xdf['minimum']
0 [3.23307959607905, 1.9836323494587338]
1 [6.189440334168731, 1.0578078219990983]
2 [3.1194570407645217, 1.2816570607783184]
3 [1.9170068676155894, 7.158027504597937]
4 [0.6244579166416464, 8.568849995324166]
5 [4.108986697339397, 0.6201685780268684]
6 [4.170639127277155, 2.3385281968695693]
7 [2.0831140755567814, 5.94063873401418]
8 [0.4887113296319978, 6.380570614449363]
9 [2.844815261473105, 0.9146457613970793]
Name: minimum, dtype: object
try this:
import pandas as pd
import numpy as np
xdf = pd.DataFrame({'a':np.random.rand(1,10)[0]*10,'b':np.random.rand(1,10)[0]*10,'c':np.random.rand(1,10)[0]*10,'d':np.random.rand(1,10)[0]*10,},index=np.arange(0,10,1))
print(xdf)
ab = xdf['ab_min'] = xdf[['a','b']].min(axis=1)
cd = xdf['cd_min'] = xdf[['c','d']].min(axis=1)
blah = pd.concat([ab, cd], axis=1)
print(blah)
results:
You can use .apply with a lambda function along axis=1:
xdf['minimum'] = xdf.apply(lambda x: [x[['a','b']].min(),x[['c','d']].min()], axis=1)
Result:
>>> xdf
a b c d minimum
0 0.662634 4.166338 8.864823 9.004818 [0.6626341544146663, 8.864822751494284]
1 6.854054 6.163417 6.510728 0.049498 [6.163416966676091, 0.04949754019059838]
2 6.389760 4.462319 2.435369 3.732534 [4.462318678134215, 2.4353686460846893]
3 4.628735 7.571098 1.900726 9.046384 [4.628735362058981, 1.9007255361271058]
4 3.203285 4.364302 2.473973 2.911911 [3.203285015796596, 2.4739732602476727]
5 5.357440 3.166420 9.908758 0.910704 [3.166420385020304, 0.91070444348338]
6 8.120486 6.395869 0.970977 5.278279 [6.395868901095546, 0.9709769503958143]
7 1.574765 7.184971 3.835641 4.495135 [1.574765093192545, 3.835640598199231]
8 8.688497 0.069061 0.771772 8.971878 [0.06906065557899743, 0.7717717844423222]
9 5.455920 2.630342 1.966357 7.374366 [2.6303421168291843, 1.966357159086991]
I'm trying to get the correlation between a single column and the rest of the numerical columns of the dataframe, but I'm stuck.
I'm trying with this:
corr = IM['imdb_score'].corr(IM)
But I get the error
operands could not be broadcast together with shapes
which I assume is because I'm trying to find a correlation between a vector (my imdb_score column) with the dataframe of several columns.
How can this be fixed?
The most efficient method it to use corrwith.
Example:
df.corrwith(df['A'])
Setup of example data:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(5, 5)), columns=list('ABCDE'))
# A B C D E
# 0 7 2 0 0 0
# 1 4 4 1 7 2
# 2 6 2 0 6 6
# 3 9 8 0 2 1
# 4 6 0 9 7 7
output:
A 1.000000
B 0.526317
C -0.209734
D -0.720400
E -0.326986
dtype: float64
I think you can you just use .corr which returns all correlations between all columns and then select just the column you are interested in.
So, something like
IM.corr()['imbd_score']
should work.
Rather than calculating all correlations and keeping the ones of interest, it can be computationally more efficient to compute the subset of interesting correlations:
import pandas as pd
df = pd.DataFrame()
df['a'] = range(10)
df['b'] = range(10)
df['c'] = range(10)
pd.DataFrame([[c, df['a'].corr(df[c])] for c in df.columns if c!='a'], columns=['var', 'corr'])
I have dataframe below.
I want to even row value substract from odd row value.
and make new dataframe.
How can I do it?
import pandas as pd
import numpy as np
raw_data = {'Time': [281.54385, 436.55295, 441.74910, 528.36445,
974.48405, 980.67895, 986.65435, 1026.02485]}
data = pd.DataFrame(raw_data)
data
dataframe
Time
0 281.54385
1 436.55295
2 441.74910
3 528.36445
4 974.48405
5 980.67895
6 986.65435
7 1026.02485
Wanted result
ON_TIME
0 155.00910
1 86.61535
2 6.19490
3 39.37050
You can use NumPy indexing:
res = pd.DataFrame(data.values[1::2] - data.values[::2], columns=['Time'])
print(res)
Time
0 155.00910
1 86.61535
2 6.19490
3 39.37050
you can use shift for the subtraction, and then pick every 2nd element, starting with the 2nd element (index = 1)
(data.Time - data.Time.shift())[1::2].rename('On Time').reset_index(drop=True)
outputs:
0 155.00910
1 86.61535
2 6.19490
3 39.37050
Name: On Time, dtype: float64
Is there a function that splits a pandas.dataframe object into multiple sub-dataframes, by a specific column value? For example, if I have
A 1
B 2
A 3
B 4
I want the result as follow:
A 1
A 3
and
B 2
B 4
In R, it is the split function. How is it being done in python? I know I can use subset within a forloop. But is there a function does that? Thanks.
You can use groupby() with list-comprehension to extract a list of sub data frames where each of them contains only a single ind value:
import pandas as pd
from StringIO import StringIO
df = pd.read_csv(StringIO("""A 1
B 2
A 3
B 4"""), sep = "\s+", names=['ind', 'value'])
lst = [g for _, g in df.groupby('ind')]
lst[0]
# ind value
#0 A 1
#2 A 3
lst[1]
# ind value
#1 B 2
#3 B 4
Say I have two pandas Series in python:
import pandas as pd
h = pd.Series(['g',4,2,1,1])
g = pd.Series([1,6,5,4,"abc"])
I can create a DataFrame with just h and then append g to it:
df = pd.DataFrame([h])
df1 = df.append(g, ignore_index=True)
I get:
>>> df1
0 1 2 3 4
0 g 4 2 1 1
1 1 6 5 4 abc
But now suppose that I have an empty DataFrame and I try to append h to it:
df2 = pd.DataFrame([])
df3 = df2.append(h, ignore_index=True)
This does not work. I think the problem is in the second-to-last line of code. I need to somehow define the blank DataFrame to have the proper number of columns.
By the way, the reason I am trying to do this is that I am scraping text from the internet using requests+BeautifulSoup and I am processing it and trying to write it to a DataFrame one row at a time.
So if you don't pass an empty list to the DataFrame constructor then it works:
In [16]:
df = pd.DataFrame()
h = pd.Series(['g',4,2,1,1])
df = df.append(h,ignore_index=True)
df
Out[16]:
0 1 2 3 4
0 g 4 2 1 1
[1 rows x 5 columns]
The difference between the two constructor approaches appears to be that the index dtypes are set differently, with an empty list it is an Int64 with nothing it is an object:
In [21]:
df = pd.DataFrame()
print(df.index.dtype)
df = pd.DataFrame([])
print(df.index.dtype)
object
int64
Unclear to me why the above should affect the behaviour (I'm guessing here).
UPDATE
After revisiting this I can confirm that this looks to me to be a bug in pandas version 0.12.0 as your original code works fine:
In [13]:
import pandas as pd
df = pd.DataFrame([])
h = pd.Series(['g',4,2,1,1])
df.append(h,ignore_index=True)
Out[13]:
0 1 2 3 4
0 g 4 2 1 1
[1 rows x 5 columns]
I am running pandas 0.13.1 and numpy 1.8.1 64-bit using python 3.3.5.0 but I think the problem is pandas but I would upgrade both pandas and numpy to be safe, I don't think this is a 32 versus 64-bit python issue.