I have a list of arrays:
[array([10,20,30]), array([5,6,7])]
How to turn it into pandas dataframe? pd.DataFrame() puts arrays in on column. desired result is:
0 1 2
10 20 30
5 6 7
0 1 2 here are column names
import pandas as pd
import numpy as np
a = [np.array([10,20,30]), np.array([5,6,7])]
print(pd.DataFrame(a))
Make sure you put the np before the array.
import pandas as pd
import numpy as np
list = [np.array([10,20,30]), np.array([5,6,7])]
df = pd.DataFrame(list)
print(df)
output:
0 1 2
0 10 20 30
1 5 6 7
If you still get an error, is the list of arrays a result from previous data manipulation or did you manually type out the values / array lists?
Related
I am trying to append certain strings to a matrix. One that reads "Target" on the 0th row 0th column. And one that reads "Dog" on the 0th column 1st row downwards to the last row of the matrix.
My initial matrix looks like:
enter image description here
I have a small issue with the following program:
import numpy as np
import pandas as pd
main=pd.read_csv('C:/Users/Jonas/Desktop/testfile/biggertest.csv', header=None)
target_col = ['dog'] * main.shape[0]
main.insert(loc = 0, column = 'target', value = target_col)
This creates a new matrix that looks like this:
enter image description here
Instead of:
enter image description here
I'm wondering what I need to change to make this happen?
Cheers.
You could simply make the following modification.
import numpy as np
import pandas as pd
main=pd.read_csv('C:/Users/Jonas/Desktop/testfile/biggertest.csv', header=None)
target_col = ['dog'] * main.shape[0]
target_col[0] = 'target'
main.insert(loc = 0, column = -1, value = target_col)
Alternatively,
import numpy as np
import pandas as pd
main=pd.read_csv('C:/Users/Jonas/Desktop/testfile/biggertest.csv', header=None)
main.insert(loc = 0, column = -1, value = 'dog')
main.at[0,-1] = 'target'
If you want the column indices to go from 0 to 4 (instead of -1 to 3), then you can add the following command:
main.rename(columns = lambda x:x+1,inplace=True)
Resulting output from all commands:
0 1 2 3 4
0 target 5 5 8 9
1 dog 9 0 2 6
2 dog 6 6 4 3
3 dog 3 3 3 3
I'm trying to get the correlation between a single column and the rest of the numerical columns of the dataframe, but I'm stuck.
I'm trying with this:
corr = IM['imdb_score'].corr(IM)
But I get the error
operands could not be broadcast together with shapes
which I assume is because I'm trying to find a correlation between a vector (my imdb_score column) with the dataframe of several columns.
How can this be fixed?
The most efficient method it to use corrwith.
Example:
df.corrwith(df['A'])
Setup of example data:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(5, 5)), columns=list('ABCDE'))
# A B C D E
# 0 7 2 0 0 0
# 1 4 4 1 7 2
# 2 6 2 0 6 6
# 3 9 8 0 2 1
# 4 6 0 9 7 7
output:
A 1.000000
B 0.526317
C -0.209734
D -0.720400
E -0.326986
dtype: float64
I think you can you just use .corr which returns all correlations between all columns and then select just the column you are interested in.
So, something like
IM.corr()['imbd_score']
should work.
Rather than calculating all correlations and keeping the ones of interest, it can be computationally more efficient to compute the subset of interesting correlations:
import pandas as pd
df = pd.DataFrame()
df['a'] = range(10)
df['b'] = range(10)
df['c'] = range(10)
pd.DataFrame([[c, df['a'].corr(df[c])] for c in df.columns if c!='a'], columns=['var', 'corr'])
I have a pandas dataframe with more than 50 columns. All the data except the 1st column is float. I want to replace any value greater than 5.75 with 100. Can someone advise any function to do the same.
The replace function is not working as to_value can only take "=" function, and not the greater than function.
This can be done using
df['ColumnName'] = np.where(df['ColumnName'] > 5.75, 100, df['First Season'])
You can make a custom function and pass it to apply:
import pandas as pd
import random
df = pd.DataFrame({'col_name': [random.randint(0,10) for x in range(100)]})
def f(x):
if x >= 5.75:
return 100
return x
df['modified'] = df['col_name'].apply(f)
print(df.head())
col_name modified
0 2 2
1 5 5
2 7 100
3 1 1
4 9 100
If you have a dataframe:
import pandas as pd
import random
df = pd.DataFrame({'first_column': [random.uniform(5,6) for x in range(10)]})
print(df)
Gives me:
first_column
0 5.620439
1 5.640604
2 5.286608
3 5.642898
4 5.742910
5 5.096862
6 5.360492
7 5.923234
8 5.489964
9 5.127154
Then check if the value is greater than 5.75:
df[df > 5.75] = 100
print(df)
Gives me:
first_column
0 5.620439
1 5.640604
2 5.286608
3 5.642898
4 5.742910
5 5.096862
6 5.360492
7 100.000000
8 5.489964
9 5.127154
import numpy as np
import pandas as pd
#Create df
np.random.seed(0)
df = pd.DataFrame(2*np.random.randn(100,50))
for col_name in df.columns[1:]: #Skip first column
df.loc[:,col_name][df.loc[:,col_name] > 5.75] = 100
np.where(df.value > 5.75, 100, df.value)
Say I have two pandas Series in python:
import pandas as pd
h = pd.Series(['g',4,2,1,1])
g = pd.Series([1,6,5,4,"abc"])
I can create a DataFrame with just h and then append g to it:
df = pd.DataFrame([h])
df1 = df.append(g, ignore_index=True)
I get:
>>> df1
0 1 2 3 4
0 g 4 2 1 1
1 1 6 5 4 abc
But now suppose that I have an empty DataFrame and I try to append h to it:
df2 = pd.DataFrame([])
df3 = df2.append(h, ignore_index=True)
This does not work. I think the problem is in the second-to-last line of code. I need to somehow define the blank DataFrame to have the proper number of columns.
By the way, the reason I am trying to do this is that I am scraping text from the internet using requests+BeautifulSoup and I am processing it and trying to write it to a DataFrame one row at a time.
So if you don't pass an empty list to the DataFrame constructor then it works:
In [16]:
df = pd.DataFrame()
h = pd.Series(['g',4,2,1,1])
df = df.append(h,ignore_index=True)
df
Out[16]:
0 1 2 3 4
0 g 4 2 1 1
[1 rows x 5 columns]
The difference between the two constructor approaches appears to be that the index dtypes are set differently, with an empty list it is an Int64 with nothing it is an object:
In [21]:
df = pd.DataFrame()
print(df.index.dtype)
df = pd.DataFrame([])
print(df.index.dtype)
object
int64
Unclear to me why the above should affect the behaviour (I'm guessing here).
UPDATE
After revisiting this I can confirm that this looks to me to be a bug in pandas version 0.12.0 as your original code works fine:
In [13]:
import pandas as pd
df = pd.DataFrame([])
h = pd.Series(['g',4,2,1,1])
df.append(h,ignore_index=True)
Out[13]:
0 1 2 3 4
0 g 4 2 1 1
[1 rows x 5 columns]
I am running pandas 0.13.1 and numpy 1.8.1 64-bit using python 3.3.5.0 but I think the problem is pandas but I would upgrade both pandas and numpy to be safe, I don't think this is a 32 versus 64-bit python issue.
imaging i have a series looks like this:
Out[64]:
2 0
3 1
80 1
83 1
84 2
85 2
how can i append an item at the very beginning of this series? the native pandas.Series.append function only appends at the end.
thanks a lot
There is a pandas.concat function...
import pandas as pd
a = pd.Series([2,3,4])
pd.concat([pd.Series([1]), a])
See the Merge, Join, and Concatenate documentation.
Using concat, or append, the resulting series will have duplicate indices:
for concat():
import pandas as pd
a = pd.Series([2,3,4])
pd.concat([pd.Series([1]), a])
Out[143]:
0 1
0 2
1 3
2 4
and for append():
import pandas as pd
a = pd.Series([2,3,4])
a.append(pd.Series([1]))
Out[149]:
0 2
1 3
2 4
0 1
This could be a problem in the future, since a[0] (if you assign the result to a) will return two values for either case.
My solutions are in this case:
import pandas as pd
a = pd.Series([2,3,4])
b = [1]
b[1:] = a
pd.Series(b)
Out[199]:
0 1
1 2
2 3
3 4
or, by reindexing with concat():
import pandas as pd
a = pd.Series([2,3,4])
a.index = a.index + 1
pd.concat([pd.Series([1]), a])
Out[208]:
0 1
1 2
2 3
3 4
In case you need to prepend a single value from a different Series b, say its last value, this is what works for me:
import pandas as pd
a = pd.Series([2, 3, 4])
b = pd.Series([0, 1])
pd.concat([b[-1:], a])
Similarly, you can use append with a list or tuple of series (so long as you're using pandas version .13 or greater)
import pandas as pd
a = pd.Series([2,3,4])
pd.Series.append((pd.Series([1]), a))