Sometimes DataFrame columns is an array of arrays:
df['value'].values
array([[1.51808096e+11],
[1.49119648e+11],
...
[1.18009284e+11],
[1.44851665e+11]])
And sometimes a regular array:
df['value'].values
array([1.51808096e+11,
1.49119648e+11,
...
1.18009284e+11,
1.44851665e+11])
DataFrame created with a csv will sometimes give one format, and sometimes the other. This causes issues. Using df['value'].values = df['value'].values.flatten() does not work.
cols = df.columns
df = pd.DataFrame(df.values.flatten().reshape(-1, df.shape[1]), columns=cols)
I was having a heck of a time recreating your data to have an array of arrays print out from my column. But maybe using reshape and then grabbing the first index will work for you. Like:
# With just an array
arr = np.array([[1,2,3,4,5], [1,2,3,4,5], [1,2,3,4,5]])
arr.reshape(1,-1)[0]
Output:
array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5])
Or with a dataframe example:
lst = [['tom', 'reacher', 25], ['krish', 'pete', 30],
['nick', 'wilson', 26], ['juli', 'williams', 22]]
df = pd.DataFrame(lst, columns =['FName', 'LName', 'Age'])
df.values.reshape(1,-1)[0]
Output:
array(['tom', 'reacher', 25, 'krish', 'pete', 30, 'nick', 'wilson', 26,
'juli', 'williams', 22], dtype=object)
If this doesn't work, could you add a minimal working example to your question to recreate your dataframe?
list comprehension would do the job
array = [v for e in [x for x in df['values']] for v in e]
Related
I have a dataframe like below:
df = pd.DataFrame({
'Aapl': [12, 5, 8],
'Fs': [18, 12, 8],
'Bmw': [6, 18, 12],
'Year': ['2020', '2025', '2030']
})
I want a dictionary like:
d={'2020':[12,18,16],
'2025':[5,12,18],
'2030':[8,8,12]
}
I am not able to develop the whole logic:
lst = [list(item.values()) for item in df.to_dict().values()]
dic={}
for items in lst:
for i in items[-1]:
dic[i]=#2000 will hold all the 0th values of other lists and so on
Is there any easier way using pandas ?
Convert Year to index, transpose and then in dict comprehension create lists:
d = {k: list(v) for k, v in df.set_index('Year').T.items()}
print (d)
{'2020': [12, 18, 6], '2025': [5, 12, 18], '2030': [8, 8, 12]}
Or use DataFrame.agg:
d = df.set_index('Year').agg(list, axis=1).to_dict()
print (d)
{'2020': [12, 18, 6], '2025': [5, 12, 18], '2030': [8, 8, 12]}
Try this:
import pandas as pd
data = {'Name': ['Ankit', 'Amit',
'Aishwarya', 'Priyanka'],
'Age': [21, 19, 20, 18],
'Stream': ['Math', 'Commerce',
'Arts', 'Biology'],
'Percentage': [88, 92, 95, 70]}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data, columns=['Name', 'Age',
'Stream', 'Percentage'])
I am working with a pandas dataframe with multi-index columns (two levels). I need to drop a column from level 0 and later get a list of the remaining columns in level=0. Strangely, the dropping part works fine, but somehow the dropped column shows back up if you call df.columns.levels[0].
Here's a MRE. When I call df.columns the result is this:
MultiIndex([('Week2', 'Hours'),
('Week2', 'Sales')],
)
Which sure looks like Week1 is gone. But if I call df.columns.levels[0].tolist()...
['Week1', 'Week2']
Here's the full code:
import pandas as pd
import numpy as np
n = ['Mickey', 'Minnie', 'Snow White', 'Donald',
'Goofy', 'Elsa', 'Pluto', 'Daisy', 'Mad Hatter']
df1 = pd.DataFrame(data={'Hours': [32, 30, 34, 33, 22, 12, 19, 17, 9],
'Sales': [10, 15, 12, 15, 6, 11, 9, 7, 4]},
index=n)
df2 = pd.DataFrame(data={'Hours': [40, 33, 29, 31, 17, 22, 13, 16, 12],
'Sales': [12, 14, 8, 16, 3, 12, 5, 6, 4]},
index=n)
df1.columns = pd.MultiIndex.from_product([['Week1'], df1.columns])
df2.columns = pd.MultiIndex.from_product([['Week2'], df2.columns])
df = pd.concat([df1, df2], axis=1)
#I want to remove a whole level
df = df.drop('Week1', axis=1, level=0)
print(df.columns) #Seems successful after this print, only Week 2 is left
print(df.columns.levels[0].tolist()) #This list includes Week1!
Use remove_unused_levels:
From the documentation:
Unused level(s) means levels that are not expressed in the labels. The resulting MultiIndex will have the same outward appearance, meaning the same .values and ordering. It will also be .equals() to the original.
df.columns = df.columns.remove_unused_levels()
print(df.columns.levels)
# Output
[['Week2'], ['Hours', 'Sales']]
I have a list of lists with multi columns:
column = [id, date,col1, col2...coln]
list_OfRows = [[1,date1, 10,20 ...23],
[1,date1, 1,10 ...33],
[2,date2, 3,7...8],
[2,date2, 21,9...23],
[2,date3, 10,56 ...20],
[2,date4, 10,20 ...42]]
I want to group by on id and date and do sum on cols WITHOUT USING PANDAS
RESULT = [[1,date1, 11,30 ...56],
[2,date2, 24,16...31],
[2,date3, 10,20 ...20],
[2,date4, 10,20 ...42]]
You can do it like this:
from itertools import groupby
list_OfRows.sort(key=lambda x: x[:2])
res = []
for k, g in groupby(list_OfRows, key=lambda x: x[:2]):
res.append(k + list(map(sum, zip(*[c[2:] for c in g]))))
which produces:
[[1, 'date1', 11, 30, 56],
[2, 'date2', 24, 16, 31],
[2, 'date3', 10, 56, 20],
[2, 'date4', 10, 20, 42]]
I have the following matrix:
M = np.matrix([[1,2,3,4,5,6,7,8,9,10],
[11,12,13,14,15,16,17,18,19,20],
[21,22,23,24,25,26,27,28,29,30]])
And I receive a vector indexing the columns of the matrix:
index = np.array([1,1,2,2,2,2,3,4,4,4])
This vector has 4 different values, so my objective is to create a list containing four new matrices so that the first matrix is made by the first two columns of M, the second matrix is made by columns 3 to 6 and so on:
M1 = np.matrix([[1,2],[11,12],[21,22]])
M2 = np.matrix([[3,4,5,6],[13,14,15,16],[23,24,25,26]])
M3 = np.matrix([[7],[17],[27]])
M4 = np.matrix([[8,9,10],[18,19,20],[28,29,30]])
l = list(M1,M2,M3,M4)
I need to do this in a automated way, since the number of rows and columns of M as well as the indexing scheme are not fixed. How can I do this?
There are 3 points to note:
For a variable number of variables, as in this case, the recommended solution is to use a dictionary.
You can use simple numpy indexing for the individual case.
Unless you have a very specific reason, use numpy.array instead of numpy.matrix.
Combining these points, you can use a dictionary comprehension:
d = {k: np.array(M[:, np.where(index==k)[0]]) for k in np.unique(index)}
Result:
{1: array([[ 1, 2],
[11, 12],
[21, 22]]),
2: array([[ 3, 4, 5, 6],
[13, 14, 15, 16],
[23, 24, 25, 26]]),
3: array([[ 7],
[17],
[27]]),
4: array([[ 8, 9, 10],
[18, 19, 20],
[28, 29, 30]])}
import numpy as np
M = np.matrix([[1,2,3,4,5,6,7,8,9,10],
[11,12,13,14,15,16,17,18,19,20],
[21,22,23,24,25,26,27,28,29,30]])
index = np.array([1,1,2,2,2,2,3,4,4,4])
m = [[],[],[],[]]
for i,c in enumerate(index):
m[k-1].append(c)
for idx in m:
print M[:,idx]
this is a little hard coded, I assumed you will always want 4 matrixes and such.. you can change it for more generalisation
For a data frame df:
name list1 list2
a [1, 3, 10, 12, 20..] [2, 6, 23, 29...]
b [2, 10, 14, 3] [4, 7, 8, 13...]
c [] [98, 101, 200]
...
I want to transfer the list1 and list2 to np.array and then hstack them. Here is what I did:
df.pv = df.apply(lambda row: np.hstack((np.asarray(row.list1), np.asarray(row.list2))), axis=1)
And I got such an error:
ValueError: Shape of passed values is (138493, 175), indices imply (138493, 4)
Where 138493==len(df)
Please note that some value in list1 and list2 is empty list, []. And the length of list are different among rows. Do you know what is the reason how can I fix the problem? Thanks in advance!
EDIT:
When I just try to convert one list to array:
df.apply(lambda row: np.asarray(row.list1), axis=1)
An error also occurs:
ValueError: Empty data passed with indices specified.
Your apply function is almost correct. All you have to do - convert the output of the np.hstack() function back to a python list.
df.apply(lambda row: list(np.hstack((np.asarray(row.list1), np.asarray(row.list2)))), axis=1)
The code is shown below (including the df creation):
df = pd.DataFrame([('a',[1, 3, 10, 12, 20],[2, 6, 23, 29]),
('b',[2, 10, 1.4, 3],[4, 7, 8, 13]),
('c',[],[98, 101, 200])],
columns = ['name','list1','list2'])
df['list3'] = df.apply(lambda row: list(np.hstack((np.asarray(row.list1), np.asarray(row.list2)))), axis=1)
print(df)
Output:
0 [1, 3, 10, 12, 20, 2, 6, 23, 29]
1 [2.0, 10.0, 1.4, 3.0, 4.0, 7.0, 8.0, 13.0]
2 [98.0, 101.0, 200.0]
Name: list3, dtype: object
If you want a numpy array, the only way I could get it to work is:
df['list3'] = df['list3'].apply(lambda x: np.array(x))
print(type(df['list3'].ix[0]))
Out[] : numpy.ndarray