I'm trying to convert my dataframe columns to arrays. For example, I have a dataframe that looks like this:
Total Price Carrier
2 3 C
1 5 D
I'd like to convert the columns to arrays like this: [[2, 1], [3,5], ['C','D]] I do not want the column names.
I've tried doing this:
df["all"] = 1
df.groupby("all")[["Total","Price", "Carrier"]].apply(list)
However, I get something like this ["Total", "Price", "Carrier"] and is an object and not an array. How can I convert all columns to arrays?
Use df.values instead of apply:
>>> df.values.T.tolist()
[[2, 1], [3, 5], ['C', 'D']]
Related
So I have a dataframe as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3, 3, 2, 1], [4, 3, 6, 6 ,3 ,4], [7, 2, 9, 9, 2, 7]]),
columns=['a', 'b', 'c', 'a_select','b_select','c_select'])
df
Now, I may need to reorganize the dataframe (or use two) to accomplish this, but...
I'd like to select the 2 largest values from each '_select' column per row, then use that to mean the corresponding column.
For example, row 1 would mean the values from a & b, row 2 a & c (NOT the values from the _select columns that we're looking at).
Currently I'm just iterating each row - as that seems rather simple, but slow with a large dataset - however I can't figure out how to use an apply or lambda function to do the equivelant (or if it's even possible).
Simple oneliner using nlargest
>>> df.filter(like='select').apply(lambda s: s.nlargest(2), 1).mean(1)
For performance, maybe numpy is useful:
>>> np.sort(df.filter(like='select').to_numpy(), 1)[:, -2:].mean(1)
To get values from the first columns, use argsort
>>> arr = df.filter(like='select').to_numpy()
>>> df[['a', 'b', 'c']].to_numpy()[[[x] for x in np.arange(len(arr))],
np.argsort(arr, 1)][:, -2:].mean(1)
array([1.5, 5. , 8. ])
Here is my data:
df:
id sub_id
A 1
A 2
B 3
B 4
and I have the following array:
[[1,2],
[2,5],
[1,4],
[7,8]]
Here is my code:
from collections import defaultdict
sub_id_array_dict = defaultdict(dict)
for i, s, a in zip(df['id'].to_list(), df['sub_id'].to_list(), arrays):
sub_id_array_dict[i][s] = a
Now, my actual dataframe includes a total of 100M rows (unique sub_id) with 500K unique ids. Ideally, I'd like to avoid a for loop.
Any help would be much appreciated.
Assuming the arrays variable has same number of rows as in the Dataframe,
df['value'] = arrays
Convert into dictionary by grouping
df.groupby('id').apply(lambda x: dict(zip(x.sub_id, x.value))).to_dict()
Output
{'A': {1: [1, 2], 2: [2, 5]}, 'B': {3: [1, 4], 4: [7, 8]}}
I am trying to create a DataFrame from these two lists
a = ['a', 'b', 'c']
b = [[1,2,3], [4,5], [7,8,9]]
df = pd.DataFrame(a, columns=['First'])
df['Second'] = b
df
This is the output I got-
First Second
0 a [1, 2, 3]
1 b [4, 5]
2 c [7, 8, 9]
How can I get rid of the [ ] brackets to get my expected output?
My expected output is
First Second
0 a 1, 2, 3
1 b 4, 5
2 c 7, 8, 9
Convert the column to a string type and strip the square brackets
df['Second'] = df['Second'].astype(str).str.strip('[]')
What are you trying to achieve here? A column with list of numeric values that is not a list? It seems bit counter-intuitive. You can maybe convert the values to string to get rid of the so called square brackets of list representation.
c = [", ".join(str(x) for x in y) for y in b]
df['Second'] = c
This will get rid of the brackets. But I am not really sure what is the use of it or what is your actual use-case.
One option is to first process the list of lists to convert it to the required type.
The question about usefulness of storing comma-separated numbers in a dataframe still remains as you won't be able to perform any computation on that.
a = ['a', 'b', 'c']
b = [[1,2,3], [4,5], [7,8,9]]
df = pd.DataFrame(a, columns=['First'])
df['Second'] = [','.join(map(str, i)) for i in b]
First Second
0 a 1,2,3
1 b 4,5
2 c 7,8,9
Suppose that we have a data-frame (df) with a high number of rows (1600000X4). Also, we have a list of lists such as this one:
inx = [[1,2],[4,5], [8,9,10], [15,16]]
We need to calculate average of first and third columns of this data-frame and median of second and fourth columns for every list in inx. For example, for the first list of inx, we should do this for first and second rows and replace all these rows with a new row which contains the output of these calculations. What is the fastest way to do this?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3, 3], [4, 5, 6, 1], [7, 8, 9, 3], [1, 1, 1, 1]]), columns=['a', 'b', 'c', 'd'])
a b c d
0 1 2 3 3
1 4 5 6 1
2 7 8 9 3
3 1 1 1 1
The output for just the first list inside of inx ([1,2]) will be something like this:
a b c d
0 1 2 3 3
1 5.5 6.5 7.5 2
3 1 1 1 1
As you can see, we don't change first row (0), because it's not in the main list. After that, we're going to do the same for [4,5]. We don't change anything in row 3 because it's not in the list too. inx is a large list of lists (more than 100000 elements).
EDIT: NEW APPROACH AVOIDING LOOPS
Here below you find an approach relying on pandas and avoiding loops.
After generating some fake data with the same size of yours, I basically create list of indexes from your inx list of rows; i.e., with your inx being:
[[2,3], [5,6,7], [10,11], ...]
the created list is:
[[1,1], [2,2,2], [3,3],...]
After that, this list is flattened and added to the original dataframe to mark various groups of rows to operate on.
After proper calculations, the resulting dataframe is joined back with original rows which don't need calculations (in my example above, rows: [0, 1, 4, 8, 9, ...]).
You find more comments in the code.
At the end of the answer I leave also my previous approach for the records.
On my box, the old algo involving a loop take more than 18 minutes... unbearable!
Using pandas only, it takes less than half second!! Pandas is great!
import pandas as pd
import numpy as np
import random
# Prepare some fake data to test
data = np.random.randint(0, 9, size=(160000, 4))
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
inxl = random.sample(range(1, 160000), 140000)
inxl.sort()
inx=[]
while len(inxl) > 3:
i = random.randint(2,3)
l = inxl[0:i]
inx.append(l)
inxl = inxl[i:]
inx.append(inxl)
# flatten inx (used below)
flat_inx = [item for sublist in inx for item in sublist]
# for each element (list) in inx create equivalent list (same length)
# of increasing ints. They'll be used to group corresponding rows
gr=[len(sublist) for sublist in inx]
t = list(zip(gr, range(1, len(inx)+1)))
group_list = [a*[b] for (a,b) in t]
# the groups are flatten either
flat_group_list = [item for sublist in group_list for item in sublist]
# create a new dataframe to mark rows to group retaining
# original index for each row
df_groups = pd.DataFrame({'groups': flat_group_list}, index=flat_inx)
# and join the group dataframe to the original df
df['groups'] = df_groups
# rows not belonging to a group are marked with 0
df['groups']=df['groups'].fillna(0)
# save rows not belonging to a group for later
df_untouched = df[df['groups'] == 0]
df_untouched = df_untouched.drop('groups', axis=1)
# new dataframe containg only rows belonging to a group
df_to_operate = df[df['groups']>0]
df_to_operate = df_to_operate.assign(ind=df_to_operate.index)
# at last, we group the rows according to original inx
df_grouped = df_to_operate.groupby('groups')
# calculate mean and median
# for each group we retain the index of first row of group
df_operated =df_grouped.agg({'a' : 'mean',
'b' : 'median',
'c' : 'mean',
'd' : 'median',
'ind': 'first'})
# set correct index on dataframe
df_operated=df_operated.set_index('ind')
# finally, join the previous dataframe with saved
# dataframe of rows which don't need calcullations
df_final = df_operated.combine_first(df_untouched)
OLD ALGO, TOO SLOW FOR SO MUCH DATA
This algo involving a loop, though giving a correct result, takes to long for such a big amount of data:
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3, 3], [4, 5, 6, 1], [7, 8, 9, 3], [1, 1, 1, 1]]), columns=['a', 'b', 'c', 'd'])
inx = [[1,2]]
for l in inx:
means=df.iloc[l][['a', 'c']].mean()
medians=df.iloc[l][['b', 'd']].median()
df.iloc[l[0]]=pd.DataFrame([means, medians]).fillna(method='bfill').iloc[0]
df.drop(index=l[1:], inplace=True)
I like using nested data structures and now I'm trying to understand how to use Pandas
Here is a toy model:
a=pd.DataFrame({'x':[1,2],'y':[10,20]})
b=pd.DataFrame({'x':[3,4],'y':[30,40]})
c=[a,b]
now I would like to get:
sol=np.array([[[1],[3]],[[2],[4]]])
I have an idea to get both sol[0] and sol[1] as:
s0=np.array([item[['x']].ix[0] for item in c])
s1=np.array([item[['x']].ix[1] for item in c])
but to get sol I would run over the index and I don't think it is really pythonic...
It looks like you want just the x columns from a and b. You can concatenate two Series (or DataFrames) into a new DataFrame using pd.concat:
In [132]: pd.concat([a['x'], b['x']], axis=1)
Out[132]:
x x
0 1 3
1 2 4
[2 rows x 2 columns]
Now, if you want a numpy array, use the values attribute:
In [133]: pd.concat([a['x'], b['x']], axis=1).values
Out[133]:
array([[1, 3],
[2, 4]], dtype=int64)
And if you want a numpy array with the same shape as sol, then use the reshape method:
In [134]: pd.concat([a['x'], b['x']], axis=1).values.reshape(2,2,1)
Out[134]:
array([[[1],
[3]],
[[2],
[4]]], dtype=int64)
In [136]: np.allclose(pd.concat([a['x'], b['x']], axis=1).values.reshape(2,2,1), sol)
Out[136]: True