This question already has answers here:
Efficient way to unnest (explode) multiple list columns in a pandas DataFrame
(7 answers)
Closed last year.
I have a dataframe with the below format:
A B C
0 [[1,2],[3,4]] [[5,6],[7,8]] [[9,10],[11,12]]
for multiple rows.
The sub lists are always of length 2 and the length of A,B,C lists are always the same size.However the lengths of the latter vary and for example can be of size 2 or 6..etc for different rows.
What i would like to do is to explode rows like these into:
A B C
0 [1,2] [5,6] [9,10]
0 [3,4] [7,8] [11,12]
Assuming you really have lists of lists, a simple explode on all columns should work:
df.explode(df.columns.to_list())
output:
A B C
0 [1, 2] [5, 6] [9, 10]
0 [3, 4] [7, 8] [11, 12]
used input:
df = pd.DataFrame([[[[1,2],[3,4]], [[5,6],[7,8]], [[9,10],[11,12]]]],
columns=['A', 'B', 'C'])
Related
I have the following Pandas dataframe in Python:
import pandas as pd
d = {'col1': [1, 2, 3, 4, 5], 'col2': [6, 7, 8, 9, 10]}
df = pd.DataFrame(data=d)
df.index=['A', 'B', 'C', 'D', 'E']
df
which gives the following output:
col1 col2
A 1 6
B 2 7
C 3 8
D 4 9
E 5 10
I need to write a function (say the name will be getNrRows(fromIndex) ) that will take an index value as input and will return the number of rows between that given index and the last index of the dataframe.
For instance:
nrRows = getNrRows("C")
print(nrRows)
> 2
Because it takes 2 steps (rows) from the index C to the index E.
How can I write such a function in the most elegant way?
The simplest way might be
len(df[row_index:]) - 1
For your information we have built-in function get_indexer_for
len(df)-df.index.get_indexer_for(['C'])-1
Out[179]: array([2], dtype=int64)
I am trying to create a DataFrame from these two lists
a = ['a', 'b', 'c']
b = [[1,2,3], [4,5], [7,8,9]]
df = pd.DataFrame(a, columns=['First'])
df['Second'] = b
df
This is the output I got-
First Second
0 a [1, 2, 3]
1 b [4, 5]
2 c [7, 8, 9]
How can I get rid of the [ ] brackets to get my expected output?
My expected output is
First Second
0 a 1, 2, 3
1 b 4, 5
2 c 7, 8, 9
Convert the column to a string type and strip the square brackets
df['Second'] = df['Second'].astype(str).str.strip('[]')
What are you trying to achieve here? A column with list of numeric values that is not a list? It seems bit counter-intuitive. You can maybe convert the values to string to get rid of the so called square brackets of list representation.
c = [", ".join(str(x) for x in y) for y in b]
df['Second'] = c
This will get rid of the brackets. But I am not really sure what is the use of it or what is your actual use-case.
One option is to first process the list of lists to convert it to the required type.
The question about usefulness of storing comma-separated numbers in a dataframe still remains as you won't be able to perform any computation on that.
a = ['a', 'b', 'c']
b = [[1,2,3], [4,5], [7,8,9]]
df = pd.DataFrame(a, columns=['First'])
df['Second'] = [','.join(map(str, i)) for i in b]
First Second
0 a 1,2,3
1 b 4,5
2 c 7,8,9
I have the following data frame called data_frame:
a b c
0 [1, 2] [3, 4] [5, 6]
1 [7, 8] [9, 10] [11, 12]
2 [13, 14] [15, 16] [17, 18]
What I want to do is element-wise mean of list in all rows. For instance, the result data should be:
a b c
0 [(1 + 7 + 13) / 3, (2 + 8 + 14) / 3] [(3 + 9 + 15) / 3, (4 + 10 + 16) / 3] [(5 + 11 + 17) / 3, (6 + 12 + 18) / 3]
If each element in pandas was a single value, this could be done by data_frame.mean().
However, how can it be done if each element in pandas is the list shown above?
Convert for each column values to 2d arrays, then use mean and last convert Series to one row DataFrame:
df = df.apply(lambda x: np.array(x.tolist()).mean(axis=0).tolist()).to_frame().T
print (df)
a b c
0 [7.0, 8.0] [9.0, 10.0] [11.0, 12.0]
Another solution, very similar principe:
df = pd.DataFrame([[np.array(df[x].tolist()).mean(axis=0).tolist() for x in df.columns]],
columns=df.columns)
Try with stack() then get mean of level 1 and aggregate back as list/join (anything you prefer):
s = df.stack()
pd.DataFrame(s.tolist(),index=s.index).mean(level=1).agg(list,1).to_frame().T
a b c
0 [7, 8] [9, 10] [11, 12]
Summary of the code below :
Transpose dataframe and convert to numpy array
zip the entries in each sublist/array ... this pairs the numbers together... so for a, we'll have (1,7,13), (2,8,14); same thing for b and c
for each zipped entry, find the mean
create a dictionary where the outcome is paired with the columns
create the dataframe
data = {"a":[[1,2],[7,8],[13,14]], "b":[[3,4],[9,10],[15,16]], "c":[[5,6],[11,12],[17,18]]}
df = pd.DataFrame(data)
outcome = [[np.mean(entry) for entry in (zip(*ent))]
for ent in df.T.to_numpy()]
pd.DataFrame({key:[value] for key, value in zip(df.columns, outcome)})
a b c
0 [7.0, 8.0] [9.0, 10.0] [11.0, 12.0]
Suppose that we have a data-frame (df) with a high number of rows (1600000X4). Also, we have a list of lists such as this one:
inx = [[1,2],[4,5], [8,9,10], [15,16]]
We need to calculate average of first and third columns of this data-frame and median of second and fourth columns for every list in inx. For example, for the first list of inx, we should do this for first and second rows and replace all these rows with a new row which contains the output of these calculations. What is the fastest way to do this?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3, 3], [4, 5, 6, 1], [7, 8, 9, 3], [1, 1, 1, 1]]), columns=['a', 'b', 'c', 'd'])
a b c d
0 1 2 3 3
1 4 5 6 1
2 7 8 9 3
3 1 1 1 1
The output for just the first list inside of inx ([1,2]) will be something like this:
a b c d
0 1 2 3 3
1 5.5 6.5 7.5 2
3 1 1 1 1
As you can see, we don't change first row (0), because it's not in the main list. After that, we're going to do the same for [4,5]. We don't change anything in row 3 because it's not in the list too. inx is a large list of lists (more than 100000 elements).
EDIT: NEW APPROACH AVOIDING LOOPS
Here below you find an approach relying on pandas and avoiding loops.
After generating some fake data with the same size of yours, I basically create list of indexes from your inx list of rows; i.e., with your inx being:
[[2,3], [5,6,7], [10,11], ...]
the created list is:
[[1,1], [2,2,2], [3,3],...]
After that, this list is flattened and added to the original dataframe to mark various groups of rows to operate on.
After proper calculations, the resulting dataframe is joined back with original rows which don't need calculations (in my example above, rows: [0, 1, 4, 8, 9, ...]).
You find more comments in the code.
At the end of the answer I leave also my previous approach for the records.
On my box, the old algo involving a loop take more than 18 minutes... unbearable!
Using pandas only, it takes less than half second!! Pandas is great!
import pandas as pd
import numpy as np
import random
# Prepare some fake data to test
data = np.random.randint(0, 9, size=(160000, 4))
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
inxl = random.sample(range(1, 160000), 140000)
inxl.sort()
inx=[]
while len(inxl) > 3:
i = random.randint(2,3)
l = inxl[0:i]
inx.append(l)
inxl = inxl[i:]
inx.append(inxl)
# flatten inx (used below)
flat_inx = [item for sublist in inx for item in sublist]
# for each element (list) in inx create equivalent list (same length)
# of increasing ints. They'll be used to group corresponding rows
gr=[len(sublist) for sublist in inx]
t = list(zip(gr, range(1, len(inx)+1)))
group_list = [a*[b] for (a,b) in t]
# the groups are flatten either
flat_group_list = [item for sublist in group_list for item in sublist]
# create a new dataframe to mark rows to group retaining
# original index for each row
df_groups = pd.DataFrame({'groups': flat_group_list}, index=flat_inx)
# and join the group dataframe to the original df
df['groups'] = df_groups
# rows not belonging to a group are marked with 0
df['groups']=df['groups'].fillna(0)
# save rows not belonging to a group for later
df_untouched = df[df['groups'] == 0]
df_untouched = df_untouched.drop('groups', axis=1)
# new dataframe containg only rows belonging to a group
df_to_operate = df[df['groups']>0]
df_to_operate = df_to_operate.assign(ind=df_to_operate.index)
# at last, we group the rows according to original inx
df_grouped = df_to_operate.groupby('groups')
# calculate mean and median
# for each group we retain the index of first row of group
df_operated =df_grouped.agg({'a' : 'mean',
'b' : 'median',
'c' : 'mean',
'd' : 'median',
'ind': 'first'})
# set correct index on dataframe
df_operated=df_operated.set_index('ind')
# finally, join the previous dataframe with saved
# dataframe of rows which don't need calcullations
df_final = df_operated.combine_first(df_untouched)
OLD ALGO, TOO SLOW FOR SO MUCH DATA
This algo involving a loop, though giving a correct result, takes to long for such a big amount of data:
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3, 3], [4, 5, 6, 1], [7, 8, 9, 3], [1, 1, 1, 1]]), columns=['a', 'b', 'c', 'd'])
inx = [[1,2]]
for l in inx:
means=df.iloc[l][['a', 'c']].mean()
medians=df.iloc[l][['b', 'd']].median()
df.iloc[l[0]]=pd.DataFrame([means, medians]).fillna(method='bfill').iloc[0]
df.drop(index=l[1:], inplace=True)
I'm trying to convert my dataframe columns to arrays. For example, I have a dataframe that looks like this:
Total Price Carrier
2 3 C
1 5 D
I'd like to convert the columns to arrays like this: [[2, 1], [3,5], ['C','D]] I do not want the column names.
I've tried doing this:
df["all"] = 1
df.groupby("all")[["Total","Price", "Carrier"]].apply(list)
However, I get something like this ["Total", "Price", "Carrier"] and is an object and not an array. How can I convert all columns to arrays?
Use df.values instead of apply:
>>> df.values.T.tolist()
[[2, 1], [3, 5], ['C', 'D']]