I am trying to create a DataFrame from these two lists
a = ['a', 'b', 'c']
b = [[1,2,3], [4,5], [7,8,9]]
df = pd.DataFrame(a, columns=['First'])
df['Second'] = b
df
This is the output I got-
First Second
0 a [1, 2, 3]
1 b [4, 5]
2 c [7, 8, 9]
How can I get rid of the [ ] brackets to get my expected output?
My expected output is
First Second
0 a 1, 2, 3
1 b 4, 5
2 c 7, 8, 9
Convert the column to a string type and strip the square brackets
df['Second'] = df['Second'].astype(str).str.strip('[]')
What are you trying to achieve here? A column with list of numeric values that is not a list? It seems bit counter-intuitive. You can maybe convert the values to string to get rid of the so called square brackets of list representation.
c = [", ".join(str(x) for x in y) for y in b]
df['Second'] = c
This will get rid of the brackets. But I am not really sure what is the use of it or what is your actual use-case.
One option is to first process the list of lists to convert it to the required type.
The question about usefulness of storing comma-separated numbers in a dataframe still remains as you won't be able to perform any computation on that.
a = ['a', 'b', 'c']
b = [[1,2,3], [4,5], [7,8,9]]
df = pd.DataFrame(a, columns=['First'])
df['Second'] = [','.join(map(str, i)) for i in b]
First Second
0 a 1,2,3
1 b 4,5
2 c 7,8,9
Related
I am a Python newbie and have a question.
As a simple example, I have three variables:
a = 3
b = 10
c = 1
I'd like to create a data frame with three columns ('a', 'b', and 'c') with:
each column +/- a certain constant from the original value AND also >0 and <=10.
If the constant is 1 then:
the possible values of 'a' will be 2, 3, 4
the possible values of 'b' will be 9, 10
the possible values of 'c' will be 1, 2
The final data frame will consist of all possible combination of a, b and c.
Do you know any Python code to do so?
Here is a code to start.
import pandas as pd
data = [[3 , 10, 1]]
df1 = pd.DataFrame(data, columns=['a', 'b', 'c'])
You may use itertools.product for this.
Create 3 separate lists with the necessary accepted data. This can be done by calling a method which will return you the list of possible values.
def list_of_values(n):
if 1 < n < 9:
return [n - 1, n, n + 1]
elif n == 1:
return [1, 2]
elif n == 10:
return [9, 10]
return []
So you will have the following:
a = [2, 3, 4]
b = [9, 10]
c = [1,2]
Next, do the following:
from itertools import product
l = product(a,b,c)
data = list(l)
pd.DataFrame(data, columns =['a', 'b', 'c'])
This question already has answers here:
Efficient way to unnest (explode) multiple list columns in a pandas DataFrame
(7 answers)
Closed last year.
I have a dataframe with the below format:
A B C
0 [[1,2],[3,4]] [[5,6],[7,8]] [[9,10],[11,12]]
for multiple rows.
The sub lists are always of length 2 and the length of A,B,C lists are always the same size.However the lengths of the latter vary and for example can be of size 2 or 6..etc for different rows.
What i would like to do is to explode rows like these into:
A B C
0 [1,2] [5,6] [9,10]
0 [3,4] [7,8] [11,12]
Assuming you really have lists of lists, a simple explode on all columns should work:
df.explode(df.columns.to_list())
output:
A B C
0 [1, 2] [5, 6] [9, 10]
0 [3, 4] [7, 8] [11, 12]
used input:
df = pd.DataFrame([[[[1,2],[3,4]], [[5,6],[7,8]], [[9,10],[11,12]]]],
columns=['A', 'B', 'C'])
Let's say I have a N-dimensionnal array, for example:
A = [ [ 1, 2] ,
[6, 10] ]
and another array B that defines an index associated with each value of A
B = [[0, 1], [1, 0]]
And I want to obtain a 1D list or array that for each index contains the sum of the values of A associated with that index. For our example, we would want
C = [11, 8]
Is there a way to do this efficiently, without looping over the arrays manually ?
Edit: To make it clearer what I want, if we now take A the same and B equal to :
B = [[1, 1], [1,1]]
Then I want all the values of A to sum into the index 1 of C, which yields
C = [0, 19]
Or I can write a code snippet :
C = np.zeros(np.max(B))
for i in range(...):
for j in range(...):
C[B[i,j]] += A[i,j]
return C
I think I found the best answer for now actually.
I can just use:
np.histogram(B, weights = A)
This code provides the solution I want.
I want to select a subset of some pandas DataFrame columns based on several slices.
In [1]: df = pd.DataFrame(data={'A': np.random.rand(100), 'B': np.random.rand(100), 'C': np.random.rand(100)})
df.head()
Out[1]: A B C
0 0.745487 0.146733 0.594006
1 0.212324 0.692727 0.244113
2 0.954276 0.318949 0.199224
3 0.606276 0.155027 0.247255
4 0.155672 0.464012 0.229516
Something like:
In [2]: df.loc[[slice(1, 4), slice(42, 44)], ['B', 'C']]
Expected output:
Out[2]: B C
1 0.692727 0.244113
2 0.318949 0.199224
3 0.155027 0.247255
42 0.335285 0.000997
43 0.019172 0.237810
I've seen that NumPy's r_ object can help when wanting to use multiple slices, e.g:
In [3]: arr = np.array([1, 2, 3, 4, 5, 5, 5, 5])
arr[np.r_[1:3, 4:6]]
Out[3]: array([2, 3, 5, 5])
But I can't get this to work with some predefined collection (list) of slices. Ideally I would like to be able to specify a collection of ranges/slices and subset based on this. I doesn't seem like r_ accepts iterables? I've seen that one could for example create an array with hstack, and then use it as an index, like:
In [4]: idx = np.hstack((np.arange(1, 4), np.arange(42, 44)))
df.loc[idx, ['B', 'C']]
Out[4]: B C
1 0.692727 0.244113
2 0.318949 0.199224
3 0.155027 0.247255
42 0.335285 0.000997
43 0.019172 0.237810
Which gets me what I need, but is there any other faster/cleaner/preferred/whatever way of doing this?
A bit late, but it might also help others:
pd.concat([df.loc[sl, ['B', 'C']] for sl in [slice(1, 4), slice(42, 44)]])
This also works when your are dealing with other slices, e.g. time windows.
You can do:
df.loc[[x for x in range(1, 4)] + [x for x in range(42, 44)], ['B', 'C']]
Which took about 1/4 of the time with your np.hstack option.
Suppose that we have a data-frame (df) with a high number of rows (1600000X4). Also, we have a list of lists such as this one:
inx = [[1,2],[4,5], [8,9,10], [15,16]]
We need to calculate average of first and third columns of this data-frame and median of second and fourth columns for every list in inx. For example, for the first list of inx, we should do this for first and second rows and replace all these rows with a new row which contains the output of these calculations. What is the fastest way to do this?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3, 3], [4, 5, 6, 1], [7, 8, 9, 3], [1, 1, 1, 1]]), columns=['a', 'b', 'c', 'd'])
a b c d
0 1 2 3 3
1 4 5 6 1
2 7 8 9 3
3 1 1 1 1
The output for just the first list inside of inx ([1,2]) will be something like this:
a b c d
0 1 2 3 3
1 5.5 6.5 7.5 2
3 1 1 1 1
As you can see, we don't change first row (0), because it's not in the main list. After that, we're going to do the same for [4,5]. We don't change anything in row 3 because it's not in the list too. inx is a large list of lists (more than 100000 elements).
EDIT: NEW APPROACH AVOIDING LOOPS
Here below you find an approach relying on pandas and avoiding loops.
After generating some fake data with the same size of yours, I basically create list of indexes from your inx list of rows; i.e., with your inx being:
[[2,3], [5,6,7], [10,11], ...]
the created list is:
[[1,1], [2,2,2], [3,3],...]
After that, this list is flattened and added to the original dataframe to mark various groups of rows to operate on.
After proper calculations, the resulting dataframe is joined back with original rows which don't need calculations (in my example above, rows: [0, 1, 4, 8, 9, ...]).
You find more comments in the code.
At the end of the answer I leave also my previous approach for the records.
On my box, the old algo involving a loop take more than 18 minutes... unbearable!
Using pandas only, it takes less than half second!! Pandas is great!
import pandas as pd
import numpy as np
import random
# Prepare some fake data to test
data = np.random.randint(0, 9, size=(160000, 4))
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
inxl = random.sample(range(1, 160000), 140000)
inxl.sort()
inx=[]
while len(inxl) > 3:
i = random.randint(2,3)
l = inxl[0:i]
inx.append(l)
inxl = inxl[i:]
inx.append(inxl)
# flatten inx (used below)
flat_inx = [item for sublist in inx for item in sublist]
# for each element (list) in inx create equivalent list (same length)
# of increasing ints. They'll be used to group corresponding rows
gr=[len(sublist) for sublist in inx]
t = list(zip(gr, range(1, len(inx)+1)))
group_list = [a*[b] for (a,b) in t]
# the groups are flatten either
flat_group_list = [item for sublist in group_list for item in sublist]
# create a new dataframe to mark rows to group retaining
# original index for each row
df_groups = pd.DataFrame({'groups': flat_group_list}, index=flat_inx)
# and join the group dataframe to the original df
df['groups'] = df_groups
# rows not belonging to a group are marked with 0
df['groups']=df['groups'].fillna(0)
# save rows not belonging to a group for later
df_untouched = df[df['groups'] == 0]
df_untouched = df_untouched.drop('groups', axis=1)
# new dataframe containg only rows belonging to a group
df_to_operate = df[df['groups']>0]
df_to_operate = df_to_operate.assign(ind=df_to_operate.index)
# at last, we group the rows according to original inx
df_grouped = df_to_operate.groupby('groups')
# calculate mean and median
# for each group we retain the index of first row of group
df_operated =df_grouped.agg({'a' : 'mean',
'b' : 'median',
'c' : 'mean',
'd' : 'median',
'ind': 'first'})
# set correct index on dataframe
df_operated=df_operated.set_index('ind')
# finally, join the previous dataframe with saved
# dataframe of rows which don't need calcullations
df_final = df_operated.combine_first(df_untouched)
OLD ALGO, TOO SLOW FOR SO MUCH DATA
This algo involving a loop, though giving a correct result, takes to long for such a big amount of data:
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3, 3], [4, 5, 6, 1], [7, 8, 9, 3], [1, 1, 1, 1]]), columns=['a', 'b', 'c', 'd'])
inx = [[1,2]]
for l in inx:
means=df.iloc[l][['a', 'c']].mean()
medians=df.iloc[l][['b', 'd']].median()
df.iloc[l[0]]=pd.DataFrame([means, medians]).fillna(method='bfill').iloc[0]
df.drop(index=l[1:], inplace=True)