Pandas explode to create new columns - python

The pandas explode method creates new row for each value found in the inner list of a given column ; this is so a row-wise explode.
Is there an easy column-wise explode already implemented in pandas, ie something to transform df into the second dataframe ?
MWE:
>>> s = pd.DataFrame([[1, 2], [3, 4]]).agg(list, axis=1)
>>> df = pd.DataFrame({"a": ["a", "b"], "s": s})
>>> df
Out:
a s
0 a [1, 2]
1 b [3, 4]
>>> pd.DataFrame(s.tolist()).assign(a=["a", "b"]).reindex(["a", 0, 1], axis=1)
Out[121]:
a 0 1
0 a 1 2
1 b 3 4

You can use apply to convert those values to Pandas Series, which will ultimately transform the dataframe in the required format:
>>> df.apply(pd.Series)
Out[28]:
0 1
0 1 2
1 3 4
As a side note, your df becomes a Pandas series after using agg
For the updated data, you can concat above result to the existing data frame
>>> pd.concat([df, df['s'].apply(pd.Series)], axis=1)
Out[48]:
a s 0 1
0 a [1, 2] 1 2
1 b [3, 4] 3 4

Related

Apply an operation on pandas DataFrame rows based on a Series that matchs the number of columns of the DataFrame

Let's say that I have a DataFrame df and a Series s like this:
>>> df = pd.DataFrame(np.random.randn(2,3), columns=["A", "B", "C"])
>>> df
A B C
0 -0.625816 0.793552 -1.519706
1 -0.955960 0.142163 0.847624
>>> s = pd.Series([1, 2, 3])
>>> s
0 1
1 2
2 3
dtype: int64
I'd like to add the values of s to each row in df. I guess I should use some apply with axis=1 or applymap but I can't figure out how (do I have to transpose at some point?).
Actually my problem is more complex that that and the final DataFrame will be composed of the elements of the initial DataFrame that will have been processed according to the values of two Series.
Possible solution is add 1d numpy array created from Series for prevent alignment columns of DataFrame to index of Series:
df = df + s.values
print (df)
A B C
0 0.207070 1.995021 4.829518
1 0.819741 2.802982 2.801355
If same columns and index values it working with sum:
#index is same like columns names
s = pd.Series([1, 2, 3], index=df.columns)
print (s)
A 1
B 2
C 3
dtype: int64
df = df + s

Replace single value in a pandas dataframe, when index is not known and values in column are unique

There is a question on SO with the title "Set value for particular cell in pandas DataFrame", but it assumes the row index is known. How do I change a value, if the row values in the relevant column are unique? Is using set_index as below the easiest option? Returning a list of index values with index.tolist does not seem like a very elegant solution.
Here is my code:
import pandas as pd
## Fill the data frame.
df = pd.DataFrame([[1, 2], [3, 4]], columns=('a', 'b'))
print(df, end='\n\n')
## Just use the index, if we know it.
index = 0
df.set_value(index, 'a', 5)
print('set_value', df, sep='\n', end='\n\n')
## Define a new index column.
df.set_index('a', inplace=True)
print('set_index', df, sep='\n', end='\n\n')
## Use the index values of column A, which are known to us.
index = 3
df.set_value(index, 'b', 6)
print('set_value', df, sep='\n', end='\n\n')
## Reset the index.
df.reset_index(inplace = True)
print('reset_index', df, sep='\n')
Here is my output:
a b
0 1 2
1 3 4
set_value
a b
0 5 2
1 3 4
set_index
b
a
5 2
3 4
set_value
b
a
5 2
3 6
reset_index
a b
0 5 2
1 3 6
Regardless of the performance, you should be able to do this using loc with boolean indexing:
df = pd.DataFrame([[5, 2], [3, 4]], columns=('a', 'b'))
# modify value in column b where a is 3
df.loc[df.a == 3, 'b'] = 6
df
# a b
#0 5 2
#1 3 6

Pandas - Sorting By Column

I have a pandas data frame known as "df":
x y
0 1 2
1 2 4
2 3 8
I am splitting it up into two frames, and then trying to merge back together:
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
My goal is to get it back in the same order, but when I concat, I am getting the following:
frames = [df_1, df_2]
solution = pd.concat(frames)
solution.sort_values(by='x', inplace=False)
x y
1 2 4
2 3 8
0 1 2
The problem is I need the 'x' values to go back into the new dataframe in the same order that I extracted. Is there a solution?
use .loc to specify the order you want. Choose the original index.
solution.loc[df.index]
Or, if you trust the index values in each component, then
solution.sort_index()
setup
df = pd.DataFrame([[1, 2], [2, 4], [3, 8]], columns=['x', 'y'])
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
frames = [df_1, df_2]
solution = pd.concat(frames)
Try this:
In [14]: pd.concat([df_1, df_2.sort_values('y')])
Out[14]:
x y
0 1 2
1 2 4
2 3 8
When you are sorting the solution using
solution.sort_values(by='x', inplace=False)
you need to specify inplace = True. That would take care of it.
Based on these assumptions on df:
Columns x and y are note necessarily ordered.
The index is ordered.
Just order your result by index:
df = pd.DataFrame({'x': [1, 2, 3], 'y': [2, 4, 8]})
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
frames = [df_2, df_1]
solution = pd.concat(frames).sort_index()
Now, solution looks like this:
x y
0 1 2
1 2 4
2 3 8

set a multi index in the dataframe constructor using the data-list provided to the constructor

I know that by using set_index i can convert an existing column into a dataframe index, but is there a way to specify, directly in the Dataframe constructor to use of one the data columns as an index (instead of turning it into a column).
Right now i initialize a DataFrame using data records, then i use set_index to make the column into an index.
DataFrame([{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}], index= ['a', 'b'], columns=('c', 'd'))
I want:
c d
ab
11 2 1
12 2 2
Instead i get:
c d
a 2 1
b 2 2
You can use MultiIndex.from_tuples:
print (pd.MultiIndex.from_tuples([(x['a'], x['b']) for x in d], names=('a','b')))
MultiIndex(levels=[[1], [1, 2]],
labels=[[0, 0], [0, 1]],
names=['a', 'b'])
d = [{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}]
df= pd.DataFrame(d,
index = pd.MultiIndex.from_tuples([(x['a'], x['b']) for x in d],
names=('a','b')),
columns=('c', 'd'))
print (df)
c d
a b
1 1 2 1
2 2 2
You can just chain call set_index on the ctor without specifying the index and columns params:
In [19]:
df=pd.DataFrame([{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}]).set_index(['a','b'])
df
Out[19]:
c d
a b
1 1 2 1
2 2 2

Delete a column in a pandas' DataFrame if its sum is less than x

I am trying to create a program that will delete a column in a Panda's dataFrame if the column's sum is less than 10.
I currently have the following solution, but I was curious if there is a more pythonic way to do this.
df = pandas.DataFrame(AllData)
sum = df.sum(axis=1)
badCols = list()
for index in range(len(sum)):
if sum[index] < 10:
badCols.append(index)
df = df.drop(df.columns[badCols], axis=1)
In my approach, I create a list of column indexes that have sums less than 10, then I delete this list. Is there a better approach for doing this?
You can call sum to generate a Series that gives the sum of each column, then use this to generate a boolean mask against your column array and use this to filter the df. DF generation code borrowed from #Alexander:
In [2]:
df = pd.DataFrame({'a': [1, 10], 'b': [1, 1], 'c': [20, 30]})
df
Out[2]:
a b c
0 1 1 20
1 10 1 30
In [3]:
df.sum()
Out[3]:
a 11
b 2
c 50
dtype: int64
In [6]:
df[df.columns[df.sum()>10]]
Out[6]:
a c
0 1 20
1 10 30
You can accomplish your objective using a one-liner by using a list comprehension and iteritems to identify all columns that meet your criteria.
df = pd.DataFrame({'a': [1, 10], 'b': [1, 1], 'c': [20, 30]})
>>> df
a b c
0 1 1 20
1 10 1 30
df.drop([col for col, val in df.sum().iteritems() if val < 10], axis=1, inplace=True)
>>> df
a c
0 1 20
1 10 30

Categories