Creating Pivot DataFrame using Multiple Columns in Pandas - python

I have a pandas dataframe following the form in the example below:
data = {'id': [1,1,1,1,2,2,2,2,3,3,3], 'a': [-1,1,1,0,0,0,-1,1,-1,0,0], 'b': [1,0,0,-1,0,1,1,-1,-1,1,0]}
df = pd.DataFrame(data)
Now, what I want to do is create a pivot table such that for each of the columns except the id, I will have 3 new columns corresponding to the values. That is, for column a, I will create a_neg, a_zero and a_pos. Similarly, for b, I will create b_neg, b_zero and b_pos. The values for these new columns would correspond to the number of times those values appear in the original a and b column. The final dataframe should look like this:
result = {'id': [1,2,3], 'a_neg': [1, 1, 1],
'a_zero': [1, 2, 2], 'a_pos': [2, 1, 0],
'b_neg': [1, 1, 1], 'b_zero': [2,1,1], 'b_pos': [1,2,1]}
df_result = pd.DataFrame(result)
Now, to do this, I can do the following steps and arrive at my final answer:
by_a = df.groupby(['id', 'a']).count().reset_index().pivot('id', 'a', 'b').fillna(0).astype(int)
by_a.columns = ['a_neg', 'a_zero', 'a_pos']
by_b = df.groupby(['id', 'b']).count().reset_index().pivot('id', 'b', 'a').fillna(0).astype(int)
by_b.columns = ['b_neg', 'b_zero', 'b_pos']
df_result = by_a.join(by_b).reset_index()
However, I believe that that method is not optimal especially if I have a lot of original columns aside from a and b. Is there a shorter and/or more efficient solution for getting what I want to achieve here? Thanks.

A shorter solution, though still quite in-efficient:
In [11]: df1 = df.set_index("id")
In [12]: g = df1.groupby(level=0)
In [13]: g.apply(lambda x: x.apply(lambda x: x.value_counts())).fillna(0).astype(int).unstack(1)
Out[13]:
a b
-1 0 1 -1 0 1
id
1 1 1 2 1 2 1
2 1 2 1 1 1 2
3 1 2 0 1 1 1
Note: I think you should be aiming for the multi-index columns.
I'm reasonably sure I've seen a trick to remove the apply/value_count/fillna with something cleaner and more efficient, but at the moment it eludes me...

Related

How to change column names of Pandas Series object?

I'm trying to prefix the names of the columns for each series in my Pandas Series, based on one of the other columns value. Currently my objective is to change a Pandas Dataframe that contains 3 columns into a Dataframe of only 1 column named 'Data' - or whatever. Below is an example of stacking a Dataframe to obtain a single dimension to work with.
df_single_level_cols = pd.DataFrame([[0, 1, 20], [2, 3, 40]],columns=['weight', 'height', 'girth'])
df = df_single_level_cols.stack()
print(df)
0 weight 0
height 1
girth 20
1 weight 2
height 3
girth 40
dtype: int64
For each series I need to prefix both column names weight and height with the value of girth. When that is done I will drop girth from the equation, leaving me with only the weight and height for my series. After prefixing and dropping the series object should look like the following:
0 20weight 0
20height 1
1 40weight 2
40height 3
dtype: int64
Then when converting this to a Dataframe I shall have the following:
Data
20weight 0
20height 1
40weight 2
40height 3
I've tried messing around with .apply(...), .rename(...) and .add_prefix(...) but none of them seem to be doing the trick. If I do something like
df[0] = df[0].add_prefix("test")
I end up getting errors as I'm setting an array element with a sequence + this does not actually use the value of girth - but was more a way of getting accustomed with the rename functionality..
You can melt instead:
df = (df_single_level_cols
.astype({'girth': str})
.melt('girth', value_name='Data')
.assign(**{'girth': lambda d: d['girth']+d.pop('variable')})
.set_index('girth')
)
output:
Data
girth
20weight 0
40weight 2
20height 1
40height 3
Like this?
df = pd.DataFrame([[0, 1, 20], [2, 3, 40]],columns=['weight', 'height', 'girth'])
df = df[['weight', 'height']].stack().reset_index(level=1).merge(df.girth, left_index=True, right_index=True, how='left')
df = df.set_index(df.girth.astype(str) + df.level_1).rename(columns={0: 'Data'})[['Data']]
> df
Data
20weight 0
20height 1
40weight 2
40height 3

Pandas - Sorting By Column

I have a pandas data frame known as "df":
x y
0 1 2
1 2 4
2 3 8
I am splitting it up into two frames, and then trying to merge back together:
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
My goal is to get it back in the same order, but when I concat, I am getting the following:
frames = [df_1, df_2]
solution = pd.concat(frames)
solution.sort_values(by='x', inplace=False)
x y
1 2 4
2 3 8
0 1 2
The problem is I need the 'x' values to go back into the new dataframe in the same order that I extracted. Is there a solution?
use .loc to specify the order you want. Choose the original index.
solution.loc[df.index]
Or, if you trust the index values in each component, then
solution.sort_index()
setup
df = pd.DataFrame([[1, 2], [2, 4], [3, 8]], columns=['x', 'y'])
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
frames = [df_1, df_2]
solution = pd.concat(frames)
Try this:
In [14]: pd.concat([df_1, df_2.sort_values('y')])
Out[14]:
x y
0 1 2
1 2 4
2 3 8
When you are sorting the solution using
solution.sort_values(by='x', inplace=False)
you need to specify inplace = True. That would take care of it.
Based on these assumptions on df:
Columns x and y are note necessarily ordered.
The index is ordered.
Just order your result by index:
df = pd.DataFrame({'x': [1, 2, 3], 'y': [2, 4, 8]})
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
frames = [df_2, df_1]
solution = pd.concat(frames).sort_index()
Now, solution looks like this:
x y
0 1 2
1 2 4
2 3 8

Filter rows of pandas dataframe whose values are lower than 0

I have a pandas dataframe like this
df = pd.DataFrame(data=[[21, 1],[32, -4],[-4, 14],[3, 17],[-7,NaN]], columns=['a', 'b'])
df
I want to be able to remove all rows with negative values in a list of columns and conserving rows with NaN.
In my example there is only 2 columns, but I have more in my dataset, so I can't do it one by one.
If you want to apply it to all columns, do df[df > 0] with dropna():
>>> df[df > 0].dropna()
a b
0 21 1
3 3 17
If you know what columns to apply it to, then do for only those cols with df[df[cols] > 0]:
>>> cols = ['b']
>>> df[cols] = df[df[cols] > 0][cols]
>>> df.dropna()
a b
0 21 1
2 -4 14
3 3 17
I've found you can simplify the answer by just doing this:
>>> cols = ['b']
>>> df = df[df[cols] > 0]
dropna() is not an in-place method, so you have to store the result.
>>> df = df.dropna()
I was looking for a solution for this that doesn't change the dtype (which will happen if NaN's are mixed in with ints as suggested in the answers that use dropna. Since the questioner already had a NaN in their data, that may not be an issue for them. I went with this solution which preserves the int64 dtype. Here it is with my sample data:
df = pd.DataFrame(data={'a':[0, 1, 2], 'b': [-1,0,1], 'c': [-2, -1, 0]})
columns = ['b', 'c']
filter_ = (df[columns] >= 0).all(axis=1)
df[filter_]
a b c
2 2 1 0

Delete a column in a pandas' DataFrame if its sum is less than x

I am trying to create a program that will delete a column in a Panda's dataFrame if the column's sum is less than 10.
I currently have the following solution, but I was curious if there is a more pythonic way to do this.
df = pandas.DataFrame(AllData)
sum = df.sum(axis=1)
badCols = list()
for index in range(len(sum)):
if sum[index] < 10:
badCols.append(index)
df = df.drop(df.columns[badCols], axis=1)
In my approach, I create a list of column indexes that have sums less than 10, then I delete this list. Is there a better approach for doing this?
You can call sum to generate a Series that gives the sum of each column, then use this to generate a boolean mask against your column array and use this to filter the df. DF generation code borrowed from #Alexander:
In [2]:
df = pd.DataFrame({'a': [1, 10], 'b': [1, 1], 'c': [20, 30]})
df
Out[2]:
a b c
0 1 1 20
1 10 1 30
In [3]:
df.sum()
Out[3]:
a 11
b 2
c 50
dtype: int64
In [6]:
df[df.columns[df.sum()>10]]
Out[6]:
a c
0 1 20
1 10 30
You can accomplish your objective using a one-liner by using a list comprehension and iteritems to identify all columns that meet your criteria.
df = pd.DataFrame({'a': [1, 10], 'b': [1, 1], 'c': [20, 30]})
>>> df
a b c
0 1 1 20
1 10 1 30
df.drop([col for col, val in df.sum().iteritems() if val < 10], axis=1, inplace=True)
>>> df
a c
0 1 20
1 10 30

How to keep index when using pandas merge

I would like to merge two DataFrames, and keep the index from the first frame as the index on the merged dataset. However, when I do the merge, the resulting DataFrame has integer index. How can I specify that I want to keep the index from the left data frame?
In [4]: a = pd.DataFrame({'col1': {'a': 1, 'b': 2, 'c': 3},
'to_merge_on': {'a': 1, 'b': 3, 'c': 4}})
In [5]: b = pd.DataFrame({'col2': {0: 1, 1: 2, 2: 3},
'to_merge_on': {0: 1, 1: 3, 2: 5}})
In [6]: a
Out[6]:
col1 to_merge_on
a 1 1
b 2 3
c 3 4
In [7]: b
Out[7]:
col2 to_merge_on
0 1 1
1 2 3
2 3 5
In [8]: a.merge(b, how='left')
Out[8]:
col1 to_merge_on col2
0 1 1 1.0
1 2 3 2.0
2 3 4 NaN
In [9]: _.index
Out[9]: Int64Index([0, 1, 2], dtype='int64')
EDIT: Switched to example code that can be easily reproduced
In [5]: a.reset_index().merge(b, how="left").set_index('index')
Out[5]:
col1 to_merge_on col2
index
a 1 1 1
b 2 3 2
c 3 4 NaN
Note that for some left merge operations, you may end up with more rows than in a when there are multiple matches between a and b. In this case, you may need to drop duplicates.
You can make a copy of index on left dataframe and do merge.
a['copy_index'] = a.index
a.merge(b, how='left')
I found this simple method very useful while working with large dataframe and using pd.merge_asof() (or dd.merge_asof()).
This approach would be superior when resetting index is expensive (large dataframe).
There is a non-pd.merge solution using Series.map and DataFrame.set_index.
a['col2'] = a['to_merge_on'].map(b.set_index('to_merge_on')['col2']))
col1 to_merge_on col2
a 1 1 1.0
b 2 3 2.0
c 3 4 NaN
This doesn't introduce a dummy index name for the index.
Note however that there is no DataFrame.map method, and so this approach is not for multiple columns.
df1 = df1.merge(df2, how="inner", left_index=True, right_index=True)
This allows to preserve the index of df1
Assuming that the resulting df has the same number of rows and order as your first df, you can do this:
c = pd.merge(a, b, on='to_merge_on')
c.set_index(a.index,inplace=True)
another simple option is to rename the index to what was before:
a.merge(b, how="left").set_axis(a.index)
merge preserves the order at dataframe 'a', but just resets the index so it's safe to use set_axis
You can also use DataFrame.join() method to achieve the same thing. The join method will persist the original index. The column to join can be specified with on parameter.
In [17]: a.join(b.set_index("to_merge_on"), on="to_merge_on")
Out[17]:
col1 to_merge_on col2
a 1 1 1.0
b 2 3 2.0
c 3 4 NaN
Think I've come up with a different solution. I was joining the left table on index value and the right table on a column value based off index of left table. What I did was a normal merge:
First10ReviewsJoined = pd.merge(First10Reviews, df, left_index=True, right_on='Line Number')
Then I retrieved the new index numbers from the merged table and put them in a new column named Sentiment Line Number:
First10ReviewsJoined['Sentiment Line Number']= First10ReviewsJoined.index.tolist()
Then I manually set the index back to the original, left table index based off pre-existing column called Line Number (the column value I joined on from left table index):
First10ReviewsJoined.set_index('Line Number', inplace=True)
Then removed the index name of Line Number so that it remains blank:
First10ReviewsJoined.index.name = None
Maybe a bit of a hack but seems to work well and relatively simple. Also, guess it reduces risk of duplicates/messing up your data. Hopefully that all makes sense.
For the people that wants to maintain the left index as it was before the left join:
def left_join(
a: pandas.DataFrame, b: pandas.DataFrame, on: list[str], b_columns: list[str] = None
) -> pandas.DataFrame:
if b_columns:
b_columns = set(on + b_columns)
b = b[b_columns]
df = (
a.reset_index()
.merge(
b,
how="left",
on=on,
)
.set_index(keys=[x or "index" for x in a.index.names])
)
df.index.names = a.index.names
return df

Categories