Splitting column values into multiple rows and column in pandas

Splitting column values into multiple rows and column in pandas - python

I am facing a problem with pandas. The input data is a single column :
MixedColumn
-------------
20_5, 20_5**1
20_7**9
20_4, 40_4, 15_4**2
And what I want to split and transform it into something like this :
Col1 Col2
--------------
20_5 1
20_5 1
20_7 9
20_4 2
40_4 2
15_4 2
The logic is split each row item (20_5, 20_5) based on comma (if present) and place them in next row of same column (Col1). As well as split each row item (**1) based on ** and associate them with individual values in a separate column (Col2).
Sorry if this is a noob question. Any hints will surely help me out. Thanks and wish you all a happy holiday.

First split on ** to get Col2 with Series.str.split and expand=True.
Then we use DataFrame.explode to make a new row for each element to create Col1:
note: this requires pandas >= 0.25.0
df[['Col1', 'Col2']] = df['MixedColumn'].str.split('\*\*', expand=True)
df = df.assign(Col1=df['Col1'].str.split(', ')).explode('Col1').drop(columns='MixedColumn')
Col1 Col2
0 20_5 1
0 20_5 1
1 20_7 9
2 20_4 2
2 40_4 2
2 15_4 2

Starting with
df = pd.DataFrame({"mixed_column": ["20_5, 20_5**1", "20_7**9", "20_4, 40_4, 15_4**2"]})
df_split = df.mixed_column.str.rsplit("**").apply(pd.Series)
df_split[0] = df_split.apply(lambda x: x[0].split(", "), axis=1)
df_split = df_split.explode(0)
which gives you
0 1
0 20_5 1
0 20_5 1
1 20_7 9
2 20_4 2
2 40_4 2
2 15_4 2

Related

Slice pandas dataframe using .loc with both index values and multiple column values, then set values

I have a dataframe, and I would like to select a subset of the dataframe using both index and column values. I can do both of these separately, but cannot figure out the syntax to do them simultaneously. Example:
import pandas as pd
# sample dataframe:
cid=[1,2,3,4,5,6,17,18,91,104]
c1=[1,2,3,1,2,3,3,4,1,3]
c2=[0,0,0,0,1,1,1,1,0,1]
df=pd.DataFrame(list(zip(c1,c2)),columns=['col1','col2'],index=cid)
df
Returns:
col1 col2
1 1 0
2 2 0
3 3 0
4 1 0
5 2 1
6 3 1
17 3 1
18 4 1
91 1 0
104 3 1
Using .loc, I can collect by index:
rel_index=[5,6,17]
relc1=[2,3]
relc2=[1]
df.loc[rel_index]
Returns:
col1 col2
5 1 5
6 2 6
17 3 7
Or I can select by column values:
df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]
Returning:
col1 col2
5 2 1
6 3 1
17 3 1
104 3 1
However, I cannot do both. When I try the following:
df.loc[rel_index,df['col1'].isin(relc1) & df['col2'].isin(relc2)]
Returns:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
I have tried a few other variations (such as "&" instead of the ","), but these return the same or other errors.
Once I collect this slice, I am hoping to reassign values on the main dataframe. I imagine this will be trivial once the above is done, but I note it here in case it is not. My goal is to assign something like df2 in the following:
c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)
to the slice referenced by index and multiple column conditions (overwriting what was in the original dataframe).

The reason for the IndexingError, is that you're calling df.loc with arrays of 2 different sizes.
df.loc[rel_index] has a length of 3 whereas df['col1'].isin(relc1) has a length of 10.
You need the index results to also have a length of 10. If you look at the output of df['col1'].isin(relc1), it is an array of booleans.
You can achieve a similar array with the proper length by replacing df.loc[rel_index] with df.index.isin([5,6,17])
so you end up with:
df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)]
which returns:
col1 col2
5 2 1
6 3 1
17 3 1
That said, I'm not sure why your index would ever look like this. Typically when slicing by index you would use df.iloc and your index would match the 0,1,2...etc. format.
Alternatively, you could first search by value - then assign the resulting dataframe to a new variable df2
df2 = df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]
then df2.loc[rel_index] would work without issue.
As for your overall goal, you can simply do the following:
c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)
df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)] = df2

#Rexovas explains it quite well, this is an alternative, where you can compute the filters on the index before assigning - it is a bit long, involves MultiIndex, but once you get your head around MultiIndex, should be intuitive:
(df
# move columns into the index
.set_index(['col1', 'col2'], append = True)
# filter based on the index
.loc(axis = 0)[rel_index, relc1, relc2]
# return cols 1 and 2
.reset_index(level = [-2, -1])
# assign values
.assign(col1 = c3, col2 = c4)
)
col1 col2
5 1 5
6 2 6
17 3 7

Drop rows and sort one dataframe according to another

I have two pandas data frames (df1 and df2):
# df1
ID COL
1 A
2 F
2 A
3 A
3 S
3 D
4 D
# df2
ID VAL
1 1
2 0
3 0
3 1
4 0
My goal is to append the corresponding val from df2 to each ID in df1. However, the relationship is not one-to-one (this is my client's fault and there's nothing I can do about this). To solve this problem, I want to sort df1 by df2['ID'] such that df1['ID'] is identical to df2['ID'].
So basically, for any row i in 0 to len(df2):
if df1.loc[i, 'ID'] == df2.loc[i, 'ID'] then keep row i in df1.
if df1.loc[i, 'ID'] != df2.loc[i, 'ID'] then drop row i from df1 and repeat.
The desired result is:
ID COL
1 A
2 F
3 A
3 S
4 D
This way, I can use pandas.concat([df1, df2['ID']], axis=0) to assign df2[VAL] to df1.
Is there a standardized way to do this? Does pandas.merge() have a method for doing this?
Before this gets voted as a duplicate, please realize that len(df1) != len(df2), so threads like this are not quite what I'm looking for.

This can be done with merge on both ID and the order within each ID:
(df1.assign(idx=df1.groupby('ID').cumcount())
.merge(df2.assign(idx=df2.groupby('ID').cumcount()),
on=['ID','idx'],
suffixes=['','_drop'])
[df1.columns]
)
Output:
ID COL
0 1 A
1 2 F
2 3 A
3 3 S
4 4 D

The simplest way I can see of getting the result you want is:
# Add a count for each repetition of the ids to temporary frames
x = df1.assign(id_counter=df1.groupby('ID').cumcount())
y = df2.assign(id_counter=df2.groupby('ID').cumcount())
# Merge using the ID and the repetition counter
df1 = pd.merge(x, y, how='right', on=['ID', 'id_counter']).drop('id_counter', axis=1)
Which would produce this output:
ID COL VAL
0 1 A 1
1 2 F 0
2 3 A 0
3 3 S 1
4 4 D 0

How to count the number of occurrences that a certain value occurs in a DataFrame according to another column?

I have a Pandas DataFrame that has two columns as such:
item1 label
0 a 0
1 a 1
2 b 0
3 c 0
4 a 1
5 a 0
6 b 0
In sum, there are a total of three kinds of items in the column item1. Namely, a, b, and c. The values that the entries of the label column are either 0 or 1.
What I want to do is receive a DataFrame where I have a count of how many entries in item1 have label value 1. Using the toy example above, the desired DataFrame would be something like:
item1 label
0 a 2
1 b 0
2 c 0
How might I achieve something like that?
I've tried using the following line of code:
df[['item1', 'label']].groupby('item1').sum()['label']
but the result is a Pandas Series and also displays some behaviors and properties that aren't desired.

IIUC, you can use pd.crosstab:
count_1=pd.crosstab(df['item1'],df['label'])[1]
print(count_1)
item1
a 2
b 0
c 0
Name: 1, dtype: int64
To get a DataFrame:
count_1=pd.crosstab(df['item1'],df['label'])[1].rename('label').reset_index()
print(count_1)
item1 label
0 a 2
1 b 0
2 c 0
The good thing about this method is that it allows you to also get the number of 0 easily, which if you use the sum you don't get

Filter columns before groupby is not necessary, but you can specify column after groupby for aggregation sum. For 2 columns DataFrames add as_index=False parameter:
df = df.groupby('item1', as_index=False)['label'].sum()
Alternative is use Series.reset_index:
df = df.groupby('item1')['label'].sum().reset_index()
print (df)
item1 label
0 a 2
1 b 0
2 c 0

python pandas groupby then count rows satisfying condition

i am trying to do a groupby on the id column such that i can show the number of rows in col1 that is equal to 1.
df:
id col1 col2 col3
a 1 1 1
a 0 1 1
a 1 1 1
b 1 0 1
my code:
df.groupby(['id'])[col1].count()[1]
output i got was 2. It didnt show me the values from other ids like b.
i want:
id col1
a 2
b 1
if possible can the total rows per id also be displayed as a new column?
example:
id col1 total
a 2 3
b 1 1

Assuming you have only 1 and 0 in col1, you can use agg:
df.groupby('id', as_index=False)['col1'].agg({'col1': 'sum', 'total': 'count'})
# id total col1
#0 a 3 2
#1 b 1 1

It's because your rows which id is 'a' sums to 3. The 2 of them are identical that's why it's been grouped and considered as one then it added the unique row which contains the 0 value on its col 1. You can't group rows with different values on its rows.
Yes you can add it on your output. Just place a method how you counted all rows on your column section of your code.

If you want to generalize the solution to include values in col1 that are not zero you can do the following. This also orders the columns correctly.
df.set_index('id')['col1'].eq(1).groupby(level=0).agg([('col1', 'sum'), ('total', 'count')]).reset_index()
id col1 total
0 a 2.0 3
1 b 1.0 1
Using a tuple in the agg method where the first value is the column name and the second the aggregating function is new to me. I was just experimenting and it seemed to work. I don't remember seeing it in the documentation so use with caution.

Pandas: merge multiple dataframes and control column names?

I would like to merge nine Pandas dataframes together into a single dataframe, doing a join on two columns, controlling the column names. Is this possible?
I have nine datasets. All of them have the following columns:
org, name, items,spend
I want to join them into a single dataframe with the following columns:
org, name, items_df1, spend_df1, items_df2, spend_df2, items_df3...
I've been reading the documentation on merging and joining. I can currently merge two datasets together like this:
ad = pd.DataFrame.merge(df_presents, df_trees,
on=['practice', 'name'],
suffixes=['_presents', '_trees'])
This works great, doing print list(aggregate_data.columns.values) shows me the following columns:
[org', u'name', u'spend_presents', u'items_presents', u'spend_trees', u'items_trees'...]
But how can I do this for nine columns? merge only seems to accept two at a time, and if I do it sequentially, my column names are going to end up very messy.

You could use functools.reduce to iteratively apply pd.merge to each of the DataFrames:
result = functools.reduce(merge, dfs)
This is equivalent to
result = dfs[0]
for df in dfs[1:]:
result = merge(result, df)
To pass the on=['org', 'name'] argument, you could use functools.partial define the merge function:
merge = functools.partial(pd.merge, on=['org', 'name'])
Since specifying the suffixes parameter in functools.partial would only allow
one fixed choice of suffix, and since here we need a different suffix for each
pd.merge call, I think it would be easiest to prepare the DataFrames column
names before calling pd.merge:
for i, df in enumerate(dfs, start=1):
df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')},
inplace=True)
For example,
import pandas as pd
import numpy as np
import functools
np.random.seed(2015)
N = 50
dfs = [pd.DataFrame(np.random.randint(5, size=(N,4)),
columns=['org', 'name', 'items', 'spend']) for i in range(9)]
for i, df in enumerate(dfs, start=1):
df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')},
inplace=True)
merge = functools.partial(pd.merge, on=['org', 'name'])
result = functools.reduce(merge, dfs)
print(result.head())
yields
org name items_df1 spend_df1 items_df2 spend_df2 items_df3 \
0 2 4 4 2 3 0 1
1 2 4 4 2 3 0 1
2 2 4 4 2 3 0 1
3 2 4 4 2 3 0 1
4 2 4 4 2 3 0 1
spend_df3 items_df4 spend_df4 items_df5 spend_df5 items_df6 \
0 3 1 0 1 0 4
1 3 1 0 1 0 4
2 3 1 0 1 0 4
3 3 1 0 1 0 4
4 3 1 0 1 0 4
spend_df6 items_df7 spend_df7 items_df8 spend_df8 items_df9 spend_df9
0 3 4 1 3 0 1 2
1 3 4 1 3 0 0 3
2 3 4 1 3 0 0 0
3 3 3 1 3 0 1 2
4 3 3 1 3 0 0 3

Would doing a big pd.concat() and then renaming all the columns work for you? Something like:
desired_columns = ['items', 'spend']
big_df = pd.concat([df1, df2[desired_columns], ..., dfN[desired_columns]], axis=1)
new_columns = ['org', 'name']
for i in range(num_dataframes):
new_columns.extend(['spend_df%i' % i, 'items_df%i' % i])
bid_df.columns = new_columns
This should give you columns like:
org, name, spend_df0, items_df0, spend_df1, items_df1, ..., spend_df8, items_df8

I've wanted this as well at times but been unable to find a built-in pandas way of doing it. Here is my suggestion (and my plan for the next time I need it):
Create an empty dictionary, merge_dict.
Loop through the index you want for each of your data frames and add the desired values to the dictionary with the index as the key.
Generate a new index as sorted(merge_dict).
Generate a new list of data for each column by looping through merge_dict.items().
Create a new data frame with index=sorted(merge_dict) and columns created in the previous step.
Basically, this is somewhat like a hash join in SQL. Seems like the most efficient way I can think of and shouldn't take too long to code up.
Good luck.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting column values into multiple rows and column in pandas - python

Related

Slice pandas dataframe using .loc with both index values and multiple column values, then set values

Drop rows and sort one dataframe according to another

How to count the number of occurrences that a certain value occurs in a DataFrame according to another column?

python pandas groupby then count rows satisfying condition

Pandas: merge multiple dataframes and control column names?

Categories

Resources