Rearrange dataframe structure - python

I get a dataframe
df
A B
0 1 4
1 2 5
2 3 6
For further processing, it would be more convenient to have the df restructered
as follows:
letters numbers
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
How can I achieve that?

Use unstack with reset_index :
df = df.unstack().reset_index(level=1, drop=True).reset_index()
df.columns = ['letters','numbers']
print (df)
letters numbers
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
Or numpy.concatenate + numpy.repeat + DataFrame:
a = np.concatenate(df.values)
b = np.repeat(df.columns,len(df.index))
df = pd.DataFrame({'letters':b, 'numbers':a})
print (df)
letters numbers
0 A 1
1 A 4
2 A 2
3 B 5
4 B 3
5 B 6

Probably simplest to melt:
In [36]: pd.melt(df, var_name="letters", value_name="numbers")
Out[36]:
letters numbers
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6

Related

Autoincrement indexing after groupby with pandas on the original table

I cannot solve a very easy/simple problem in pandas. :(
I have the following table:
df = pd.DataFrame(data=dict(a=[1, 1, 1,2, 2, 3,1], b=["A", "A","B","A", "B", "A","A"]))
df
Out[96]:
a b
0 1 A
1 1 A
2 1 B
3 2 A
4 2 B
5 3 A
6 1 A
I would like to make an incrementing ID of each grouped (grouped by columns a and b) unique item. So the result would like like this (column c):
Out[98]:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
I tried with:
df.groupby(["a", "b"]).nunique().cumsum().reset_index()
Result:
Out[105]:
a b c
0 1 A 1
1 1 B 2
2 2 A 3
3 2 B 4
4 3 A 5
Unfortunatelly this works only for the grouped by dataset and not on the original dataset. As you can see in the original table I have 7 rows and the grouped by returns only 5.
So could someone please help me on how to get the desired table:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Thank you in advance!
groupby + ngroup
df['c'] = df.groupby(['a', 'b']).ngroup() + 1
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Use pd.factorize after create a tuple from (a, b) columns:
df['c'] = pd.factorize(df[['a', 'b']].apply(tuple, axis=1))[0] + 1
print(df)
# Output
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1

split a string into separate columns in pandas

I have a dataframe with lots of data and 1 column that is structured like this:
index var_1
1 a=3:b=4:c=5:d=6:e=3
2 b=3:a=4:c=5:d=6:e=3
3 e=3:a=4:c=5:d=6
4 c=3:a=4:b=5:d=6:f=3
I am trying to structure the data in that column to look like this:
index a b c d e f
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
I have done the following thus far:
df1 = df['var1'].str.split(':', expand=True)
I can then loop through the cols of df1 and do another split on '=', but then I'll just have loads of disorganised label cols and value cols.
Use list comprehension with dictionaries for each value and pass to DataFrame constructor:
comp = [dict([y.split('=') for y in x.split(':')]) for x in df['var_1']]
df = pd.DataFrame(comp).fillna(0).astype(int)
print (df)
a b c d e f
0 3 4 5 6 3 0
1 4 3 5 6 3 0
2 4 0 5 6 3 0
3 4 5 3 6 0 3
Or use Series.str.split with expand=True for DataFrame, reshape by DataFrame.stack, again split, remove first level of MultiIndex and add new level by 0 column, last reshape by Series.unstack:
df = (df['var_1'].str.split(':', expand=True)
.stack()
.str.split('=', expand=True)
.reset_index(level=1, drop=True)
.set_index(0, append=True)[1]
.unstack(fill_value=0)
.rename_axis(None, axis=1))
print (df)
a b c d e f
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
Here's one approach using str.get_dummies:
out = df.var_1.str.get_dummies(sep=':')
out = out * out.columns.str[2:].astype(int).values
out.columns = pd.MultiIndex.from_arrays([out.columns.str[0], out.columns])
print(out.max(axis=1, level=0))
a b c d e f
index
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
You can apply "extractall" and "pivot".
After "extractall" you get:
0 1
index match
1 0 a 3
1 b 4
2 c 5
3 d 6
4 e 3
2 0 b 3
1 a 4
2 c 5
3 d 6
4 e 3
3 0 e 3
1 a 4
2 c 5
3 d 6
4 0 c 3
1 a 4
2 b 5
3 d 6
4 f 3
And in one step:
rslt= df.var_1.str.extractall(r"([a-z])=(\d+)") \
.reset_index(level="match",drop=True) \
.pivot(columns=0).fillna(0)
1
0 a b c d e f
index
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
#rslt.columns= rslt.columns.levels[1].values

Python: how to merge two dataframe based only on different columns?

I have two dataframes df1 and df2
df1
A B
0 4 2
1 3 3
2 1 2
df2
B AB C
0 4 8 3
1 3 9 2
2 1 2 4
I would like to make a join only on different columns
df3
A B AB C
0 4 2 8 3
1 3 3 9 2
2 1 2 2 4
Use Index.isin with inverse mask or Index.difference:
df22 = df2.loc[:, ~df2.columns.isin(df1.columns)]
df = df1.join(df22)
Or:
df22 = df2[df2.columns.difference(df1.columns)]
df = df1.join(df22)
print (df)
A B AB C
0 4 2 8 3
1 3 3 9 2
2 1 2 2 4
You can also use the merge functions as an alternate solution:
df3=pd.merge(df1,df2, left_on='A', right_on='B', how ='left', suffixes=('','_')).drop('B_',axis=1)

Pandas - combine two Series by all unique combinations

Let's say I have the following series:
0 A
1 B
2 C
dtype: object
0 1
1 2
2 3
3 4
dtype: int64
How can I merge them to create an empty dataframe with every possible combination of values, like this:
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
Assuming the 2 series are s and s1, use itertools.product() which gives a cartesian product of input iterables :
import itertools
df = pd.DataFrame(list(itertools.product(s,s1)),columns=['letter','number'])
print(df)
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
As of Pandas 1.2.0, there is a how='cross' option in pandas.merge() that produces the Cartesian product of the columns.
import pandas as pd
letters = pd.DataFrame({'letter': ['A','B','C']})
numbers = pd.DataFrame({'number': [1,2,3,4]})
together = pd.merge(letters, numbers, how = 'cross')
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
As an additional bonus, this function makes it easy to do so with more than one column.
letters = pd.DataFrame({'letterA': ['A','B','C'],
'letterB': ['D','D','E']})
numbers = pd.DataFrame({'number': [1,2,3,4]})
together = pd.merge(letters, numbers, how = 'cross')
letterA letterB number
0 A D 1
1 A D 2
2 A D 3
3 A D 4
4 B D 1
5 B D 2
6 B D 3
7 B D 4
8 C E 1
9 C E 2
10 C E 3
11 C E 4
If you have 2 Series s1 and s2.
you can do this:
pd.DataFrame(index=s1,columns=s2).unstack().reset_index()[["s1","s2"]]
It will give you the follow
s1 s2
0 A 1
1 B 1
2 C 1
3 A 2
4 B 2
5 C 2
6 A 3
7 B 3
8 C 3
9 A 4
10 B 4
11 C 4
You can use pandas.MultiIndex.from_product():
import pandas as pd
pd.DataFrame(
index = pd.MultiIndex
.from_product(
[
['A', 'B', 'C'],
[1, 2, 3, 4]
],
names = ['letters', 'numbers']
)
)
which results in a hierarchical structure:
letters numbers
A 1
2
3
4
B 1
2
3
4
C 1
2
3
4
and you can further call .reset_index() to get ungrouped results:
letters numbers
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
(However I find #NickCHK's answer to be the best)

Repeating rows of a dataframe based on a column value

I have a data frame like this:
df1 = pd.DataFrame({'a': [1,2],
'b': [3,4],
'c': [6,5]})
df1
Out[150]:
a b c
0 1 3 6
1 2 4 5
Now I want to create a df that repeats each row based on difference between col b and c plus 1. So diff between b and c for first row is 6-3 = 3. I want to repeat that row 3+1=4 times. Similarly for second row the difference is 5-4 = 1, so I want to repeat it 1+1=2 times. The column d is added to have value from min(b) to diff between b and c (i.e.6-3 = 3. So it goes from 3->6). So I want to get this df:
a b c d
0 1 3 6 3
0 1 3 6 4
0 1 3 6 5
0 1 3 6 6
1 2 4 5 4
1 2 4 5 5
Do it with reindex + repeat, then using groupby cumcount assign the new value d
df1.reindex(df1.index.repeat(df1.eval('c-b').add(1))).\
assign(d=lambda x : x.c-x.groupby('a').cumcount(ascending=False))
Out[572]:
a b c d
0 1 3 6 3
0 1 3 6 4
0 1 3 6 5
0 1 3 6 6
1 2 4 5 4
1 2 4 5 5

Categories