Making a matrix-format from python - python

I have the following data in my dataframe B:
F1 F2 Count
A C 5
B C 2
B U 6
C A 1
I want to make a square matrix out of them so the results will be:
A B C U
A 0 0 6 0
B 0 0 2 6
C 6 2 0 0
U 0 6 0 0
I initially used pd.crosstab() but some variables in F1/F2 is missing in the matrix.
AC = 5 CA = 1 therefore the output should be 6.
Also pdcrosstab() does not recognize BU = UB, etc.
Anyone who could help? I am basically new to python.
Btw, this is my code:
wow=pd.crosstab(B.F1,
B.F2,
values=B.Count,
aggfunc='sum',
).rename_axis(None).rename_axis(None, axis=1)

You can pd.concat, wow and wow.T then groupby index and sum again:
>>> wow=pd.crosstab(B.F1,
B.F2,
values=B.Count,
aggfunc='sum',
).rename_axis(None).rename_axis(None, axis=1)
>>> wow
A C U
A NaN 5.0 NaN
B NaN 2.0 6.0
C 1.0 NaN NaN
>>> pd.concat([wow, wow.T], sort=True).fillna(0, downcast='infer').groupby(level=0).sum()
A B C U
A 0 0 6 0
B 0 0 2 6
C 6 2 0 0
U 0 6 0 0

You can make columns F1 and F2 categorical and use crosstab for the work.
FDtype = pd.CategoricalDtype(list("ABCU"))
df[["F1", "F2"]] = df[["F1", "F2"]].astype(FDtype)
count = pd.crosstab(df["F1"], df["F2"], df["Count"], aggfunc='sum', dropna=False)
count.fillna(0, inplace=True, downcast="infer")
count += count.T
Remark: it is more efficient to specify the column dtypes while the DataFrame is constructed

You can append the DataFrame where 'F1' and 'F2' are swapped to the original DataFrame.
df1 = df.append(df.rename({'F1': 'F2', 'F2': 'F1'}, axis=1), sort=False)
Then you can use pivot_table:
pd.pivot_table(df1, values='Count', index='F1', columns='F2', aggfunc='sum', fill_value=0)
or crosstab:
pd.crosstab(df1.F1, df1.F2, df1.Count, aggfunc='sum').fillna(0)
Finally remove columns and index names:
del df1.columns.name, df1.index.name
Result:
A B C U
A 0 0 6 0
B 0 0 2 6
C 6 2 0 0
U 0 6 0 0

Related

Pandas convert rows of dataframe to diagonal dataframe

I have a dataframe where I want to convert each row into a diagonal dataframe and bind all the resulting dataframes into 1 large dataframe.
Input:
a b c
2021-11-06 1 2 3
2021-11-07 4 5 6
Desired output:
a b c
Date
2021-11-06 a 1 0 0
b 0 2 0
c 0 0 3
2021-11-07 a 4 0 0
b 0 5 0
c 0 0 6
I tried using apply on each row of the original dataframe.
data = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a', 'b', 'c'], index=pd.date_range('2021-11-06', '2021-11-07'))
def convert_dataframe(ser):
df_ser = pd.DataFrame(0.0, index=ser.index, columns=ser.index)
np.fill_diagonal(df_ser.values, ser)
return df_ser
data.apply(lambda x: convert_dataframe(x), axis=1)
However, the output is not the multi-index dataframe that I expected.
The output is instead a single index dataframe where each row is a reference to the diagonal dataframe returned.
Any help is much appreciated. Thanks in advance.
Use MultiIndex.droplevel for remove first level of MultiIndex and call function after DataFrame.stack in GroupBy.apply:
def convert_dataframe(ser):
ser = ser.droplevel(0)
df_ser = pd.DataFrame(0, index=ser.index, columns=ser.index)
np.fill_diagonal(df_ser.values, ser)
return df_ser
data = data.stack().groupby(level=0).apply(convert_dataframe)
print (data)
a b c
2021-11-06 a 1 0 0
b 0 2 0
c 0 0 3
2021-11-07 a 4 0 0
b 0 5 0
c 0 0 6

How to extract period and variable name from dataframe column strings for multiindex panel data preparation

I'm new to Python and could not find the answer I'm looking for anywhere.
I have a DataFrame that has the following structure:
df = pd.DataFrame(index=list('abc'), data={'A1': range(3), 'A2': range(3),'B1': range(3), 'B2': range(3), 'C1': range(3), 'C2': range(3)})
df
Out[1]:
A1 A2 B1 B2 C1 C2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Where the numbers are periods and he letters are variables. I'd like to transform the columns in a way, that I split the periods and variables into a multiindex. The desired output would look like that
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
I've tried the following:
periods = list(range(1, 3))
df.columns = df.columns.str.replace('\d+', '')
df.columns = pd.MultiIndex.from_product([df.columns, periods])
That seams to be multiplying the columns and raising an ValueError: Length mismatch
in my dataframe I have 72 periods and 12 variables.
Thanks in advance for your help!
Edit: I realized that I haven't been precise enough. I have several columns names something like Impressions1, Impressions2...Impressions72 and hhi1, hhi2...hhi72. So df.columns.str[0],df.columns.str[1] does not work for me, as all column names have a different length. I think the solution might contain regex but I can't figure out how to do it. Any ideas?
Use pd.MultiIndex.from_tuples:
df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns.str[0],df.columns.str[1])))
print(df)
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Alternative:
pd.MultiIndex.from_tuples([tuple(name) for name in df.columns])
or
pd.MultiIndex.from_tuples(map(tuple, df.columns))
You can also use, .str.extract and from_frame:
df.columns = pd.MultiIndex.from_frame(df.columns.str.extract('(.)(.)'), names=[None, None])
Output:
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Here is what actually solved my issue:
df.columns = pd.MultiIndex.from_frame(df.columns.str.extract(r'([a-zA-Z]+)([0-9]+)'), names=[None, None])
Thanks #Scott Boston for your inspiration to the solution!

Pandas redundant multi-indices

I've written some functions to help aggregate data. In the end, they give me what I want, but with a crazy multi-indexed series:
fec988a2-6eba-49e0-8327-a89f25143ccf fec988a2-6eba-49e0-8327-a89f25143ccf com.facebook.katana fec988a2-6eba-49e0-8327-a89f25143ccf 1067
com.android.systemui fec988a2-6eba-49e0-8327-a89f25143ccf 935
com.facebook.orca fec988a2-6eba-49e0-8327-a89f25143ccf 893
com.android.chrome fec988a2-6eba-49e0-8327-a89f25143ccf 739
com.whatsapp fec988a2-6eba-49e0-8327-a89f25143ccf 515
I only need the first index, and the one with the app names (and the value of course). How do I get rid of unwanted indices like this?
You can use double reset_index - first remove unnecessary levels (here only 2, because group_keys=False in groupby remove another) and second with name='new' for convert Series to DataFrame with set new column name:
df = pd.DataFrame({'application':list('abbddedcc'),
'id':list('aaabbbbbb')})
print (df)
application id
0 a a
1 b a
2 b a
3 d b
4 d b
5 e b
6 d b
7 c b
8 c b
top = 2
df1 = (df.groupby(['id', 'application'])['id']
.value_counts()
.groupby(['id'], group_keys=False)
.nlargest(top)
.reset_index(level=2, drop=True)
.reset_index(name='new'))
print (df1)
id application new
0 a b 2
1 a a 1
2 b d 3
3 b c 2
Or remove id from first groupby, rather test if same output with real data:
top = 2
df1 = (df.groupby(['application'])['id']
.value_counts()
.groupby(['id'], group_keys=False)
.nlargest(top)
.reset_index(name='new'))
print (df1)
application id new
0 b a 2
1 a a 1
2 d b 3
3 c b 2
You can use pd.DataFrame.reset_index() or pd.Series.reset_index() with drop=True argument:
n = 5
df = pd.DataFrame({'idx0': [0] * n, 'idx1': range(n, 0, -1),
'idx2': range(0, n), 'idx3': ['a'] * n,
'value': [i/2 for i in range(n)]},
).set_index(['idx0', 'idx1', 'idx2', 'idx3'])
df
Out:
idx0 idx1 idx2 idx3 value
0 5 0 a 0.0
4 1 a 0.5
3 2 a 1.0
2 3 a 1.5
1 4 a 2.0
df.reset_index(level=(1, 3), drop=True)
Out:
idx0 idx2 value
0 0 0.0
1 0.5
2 1.0
3 1.5
4 2.0

keep the same factorizing between two data

We have two data sets with one varialbe col1.
some levels are missing in the second data. For example let
import pandas as pd
df1 = pd.DataFrame({'col1':["A","A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
When we factorize df1
df1["f_col1"]= pd.factorize(df1.col1)[0]
df1
we got
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
But when we do it for df2
df2["f_col1"]= pd.factorize(df2.col1)[0]
df2
we got
col1 f_col1
0 A 0
1 B 1
2 D 2
3 E 3
this is not what I want. I want to keep the same factorizing between data, i.e. in df2 we should have something like
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Thanks.
PS: The two data sets not always available in the same time, so I cannot concat them. The values should be stored form df1 and used in df2 when it is available.
You could concatenate the two DataFrames, then apply pd.factorize once to the entire column:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df = pd.concat({'df1':df1, 'df2':df2})
df['f_col1'], uniques = pd.factorize(df['col1'])
print(df)
yields
col1 f_col1
df1 0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
df2 0 A 0
1 B 1
2 D 3
3 E 4
To extract df1 and df2 from df you could use df.loc:
In [116]: df.loc['df1']
Out[116]:
col1 f_col1
0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
In [117]: df.loc['df2']
Out[117]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
(But note that since performance of vectorized operations improve if you can apply them once to large DataFrames instead of multiple times to smaller DataFrames, you might be better off keeping df and ditching df1 and df2...)
Alternatively, if you must generate df1['f_col1'] first, and then compute
df2['f_col1'] later, you could use merge to join df1 and df2 on col1:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df1['f_col1'], uniques = pd.factorize(df1['col1'])
df2 = pd.merge(df2, df1, how='left')
print(df2)
yields
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
You could reuse f_col1 column of df1 and map values of df2.col1 by setting index on df.col1
In [265]: df2.col1.map(df1.set_index('col1').f_col1)
Out[265]:
0 0
1 1
2 3
3 4
Details
In [266]: df2['f_col1'] = df2.col1.map(df1.set_index('col1').f_col1)
In [267]: df2
Out[267]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Incase, df1 has multiple records, drop the records using drop_duplicates
In [290]: df1
Out[290]:
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
In [291]: df2.col1.map(df1.drop_duplicates().set_index('col1').f_col1)
Out[291]:
0 0
1 1
2 3
3 4
Name: col1, dtype: int32
You want to get unique values across both sets of data. Then create a series or a dictionary. This is your factorization that can be used across both data sets. Use map to get the output you are looking for.
u = np.unique(np.append(df1.col1.values, df2.col1.values))
f = pd.Series(range(len(u)), u) # this is factorization
Assign with map
df1['f_col1'] = df1.col1.map(f)
df2['f_col1'] = df2.col1.map(f)
print(df1)
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
print(df2)
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4

How to replace a value in a pandas dataframe with column name based on a condition?

I have a dataframe that looks something like this:
I want to replace all 1's in the range A:D with the name of the column, so that the final result should resemble:
How can I do that?
You can recreate my dataframe with this:
dfz = pd.DataFrame({'A' : [1,0,0,1,0,0],
'B' : [1,0,0,1,0,1],
'C' : [1,0,0,1,3,1],
'D' : [1,0,0,1,0,0],
'E' : [22.0,15.0,None,10.,None,557.0]})
One way could be to use replace and pass in a Series mapping column labels to values (those same labels in this case):
>>> dfz.loc[:, 'A':'D'].replace(1, pd.Series(dfz.columns, dfz.columns))
A B C D
0 A B C D
1 0 0 0 0
2 0 0 0 0
3 A B C D
4 0 0 3 0
5 0 B C 0
To make the change permanent, you'd assign the returned DataFrame back to dfz.loc[:, 'A':'D'].
Solutions aside, it's useful to keep in mind that you may lose a lot of performance benefits when you mix numeric and string types in columns, as pandas is forced to use the generic 'object' dtype to hold the values.
A solution using where:
>>> dfz.where(dfz != 1, dfz.columns.to_series(), axis=1)
A B C D E
0 A B C D 22.0
1 0 0 0 0 15.0
2 0 0 0 0 NaN
3 A B C D 10.0
4 0 0 3 0 NaN
5 0 B C 0 557.0
Maybe it's not so elegant but...just looping through columns and replace:
for i in dfz[['A','B','C','D']].columns:
dfz[i].replace(1,i,inplace=True)
I do prefer very elegant solution from #ajcr.
In case if you have column names that you cant use that easily for slicing, here is my solution:
dfz.ix[:, dfz.filter(regex=r'(A|B|C|D)').columns.tolist()] = (
dfz[dfz!=1].ix[:,dfz.filter(regex=r'(A|B|C|D)').columns.tolist()]
.apply(lambda x: x.fillna(x.name))
)
Output:
In [207]: dfz
Out[207]:
A B C D E
0 A B C D 22.0
1 0 0 0 0 15.0
2 0 0 0 0 NaN
3 A B C D 10.0
4 0 0 3 0 NaN
5 0 B C 0 557.0

Categories