I've written some functions to help aggregate data. In the end, they give me what I want, but with a crazy multi-indexed series:
fec988a2-6eba-49e0-8327-a89f25143ccf fec988a2-6eba-49e0-8327-a89f25143ccf com.facebook.katana fec988a2-6eba-49e0-8327-a89f25143ccf 1067
com.android.systemui fec988a2-6eba-49e0-8327-a89f25143ccf 935
com.facebook.orca fec988a2-6eba-49e0-8327-a89f25143ccf 893
com.android.chrome fec988a2-6eba-49e0-8327-a89f25143ccf 739
com.whatsapp fec988a2-6eba-49e0-8327-a89f25143ccf 515
I only need the first index, and the one with the app names (and the value of course). How do I get rid of unwanted indices like this?
You can use double reset_index - first remove unnecessary levels (here only 2, because group_keys=False in groupby remove another) and second with name='new' for convert Series to DataFrame with set new column name:
df = pd.DataFrame({'application':list('abbddedcc'),
'id':list('aaabbbbbb')})
print (df)
application id
0 a a
1 b a
2 b a
3 d b
4 d b
5 e b
6 d b
7 c b
8 c b
top = 2
df1 = (df.groupby(['id', 'application'])['id']
.value_counts()
.groupby(['id'], group_keys=False)
.nlargest(top)
.reset_index(level=2, drop=True)
.reset_index(name='new'))
print (df1)
id application new
0 a b 2
1 a a 1
2 b d 3
3 b c 2
Or remove id from first groupby, rather test if same output with real data:
top = 2
df1 = (df.groupby(['application'])['id']
.value_counts()
.groupby(['id'], group_keys=False)
.nlargest(top)
.reset_index(name='new'))
print (df1)
application id new
0 b a 2
1 a a 1
2 d b 3
3 c b 2
You can use pd.DataFrame.reset_index() or pd.Series.reset_index() with drop=True argument:
n = 5
df = pd.DataFrame({'idx0': [0] * n, 'idx1': range(n, 0, -1),
'idx2': range(0, n), 'idx3': ['a'] * n,
'value': [i/2 for i in range(n)]},
).set_index(['idx0', 'idx1', 'idx2', 'idx3'])
df
Out:
idx0 idx1 idx2 idx3 value
0 5 0 a 0.0
4 1 a 0.5
3 2 a 1.0
2 3 a 1.5
1 4 a 2.0
df.reset_index(level=(1, 3), drop=True)
Out:
idx0 idx2 value
0 0 0.0
1 0.5
2 1.0
3 1.5
4 2.0
Related
I have two dataframes,
df1:
hash a b c
ABC 1 2 3
def 5 3 4
Xyz 3 2 -1
df2:
hash v
Xyz 3
def 5
I want to make
df:
hash a b c
ABC 1 2 3 (= as is, because no matching 'ABC' in df2)
def 25 15 20 (= 5*5 3*5 4*5)
Xyz 9 6 -3 (= 3*3 2*3 -1*3)
as like above,
I want to make a dataframe with values of multiplying df1 and df2 according to their index (or first column name) matched.
As df2 only has one column (v), all df1's columns except for the first one (index) should be affected.
Is there any neat Pythonic and Panda's way to achieve it?
df1.set_index(['hash']).mul(df2.set_index(['hash'])) or similar things seem not work..
One approach:
df1 = df1.set_index("hash")
df2 = df2.set_index("hash")["v"]
res = df1.mul(df2, axis=0).combine_first(df1)
print(res)
Output
a b c
hash
ABC 1.0 2.0 3.0
Xyz 9.0 6.0 -3.0
def 25.0 15.0 20.0
One Method:
# We'll make this for convenience
cols = ['a', 'b', 'c']
# Merge the DataFrames, keeping everything from df
df = df1.merge(df2, 'left').fillna(1)
# We'll make the v column integers again since it's been filled.
df.v = df.v.astype(int)
# Broadcast the multiplication across axis 0
df[cols] = df[cols].mul(df.v, axis=0)
# Drop the no-longer needed column:
df = df.drop('v', axis=1)
print(df)
Output:
hash a b c
0 ABC 1 2 3
1 def 25 15 20
2 Xyz 9 6 -3
Alternative Method:
# Set indices
df1 = df1.set_index('hash')
df2 = df2.set_index('hash')
# Apply multiplication and fill values
df = (df1.mul(df2.v, axis=0)
.fillna(df1)
.astype(int)
.reset_index())
# Output:
hash a b c
0 ABC 1 2 3
1 Xyz 9 6 -3
2 def 25 15 20
The function you are looking for is actually multiply.
Here's how I have done it:
>>> df
hash a b
0 ABC 1 2
1 DEF 5 3
2 XYZ 3 -1
>>> df2
hash v
0 XYZ 4
1 ABC 8
df = df.merge(df2, on='hash', how='left').fillna(1)
>>> df
hash a b v
0 ABC 1 2 8.0
1 DEF 5 3 1.0
2 XYZ 3 -1 4.0
df[['a','b']] = df[['a','b']].multiply(df['v'], axis='index')
>>>df
hash a b v
0 ABC 8.0 16.0 8.0
1 DEF 5.0 3.0 1.0
2 XYZ 12.0 -4.0 4.0
You can actually drop v at the end if you don't need it.
I have the following data in my dataframe B:
F1 F2 Count
A C 5
B C 2
B U 6
C A 1
I want to make a square matrix out of them so the results will be:
A B C U
A 0 0 6 0
B 0 0 2 6
C 6 2 0 0
U 0 6 0 0
I initially used pd.crosstab() but some variables in F1/F2 is missing in the matrix.
AC = 5 CA = 1 therefore the output should be 6.
Also pdcrosstab() does not recognize BU = UB, etc.
Anyone who could help? I am basically new to python.
Btw, this is my code:
wow=pd.crosstab(B.F1,
B.F2,
values=B.Count,
aggfunc='sum',
).rename_axis(None).rename_axis(None, axis=1)
You can pd.concat, wow and wow.T then groupby index and sum again:
>>> wow=pd.crosstab(B.F1,
B.F2,
values=B.Count,
aggfunc='sum',
).rename_axis(None).rename_axis(None, axis=1)
>>> wow
A C U
A NaN 5.0 NaN
B NaN 2.0 6.0
C 1.0 NaN NaN
>>> pd.concat([wow, wow.T], sort=True).fillna(0, downcast='infer').groupby(level=0).sum()
A B C U
A 0 0 6 0
B 0 0 2 6
C 6 2 0 0
U 0 6 0 0
You can make columns F1 and F2 categorical and use crosstab for the work.
FDtype = pd.CategoricalDtype(list("ABCU"))
df[["F1", "F2"]] = df[["F1", "F2"]].astype(FDtype)
count = pd.crosstab(df["F1"], df["F2"], df["Count"], aggfunc='sum', dropna=False)
count.fillna(0, inplace=True, downcast="infer")
count += count.T
Remark: it is more efficient to specify the column dtypes while the DataFrame is constructed
You can append the DataFrame where 'F1' and 'F2' are swapped to the original DataFrame.
df1 = df.append(df.rename({'F1': 'F2', 'F2': 'F1'}, axis=1), sort=False)
Then you can use pivot_table:
pd.pivot_table(df1, values='Count', index='F1', columns='F2', aggfunc='sum', fill_value=0)
or crosstab:
pd.crosstab(df1.F1, df1.F2, df1.Count, aggfunc='sum').fillna(0)
Finally remove columns and index names:
del df1.columns.name, df1.index.name
Result:
A B C U
A 0 0 6 0
B 0 0 2 6
C 6 2 0 0
U 0 6 0 0
I have a database with strings and the index as below.
df0
idx name_id_code string_line_0
0 0.01 A
1 0.5 B
2 77.6 C
3 29.8 D
4 56.2 E
5 88.1000005 F
6 66.4000008 G
7 2.1 H
8 99 I
9 550.9999999 J
df1
idx string_line_1
0 A
1 F
2 J
3 G
4 D
Now, I want to match the df1 with df0, taking the values where df1 = df 0 but, keeping the index of df0 true as below
df_result name_id_code string_line_0
0 0.01 A
5 88.1000005 F
9 550.9999999 J
6 66.4000008 G
3 29.8 D
I tried with my code but it didnt work for string and only matching index
c = df0['name_id_code'] + ' (' + df0['string_line_0'].astype(str) + ')'
out = df1[df2['string_line_1'].isin(s)]
I also tried to keep simple just last column match like
c = df0['string_line_0'].astype(str) + ')'
out = df1[df1['string_line_1'].isin(s)]
but blank output.
Because is filtered df0 DataFrame then is index values not changed if use Series.isin by df1['string_line_1', only order of columns is like in original df0:
out = df0[df0['string_line_0'].isin(df1['string_line_1'])]
print (out)
name_id_code string_line_0
idx
0 0.010000 A
3 29.800000 D
5 88.100001 F
6 66.400001 G
9 551.000000 J
Or if use DataFrame.merge then for avoid lost df0.index is necessary add DataFrame.reset_index:
out = (df1.rename(columns={'string_line_1':'string_line_0'})
.merge(df0.reset_index(), on='string_line_0'))
print (out)
string_line_0 idx name_id_code
0 A 0 0.010000
1 F 5 88.100001
2 J 9 551.000000
3 G 6 66.400001
4 D 3 29.800000
Similar solution, only same values in string_line_0 and string_line_1 columns:
out = (df1.merge(df0.reset_index(), left_on='string_line_1', right_on='string_line_0'))
print (out)
string_line_1 idx name_id_code string_line_0
0 A 0 0.010000 A
1 F 5 88.100001 F
2 J 9 551.000000 J
3 G 6 66.400001 G
4 D 3 29.800000 D
You can do:
out = df0.loc[(df0["string_line_0"].isin(df1["string_line_1"]))].copy()
out["string_line_0"] = pd.Categorical(out["string_line_0"], categories=df1["string_line_1"].unique())
out.sort_values(by=["string_line_0"], inplace=True)
The first line filters df0 to just the rows where string_line_0 is in the string_line_1 column of df1.
The second line converts string_line_0 in the output df to a Categorical feature, which is then custom sorted by the order of the values in df1
I have a DataFrame with integer indexes that are missing some values (i.e. not equally spaced), I want to create a new DataFrame with equally spaced index values and forward fill column values. Below is a simple example:
have
import pandas as pd
df = pd.DataFrame(['A', 'B', 'C'], index=[0, 2, 4])
0
0 A
2 B
4 C
want to use above and create:
0
0 A
1 A
2 B
3 B
4 C
Use reindex with method='ffill':
df = df.reindex(np.arange(0, df.index.max()+1), method='ffill')
Or:
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), method='ffill')
print (df)
0
0 A
1 A
2 B
3 B
4 C
Using reindex and ffill:
df = df.reindex(range(df.index[0],df.index[-1]+1)).ffill()
print(df)
0
0 A
1 A
2 B
3 B
4 C
You can do this:
In [319]: df.reindex(list(range(df.index.min(),df.index.max()+1))).ffill()
Out[319]:
0
0 A
1 A
2 B
3 B
4 C
We have two data sets with one varialbe col1.
some levels are missing in the second data. For example let
import pandas as pd
df1 = pd.DataFrame({'col1':["A","A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
When we factorize df1
df1["f_col1"]= pd.factorize(df1.col1)[0]
df1
we got
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
But when we do it for df2
df2["f_col1"]= pd.factorize(df2.col1)[0]
df2
we got
col1 f_col1
0 A 0
1 B 1
2 D 2
3 E 3
this is not what I want. I want to keep the same factorizing between data, i.e. in df2 we should have something like
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Thanks.
PS: The two data sets not always available in the same time, so I cannot concat them. The values should be stored form df1 and used in df2 when it is available.
You could concatenate the two DataFrames, then apply pd.factorize once to the entire column:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df = pd.concat({'df1':df1, 'df2':df2})
df['f_col1'], uniques = pd.factorize(df['col1'])
print(df)
yields
col1 f_col1
df1 0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
df2 0 A 0
1 B 1
2 D 3
3 E 4
To extract df1 and df2 from df you could use df.loc:
In [116]: df.loc['df1']
Out[116]:
col1 f_col1
0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
In [117]: df.loc['df2']
Out[117]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
(But note that since performance of vectorized operations improve if you can apply them once to large DataFrames instead of multiple times to smaller DataFrames, you might be better off keeping df and ditching df1 and df2...)
Alternatively, if you must generate df1['f_col1'] first, and then compute
df2['f_col1'] later, you could use merge to join df1 and df2 on col1:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df1['f_col1'], uniques = pd.factorize(df1['col1'])
df2 = pd.merge(df2, df1, how='left')
print(df2)
yields
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
You could reuse f_col1 column of df1 and map values of df2.col1 by setting index on df.col1
In [265]: df2.col1.map(df1.set_index('col1').f_col1)
Out[265]:
0 0
1 1
2 3
3 4
Details
In [266]: df2['f_col1'] = df2.col1.map(df1.set_index('col1').f_col1)
In [267]: df2
Out[267]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Incase, df1 has multiple records, drop the records using drop_duplicates
In [290]: df1
Out[290]:
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
In [291]: df2.col1.map(df1.drop_duplicates().set_index('col1').f_col1)
Out[291]:
0 0
1 1
2 3
3 4
Name: col1, dtype: int32
You want to get unique values across both sets of data. Then create a series or a dictionary. This is your factorization that can be used across both data sets. Use map to get the output you are looking for.
u = np.unique(np.append(df1.col1.values, df2.col1.values))
f = pd.Series(range(len(u)), u) # this is factorization
Assign with map
df1['f_col1'] = df1.col1.map(f)
df2['f_col1'] = df2.col1.map(f)
print(df1)
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
print(df2)
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4