How do I split a dataframe by a repeating index and enumerate? - python

I have a pandas dataframe that looks like the following:
0 1 2
# A B C
1 D E F
2 G H I
# J K L
1 M N O
2 P Q R
3 S T U
The index has a repeating 'delimiter', namely #. I am seeking an efficient way to transform this to the following:
0 1 2 3
# A B C 1
1 D E F 1
2 G H I 1
# J K L 2
1 M N O 2
2 P Q R 2
3 S T U 2
I would like a new column (3) which is splitting by the # symbol in the rows and enumerating the chunks. This is for an NLP application and the dataset I am working with can be found here for context: https://sites.google.com/site/germeval2014ner/data.
By the way, I know I can do this with a simple iteration, but I am wondering if there is vectorized format or a split capability I am not aware of.
Thanks for your help!

Something like
df['new_col'] = (df.index == '#').cumsum()
Output:
1 2 3 new_col
0
# A B C 1
1 D E F 1
2 G H I 1
# J K L 2
1 M N O 2
2 P Q R 2
3 S T U 2

Related

How to create conditional pandas series/column?

Here is a sample df:
A B C D E (New Column)
0 1 2 a n ?
1 3 3 p d ?
2 5 9 f z ?
If Column A == Column B PICK Column C's value apply to Column E;
Otherwise PICK Column D's value apply to Column E.
I have tried many ways but failed, I am new please teach me how to do it, THANK YOU!
Note:
It needs to PICK the value from Col.C or Col.D in this case. So there are not specify values are provided to fill in the Col.E(this is the most different to other similar questions)
use numpy.where
df['E'] = np.where(df['A'] == df['B'],
df['C'],
df['D'])
df
A B C D E
0 1 2 a n n
1 3 3 p d p
2 5 9 f z z
Try pandas where
df['E'] = df['C'].where(df['A'].eq(df['B']), df['D'])
df
Out[570]:
A B C D E
0 1 2 a n n
1 3 3 p d p
2 5 9 f z z

Matching two columns from Pandas Dataframe but the order matters

I have two DataFrames
df_1:
idx A X
0 1 A
1 2 B
2 3 C
3 4 D
4 1 E
5 2 F
and
df_2:
idx B Y
0 1 H
1 2 I
2 4 J
3 2 K
4 3 L
5 1 M
my goal is get the following:
df_result:
idx A X B Y
0 1 A 1 H
1 2 B 2 I
2 4 D 4 J
3 2 F 2 K
I am trying to match both A and B columns, based on on the column Bfrom df_2.
Columns A and B repeat their content after getting to 4. The order matters here and because of that the row from df_1 with idx = 4 does not match the one from df_2 with idx = 5.
I was trying to use:
matching = list(set(df_1["A"]) & set(df_2["B"]))
and then
df1_filt = df_1[df_1['A'].isin(matching)]
df2_filt = df_2[df_2['B'].isin(matching)]
But this does not take the order into consideration.
I am looking for a solution without many for loops.
Edit:
df_result = pd.merge_asof(left=df_1, right=df_2, left_on='idx', right_on='idx', left_by='A', right_by='B', direction='backward', tolerance=2).dropna().drop(labels='idx', axis='columns').reset_index(drop=True)
Gets me what I want.
IIUC this should work:
df_result = df_1.merge(df_2,
left_on=['idx', 'A'], right_on=['idx', 'B'])

How to get top 5 items for each group in grouped dataframe?

df = pd.DataFrame({'Weekday':list('MMMMMMMMMMTTTTTTTTTT'),
'Items': list("AAABBCDEFGBBBCCADEFG")
})
grouped = df.groupby(['Weekday','Items'],sort=True).agg({'Items': 'count'})
Then, I get the result of grouped:
Weekday Items
M A 3
B 2
C 1
D 1
E 1
F 1
G 1
T A 1
B 3
C 2
D 1
E 1
F 1
G 1
So how to output the top 5 items for each "weekdays" (5 for 'M' and 'T'), like:
Weekday Items
M A 3
B 2
C 1
D 1
E 1
T
B 3
C 2
A 1
D 1
E 1
Anyone can help this?
df = pd.DataFrame({'Weekday':list('MMMMMMMMMMTTTTTTTTTT'),
'Item': list("AAABBCDEFGBBBCCADEFG")
})
grouped = df.groupby(['Weekday','Item'],sort=True).agg(count=('Item', 'count'))
grouped.sort_values(['Weekday','count'],ascending=False).groupby('Weekday').head(5)
count
Weekday Item
T B 3
C 2
A 1
D 1
E 1
M A 3
B 2
C 1
D 1
E 1
grouped = (df.groupby(['Weekday','Items'])
.Items.agg(counter='count')
.groupby(['Weekday'],
as_index=False))
pd.concat([group.nlargest(5,'counter') for name,group in grouped])
counter
Weekday Items
M A 3
B 2
C 1
D 1
E 1
T B 3
C 2
A 1
D 1
E 1
groupby twice, first to get the counter variable. the second groupby allows an iteration through the groups to get the top 5, using nlargest. last step is to combine the dataframes in the list into one.
vb_rise's solution should be faster as it avoids the iteration process.

How to split a string and assign as column name for a pandas dataframe?

I have a dataframe which has a single column like this:
a;d;c;d;e;r;w;e;o
--------------------
0 h;j;r;d;w;f;g;t;r
1 a;f;c;x;d;e;r;t;y
2 b;h;g;t;t;t;y;u;f
3 g;t;u;n;b;v;d;s;e
When I split it I am getting like this:
0 1 2 3 4 5 6 7 8
------------------------------
0 h j r d w f g t r
1 a f c x d e r t y
2 b h g t t t y u f
3 g t u n b v d s e
I need to assign a d c d e r w e o instead of 0 1 2 3 4 5 6 7 8 as column names.
I tried :
df = dataframe
df = df.iloc[:,0].str.split(';')
res = pd.DataFrame(df.columns.tolist())
res = pd.DataFrame(df.values.tolist())
I am getting values assigned to each column..But not column headers. What to do?
I think need create new DataFrame by expand=True parameter and then assign new columns names:
res = df.iloc[:,0].str.split(';', expand=True)
res.columns = df.columns[0].split(';')
print (res)
a d c d e r w e o
0 h j r d w f g t r
1 a f c x d e r t y
2 b h g t t t y u f
3 g t u n b v d s e
But maybe need sep=';' in read_csv if only one column data:
res = pd.read_csv(file, sep=';')

pandas groupby operation with missing data

In a pandas dataframe I have a column that looks like:
0 M
1 E
2 L
3 M.1
4 M.2
5 M.3
6 E.1
7 E.2
8 E.3
9 E.4
10 L.1
11 L.2
12 M.1.a
13 M.1.b
14 M.1.c
15 M.2.a
16 M.3.a
17 E.1.a
18 E.1.b
19 E.1.c
20 E.2.a
21 E.3.a
22 E.3.b
23 E.4.a
I need to group all the value where the first elements are E, M, or L and then, for each group, I need to create a subgroup where the index is 1, 2, or 3 which will contain a record for each lowercase letter (a,b,c, ...)
Potentially the solution should work for any number of levels concatenate elements (in this case the number of levels is 3 (eg: A.1.a))
0 1 2
E 1 a
b
c
2 a
3 a
b
4 a
L 1
2
M 1 a
b
c
2 a
3 a
I tried with:
df.groupby([0,1,2]).count()
But the result is missing the L level because it doesn't have records at the last sub-level
A workaround is to add a dummy variable and then remove it ... like:
df[2][(df[0]=='L') & (df[2].isnull()) & (df[1].notnull())]='x'
df = df.replace(np.nan,' ', regex=True)
df.sort_values(0, ascending=False, inplace=True)
newdf = df.groupby([0,1,2]).count()
which gives:
0 1 2
E 1 a
b
c
2 a
3 a
b
4 a
L 1 x
2 x
M 1 a
b
c
2 a
3 a
I then deal with the dummy entry x later in my code ...
how can avoid this ackish way to use groupby ?
Assuming the column under consideration to be represented by s, we can:
Split on "." delimiter along with expand=True to produce an expanded DF.
fnc : checks if all elements of the grouped frame consists of only None, then it replaces them by a dummy entry "" which is established via a list-comprehension. A series constructor is later called on the filtered list. Any None's present here are subsequently removed using dropna.
Perform groupby w.r.t. 0 & 1 column names and apply fnc to 2.
split_str = s.str.split(".", expand=True)
fnc = lambda g: pd.Series(["" if all(x is None for x in g) else x for x in g]).dropna()
split_str.groupby([0, 1])[2].apply(fnc)
produces:
0 1
E 1 1 a
2 b
3 c
2 1 a
3 1 a
2 b
4 1 a
L 1 0
2 0
M 1 1 a
2 b
3 c
2 1 a
3 1 a
Name: 2, dtype: object
To obtain a flattened DF, reset the indices same as the levels used to group the DF before:
split_str.groupby([0, 1])[2].apply(fnc).reset_index(level=[0, 1]).reset_index(drop=True)
produces:
0 1 2
0 E 1 a
1 E 1 b
2 E 1 c
3 E 2 a
4 E 3 a
5 E 3 b
6 E 4 a
7 L 1
8 L 2
9 M 1 a
10 M 1 b
11 M 1 c
12 M 2 a
13 M 3 a
Maybe you have to find a way with regex.
import pandas as pd
df = pd.read_clipboard(header=None).iloc[:, 1]
df2 = df.str.extract(r'([A-Z])\.?([0-9]?)\.?([a-z]?)')
print df2.set_index([0,1])
and the result is,
2
0 1
M
E
L
M 1
2
3
E 1
2
3
4
L 1
2
M 1 a
1 b
1 c
2 a
3 a
E 1 a
1 b
1 c
2 a
3 a
3 b
4 a

Categories