Splitting string multiple times in DataFrame - python

I have a column in a DataFrame that contains a string from which I must retrieve two pieces of information by different separators:
ID STR
280 11040402-38.58551%;11050101-9.29086%;11070101-52.12363%
351 11130203-35%;11130230-65%
510 11070103-69%
655 11090103-41.63463%;11160102-58.36537%
666 11130205-50.00%;11130207-50%
I have been trying to use the .apply method on this series together with a lambda function to make the splitting in one go, to no avail:
df['STR'].apply(lambda x: y.split('-') for y in x.split(';'))
Ideally, not only I would be able to split the string in one go, but also separate the left side of the - from the right side:
ID STR.LEFT STR.RIGHT
280 [11040402, 11050101, 11070101] [38.58551%, 9.29086%, 52.12363%]
351 [11130203, 11130230] [35%, 65%]
510 [11070103] [69%]
655 [11090103, 11160102] [41.63463%, 58.36537%]
666 [11130205, 11130207] [50.00%, 50%]
I believe this could be achievable with .apply and slicing, but any other solution is welcome.

You can try splitting several times:
# set ID as index
df.set_index('ID', inplace=True)
new_series = df.STR.str.split(';', expand=True).stack().reset_index(level=-1,drop=True)
new_df = new_series.str.split('-', expand=True)
new_df.groupby('ID').agg(list).reset_index()
Output:
ID 0 1
-- ---- ------------------------------------ --------------------------------------
0 280 ['11040402', '11050101', '11070101'] ['38.58551%', '9.29086%', '52.12363%']
1 351 ['11130203', '11130230'] ['35%', '65%']
2 510 ['11070103'] ['69%']
3 655 ['11090103', '11160102'] ['41.63463%', '58.36537%']
4 666 ['11130205', '11130207'] ['50.00%', '50%']

str.split
Assuming the pattern always leaves 'l-r;l-r;l-r...'
s = df.STR.str.split('-|;')
df[['ID']].join(pd.concat({'STR.LEFT': s.str[::2], 'STR.RIGTH': s.str[1::2]}, axis=1))
ID STR.LEFT STR.RIGTH
0 280 [11040402, 11050101, 11070101] [38.58551%, 9.29086%, 52.12363%]
1 351 [11130203, 11130230] [35%, 65%]
2 510 [11070103] [69%]
3 655 [11090103, 11160102] [41.63463%, 58.36537%]
4 666 [11130205, 11130207] [50.00%, 50%]
If you want to explode these lists into separate rows
s = df.STR.str.split('-|;')
i = np.arange(len(df)).repeat(s.str.len() // 2)
d = {'STR.LEFT': np.concatenate(s.str[::2]),
'STR.RIGHT': np.concatenate(s.str[1::2])}
df[['ID']].iloc[i].assign(**d).reset_index(drop=True)
ID STR.LEFT STR.RIGHT
0 280 11040402 38.58551%
1 280 11050101 9.29086%
2 280 11070101 52.12363%
3 351 11130203 35%
4 351 11130230 65%
5 510 11070103 69%
6 655 11090103 41.63463%
7 655 11160102 58.36537%
8 666 11130205 50.00%
9 666 11130207 50%

A single str.extractall call will suffice to extract the pairs into separate columns. You can then aggregate them into lists using groupby.
(df['STR'].str.extractall(r'(.*?)-(.*?)(?=;|$)')
.groupby(level=0)
.agg(list)
.set_axis(['STR.LEFT', 'STR.RIGHT'], axis=1, inplace=False))
STR.LEFT STR.RIGHT
0 [11040402, ;11050101, ;11070101] [38.58551%, 9.29086%, 52.12363%]
1 [11130203, ;11130230] [35%, 65%]
2 [11070103] [69%]
3 [11090103, ;11160102] [41.63463%, 58.36537%]
4 [11130205, ;11130207] [50.00%, 50%]
To join with ID, you use just that: join.
(df['STR'].str.extractall(r'(.*?)-(.*?)(?=;|$)')
.groupby(level=0)
.agg(list)
.set_axis(['STR.LEFT', 'STR.RIGHT'], axis=1, inplace=False)
.join(df['ID'])
STR.LEFT STR.RIGHT ID
0 [11040402, ;11050101, ;11070101] [38.58551%, 9.29086%, 52.12363%] 280
1 [11130203, ;11130230] [35%, 65%] 351
2 [11070103] [69%] 510
3 [11090103, ;11160102] [41.63463%, 58.36537%] 655
4 [11130205, ;11130207] [50.00%, 50%] 666

Related

Merge two dataframes based on condition

I have these two dataframes:
sp_client
ConnectionID Value
0 CN01493292 495
1 CN01492424 440
2 CN01491959 403
3 CN01493200 312
4 CN01493278 282
.. ... ...
110 CN01492864 1
111 CN01492513 1
112 CN01492899 1
113 CN01493010 1
114 CN01493032 1
[115 rows x 2 columns]
sp_server
ConnectionID Value
1 CN01491920 2
1 CN01491920 2
3 CN01491922 2
3 CN01491922 2
5 CN01491928 2
.. ... ...
595 CN01493166 3
595 CN01493166 3
595 CN01493166 3
597 CN01493163 2
597 CN01493163 2
[673 rows x 2 columns]
I would like to merge them in a way where sp_client['Value'] increments by addition of sp_sever['Value'] and sp_client['Value'] only when the rows satisfy the condition sp_sever['ConnectionID']==sp_client['ConnectionID'].
It was a little bit complicated for me but I tried the following, but I am missing the condition part. Maybe it does not need to be merged in the first place. Happy to hear suggestions.
as per my comment, try to append tables and group them by ID while summing Value column as per example:
all_data = pd.concat([sp_server,sp_client])
all_data = all_data.groupby('ConnectionID')['Value'].agg(sum).reset_index()
out:
ConnectionID Value
0 CN01491920 4
1 CN01491922 4
2 CN01491928 2
3 CN01491959 403
4 CN01492424 440
5 CN01493200 312

Pandas: How to print the groupby values

I have a following data set from Table_Record:
Seg_ID Lock_ID Code
111 100 1
222 121 2
333 341 2
444 100 1
555 100 1
666 341 2
777 554 4
888 332 5
I am using the sql query to find the Seg_IDs where Lock_ID is repeated:
Select Code,Lock_ID,Seg_ID from Table_Record group by Code, Lock_ID;
Seg_ID Lock_ID Code
111 100 1
444 100 1
555 100 1
222 121 2
333 341 2
666 341 2
777 554 4
888 332 5
How can I achieve the same using Pandas?
Excepted O/P from Pandas is:
eg.
Seg_ID (111,444,555) has Lock_id (1).
Seg_ID (222,333,666) has Lock_ID (2).
First get all codes by filtering only duplicated values and then filter original DaatFrame by boolean indexing with isin:
codes = df.loc[df.duplicated(['Lock_ID']), 'Code'].unique()
df1 = df[df['Code'].isin(codes)]
print (df1)
Seg_ID Lock_ID Code
0 111 100 1
1 222 121 2
2 333 341 2
3 444 100 1
4 555 100 1
5 666 341 2
Then groupby with f-strings:
for k, v in df1.groupby(['Code'])['Seg_ID']:
print (f'Seg_ID {tuple(v)} has Code ({k})')
Seg_ID (111, 444, 555) has Code (1)
Seg_ID (222, 333, 666) has Code (2)
If want output like DataFrame use apply with tuple:
df2 = df1.groupby(['Code'])['Seg_ID'].apply(tuple).reset_index()
print (df2)
Code Seg_ID
0 1 (111, 444, 555)
1 2 (222, 333, 666)
Simply use groupby. As I could understand from your code, you'd want:
grouped= df.groupby(['Code']['LockId'])

Selecting rows with lowest values based on combination two columns from pandas

I'm not even sure if the title makes sense.
I have a pandas dataframe with 3 columns: x, y, time. There are a few thousand rows. Example below:
x y time
0 225 0 20.295270
1 225 1 21.134015
2 225 2 21.382298
3 225 3 20.704367
4 225 4 20.152735
5 225 5 19.213522
.......
900 437 900 27.748966
901 437 901 20.898460
902 437 902 23.347935
903 437 903 22.011992
904 437 904 21.231041
905 437 905 28.769945
906 437 906 21.662975
.... and so on
What I want to do is retrieve those rows which have the smallest time associated with x and y. Basically for every element on the y, I want to find which have the smallest time value but I want to exclude those that have time 0.0. This happens when x has the same value as y.
So for example, the fastest way to get to y-0 is by starting from x-225 and so on, therefore it could be the case that x repeats itself but for a different y.
e.g.
x y time
225 0 20.295270
438 1 19.648954
27 20 4.342732
9 438 17.884423
225 907 24.560400
I tried up until now groupby but I'm only getting the same x as y.
print(df.groupby('id_y', sort=False)['time'].idxmin())
y
0 0
1 1
2 2
3 3
4 4
The one below just returns the df that I already have.
df.loc[df.groupby("id_y")["time"].idxmin()]
Just to point out one thing, I'm open to options, not just groupby, if there are other ways that is very good.
So need remove rows with time equal first by boolean indexing and then use your solution:
df = df[df['time'] != 0]
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Similar alternative with filter by query:
df = df.query('time != 0')
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Or use sort_values with drop_duplicates:
df2 = df[df['time'] != 0].sort_values(['y','time']).drop_duplicates('y')

Reset secondary index in pandas dataframe to start at 1

Suppose I construct a multi-index dataframe like the one show here:
prim_ind=np.array(range(0,1000))
for i in range(0,1000):
prim_ind[i]=round(i/4)
d = {'prim_ind' :prim_ind,
'sec_ind' : np.array(range(1,1001)),
'a' : np.array(range(325,1325)),
'b' : np.array(range(8318,9318))}
df= pd.DataFrame(d).set_index(['prim_ind','sec_ind'])
The sec_ind runs sequentially from 1 upwards, but I want to reset this second index so that for each of the prim_ind levels the sec_ind always starts at 1. I have been trying to work out if I can use reset index to do this but am failing miserably.
I know i could iterate over the dataframe to get this outcome but that will be a horrible way to do it and there must be a more pythonic way - can anyone help?
Note: the dataframe i'm working with is actually imported from csv, the code above is just to illustrate this question.
You can use cumcount for count categories.
df.index = [df.index.get_level_values(0), df.groupby(level=0).cumcount() + 1]
Or better if want also index names is use MultiIndex.from_arrays:
df.index = pd.MultiIndex.from_arrays([df.index.get_level_values(0),
df.groupby(level=0).cumcount() + 1],
names=df.index.names)
print (df)
a b
prim_ind sec_ind
0 1 325 8318
2 326 8319
3 327 8320
1 1 328 8321
2 329 8322
3 330 8323
2 1 331 8324
So column sec_ind is not necessary, you can use also:
d = {'prim_ind' :prim_ind,
'a' : np.array(range(325,1325)),
'b' : np.array(range(8318,9318))}
df = pd.DataFrame(d)
print (df.head(8))
a b prim_ind
0 325 8318 0
1 326 8319 0
2 327 8320 0
3 328 8321 1
4 329 8322 1
5 330 8323 1
6 331 8324 2
7 332 8325 2
df = df.set_index(['prim_ind', df.groupby('prim_ind').cumcount() + 1]) \
.rename_axis(('first','second'))
print (df.head(8))
a b
first second
0 1 325 8318
2 326 8319
3 327 8320
1 1 328 8321
2 329 8322
3 330 8323
2 1 331 8324
2 332 8325

Aligning Dataframes based on count on pandas

I am aligning two dataframes which look like the following:
Dataframe 1
Timestamp L_x L_y L_a R_x R_y R_a
2403950 621.3 461.3 313 623.3 461.8 260
2404050 622.5 461.3 312 623.3 462.6 260
2404150 623.1 461.5 311 623.4 464 261
2404250 623.6 461.7 310 623.7 465.4 261
2404350 623.8 461.5 309 623.9 466.1 261
Dataframe 2
This dataframe contains the timestamps that a particular event occured.
Timestamp
0 2404030
1 2404050
2 2404250
3 2404266
4 2404282
5 2404298
6 2404314
7 2404330
8 2404350
9 2404382
All timestamps are in milliseconds. As you can see, the first dataframe is resampled to 100milliseconds. So what I want to do is, to align the two dataframes based on count. Which means based on the count how many events occur during a particular 100milliseconds bin time. For example, from the dataframe 1, in the first 100millisecond bin time (24043950 - 2404049), only one event occur according to the second dataframe which is at 2404030 and so on. The aligned table should look like the following:
Timestamp L_x L_y L_a R_x R_y R_a count
2403950 621.3 461.3 313 623.3 461.8 260 1
2404050 622.5 461.3 312 623.3 462.6 260 1
2404150 623.1 461.5 311 623.4 464 261 0
2404250 623.6 461.7 310 623.7 465.4 261 6
2404350 623.8 461.5 309 623.9 466.1 261 2
Thank you for your help and suggestion.
You want to perform integer division on the timestamp (i.e. a // b), but first need to add 50 to it given your bucketing. Then convert it back into the correct units by multiplying by 100 and subtracting 50.
Now, group on this new index and perform a count.
You then merge these counts to your original dataframe and do some formatting operations to get the data in the desired shape. Make sure to fill NaNs with zero.
df2['idx'] = (df2.Timestamp + 50) // 100 * 100 - 50
counts = df2.groupby('idx').count()
>>> counts
Timestamp
idx
2403950 1
2404050 1
2404250 6
2404350 2
df_new =df.merge(counts, how='left', left_on='Timestamp', right_index=True, suffixes=['', '_'])
columns = list(df_new)
columns[-1] = 'count'
df_new.columns = columns
df_new['count'].fillna(0, inplace=True)
>>> df_new
Timestamp L_x L_y L_a R_x R_y R_a count
0 2403950 621.3 461.3 313 623.3 461.8 260 1
1 2404050 622.5 461.3 312 623.3 462.6 260 1
2 2404150 623.1 461.5 311 623.4 464.0 261 0
3 2404250 623.6 461.7 310 623.7 465.4 261 6
4 2404350 623.8 461.5 309 623.9 466.1 261 2

Categories