How to position elements in a table in Pandas - python

I'm using Pandas, and I would like to reposition elements with the columns, I currently have:
Type Label
Initial
2022
2023
Difference
APPS
A/B/C
500
469
31
BACS
B/C/D
5
3
2
CAPS
C/D/E
10
5
5
I would like to have the table to be displayed like this:
Type Label/Initial
2022
2023
Difference
APPS
500
469
31
A
B
C
BACS
5
3
2
B
C
D
CAPS
10
5
5
C
D
E

Join columns Type and Initial with DataFrame.pop for droping column, then split by / and use DataFrame.explode, rename columns names and set empty string for repeated values:
s = (df['Type Label'] + '/ ' + df.pop('Initial')).str.split('/')
df = df.assign(**{'Type Label':s} ).explode('Type Label').rename(columns={'Type Label':'Type Label/Initial'})
df.iloc[:, 1:] = df.iloc[:, 1:].mask(df.index.to_series().duplicated(), '')
df = df.reset_index(drop=True)
print (df)
Type Label/Initial 2022 2023 Difference
0 APPS 500 469 31
1 A
2 B
3 C
4 BACS 5 3 2
5 B
6 C
7 D
8 CAPS 10 5 5
9 C
10 D
11 E

Related

Splitting the total time (in seconds) and fill the rows of a column value in 1 second frame

I have an dataframe look like (start_time and stop_time are in seconds followed by milliseconds)
And my Expected output to be like.,
I dont know how to approach this. forward filling may fill NaN values. But I need the total time seconds to be divided and saved as 1 second frame in accordance with respective labels. I dont have any code snippet to go forward. All i did is saving it in a dataframe as.,
df = pd.DataFrame(data, columns=['Labels', 'start_time', 'stop_time'])
Thank you and I really appreciate the help.
>>> df2 = pd.DataFrame({
>>> "Labels" : df.apply(lambda x:[x.Labels]*(round(x.stop_time)-round(x.start_time)), axis=1).explode(),
... "start_time" : df.apply(lambda x:range(round(x.start_time), round(x.stop_time)), axis=1).explode()
... })
>>> df2['stop_time'] = df2.start_time + 1
>>> df2
Labels start_time stop_time
0 A 0 1
0 A 1 2
0 A 2 3
0 A 3 4
0 A 4 5
0 A 5 6
0 A 6 7
0 A 7 8
0 A 8 9
1 B 9 10
1 B 10 11
1 B 11 12
1 B 12 13
2 C 13 14
2 C 14 15

Select rows from dataframe where the difference in time is smallest per group

QUESTION: How do I find all rows in a pandas data frame which have the min time difference when compared to another time of an advice?
Example:
Advicenr Advicehour Setdownnr Zone Setdownhour
0 A 1 A 16 **2** <-- zone 16 is closest to advicehour of A
1 A 1 A 16 **3**
2 A 2 A 18 5
3 A 2 A 18 8
4 B 4 B 19 18
5 B 8 B 20 **12** <-- zone 20 is closest to advicehour of B
Expected output:
Advicenr Advicehour Setdownnr Zone Setdownhour
0 A 1 A 16 3
1 A 1 A 16 2
5 B 8 B 20 12
It is not possible that setdownnr is before advice, and it should also not be possible that an advice for a different zone has a timestamp before the previous one ended.
First create column bor absolute differencies between columns and then get Zone by minimal difference per groups and select all rows which matched:
df['diff'] = df['Setdownhour'].sub(df['Advicehour']).abs()
s = df.set_index('Zone').groupby('Advicenr', sort=False)['diff'].transform('idxmin')
df = df[(s == s.index).to_numpy()]
print (df)
Advicenr Advicehour Setdownnr Zone Setdownhour diff
0 A 1 A 16 2 1
1 A 1 A 16 3 2
5 B 8 B 20 12 4
Solution without helper column in output:
s = df['Setdownhour'].sub(df['Advicehour']).abs()
s1 = df.assign(s = s).set_index('Zone').groupby('Advicenr')['s'].transform('idxmin')
df = df[(s1 == s1.index).to_numpy()]
print (df)
Advicenr Advicehour Setdownnr Zone Setdownhour
0 A 1 A 16 2
1 A 1 A 16 3
5 B 8 B 20 12
Thanks to advice from Jezrael, ended up doing:
df['diff'] = df['Setdownhour'].sub(inner_join_tote_nr['Advicehour']).abs()
df['avg_diff'] = df.groupby(['Setdownnr', 'Advicehour', 'Zone'])['diff'].transform('min')
s = df.groupby(['Advicenr', 'Advicehour'], sort=False)['avg_diff'].min().reset_index()
selected = pd.merge(s, inner_join_tote_nr, left_on=['Advicenr','Advicehour', 'avg_diff'], right_on = ['Advicenr','Advicehour', 'avg_diff'])

Retrieving Unknown Column Names from DataFrame.apply

How I can retrieve column names from a call to DataFrame apply without knowing them in advance?
What I'm trying to do is apply a mapping from column names to functions to arbitrary DataFrames. Those functions might return multiple columns. I would like to end up with a DataFrame that contains the original columns as well as the new ones, the amount and names of which I don't know at build-time.
Other solutions here are Series-based. I'd like to do the whole frame at once, if possible.
What am I missing here? Are the columns coming back from apply lost in destructuring unless I know their names? It looks like assign might be useful, but will likely require a lot of boilerplate.
import pandas as pd
def fxn(col):
return pd.Series(col * 2, name=col.name+'2')
df = pd.DataFrame({'A': range(0, 10), 'B': range(10, 0, -1)})
print(df)
# [Edit:]
# A B
# 0 0 10
# 1 1 9
# 2 2 8
# 3 3 7
# 4 4 6
# 5 5 5
# 6 6 4
# 7 7 3
# 8 8 2
# 9 9 1
df = df.apply(fxn)
print(df)
# [Edit:]
# Observed: columns changed in-place.
# A B
# 0 0 20
# 1 2 18
# 2 4 16
# 3 6 14
# 4 8 12
# 5 10 10
# 6 12 8
# 7 14 6
# 8 16 4
# 9 18 2
df[['A2', 'B2']] = df.apply(fxn)
print(df)
# [Edit: I am doubling column values, so missing something, but the question about the column counts stands.]
# Expected: new columns added. How can I do this at runtime without knowing column names?
# A B A2 B2
# 0 0 40 0 80
# 1 4 36 8 72
# 2 8 32 16 64
# 3 12 28 24 56
# 4 16 24 32 48
# 5 20 20 40 40
# 6 24 16 48 32
# 7 28 12 56 24
# 8 32 8 64 16
# 9 36 4 72 8
You need to concat the result of your function with the original df.
Use pd.concat:
In [8]: x = df.apply(fxn) # Apply function on df and store result separately
In [10]: df = pd.concat([df, x], axis=1) # Concat with original df to get all columns
Rename duplicate column names by adding suffixes:
In [82]: from collections import Counter
In [38]: mylist = df.columns.tolist()
In [41]: d = {a:list(range(1, b+1)) if b>1 else '' for a,b in Counter(mylist).items()}
In [62]: df.columns = [i+str(d[i].pop(0)) if len(d[i]) else i for i in mylist]
In [63]: df
Out[63]:
A1 B1 A2 B2
0 0 10 0 20
1 1 9 2 18
2 2 8 4 16
3 3 7 6 14
4 4 6 8 12
5 5 5 10 10
6 6 4 12 8
7 7 3 14 6
8 8 2 16 4
9 9 1 18 2
You can assign directly with:
df[df.columns + '2'] = df.apply(fxn)
Outut:
A B A2 B2
0 0 10 0 20
1 1 9 2 18
2 2 8 4 16
3 3 7 6 14
4 4 6 8 12
5 5 5 10 10
6 6 4 12 8
7 7 3 14 6
8 8 2 16 4
9 9 1 18 2
Alternatively, you can leverage the #MayankPorwal answer by using .add_suffix('2') to the output from your apply function:
pd.concat([df, df.apply(fxn).add_suffix('2')], axis=1)
which will return the same output.
In your function, name=col.name+'2' is doing nothing (it's basically returning just col * 2). That's because apply returns the values back to the original column.
Anyways, it's possible to take the MayankPorwal approach: pd.concat + managing duplicated columns (make them unique). Another possible way to do that:
# Use pd.concat as mentioned in the first answer from Mayank Porwal
df = pd.concat([df, df.apply(fxn)], axis=1)
# Rename duplicated columns
suffix = (pd.Series(df.columns).groupby(df.columns).cumcount()+1).astype(str)
df.columns = df.columns + suffix.rename('1', '')
which returns the same output, and additionally manage further duplicated columns.
Answer on the behalf of OP:
This code does what I wanted:
import pandas as pd
# Simulated business logic: for an input row, return a number of columns
# related to the input, and generate names for them, such that we don't
# know the shape of the output or the names of its columns before the call.
def fxn(row):
length = row[0]
indicies = [row.index[0] + str(i) for i in range(0, length)]
series = pd.Series([i for i in range(0, length)], index=indicies)
return series
# Sample data: 0 to 18, inclusive, counting by 2.
df1 = pd.DataFrame(list(range(0, 20, 2)), columns=['A'])
# Randomize the rows to simulate different input shapes.
df1 = df1.sample(frac=1)
# Apply fxn to rows to get new columns (with expand). Concat to keep inputs.
df1 = pd.concat([df1, df1.apply(fxn, axis=1, result_type='expand')], axis=1)
print(df1)

Selecting rows from one DataFrame depending on values from another

Take two data frames
print(df1)
A B
0 a 1
1 a 3
2 a 5
3 b 7
4 b 9
5 c 11
6 c 13
7 c 15
print(df2)
C D
a apple 1
b pear 1
c apple 1
So the values in column df1['A'] are the indexes of df2.
I want to select the rows in df1 where the values in column A are 'apple' in df2['C']. Resulting in:
A B
0 a 1
1 a 3
2 a 5
5 c 11
6 c 13
7 c 15
Made many edits due to comments and question edits,
Basically you first extract the indexes of df2 by filtering the dataframe by values in C, then filter the df2 by indexes with isin
indexes = df2[df2['C']=='apple'].index
df1[df1['A'].isin(indexes)]
>>>
A B
0 a 1
1 a 3
2 a 5
5 c 11
6 c 13
7 c 15
UPDATE
If you want to minimize memory allocation try to prevent saving information, (note. That i am not sure ot will solve your menory allocation issue because i didnt have full details of the situation and maybe even not suited enough to provide a solution):
df1[df1['A'].isin( df2[df2['C']=='apple'].index)]

Select columns with one of the strings in a list in their name?

I have a dataframe with columns that follow certain naming convention. I want to keep only those that have 'out' and 'agg' as prefixes in the header.
I've drafted the following code to achieve this. I created a list so that I can make this a small function and call it for any combination of col prefixes that I want to extract.
prefix = ['out', 'agg']
cols = []
for pref in prefix:
cols = cols + [col for col in df.columns if pref in col]
df = df[cols].dropna(how='all', axis=0)
Is there a shorter/faster way to do this? I liked the solutions here:Drop columns whose name contains a specific string from pandas DataFrame but couldn't make them work for a list of strings.
thanks
Use DataFrame.filter with regex for match columns names by strings joined by | for regex or:
df = pd.DataFrame({
'A_out':list('abcdef'),
'B_out':[4,5,4,5,5,4],
'C_agg':[7,8,9,4,2,3],
'agg_D':[1,3,5,7,1,0],
'out_E':[5,3,6,9,2,4],
'F_agg':list('aaabbb')
})
prefix = ['out', 'agg']
If need match values for any positions in columns names:
df0 = df.filter(regex='|'.join(prefix)).dropna(how='all')
print (df0)
A_out B_out C_agg agg_D out_E F_agg
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
If need only suffixes add $ for match end of strings:
df1 = df.filter(regex='|'.join(f'{x}$' for x in prefix)).dropna(how='all')
print (df1)
A_out B_out C_agg F_agg
0 a 4 7 a
1 b 5 8 a
2 c 4 9 a
3 d 5 4 b
4 e 5 2 b
5 f 4 3 b
If need only prefixes add ^ for match start of strings:
df2 = df.filter(regex='|'.join(f'^{x}' for x in prefix)).dropna(how='all')
print (df2)
agg_D out_E
0 1 5
1 3 3
2 5 6
3 7 9
4 1 2
5 0 4

Categories