Select rows that contains values repeated across different levels of another column

Select rows that contains values repeated across different levels of another column - python

I have a dataset with two columns
id to
0 1 0x954b890704693af242613edef1b603825afcd708
1 1 0x954b890704693af242613edef1b603825afcd708
2 1 0x607f4c5bb672230e8672085532f7e901544a7375
3 1 0x9b9647431632af44be02ddd22477ed94d14aacaa
4 2 0x9b9647431632af44be02ddd22477ed94d14aacaa
and I would like to print the value in column 'to' that is present in different levels of the column 'id', in the above example for example the only value to be printed should be 0x9b9647431632af44be02ddd22477ed94d14aacaa
I have done this with a for loop within, i wonder it there is a better way of doing this:
for index, row in df.iterrows():
to=row['to']
id=row['id']
for index, row in df.iterrows():
if row['to']==to and row['id']!=id:
print(to)

You can use df.groupby on column to, apply nunique and keep only the entries > 1. So:
import pandas as pd
d = {'id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2},
'to': {0: '0x954b890704693af242613edef1b603825afcd708',
1: '0x954b890704693af242613edef1b603825afcd708',
2: '0x607f4c5bb672230e8672085532f7e901544a7375',
3: '0x9b9647431632af44be02ddd22477ed94d14aacaa',
4: '0x9b9647431632af44be02ddd22477ed94d14aacaa'}}
df = pd.DataFrame(d)
nunique = df.groupby('to')['id'].nunique()
print(nunique)
to
0x607f4c5bb672230e8672085532f7e901544a7375 1
0x954b890704693af242613edef1b603825afcd708 1
0x9b9647431632af44be02ddd22477ed94d14aacaa 2
res = nunique[nunique>1]
print(res.index.tolist())
['0x9b9647431632af44be02ddd22477ed94d14aacaa']

Related

Dropping columns before a matching string in the following column

Is there direct possibility to drop all columns before a matching string in a pandas Dataframe. For eg. if my column 8 contains a string 'Matched' I want to drop columns 0 to 7 ?

Well, you did not give any information where and how to look for 'Matched', but let's say that integer col_num contains the number of the matched column:
col_num = np.where(df == 'Matched')[1][0]
df.drop(columns=df.columns[0:col_num],inplace=True)
will do the drop

Example
data = {'A': {0: 1}, 'B': {0: 2}, 'C': {0: 3}, 'Match1': {0: 4}, 'D': {0: 5}}
df = pd.DataFrame(data)
df
A B C Match1 D
0 1 2 3 4 5
Code
remove in front of first Match + # column : boolean indexing
df.loc[:, df.columns.str.startswith('Match').cumsum() > 0]
result
Match1 D
0 4 5

Formatting a strange DataFrame a the pythonic way

I have a Dataframe with a really strange format, columns are matching/linked two by two.
The first column contains labels and codes associated with the adjacent column values.
Here's what it looks like :
1
2
3
4
Name
letter_1
Name
letter_2
Title
Choose a letter:
Title
Choose another letter
1
a
1
z
2
b
2
y
3
c
4
d
And here's what I need :
Name
Title
Code
Label
letter_1
Choose a letter:
1
a
letter_1
Choose a letter:
2
b
letter_1
Choose a letter:
3
c
letter_1
Choose a letter:
4
d
letter_2
Choose another letter
1
z
letter_2
Choose another letter
2
y
I managed to do it with that code:
# Init an empty DataFrame
df_format = pd.DataFrame()
# Iterate over the columns, step 2
for i in range(0, len(df.columns), 2):
# Get the columns names for current col and col+1, since they're linked together
col_i, col_ii = df.columns[i], df.columns[i]+1
# Concat codes from col and labels from col+1
codes = pd.concat([df[col_i].loc[3:], df[col_ii].loc[3:]], axis=1).dropna()
# Get the "Name", col+1 line 0
name = df[col_ii].loc[0]
# Get the title, col+1 line 1
title = df[col_ii].loc[1]
codes.loc[:, ['Name', 'Title']] = [name, title]
codes.columns = ["Code", "Label", "Name", "Title"]
df_format = pd.concat([df_format, codes])
But the question is: is there a more pythonic way to do this ?
I assume it is with pandas but sometimes it kind of break my brain.
Here is the example to use with pd.DataFrame
[{1: 'Name', 2: 'letter_1', 3: 'Name', 4: 'letter_2'},
{1: 'Title', 2: 'Choose a letter:', 3: 'Title', 4: 'Choose another letter'},
{1: 1, 2: 'a', 3: 1, 4: 'z'},
{1: 2, 2: 'b', 3: 2, 4: 'y'},
{1: 3, 2: 'c', 3: np.nan, 4: np.nan},
{1: 4, 2: 'd', 3: np.nan, 4: np.nan}]
Thanks a lot for your help !
EDIT
This data come from an Excel file with another sheet containing the answers given by some respondents where the columns names are corresponding to letter_1, letter_2, etc.
The sheet I am working on has about 15.000 columns, all ordered the same way, as shown above.
I read it with pd.read_excel(file, header=None, sheetname="sheet1"), that's why I do not have easy to read column names (I dropped column 0)

How to Convert rows into columns headers and values of other column as data?

I've a below dataframe and I want to pivot the data to change the Name column values into multiple columns and values from Data column into values for the columns created by the Name column.
As the Data column has all types of date when I pivot the data I dont get the required result can someone please advise what I'm doing wrong?
import numpy as np
dict_d = {'Name': {0: 'Number', 1: 'Purpose', 2: 'Approver', 3: 'internal/external', 4: 'Name', 5: 'N Mnemonic'}, 'Data': {0: '123456', 1: 'BC', 2: np.nan, 3: 'internal', 4: np.nan, 5: 'xyz'}}
df = pd.DataFrame(dict_d)
df
o/p
Name Data
0 Number 123456
1 Purpose BC
2 Approver NaN
3 internal/external internal
4 Name NaN
5 N Mnemonic xyz
I've tried this
df.pivot_table(columns='Name', values='Data', aggfunc=lambda x: ''.join(str(x)))
Name Approver N Mnemonic Name Number Purpose internal/external
Data NameData NameData NameData NameData NameData NameData
But in row 1 I want the data values.

I think you need convert Name to index, select by double [] for one column DataFrame and transpose:
df1 = df.set_index('Name')[['Data']].T

Merging multiple dataframe lines into aggregate lines

For the following dataframe:
df = pd.DataFrame({'Name': {0: "A", 1: "A", 2:"A", 3: "B"},
'Spec1': {0: '1', 1: '3', 2:'5',
3: '1'},
'Spec2': {0: '2a', 1: np.nan, 2:np.nan,
3: np.nan}
}, columns=['Name', 'Spec1', 'Spec2'])
Name Spec1 Spec2
0 A 1 2a
1 A 3 NaN
2 A 5 NaN
3 B 1 NaN
I would like to aggregate the columns into:
Name Spec
0 A 1,3,5,2a
1 B 1
Is there a more "pandas" way of doing this than just looping and keeping track of the values?

Or using melt
df.melt('Name').groupby('Name').value.apply(lambda x:','.join(pd.Series(x).dropna())).reset_index().rename(columns={'value':'spec'})
Out[2226]:
Name spec
0 A 1,3,5,2a
1 B 1

Another way
In [966]: (df.set_index('Name').unstack()
.dropna().reset_index()
.groupby('Name')[0].apply(','.join))
Out[966]:
Name
A 1,3,5,2a
B 1
Name: 0, dtype: object

Group rows by name, combine column values as a list, dropping NaN:
df = df.groupby('Name').agg(lambda x: list(x.dropna()))
Spec1 Spec2
Name
A [1, 3, 5] [2a]
B [1] []
Now merge Spec1 and Spec2 lists. Bring Name back as a column. Name the new Spec column.
df = (df.Spec1 + df.Spec2).reset_index().rename(columns={0:"Spec"})
Name Spec
0 A [1, 3, 5, 2a]
1 B [1]
Finally, convert Spec lists to string representations:
df.Spec = df.Spec.apply(','.join)
Name Spec
0 A 1,3,5,2a
1 B 1

Explode a row to multiple rows in pandas dataframe

I have a dataframe with the following header:
id, type1, ..., type10, location1, ..., location10
and I want to convert it as follows:
id, type, location
I managed to do this using embedded for loops but it's very slow:
new_format_columns = ['ID', 'type', 'location']
new_format_dataframe = pd.DataFrame(columns=new_format_columns)
print(data.head())
new_index = 0
for index, row in data.iterrows():
ID = row["ID"]
for i in range(1,11):
if row["type"+str(i)] == np.nan:
continue
else:
new_row = pd.Series([ID, row["type"+str(i)], row["location"+str(i)]])
new_format_dataframe.loc[new_index] = new_row.values
new_index += 1
Any suggestions for improvement using native pandas features?

You can use lreshape:
types = [col for col in df.columns if col.startswith('type')]
location = [col for col in df.columns if col.startswith('location')]
print(pd.lreshape(df, {'Type':types, 'Location':location}, dropna=False))
Sample:
import pandas as pd
df = pd.DataFrame({
'type1': {0: 1, 1: 4},
'id': {0: 'a', 1: 'a'},
'type10': {0: 1, 1: 8},
'location1': {0: 2, 1: 9},
'location10': {0: 5, 1: 7}})
print (df)
id location1 location10 type1 type10
0 a 2 5 1 1
1 a 9 7 4 8
types = [col for col in df.columns if col.startswith('type')]
location = [col for col in df.columns if col.startswith('location')]
print(pd.lreshape(df, {'Type':types, 'Location':location}, dropna=False))
id Location Type
0 a 2 1
1 a 9 4
2 a 5 1
3 a 7 8
Another solution with double melt:
print (pd.concat([pd.melt(df, id_vars='id', value_vars=types, value_name='type'),
pd.melt(df, value_vars=location, value_name='Location')], axis=1)
.drop('variable', axis=1))
id type Location
0 a 1 2
1 a 4 9
2 a 1 5
3 a 8 7
EDIT:
lreshape is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Select rows that contains values repeated across different levels of another column - python

Related

Dropping columns before a matching string in the following column

Formatting a strange DataFrame a the pythonic way

How to Convert rows into columns headers and values of other column as data?

Merging multiple dataframe lines into aggregate lines

Explode a row to multiple rows in pandas dataframe

Categories

Resources