I have a dataframe like:
df1
right left
[a,b] [c,d,e,f]
[b,c] [a,d,e,f]
[c,d,e,f] [a,b]
Line 1 and 3 are basically same and I want to remove the duplicates.
Is their any way to do this?
The data is structured in this way only.
I tried running below command I found but since these are lists, it throws an error:
df1.duplicated(subset = ['right', 'left'], keep = False)
error: unhashable type :list
Create hashable type tuple for both columns, sorting in list comprehension and test duplicates by Series.duplicated:
L = [tuple(map(tuple, sorted(x))) for x in df[['right','left']].to_numpy()]
m = pd.Series(L, index=df.index).duplicated(keep = False)
print (m)
0 True
1 False
2 True
dtype: bool
Related
I am writing a custom error message when 2 Pandas series are not equal and want to use '<' to point at the differences.
Here's the workflow for a failed equality:
Convert both lists to Python: pd.Series([list])
Side by side comparison in a dataframe: table = pd.concat([list1], [list2]), axis=1
Add column and index names: table.columns = ['...', '...'], table.index = ['...', '...']
Current output:
|Yours|Actual|
|1|1|
|2|2|
|4|3|
Desired output:
|Yours|Actual|-|
|1|1||
|2|2||
|4|3|<|
The naive solution is iterating through each list index and if it's not equal, appending '<' to another list then putting this list into pd.concat() but I am looking for a method using Pandas. For example,
error_series = '<' if (abs(yours - actual) >= 1).all(axis=None) else ''
Ideally it would append '<' to a list if the difference between the results is greater than the Margin of Error of 1, otherwise append nothing
Note: Removed tables due to StackOverflow being picky and not letting my post my question
You can create the DF and give index and column names in one line:
import pandas as pd
list1 = [1,2,4]
list2 = [1,2,10]
df = pd.DataFrame(zip(list1, list2), columns=['Yours', 'Actual'])
Create a boolean mask to find the rows that have a too large difference:
margin_of_error = 1
mask = df.diff(axis=1)['Actual'].abs()>margin_of_error
Add a column to the DF and set the values of the mask as you want:
df['too_different'] = df.diff(axis=1)['Actual'].abs()>margin_of_error
df['too_different'].replace(True, '<', inplace=True)
df['too_different'].replace(False, '', inplace=True)
output:
Yours Actual too_different
0 1 1
1 2 2
2 4 10 <
or you can do something like this:
df = df.assign(diffr=df.apply(lambda x: '<'
if (abs(x['yours'] - x['actual']) >= 1)
else '', axis=1))
print(df)
'''
yours actual diffr
0 1 1
1 2 2
2 4 3 <
My goal is to get a list object: ['assetCode', 'assetName'], where the contents are the labels of a Panda.series that are retrieved based on more than one condition. I tried:
tmp3 = datatype[datatype == 'object' | datatype == 'category'].index # extract label from Pandas.series
This gives the error: TypeError: cannot compare a dtyped [object] array with a scalar of type [bool]
However, while less elegant, I was able to find the following two working solutions:
tmp2 = datatype[datatype == 'object'].index # extract label from Pandas.series
tmp2[0]
'assetCode'
tmp1 = datatype[datatype == 'category'].index # extract label from Pandas.series
tmp1[0]
'assetName'
How do I combine these two strings into a list object? Is there a better way to achieve that goal than the way I am trying to do it?
Setup
df
A B C
0 8 4 2
1 8 8 6
2 8 5 2
datatype = df.dtypes
datatype
A object
B category
C int64
dtype: object
It looks like you are trying to select object and categorical columns from some DataFrame (not shown here). To fix your code, use:
tmp3 = datatype[(datatype == 'object') | (datatype == 'category')].index.tolist()
tmp3
# ['A', 'B']
Since bitwise operators have higher precedence, you will need to use parentheses before ORing the masks. After that, indexing works fine.
To get a list, call .index.tolist().
Another solution is select_dtypes:
df.select_dtypes(include=['object', 'category'])
A B
0 8 4
1 8 8
2 8 5
df.select_dtypes(include=['object', 'category']).columns
# ['A', 'B']
This circumvents the need for an intermediate datatype series.
I have a dataframe containing one column of lists.
names unique_values
[B-PER,I-PER,I-PER,B-PER] 2
[I-PER,N-PER,B-PER,I-PER,A-PER] 4
[B-PER,A-PER,I-PER] 3
[B-PER, A-PER,A-PER,A-PER] 2
I have to count each distinct value in a column of lists and If value appears more than once count it as one. How can I achieve it
Thanks
Combine explode with nunique
df["unique_values"] = df.names.explode().groupby(level = 0).nunique()
You can use the inbulit set data type to do this -
df['unique_values'] = df['names'].apply(lambda a : len(set(a)))
This works as sets do not allow any duplicate elements in their construction so when you convert a list to a set it strips all duplicate elements and all you need to do is get the length of the resultant set.
to ignore NaN values in a list you can do the following -
df['unique_values'] = df['names'].apply(lambda a : len([x for x in set(a) if str(x) != 'nan']))
Try:
df["unique_values"] = df.names.explode().groupby(level = 0).unique().str.len()
Output
df
names unique_values
0 [B-PER, I-PER, I-PER, B-PER] 2
1 [I-PER, N-PER, B-PER, I-PER, A-PER] 4
2 [B-PER, A-PER, I-PER] 3
3 [B-PER, A-PER, A-PER, A-PER] 2
How can I get the values of one column in a csv-file by matching attributes in another column?
CSV-file would look like that:
One,Two,Three
x,car,5
x,bus,7
x,car,9
x,car,6
I only want to get the values of column 3, if they have the value "car" in column 2. I also do not want them to be added but rather have them printed in a list, or like that:
5
9
6
My approach is looking like that, but doesn't really work:
import pandas as pd
df = pd.read_csv(r"example.csv")
ITEMS = [car] #I will need more items, this is just examplified
for item in df.Two:
if item in ITEMS:
print(df.Three)
How can I get the exact value for a matched item?
In one line you can do it like:
print(df['Three'][df['Two']=='car'].values)
Output:
[5 9 6]
For multiple items try:
df = pd.DataFrame({'One': ['x','x','x','x', 'x'],'Two': ['car','bus','car','car','jeep'],'Three': [5,7,9,6,10]})
myitems = ['car', 'bus']
res_list = []
for item in myitems:
res_list += df['Three'][df['Two']==item].values.tolist()
print(*sorted(res_list), sep='\n')
Output:
5
6
7
9
Explanation
df['Two']=='car' returns a Dataframe with boolean True at row positions where value in column Two of of df is car
.values gets these boolean values as a numpy.ndarray, result would be [True False True True]
We can filter the values in column Three by using this list of booleans like so: df['Three'][<Boolean_list>]
To combine the resulting arrays we convert each numpy.ndarray to python list using tolist() and append it to res_list
Then we use sorted to sort res_list
I have a long series like the following:
series = pd.Series([[(1,2)],[(3,5)],[],[(3,5)]])
In [151]: series
Out[151]:
0 [(1, 2)]
1 [(3, 5)]
2 []
3 [(3, 5)]
dtype: object
I want to remove all entries with an empty list. For some reason, boolean indexing does not work.
The following tests both give the same error:
series == [[(1,2)]]
series == [(1,2)]
ValueError: Arrays were different lengths: 4 vs 1
This is very strange, because in the simple example below, indexing works just like above:
In [146]: pd.Series([1,2,3]) == [3]
Out[146]:
0 False
1 False
2 True
dtype: bool
P.S. ideally, I'd like to split the tuples in the series into a DataFrame of two columns also.
You could check to see if the lists are empty using str.len():
series.str.len() == 0
and then use this boolean series to remove the rows containing empty lists.
If each of your entries is a list containing a two-tuple (or else empty), you could create a two-column DataFrame by using the str accessor twice (once to select the first element of the list, then to access the elements of the tuple):
pd.DataFrame({'a': series.str[0].str[0], 'b': series.str[0].str[1]})
Missing entries default to NaN with this method.
Using the built in apply you can filter by the length of the list:
series = pd.Series([[(1,2)],[(3,5)],[],[(3,5)]])
series = series[series.apply(len) > 0]
Your series is in a bad state -- having a Series of lists of tuples of ints
buries the useful data, the ints, inside too many layers of containers.
However, to form the desired DataFrame, you could use
df = series.apply(lambda x: pd.Series(x[0]) if x else pd.Series()).dropna()
which yields
0 1
0 1 2
1 3 5
2 3 5
A better way would be to avoid building the malformed series altogether and
form df directly from the data:
data = [[(1,2)],[(3,5)],[],[(3,5)]]
data = [pair for row in data for pair in row]
df = pd.DataFrame(data)