I have this dataset:
Field
A
A
A
B
C
C
C
D
C
C
C
A
This has been read into pandas through the following code:
data = read_csv('data.csv', header=None)
print(data.describe())
How can I transform the column to get the below result?
Field
A
A
A
Others
C
C
C
Others
C
C
C
A
I want to transform values B and D, since they have low frequency, to an aggregate value "Others".
Here is one way:
import pandas as pd
df = pd.DataFrame({'Field': ['A', 'A', 'A', 'B', 'C', 'C', 'C',
'D', 'C', 'C', 'C', 'C', 'A']})
n = 2
counts = df['Field'].value_counts()
others = set(counts[counts < n].index)
df['Field'] = df['Field'].replace(list(others), 'Others')
Result
Field
0 A
1 A
2 A
3 Others
4 C
5 C
6 C
7 Others
8 C
9 C
10 C
11 C
12 A
Explanation
First get the counts of each value in Field via value_counts.
Filter for values which occur less than n times. n is user-configurable.
Finally replace those values with 'Others'.
Related
I have a list (list_to_match = ['a','b','c','d']) and a dataframe like this one below:
Index
One
Two
Three
Four
1
a
b
d
c
2
b
b
d
d
3
a
b
d
4
c
b
c
d
5
a
b
c
g
6
a
b
c
7
a
s
c
f
8
a
f
c
9
a
b
10
a
b
t
d
11
a
b
g
...
...
...
...
...
100
a
b
c
d
My goal would be to filter for the rows with most matches with the list in the corrisponding position (e.g. position 1 in the list has to match column 1, position 2 column 2 etc...).
In this specific case, excluding row 100, row 5 and 6 would be the one selected since they match 'a', 'b' and 'c' but if row 100 were to be included row 100 and all the other rows matching all elements would be the selected.
Also the list might change in length e.g. list_to_match = ['a','b'].
Thanks for your help!
I would use:
list_to_match = ['a','b','c','d']
# compute a mask of identical values
mask = df.iloc[:, :len(list_to_match)].eq(list_to_match)
# ensure we match values in order
mask2 = mask.cummin(axis=1).sum(axis=1)
# get the rows with max matches
out = df[mask2.eq(mask2.max())]
# or
# out = df.loc[mask2.nlargest(1, keep='all').index]
print(out)
Output (ignoring the input row 100):
One Two Three Four
Index
5 a b c g
6 a b c None
Here is my approach. Descriptions are commented below.
import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine
data = {'One': ['a', 'a', 'a', 'a'],
'Two': ['b', 'b', 'b', 'b'],
'Three': ['c', 'c', 'y', 'c'],
'Four': ['g', 'g', 'z', 'd']}
dataframe_ = pd.DataFrame(data)
#encoding Letters into numerical values so we can compute the cosine similarities
dataframe_[:] = dataframe_.to_numpy().astype('<U1').view(np.uint32)-64
#Our input data which we are going to compare with other rows
input_data = np.array(['a', 'b', 'c', 'd'])
#encode input data into numerical values
input_data = input_data.astype('<U1').view(np.uint32)-64
#compute cosine similarity for each row
dataframe_out = dataframe_.apply(lambda row: 1 - cosine(row, input_data), axis=1)
print(dataframe_out)
output:
0 0.999343
1 0.999343
2 0.973916
3 1.000000
Filtering rows based on their cosine similarities:
df_filtered = dataframe_out[dataframe_out.iloc[:, [0]] > 0.99]
print(df_filtered)
0 0.999343
1 0.999343
2 NaN
3 1.000000
From here on you can easily find the rows with non-NaN values by their indexes.
Hey I have two different lists:
One is list of the strings:
['A',
'B',
'C',
'D',
'E']
Second list contains floats:
[(-0.07154222477384509, 0.03681057318023705),
(-0.23678194754416643, 3.408617573881597e-12),
(-0.24277881018771763, 6.991906304566735e-13),
(-0.16858465905189185, 7.569580517034595e-07),
(-0.21850787663602167, 1.1718560531238815e-10)]
I want have one DataFrame with three columns that look like this:
var_name val1 val2
A -0.07154222477384509 0.03681057318023705
Best if the new DataFrame dont have scientific notation so I dont want them as strings.
Use list copmprehension with zip for list of tuples and pass toDataFrame constructor:
a = ['A',
'B',
'C',
'D',
'E']
b = [(-0.07154222477384509, 0.03681057318023705),
(-0.23678194754416643, 3.408617573881597e-12),
(-0.24277881018771763, 6.991906304566735e-13),
(-0.16858465905189185, 7.569580517034595e-07),
(-0.21850787663602167, 1.1718560531238815e-10)]
df = pd.DataFrame([(a, *b) for a, b in zip(a,b)])
print (df)
0 1 2
0 A -0.071542 3.681057e-02
1 B -0.236782 3.408618e-12
2 C -0.242779 6.991906e-13
3 D -0.168585 7.569581e-07
4 E -0.218508 1.171856e-10
With set columns names:
df = pd.DataFrame([(a, *b) for a, b in zip(a,b)],
columns=['var_name','val1','val2'])
print (df)
var_name val1 val2
0 A -0.071542 3.681057e-02
1 B -0.236782 3.408618e-12
2 C -0.242779 6.991906e-13
3 D -0.168585 7.569581e-07
4 E -0.218508 1.171856e-10
I am trying to find the number of people of a certain group who appear in other groups. For instance, here is the Pandas dataframe:
d = {'name': ['ash', 'psyduck', 'pikachu', 'charizard', 'ash', 'psyduck'], 'group': ['a', 'b', 'c', 'b', 'b', 'a']}
Which looks like this:
Ash: a
Psyduck: b
Pikachu: c
Charizard: b
Ash: b
Psyduck: a
I am trying to create a cross tabulation that looks like the following:
a b c
a 2 2 0
b 2 3 0
c 0 0 1
Essentially, this cross tab shows how many members of group x are also members of group x. For example, there are 2 people who are in group a and b, thus there is a 2 in the intersection of those columns
I have used Pandas cross tab function but it doesn't give the result that I am looking for.
import pandas as pd
d = {'name': ['ash', 'psyduck', 'pikachu', 'charizard', 'ash', 'psyduck'], 'group': ['a', 'b', 'c', 'b', 'b', 'a']}
df = pd.DataFrame(d)
df = df.merge(df, on='name')
print(
pd.crosstab(df.group_x, df.group_y)
)
Output:
group_y a b c
group_x
a 2 2 0
b 2 3 0
c 0 0 1
Demo: https://repl.it/#alexmojaki/TragicFrigidConditions
I have a Series and a list like this
$ import pandas as pd
$ s = pd.Series(data=[1, 2, 3, 4], index=['A', 'B', 'C', 'D'])
$ filter_list = ['A', 'C', 'D']
$ print(s)
A 1
B 2
C 3
D 4
How can I create a new Series with row B removed using s and filter_list?
I mean I want to create a Series new_s with the following content
$ print(new_s)
A 1
C 3
D 4
s.isin(filter_list) doesn't work. Because I want to filter based on the index of the Series, not the values of the Series.
Use Series.loc if all values of list exist in index:
new_s = s.loc[filter_list]
print (new_s)
A 1
C 3
D 4
dtype: int64
If possible some not exist use Index.intersection or isin like #Yusuf Baktir solution:
filter_list = ['A', 'C', 'D', 'E']
new_s = s.loc[s.index.intersection(filter_list)]
print (new_s)
A 1
C 3
D 4
dtype: int64
Another alternative with numpy.in1d:
filter_list = ['A', 'C', 'D', 'E']
new_s = s[np.in1d(s.index, filter_list)]
print (new_s)
A 1
C 3
D 4
dtype: int64
Basically, those are the index values. So, filtering on index will work
s[s.index.isin(filter_list)]
for i in filter_list:
print(i,s[i])
A 1
C 3
D 4
List1 = [[1,A,!,a],[2,B,#,b],[7,C,&,c],[1,B,#,c],[4,D,#,p]]
Output should be like this:
Each different column should contain 1 value of each sublist elements
for example
column1:[1,2,7,1,4]
column2:[A,B,C,B,D]
column3:[!,#,&,#,#]
column4:[a,b,c,c,p]
in the same dataframe
Assuming that you actually meant for List1 to be this (all elements are strings):
list1 = [["1","A","!","a"],["2","B","#","b"],["7","C","&","c"],["1","B","#","c"],["4","D","#","p"]]
I don't think that you need to do anything except pass List1 to the DataFrame constructor. There are several ways to pass information to a DataFrame. Using lists of lists constructs un-named columns.
print(pd.DataFrame(list1))
0 1 2 3
0 1 A ! a
1 2 B # b
2 7 C & c
3 1 B # c
4 4 D # p
Given the below list file:
l = [['1', 'A', '!', 'a'], ['2', 'B', '#', 'b'], ['7', 'C', '&', 'c'], ['1', 'B', '#', 'c'], ['4', 'D', '#', 'p']]
You can use pandas.Dataframe for converting it as below:
import pandas as pd
pd.DataFrame(l, columns=['c1', 'c2', 'c3', 'c4'])
# columns parameter for passing customized column names
Result:
c1 c2 c3 c4
0 1 A ! a
1 2 B # b
2 7 C & c
3 1 B # c
4 4 D # p
As commented (and illustrated by John L.'s answer), pandas.DataFrame should be sufficient. If what you actually want is a transposed dataframe, try transpose manually:
import pandas as pd
df = pd.DataFrame(List1).T
Or beforehand using zip:
df = pd.DataFrame(list(zip(*List1)))
Both of which returns:
0 1 2 3 4
0 1 2 7 1 4
1 A B C B D
2 ! # & # #
3 a b c c p