I have a dataframe having categorical variables. I want to convert them to the numerical using the following logic:
I have 2 lists one contains the distinct categorical values in the column and the second list contains the values for each category. Now i need to map these values in place of those categorical values.
For Eg:
List_A = ['A','B','C','D','E']
List_B = [3,2,1,1,2]
I need to replace A with 3, B with 2, C and D with 1 and E with 2.
Is there any way to do this in Python.
I can do this by applying multiple for loops but I am looking for some easier way or some direct function if there is any.
Any help is very much appreciated, Thanks in Advance.
Create a mapping dict
List_A = ['A','B','C','D','E',]
List_B = [3,2,1,1,2]
d=dict(zip(List_A, List_B))
new_list=['A','B','C','D','E','A','B']
new_mapped_list=[d[v] for v in new_list if v in d]
new_mapped_list
Or define a function and use map
List_A = ['A','B','C','D','E',]
List_B = [3,2,1,1,2]
d=dict(zip(List_A, List_B))
def mapper(value):
if value in d:
return d[value]
return None
new_list=['A','B','C','D','E','A','B']
map(mapper,new_list)
Suppose df is your data frame and "Category" is the name of the column holding your categories:
df[df.Category == "A"] = 3,2, 1, 1, 2
df[(df.Category == "B") | (df.Category == "E") ] = 2
df[(df.Category == "C") | (df.Category == "D") ] = 1
If you only need to replace values in one list with the values of other and the structure is like the one you say. Two list, same lenght and same position, then you only need this:
list_a = []
list_a = list_b
A more convoluted solution would be like this, with a function that will create a dictionary that you can use on other lists:
# we make a function
def convert_list(ls_a,ls_b):
dic_new = {}
for letter,number in zip(ls_a,ls_b):
dic_new[letter] = number
return dic_new
This will make a dictionary with the combinations you need. You pass the two list, then you can use that dictionary on other list:
List_A = ['A','B','C','D','E']
List_B = [3,2,1,1,2]
dic_new = convert_list(ls_a, ls_b)
other_list = ['a','b','c','d']
for _ in other_list:
print(dic_new[_.upper()])
# prints
3
2
1
1
cheers
You could use a solution from machine learning scikit-learn module.
OneHotEncoder
LabelEncoder
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
The pandas "hard" way:
https://stackoverflow.com/a/29330853/9799449
Related
How to create python dictionary using the data below
Df1:
Id
mail-id
1
xyz#gm
1
ygzbb
2.
Ghh.
2.
Hjkk.
I want it as
{1:[xyz#gm,ygzbb], 2:[Ghh,Hjkk]}
Something like this?
data = [
[1, "xyz#gm"],
[1, "ygzbb"],
[2, "Ghh"],
[2, "Hjkk"],
]
dataDict = {}
for k, v in data:
if k not in dataDict:
dataDict[k] = []
dataDict[k].append(v)
print(dataDict)
One option is to groupby the Id column and turn the mail-id into a list in a dictionary comprehension:
{k:v["mail-id"].values.tolist() for k,v in df.groupby("Id")}
One option is to iterate over the set version of the ids and check one by one:
>>> _d = {}
>>> df = pd.DataFrame({"Id":[1,1,2,2],"mail-id":["xyz#gm","ygzbb","Ghh","Hjkk"]})
>>> for x in set(df["Id"]):
... _d.update({x:df[df["id"]==x]["mail_id"]})
But it's much faster to use dictionary comprehension and builtin pandas DataFrame.groupby; a quick look from the Official Documentation:
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True)
as #fsimonjetz pointed out, this code will be sufficent:
>>> df = pd.DataFrame({"Id":[1,1,2,2],"mail-id":["xyz#gm","ygzbb","Ghh","Hjkk"]})
>>> {k:v["mail-id"].values.tolist() for k,v in df.groupby("Id")}
You can do:
df.groupby('Id').agg(list).to_dict()['mail-id']
Output:
{1: ['xyz#gm', 'ygzbb'], 2: ['Ghh.', 'Hjkk.']}
I have a variable that consists of the list after list after list
my code:
>>> text = File(txt) #creates text object from text name
>>> names = text.name_parser() #invokes parser method to extract names from text object
My name_parser() stores names into a list self.names=[]
example:
>>> variable = my_method(txt)
output:
>>> variable
>>> [jacob, david], [jacob, hailey], [judy, david], ...
I want to make them into single list while retaining the duplicate values
desired output:
>>> [jacob, david, jacob, hailey, judy, david, ...]
(edited)
(edited)
Here's a very simple approach to this.
variable = [['a','b','c'], ['d','e','f'], ['g','h','i']]
fileNames = ['one.txt','two.txt','three.txt']
dict = {}
count = 0
for lset in variable:
for letters in lset:dict[letters] = fileNames[count]
count += 1
print(dict)
I hope this helps
#!/usr/bin/python3
#function to iterate through the list of dict
def fun(a):
for i in a:
for ls in i:
f = open(ls)
for x in f:
print(x)
variable ={ "a": "text.txt", "b": "text1.txt" , "c":"text2.txt" , "d": "text3.txt"}
myls = [variable["a"], variable["b"]], [variable["c"], variable["d"]]
fun(myls)
print("Execution Completed")
You can use itertools module that will allow to transform your list of lists into a flat list:
import itertools
foo = [v for v in itertools.chain.from_iterable(variable)]
After that you can iterate over the new variable however you like.
Well, if your variable is list of lists, then you can try something like this:
file_dict = {}
for idx, files in enumerate(variable):
# you can create some dictionary to bind indices to words
# or use any library for this, I believe there are few
file_name = f'{idx+1}.txt'
for file in files:
file_dict[file] = [file_name]
Consider the following snippet:
data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
"col2":["fff","aaa","ggg","eee","ccc","ttt"]}
df = pd.DataFrame(data,columns=["col1","col2"]) # my actual dataframe has
# 20,00,000 such rows
list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]
# After doing a combination of 2 elements between the 2 lists in both orders,
# we get a list that resembles something like this:
new_list = ["ccc-ggg", "ggg-ccc", "aaa-fff", "fff-aaa", ..."ccc-fff", "fff-ccc", ...]
Given a huge dataframe and 2 lists, I want to count the number of elements in new_list that are in the same in the dataframe. In the above pseudo example, The result would be 3 as: "aaa-fff", "ccc-ggg", & "ddd-ccc" are in the same row of the dataframe.
Right now, I am using a linear search algorithm but it is very slow as I have to scan through the entire dataframe.
df['col3']=df['col1']+"-"+df['col2']
for a in list_a:
c1 = 0
for b in list_b:
str1=a+"-"+b
str2=b+"-"+a
str1=a+"-"+b
c2 = (df['col3'].str.contains(str1).sum())+(df['col3'].str.contains(str2).sum())
c1+=c2
return c1
Can someone kindly help me implement a faster algorithm preferably with a dictionary data structure?
Note: I have to iterate through the 7,000 rows of another dataframe and create the 2 lists dynamically, and get an aggregate count for each row.
Here is another way. First, I used your definition of df (with 2 columns), list_a and list_b.
# combine two columns in the data frame
df['col3'] = df['col1'] + '-' + df['col2']
# create set with list_a and list_b pairs
s = ({ f'{a}-{b}' for a, b in zip(list_a, list_b)} |
{ f'{b}-{a}' for a, b in zip(list_a, list_b)})
# find intersection
result = set(df['col3']) & s
print(len(result), '\n', result)
3
{'ddd-ccc', 'ccc-ggg', 'aaa-fff'}
UPDATE to handle duplicate values.
# build list (not set) from list_a and list_b
idx = ([ f'{a}-{b}' for a, b in zip(list_a, list_b) ] +
[ f'{b}-{a}' for a, b in zip(list_a, list_b) ])
# create `col3`, and do `value_counts()` to preserve info about duplicates
df['col3'] = df['col1'] + '-' + df['col2']
tmp = df['col3'].value_counts()
# use idx to sub-select from to value counts:
tmp[ tmp.index.isin(idx) ]
# results:
ddd-ccc 1
aaa-fff 1
ccc-ggg 1
Name: col3, dtype: int64
Try this:
from itertools import product
# all combinations of the two lists as tuples
all_list_combinations = list(product(list_a, list_b))
# tuples of the two columns
dftuples = [x for x in df.itertuples(index=False, name=None)]
# take the length of hte intersection of the two sets and print it
print(len(set(dftuples).intersection(set(all_list_combinations))))
yields
3
First join the columns before looping, then instead of looping pass an optional regex to contains with all possible strings.
joined = df.col1+ '-' + df.col2
pat = '|'.join([f'({a}-{b})' for a in list_a for b in list_b] +
[f'({b}-{a})' for a in list_a for b in list_b]) # substitute for itertools.product
ct = joined.str.contains(pat).sum()
To work with dicts instead of dataframes, you can use filter(re, joined) as in this question
import re
data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
"col2":["fff","aaa","ggg","eee","ccc","ttt"]}
list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]
### build the regex pattern
pat_set = set('-'.join(combo) for combo in set(
list(itertools.product(list_a, list_b)) +
list(itertools.product(list_b, list_a))))
pat = '|'.join(pat_set)
# use itertools to generalize with many colums, remove duplicates with set()
### join the columns row-wise
joined = ['-'.join(row) for row in zip(*[vals for key, vals in data.items()])]
### filter joined
match_list = list(filter(re.compile(pat).match, joined))
ct = len(match_list)
Third option with series.isin() inspired by jsmart's answer
joined = df.col1 + '-' + df.col2
ct = joined.isin(pat_set).sum()
Speed testing
I repeated data 100,000 times for scalability testing. series.isin() takes the day, while jsmart's answer is fast but does not find all occurrences because it removes duplicates from joined
with dicts: 400000 matches, 1.00 s
with pandas: 400000 matches, 1.77 s
with series.isin(): 400000 matches, 0.39 s
with jsmart answer: 4 matches, 0.50 s
I'm trying to get the matching IDs and store the data into one list. I have a list of dictionaries:
list = [
{'id':'123','name':'Jason','location': 'McHale'},
{'id':'432','name':'Tom','location': 'Sydney'},
{'id':'123','name':'Jason','location':'Tompson Hall'}
]
Expected output would be something like
# {'id':'123','name':'Jason','location': ['McHale', 'Tompson Hall']},
# {'id':'432','name':'Tom','location': 'Sydney'},
How can I get matching data based on dict ID value? I've tried:
for item in mylist:
list2 = []
row = any(list['id'] == list.id for id in list)
list2.append(row)
This doesn't work (it throws: TypeError: tuple indices must be integers or slices, not str). How can I get all items with the same ID and store into one dict?
First, you're iterating through the list of dictionaries in your for loop, but never referencing the dictionaries, which you're storing in item. I think when you wrote list[id] you mean item[id].
Second, any() returns a boolean (true or false), which isn't what you want. Instead, maybe try row = [dic for dic in list if dic['id'] == item['id']]
Third, if you define list2 within your for loop, it will go away every iteration. Move list2 = [] before the for loop.
That should give you a good start. Remember that row is just a list of all dictionaries that have the same id.
I would use kdopen's approach along with a merging method after converting the dictionary entries I expect to become lists into lists. Of course if you want to avoid redundancy then make them sets.
mylist = [
{'id':'123','name':['Jason'],'location': ['McHale']},
{'id':'432','name':['Tom'],'location': ['Sydney']},
{'id':'123','name':['Jason'],'location':['Tompson Hall']}
]
def merge(mylist,ID):
matches = [d for d in mylist if d['id']== ID]
shell = {'id':ID,'name':[],'location':[]}
for m in matches:
shell['name']+=m['name']
shell['location']+=m['location']
mylist.remove(m)
mylist.append(shell)
return mylist
updated_list = merge(mylist,'123')
Given this input
mylist = [
{'id':'123','name':'Jason','location': 'McHale'},
{'id':'432','name':'Tom','location': 'Sydney'},
{'id':'123','name':'Jason','location':'Tompson Hall'}
]
You can just extract it with a comprehension
matched = [d for d in mylist if d['id'] == '123']
Then you want to merge the locations. Assuming matched is not empty
final = matched[0]
final['location'] = [d['location'] for d in matched]
Here it is in the interpreter
In [1]: mylist = [
...: {'id':'123','name':'Jason','location': 'McHale'},
...: {'id':'432','name':'Tom','location': 'Sydney'},
...: {'id':'123','name':'Jason','location':'Tompson Hall'}
...: ]
In [2]: matched = [d for d in mylist if d['id'] == '123']
In [3]: final=matched[0]
In [4]: final['location'] = [d['location'] for d in matched]
In [5]: final
Out[5]: {'id': '123', 'location': ['McHale', 'Tompson Hall'], 'name': 'Jason'}
Obviously, you'd want to replace '123' with a variable holding the desired id value.
Wrapping it all up in a function:
def merge_all(df):
ids = {d['id'] for d in df}
result = []
for id in ids:
matches = [d for d in df if d['id'] == id]
combined = matches[0]
combined['location'] = [d['location'] for d in matches]
result.append(combined)
return result
Also, please don't use list as a variable name. It shadows the builtin list class.
I have a some variables and I need to compare each of them and fill three lists according the comparison, if the var == 1 add a 1 to lista_a, if var == 2 add a 1 to lista_b..., like:
inx0=2 inx1=1 inx2=1 inx3=1 inx4=4 inx5=3 inx6=1 inx7=1 inx8=3 inx9=1
inx10=2 inx11=1 inx12=1 inx13=1 inx14=4 inx15=3 inx16=1 inx17=1 inx18=3 inx19=1
inx20=2 inx21=1 inx22=1 inx23=1 inx24=2 inx25=3 inx26=1 inx27=1 inx28=3 inx29=1
lista_a=[]
lista_b=[]
lista_c=[]
#this example is the comparison for the first variable inx0
#and the same for inx1, inx2, etc...
for k in range(1,30):
if inx0==1:
lista_a.append(1)
elif inx0==2:
lista_b.append(1)
elif inx0==3:
lista_c.append(1)
I need get:
#lista_a = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
#lista_b = [1,1,1]
#lista_c = [1]
Your inx* variables should almost certinaly be a list to begin with:
inx = [2,1,1,1,4,3,1,1,3,1,2,1,1,1,4,3,1,1,3,1,2,1,1,1,2,3,1,1,3,1]
Then, to find out how many 2's it has:
inx.count(2)
If you must, you can build a new list out of that:
list_a = [1]*inx.count(1)
list_b = [1]*inx.count(2)
list_c = [1]*inx.count(3)
but it seems silly to keep a list of ones. Really the only data you need to keep is a single integer (the count), so why bother carrying around a list?
An alternate approach to get the lists of ones would be to use a defaultdict:
from collections import defaultdict
d = defaultdict(list)
for item in inx:
d[item].append(1)
in this case, what you want as list_a could be accessed by d[1], list_b could be accessed as d[2], etc.
Or, as stated in the comments, you could get the counts using a collections.Counter:
from collections import Counter #python2.7+
counts = Counter(inx)
list_a = [1]*counts[1]
list_b = [1]*counts[2]
...