I am trying to add a column with values from a dictionary. It will be easy to show you the dummy data.
df = pd.DataFrame({'id':[1,2,3,2,5], 'grade':[5,2,2,1,3]})
dictionary = {'1':[5,8,6,3], '2':[1,2], '5':[8,6,2]}
Notice that not every id is in the dictionary and the values which are the lists. I want to find the row in the df that matches with the keys in the dictionary and add the list in one column. So the desired output will look like this:
output = pd.DataFrame({'id':[1,2,3,2,5], 'grade':[5,2,2,1,3], 'new_column':[[5,8,6,3],[1,2],[],[1,2],[8,6,2]]})
Is this what you want?
df = df.set_index('id')
dictionary = {1:[5,8,6,3], 2:[1,2], 5:[8,6,2]}
df['new_column'] = pd.Series(dictionary)
Note: The keys of the dictionary need to be the same type (int) as the index of the data frame.
>>> print(df)
gender new_column
id
1 0 [5, 8, 6, 3]
2 0 [1, 2]
3 1 NaN
4 1 NaN
5 1 [8, 6, 2]
Update:
A better solution if 'id' column contains duplicates (see comments below):
df['new_column'] = df['id'].map(dictionary)
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4,5], 'gender':[0,0,1,1,1]})
dictionary = {'1':[5,8,6,3], '2':[1,2], '5':[8,6,2]}
then just create a list with the values you want and add them to your dataframe
newValues = [ dictionary.get(str(val),[]) for val in df['id'].values]
df['new_column'] = newValues
>>> print(df)
gender new_column
id
1 0 [5, 8, 6, 3]
2 0 [1, 2]
3 1 []
4 1 []
5 1 [8, 6, 2]
You can construct your column using special dictionaries that has a value [] by default.
from collections import defaultdict
default_dictionary = defaultdict(list)
id = [1,2,3,4,5]
dictionary = {'1':[5,8,6,3], '2':[1,2], '5':[8,6,2]}
for n in dictionary:
default_dictionary[n] = dictionary[n]
new_column = [default_dictionary[str(n)] for n in id]
new_column is [[5, 8, 6, 3], [1, 2], [], [], [8, 6, 2]] now and you can pass it to your last argument of pd.DataFrame(...)
Related
I want to manipulate categorical data using pandas data frame and then convert them to numpy array for model training.
Say I have the following data frame in pandas.
import pandas as pd
df2 = pd.DataFrame({"c1": ['a','b',None], "c2": ['d','e','f']})
>>> df2
c1 c2
0 a d
1 b e
2 None f
And now I want "compress the categories" horizontally as the following:
compressed_categories
0 c1-a, c2-d <--- this could be a string, ex. "c1-a, c2-d" or array ["c1-a", "c2-d"] or categorical data
1 c1-b, c2-e
2 c1-nan, c2-f
Next I want to generate a dictionary/vocabulary based on the unique occurrences plus "nan" columns in compressed_categories, ex:
volcab = {
"c1-a": 0,
"c1-b": 1,
"c1-c": 2,
"c1-nan": 3,
"c2-d": 4,
"c2-e": 5,
"c2-f": 6,
"c2-nan": 7,
}
So I can further numerically encoding then as follows:
compressed_categories_numeric
0 [0, 4]
1 [1, 5]
2 [3, 6]
So my ultimate goal is to make it easy to convert them to numpy array for each row and thus I can further convert it to tensor.
input_data = np.asarray(df['compressed_categories_numeric'].tolist())
then I can train my model using input_data.
Can anyone please show me an example how to make this series of conversion? Thanks in advance!
To build volcab dictionary and compressed_categories_numeric, you can use:
df3 = df2.fillna(np.nan).astype(str).apply(lambda x: x.name + '-' + x)
volcab = {k: v for v, k in enumerate(np.unique(df3))}
df2['compressed_categories_numeric'] = df3.replace(volcab).agg(list, axis=1)
Output:
>>> volcab
{'c1-a': 0, 'c1-b': 1, 'c1-nan': 2, 'c2-d': 3, 'c2-e': 4, 'c2-f': 5}
>>> df2
c1 c2 compressed_categories_numeric
0 a d [0, 3]
1 b e [1, 4]
2 None f [2, 5]
>>> np.array(df2['compressed_categories_numeric'].tolist())
array([[0, 3],
[1, 4],
[2, 5]])
new_data = {'mid':mids, 'human':all_tags, 'new':new_tags, 'old':old_tags}
df = pd.DataFrame(new_data.items(), columns=['mid', 'human', 'new', 'old'])
new_data is a dictionary, in which the value of each column is a list with equal length. I tried to convert it into a df, but it gives this error:
ValueError: 4 columns passed, passed data had 2 columns
How to convert this new_data into a df?
Remove .items():
new_data = {'mid':[1, 2], 'human':[1, 2], 'new':[1, 2], 'old':[1, 2]}
df = pd.DataFrame(new_data, columns=['mid', 'human', 'new', 'old'])
Note:
Passing columns here is redundant, because their names equal the dictionary keys anyways. So just use:
>>> pd.DataFrame(new_data)
mid human new old
0 1 1 1 1
1 2 2 2 2
The reason behind the error:
If you try this, here is what you'll get:
>>> pd.DataFrame(new_data.items())
0 1
0 mid [1, 2]
1 human [1, 2]
2 new [1, 2]
3 old [1, 2]
Why?
Check this:
>>> list(new_data.items())
[('mid', [1, 2]), ('human', [1, 2]), ('new', [1, 2]), ('old', [1, 2])]
It is in a format "list of lists" (well, list of tuples in this case). If pd.DataFrame() receives this, it will assume you are going row by row. This is why it constructs only two columns. And that is why your assignment of column names fails - there are 2 columns but you are providing 4 column names.
I have a dataframe that, as a result of a previous group by, contains 5 rows and two columns. column A is a unique name, and column B contains a list of unique numbers that correspond to different factors related to the unique name. How can I find the most common number (mode) for each row?
df = pd.DataFrame({"A": [Name1,Name2,...], "B": [[3, 5, 6, 6], [1, 1, 1, 4],...]})
I have tried:
df['C'] = df[['B']].mode(axis=1)
but this simply creates a copy of the lists from column B. Not really sure how to access each list in this case.
Result should be:
A: B: C:
Name 1 [3,5,6,6] 6
Name 2 [1,1,1,4] 1
Any help would be great.
Here's a method using statistics module's mode function
from statistics import mode
Two options:
df["C"] = df["B"].apply(mode)
df.head()
# A B C
# 0 Name1 [3, 5, 6, 6] 6
# 1 Name2 [1, 1, 1, 4] 1
Or
df["C"] = [mode(df["B"][i]) for i in range(len(df))]
df.head()
# A B C
# 0 Name1 [3, 5, 6, 6] 6
# 1 Name2 [1, 1, 1, 4] 1
I would use Pandas' .apply() function here. It will execute a function on each element in a series. First, we define the function, I'm taking the mode from Find the most common element in a list
def mode(lst):
return max(set(lst), key=lst.count)
Then, we apply this function to the B column to get C:
df['C'] = df['B'].apply(mode)
Our output is:
>>> df
A B C
0 Name1 [3, 5, 6, 6] 6
1 Name2 [1, 1, 1, 4] 1
i am very new to pandas can anybody tell me how to map uniquely lists for a dataframe?
Data
[phone, laptop]
[life, death, mortal]
[happy]
Expected output:
[1,2]
[3,4,5]
[6]
I used map() and enumerate but both give me errors.
For efficiency, use a list comprehension.
For simple counts:
from itertools import count
c = count(1)
df['new'] = [[next(c) for x in l ] for l in df['Data']]
For unique identifiers in case of duplicates:
from itertools import count
c = count(1)
d = {}
df['new'] = [[d[x] if x in d else d.setdefault(x, next(c)) for x in l ] for l in df['Data']]
Output:
Data new
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy] [6]
You could explode, replace, and groupby to undo the explode operation:
df = pd.DataFrame({"data": [["phone", "laptop"],
["life", "death", "mortal"],
["happy", "phone"]]})
df_expl = df.explode("data")
df["data_mapped"] = (
df_expl.assign(data=lambda df: range(1, len(df) + 1))
.groupby(df_expl.index).data.apply(list))
print(df)
data data_mapped
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy, phone] [6, 7]
This always increments the counter, even if list items are duplicates.
In case duplicates should have unique integer values, use factorize instead of range:
df_expl = df.explode("data")
df["data_mapped"] = (
df_expl.assign(data=df_expl.data.factorize()[0] + 1)
.groupby(df_expl.index).data.apply(list))
print(df)
# output:
data data_mapped
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy, phone] [6, 1]
I'm trying to figure out how to condition on an array I've created.
first6 = df["Tbl_Name_Dur"].unique()[0:6]
for element in first6:
print(element)
df_test = df[df['Tbl_Name_Dur'] for element in first6]
I've printed the elements and that works. How do I condition on selecting my dataframe based on first6. I've tried the following:
df_test = df[df['Tbl_Name_Dur'] in first6]
df_test = df[df['Tbl_Name_Dur'] == first6]
Any help would be much appreciated!
You can use the isin method. Here is an example:
import pandas as pd
data_dict = {'col': pd.Series([1, 2, 3, 4, 4, 5, 6, 7, 8 ,8 ])}
df = pd.DataFrame(data_dict)
first6 = df.col.unique()[0:6]
df = df[df.isin(first6)]
df.dropna(inplace=True)
print(df)
Alternatively, you can use a lambda function together with map:
import pandas as pd
data_dict = {'col': pd.Series([1, 2, 3, 4, 4, 5, 6, 7, 8, 8 ])}
df = pd.DataFrame(data_dict)
first6 = df.col.unique()[0:6]
df = df[df.col.map(lambda x : x in first6)]
print(df)
Output:
col
0 1
1 2
2 3
3 4
4 4
5 5
6 6