Create data frame with list comprehension if/else - python

I'm trying to create a dataframe using 1 list and 1 dictionary.
The first column is the word (equal to the list), the second the count (some words are in the dictionary with the correspondent count).
Example:
list = ['hi', 'hello', 'bye']
dict = {'hi': 10}
df = hi 10
hello 0
bye 0
What I want to do is do it using one list comp, something like:
df = pd.DataFrame([[word, count] for word in list if word in dict.keys(): count = dict[word] else: count = 0 ], columns=['words', 'count'])

You can be a bit more expressive while writing python using dict.get method which can be used to set default value if key does not exist.
pd.DataFrame([[word, dct.get(word, 0)] for word in lst], columns=['words', 'count'])
I've replaced dict and list with dct and lst to avoid using reserved keywords.

It looks like you're actually after a Pandas Series object. Here's how you can create one from a dictionary that solves your problem:
>>> lst = ['hi', 'hello', 'bye']
>>> dct = {'hi': 10}
>>> pd.Series({k: dct.get(k, 0) for k in lst})
hi 10
hello 0
bye 0
dtype: int64
Or, if you want a DataFrame:
>>> pd.DataFrame([(k, dct.get(k, 0)) for k in lst], columns=['word', 'count'])
word count
0 hi 10
1 hello 0
2 bye 0

Related

filtering a dataframe if the length of the word inside the series > 3

Community! I really appreciate all support I'm receiving through my journey learning python so far!
I got this following dataframe:
d = {'name': ['john', 'mary', 'james'], 'area':[['IT', 'Resources', 'Admin'], ['Software', 'ITS', 'Programming'], ['Teaching', 'Research', 'KS']]}
df = pd.DataFrame(data=d)
My goal is:
In other words, if the length of word inside the list of the column 'area' > 3, remove them.
I'm trying something like this but I´m really stuck
What is the best way of approaching this situation?
Thanks again!!
Combine .map with list comprehension:
df['area'] = df['area'].map(lambda x: [e for e in x if len(e)>3])
0 [Resources, Admin]
1 [Software, Programming]
2 [Teaching, Research]
explaination:
x = ["Software", "ABC", "Programming"]
# return e for every element in x but only if length of element is larger than 3
[e for e in x if len(e)>3]
You can expand all your lists, filter on str length and then put them back in lists by aggregating using list:
df = df.explode("area")
df = df[df["area"].str.len() > 3].groupby("name", as_index=False).agg(list)
# name area
# 0 james [Teaching, Research]
# 1 john [Resources, Admin]
# 2 mary [Software, Programming]
Before you build the dataframe.
One simple and efficient way is to create a new list of the key: "area", which will contain only strings with length bigger than 3. For example:
d = {'name': ['john', 'mary', 'james'], 'area':['IT', 'Resources', 'Admin'], ['Software', 'ITS', 'Programming'], ['Teaching', 'Research', 'KS']]}
# Retrieving the areas from d.
area_list = d['area']
# Copying all values, whose length is larger than 3, in a new list.
filtered_area_list = [a in area_list if len(3) > 3]
# Replacing the old list in the dictionary with the new one.
d['area'] = filtered_area_list
# Creating the dataframe.
df = pd.DataFrame(data=d)
After you build the dataframe.
If your data is in a dataframe, then you can use the "map" function:
df['area'] = df['area'].map(lambda a: [e for e in a if len(e) > 3])

How to assign same values in dictionary to another list of strings

I am trying to convert strings to numbers and then assign same values for same words in another list of strings. Assume I have string A like below. I converted to values using dictionary like the code below. Now I need to assign same values to same string in list B and the output should be like res_B
A='hello world how are you doing'`
res_A = [1, 2, 3, 4, 5, 6]
B=['hello world how', 'hello are' ,'hello', 'hello are you doing']
res_B = [[1,2,3],[1,4],[1],[1,4,5,6]]
A='hello world how are you doing'
d = {}
res_A = [d.setdefault(word, len(d)+1) for word in A.lower().split()]
# map words from A into indices 1..N
mapping = {k: v for v, k in enumerate(A.split(), 1)}
# find mappings of words in B
B_res = [[mapping[word] for word in s.split()] for s in B]
Using a list comprehension again, and the same dictionary from before:
res_B = [
[d[word] for word in phrase.lower().split()]
for phrase in B
]
Here's a step-by-step and functional way to do it
# Create a lookup dictionary
lookup = {word: index for word, index in zip(A.split(' '), res_A)}
# Map every sentence to be replaced with lookup values per word
res_B = [list(map(lambda x: lookup[x], sentence.split(' '),
sentence)) for sentence in B]

pandas: count the number of unique occurrences of each element of list in a column of lists

I have a dataframe containing a column of lists lie the following:
df
pos_tag
0 ['Noun','verb','adjective']
1 ['Noun','verb']
2 ['verb','adjective']
3 ['Noun','adverb']
...
what I would like to get is the number of time each unique element occurred in the overall column as a dictionary:
desired output:
my_dict = {'Noun':3, 'verb':3, 'adjective':2, 'adverb':1}
Use, Series.explode along with Series.value_counts and Series.to_dict:
freq = df['pos_tag'].explode().value_counts().to_dict()
Result:
# print(freq)
{'Noun':3, 'verb':3, 'adjective':2, 'adverb':1}
For improve performance use Counter with flatten values of nested lists:
from collections import Counter
my_dict = dict(Counter([y for x in df['pos_tag'] for y in x]))
print (my_dict)
{'Noun': 3, 'verb': 3, 'adjective': 2, 'adverb': 1}

Python pandas: sum item occurences in a string list by item substring

I've this list of strings:
list = ['a.xxx', 'b.yyy', 'c.zzz', 'a.yyy', 'b.xxx', 'a.www']
I'd like to count items occurences by item.split('.')[0].
Desiderata:
a 3
b 2
c 1
setup
I don't like assigning to variable names that are built-in classes
l = ['a.xxx', 'b.yyy', 'c.zzz', 'a.yyy', 'b.xxx', 'a.www']
option 1
pd.value_counts(pd.Series(l).str.split('.').str[0])
option 2
pd.value_counts([x.split('.', 1)[0] for x in l])
option 3
wrap Counter in pd.Series
pd.Series(Counter([x.split('.', 1)[0] for x in l]))
option 4
pd.Series(l).apply(lambda x: x.split('.', 1)[0]).value_counts()
option 5
using find
pd.value_counts([x[:x.find('.')] for x in l])
All yield
a 3
b 2
c 1
dtype: int64
First of all, list is not a good variable name because you will shadow the built in list. I don't know much pandas, but since it is not required here I'll post an answer anyway.
>>> from collections import Counter
>>> l = ['a.xxx', 'b.yyy', 'c.zzz', 'a.yyy', 'b.xxx', 'a.www']
>>> Counter(x.split('.', 1)[0] for x in l)
Counter({'a': 3, 'b': 2, 'c': 1})
I would try the Counter class from collections. It is a subclass of a dict, and gives you a dictionary where the values correspond to the number of observations of each type of key:
a = ['a.xxx', 'b.yyy', 'c.zzz', 'a.yyy', 'b.xxx', 'a.www']
from collections import Counter
Counter([item.split(".")[0] for item in a])
gives
Counter({'a': 3, 'b': 2, 'c': 1})
which is what you require

How to efficiently convert the entries of a dictionary into a dataframe

I have a dictionary like this:
mydict = {'A': 'some thing',
'B': 'couple of words'}
All the values are strings that are separated by white spaces. My goal is to convert this into a dataframe which looks like this:
key_val splitted_words
0 A some
1 A thing
2 B couple
3 B of
4 B words
So I want to split the strings and then add the associated key and these words into one row of the dataframe.
A quick implementation could look like this:
import pandas as pd
mydict = {'A': 'some thing',
'B': 'couple of words'}
all_words = " ".join(mydict.values()).split()
df = pd.DataFrame(columns=['key_val', 'splitted_words'], index=range(len(all_words)))
indi = 0
for item in mydict.items():
words = item[1].split()
for word in words:
df.iloc[indi]['key_val'] = item[0]
df.iloc[indi]['splitted_words'] = word
indi += 1
which gives me the desired output.
However, I am wondering whether there is a more efficient solution to this!?
Here is my on-line approach:
df = pd.DataFrame([(k, s) for k, v in mydict.items() for s in v.split()], columns=['key_val','splitted_words'])
If I split it, it will be:
d=[(k, s) for k, v in mydict.items() for s in v.split()]
df = pd.DataFrame(d, columns=['key_val','splitted_words'])
Output:
Out[41]:
key_val splitted_words
0 A some
1 A thing
2 B couple
3 B of
4 B words
Based on #qu-dong's idea and using a generator function for readability a working example:
#! /usr/bin/env python
from __future__ import print_function
import pandas as pd
mydict = {'A': 'some thing',
'B': 'couple of words'}
def splitting_gen(in_dict):
"""Generator function to split in_dict items on space."""
for k, v in in_dict.items():
for s in v.split():
yield k, s
df = pd.DataFrame(splitting_gen(mydict), columns=['key_val', 'splitted_words'])
print (df)
# key_val splitted_words
# 0 A some
# 1 A thing
# 2 B couple
# 3 B of
# 4 B words
# real 0m0.463s
# user 0m0.387s
# sys 0m0.057s
but this only caters efficiency in elegance/readability of the solution requested.
If you note the timings they are all alike approx. a tad shorted than 500 milli seconds. So one might continue to profile further to not suffer when feeding in larger texts ;-)

Categories