I have a dictionary like this:
mydict = {'A': 'some thing',
'B': 'couple of words'}
All the values are strings that are separated by white spaces. My goal is to convert this into a dataframe which looks like this:
key_val splitted_words
0 A some
1 A thing
2 B couple
3 B of
4 B words
So I want to split the strings and then add the associated key and these words into one row of the dataframe.
A quick implementation could look like this:
import pandas as pd
mydict = {'A': 'some thing',
'B': 'couple of words'}
all_words = " ".join(mydict.values()).split()
df = pd.DataFrame(columns=['key_val', 'splitted_words'], index=range(len(all_words)))
indi = 0
for item in mydict.items():
words = item[1].split()
for word in words:
df.iloc[indi]['key_val'] = item[0]
df.iloc[indi]['splitted_words'] = word
indi += 1
which gives me the desired output.
However, I am wondering whether there is a more efficient solution to this!?
Here is my on-line approach:
df = pd.DataFrame([(k, s) for k, v in mydict.items() for s in v.split()], columns=['key_val','splitted_words'])
If I split it, it will be:
d=[(k, s) for k, v in mydict.items() for s in v.split()]
df = pd.DataFrame(d, columns=['key_val','splitted_words'])
Output:
Out[41]:
key_val splitted_words
0 A some
1 A thing
2 B couple
3 B of
4 B words
Based on #qu-dong's idea and using a generator function for readability a working example:
#! /usr/bin/env python
from __future__ import print_function
import pandas as pd
mydict = {'A': 'some thing',
'B': 'couple of words'}
def splitting_gen(in_dict):
"""Generator function to split in_dict items on space."""
for k, v in in_dict.items():
for s in v.split():
yield k, s
df = pd.DataFrame(splitting_gen(mydict), columns=['key_val', 'splitted_words'])
print (df)
# key_val splitted_words
# 0 A some
# 1 A thing
# 2 B couple
# 3 B of
# 4 B words
# real 0m0.463s
# user 0m0.387s
# sys 0m0.057s
but this only caters efficiency in elegance/readability of the solution requested.
If you note the timings they are all alike approx. a tad shorted than 500 milli seconds. So one might continue to profile further to not suffer when feeding in larger texts ;-)
Related
I have a dataframe with strings and a dictionary which values are lists of strings.
I need to check if each string of the dataframe contains any element of every value in the dictionary. And if it does, I need to label it with the appropriate key from the dictionary. All I need to do is to categorize all the strings in the dataframe with keys from the dictionary.
For example.
df = pd.DataFrame({'a':['x1','x2','x3','x4']})
d = {'one':['1','aa'],'two':['2','bb']}
I would like to get something like this:
df = pd.DataFrame({
'a':['x1','x2','x3','x4'],
'Category':['one','two','x3','x4']})
I tried this, but it has not worked:
df['Category'] = np.nan
for k, v in d.items():
for l in v:
df['Category'] = [k if l in str(x).lower() else x for x in df['a']]
Any ideas appreciated!
Firstly create a function that do this for you:-
def func(val):
for x in range(0,len(d.values())):
if val in list(d.values())[x]:
return list(d.keys())[x]
Now make use of split() and apply() method:-
df['Category']=df['a'].str.split('',expand=True)[2].apply(func)
Finally use fillna() method:-
df['Category']=df['Category'].fillna(df['a'])
Now if you print df you will get your expected output:-
a Category
0 x1 one
1 x2 two
2 x3 x3
3 x4 x4
Edit:
You can also do this by:-
def func(val):
for x in range(0,len(d.values())):
if any(l in val for l in list(d.values())[x]):
return list(d.keys())[x]
then:-
df['Category']=df['a'].apply(func)
Finally:-
df['Category']=df['Category'].fillna(df['a'])
I've come up with the following heuristic, which looks really dirty.
It outputs what you desire, albeit with some warnings, since I've used indices to append values to dataframe.
import pandas as pd
import numpy as np
def main():
df = pd.DataFrame({'a': ['x1', 'x2', 'x3', 'x4']})
d = {'one': ['1', 'aa'], 'two': ['2', 'bb']}
found = False
i = 0
df['Category'] = np.nan
for x in df['a']:
for k,v in d.items():
for item in v:
if item in x:
df['Category'][i] = k
found = True
break
else:
df['Category'][i] = x
if found:
found = False
break
i += 1
print(df)
main()
I'm trying to create a dataframe using 1 list and 1 dictionary.
The first column is the word (equal to the list), the second the count (some words are in the dictionary with the correspondent count).
Example:
list = ['hi', 'hello', 'bye']
dict = {'hi': 10}
df = hi 10
hello 0
bye 0
What I want to do is do it using one list comp, something like:
df = pd.DataFrame([[word, count] for word in list if word in dict.keys(): count = dict[word] else: count = 0 ], columns=['words', 'count'])
You can be a bit more expressive while writing python using dict.get method which can be used to set default value if key does not exist.
pd.DataFrame([[word, dct.get(word, 0)] for word in lst], columns=['words', 'count'])
I've replaced dict and list with dct and lst to avoid using reserved keywords.
It looks like you're actually after a Pandas Series object. Here's how you can create one from a dictionary that solves your problem:
>>> lst = ['hi', 'hello', 'bye']
>>> dct = {'hi': 10}
>>> pd.Series({k: dct.get(k, 0) for k in lst})
hi 10
hello 0
bye 0
dtype: int64
Or, if you want a DataFrame:
>>> pd.DataFrame([(k, dct.get(k, 0)) for k in lst], columns=['word', 'count'])
word count
0 hi 10
1 hello 0
2 bye 0
Having issues with building a find and replace tool in python. Goal is to search a column in an excel file for a string and swap out every letter of the string based on the key value pair of the dictionary, then write the entire new string back to the same cell. So "ABC" should convert to "BCD". I have to find and replace any occurrence of individual characters.
The below code runs without debugging, but newvalue never creates and I don't know why. No issues writing data to the cell if newvalue gets created.
input: df = pd.DataFrame({'Code1': ['ABC1', 'B5CD', 'C3DE']})
expected output: df = pd.DataFrame({'Code1': ['BCD1', 'C5DE', 'D3EF']})
mycolumns = ["Col1", "Col2"]
mydictionary = {'A': 'B', 'B': 'C', 'C': 'D'}
for x in mycolumns:
# 1. If the mycolumn value exists in the headerlist of the file
if x in headerlist:
# 2. Get column coordinate
col = df.columns.get_loc(x) + 1
# 3. iterate through the rows underneath that header
for ind in df.index:
# 4. log the row coordinate
rangerow = ind + 2
# 5. get the original value of that coordinate
oldval = df[x][ind]
for count, y in enumerate(oldval):
# 6. generate replacement value
newval = df.replace({y: mydictionary}, inplace=True, regex=True, value=None)
print("old: " + str(oldval) + " new: " + str(newval))
# 7. update the cell
ws.cell(row=rangerow, column=col).value = newval
else:
print("not in the string")
else:
# print(df)
print("column doesn't exist in workbook, moving on")
else:
print("done")
wb.save(filepath)
wb.close()
I know there's something going on with enumerate and I'm probably not stitching the string back together after I do replacements? Or maybe a dictionary is the wrong solution to what I am trying to do, the key:value pair is what led me to use it. I have a little programming background but ery little with python. Appreciate any help.
newvalue never creates and I don't know why.
DataFrame.replace with inplace=True will return None.
>>> df = pd.DataFrame({'Code1': ['ABC1', 'B5CD', 'C3DE']})
>>> df = df.replace('ABC1','999')
>>> df
Code1
0 999
1 B5CD
2 C3DE
>>> q = df.replace('999','zzz', inplace=True)
>>> print(q)
None
>>> df
Code1
0 zzz
1 B5CD
2 C3DE
>>>
An alternative could b to use str.translate on the column (using its str attribute) to encode the entire Series
>>> df = pd.DataFrame({'Code1': ['ABC1', 'B5CD', 'C3DE']})
>>> mydictionary = {'A': 'B', 'B': 'C', 'C': 'D'}
>>> table = str.maketrans('ABC','BCD')
>>> df
Code1
0 ABC1
1 B5CD
2 C3DE
>>> df.Code1.str.translate(table)
0 BCD1
1 C5DD
2 D3DE
Name: Code1, dtype: object
>>>
I have a column in data frame which looks like below
How do i calculate frequency of each word. For ex: The word 'doorman' appears in 4 rows so i need the word along with its frequency i.e doorman = 4.
This needs to be done for each and every word.
Please advise
I think you can first flat list of lists in column and then use Counter:
df = pd.DataFrame({'features':[['a','b','b'],['c'],['a','a']]})
print (df)
features
0 [a, b, b]
1 [c]
2 [a, a]
from itertools import chain
from collections import Counter
print (Counter(list(chain.from_iterable(df.features))))
Counter({'a': 3, 'b': 2, 'c': 1})
I have a list of bi-grams like this:
[['a','b'],['e', ''f']]
Now I want to add these bigrams to a DataFrame with their frequencies like this:
b f
a|1 0
e|0 1
I tried doing this with the following code, but this raises an error, because the index doesn't exist yet. Is there a fast way to do this for really big data? (like 200000 bigrams)
matrixA = pd.DataFrame()
# Put the counts in a matrix
for elem in grams:
tag1, tag2 = elem[0], elem[1]
matrixA.loc[tag1, tag2] += 1
from collections import Counter
bigrams = [[['a','b'],['e', 'f']], [['a','b'],['e', 'g']]]
pairs = []
for bg in bigrams:
pairs.append((bg[0][0], bg[0][1]))
pairs.append((bg[1][0], bg[1][1]))
c = Counter(pairs)
>>> pd.Series(c).unstack() # optional: .fillna(0)
b f g
a 2 NaN NaN
e NaN 1 1
The above is for the intuition. This can be wrapped up in a one line generator expression as follows:
pd.Series(Counter((bg[i][0], bg[i][1]) for bg in bigrams for i in range(2))).unstack()
You can use Counter from the collections package. Note that I changed the contents of the list to be tuples rather than lists. This is because Counter keys (like dict keys) must be hashable.
from collections import Counter
l = [('a','b'),('e', 'f')]
index, cols = zip(*l)
df = pd.DataFrame(0, index=index, columns=cols)
c = Counter(l)
for (i, c), count in c.items():
df.loc[i, c] = count