Creating dataframe from pandas function - python

I'm attempting some text analytics and writing code to show the occurrence of a word each month from a given dataset. I have the following function which outputs the frequency of the given word every month - however I am struggling to transform this into a dataframe (columns; month, word frequency).
Appreciate any help!
import collections
df=df.set_index(df['Date'])
for u,v in df.groupby(pd.Grouper(freq="M")):
words=sum(v['Processed'].str.split(' ').values.tolist(),[])
c = collections.Counter(words)
print (c['word'])
currently outputs:
0
1
0
1
1
2
1
18
6
0
0
0

You can convert your collection into a dataframe using pd.DataFrame.from_dict:
import collections
import pandas as pd
df=df.set_index(df['Date'])
results = []
for u,v in df.groupby(pd.Grouper(freq="M")):
words=sum(v['Processed'].str.split(' ').values.tolist(),[])
c = collections.Counter(words)
# convert counter to dataframe
cdf = pd.DataFrame.from_dict(c,orient='index',columns=['frequency']).reset_index()
# add identifer to dataframe
cdf['month'] = u
# collect results
results += [cdf]
# concatenate results
results = pd.concat(results)

Related

Finding closest timestamp between dataframe columns

I have two dataframes
import numpy as np
import pandas as pd
test1 = pd.date_range(start='1/1/2018', end='1/10/2018')
test1 = pd.DataFrame(test1)
test1.rename(columns = {list(test1)[0]: 'time'}, inplace = True)
test2 = pd.date_range(start='1/5/2018', end='1/20/2018')
test2 = pd.DataFrame(test2)
test2.rename(columns = {list(test2)[0]: 'time'}, inplace = True)
Now in first dataframe I create column
test1['values'] = np.zeros(10)
I want to fill this column, next to each date there should be the index of the closest date from second dataframe. I want it to look like this:
0 2018-01-01 0
1 2018-01-02 0
2 2018-01-03 0
3 2018-01-04 0
4 2018-01-05 0
5 2018-01-06 1
6 2018-01-07 2
7 2018-01-08 3
Of course my real data is not evenly spaced and has minutes and seconds, but the idea is same. I use the following code:
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
for k in range(10):
a = nearest(test2['time'], test1['time'][k]) ### find nearest timestamp from second dataframe
b = test2.index[test2['time'] == a].tolist()[0] ### identify the index of this timestamp
test1['value'][k] = b ### assign this value to the cell
This code is very slow on large datasets, how can I make it more efficient?
P.S. timestamps in my real data are sorted and increasing just like in these artificial examples.
You could do this in one line, using numpy's argmin:
test1['values'] = test1['time'].apply(lambda t: np.argmin(np.absolute(test2['time'] - t)))
Note that applying a lambda function is essentially also a loop. Check if that satisfies your requirements performance-wise.
You might also be able to leverage the fact that your timestamps are sorted and the timedelta between each timestamp is constant (if I got that correctly). Calculate the offset in days and derive the index vector, e.g. as follows:
offset = (test1['time'] - test2['time']).iloc[0].days
if offset < 0: # test1 time starts before test2 time, prepend zeros:
offset = abs(offset)
idx = np.append(np.zeros(offset), np.arange(len(test1['time'])-offset)).astype(int)
else: # test1 time starts after or with test2 time, use arange right away:
idx = np.arange(offset, offset+len(test1['time']))
test1['values'] = idx

How to create a new string by combining values corresponding to keys in several dictionaries in Python?

I have two dictionaries:
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
and a table consisting of one single column where bond names are contained:
bond_names=pd.DataFrame({'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']})
I need to replace the name with a string of the following format: EUA21 where the first two letters are the corresponding value to the currency key in the dictionary, the next letter is the value corresponding to the month key and the last two digits are the year from the name.
I tried to split the name using the following code:
bond_names['Names']=bond_names['Names'].apply(lambda x: x.split('.'))
but I am not sure how to proceed from here to create the string as I need to search the dictionaries at the same time for the currency and month extract the values join them and add the year from the name onto it.
This will give you a list of what you need:
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names = {'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']}
result = []
for names in bond_names['Names']:
bond = names.split('.')
result.append(currency[bond[1]] + time[bond[2]] + bond[3])
print(result)
You can do that like this:
import pandas as pd
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency = {'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names = pd.DataFrame({'Names': ['Bond.USD.JAN.21', 'Bond.USD.MAR.25', 'Bond.EUR.APR.22', 'Bond.HUF.JUN.21', 'Bond.HUF.JUL.23', 'Bond.GBP.JAN.21']})
bond_names['Names2'] = bond_names['Names'].apply(lambda x: currency[x[5:8]] + time[x[9:12]] + x[-2:])
print(bond_names['Names2'])
# 0 USA21
# 1 USC25
# 2 EUD22
# 3 HFF21
# 4 HFH23
# 5 GBA21
# Name: Names2, dtype: object
With extended regex substitution:
In [42]: bond_names['Names'].str.replace(r'^[^.]+\.([^.]+)\.([^.]+)\.(\d+)', lambda m: '{}{}{}'.format(curre
...: ncy.get(m.group(1), m.group(1)), time.get(m.group(2), m.group(2)), m.group(3)))
Out[42]:
0 USA21
1 USC25
2 EUD22
3 HFF21
4 HFH23
5 GBA21
Name: Names, dtype: object
You can try this :
import pandas as pd
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names=pd.DataFrame({'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']})
bond_names['Names']=bond_names['Names'].apply(lambda x: x.split('.'))
for idx, bond in enumerate(bond_names['Names']):
currencyID = currency.get(bond[1])
monthID = time.get(bond[2])
yearID = bond[3]
bond_names['Names'][idx] = currencyID + monthID + yearID
Output
Names
0 USA21
1 USC25
2 EUD22
3 HFF21
4 HFH23
5 GBA21

How to compute word per token word distance and return the count of 0 distance in a column

I got two descriptions, one in a dataframe and other that is a list of words and I need to compute the levensthein distance of each word in the description against each word in the list and return the count of the result of the levensthein distance that is equal to 0
import pandas as pd
definitions=['very','similarity','seem','scott','hello','names']
# initialize list of lists
data = [['hello my name is Scott'], ['I went to the mall yesterday'], ['This seems very similar']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Descriptions'])
# print dataframe.
df
Column counting the number of all words in each row that computing the Lev distances against each word in the dictionary returns 0
df['lev_count_0']= Column counting the number of all words in each row that computing the Lev distances against each word in the dictionary returns 0
So for example, the first case will be
edit_distance("hello","very") # This will be equal to 4
edit_distance("hello","similarity") # this will be equal to 9
edit_distance("hello","seem") # This will be equal to 4
edit_distance("hello","scott") # This will be equal to 5
edit_distance("hello","hello")# This will be equal to 0
edit_distance("hello","names") # this will be equal to 5
So for the first row in df['lev_count_0'] the result should be 1, since there is just one 0 comparing all words in the Descriptions against the list of Definitions
Description | lev_count_0
hello my name is Scott | 1
My solution
from nltk import edit_distance
import pandas as pd
data = [['hello my name is Scott'], ['I went to the mall yesterday'], ['This seems very similar']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Descriptions'])
dictionary=['Hello', 'my']
def lev_dist(colum):
count=0
dataset=list(colum.split(" "))
for word in dataset :
for dic in dictionary:
result=edit_distance(word,dic)
if result ==0 :
count=count+1
return count
df['count_lev_0'] = df.Descriptions.apply(lev_dist)

Create a dataframe To detail information of another dataframe

I have one dataframe with the value and number of payments and the start date. id like to create a new dataframe with the all the payments one row per month.
Can you guys give a tip about how to finish it?
# Import pandas library
import pandas as pd
# initialize list of lists
data = [[1,'2017-06-09',300,3]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['ID','DATE','VALUE','PAYMENTS'])
# print dataframe.
df
EXISTING DATAFRAME FIELDS:
DATAFRAME DESIRED, open the payments and update the date:
My first thought was to make a loop appending the payments. But if in this loop i already put the other fields and generate de new data frame, so the task would be done.
result = []
for value in df["PAYMENTS"]:
if value == 1:
result.append(1)
elif value ==3:
for x in range(1,4):
result.append(x)
else:
for x in range(1,7):
result.append(x)
Here's my try:
df.VALUE = df.VALUE / df.PAYMENTS
df = df.merge(df.ID.repeat(df.PAYMENTS), on='ID', how='outer')
df.PAYMENTS = df.groupby('ID').cumcount() + 1
Output:
ID DATE VALUE PAYMENTS
0 1 2017-06-09 100.0 1
1 1 2017-06-09 100.0 2
2 1 2017-06-09 100.0 3

Create a new column in a dataframe with increment number based on another column

Consider the below pandas DataFrame:
from pandas import Timestamp
df = pd.DataFrame({
'day': [Timestamp('2017-03-27'),
Timestamp('2017-03-27'),
Timestamp('2017-04-01'),
Timestamp('2017-04-03'),
Timestamp('2017-04-06'),
Timestamp('2017-04-07'),
Timestamp('2017-04-11'),
Timestamp('2017-05-01'),
Timestamp('2017-05-01')],
'act_id': ['916298883',
'916806776',
'923496071',
'926539428',
'930641527',
'931935227',
'937765185',
'966163233',
'966417205']
})
As you may see, there are 9 unique ids distributed in 7 days.
I am looking for a way to add two new columns.
The first column:
An increment number for each new day. For example 1 for '2017-03-27'(same number for same day), 2 for '2017-04-01', 3 for '2017-04-03', etc.
The second column:
An increment number for each new act_id per day. For example 1 for '916298883', 2 for '916806776' (which is linked to the same day '2017-03-27'), 1 for '923496071', 1 for '926539428', etc.
The final table should look like this
I have already tried to build the first column with apply and a function but it doesn't work as it should.
#Create helper function to give index number to a new column
counter = 1
def giveFlag(x):
global counter
index = counter
counter+=1
return index
And then:
# Create day flagger column
df_helper['day_no'] = df_helper['day'].apply(lambda x: giveFlag(x))
try this:
days = list(set(df['day']))
days.sort()
day_no = list()
iter_no = list()
for index,day in enumerate(days):
counter=1
for dfday in df['day']:
if dfday == day:
iter_no.append(counter)
day_no.append(index+1)
counter+=1
df['day_no'] = pd.Series(day_no).values
df['iter_no'] = pd.Series(iter_no).values

Categories