I'm trying to count the individual words in a column of my data frame. It looks like this. In reality the texts are Tweets.
text
this is some text that I want to count
That's all I wan't
It is unicode text
So what I found from other stackoverflow questions is that I could use the following:
Count most frequent 100 words from sentences in Dataframe Pandas
Count distinct words from a Pandas Data Frame
My df is called result and this is my code:
from collections import Counter
result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
result2
I get the following error:
TypeError Traceback (most recent call last)
<ipython-input-6-2f018a9f912d> in <module>()
1 from collections import Counter
----> 2 result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
3 result2
TypeError: sequence item 25831: expected str instance, float found
The dtype of text is object, which from what I understand is correct for unicode text data.
The issue is occurring because some of the values in your series (result['text']) is of type float. If you want to consider them during ' '.join() as well, then you would need to convert the floats to string before passing them onto str.join().
You can use Series.astype() to convert all the values to string. Also, you really do not need to use .tolist() , you can simply give the series to str.join() as well. Example -
result2 = Counter(" ".join(result['text'].astype(str)).split(" ")).items()
Demo -
In [60]: df = pd.DataFrame([['blah'],['asd'],[10.1]],columns=['A'])
In [61]: df
Out[61]:
A
0 blah
1 asd
2 10.1
In [62]: ' '.join(df['A'])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-62-77e78c2ee142> in <module>()
----> 1 ' '.join(df['A'])
TypeError: sequence item 2: expected str instance, float found
In [63]: ' '.join(df['A'].astype(str))
Out[63]: 'blah asd 10.1'
In the end I went with the following code:
pd.set_option('display.max_rows', 100)
words = pd.Series(' '.join(result['text'].astype(str)).lower().split(" ")).value_counts()[:100]
words
The problem was however solved by Anand S Kumar.
Related
I'm trying to use two columns from an existing dataframe to generate a list of new strings with those values. I found a lot of examples doing something similar, but not the same thing, so I appreciate advice or links elsewhere if this is a repeat question. Thanks in advance!
If I start with a data frame like this:
import pandas as pd
df=pd.DataFrame(data=[["a",1],["b",2],["c",3]], columns=["id1","id2"])
id1 id2
0 a 1
1 b 2
2 c 3
I want to make a list that looks like
new_ids=['a_1','b_2','c_3'] where values are from combining values in row 0 for id1 with values for row 0 for id2 and so on.
I started by making lists from the columns, but can't figure out how to combine them into a new list. I also tried not using intermediate lists, but couldn't get that either. Error messages below are accurate to the mock data, but are different from the ones with real data.
#making separate lists version
#this function works
def get_ids(orig_df):
id1_list=[]
id2_list=[]
for i in range(len(orig_df)):
id1_list.append(orig_df['id1'].values[i])
id2_list.append(orig_df['id2'].values[i])
return(id1_list,id2_list)
idlist1,idlist2=get_ids(df)
#this is the part that doesn't work
new_id=[]
for i,j in zip(idlist1,idlist2):
row='_'.join(str(idlist1[i]),str(idlist2[j]))
new_id.append(row)
#------------------------------------------------------------------------
#AttributeError Traceback (most recent call #last)
#<ipython-input-44-09983bd890a6> in <module>
# 1 newid_list=[]
# 2 for i in range(len(df)):
#----> 3 n1=df['id1'[i].values]
# 4 n2=df['id2'[i].values]
# 5 nid= str(n1)+"_"+str(n2)
#AttributeError: 'str' object has no attribute 'values'
#skipping making lists (also doesn't work)
newid_list=[]
for i in range(len(df)):
n1=df['id1'[i].values]
n2=df['id2'[i].values]
nid= str(n1)+"_"+str(n2)
newid_list.append(nid)
#---------------------------------------------------------------------------
#TypeError Traceback (most recent call last)
#<ipython-input-41-6b0c949a1ad5> in <module>
# 1 new_id=[]
# 2 for i,j in zip(idlist1,idlist2):
#----> 3 row='_'.join(str(idlist1[i]),str(idlist2[j]))
# 4 new_id.append(row)
# 5 #return ', '.join(new_id)
#TypeError: list indices must be integers or slices, not str
(df.id1 + "_" + df.id2.astype(str)).tolist()
output:
['a_1', 'b_2', 'c_3']
your approaches(corrected):
def get_ids(orig_df):
id1_list=[]
id2_list=[]
for i in range(len(orig_df)):
id1_list.append(orig_df['id1'].values[i])
id2_list.append(orig_df['id2'].values[i])
return(id1_list,id2_list)
idlist1, idlist2=get_ids(df)
#this is the part that doesn't work
new_id=[]
for i,j in zip(idlist1,idlist2):
row='_'.join([str(i),str(j)])
new_id.append(row)
newid_list=[]
for i in range(len(df)):
n1=df['id1'][i]
n2=df['id2'][i]
nid= str(n1)+"_"+str(n2)
newid_list.append(nid)
points:
in first approach, when you loop on data, i and j are data, not indices, so use them as data and convert them to string.
join get list as data and simply define a list using 2 data: [str(i),str(j)] and pass to join
in second approach, you can get every element of every column using df['id1'][i] and you don't need values that return all elements of column as a numpy array
if you want to use values:
(df.id1.values + "_" + df.id2.values.astype(str)).tolist()
Try this:
import pandas as pd
df=pd.DataFrame(data=[["a",1],["b",2],["c",3]], columns=["id1","id2"])
index=0
newid_list=[]
while index < len(df):
newid_list.append(str(df['id1'][index]) + '_' + str(df['id2'][index]))
index+=1
I am trying to write a simple program where a new column is added to an existing dataframe. The new column is created by multiplying values of two existing columns.
This is the code I have written :
import pandas as pd
import numpy as np
data={'Booking Code':['B001','B002','B003','B004','B005'],'Customer Name':['Veer','Umesh','Lavanya','Shobhna','Piyush'],
'No. of Tickets':[4,2,6,5,3], 'Ticket Rate':[100,200,150,250,100],'Booking Clerk':['Manish','Kishor','Manish','John','Kishor']}
df=pd.DataFrame(data)
df=df.to_string(index=False)
print(df)
totalamount=[int(df['No. of Tickets'])* int(df['Ticket Rate']) ]
df['Total Amounts']=totalamount
print(df)
Even though I've used the int() method to convert the values back to integer, it still gives the type error, the exact error being:
Traceback (most recent call last):
File "File Path", line 11, in <module>
totalamount=[int(df['No. of Tickets'])* int(df['Ticket Rate']) ]
TypeError: string indices must be integers
Earlier when I did not have the line df=df.to_string(index=False) line, and also did not use the int() function, there wasn't any error. The list was multiplied, although printed in this manner
[0 400
1 400
2 900
3 1250
4 300
dtype: int64]
But further in the code where I try to add the list to the Dataframe it gives the error ValueError: Length of values (1) does not match length of index (5)
I tried to look any other ways to do this, but can't seem to find any. Thank you in Advance!
you can try this short answer:
df['Total Amounts'] = df.apply(lambda x: x['No. of Tickets'] * x['Ticket Rate'], axis=1)
output:
# print(df['Total Amounts'])
0 400
1 400
2 900
3 1250
4 300
Name: Total Amounts, dtype: int64
You converted your df to a string and again reassigned it to df.
import pandas as pd
import numpy as np
data={'Booking Code':['B001','B002','B003','B004','B005'],'Customer Name':['Veer','Umesh','Lavanya','Shobhna','Piyush'],
'No. of Tickets':[4,2,6,5,3], 'Ticket Rate':[100,200,150,250,100],'Booking Clerk':['Manish','Kishor','Manish','John','Kishor']}
df = pd.DataFrame(data)
string_df = df.to_string(index=False) #Assign it to another variable!
print(string_df)
totalamount = [df['No. of Tickets'] * df['Ticket Rate'] ]
df['Total Amounts'] = totalamount
print(df)
Here is the code and the output. I assume is it about the score not being int, but not sure how to convert in this case
df.index = df.columns
rows = []
for i in df.index:
for c in df.columns:
if i == c: continue
score = df.ix[i, c]
score = [int(row) for row in score.split('-')]
rows.append([i, c, score[0], score[1]])
df = pd.DataFrame(rows, columns = ['home', 'away', 'home_score', 'away_score'])
df.head()
You're splitting on "-" (U+0020 HYPHEN-MINUS), but your data is using some other character... it's hard to say since you provided a picture of the error instead of the actual error, but it's probably "–" (U+2013 EN DASH). Fix your split to use the character that actually occurs in the input.
I think you should just do
score = [int(row) for row in score if row.isnumeric()]
Take advantage of the .isnumeric() method of strings.
P.S. You should not be using .ix method with Pandas. I am pretty sure it is deprecated.
The screenshot you posted is identifying the issue. You are trying to call int() on the str 0-3. This can't be done. From my terminal
In [1]: int('0-3')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-1-c6bc87cd2bc7> in <module>
----> 1 int('0-3')
ValueError: invalid literal for int() with base 10: '0-3'
Looking at your dataframe it looks like you have a lot of data that can't be turned into an int as is
I am new in python. I have a data frame with a column, named 'Name'. The column contains different type of accents. I am trying to remove those accents. For example, rubén => ruben, zuñiga=zuniga, etc. I wrote following code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import unicodedata
data=pd.read_csv('transactions.csv')
data.head()
nm=data['Name']
normal = unicodedata.normalize('NFKD', nm).encode('ASCII', 'ignore')
I am getting error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-41-1410866bc2c5> in <module>()
1 nm=data['Name']
----> 2 normal = unicodedata.normalize('NFKD', nm).encode('ASCII', 'ignore')
TypeError: normalize() argument 2 must be unicode, not Series
The reason why it is giving you that error is because normalize requires a string for the second parameter, not a list of strings. I found an example of this online:
unicodedata.normalize('NFKD', u"Durrës Åland Islands").encode('ascii','ignore')
'Durres Aland Islands'
Try this for one column:
nm = nm.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
Try this for multiple columns:
obj_cols = data.select_dtypes(include=['O']).columns
data.loc[obj_cols] = data.loc[obj_cols].apply(lambda x: x.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8'))
Try this for one column:
df[column_name] = df[column_name].apply(lambda x: unicodedata.normalize(u'NFKD', str(x)).encode('ascii', 'ignore').decode('utf-8'))
Change the column name according to your data columns.
Array python split str (too many values to unpack)
df.timestamp[1]
Out[191]:
'2016-01-01 00:02:16'
#i need to slept these into to feature
split1,split2=df.timestamp.str.split(' ')
Out[192]:
ValueErrorTraceback (most recent call last)
<ipython-input-216-bbe8e968766f> in <module>()
----> 1 split1,split2=df.timestamp.str.split(' ')
ValueError: too many values to unpack
Use the str[index] since you are splitting the series, the output will also be a series and not two different lists in pandas.
df = pd.DataFrame({'timestamp':['2016-01-01 00:02:16','2016-01-01 00:02:16'] })
split1,split2 = df.timestamp.str.split(' ')[0], df.timestamp.str.split(' ')[1]
str.split will return a series for example
df.timestamp.str.split(' ')
0 [2016-01-01, 00:02:16]
1 [2016-01-01, 00:02:16]
Name: timestamp, dtype: object
You are using split() method wrong.
Given this:
df.timestamp[1]
Out[191]:
'2016-01-01 00:02:16'
Use split() method like this:
# I need to split timestamp[1]
split1, split2 = df.timestamp[1].split(' ') # remove str.