array python split str (too many values to unpack) - python

Array python split str (too many values to unpack)
df.timestamp[1]
Out[191]:
'2016-01-01 00:02:16'
#i need to slept these into to feature
split1,split2=df.timestamp.str.split(' ')
Out[192]:
ValueErrorTraceback (most recent call last)
<ipython-input-216-bbe8e968766f> in <module>()
----> 1 split1,split2=df.timestamp.str.split(' ')
ValueError: too many values to unpack

Use the str[index] since you are splitting the series, the output will also be a series and not two different lists in pandas.
df = pd.DataFrame({'timestamp':['2016-01-01 00:02:16','2016-01-01 00:02:16'] })
split1,split2 = df.timestamp.str.split(' ')[0], df.timestamp.str.split(' ')[1]
str.split will return a series for example
df.timestamp.str.split(' ')
0 [2016-01-01, 00:02:16]
1 [2016-01-01, 00:02:16]
Name: timestamp, dtype: object

You are using split() method wrong.
Given this:
df.timestamp[1]
Out[191]:
'2016-01-01 00:02:16'
Use split() method like this:
# I need to split timestamp[1]
split1, split2 = df.timestamp[1].split(' ') # remove str.

Related

Python- trying to make new list combining values from other list

I'm trying to use two columns from an existing dataframe to generate a list of new strings with those values. I found a lot of examples doing something similar, but not the same thing, so I appreciate advice or links elsewhere if this is a repeat question. Thanks in advance!
If I start with a data frame like this:
import pandas as pd
df=pd.DataFrame(data=[["a",1],["b",2],["c",3]], columns=["id1","id2"])
id1 id2
0 a 1
1 b 2
2 c 3
I want to make a list that looks like
new_ids=['a_1','b_2','c_3'] where values are from combining values in row 0 for id1 with values for row 0 for id2 and so on.
I started by making lists from the columns, but can't figure out how to combine them into a new list. I also tried not using intermediate lists, but couldn't get that either. Error messages below are accurate to the mock data, but are different from the ones with real data.
#making separate lists version
#this function works
def get_ids(orig_df):
id1_list=[]
id2_list=[]
for i in range(len(orig_df)):
id1_list.append(orig_df['id1'].values[i])
id2_list.append(orig_df['id2'].values[i])
return(id1_list,id2_list)
idlist1,idlist2=get_ids(df)
#this is the part that doesn't work
new_id=[]
for i,j in zip(idlist1,idlist2):
row='_'.join(str(idlist1[i]),str(idlist2[j]))
new_id.append(row)
#------------------------------------------------------------------------
#AttributeError Traceback (most recent call #last)
#<ipython-input-44-09983bd890a6> in <module>
# 1 newid_list=[]
# 2 for i in range(len(df)):
#----> 3 n1=df['id1'[i].values]
# 4 n2=df['id2'[i].values]
# 5 nid= str(n1)+"_"+str(n2)
#AttributeError: 'str' object has no attribute 'values'
#skipping making lists (also doesn't work)
newid_list=[]
for i in range(len(df)):
n1=df['id1'[i].values]
n2=df['id2'[i].values]
nid= str(n1)+"_"+str(n2)
newid_list.append(nid)
#---------------------------------------------------------------------------
#TypeError Traceback (most recent call last)
#<ipython-input-41-6b0c949a1ad5> in <module>
# 1 new_id=[]
# 2 for i,j in zip(idlist1,idlist2):
#----> 3 row='_'.join(str(idlist1[i]),str(idlist2[j]))
# 4 new_id.append(row)
# 5 #return ', '.join(new_id)
#TypeError: list indices must be integers or slices, not str
(df.id1 + "_" + df.id2.astype(str)).tolist()
output:
['a_1', 'b_2', 'c_3']
your approaches(corrected):
def get_ids(orig_df):
id1_list=[]
id2_list=[]
for i in range(len(orig_df)):
id1_list.append(orig_df['id1'].values[i])
id2_list.append(orig_df['id2'].values[i])
return(id1_list,id2_list)
idlist1, idlist2=get_ids(df)
#this is the part that doesn't work
new_id=[]
for i,j in zip(idlist1,idlist2):
row='_'.join([str(i),str(j)])
new_id.append(row)
newid_list=[]
for i in range(len(df)):
n1=df['id1'][i]
n2=df['id2'][i]
nid= str(n1)+"_"+str(n2)
newid_list.append(nid)
points:
in first approach, when you loop on data, i and j are data, not indices, so use them as data and convert them to string.
join get list as data and simply define a list using 2 data: [str(i),str(j)] and pass to join
in second approach, you can get every element of every column using df['id1'][i] and you don't need values that return all elements of column as a numpy array
if you want to use values:
(df.id1.values + "_" + df.id2.values.astype(str)).tolist()
Try this:
import pandas as pd
df=pd.DataFrame(data=[["a",1],["b",2],["c",3]], columns=["id1","id2"])
index=0
newid_list=[]
while index < len(df):
newid_list.append(str(df['id1'][index]) + '_' + str(df['id2'][index]))
index+=1

Type error when trying to multiply two Lists in Python

I am trying to write a simple program where a new column is added to an existing dataframe. The new column is created by multiplying values of two existing columns.
This is the code I have written :
import pandas as pd
import numpy as np
data={'Booking Code':['B001','B002','B003','B004','B005'],'Customer Name':['Veer','Umesh','Lavanya','Shobhna','Piyush'],
'No. of Tickets':[4,2,6,5,3], 'Ticket Rate':[100,200,150,250,100],'Booking Clerk':['Manish','Kishor','Manish','John','Kishor']}
df=pd.DataFrame(data)
df=df.to_string(index=False)
print(df)
totalamount=[int(df['No. of Tickets'])* int(df['Ticket Rate']) ]
df['Total Amounts']=totalamount
print(df)
Even though I've used the int() method to convert the values back to integer, it still gives the type error, the exact error being:
Traceback (most recent call last):
File "File Path", line 11, in <module>
totalamount=[int(df['No. of Tickets'])* int(df['Ticket Rate']) ]
TypeError: string indices must be integers
Earlier when I did not have the line df=df.to_string(index=False) line, and also did not use the int() function, there wasn't any error. The list was multiplied, although printed in this manner
[0 400
1 400
2 900
3 1250
4 300
dtype: int64]
But further in the code where I try to add the list to the Dataframe it gives the error ValueError: Length of values (1) does not match length of index (5)
I tried to look any other ways to do this, but can't seem to find any. Thank you in Advance!
you can try this short answer:
df['Total Amounts'] = df.apply(lambda x: x['No. of Tickets'] * x['Ticket Rate'], axis=1)
output:
# print(df['Total Amounts'])
0 400
1 400
2 900
3 1250
4 300
Name: Total Amounts, dtype: int64
You converted your df to a string and again reassigned it to df.
import pandas as pd
import numpy as np
data={'Booking Code':['B001','B002','B003','B004','B005'],'Customer Name':['Veer','Umesh','Lavanya','Shobhna','Piyush'],
'No. of Tickets':[4,2,6,5,3], 'Ticket Rate':[100,200,150,250,100],'Booking Clerk':['Manish','Kishor','Manish','John','Kishor']}
df = pd.DataFrame(data)
string_df = df.to_string(index=False) #Assign it to another variable!
print(string_df)
totalamount = [df['No. of Tickets'] * df['Ticket Rate'] ]
df['Total Amounts'] = totalamount
print(df)

Python error: Invalid literal for int() with base 10

Here is the code and the output. I assume is it about the score not being int, but not sure how to convert in this case
df.index = df.columns
rows = []
for i in df.index:
for c in df.columns:
if i == c: continue
score = df.ix[i, c]
score = [int(row) for row in score.split('-')]
rows.append([i, c, score[0], score[1]])
df = pd.DataFrame(rows, columns = ['home', 'away', 'home_score', 'away_score'])
df.head()
You're splitting on "-" (U+0020 HYPHEN-MINUS), but your data is using some other character... it's hard to say since you provided a picture of the error instead of the actual error, but it's probably "–" (U+2013 EN DASH). Fix your split to use the character that actually occurs in the input.
I think you should just do
score = [int(row) for row in score if row.isnumeric()]
Take advantage of the .isnumeric() method of strings.
P.S. You should not be using .ix method with Pandas. I am pretty sure it is deprecated.
The screenshot you posted is identifying the issue. You are trying to call int() on the str 0-3. This can't be done. From my terminal
In [1]: int('0-3')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-1-c6bc87cd2bc7> in <module>
----> 1 int('0-3')
ValueError: invalid literal for int() with base 10: '0-3'
Looking at your dataframe it looks like you have a lot of data that can't be turned into an int as is

Pythonic/efficient way to strip whitespace from every Pandas Data frame cell that has a stringlike object in it

I'm reading a CSV file into a DataFrame. I need to strip whitespace from all the stringlike cells, leaving the other cells unchanged in Python 2.7.
Here is what I'm doing:
def remove_whitespace( x ):
if isinstance( x, basestring ):
return x.strip()
else:
return x
my_data = my_data.applymap( remove_whitespace )
Is there a better or more idiomatic to Pandas way to do this?
Is there a more efficient way (perhaps by doing things column wise)?
I've tried searching for a definitive answer, but most questions on this topic seem to be how to strip whitespace from the column names themselves, or presume the cells are all strings.
Stumbled onto this question while looking for a quick and minimalistic snippet I could use. Had to assemble one myself from posts above. Maybe someone will find it useful:
data_frame_trimmed = data_frame.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
You could use pandas' Series.str.strip() method to do this quickly for each string-like column:
>>> data = pd.DataFrame({'values': [' ABC ', ' DEF', ' GHI ']})
>>> data
values
0 ABC
1 DEF
2 GHI
>>> data['values'].str.strip()
0 ABC
1 DEF
2 GHI
Name: values, dtype: object
We want to:
Apply our function to each element in our dataframe - use applymap.
Use type(x)==str (versus x.dtype == 'object') because Pandas will label columns as object for columns of mixed datatypes (an object column may contain int and/or str).
Maintain the datatype of each element (we don't want to convert everything to a str and then strip whitespace).
Therefore, I've found the following to be the easiest:
df.applymap(lambda x: x.strip() if type(x)==str else x)
When you call pandas.read_csv, you can use a regular expression that matches zero or more spaces followed by a comma followed by zero or more spaces as the delimiter.
For example, here's "data.csv":
In [19]: !cat data.csv
1.5, aaa, bbb , ddd , 10 , XXX
2.5, eee, fff , ggg, 20 , YYY
(The first line ends with three spaces after XXX, while the second line ends at the last Y.)
The following uses pandas.read_csv() to read the files, with the regular expression ' *, *' as the delimiter. (Using a regular expression as the delimiter is only available in the "python" engine of read_csv().)
In [20]: import pandas as pd
In [21]: df = pd.read_csv('data.csv', header=None, delimiter=' *, *', engine='python')
In [22]: df
Out[22]:
0 1 2 3 4 5
0 1.5 aaa bbb ddd 10 XXX
1 2.5 eee fff ggg 20 YYY
The "data['values'].str.strip()" answer above did not work for me, but I found a simple work around. I am sure there is a better way to do this. The str.strip() function works on Series. Thus, I converted the dataframe column into a Series, stripped the whitespace, replaced the converted column back into the dataframe. Below is the example code.
import pandas as pd
data = pd.DataFrame({'values': [' ABC ', ' DEF', ' GHI ']})
print ('-----')
print (data)
data['values'].str.strip()
print ('-----')
print (data)
new = pd.Series([])
new = data['values'].str.strip()
data['values'] = new
print ('-----')
print (new)
Here is a column-wise solution with pandas apply:
import numpy as np
def strip_obj(col):
if col.dtypes == object:
return (col.astype(str)
.str.strip()
.replace({'nan': np.nan}))
return col
df = df.apply(strip_obj, axis=0)
This will convert values in object type columns to string. Should take caution with mixed-type columns. For example if your column is zip codes with 20001 and ' 21110 ' you will end up with '20001' and '21110'.
This worked for me - applies it to the whole dataframe:
def panda_strip(x):
r =[]
for y in x:
if isinstance(y, str):
y = y.strip()
r.append(y)
return pd.Series(r)
df = df.apply(lambda x: panda_strip(x))
I found the following code useful and something that would likely help others. This snippet will allow you to delete spaces in a column as well as in the entire DataFrame, depending on your use case.
import pandas as pd
def remove_whitespace(x):
try:
# remove spaces inside and outside of string
x = "".join(x.split())
except:
pass
return x
# Apply remove_whitespace to column only
df.orderId = df.orderId.apply(remove_whitespace)
print(df)
# Apply to remove_whitespace to entire Dataframe
df = df.applymap(remove_whitespace)
print(df)

Count individual words in Pandas data frame

I'm trying to count the individual words in a column of my data frame. It looks like this. In reality the texts are Tweets.
text
this is some text that I want to count
That's all I wan't
It is unicode text
So what I found from other stackoverflow questions is that I could use the following:
Count most frequent 100 words from sentences in Dataframe Pandas
Count distinct words from a Pandas Data Frame
My df is called result and this is my code:
from collections import Counter
result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
result2
I get the following error:
TypeError Traceback (most recent call last)
<ipython-input-6-2f018a9f912d> in <module>()
1 from collections import Counter
----> 2 result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
3 result2
TypeError: sequence item 25831: expected str instance, float found
The dtype of text is object, which from what I understand is correct for unicode text data.
The issue is occurring because some of the values in your series (result['text']) is of type float. If you want to consider them during ' '.join() as well, then you would need to convert the floats to string before passing them onto str.join().
You can use Series.astype() to convert all the values to string. Also, you really do not need to use .tolist() , you can simply give the series to str.join() as well. Example -
result2 = Counter(" ".join(result['text'].astype(str)).split(" ")).items()
Demo -
In [60]: df = pd.DataFrame([['blah'],['asd'],[10.1]],columns=['A'])
In [61]: df
Out[61]:
A
0 blah
1 asd
2 10.1
In [62]: ' '.join(df['A'])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-62-77e78c2ee142> in <module>()
----> 1 ' '.join(df['A'])
TypeError: sequence item 2: expected str instance, float found
In [63]: ' '.join(df['A'].astype(str))
Out[63]: 'blah asd 10.1'
In the end I went with the following code:
pd.set_option('display.max_rows', 100)
words = pd.Series(' '.join(result['text'].astype(str)).lower().split(" ")).value_counts()[:100]
words
The problem was however solved by Anand S Kumar.

Categories