I have pandas dataframe:
df.text[:3]
0 nena shot by me httptcodcrsfqyvh httpstcokxr...
1 full version of soulless httptcowfmcyyu
2 when youre having a good day but then get to w...
Name: text, dtype: object
Basically it just a series with tweets text. Nothing more.
text = df.text
text.index
Int64Index([0, 1, 2, ...], dtype='int64')
Now I want to split words in this series. It works just fine with this one:
df.text.str.split('')
0 [nena shot by me httptcodcrsfqyvh httpstcokx...
1 [full version of soulless httptcowfmcyyu]
2 [when youre having a good day but then get to ...
But id does not work with apply method:
df.text.apply(lambda x: x.split(' '))
and throws an exception: AttributeError: 'float' object has no attribute 'split'
What am I doing wrong and why apply method takes this int index as parameter?
Same thing if I use df.text.map(lambda x: x.split(' '))
UPD
df[df.text == np.nan].shape
(0, 13)
And
df.text[:3]
0 nena shot by me httptcodcrsfqyvh httpstcokxr...
1 full version of soulless httptcowfmcyyu
2 when youre having a good day but then get to w...
Works just fine:
df.text[:3].map(lambda x: x.split())
0 [nena, shot, by, me, httptcodcrsfqyvh, httpstc...
1 [full, version, of, soulless, httptcowfmcyyu]
2 [when, youre, having, a, good, day, but, then,...
Name: text, dtype: object
Related
I have a subset of data (single column) we'll call ID:
ID
0 07-1401469
1 07-89556629
2 07-12187595
3 07-381962
4 07-99999085
The current format is (usually) YY-[up to 8-character ID].
The desired output format is a more uniformed YYYY-xxxxxxxx:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
Knowing that I've done padding in the past, the thought process was to combine
df['id'].str.split('-').str[0].apply(lambda x: '{0:20>4}'.format(x))
df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x))
However I ran into a few problems:
The '20' in '{0:20>4}' must be a singular value and not a string
Trying to do something like the below just results in df['id'] taking the properties of the last lambda & trying any other way to combine multiple apply/lambdas just didn't work. I started going down the pad left/right route but that seemed to be taking be backwards.
df['id'] = (df['id'].str.split('-').str[0].apply(lambda x: '{0:X>4}'.format(x)).str[1].apply(lambda x: '{0:0>8}'.format(x)))
The current solution I have (but HATE because its long, messy, and just not clean IMO) is:
df['idyear'] = df['id'].str.split('-').str[0].apply(lambda x: '{:X>4}'.format(x)) # Split on '-' and pad with X
df['idyear'] = df['idyear'].str.replace('XX', '20') # Replace XX with 20 to conform to YYYY
df['idnum'] = df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x)) # Pad 0s up to 8 digits
df['id'] = df['idyear'].map(str) + "-" + df['idnum'] # Merge idyear and idnum to remake id
del df['idnum'] # delete extra
del df['idyear'] # delete extra
Which does work
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
But my questions are
Is there a way to run multiple apply() functions in a single line so I'm not making temp variables
Is there a better way than replacing 'XX' for '20'
I feel like this entire code block can be compress to 1 or 2 lines I just don't know how. Everything I've seen on SO and Pandas documentation on highlights/relates to singular manipulation so far.
One option is to split; then use str.zfill to pad '0's. Also prepend '20's before splitting, since you seem to need it anyway:
tmp = df['ID'].radd('20').str.split('-')
df['ID'] = tmp.str[0] + '-'+ tmp.str[1].str.zfill(8)
Output:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
I'd do it in two steps, using .str.replace:
df["ID"] = df["ID"].str.replace(r"^(\d{2})-", r"20\1-", regex=True)
df["ID"] = df["ID"].str.replace(r"-(\d+)", lambda g: f"-{g[1]:0>8}", regex=True)
print(df)
Prints:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
I'm having trouble creating a new column based on columns 'language_1' and 'language_2' in python dataframe. I want to create a 'bilingual' column where a '1' represents a user who speaks both English and Spanish(bi-lingual) and a 0 for non-bilingual speakers. Ultimately I want to compare their average ratings to each other, but want to categorize them first. I tried using if statements but I'm not sure how to write an if statement that combines multiple conditions to result in 1 value. Thank you for any help.
===============================================================================================
name language_1 language_2 rating bilingual
Kevin English Null 4.25
Miguel English Spanish 4.56
Carlos English Spanish 4.61
Aaron Null Spanish 4.33
===============================================================================================
Here is the code I've tried to use to append the new column to my dataframe.
def label_bilingual(row):
if row('language_english') == row['English'] and row('language_spanish') == 'Spanish':
val = 1
else:
val = 0
df_doc_1['bilingual'] = df_doc_1.apply(label_bilingual, axis=1)
Here is the error I'm getting.
----> 1 df_doc_1['bilingual'] = df_doc_1.apply(label_bilingual, axis=1)
'Series' object is not callable
You have a few issues with your function, one which is causing your error and a few more which will cause more problems after.
1 - You have tried to call the column with row('name') which is not callable.
df('row')
Traceback (most recent call last):
File "<pyshell#30>", line 1, in <module>
df('row')
TypeError: 'DataFrame' object is not callable
2 - You have tried to compare row['column'] to row['English'] which will not work, as a column named English does not exist
KeyError: 'English'
3 - You do not return any values
val = 1
val = 0
You need to modify your function as below to resolve these errors.
def label_bilingual(row):
if row['language_1'] == 'English' and row['language_2'] == 'Spanish':
return 1
else:
return 0
Output
>>> df['bilingual'] = df.apply(label_bilingual, axis=1)
>>> df
name language_1 language_2 rating bilingual
0 Kevin English Null 4.25 0
1 Miguel English Spanish 4.56 1
2 Carlos English Spanish 4.61 1
3 Aaron Null Spanish 4.33 0
To make it simpler I'd suggest having missing values in either column as numpy.nan. For example if missing values were recorded as np.nan:
bilingual = np.where(np.isnan(df[['language_1', 'language_2']].values.any(), 0, 1))
df['bilingual'] = bilingual
Here np.where checks condition inside, which in turn checks whether values in either of language columns are missing. And if true, than a person is not bilingual and gets a 0, otherwise 1.
I'm trying to convert a pandas column to a string in order to use str.extract().
When I run print(data.dtypes), this is what I see:
:Address Line 1: object
:City: object
Address Line 2: object
Case Initiation Date: object
Case Number: object
Case Status: object
Defendants object
Demand Amount: object
Motion Status object
Zip: object
6 object
dtype: object
I'm trying to split the data['Motion Status'] variable using a regular expression, but am running into roadblocks. First, here's a quick look at data['Motion Status']:
0 b'01/31/202008:30155'
1 b'02/03/202008:30155'
2 b'02/03/202008:30155'
3 b'02/04/202008:30155'
4 b'02/04/202008:30155'
Name: Motion Status, dtype: object
You'll note that it's of the format dd/mm/yyyy + hh:mm + 3-digit number. This is the code I have been using to try and parse out the date from the time (i'll do the '155' after i've got it working):
data['Motion Status (date)'] = data['Motion Status'].str.extract('\d{2}\/\d{2}\/\d{4}', expand=True)
When I run it, it returns the error TypeError: Cannot use .str.extract with values of inferred dtype 'bytes'. I've tried four different solutions, but none of them have worked (returning the same error message as above when I re-run the str.extract line):
data['Motion Status'] = data['Motion Status'].astype('|S')
data['Motion Status'] = data['Motion Status'].astype('str')
data['Motion Status'] = data['Motion Status'].astype(str)
data.astype(str)['Motion Status'].map(lambda x: type(x))
Can anyone help me out here? I'm really not wedded to converting this variable to a string. I just want to be able to parse out the date, time, and the '155' at the end (it's not always a '155' by the way - only in the first 20 rows or so).
Any help would be appreciated!
Update:
I can now run this line of code data['Motion Status (date)'], data['Time'], data['Other'] = data['Motion Status'].str.extract('(\d{2})/(\d{2})/(\d{4})', expand=True) and it executes without an error. I honestly don't know what i've done to make this run... However, I now run into a slightly different issue where the code creates three new variables but they are filled with zeros, ones, or twos in all rows (and not the parts of the data['Motion Status'] string I was hoping to get. E.g.
Motion Status (date) Time Other
0 0 1 2
1 0 1 2
2 0 1 2
3 0 1 2
4 0 1 2
So i'm not exactly back to square one, but I still haven't managed to be able parse out the different parts of the string.
You have bytes in the column. decode it first using str.decode method:
s
#0 b'02/03/202008:30155'
#1 b'02/03/202008:30155'
#2 b'02/04/202008:30155'
#3 b'02/04/202008:30155'
#dtype: object
s.str.decode('UTF-8').str.extract('(\d{2})/(\d{2})/(\d{4})', expand=True)
# 0 1 2
#0 02 03 2020
#1 02 03 2020
#2 02 04 2020
#3 02 04 2020
Using pandas I have a result (here aresult) from a df.loc lookup that python is telling me is a 'Timeseries'.
sample of predictions.csv:
prediction id
1 593960337793155072
0 991960332793155071
....
code to retrieve one prediction
predictionsfile = pandas.read_csv('predictions.csv')
idtest = 593960337793155072
result = (predictionsfile.loc[predictionsfile['id'] == idtest])
aresult = result['prediction']
A result retreives a data format that cannot be keyed:
In: print aresult
11 1
Name: prediction, dtype: int64
I just need the prediction, which in this case is 1. I've tried aresult['result'], aresult[0] and aresult[1] all to no avail. Before I do something awful like converting it to a string and strip it out, I thought I'd ask here.
A series requires .item() to retrieve its value.
print aresult.item()
1
I have a DataFrame with some user input (it's supposed to just be a plain email address), along with some other values, like this:
import pandas as pd
from pandas import Series, DataFrame
df = pd.DataFrame({'input': ['Captain Jean-Luc Picard <picard#starfleet.com>','deanna.troi#starfleet.com','data#starfleet.com','William Riker <riker#starfleet.com>'],'val_1':[1.5,3.6,2.4,2.9],'val_2':[7.3,-2.5,3.4,1.5]})
Due to a bug, the input sometimes has the user's name as well as brackets around the email address; this needs to be fixed before continuing with the analysis.
To move forward, I want to create a new column that has cleaned versions of the emails: if the email contains names/brackets then remove those, else just give the already correct email.
There are numerous examples of cleaning string data with Python/pandas, but I've yet to find successfully implement any of these suggestions. Here are a few examples of what I've tried:
# as noted in pandas docs, turns all non-matching strings into NaN
df['cleaned'] = df['input'].str.extract('<(.*)>')
# AttributeError: type object 'str' has no attribute 'contains'
df['cleaned'] = df['input'].apply(lambda x: str.extract('<(.*)>') if str.contains('<(.*)>') else x)
# AttributeError: 'DataFrame' object has no attribute 'str'
df['cleaned'] = df[df['input'].str.contains('<(.*)>')].str.extract('<(.*)>')
Thanks!
Use np.where to use the str.extract for those rows that contain the embedded email, for the else condition just return the 'input' value:
In [63]:
df['cleaned'] = np.where(df['input'].str.contains('<'), df['input'].str.extract('<(.*)>'), df['input'])
df
Out[63]:
input val_1 val_2 \
0 Captain Jean-Luc Picard <picard#starfleet.com> 1.5 7.3
1 deanna.troi#starfleet.com 3.6 -2.5
2 data#starfleet.com 2.4 3.4
3 William Riker <riker#starfleet.com> 2.9 1.5
cleaned
0 picard#starfleet.com
1 deanna.troi#starfleet.com
2 data#starfleet.com
3 riker#starfleet.com
If you want to use regular expressions:
import re
rex = re.compile(r'<(.*)>')
def fix(s):
m = rex.search(s)
if m is None:
return s
else:
return m.groups()[0]
fixed = df['input'].apply(fix)