I have a column in my pandas dataframe with the following values that represent hours worked in a week.
0 40
1 40h / week
2 46.25h/week on average
3 11
I would like to check every row, and if the length of the value is larger than 2 digits - extract the number of hours only from it.
I have tried the following:
df['Hours_per_week'].apply(lambda x: (x.extract('(\d+)') if(len(str(x)) > 2) else x))
However I am getting the AttributeError: 'str' object has no attribute 'extract' error.
It looks like you could ensure having h after the number:
df['Hours_per_week'].str.extract(r'(\d{2}\.?\d*)h', expand=False)
Output:
0 NaN
1 40
2 46.25
3 NaN
Name: Hours_per_week, dtype: object
Assuming the series data are strings, try this:
df['Hours_per_week'].str.extract('(\d+)')
Why not immediately extract float pattern i.e. \d+\.?\d+ ?
>>> s = pd.Series(['40', '40h / week', '46.25h/week on average', '11'])
>>> s.str.extract("(\d+\.?\d+)")
0
0 40
1 40
2 46.25
3 11
2 digits will still match either way.
Related
I have a dataframe and I want to change some element of a column based on a condition.
In particular given this column:
... VALUE ....
0
"1076A"
12
9
"KKK0139"
5
I want to obtain this:
... VALUE ....
0
"1076A"
12
9
"0139"
5
In the 'VALUE' column there are both strings and numbers, when I found a particular substring in a string value, I want to obtain the same value without that substring.
I have tried:
1) df['VALUE'] = np.where(df['VALUE'].str.contains('KKK', na=False), df['VALUE'].str[3:], df['VALUE'])
2) df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'] = df['VALUE'].str[3:]
But these two attempts returns a IndexError: invalid index to scalar variable
Some advice ?
As the column contains both numeric value (non-string) and string values, you cannot use .str.replace() since it handles strings only. You have to use .replace() instead. Otherwise, non-string elements will be converted to NaN by str.replace().
Here, you can use:
df['VALUE'] = df['VALUE'].replace(r'KKK', '', regex=True)
Input:
data = {'VALUE': [0, "1076A", 12, 9, "KKK0139", 5]}
df = pd.DataFrame(data)
Result:
0 0
1 1076A
2 12
3 9
4 0139
5 5
Name: VALUE, dtype: object
If you use .str.replace(), you will get:
Note the NaN values result for numeric values (not of string type)
0 NaN
1 1076A
2 NaN
3 NaN
4 0139
5 NaN
Name: VALUE, dtype: object
In general, if you want to remove leading alphabet substring, you can use:
df['VALUE'] = df['VALUE'].replace(r'^[A-Za-z]+', '', regex=True)
>>> df['VALUE'].str.replace(r'KKK', '')
0 0
1 1076A
2 12
3 9
4 0139
5 5
Name: VALUE, dtype: object
Your second solution fails because you also need to apply the row selector to the right side of your assignment.
df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'] = df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'].str[3:]
Looking at your sample data, if k is the only problem, just replace it with empty string
df['VALUE'].str.replace('K', '')
0 0
1 "1076A"
2 12
3 9
4 "0139"
5 5
Name: text, dtype: object
If you want to do it for specific occurrences or positions of k, you can do that as well.
I am currently working with a dataset that has a column with the following set up:
'age'
20
25
30
35
etc.
I am trying to convert to a column to the following:
'age'
'twenty'
'twenty-five'
etc.
I tried to accomplish this using num2words imported library and doing a map:
df['age'] = df['age'].map(lambda x: num2words(x))
But I get an attribute error. The data originally in age is stored as an int32 dtype, so I am not to sure what else would cause it. Any help is appreciated!
df = pd.DataFrame({"k":[1,2,3,4,5,6]})
df.k.map(lambda x: num2words(x))
0 one
1 two
2 three
3 four
4 five
5 six
Name: k, dtype: object
df = pd.DataFrame({"k":["125","2","3","4",5.0,6.0]})
df.k.map(lambda x: num2words(x))
0 one hundred and twenty-five
1 two
2 three
3 four
4 five
5 six
Name: k, dtype: object
I am very sorry if this is a very basic question but unfortunately, I'm failing miserably at figuring out the solution.
I need to subtract the first value within a column (in this case column 8 in my df) from the last value & divide this by a number (e.g. 60) after having applied groupby to my pandas df to get one value per id. The final output would ideally look something like this:
id
1 1523
2 1644
I have the actual equation which works on its own when applied to the entire column of the df:
(df.iloc[-1,8] - df.iloc[0,8])/60
However I fail to combine this part with the groupby function. Among others, I tried apply, which doesn't work.
df.groupby(['id']).apply((df.iloc[-1,8] - df.iloc[0,8])/60)
I also tried creating a function with the equation part and then do apply(func)but so far none of my attempts have worked. Any help is much appreciated, thank you!
Demo:
In [204]: df
Out[204]:
id val
0 1 12
1 1 13
2 1 19
3 2 20
4 2 30
5 2 40
In [205]: df.groupby(['id'])['val'].agg(lambda x: (x.iloc[-1] - x.iloc[0])/60)
Out[205]:
id
1 0.116667
2 0.333333
Name: val, dtype: float64
I am trying to concatenate these three columns, but the code i am using is giving me this output, i changed the format of all the columns to string:
Income_Status_Number Income_Stability_Number Product_Takeup_Number Permutation
1 1 2 1.012
2 1 3 2.013
1 1 1 1.011
this is the code i used:
df['Permutation']=df['Income_Status_Number'].astype(str)+""+df['Income_Stability_Number'].astype(str)+""+df['Product_Takeup_Number'].astype(str)
But I want my output to look like this:
Income_Status_Number Income_Stability_Number Product_Takeup_Number Permutation
1 1 2 112
2 1 3 213
1 1 1 111
Please help.
The issue is that the first column is being treated as a float instead of an int. The simple way to solve this problem is to sum the values with multipliers to put the numbers is the correct space and let pandas realize that the number is an int:
df['Permutation'] = df['Income_Status_Number']*100 + df['Income_Stability_Number']*10 + df['Product_Takeup_Number']
Another solution is to use astype(int).astype to convert the number first, but that solution is somewhat slower:
10000 Runs Each
as_type
Total: 9.7106s
Avg: 971059.8162ns
Maths
Total: 7.0491s
Avg: 704909.3242ns
It looks like your first column is being read as a float right before you convert it to a string.
df['Permutation']=df['Income_Status_Number'].astype(int).astype(str)+df['Income_Stability_Number'].astype(int).astype(str)+df['Product_Takeup_Number'].astype(int).astype(str)
Try the following code to add a 'Permutation' column to your data frame formatted in the way you wanted:
df['Permutation'] = df[df.columns[0:]].apply(lambda x: ''.join(x.dropna().astype(int).astype(str)),axis=1)
Which give you the following dataframe:
df
Income_Status_Number Income_Stability_Number Product_Takeup_Number \
0 1 1 2
1 2 1 3
2 1 1 1
Permutation
0 112
1 213
2 111
I hope this one will work for you.
df['Permutation'] = df[df.columns].apply(lambda x: ' '.join(x.dropna().astype(int).astype(str)),axis=1)
I have a lot of experience programming in Matlab, now using Python and I just don't get this thing to work... I have a dataframe containing a column with timecodes like 00:00:00.033.
timecodes = ['00:00:01.001', '00:00:03.201', '00:00:09.231', '00:00:11.301', '00:00:20.601', '00:00:31.231', '00:00:90.441', '00:00:91.301']
df = pd.DataFrame(timecodes, columns=['TimeCodes'])
All my inputs are 90 seconds or less, so I want to create a column with just the seconds as float. To do this, I need to select position 6 to end and make that into a float, which I can do for the first row like:
float(df['TimeCodes'][0][6:])
This works just fine, but if I now want to create a whole new column 'Time_sec', the following does not work:
df['Time_sec'] = float(df['TimeCodes'][:][6:])
Because df['TimeCodes'][:][6:] takes row 6 to last row, while I want WITHIN each row the 6th till last position. Also this does not work:
df['Time_sec'] = float(df['TimeCodes'][:,6:])
Do I need to make a loop? There must be a better way... And why does df['TimeCodes'][:][6:] not work?
You can use the slice string method and then cast the whole thing to a float:
In [13]: df["TimeCodes"].str.slice(6).astype(float)
Out[13]:
0 1.001
1 3.201
2 9.231
3 11.301
4 20.601
5 31.231
6 90.441
7 91.301
Name: TimeCodes, dtype: float64
As to why df['TimeCodes'][:][6:] doesn't work, what this ends up doing is chaining some selections. First you grab the pd.Series associated with the TimeCodes column, then you select all of the items from the Series with [:], and then you just select the items with index 6 or higher with [6:].
Solution - indexing with str and casting to float by astype:
print (df["TimeCodes"].str[6:])
0 01.001
1 03.201
2 09.231
3 11.301
4 20.601
5 31.231
6 90.441
7 91.301
Name: TimeCodes, dtype: object
df['new'] = df["TimeCodes"].str[6:].astype(float)
print (df)
TimeCodes new
0 00:00:01.001 1.001
1 00:00:03.201 3.201
2 00:00:09.231 9.231
3 00:00:11.301 11.301
4 00:00:20.601 20.601
5 00:00:31.231 31.231
6 00:00:90.441 90.441
7 00:00:91.301 91.301