Converting an entire column of numbers in pandas to word equivalent - python

I am currently working with a dataset that has a column with the following set up:
'age'
20
25
30
35
etc.
I am trying to convert to a column to the following:
'age'
'twenty'
'twenty-five'
etc.
I tried to accomplish this using num2words imported library and doing a map:
df['age'] = df['age'].map(lambda x: num2words(x))
But I get an attribute error. The data originally in age is stored as an int32 dtype, so I am not to sure what else would cause it. Any help is appreciated!

df = pd.DataFrame({"k":[1,2,3,4,5,6]})
df.k.map(lambda x: num2words(x))
0 one
1 two
2 three
3 four
4 five
5 six
Name: k, dtype: object
df = pd.DataFrame({"k":["125","2","3","4",5.0,6.0]})
df.k.map(lambda x: num2words(x))
0 one hundred and twenty-five
1 two
2 three
3 four
4 five
5 six
Name: k, dtype: object

Related

Modifying pandas row value based on its length

I have a column in my pandas dataframe with the following values that represent hours worked in a week.
0 40
1 40h / week
2 46.25h/week on average
3 11
I would like to check every row, and if the length of the value is larger than 2 digits - extract the number of hours only from it.
I have tried the following:
df['Hours_per_week'].apply(lambda x: (x.extract('(\d+)') if(len(str(x)) > 2) else x))
However I am getting the AttributeError: 'str' object has no attribute 'extract' error.
It looks like you could ensure having h after the number:
df['Hours_per_week'].str.extract(r'(\d{2}\.?\d*)h', expand=False)
Output:
0 NaN
1 40
2 46.25
3 NaN
Name: Hours_per_week, dtype: object
Assuming the series data are strings, try this:
df['Hours_per_week'].str.extract('(\d+)')
Why not immediately extract float pattern i.e. \d+\.?\d+ ?
>>> s = pd.Series(['40', '40h / week', '46.25h/week on average', '11'])
>>> s.str.extract("(\d+\.?\d+)")
0
0 40
1 40
2 46.25
3 11
2 digits will still match either way.

How to avoid datatype mismatch within pandas column

I have a dataframe which can be generated from code below
df = pd.DataFrame({'person_id' :['13423523234527afefc9586e8cec5ae2e5c5d46aedcbe6a5652fa0615e92c3ee84bc32792826','123253252364334527afefc9586e8cec536ae2e5c5d46aedcbe6a5652fa0615e92c3ee84bc32792826','123443643643527afefc9586e8cec5346ae2e5c5d46aedcbe6a5652fa0615e92c3ee84bc32792826','1234523463434312de3c1a186a623642a6699bb2f5ab570c37985ec13ed33582486b51aa1234567','123452312de3c1a186a622a6693469bb2f5ab570c37985ec13ed33554321b51aa8891808','1234523146363462de3c1a186a622a3466699bb2f5ab570c37985ec13ed331234551aa8891808','123452312de3c143643a186a622a6699634bb2f5ab570c37985ec13ed12345676b51aa8891808',np.nan,2],'level_1': ['L1FR','L1Date','L1value','L1FR','L1Date','L1value','L2FR','L2Date','L2value'], 'val3':['Fasting','11/4/2005',1.33,'Random','18/1/2007',4.63,'Fasting','18/1/2017',8.63]})
It looks like as shown below
I would like to extract numeric portion (only 9 digits) from the person_id column. For which I tried the below
df.fillna(0,inplace=True)
df.person_id.apply(lambda x: int(''.join(filter(str.isdigit, str(x)))))
In the above code if I don't use str(x), it throws an error because elements 0(7th row after filling na) and 2(8th row) are of integer type
How can the datatype of elements be different from datatype of the column. I have shown below as well
How can I expect my output to be like as shown below
Use pandas.Series.str.findall:
df.fillna(0, inplace=True)
df['person_id'] = df['person_id'].astype(str)
df['extracted'] = df['person_id'].str.findall('\d+').apply(lambda x: ''.join(x)[:9])
print(df['extracted'])
Output:
0 123452795
1 123452795
2 123452795
3 123452312
4 123452312
5 123452312
6 123452312
7 0
8 2
Name: extracted, dtype: object

Subtract 2 values from one another within 1 column after groupby

I am very sorry if this is a very basic question but unfortunately, I'm failing miserably at figuring out the solution.
I need to subtract the first value within a column (in this case column 8 in my df) from the last value & divide this by a number (e.g. 60) after having applied groupby to my pandas df to get one value per id. The final output would ideally look something like this:
id
1 1523
2 1644
I have the actual equation which works on its own when applied to the entire column of the df:
(df.iloc[-1,8] - df.iloc[0,8])/60
However I fail to combine this part with the groupby function. Among others, I tried apply, which doesn't work.
df.groupby(['id']).apply((df.iloc[-1,8] - df.iloc[0,8])/60)
I also tried creating a function with the equation part and then do apply(func)but so far none of my attempts have worked. Any help is much appreciated, thank you!
Demo:
In [204]: df
Out[204]:
id val
0 1 12
1 1 13
2 1 19
3 2 20
4 2 30
5 2 40
In [205]: df.groupby(['id'])['val'].agg(lambda x: (x.iloc[-1] - x.iloc[0])/60)
Out[205]:
id
1 0.116667
2 0.333333
Name: val, dtype: float64

python dask dataframes - concatenate groupby.apply output to a single data frame

I am using dask dataframe.groupby().apply()
and get a dask series as a return value.
I am each group to a list triplets such as (a,b,1) and wish then to turn all the triplets into a single dask data frame
I am using this code in the end of the mapping function to return the triplets as a dask df
#assume here that trips is a generator for tripletes such as you would produce from itertools.product([l1,l2,l3])
trip = list(itertools.chain.from_iterable(trip))
df = pd.DataFrame.from_records(trip)
return dd.from_pandas(df,npartitions=1)
then when I try to use something similar to pandas concat with dask concatenate
Assume the result of the apply function is the variable result.
I am trying to use
import dask.dataframe as dd
dd.concat(result, axis=0
and get the error
raise TypeError("dfs must be a list of DataFrames/Series objects")
TypeError: dfs must be a list of DataFrames/Series objects
But when I check for the type of result using
print type(result)
I get
output: class 'dask.dataframe.core.Series'
What is the proper way to apply a function over groups of dask groupby object and get all the results into one dataframe?
Thanks
edit:--------------------------------------------------------------
in order to produce the use case, assume this fake data generation
import random
import pandas as pd
import dask.dataframe as dd
people = [[random.randint(1,3), random.randint(1,3), random.randint(1,3)] for i in range(1000)]
ddf = dd.from_pandas(pd.DataFrame.from_records(people, columns=["first name", "last name", "cars"]), npartitions=1)
Now my mission is to group people by first and last name (e.g all the people with same first name & first last name) and than I need to get a new dask data frame which will contain how many cars each group had.
Assume that the apply function can return either a series of lists of tuples e.g [(name,name,cars count),(name,name,cars count)] or a data frame with the same columns - name, name, car count.
Yes, I know that particular use case can be solved in another way, but please trust me, my use case is more complex. But i can not share the data and can not generate any similar data. so let's use a dummy data :-)
The challenge is to connect all the results of the apply into a single dask data frame (pandas data frame will be a problem here, data will not fit in memory - so transitions via a pandas data frame will be a problem)
For me working if output of apply is pandas DataFrame, so last if necessary convert to dask DataFrame:
def f(x):
trip = ((1,2,x) for x in range(3))
df = pd.DataFrame.from_records(trip)
return df
df1 = ddf.groupby('cars').apply(f, meta={'x': 'i8', 'y': 'i8', 'z': 'i8'}).compute()
#only for remove MultiIndex
df1 = df1.reset_index()
print (df1)
cars level_1 x y z
0 1 0 1 2 0
1 1 1 1 2 1
2 1 2 1 2 2
3 2 0 1 2 0
4 2 1 1 2 1
5 2 2 1 2 2
6 3 0 1 2 0
7 3 1 1 2 1
8 3 2 1 2 2
ddf1 = dd.from_pandas(df1,npartitions=1)
print (ddf1)
cars level_1 x y z
npartitions=1
0 int64 int64 int64 int64 int64
8 ... ... ... ... ...
Dask Name: from_pandas, 1 tasks
EDIT:
L = []
def f(x):
trip = ((1,2,x) for x in range(3))
#append each
L.append(da.from_array(np.array(list(trip)), chunks=(1,3)))
ddf.groupby('cars').apply(f, meta={'x': 'i8', 'y': 'i8', 'z': 'i8'}).compute()
dar = da.concatenate(L, axis=0)
print (dar)
dask.array<concatenate, shape=(12, 3), dtype=int32, chunksize=(1, 3)>
For your edit:
In [8]: ddf.groupby(['first name', 'last name']).cars.count().compute()
Out[8]:
first name last name
1 1 107
2 107
3 110
2 1 117
2 120
3 99
3 1 119
2 103
3 118
Name: cars, dtype: int64

Creating a new column in Pandas by selecting part of string in other column

I have a lot of experience programming in Matlab, now using Python and I just don't get this thing to work... I have a dataframe containing a column with timecodes like 00:00:00.033.
timecodes = ['00:00:01.001', '00:00:03.201', '00:00:09.231', '00:00:11.301', '00:00:20.601', '00:00:31.231', '00:00:90.441', '00:00:91.301']
df = pd.DataFrame(timecodes, columns=['TimeCodes'])
All my inputs are 90 seconds or less, so I want to create a column with just the seconds as float. To do this, I need to select position 6 to end and make that into a float, which I can do for the first row like:
float(df['TimeCodes'][0][6:])
This works just fine, but if I now want to create a whole new column 'Time_sec', the following does not work:
df['Time_sec'] = float(df['TimeCodes'][:][6:])
Because df['TimeCodes'][:][6:] takes row 6 to last row, while I want WITHIN each row the 6th till last position. Also this does not work:
df['Time_sec'] = float(df['TimeCodes'][:,6:])
Do I need to make a loop? There must be a better way... And why does df['TimeCodes'][:][6:] not work?
You can use the slice string method and then cast the whole thing to a float:
In [13]: df["TimeCodes"].str.slice(6).astype(float)
Out[13]:
0 1.001
1 3.201
2 9.231
3 11.301
4 20.601
5 31.231
6 90.441
7 91.301
Name: TimeCodes, dtype: float64
As to why df['TimeCodes'][:][6:] doesn't work, what this ends up doing is chaining some selections. First you grab the pd.Series associated with the TimeCodes column, then you select all of the items from the Series with [:], and then you just select the items with index 6 or higher with [6:].
Solution - indexing with str and casting to float by astype:
print (df["TimeCodes"].str[6:])
0 01.001
1 03.201
2 09.231
3 11.301
4 20.601
5 31.231
6 90.441
7 91.301
Name: TimeCodes, dtype: object
df['new'] = df["TimeCodes"].str[6:].astype(float)
print (df)
TimeCodes new
0 00:00:01.001 1.001
1 00:00:03.201 3.201
2 00:00:09.231 9.231
3 00:00:11.301 11.301
4 00:00:20.601 20.601
5 00:00:31.231 31.231
6 00:00:90.441 90.441
7 00:00:91.301 91.301

Categories