Deleting zeros from string column in pandas dataframe - python

I have a column in my dataframe,where the values are something like this:
col1:
00000000000012VG
00000000000014SG
00000000000014VG
00000000000010SG
20000000000933LG
20000000000951LG
20000000000957LG
20000000000963LG
20000000000909LG
20000000000992LG
I want to delete all zeros:
a)that are in front of other numbers and letters(For example in case of 00000000000010SG I want to delete this part000000000000 and keep 10SG).
b) In cases like 20000000000992LG I want to delete this part 0000000000 and unite 2 with 992LG.
str.stprip('0') solves only part a), as I checked.
But what is the right solution for both cases?

I would recommend something similar to Ed's answer, but using regex to ensure that not all 0s are replaced, and the eliminate the need to hardcode the number of 0s.
In [2426]: df.col1.str.replace(r'[0]{2,}', '', 1)
Out[2426]:
0 12VG
1 14SG
2 14VG
3 10SG
4 2933LG
5 2951LG
6 2957LG
7 2963LG
8 2909LG
9 2992LG
Name: col1, dtype: object
Only the first string of 0s is replaced.
Thanks to #jezrael for pointing out a small bug in my answer.

You can just do
In[9]:
df['col1'] = df['col1'].str.replace('000000000000','')
df['col1'] = df['col1'].str.replace('0000000000','')
df
Out[9]:
col1
0 12VG
1 14SG
2 14VG
3 10SG
4 2933LG
5 2951LG
6 2957LG
7 2963LG
8 2909LG
9 2992LG
This will replace a fixed number of 0s with a blank space, this isn't dynamic but for your given dataset this is the simplest thing to do unless you can explains better the pattern

Related

Create a column based on multiple column distinct count pandas [duplicate]

I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.
my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.
something like this isn't working:
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()
nor is
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)
this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)
in R this is easily done in data.table with
df[, n_unique_id := uniqueN(id), by = c('track', 'type')]
thanks!
df.groupby(['track', 'type'])['id'].transform(nunique)
Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.
As pointed out by #root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.
For example transform('sum') should be preferred over transform(sum).
Try this instead
df.groupby(['track', 'type'])['id'].transform('nunique')
demo
df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)
id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B
df.groupby(['track', 'type'])['id'].transform('nunique')
0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64

Python Pandas. Delete cells whose value is contained in another cell in the same column

I have a dataframe like this:
A B
exa 3
example 6
exam 4
hello 4
hell 3
I want to delete the rows that are substrings of another row and keep the longest one (Notice that B is already the length of A)
I want my table to look like this:
A B
example 6
hello 4
I thought about the following boolean filter but it does not work :(
df['Check'] = df.apply(lambda row: df.count(row['A'] in row['A'])>1, axis=1)
This is non-trivial. But we can take advantage of B to sort the data, compare each value with only those strings larger than itself for solution slightly better than O(N^2).
df = df.sort_values('B')
v = df['A'].tolist()
df[[not any(b.startswith(a) for b in v[i + 1:]) for i, a in enumerate(v)]].sort_index()
A B
1 example 6
3 hello 4
Like what cold provided my solution is O(m*n) as well (In your case m=n)
df[np.sum(np.array([[y in x for x in df.A.values] for y in df.A.values]),1)==1]
Out[30]:
A B
1 example 6
3 hello 4

Subtract 2 values from one another within 1 column after groupby

I am very sorry if this is a very basic question but unfortunately, I'm failing miserably at figuring out the solution.
I need to subtract the first value within a column (in this case column 8 in my df) from the last value & divide this by a number (e.g. 60) after having applied groupby to my pandas df to get one value per id. The final output would ideally look something like this:
id
1 1523
2 1644
I have the actual equation which works on its own when applied to the entire column of the df:
(df.iloc[-1,8] - df.iloc[0,8])/60
However I fail to combine this part with the groupby function. Among others, I tried apply, which doesn't work.
df.groupby(['id']).apply((df.iloc[-1,8] - df.iloc[0,8])/60)
I also tried creating a function with the equation part and then do apply(func)but so far none of my attempts have worked. Any help is much appreciated, thank you!
Demo:
In [204]: df
Out[204]:
id val
0 1 12
1 1 13
2 1 19
3 2 20
4 2 30
5 2 40
In [205]: df.groupby(['id'])['val'].agg(lambda x: (x.iloc[-1] - x.iloc[0])/60)
Out[205]:
id
1 0.116667
2 0.333333
Name: val, dtype: float64

Creating a new column in Pandas by selecting part of string in other column

I have a lot of experience programming in Matlab, now using Python and I just don't get this thing to work... I have a dataframe containing a column with timecodes like 00:00:00.033.
timecodes = ['00:00:01.001', '00:00:03.201', '00:00:09.231', '00:00:11.301', '00:00:20.601', '00:00:31.231', '00:00:90.441', '00:00:91.301']
df = pd.DataFrame(timecodes, columns=['TimeCodes'])
All my inputs are 90 seconds or less, so I want to create a column with just the seconds as float. To do this, I need to select position 6 to end and make that into a float, which I can do for the first row like:
float(df['TimeCodes'][0][6:])
This works just fine, but if I now want to create a whole new column 'Time_sec', the following does not work:
df['Time_sec'] = float(df['TimeCodes'][:][6:])
Because df['TimeCodes'][:][6:] takes row 6 to last row, while I want WITHIN each row the 6th till last position. Also this does not work:
df['Time_sec'] = float(df['TimeCodes'][:,6:])
Do I need to make a loop? There must be a better way... And why does df['TimeCodes'][:][6:] not work?
You can use the slice string method and then cast the whole thing to a float:
In [13]: df["TimeCodes"].str.slice(6).astype(float)
Out[13]:
0 1.001
1 3.201
2 9.231
3 11.301
4 20.601
5 31.231
6 90.441
7 91.301
Name: TimeCodes, dtype: float64
As to why df['TimeCodes'][:][6:] doesn't work, what this ends up doing is chaining some selections. First you grab the pd.Series associated with the TimeCodes column, then you select all of the items from the Series with [:], and then you just select the items with index 6 or higher with [6:].
Solution - indexing with str and casting to float by astype:
print (df["TimeCodes"].str[6:])
0 01.001
1 03.201
2 09.231
3 11.301
4 20.601
5 31.231
6 90.441
7 91.301
Name: TimeCodes, dtype: object
df['new'] = df["TimeCodes"].str[6:].astype(float)
print (df)
TimeCodes new
0 00:00:01.001 1.001
1 00:00:03.201 3.201
2 00:00:09.231 9.231
3 00:00:11.301 11.301
4 00:00:20.601 20.601
5 00:00:31.231 31.231
6 00:00:90.441 90.441
7 00:00:91.301 91.301

Delineate twice through a dataframe in pandas

I have a sparse pandas DataFrame/Series with values that look like variations of "AB1234:12, CD5678:34, EF3456:56". Something to the effect of
"AB1234:12, CD5678:34, EF3456:56"
"AB1234:12, CD5678:34"
NaN
"GH5678:34, EF3456:56"
"OH56:34"
Which I'd like to convert into
["AB1234","CD5678", "EF3456"]
["AB1234","CD5678"]
NaN
["GH5678","EF3456"]
["OH56"]
This kind of "double delineation" has been proving difficult. I know we can A = df["columnName"].str.split(",") however I've run across a couple of problems including that .split(", ") doesnt seem to work and '.split(",")' leaves whitespace. Also, that iterating through the generated A and splitting seems to be interpreting my new lists as 'floats'. Although that last one might be a technical difficulty with ipython - I'm trying to work out that problem as well.
Is there a way to delineate on two types of separators - instead of just one? If not, how do you perform the loop to iterate over the inner list?
//Edit: changed the apostrophes to commas - that was just my dyslexia
kicking in
You nearly had it, note you can use a regular expression to split more generally:
In [11]: s2
Out[11]:
0 AB1234:12, CD5678:34, EF3456:56
1 AB1234:12, CD5678:34
2 NaN
3 GH5678:34, EF3456:56
4 OH56:34
dtype: object
In [12]: s2.str.split(", '")
Out[12]:
0 [AB1234:12, CD5678:34, EF3456:56]
1 [AB1234:12, CD5678:34]
2 NaN
3 [GH5678:34, EF3456:56]
4 [OH56:34]
dtype: object
In [13]: s2.str.split("\s*,\s*'")
Out[13]:
0 [AB1234:12, CD5678:34, EF3456:56]
1 [AB1234:12, CD5678:34]
2 NaN
3 [GH5678:34, EF3456:56]
4 [OH56:34]
dtype: object
Where this removes any spaces before or after a comma.
Here is your DataFrame
>>> df
A
0 AB1234:12, CD5678:34, EF3456:56
1 AB1234:12, CD5678:34
2 None
3 GH5678:34, EF3456:56
4 OH56:34
And now I use split and replace to split by ', ' and remove all ':'
>>> df.A = [i.replace(':','').split(", ") if isinstance(i,str) else i for i in df.A]
>>> df.A
0 [AB123412, CD567834, EF345656]
1 [AB123412, CD567834]
2 None
3 [GH567834, EF345656]
4 [OH5634]
Name: A

Categories