extracting numerical information from strings in a dataframe column

extracting numerical information from strings in a dataframe column - python

I've seen this done in excel but I'd like to split the SOP and number into different columns. It gets a little tricky since the formatting is different at times.
0 SOP-015641
1 SOP-007809
2 SOP018262
3 SOP-007802
4 SOP-007804
5 SOP-007807

use .str.extract() method:
In [8]: df[['a','b']] = df.pop('col').str.extract('(\D+)(\d+)', expand=True)
In [9]: df
Out[9]:
a b
0 SOP- 015641
1 SOP- 007809
2 SOP 018262
3 SOP- 007802
4 SOP- 007804
5 SOP- 007807
RegEx explained

Related

Pandas sampling a dataframe but treating multiple rows as a single row based on column

Consider the following toy code that performs a simplified version of my actual question:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,2,3,4,5],
'some column': [0,1,2,3,4],
}
)
df = df.set_index(['n_event'])
print(df)
resampled_df = df.sample(frac=1, replace=True)
print(resampled_df)
The resampled_df is, as it name suggests, a resampled version of the original one (with replacement). This is exactly what I want. An example output of the previous code is
some column
n_event
1 0
2 1
3 2
4 3
5 4
some column
n_event
4 3
1 0
4 3
4 3
2 1
Now for my actual question I have the following dataframe:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,1,2,2,3,3,4,4,5,5],
'n_channel': [1,2,1,2,1,2,1,2,1,2],
'some column': [0,1,2,3,4,5,6,7,8,9],
}
)
df = df.set_index(['n_event','n_channel'])
print(df)
which looks like
some column
n_event n_channel
1 1 0
2 1
2 1 2
2 3
3 1 4
2 5
4 1 6
2 7
5 1 8
2 9
I want to do exactly the same as before, resample with replacements, but treating each group of rows with the same n_event as a single entity. A hand-built example of what I want to do can look like this:
some column
n_event n_channel
2 1 2
2 3
2 1 2
2 3
3 1 4
2 5
1 1 0
2 1
5 1 8
2 9
As seen, each n_event was treated as a whole and things within each event were no mixed up.
How can I do this without proceeding by brute force (i.e. without for loops, etc)?
I have tried with df.sample(frac=1, replace=True, ignore_index=False) and a few things using group_by without success.

Would a pivot()/melt() sequence work for you?
Use pivot() to from long to wide (make each group a single row).
Do the sampling.
Then back from wide to long using melt().
Don't have time to work out a full answer but thought I would get this idea to you in case it might help you.

Following the suggestion of jch I was able to find a solution by combining pivot and stack:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,1,2,2,3,3,4,4,5,5],
'n_channel': [1,2,1,2,1,2,1,2,1,2],
'some column': [0,1,2,3,4,5,6,7,8,9],
'other col': [5,6,4,3,2,5,2,6,8,7],
}
)
resampled_df = df.pivot(
index = 'n_event',
columns = 'n_channel',
values = set(df.columns) - {'n_event','n_channel'},
)
resampled_df = resampled_df.sample(frac=1, replace=True)
resampled_df = resampled_df.stack()
print(resampled_df)

Create a column based on multiple column distinct count pandas [duplicate]

I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.
my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.
something like this isn't working:
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()
nor is
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)
this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)
in R this is easily done in data.table with
df[, n_unique_id := uniqueN(id), by = c('track', 'type')]
thanks!

df.groupby(['track', 'type'])['id'].transform(nunique)
Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.
As pointed out by #root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.
For example transform('sum') should be preferred over transform(sum).
Try this instead
df.groupby(['track', 'type'])['id'].transform('nunique')
demo
df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)
id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B
df.groupby(['track', 'type'])['id'].transform('nunique')
0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64

Find and delete non numeric columns in pandas dataframe [duplicate]

For example, if I want to consider a flower species, number of petals, germination time and user ID, the user ID is going to have a hyphen in there. So in my data analysis, I don't want to use it. I'm aware that I can hard code it in, but I want to make it so when I input any dataset, it will automatically remove columns with non-numeric inputs.
Edit: Unclear question. I'm reading in data from a csv file using pandas.
Example:
Species NPetals GermTime UserID
1 R. G 5 4 65-78
2 R. F 5 3 65-81
I want to remove the UserID and Species columns from the dataset.

From the docs you can just select the numeric data by filtering using select_dtypes:
In [5]:
df = pd.DataFrame({'a': np.random.randn(6).astype('f4'),'b': [True, False] * 3,'c': [1.0, 2.0] * 3})
df
Out[5]:
a b c
0 0.338710 True 1
1 1.530095 False 2
2 -0.048261 True 1
3 -0.505742 False 2
4 0.729667 True 1
5 -0.634482 False 2
In [15]:
df.select_dtypes(include=[np.number])
Out[15]:
a c
0 0.338710 1
1 1.530095 2
2 -0.048261 1
3 -0.505742 2
4 0.729667 1
5 -0.634482 2
You can pass any valid np dtype hierarchy

Obtaining the first few rows of a dataframe

Is there a way to get the first n rows of a dataframe without using the indices. For example, I know if I have a dataframe called df I could get the first 5 rows via df.ix[5:]. But, what if my indices are not ordered and I dont want to order them? This does not seem to work. Hence, I was wondering if there is another way to select the first couple of rows. I apologize if there is already an answer to this. I wasnt able to find one.

Use head(5) or iloc[:5]
In [7]:
df = pd.DataFrame(np.random.randn(10,3))
df
Out[7]:
0 1 2
0 -1.230919 1.482451 0.221723
1 -0.302693 -1.650244 0.957594
2 -0.656565 0.548343 1.383227
3 0.348090 -0.721904 -1.396192
4 0.849480 -0.431355 0.501644
5 0.030110 0.951908 -0.788161
6 2.104805 -0.302218 -0.660225
7 -0.657953 0.423303 1.408165
8 -1.940009 0.476254 -0.014590
9 -0.753064 -1.083119 -0.901708
In [8]:
df.head(5)
Out[8]:
0 1 2
0 -1.230919 1.482451 0.221723
1 -0.302693 -1.650244 0.957594
2 -0.656565 0.548343 1.383227
3 0.348090 -0.721904 -1.396192
4 0.849480 -0.431355 0.501644
In [11]:
df.iloc[:5]
Out[11]:
0 1 2
0 -1.230919 1.482451 0.221723
1 -0.302693 -1.650244 0.957594
2 -0.656565 0.548343 1.383227
3 0.348090 -0.721904 -1.396192
4 0.849480 -0.431355 0.501644

Pandas Filter function returned a Series, but expected a scalar bool

I am attempting to use filter on a pandas dataframe to filter out all rows that match a duplicate value(need to remove ALL the rows when there are duplicates, not just the first or last).
This is what I have that works in the editor :
df = df.groupby("student_id").filter(lambda x: x.count() == 1)
But when I run my script with this code in it I get the error:
TypeError: filter function returned a Series, but expected a scalar bool
I am creating the dataframe by concatenating two other frames immediately before trying to apply the filter.

it should be:
In [32]: grouped = df.groupby("student_id")
In [33]: grouped.filter(lambda x: x["student_id"].count()==1)
Updates:
i'm not sure about the issue u mentioned regarding the interactive console. technically speaking in this particular case (there might be other situations such as the intricate "import" functionality in which diff env may behave differently), the console (such as ipython) should behave the same as other environment (orig python env, or some IDE embedded one)
an intuitive way to understand the pandas groupby is to treat the return obj of DataFrame.groupby() as a list of dataframe. so when u try to using filter to apply the lambda function upon x, x is actually one of those dataframes:
In[25]: df = pd.DataFrame(data,columns=year)
In[26]: df
Out[26]:
2013 2014
0 0 1
1 2 3
2 4 5
3 6 7
4 0 1
5 2 3
6 4 5
7 6 7
In[27]: grouped = df.groupby(2013)
In[28]: grouped.count()
Out[28]:
2014
2013
0 2
2 2
4 2
6 2
in this example, the first dataframe in the grouped obj would be:
In[33]: df1 = df.ix[[0,4]]
In[34]: df1
Out[33]:
2013 2014
0 0 1
4 0 1

how about using the pd.DataFrame.drop_duplicates() method?
Documentation.
Are you sure you really want to remove ALL rows? Not n-1?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

extracting numerical information from strings in a dataframe column - python

I've seen this done in excel but I'd like to split the SOP and number into different columns. It gets a little tricky since the formatting is different at times. 0 SOP-015641 1 SOP-007809 2 SOP018262 3 SOP-007802 4 SOP-007804 5 SOP-007807

use .str.extract() method: In [8]: df[['a','b']] = df.pop('col').str.extract('(\D+)(\d+)', expand=True) In [9]: df Out[9]: a b 0 SOP- 015641 1 SOP- 007809 2 SOP 018262 3 SOP- 007802 4 SOP- 007804 5 SOP- 007807 RegEx explained

Related

Pandas sampling a dataframe but treating multiple rows as a single row based on column

Create a column based on multiple column distinct count pandas [duplicate]

Find and delete non numeric columns in pandas dataframe [duplicate]

Obtaining the first few rows of a dataframe

Pandas Filter function returned a Series, but expected a scalar bool

Categories

Resources