remove ' months' from int in 'term column' example '36 months' [duplicate] - python

Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['1a',np.nan,'10a','100b','0b'],
})
df
A
0 1a
1 NaN
2 10a
3 100b
4 0b
I'd like to extract the numbers from each cell (where they exist).
The desired result is:
A
0 1
1 NaN
2 10
3 100
4 0
I know it can be done with str.extract, but I'm not sure how.

Give it a regex capture group:
df.A.str.extract('(\d+)')
Gives you:
0 1
1 NaN
2 10
3 100
4 0
Name: A, dtype: object

U can replace your column with your result using "assign" function:
df = df.assign(A = lambda x: x['A'].str.extract('(\d+)'))

To answer #Steven G 's question in the comment above, this should work:
df.A.str.extract('(^\d*)')

If you have cases where you have multiple disjoint sets of digits, as in 1a2b3c, in which you would like to extract 123, you can do it with Series.str.replace:
>>> df
A
0 1a
1 b2
2 a1b2
3 1a2b3c
>>> df['A'] = df['A'].str.replace('\D+', '')
0 1
1 2
2 12
3 123
You could also work this around with Series.str.extractall and groupby but I think that this one is easier.
Hope this helps!

Related

Pivot table based on the first value of the group in Pandas

Have the following DataFrame:
I'm trying to pivot it in pandas and achieve the following format:
Actually I tried the classical approach with pd.pivot_table() but it does not work out:
pd.pivot_table(df,values='col2', index=[df.index], columns = 'col1')
Would be appreciate for some suggestions :) Thanks!
You can use pivot and then dropna for each column:
>>> df.pivot(columns='col1', values='col2').apply(lambda x: x.dropna().tolist()).astype(int)
col1 a b c
0 1 2 9
1 4 5 0
2 6 8 7
Another option is to create a Series of lists using groupby.agg; then construct a DataFrame:
out = df.groupby('col1')['col2'].agg(list).pipe(lambda x: pd.DataFrame(zip(*x), columns=x.index.tolist()))
Output:
A B C
0 1 2 9
1 4 5 0
2 6 8 7

Is there a way in pandas to create an integer in a new column if a row contains a specific string

For example, I have the following dataframe:
I want to transform the dataframe from above to something like this:
Thank's for any kind of help!
Run:
df['Number'] = df.svn_changes.str.match(r'r\d+').cumsum()
Yes, is contains with regex and cumsum:
df = pd.DataFrame({'svn_changes':['r123456','RowValueRow','ValueRowValue',
'some_string_string','r234566','ValueRowValue',
'some_string_string','r123789','something_here',
'ValueRowValue','String_2','String_4']})
df['Number'] = df['svn_changes'].str.contains('r\d+').cumsum()
print(df)
Output:
svn_changes Number
0 r123456 1
1 RowValueRow 1
2 ValueRowValue 1
3 some_string_string 1
4 r234566 2
5 ValueRowValue 2
6 some_string_string 2
7 r123789 3
8 something_here 3
9 ValueRowValue 3
10 String_2 3
11 String_4 3
Here's a simple reusable line you can use to do that:
df['new_col'] = df['old_col'].str.contains('string_to_match')*1
The new column will have value 1 if the string is present in this column, and 0 otherwise.

python pandas select both head and tail

For a DataFrame in Pandas, how can I select both the first 5 values and last 5 values?
For example
In [11]: df
Out[11]:
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-01 2 2 2
2012-12-02 3 3 3
2012-12-03 4 4 4
2012-12-04 5 5 5
2012-12-05 6 6 6
2012-12-06 7 7 7
2012-12-07 8 8 8
2012-12-08 9 9 9
How to show the first two and the last two rows?
You can use iloc with numpy.r_:
print (np.r_[0:2, -2:0])
[ 0 1 -2 -1]
df = df.iloc[np.r_[0:2, -2:0]]
print (df)
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-07 8 8 8
2012-12-08 9 9 9
df = df.iloc[np.r_[0:4, -4:0]]
print (df)
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-01 2 2 2
2012-12-02 3 3 3
2012-12-05 6 6 6
2012-12-06 7 7 7
2012-12-07 8 8 8
2012-12-08 9 9 9
You can use df.head(5) and df.tail(5) to get first five and last five.
Optionally you can create new data frame and append() head and tail:
new_df = df.tail(5)
new_df = new_df.append(df.head(5))
Not quite the same question but if you just want to show the top / bottom 5 rows (eg with display in jupyter or regular print, there's potentially a simpler way than this if you use the pd.option_context context.
#make 100 3d random numbers
df = pd.DataFrame(np.random.randn(100,3))
# sort them by their axis sum
df = df.loc[df.sum(axis=1).index]
with pd.option_context('display.max_rows',10):
print(df)
Outputs:
0 1 2
0 -0.649105 -0.413335 0.374872
1 3.390490 0.552708 -1.723864
2 -0.781308 -0.277342 -0.903127
3 0.433665 -1.125215 -0.290228
4 -2.028750 -0.083870 -0.094274
.. ... ... ...
95 0.443618 -1.473138 1.132161
96 -1.370215 -0.196425 -0.528401
97 1.062717 -0.997204 -1.666953
98 1.303512 0.699318 -0.863577
99 -0.109340 -1.330882 -1.455040
[100 rows x 3 columns]
Small simple function:
def ends(df, x=5):
return df.head(x).append(df.tail(x))
And use like so:
df = pd.DataFrame(np.random.rand(15,6))
ends(df,2)
I actually use this so much, I think it would be a great feature to add to pandas. (No features are to be added to pandas.DataFrame core API) I add it after import like so:
import pandas as pd
def ends(df, x=5):
return df.head(x).append(df.tail(x))
setattr(pd.DataFrame,'ends',ends)
Use like so:
import numpy as np
df = pd.DataFrame(np.random.rand(15,6))
df.ends(2)
You should use both head() and tail() for this purpose. I think the easiest way to do this is:
df.head(5).append(df.tail(5))
In Jupyter, expanding on #bolster's answer, we'll create a reusable convenience function:
def display_n(df,n):
with pd.option_context('display.max_rows',n*2):
display(df)
Then
display_n(df,2)
Returns
0 1 2
0 0.167961 -0.732745 0.952637
1 -0.050742 -0.421239 0.444715
... ... ... ...
98 0.085264 0.982093 -0.509356
99 -0.758963 -0.578267 -0.115865
(except as a nicely formatted HTML table)
when df is df = pd.DataFrame(np.random.randn(100,3))
Notes:
Of course you could make the same thing print as text by modifying display to print above.
On unix-like systems, you can the autoload the above function in all notebooks by placing it in a py or ipy file in ~/.ipython/profile_default/startup as described here.
If you want to keep it to just Pandas, you can use apply() to concatenate the head and tail:
import pandas as pd
from string import ascii_lowercase, ascii_uppercase
df = pd.DataFrame(
{"upper": list(ascii_uppercase), "lower": list(ascii_lowercase)}, index=range(1, 27)
)
df.apply(lambda x: pd.concat([x.head(2), x.tail(2)]))
upper lower
1 A a
2 B b
25 Y y
26 Z z
Associated with Linas Fx.
Defining below
pd.DataFrame.less = lambda df, n=10: df.head(n//2).append(df.tail(n//2))
then you can type only df.less()
It's same as type df.head().append(df.tail())
If you type df.less(2), the result is same as df.head(1).append(df.tail(1))
Combining #ic_fl2 and #watsonic to give the below in Jupyter:
def ends_attr():
def display_n(df,n):
with pd.option_context('display.max_rows',n*2):
display(df)
# set pd.DataFrame attribute where .ends runs display_n() function
setattr(pd.DataFrame,'ends',display_n)
ends_attr()
View first and last 3 rows of your df:
your_df.ends(3)
I like this because I can copy a single function and know I have everything I need to use the ends attribute.

Conditional statement and split in a Dataframe

I am looking for a conditional statement in python to look for a certain information in a specified column and put the results in a new column
Here is an example of my dataset:
OBJECTID CODE_LITH
1 M4,BO
2 M4,BO
3 M4,BO
4 M1,HP-M7,HP-M1
and what I want as results:
OBJECTID CODE_LITH M4 M1
1 M4,BO 1 0
2 M4,BO 1 0
3 M4,BO 1 0
4 M1,HP-M7,HP-M1 0 1
What I have done so far:
import pandas as pd
import numpy as np
lookup = ['M4']
df.loc[df['CODE_LITH'].str.isin(lookup),'M4'] = 1
df.loc[~df['CODE_LITH'].str.isin(lookup),'M4'] = 0
Since there is multiple variables per rows in "CODE_LITH" it seems like the script in not able to find only "M4" it can find "M4,BO" and put 1 or 0 in the new column
I have also tried:
if ('M4') in df['CODE_LITH']:
df['M4'] = 0
else:
df['M4'] = 1
With the same results.
Thanks for your help.
PS. The dataframe contains about 2.6 millions rows and I need to do this operation for 30-50 variables.
I think this is the Pythonic way to do it:
for mn in ['M1', 'M4']: # Add other "M#" as needed
df[mn] = df['CODE_LITH'].map(lambda x: mn in x)
Use str.contains accessor:
>>>> for key in ('M4', 'M1'):
... df.loc[:, key] = df['CODE_LITH'].str.contains(key).astype(int)
>>> df
OBJECTID CODE_LITH M4 M1
0 1 M4,BO 1 0
1 2 M4,BO 1 0
2 3 M4,BO 1 0
3 4 M1,HP-M7,HP-M1 0 1
I was able to do:
for index,data in enumerate(df['CODE_LITH']):
if "I1" in data:
df['Plut_Felsic'][index] = 1
else:
df['Plut_Felsic'][index] = 0
It does work, but takes quite some time to calculate.

Pandas Extract Number from String

Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['1a',np.nan,'10a','100b','0b'],
})
df
A
0 1a
1 NaN
2 10a
3 100b
4 0b
I'd like to extract the numbers from each cell (where they exist).
The desired result is:
A
0 1
1 NaN
2 10
3 100
4 0
I know it can be done with str.extract, but I'm not sure how.
Give it a regex capture group:
df.A.str.extract('(\d+)')
Gives you:
0 1
1 NaN
2 10
3 100
4 0
Name: A, dtype: object
U can replace your column with your result using "assign" function:
df = df.assign(A = lambda x: x['A'].str.extract('(\d+)'))
To answer #Steven G 's question in the comment above, this should work:
df.A.str.extract('(^\d*)')
If you have cases where you have multiple disjoint sets of digits, as in 1a2b3c, in which you would like to extract 123, you can do it with Series.str.replace:
>>> df
A
0 1a
1 b2
2 a1b2
3 1a2b3c
>>> df['A'] = df['A'].str.replace('\D+', '')
0 1
1 2
2 12
3 123
You could also work this around with Series.str.extractall and groupby but I think that this one is easier.
Hope this helps!

Categories