how to extract numeric information from a string in Pandas? - python

I have a column in my dataframe that contains string rows such as :
'(0.0,0.8638888888888889,3.7091666666666665,12.023333333333333,306.84694444444443)'
This output (produced by another program) corresponds to the min, 25th, median, 75th and max for a given variable.
I would like to extract that information, and put them in separate numeric columns, such as
min p25 p50
0.0 0.864 3.70
The data I have is really large. How can I do that in Pandas?
Many thanks!

IIUC then the following should work:
In [280]:
df = pd.DataFrame({'col':['(0.0,0.8638888888888889,3.7091666666666665,12.023333333333333,306.84694444444443)']})
df
Out[280]:
col
0 (0.0,0.8638888888888889,3.7091666666666665,12....
In [297]:
df[['min','p25','p50']] = df['col'].str.replace('\'|\(|\)','').str.split(',', expand=True).astype(np.float64)[[0,1,2]]
df
Out[297]:
col min p25 p50
0 (0.0,0.8638888888888889,3.7091666666666665,12.... 0.0 0.863889 3.709167
So this replaces the ' ( and ) characters with blank using str.replace and then we split using str.split on the comma and cast the type to float and then index the cols of interest.

Related

Convert a string that contains a price with space after thousands to float in pandas column

I have a column that contains prices with a currency. I want to convert these prices to floats. The issue here, is that these prices contain spaces after thousands.
My initial dataframe :
import numpy as np
import pandas as pd
prices = pd.Series(['239,00$','1 345,00$','1,00$','4 344,33$'])
df = pd.DataFrame(prices,columns = ["prices"])
print df
prices
0 239,00$
1 1 345,00$
2 1,00$
3 4 344,33$
The output I want to get is a dataframe column where my values are float and don't have spaces:
prices
0 239.00
1 1345.00
2 1.00
3 4344.33
I tried using replace function to remove space in string but it doesn't seem to work.
Any idea on how I can reach that result ?
Remove characters that's not a digit or comma ([^\d,]) and then replace , with .:
df.prices.str.replace('[^\d,]', '').str.replace(',', '.').astype(float)
0 239.00
1 1345.00
2 1.00
3 4344.33
Name: prices, dtype: float64

Trying to use a lambda function to add a new column with cumulative increase of the values in a DataFrame

Say I have a simple one column DataFrame:
df = pd.DataFrame([0.01,0.02,0.03,0.04,0.05,0.06,0.07])
I am trying to multiply the first two numbers and then multiply the result of that by the third number and then multiply the result of that by the fourth number, so on down the column.
For example:
df['chainlink'] = df.apply(lambda x: (1+x[0])*(1+x[1]))
This Obviously creates a new column with the value 1.0302 in the first row and then NaNs after. What I am then trying to do is (1.0302)(1+0.03) = 1.0611 then (1.0611)(1+0.04) = 1.1036 etc.
The new column should be a sort of cumulative increase of the values.
Check with cumprod
df['new'] = df[0].add(1).cumprod()
0 1.010000
1 1.030200
2 1.061106
3 1.103550
4 1.158728
5 1.228251
6 1.314229
Name: 0, dtype: float64

Get far right hand data using Pandas

I'm not familiar with Python,but aim to get data using Pandas from below data format.
Is there any method to get far right hand data from each row? Total rows of this data reaches over 60,000 and last level of each row is vary.
To access the column in pandas you'll have to give it a name as well ('some_name' for this example). Then it should be as easy as
import pandas as pd
df = pd.read_excel('path/to/your/file')
target = df['some_name']
See pandas.read_excel for further details.
IIUC, you want the last value in each column that is not None. I am assuming that after reading your data using pd.read_csv() your data looks something like this -
# CREATING DUMMY DATA
a = [['AA000',np.nan,np.nan,np.nan],
['AA006','AA001',np.nan,np.nan],
['AA008','AA002',np.nan,np.nan],
['AA002','AA003','AA003',np.nan],
['AA002','AA006','AA004',np.nan]]
df = pd.DataFrame(a, columns=['Level1','Level2','Level3','Level4'])
df
Note: I modified an old answer of mine for this solution, so if you are interested in knowing what is happening here, do check it out. In a nutshell, I have oriented/flipped the data in such a way, that the values you want, become the FIRST values in each row, instead of the last. That way when you do argmax(1), it returns the index of the first occurrence of the notna values. Incase of idxmax it returns the column index directly, instead of the integer value for idx.
Pandas method:
You can use pandas to solve this as -
result = df.lookup(range(df.shape[0]), df.iloc[:, ::-1].notna().idxmax(1))
result
array(['AA000', 'AA001', 'AA002', 'AA003', 'AA004'], dtype=object)
Here is a visual explanation for the column index
df.iloc[:, ::-1].notna().idxmax(1)
|______________| |_____| |_______|
| | |
horizontal-flipped bool column idx of first True value
Numpy method:
You can use NumPy to solve this as follows -
import numpy as np
col_idx = df.shape[1] - np.fliplr(df.notna().values).argmax(1) - 1
row_idx = np.arange(df.shape[0])
result = df.values[row_idx, col_idx]
result
array(['AA000', 'AA001', 'AA002', 'AA003', 'AA004'], dtype=object)
Visual explanation for what is happening:
df.shape[1] - np.fliplr(df.notna().values).argmax(1) - 1
|_________| |___________________________| |___________|
| | |
# of rows horizontal-flipped matrix idx of first notna
Finally, set it to a column by simply assigning it to a new column name in df -
df['last'] = result
print(df)
Level1 Level2 Level3 Level4 last
0 AA000 NaN NaN NaN AA000
1 AA006 AA001 NaN NaN AA001
2 AA008 AA002 NaN NaN AA002
3 AA002 AA003 AA003 NaN AA003
4 AA002 AA006 AA004 NaN AA004

Python df groupby with agg for string and sum

With this df as base i want the following output:
So all should be aggregated by column 0 and all strings from column 1 should be added and the numbers from column 2 should be summed when the strings from column 1 have the same name.
With the following code i could aggregate the strings but without summing the numbers:
df2= df1.groupby([0]).agg(lambda x: ','.join(set(x))).reset_index()
df2
Avoid an arbitrary number of columns
Your desired output suggests you have an arbitrary number of columns dependent on the number of values in 1 for each group 0. This is anti-Pandas, which is strongly geared towards an arbitrary number of rows. Hence series-wise operations are preferred.
So you can just use groupby + sum to store all the information you require.
df = pd.DataFrame({0: ['2008-04_E.pdf']*3,
1: ['Mat1', 'Mat2', 'Mat2'],
2: [3, 1, 1]})
df_sum = df.groupby([0, 1]).sum().reset_index()
print(df_sum)
0 1 2
0 2008-04_E.pdf Mat1 3
1 2008-04_E.pdf Mat2 2
But if you insist...
If you insist on your unusual requirement, you can achieve it as follows via df_sum calculated as above.
key = df_sum.groupby(0)[1].cumcount().add(1).map('Key{}'.format)
res = df_sum.set_index([0, key]).unstack().reset_index().drop('key', axis=1)
res.columns = res.columns.droplevel(0)
print(res)
Key1 Key2 Key1 Key2
0 2008-04_E.pdf Mat1 Mat2 3 2
This seems like a 2-step process. It also requires that each group from column 1 has the same number of unique elements in column 2. First groupby the columns you want grouped
df_grouped = df.groupby([0,1]).sum().reset_index()
Then reshape to the form you want:
def group_to_row(group):
group = group.sort_values(1)
output = []
for i, row in group[[1,2]].iterrows():
output += row.tolist()
return pd.DataFrame(data=[output])
df_output = df_grouped.groupby(0).apply(group_to_row).reset_index()
This is untested but this is also quite a non-standard form so unfortunately I don't think there's a standard Pandas function for you.

Replace unwanted strings in pandas dataframe element wise and efficiently

I have a very large dataframe (thousands x thousands) only showing 5 x 3 here, time is the index
col1 col2 col3
time
05/04/2018 05:14:52 AM +unend +unend 0
05/04/2018 05:14:57 AM 0 0 0
05/04/2018 05:15:02 AM 30.691 0.000 0.121
05/04/2018 05:15:07 AM 30.691 n. def. 0.108
05/04/2018 05:15:12 AM 30.715 0.000 0.105
As these are coming from some other device (df is produced by pd.read_csv(filename)) the dataframe instead of being a completely float type now ends up having unwanted strings like +unend and n. def.. These are not the classical +infinity or NaN , that df.fillna() could take care off. I would like to replace the strings with 0.0. I saw these answers Pandas replace type issue and replace string in pandas dataframe which although try to do the same thing, are column or row wise, but not elementwise. However, in the comments there were some good hints of proceeding for general case as well.
If i try to do
mask = df.apply(lambda x: x.str.contains(r'+unend|n. def.'))
df[mask] =0.0
i get error: nothing to repeat
if i do
mask = df.apply(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask]=0.0
i would get a Series object with True or False for every column rather than a elementwise mask and therefore an error
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value.
The below
mask = df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask.values]=0.0
does give me the intended result replacing all the unwanted strings with 0.0 However, it is slow (unpythonic?) and also, i am not sure if i can use regex for the check rather than in, especially, if i know there are mixed datatypes. Is there an efficient, fast, robust but also elementwise general way to do this?
These are not the classical +infinity or NaN , that df.fillna() could take care off
You can specify a list of strings to consider as NA when reading the csv file.
df = pd.read_csv(filename, na_values=['+unend', 'n. def.'])
And then fill the NA values with fillna
As pointed Edchum if need replace all non numeric values to 0 - first to_numeric with errors='coerce' create NaNs for not parseable values and then convert them to 0 by fillna:
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(0)
If values are not substrings use DataFrame.isin or very nice answer of Haleemur Ali:
df = df.mask(df.isin(['+unend','n. def.']), 0).astype(float)
For substrings with define values:
There are special regex char + and ., so need escape them by \:
df = df.mask(df.astype(str).apply(lambda x: x.str.contains(r'(\+unend|n\. def\.)')), 0).astype(float)
Or use applymap for elemnetwise check:
df = df.mask(df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) ), 0).astype(float)
print (df)
col1 col2 col3
time
05/04/2018 05:14:52 AM 0.000 0.0 0.000
05/04/2018 05:14:57 AM 0.000 0.0 0.000
05/04/2018 05:15:02 AM 30.691 0.0 0.121
05/04/2018 05:15:07 AM 30.691 0.0 0.108
05/04/2018 05:15:12 AM 30.715 0.0 0.105
Do not use pd.Series.str.contains or pd.Series.isin
A more efficient solution to this problem is to use pd.to_numeric to convert try and convert all data to numeric.
Use errors='coerce' to default to NaN, which you can then use with pd.Series.fillna.
cols = ['col1', 'col2', 'col3']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce').fillna(0)

Categories