Remove value after specific character in pandas dataframe - python

I am using a pandas dataframe and I would like to remove all information after a space occures. My dataframe is similar as this one:
import pandas as pd
d = {'Australia' : pd.Series([0,'1980 (F)\n\n1957 (T)\n\n',1991], index=['Australia', 'Belgium', 'France']),
'Belgium' : pd.Series([1980,0,1992], index=['Australia','Belgium', 'France']),
'France' : pd.Series([1991,1992,0], index=['Australia','Belgium', 'France'])}
df = pd.DataFrame(d, dtype='str')
df
I am able to remove the values for one specific column, however the split() function does not apply to the whole dataframe.
f = lambda x: x["Australia"].split(" ")[0]
df = df.apply(f, axis=1)
Anyone an idea how I could remove the information after a space occures for each value in the dataframe?

I think need convert all columns to strings and then apply split function:
df = df.astype(str).apply(lambda x: x.str.split().str[0])
Another solution:
df = df.astype(str).applymap(lambda x: x.split()[0])
print (df)
Australia Belgium France
Australia 0 1980 1991
Belgium 1980 0 1992
France 1991 1992 0

Let's try using assign since the column names in this dataframe are "well tame" meaning not containing a space nor special characters:
df.assign(Australia=df.Australia.str.split().str[0])
Output:
Australia Belgium France
Australia 0 1980 1991
Belgium 1980 0 1992
France 1991 1992 0
Or you can use apply and a lamda function if all your column datatypes are strings:
df.apply(lambda x: x.str.split().str[0])
Or if you have a mixture of numbers and string dtypes then you can use select_dtypes with assign like this:
df.assign(**df.select_dtypes(exclude=np.number).apply(lambda x: x.str.split().str[0]))

You could loop over all columns and apply below:
for column in df:
df[column] = df[column].str.split().str[0]

Related

Find max and min value for several numeric column and return dataframe with the corresponding row value

I have the following dataset
For each year column, I would like to find the max and min values and return both the 'max' and 'min' values together with the corresponding 'Geo' value for each.
For instance, for '1950', '1951', and so on, I would like to produce a dataframe like this one:
This is a similar thread, but the suggested approaches there don't seem to work because my columns have numeric headers, plus my desired result is slightly different.
Any advice would be helpful. Thanks.
This should work but it surely exists a better solution. I supposed your initial dataframe was a pandas dataframe named df.
dff = pd.DataFrame({'row_labels':['Max_value','Max_geo','Min_value','Min_geo']})
for col in df.columns[2:]: #start at column 1950
col_list = []
col_list.append(df[col].min())
col_list.append(df.loc[df[col] == df[col].min(),'Geo'].values[0])
col_list.append(df[col].max())
col_list.append(df.loc[df[col] == df[col].max(),'Geo'].values[0])
dff[col] = col_list
dff.set_index('row_labels', inplace = True, drop = True)
You can do this without having to loop or do any value comparisons to find the max, using max, min, idxmax and idxmin as follows (assuming your dataframe is df):
(df.melt(id_vars='Geo', var_name='year')
.set_index('geo')
.groupby('year')
.agg({'value': ('max', 'idxmax', 'min', 'idxmin')})
.T)
You can use df.set_index with stack and Groupby.agg:
In [1915]: df = pd.DataFrame({'Geo':['Afghanistan', 'Albania', 'Algeria', 'Angola'], 'Geo code':[4,8,12,24], '1950':[27.638, 54.191, 42.087, 35.524], '1951':[27.878, 54.399, 42.282, 35.599]})
In [1914]: df
Out[1914]:
Geo Geo code 1950 1951
0 Afghanistan 4 27.638 27.878
1 Albania 8 54.191 54.399
2 Algeria 12 42.087 42.282
3 Angola 24 35.524 35.599
In [1916]: x = df.set_index('Geo').stack().reset_index(level=1, name='value').query('level_1 != "Geo code"')
In [1917]: res = x.groupby('level_1').agg({'value': ('max', 'idxmax', 'min', 'idxmin')}).T
In [1918]: res
Out[1918]:
level_1 1950 1951
value max 54.191 54.399
idxmax Albania Albania
min 27.638 27.878
idxmin Afghanistan Afghanistan

Convert columns on pandas from non-null object to float

I'm having some difficulties to convert the values from object to float!!!
I saw some examples but I couldn't be able to get it.
I would like to have a for loop to convert the values in all columns.
I didn't have yet a script cause I saw different ways to do it
Terawatt-hours Total Asia Pacific Total CIS Total Europe
2000 0.428429 0 0.134473
2001 0.608465 0 0.170166
2002 0.829254 0 0.276783
2003 1.11654 0 0.468726
2004 1.46406 0 0.751126
2005 1.85281 0 1.48641
2006 2.29128 0 2.52412
2007 2.74858 0 3.81573
2008 3.3306 0 7.5011
2009 4.3835 7.375e-06 14.1928
2010 6.73875 0.000240125 23.2634
2011 12.1544 0.00182275 46.7135
I tried this:
df = pd.read_excel(r'bp-stats-review-2019-all-data.xls')
columns = list(df.head(0))
for i in range(len(columns)):
df[columns[i]].astype(float)
Your question is not clear as to which column you are trying to convert, So I am sharing the example for the 1st column in your screenshot.
df['Terawatt-hours'] = df.Terawatt-hours.astype(float)
or same for any other column.
EDIT
for creating a loop on the dataframe and change it for all the columns, you can do the following :
Generating a dummy dataframe
df = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list('abcd'))
Check the type of column in dataframe :
for column in df.columns:
print(df[column].dtype)
Change the type of all the columns to float
for column in df.columns:
df[column] = df[column].astype(float)
your question is not clear? which columns are you trying to convert to float and also post what you have done.
EDIT:
what you tried is right until the last line of your code where you failed to reassign the columns.
df[columns[i]] = df[columns[i]].astype(float)
also try using df.columns to get column names instead of list(df.head(0))
the link here to pandas docs on how to cast a pandas object to a specified dtype

How to Loop over Numeric Column in Pandas Dataframe and filter Values?

df:
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
3 Foxtrix Ammy thirty 2000
4 Hensaui giny 33 ten
5 menuia rony fifty 7000
6 lopex nick 23 Ninety
I want loop over Numeric Column (Age, Salary) to check each value whether it is numeric or not, if string value present in Numeric column filter out the record and create a new data frame without that error.
Output :
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
You could extend this answer to filter on multiple columns for numerical data types:
import pandas as pd
from io import StringIO
data = """
Org_Name,Emp_Name,Age,Salary
Axempl,Rick,29,1000
Lastik,John,34,2000
Xenon,sidd,47,9000
Foxtrix,Ammy,thirty,2000
Hensaui,giny,33,ten
menuia,rony,fifty,7000
lopex,nick,23,Ninety
"""
df = pd.read_csv(StringIO(data))
print('Original dataframe\n', df)
df = df[(df.Age.apply(lambda x: x.isnumeric())) &
(df.Salary.apply(lambda x: x.isnumeric()))]
print('Filtered dataframe\n', df)
gives
Original dataframe
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
3 Foxtrix Ammy thirty 2000
4 Hensaui giny 33 ten
5 menuia rony fifty 7000
6 lopex nick 23 Ninety
Filtered dataframe
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
I believe this can be solved using Pandas' "to_numeric" function.
import pandas as pd
df['Column to Check'] = pd.to_numeric(df['Column to Check'], downcast='integer', errors='coerce')
df.dropna(axis=0, inplace=True)
Where 'Column to Check' is the column name that your are checking for values that cannot be cast as an integer (or any numeric type); in your question I believe you will want to apply this code to 'Age' and 'Salary'. "to_numeric" will convert any values in those columns to NaN if they could not be cast as your selected type. The "dropna" method will remove all rows that have a NaN in any of your columns.
To loop over the columns like you ask, you could do the following:
for col in ['Age', 'Salary']:
df[col] = pd.to_numeric(df[col], downcast='integer', errors='coerce')
df.dropna(axis=0, inplace=True)
EDIT:
In response to harry's comment. If there are preexisting NaNs in the data, something like the following should keep any valid row that had a preexisting NaN in one of the other columns.
for col in ['Age', 'Salary']:
df[col] = pd.to_numeric(df[col], downcast='integer', errors='coerce')
df = df[df[col].notnull()]
You can use a mask to indicate wheter or not there is a string type among the Age and Salary columns:
mask_str = (df[['Age', 'Salary']]
.applymap(lambda x: str(type(x)))
.sum(axis=1)
.str.contains("str"))
df[~mask_str]
This is assuming that the dataframe already contains the proper types. If not, you can convert them using the following:
def convert(val):
try:
return int(val)
except ValueError:
return val
df = (df.assign(Age=lambda f: f.Age.apply(convert),
Salary=lambda f: f.Salary.apply(convert)))

Comparing two DataFrames and showing the difference is generating an error

I'm able to find the difference by comparing the two DataFrames and concatenate the differences into a new DataFrame but there is a problem when values are missing in one of the DataFrames an error is generated: ValueError: Can only compare identically-labeled Series objects I think there is a problem with the header index. If you can help me it will be great.
df1 have one missing value at column 1980
df1
Country 1980 1981 1982 1983 1984
Bermuda 0.00687 0.00727 0.00971 0.00752
Canada 9.6947 9.58952 9.20637 9.18989 9.78546
Greenland 7 0.00746 0.00722 0.00505 0.00799
Mexico 3.72819 4.11969 4.33477 4.06414 4.18464
df2
Country 1980 1981 1982 1983 1984
Bermuda 0.77777 0.00687 0.00727 0.00971 0.00752
Canada 9.6947 9.58952 9.20637 9.18989 9.78546
Greenland 0.00791 0.00746 0.00722 0.00505 0.00799
Mexico 3.72819 4.11969 4.33477 4.06414 4.18464
def process_df(df):
res = df.set_index('Country').stack()
res.index.rename('Column', level=1, inplace=True)
return res
df1 = process_df(df1)
df2 = process_df(df2)
mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
df3 = pd.concat([df1[mask], df2[mask]], axis=1).rename({0:'From', 1:'To'}, axis=1)
print(df3)
I want to show the missing values just like blank space, example below:
From To
Country Column
Bermuda 1980 0.77777
Greenland 1980 0.00791 7
Remember the code works fine if there is no missing values but I want to be able to handle and missing values. Thank you

Fill NaN based on MultiIndex Pandas

I have a pandas Data Frame that I would like to fill in some NaN values of.
import pandas as pd
tuples = [('a', 1990),('a', 1994),('a',1996),('b',1992),('b',1997),('c',2001)]
index = pd.MultiIndex.from_tuples(tuples, names = ['Type', 'Year'])
vals = ['NaN','NaN','SomeName','NaN','SomeOtherName','SomeThirdName']
df = pd.DataFrame(vals, index=index)
print(df)
0
Type Year
a 1990 NaN
1994 NaN
1996 SomeName
b 1992 NaN
1997 SomeOtherName
c 2001 SomeThirdName
The output that I would like is:
Type Year
a 1990 SomeName
1994 SomeName
1996 SomeName
b 1992 SomeOtherName
1997 SomeOtherName
c 2001 SomeThirdName
This needs to be done on a much larger DataFrame (millions of rows) where each 'Type' can have between 1-5 unique 'Years' and the name value is only present for the most recent year. I'm trying to avoid iterating over rows for performance purposes.
You can sort your data frame by index in descending order and then ffill it:
import pandas as pd
df.sort_index(level = [0,1], ascending = False).ffill()
# 0
# Type Year
# c 2001 SomeThirdName
# b 1997 SomeOtherName
# 1992 SomeOtherName
# a 1996 SomeName
# 1994 SomeName
# 1990 SomeName
Note: The example data doesn't really contain np.nan values but string NaN, so in order for ffill to work you need to replace the NaN string as np.nan:
import numpy as np
df[0] = np.where(df[0] == "NaN", np.nan, df[0])
Or as #ayhan suggested, after replacing the String "NaN" with np.nan use df.bfill().

Categories