Extracting the integers from a column of strings - python

I have 2 dataframes: longdf, and shortdf. Longdf is the ‘master’ list and I need to basically match values from shortdf to longdf, those that match, replace values in other columns. Both longdf and shortdf need extensive data cleaning.
The goal is to reach the df ‘goal.’ I was trying to use a for loop where I wanted to 1) extract all number in the df cell, and 2) strip the blank/cell spaces from the cell. First: How come this for loop doesn't work? Second: Is there a better way to do this?
import pandas as pd
a = pd.Series(['EY', 'BAIN', 'KPMG', 'EY'])
b = pd.Series([' 10wow this is terrible data8 ', '10/ USED TO BE ANOTHER NUMBER/ 2', ' OMG 106 OMG ', ' 10?7'])
y = pd.Series(['BAIN', 'KPMG', 'EY', 'EY' ])
z = pd.Series([108, 102, 106, 107 ])
goal = pd.DataFrame
shortdf = pd.DataFrame({'consultant': a, 'invoice_number':b})
longdf = shortdf.copy(deep=True)
goal = pd.DataFrame({'consultant': y, 'invoice_number':z})
shortinvoice = shortdf['invoice_number']
longinvoice = longdf['invoice_number']
frames = [shortinvoice, longinvoice]
new_list=[]
for eachitemer in frames:
eachitemer.str.extract('(\d+)').astype(float) #extracing all numbers in the df cell
eachitemer.str.strip() #strip the blank/whitespaces in between the numbers
new_list.append(eachitemer)
new_short_df = new_list[0]
new_long_df = new_list[1]

If I understand correctly, you want to take a series of strings that contain integers and remove all the characters that aren't integers. You don't need a for-loop for this. Instead, you can solve it with a simple regular expression.
b.replace('\D+', '', regex=True).astype(int)
Returns:
0 108
1 102
2 106
3 107
The regex replaces all characters that aren't numbers (denoted by \D) with an empty string, removing anything that's not a number. .astype(int) converts the series to the integer type. You can merge the result into your final dataframe as normal:
result = pd.DataFrame({
'consultant': a,
'invoice_number': b.replace('\D+', '', regex=True).astype(int)
})

Related

Merge several rows to a single row with lists in the cells only if elements are different

I have the following table:
source_system geo_id product_subfamily product_deny_list product_allow_list transaction_deny_list operation_allow_list operation_filter
0 CONFIRMING_SCHF FRK CASH_MGMT ' ' 'CNF' ' ' ' ' NaN
1 EQUATION_SCHF FRK CASH_MGMT 'CD','TEST','CB' 'CA' '408','805','385','856','320','420','825','355... ' ' NaN
I would like to convert it to this new table of one single row:
source_system geo_id product_subfamily product_deny_list product_allow_list transaction_deny_list operation_allow_list operation_filter
0 [CONFIRMING_SCHF, EQUATION_SCHF] FRK CASH_MGMT ['CD','TEST','CB'] ['CNF', 'CA'] ' ' ' ' NaN
During the conversion, lists should be created in each cell only of the elements between the multiple rows are different, if they are the same, then, only the single value should be kept. If there was a blank string in a row and a value different from a blank string in the other row, the blank string should be removed and the other value kept.
How could i do this?
Thanks in advance
Mini example of your data + solution:
d = {'source_system ': ['CONFIRMING_SCHF ', 'EQUATION_SCHF'], 'geo_id': ['FRK', 'FRK']}
df = pd.DataFrame(data=d)
df_list = df.apply(lambda x: list(set(x)))
df = pd.DataFrame(data=df_list).T
Result:
import pandas as pd
import numpy as np
def apply_func(x):
x = list(filter(None, set(x))) # Filter blank spaces
if len(x) <= 1:
return ''.join(x)
return x
df = pd.DataFrame((['CONFIRMING_SCHF','FRK','CASH_MGMT', '', 'CNF' ,'','', np.nan],
['EQUATION_SCHF','FRK', 'CASH_MGMT', "'CD','TEST','CB'",'CA' , '408355','',np.nan]),
columns = ['source_system','geo_id','product_subfamily','product_deny_list','product_allow_list','transaction_deny_list','operation_allow_list','operation_filter'])
df['UNIQUE'] = 1
df_list = df.groupby('UNIQUE').agg(apply_func) #You can apply reset_index(drop=True) as well
df_list
This might be not proper solution for you because i have added additional column named as UNIQUE, but i am getting the output you expected, where you can apply almost all your conditions in the apply_func function
Output
source_system geo_id product_subfamily product_deny_list product_allow_list transaction_deny_list operation_allow_list operation_filter
UNIQUE
1 [CONFIRMING_SCHF, EQUATION_SCHF] FRK CASH_MGMT 'CD','TEST','CB' [CNF, CA] 408355 [nan, nan]

Concat Columns of Dataframe in python?

I have a data frame generate with the code as below:
# importing pandas as pd
import pandas as pd
# Create the dataframe
df = pd.DataFrame({'Category':['A', 'B', 'C', 'D'],
'Event':['Music Theater', 'Poetry Music', 'Theatre Comedy', 'Comedy Theatre'],
'Cost':[10000, 5000, 15000, 2000]})
# Print the dataframe
print(df)
I want a list to be generated combining all three columns and also removing whitespaces by "_" like and removing all trailing spaces too:-
[A_Music_Theater_10000, B_Poetry_Music_5000,C_Theatre_Comedy_15000,D_Comedy_Theatre_2000]
I want to it in most optimized way as running time is a issue for me. So looking to avoid for loops. Can anybody tell me how can i achieve this is most optimized way ?
The most general solution is convert all values to strings, use join and last replace:
df['new'] = df.astype(str).apply('_'.join, axis=1).str.replace(' ', '_')
If need filter only some columns:
cols = ['Category','Event','Cost']
df['new'] = df[cols].astype(str).apply('_'.join, axis=1).str.replace(' ', '_')
Or processing each columns separately - if necessary replace and also convert numeric column to strings:
df['new'] = (df['Category'] + '_' +
df['Event'].str.replace(' ', '_') + '_' +
df['Cost'].astype(str))
Or after converting to strings add _, sum, but necessary after replace remove traling _ by rstrip:
df['new'] = df.astype(str).add('_').sum(axis=1).str.replace(' ', '_').str.rstrip('_')
print(df)
Category Event Cost new
0 A Music Theater 10000 A_Music_Theater_10000
1 B Poetry Music 5000 B_Poetry_Music_5000
2 C Theatre Comedy 15000 C_Theatre_Comedy_15000
3 D Comedy Theatre 2000 D_Comedy_Theatre_2000

Python remove uppercase and empty elements from dataframe

I am new to dealing with list in the data frame. I have a data frame with 1 column that contains list like values. I am trying to remove 'empty list' and 'upper case' elements from this column. Here is what I tried what am I missing in this code?
Data csv:
id,list_col
1,"['',' books','PARAGRAPH','ISBN number','Harry Potter']"
2,"['',' books','TESTS','events 1234','Harry Potter',' 1 ']"
3,
4,"['',' books','PARAGRAPH','','PUBLISHES number','Garden Guide', '']"
5,"['',' books','PARAGRAPH','PUBLISHES number','The Lord of the Rings']"
Code:
df = pd.read_csv('sample.csv')
# (1) # trying to remove empty list but not working
df['list_col'] = list(filter(None, [w[2:] for w in df['list_col'].astype(str)]))
df['list_col']
# (2) remove upper case elements in the dataframe
#AttributeError: 'map' object has no attribute 'upper'
df['list_col'] = [t for t in (w for w in df['list_col'].astype(str)) != t.upper()]
Output Looking for:
id list_col
1 [' books','ISBN number','Harry Potter']
2 [' books','events 1234','Harry Potter',' 1 ']
3
4 [' books','PUBLISHES number','Garden Guide']
5 [' books','PUBLISHES number','The Lord of the Rings']
When pandas loads your CSV it loads the list as a quoted string which can be converted into a python list by eval and then you can use re.match to remove uppercase elements.
Code:
import pandas as pd
from re import compile
regex = compile('^[A-Z]{1,}$')
df = pd.read_csv(r'./input.csv')
not_null_indices = df.loc[:, 'list_col'].isna()
df.loc[~not_null_indices, 'list_col'] = df.loc[~not_null_indices, 'list_col']\
.apply(lambda x: eval(x))\
.apply(lambda y: list(filter(lambda z: regex.match(z) is None, y)) \
if isinstance(y, list) else list())

Values in Pandas dataframe is mixed and shifted

Values in Pandas dataframe is mixed and shifted.But each column has its own characteristics for values in it. How can I rearrange values in their own position?
'floor_no' have to contain values with ' / ' substring in it.
'room_count' is maximum 2 values digit long.
sq_m_count' have to contain ' m²' substring in it.
'price_sq' have to contain ' USD/m²' in it.
'bs_state' have to contain one of 'Have' or 'Do not have' values.
Adding part of pandas dataframe.
Consider the following approach:
In [90]: dfs = []
In [91]: url = 'https://ru.bina.az/items/565674'
In [92]: dfs.append(pd.read_html(url)[0].set_index(0).T)
In [93]: url = 'https://ru.bina.az/items/551883'
In [94]: dfs.append(pd.read_html(url)[0].set_index(0).T)
In [95]: df = pd.concat(dfs, ignore_index=True)
In [96]: df
Out[96]:
0 Категория Площадь Количество комнат Купчая
0 Дом / Вилла 376 м² 6 есть
1 Дом / Вилла 605 м² 6 нет
I figured out solution that is bit "+18 and perverty"
I wrote a loop that looks if each of these columns contain some sting that identifies columnt that it belongs to and copies this value to new column. Then i simply subsituted new with old one.
I did this with each of 'mixed' columns. This code filled my needs and fixed all problem. I understand how 'perverted' code is and will write a function that is much shorter and professional.
for index in bina_az_df.itertuples():
bina_az_df.loc[bina_az_df['bs_state'].str.contains(" m²|sot"),'new_sq_m_count'] = bina_az_df['bs_state']
bina_az_df.loc[bina_az_df['sq_m_count'].str.contains(" m²|sot"),'new_sq_m_count'] = bina_az_df['sq_m_count']
bina_az_df.loc[bina_az_df['floor_no'].str.contains(" m²|sot"),'new_sq_m_count'] = bina_az_df['floor_no']
bina_az_df.loc[bina_az_df['price_sq'].str.contains(" m²|sot"),'new_sq_m_count'] = bina_az_df['price_sq']
bina_az_df.loc[bina_az_df['room_count'].str.contains(" m²|sot"),'new_sq_m_count'] = bina_az_df['room_count']
bina_az_df['sq_m_count'] = bina_az_df['new_sq_m_count'] # Substitutes
del bina_az_df['new_sq_m_count'] # deletes unnecesary temp column

Pythonic/efficient way to strip whitespace from every Pandas Data frame cell that has a stringlike object in it

I'm reading a CSV file into a DataFrame. I need to strip whitespace from all the stringlike cells, leaving the other cells unchanged in Python 2.7.
Here is what I'm doing:
def remove_whitespace( x ):
if isinstance( x, basestring ):
return x.strip()
else:
return x
my_data = my_data.applymap( remove_whitespace )
Is there a better or more idiomatic to Pandas way to do this?
Is there a more efficient way (perhaps by doing things column wise)?
I've tried searching for a definitive answer, but most questions on this topic seem to be how to strip whitespace from the column names themselves, or presume the cells are all strings.
Stumbled onto this question while looking for a quick and minimalistic snippet I could use. Had to assemble one myself from posts above. Maybe someone will find it useful:
data_frame_trimmed = data_frame.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
You could use pandas' Series.str.strip() method to do this quickly for each string-like column:
>>> data = pd.DataFrame({'values': [' ABC ', ' DEF', ' GHI ']})
>>> data
values
0 ABC
1 DEF
2 GHI
>>> data['values'].str.strip()
0 ABC
1 DEF
2 GHI
Name: values, dtype: object
We want to:
Apply our function to each element in our dataframe - use applymap.
Use type(x)==str (versus x.dtype == 'object') because Pandas will label columns as object for columns of mixed datatypes (an object column may contain int and/or str).
Maintain the datatype of each element (we don't want to convert everything to a str and then strip whitespace).
Therefore, I've found the following to be the easiest:
df.applymap(lambda x: x.strip() if type(x)==str else x)
When you call pandas.read_csv, you can use a regular expression that matches zero or more spaces followed by a comma followed by zero or more spaces as the delimiter.
For example, here's "data.csv":
In [19]: !cat data.csv
1.5, aaa, bbb , ddd , 10 , XXX
2.5, eee, fff , ggg, 20 , YYY
(The first line ends with three spaces after XXX, while the second line ends at the last Y.)
The following uses pandas.read_csv() to read the files, with the regular expression ' *, *' as the delimiter. (Using a regular expression as the delimiter is only available in the "python" engine of read_csv().)
In [20]: import pandas as pd
In [21]: df = pd.read_csv('data.csv', header=None, delimiter=' *, *', engine='python')
In [22]: df
Out[22]:
0 1 2 3 4 5
0 1.5 aaa bbb ddd 10 XXX
1 2.5 eee fff ggg 20 YYY
The "data['values'].str.strip()" answer above did not work for me, but I found a simple work around. I am sure there is a better way to do this. The str.strip() function works on Series. Thus, I converted the dataframe column into a Series, stripped the whitespace, replaced the converted column back into the dataframe. Below is the example code.
import pandas as pd
data = pd.DataFrame({'values': [' ABC ', ' DEF', ' GHI ']})
print ('-----')
print (data)
data['values'].str.strip()
print ('-----')
print (data)
new = pd.Series([])
new = data['values'].str.strip()
data['values'] = new
print ('-----')
print (new)
Here is a column-wise solution with pandas apply:
import numpy as np
def strip_obj(col):
if col.dtypes == object:
return (col.astype(str)
.str.strip()
.replace({'nan': np.nan}))
return col
df = df.apply(strip_obj, axis=0)
This will convert values in object type columns to string. Should take caution with mixed-type columns. For example if your column is zip codes with 20001 and ' 21110 ' you will end up with '20001' and '21110'.
This worked for me - applies it to the whole dataframe:
def panda_strip(x):
r =[]
for y in x:
if isinstance(y, str):
y = y.strip()
r.append(y)
return pd.Series(r)
df = df.apply(lambda x: panda_strip(x))
I found the following code useful and something that would likely help others. This snippet will allow you to delete spaces in a column as well as in the entire DataFrame, depending on your use case.
import pandas as pd
def remove_whitespace(x):
try:
# remove spaces inside and outside of string
x = "".join(x.split())
except:
pass
return x
# Apply remove_whitespace to column only
df.orderId = df.orderId.apply(remove_whitespace)
print(df)
# Apply to remove_whitespace to entire Dataframe
df = df.applymap(remove_whitespace)
print(df)

Categories