String substitution with regex or regular Python?

String substitution with regex or regular Python? - python

I have a list of strings like the following
orig = ["a1 2.3 ABC 4 DEFG 567 b890",
"a2 3.0 HI 4 5 JKL 67 c65",
"b1 1.2 MNOP 3 45 67 89 QR 987 d64 e112"]
Context here is that this is a CSV file and certain columns are omitted. I don't think that the pandas csv reader can handle these cases. The idea is now to inject na for the missing values, so the output becomes
corr = ["a1 2.3 ABC 4 na na na DEFG 567 b890",
"a2 3.0 HI 4 5 na na JKL 67 c65",
"b1 1.2 MNOP 3 45 67 89 QR 987 d64 e112"]
to align the second column with capitalised words later on, when imported in pandas.
The structure is the following: Delimiters between columns are two or more whitespaces and between the two upper case columns have to be four values. In the original file, there are always only two upper case columns, there is at least one and maximal four numbers in between them and there are only number values between these upper case words.
I can write without problem a script in native Python, so please no suggestions for this. But I thought, this might be a case for regex. As a regex beginner, I only managed to extract the string between the two upper case columns with
for line in orig:
a = re.findall("([A-Z]+[\s\d]+[A-Z]+)", line))
print(a)
>>>'ABC 4 DEFG' #etc pp
Is there now an easy way in regex to determine, how many numbers are between the upper case words and insert 'na' values to have always four values in between? Or should I do it in native Python?
Of course, if there is a way to do this with the pandas csv reader, that would be even better. But I studied pandas csv_reader docs and haven't found anything useful.

Based on complete pandas approach split and concat might help i.e
ndf = pd.Series(orig).str.split(expand=True)
# 0 1 2 3 4 5 6 7 8 9 10
#0 a1 2.3 ABC 4 DEFG 567 b890 None None None None
#1 a2 3.0 HI 4 5 JKL 67 c65 None None None
#2 b1 1.2 MNOP 3 45 67 89 QR 987 d64 e112
df = pd.concat([ndf.iloc[:,:4], ndf.iloc[:,4:].apply(sorted,key=pd.notnull,axis=1)],1)
df.astype(str).apply(' '.join,axis=1).tolist()
['a1 2.3 ABC 4 None None None None DEFG 567 b890',
'a2 3.0 HI 4 None None None 5 JKL 67 c65',
'b1 1.2 MNOP 3 45 67 89 QR 987 d64 e112']

Though the consensus seems to be that regex is not the best tool for such a dynamic string substitution, I found the re module quite comfortable to use in this context. The capturing pattern is based on a comment by Jon Clements.
import re
orig = ["a1 2.3 ABC 4 DEFG 567 b890",
"a2 3.0 HI 4 5 JKL 67 c65",
"b1 1.2 MNOP 3 45 67 89 QR 987 d64 e112"]
corr = []
for item in orig:
#capture group starting with first capitalised word and stopping before the second
col_betw = re.search("\s{2,}([A-Z]+.*)\s{2,}[A-Z]+\s{2,}", item).group(1)
#determine, how many elements we have in this segment
nr_col_betw = len(re.split(r"\s{2,}", col_betw))
#substitute, if not enough numbers
if nr_col_betw <= 4:
#fill with NA, which is interpreted by pandas csv reader as NaN
subst = col_betw + " NA" * (5 - nr_col_betw)
item = item.replace(col_betw, subst, 1)
corr.append(item)

Related

pandas/python: replacing categorical values in dataframe through iteration

I created a database and I am trying to substitute the categorical variables with some numerical values
that I calculated via 'pivot'. In my code, I am trying to iterate through the whole dataframe and if the dataframe categorical columns cells have the same values as one of the elements in 'sublist_names', they should be replaced by the element in 'sublist_values' located in the same position as the value in sublist names.
For example, while iterating the dataframe and each of the categorical columns, the first value of column called 'Name' is the string 'tom'. 'tom' is exactly the 7th element in 'sublist_names', which means it should be replaced by the 7th element in 'sublist_values' which is equal to 150.
I was able to obtain all the needed values but when it comes to solving this last task by iterating the whole dataframe instead of working column by column, I am not sure how to do it.
I hope I explained clearly, but for any questions feel free to ask.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = [['tom', 10,6,'brown',200],
['nick', 15,5.10,'red',150],
['juli', 14,5.5,'black',170]
,['peter', 10,6,'blue',290],
['axel', 15,5.10,'yellow',190],
['william', 14,5.5,'yellow',170]
,['tom', 10,6,'orange',100],
['tom', 15,5.10,'brown',150],
['angela', 14,5.5,'black',160]
,['peter', 10,6,'purple',220],
['nick', 15,5.10,'orange',150],
['aroon', 14,5.5,'red',170] ]
df = pd.DataFrame(data, columns=['Name', 'Age','height','color','weight'])
categorical_variables= (df.select_dtypes('object') ) # categorical variables
categ_var_list=(list(categorical_variables))
print(categ_var_list)
condition_pivot_list_names=[]
pivot_values_list=[]
for i in categ_var_list:
condition_pivot = df.pivot_table(index=i, values='weight', aggfunc=np.mean)
pivot_names = (condition_pivot.index.values.tolist())
condition_pivot_list_names.append(pivot_names)
pivot_values_draft = ((condition_pivot.values.tolist()))
pivot_values = [i[0] for i in pivot_values_draft]
pivot_values_list.append(pivot_values)
print(condition_pivot_list_names, 'condition pivot list names')
print(pivot_values_list,'pivot values list')
sublist_names=[(sublists) for sublists in condition_pivot_list_names]
print(sublist_names)
sublist_values=[(sublists1) for sublists1 in pivot_values_list]
print(sublist_values)
def myfunc(x):
if x in sublist_names:
index=sublist_names.index(x)
return sublist_values[index]
return x
df['Name'] = df['Name'].apply(lambda x: myfunc(x))
print(df['Name'])
This is what print( df[name]) shows:
0 tom
1 nick
2 juli
3 peter
4 axel
5 william
6 tom
7 tom
8 angela
9 peter
10 nick
11 aroon
And this is what should show:
0 150
1 150
2 170
3 255
4 190
5 170
6 150
7 150
8 160
9 255
10 150
11 170

You have two categorical values Name and Color. So you cam do something like this.
df['Name'] = df['Name'].apply(lambda x: myfunc(x))
And than you can create a function myfunc() which will receive x from above code. What above code is doing is, it will iterate over the column one by one and pass value of each row one by one to the function. Inside the function you can define the logic to convert the categorical values something like this
def myfunc(x):
if x in sublist_names:
index=sublist_names.index(x)
return sublist_values[index]
return x
Do the same thing for the column Color.

Try this:
df.Name = np.where(df.groupby('Name', as_index=False)['Name'].cumcount().eq(0), df.Name, df.weight)
Output:
Name Age height color weight
0 tom 10 6.0 brown 200
1 nick 15 5.1 red 150
2 juli 14 5.5 black 170
3 peter 10 6.0 blue 290
4 axel 15 5.1 yellow 190
5 william 14 5.5 yellow 170
6 100 10 6.0 orange 100
7 150 15 5.1 brown 150
8 angela 14 5.5 black 160
9 220 10 6.0 purple 220
10 150 15 5.1 orange 150
11 aroon 14 5.5 red 170

Okay I see your problem. Just write the code below before the function declaration.
sub_names=[]
sub_values=[]
for i in sublist_names:
sub_names.extend(i)
for i in sublist_values:
sub_values.extend(i)
Also dont forget to update variable names in myfunc().

Removing from pandas dataframe all rows having less than 3 characters

I have this dataframe
Word Frequency
0 : 79
1 , 60
2 look 26
3 e 26
4 a 25
... ... ...
95 trump 2
96 election 2
97 step 2
98 day 2
99 university 2
I would like to remove all words having less than 3 characters.
I tried as follows:
df['Word']=df['Word'].str.findall('\w{3,}').str.join(' ')
but it does not remove them from my datataset.
Can you please tell me how to remove them?
My expected output would be:
Word Frequency
2 look 26
... ... ...
95 trump 2
96 election 2
97 step 2
98 day 2
99 university 2

Try with
df = df[df['Word'].str.len()>=3]

Instead of attempting a regular expression, you can use .str.len() to get the length of each string of your column. Then you can simply filter based on that length for >= 3
Should look like:
df.loc[df["Word"].str.len() >= 3]

Please Try
df[df.Word.str.len()>=3]

How can I merge these two datasets on 'Name' and 'Year'?

I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')

You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.

Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!

Getting KeyError after reading in pipe-separated CSV

I read in a pipe-separated CSV like this
test = pd.read_csv("http://kejser.org/wp-content/uploads/2014/06/Country.csv")
test.head()
This returns
SK_Country|"Number"|"Alpha2Code"|"Alpha3Code"|"CountryName"|"TopLevelDomain"
0 1|20|"ad"|"and"|"Andorra"|".ad"
1 2|4|"af"|"afg"|"Afghanistan"|".af"
2 3|28|"ag"|"atg"|"Antigua and Barbuda"|".ag"
3 4|660|"ai"|"aia"|"Anguilla"|".ai"
4 5|8|"al"|"alb"|"Albania"|".al"
When I try and extract specific data from it, like below:
df = test[["Alpha3Code"]]
I get the following error:
KeyError: ['Alpha3Code'] not in index
I don't understand what goes wrong - I can see the value is in the CSV when I print the head, likewise when I open the CSV, everything looks fine.
I've tried to google around and read some posts regarding the issue here on the stack and tried different approaches, but nothing seems to fix this annoying problem.

Notice how everything is crammed into one string column? That's because you didn't specify the delimiter separating columns to pd.read_csv, which in this case has to be '|'.
test = pd.read_csv("http://kejser.org/wp-content/uploads/2014/06/Country.csv",
sep='|')
test.head()
# SK_Country Number Alpha2Code Alpha3Code CountryName \
# 0 1 20 ad and Andorra
# 1 2 4 af afg Afghanistan
# 2 3 28 ag atg Antigua and Barbuda
# 3 4 660 ai aia Anguilla
# 4 5 8 al alb Albania
#
# TopLevelDomain
# 0 .ad
# 1 .af
# 2 .ag
# 3 .ai
# 4 .al

As pointed out in the comment by #chrisz, you have to specify the delimiter:
test = pd.read_csv("http://kejser.org/wp-content/uploads/2014/06/Country.csv",delimiter='|')
test.head()
SK_Country Number Alpha2Code Alpha3Code CountryName \
0 1 20 ad and Andorra
1 2 4 af afg Afghanistan
2 3 28 ag atg Antigua and Barbuda
3 4 660 ai aia Anguilla
4 5 8 al alb Albania
TopLevelDomain
0 .ad
1 .af
2 .ag
3 .ai
4 .al

How to delete rows with less than a certain amount of items or strings with Pandas?

I have searched a lot but couldn't find a solution to this particular case. I want to remove any rows that contains less than 3 strings or items in the lists. My issues will be addressed more clearly further down.
I'm preparing a LDA topic modelling with a large Swedish database in pandas and have limited the test case to 1000 rows. I'm only concerned with a specific column and my approach so far has been as follows:
con = sqlite3.connect('/Users/mo/EXP/NAV/afm.db')
sql = """
select * from stillinger limit 1000
"""
dfs = pd.read_sql(sql, con)
plb = """
select PLATSBESKRIVNING from stillinger limit 1000
"""
dfp = pd.read_sql(plb, con);dfp
Then I've defined a regular expression where the first argument removes any meta characters while keeping the Swedish and Norwegian language specific letters. The second argument removes words < 3:
rep = {
'PLATSBESKRIVNING': {
r'[^A-Za-zÅåÄäÖöÆØÅæøå]+': ' ',
r'\W*\b\w{1,3}\b': ' '}
}
p0 = (pd.DataFrame(dfp['PLATSBESKRIVNING'].str.lower()).replace(rep, regex=True).
drop_duplicates('PLATSBESKRIVNING').reset_index(drop=True));p0
PLATSBESKRIVNING
0 medrek rekrytering söker uppdrag manpower h...
1 familj barn tjejer kille söker pair ...
2 uppgift blir tillsammans medarbetare leda ...
3 behov operasjonssykepleiere langtidsoppdr...
4 detta perfekta jobbet arbetstiderna vardaga...
5 familj paris barn söker älskar barn v...
6 alla inom cafe restaurang förekommande arbets...
.
.
Creating a pandas Series:
s0 = p0['PLATSBESKRIVNING']
Then:
ts = s0.str.lower().str.split();ts
0 [medrek, rekrytering, söker, uppdrag, manpower...
1 [familj, barn, tjejer, kille, söker, pair, vil...
2 [uppgift, blir, tillsammans, medarbetare, leda...
3 [behov, operasjonssykepleiere, langtidsoppdrag...
4 [detta, perfekta, jobbet, arbetstiderna, varda...
5 [familj, paris, barn, söker, älskar, barn, vil...
6 [alla, inom, cafe, restaurang, förekommande, a...
7 [diskare, till, cafe, dubbel, sökes, arbetet, ...
8 [diskare, till, thelins, konditori, sökes, arb...
Removing the stop words from the database:
r = s0.str.split().apply(lambda x: [item for item in x if item not in mswl]);r
0 [uppdrag, bemanningsföretag, erbjuds, tillägg,...
1 [föräldrarna, citycentre, stort, tomt, mamman,...
2 [utveckla, övergripande, strategiska, frågor, ...
3 [erfaring, sykepleier, legitimasjon]
4 [arbetstiderna, vardagar, härliga, människor, ...
5 [paris, utav, badrum, båda, yngsta, endast, fö...
6 [förekommande, emot, utbildning]
7 []
8 [thelins]
9 [paris, baby, månader, våning, delar, badrum, ...
Creating a new DataFrame and removing the empty brackets:
dr = pd.DataFrame(r)
dr0 = dr[dr.astype(str)['PLATSBESKRIVNING'] != '[]'].reset_index(drop=True); dr0
PLATSBESKRIVNING
0 [uppdrag, bemanningsföretag, erbjuds, tillägg,...
1 [föräldrarna, citycentre, stort, tomt, mamman,...
2 [utveckla, övergripande, strategiska, frågor, ...
3 [erfaring, sykepleier, legitimasjon]
4 [arbetstiderna, vardagar, härliga, människor, ...
5 [paris, utav, badrum, båda, yngsta, endast, fö...
6 [förekommande, emot, utbildning]
7 [thelins]
8 [paris, baby, månader, våning, delar, badrum, ...
Maintaining the string:
dr1 = dr0['PLATSBESKRIVNING'].apply(str); len(dr1),type(dr1), dr1
0 ['uppdrag', 'bemanningsföretag', 'erbjuds', 't...
1 ['föräldrarna', 'citycentre', 'stort', 'tomt',...
2 ['utveckla', 'övergripande', 'strategiska', 'f...
3 ['erfaring', 'sykepleier', 'legitimasjon']
4 ['arbetstiderna', 'vardagar', 'härliga', 'männ...
5 ['paris', 'utav', 'badrum', 'båda', 'yngsta', ...
6 ['förekommande', 'emot', 'utbildning']
7 ['thelins']
8 ['paris', 'baby', 'månader', 'våning', 'delar'...
My issue now is that I want to remove any rows that contains less than 3 strings in the lists, e.g row 3, 6 and 7. Desired result would be like this:
0 ['uppdrag', 'bemanningsföretag', 'erbjuds', 't...
1 ['föräldrarna', 'citycentre', 'stort', 'tomt',...
2 ['utveckla', 'övergripande', 'strategiska', 'f...
3 ['arbetstiderna', 'vardagar', 'härliga', 'männ...
4 ['paris', 'utav', 'badrum', 'båda', 'yngsta', ...
5 ['paris', 'baby', 'månader', 'våning', 'delar'...
.
.
How can I obtain this? I'm also wondering if this could be done in a neater way? My approach seems so clumsy and cumbersome.
I would also like to remove both indexes and column name for the LDA topic modelling such that I could write it to a text file without the header and the digits of indexes. I have tried:
dr1.to_csv('LDA1.txt',header=None,index=False)
But this wraps quotation marks "['word1', 'word2', 't.. ]" to the each list of strings in the file.
Any suggestions would be much appreciated.
Best regards
Mo

Just measure the number of items in the list and filter the rows with length lower than 3
dr0['length'] = dr0['PLATSBESKRIVNING'].apply(lambda x: len(x))
cond = dr0['length'] > 3
dr0 = dr0[cond]

You can use apply len and then select data store it in the dataframe variable you like i.e
df[df['PLATSBESKRIVNING'].apply(len)>3]
Output :
PLATSBESKRIVNING
0 [uppdrag, bemanningsföretag, erbjuds, nice]
1 [föräldrarna, citycentre, stort, tomt]
2 [utveckla, övergripande, strategiska, fince]
4 [arbetstiderna, vardagar, härliga, männ]
5 [paris, utav, badrum, båda, yngsta]
8 [paris, baby, månader, våning, delar]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.