Pattern to split a column based on regex - python

I have a data frame where each row represent a full name and a website. I need to split that into 2 columns: name and website.
I've tried to use pandas str.split but I'm struggling to create a regex pattern that catches any initial 'http' plus the rest of the website. I have websites starting with http and https.
df = pd.DataFrame([['John Smith http://website.com'],['Alan Delon https://alandelon.com']])
I want to have a pattern that correctly identify the website to split my data. Any help would be very much appreciated.

using str.split
pd.DataFrame(df[0].str.split('\s(?=http)').tolist()).rename({0:'Name',1:'Website'}, axis=1)
Output
Name Website
0 John Smith http://website.com
1 Alan Delon https://alandelon.com

Using str.extract
Ex:
df = pd.DataFrame([['John Smith http://website.com'],['Alan Delon https://alandelon.com']], columns=["data"])
df[["Name", "Url"]] = df["data"].str.extract(r"(.*?)(http.*)")
print(df)
Output:
data Name Url
0 John Smith http://website.com John Smith http://website.com
1 Alan Delon https://alandelon.com Alan Delon https://alandelon.com

Related

See if object from one dataframe appears in other dataframe, when one has numbers added (e.g. string, string1)

I have two dataframes with actor names (their types are object) that look like the following:
df = pd.DataFrame({Actors: [Christian Bale, Ben Kingsley, Halley Bailey, Aaron Paul, etc...]
df2 = pd.read_csv({id: [Halley Bailey - 1998, Coco Jones – 1998, etc...]
Normally I would use the following code to find if one item is present in another dataframe but due to the numbers in the second dataframe I get 0 matches. Is there any smart way of going over this?
df.assign(indf=df.Actors.isin(df_actor_list.id).astype(int))
The code above did not work obviously
You can extract the actor names from df2['id'] and check if df['Actors'] is in it:
df.assign(indf=df['Actors'].isin(df2['id'].str.extract('(.*)(?=\s[-–])',
expand=False)).astype(int))
output:
Actors indf
0 Christian Bale 0
1 Ben Kingsley 0
2 Halley Bailey 1
3 Aaron Paul 0
Another, more generic, approach relying on a regex:
import re
regex = '|'.join(map(re.escape, df['Actors']))
# 'Christian\\ Bale|Ben\\ Kingsley|Halley\\ Bailey|Aaron\\ Paul'
actors = df2['id'].str.extract(f'({regex})', expand=False).dropna()
df.assign(indf=df['Actors'].isin(actors).astype(int))
used inputs:
df = pd.DataFrame({'Actors': ['Christian Bale', 'Ben Kingsley', 'Halley Bailey', 'Aaron Paul']})
df2 = pd.DataFrame({'id': ['Halley Bailey - 1998', 'Coco Jones – 1998']})

Parse *.txt file looping with comprehensive dictionary

I have a *.txt file coming from a SQL query organised in rows.
I'm reading it with pandas library through:
df = pd.read_csv(./my_file_path/my_file.txt, sep = '\n', head = 0)
df.rename(columns = {list(df.columns)[0]: 'cols'}, inplace = True)
the output are rows with the information separated by spaces in an standard structure (dots are meant to be spaces):
name................address........country..........age
0 Bob.Hope............Broadway.......United.States....101
1 Richard.Donner......Park.Avenue....United.States.....76
2 Oscar.Meyer.........Friedrichshain.Germany...........47
I tried to create a dictionary to get the info with comprehensive lists:
col_dict = {'name': [df.cols[i][0:20].strip() for i in range(0,len(df.cols))],
'address': [df.cols[I][21:36].strip() for i in range(0,len(df.cols))],
'country': [df.cols[i][36:52].strip() for i in range(0,len(df.cols))],
'age': [df.cols[i][53:].strip() for i in range(0,len(df.cols))],
}
This script runs well in order to create a dictionary as a basis for a dataframe to work with. But I were asking myself if there is any other way to make the script more pythonic, looping directly through a dictionary with the column names and avoiding the repetition of the same code for every column -the actual dataset is much longer-.
The question is how can I store de string indexation to use it later with the column names to parse everything at once.
You can read it directly with pandas:
df = pd.read_csv(./my_file_path/my_file.txt, delim_whitespace=True)
If you know that the space between the columns is going to be at least 2 spaces, you can do it this way:
df = pd.read_csv(./my_file_path/my_file.txt, sep='\s{2,}')
In your case, the file is fixed width so you need to use a different method:
df = pd.read_fwf(StringIO(my_text), widths=[20,15,16, 10],skiprows=1)
The pandas.read_fwf method is what you are looking for.
df = pd.read_fwf( 'data.txt' )
data.txt
name address country age
Bob Hope Broadway United States 101
Richard Donner Park Avenue United States 76
Oscar Meyer Friedrichshain Germany 47
df
id
name
address
country
age
0
Bob Hope
Broadway
United States
101
1
Richard Donner
Park Avenue
United States
76
2
Oscar Meyer
Friedrichshain
Germany
47

Matching a nickname to a name in Pandas

I have two dataframes: one with full names and another with nicknames. The nickname is always a portion of the person's full name, and the data is not sorted or indexed, so I can't just merge the two.
What I want as an output is one data frame that contains the full name and the associated nick name by simple search: find the nickname inside the name and match it.
Any solutions to this?
df = pd.DataFrame({'fullName': ['Claire Daines', 'Damian Lewis', 'Mandy Patinkin', 'Rupert Friend', 'F. Murray Abraham']})
df2 = pd.DataFrame({'nickName': ['Rupert','Abraham','Patinkin','Daines','Lewis']})
Thanks
Use Series.str.extract with strings joined by | for regex or with \b for words boundaries:
pat = '|'.join(r"\b{}\b".format(x) for x in df2['nickName'])
df['nickName'] = df['fullName'].str.extract('('+ pat + ')', expand=False)
print (df)
fullName nickName
0 Claire Daines Daines
1 Damian Lewis Lewis
2 Mandy Patinkin Patinkin
3 Rupert Friend Rupert
4 F. Murray Abraham Abraham

Parsing log files and write to csv (different number of fields)

This is a question that concerns me for a long time. I have log files that I want to convert to csv. My problem is that the empty fields have been omitted in the log files. I want to end up with a csv file containing all fields.
Now I'm parsing the log files and write them to xml because one of the nice features of Microsoft Excel is that when you open a xml file with a different number of elements, Excel shows you all elements as separate columns.
Last week I came up with the idea that this might be possible with Pandas, but I can not find a good example to get this done.
Someone a good idea how I can get this done?
Updated
I can't share the actual logs here. Below a fictional sample:
Sample 1:
First : John Last : Doe Address : Main Street Email : j_doe#notvalid.gov Sex : male State : TX City : San Antonio Country : US Phone : 210-354-4030
First : Carolyn Last : Wysong Address : 1496 Hewes Avenue Sex : female State : TX City : KEMPNER Country : US Phone : 832-600-8133 Bank_Account : 0123456789
regex :
matches = re.findall(r'(\w+) : (.*?) ', line, re.IGNORECASE)
Sample 2:
:1: John :2: Doe :3: Main Street :4: j_doe#notvalid.gov :5: male :6: TX :7: San Antonio :8: US :9: 210-354-4030
:1: Carolyn :2: Wysong :3: 1496 Hewes Avenue :5: female :6: TX :7: KEMPNER :8: US :9: 832-600-8133 :10: 0123456789
regex:
matches = re.findall(r':(\d+): (.*?) ', line, re.IGNORECASE)
Allow me to concentrate on your first example. Your regex only matches the first word of each field, but let's keep it like that for now as I'm sure you can easily fix that.
You can create a pandas DataFrame to store your parsed data, then for each line you run your regexp, convert it to a dictionary and load it into a pandas Series. Then you append it to your dataframe. Pandas is smart enough to fill missing data with NaN.
df = pd.DataFrame()
for l in lines:
matches = re.findall(r'(\w+) : (.*?) ', l, re.IGNORECASE)
s = pd.Series(dict(matches))
df = df.append(s, ignore_index=True)
>>> print(df)
Address City Country Email First Last Sex State Phone
0 Main San US j_doe#notvalid.gov John Doe male TX NaN
1 1496 KEMPNER US NaN Carolyn Wysong female TX 832-600-8133
I'm not sure the dict step is needed, maybe there's a pandas way to directly parse your list of tuples.
Then you can easily convert it to csv, you will retain all your columns with empty fields where appropriate.
df.to_csv("result.csv", index=False)
>>> !cat result.csv
Address,City,Country,Email,First,Last,Sex,State,Phone
Main,San,US,j_doe#notvalid.gov,John,Doe,male,TX,
1496,KEMPNER,US,,Carolyn,Wysong,female,TX,832-600-8133
About big files performances, if you know all the field names in advance you can initialize the dataframe with a columns argument and run the parsing and csv saving one chunk at the time. IIRC there's a mode parameter for to_csv that should allow you to append to an existing file.

Python/Pandas: Drop (not filter!) rows from data frame on string match from list

This thread contains information on how to filter out rows. However, I want to know how to delete, not filter, rows from a data frame based on a string match from a list.
What's the fastest way to do this?
Edit: here's an example using the dataset provided in another thread.
>>> import pandas as pd
>>>
>>> df = pd.read_csv('data.csv')
>>> df.head()
fName lName email title
0 John Smith jsmith#gmail.com CEO
1 Joe Schmo jschmo#business.com Bagger
2 Some Person some.person#hotmail.com Clerk
One solution given involves filtering out some rows as follows:
In [6]: to_drop = ['Clerk', 'Bagger']
df[~df['title'].isin(to_drop)]
Out[6]:
fName lName email title
0 John Smith jsmith#gmail.com CEO
This work, but the data frame still contains those rows that I want to permanently delete:
In [7]: df.head()
Out[7]:
fName lName email title
0 John Smith jsmith#gmail.com CEO
1 Joe Schmo jschmo#business.com Bagger
2 Some Person some.person#hotmail.com Clerk
I quickly figure this one out. Apologies for the seemingly naive question, but hope it helps out other Python newbies like myself.
The solution, as raised by #mattmilten, is to simply assign the output of the filtering to the same or a new data frame. That is,
In [3]: to_drop = ['Clerk', 'Bagger']
df = df[~df['title'].isin(to_drop)]
df
Out[3]:
fName lName email title
0 John Smith jsmith#gmail.com CEO

Categories