Remove some strings in dataframe - python

I'm trying to remove some strings in a dataframe that start with System:
My dataframe:
A B C
French house Blablabla System:Microsoft Windows XP; Browser:Chrome 32.0.1700;
English house my address: 101-102 bd Charles de Gaulle 75001 Paris
French apartment my name is Liam
French house Hello George!
English apartment System:Microsoft Windows XP; Browser:Chrome 32.0.1700;
I tried:
def remove_lines():
df['C'] = df['C'].str.replace(r'(\s+)(System:).+','')
return df
Nothing happens...
Good output:
A B C
French house Blablabla
English house my address: 101-102 bd Charles de Gaulle 75001 Paris
French apartment my name is Liam
French house Hello George!
English apartment

Use:
df.C = df.C.str.replace('System:.*','')
df.C
# 0 Blablabla
# 1 my address: 101-102 bd Charles de Gaulle 75001...
# 2 my name is Liam
# 3 Hello George!
# 4
# Name: C, dtype: object

You can simply use split function on System and pick the first part, like this:
In [1936]: df.C = pd.DataFrame(df.C.str.split('System').tolist())[0]
In [1937]: df
Out[1937]:
A B C
0 French house Blablabla
1 English house my address: 101-102 bd Charles de Gaulle 75001...
2 French apartment my name is Liam
3 French house Hello George!
4 English apartment

Related

How to deal with long names in data cleaning?

I have a users database. I want to separate them into two columns to have user1 and user2.
The way I was solving this was to split the names into multiple columns then merge the names to have the two columns of users.
The issue I run into is some names are long and after the split. Those names take some spot on the data frame which makes it harder to merge properly.
Users
Maria Melinda Del Valle Justin Howard
Devin Craig Jr. Michael Carter III
Jeanne De Bordeaux Alhamdi
After I split the user columns
0
1
2
3
4
5
6
7
8
Maria
Melinda
Del
Valle
Justin
Howard
Devin
Craig
Jr.
Michael
Carter
III
Jeanne
De
Bordeaux
Alhamdi
The expected result is the following
User1
User2
Maria Melinda Del valle
Justin Howard
Devin Craig Jr.
Michael Carter III
Jeanne De Bordeaux
Alhamdi
You can use:
def f(sr):
m = sr.isna().cumsum().loc[lambda x: x < 2]
return sr.dropna().groupby(m).apply(' '.join)
out = df.apply(f, axis=1).rename(columns=lambda x: f'User{x+1}')
Output:
>>> out
User1 User2
0 Maria Melinda Del Valle Justin Howard
1 Devin Craig Jr. Michael Carter III
2 Jeanne De Bordeaux Alhamdi
As suggested by #Barmar, If you know where to put the blank columns in the first split, you should know how to create both columns.

Update DataFrame based on matching rows in another DataFrame

Say there is a group of people who can choose an English and / or a Spanish word. Let's say they chose like this:
>>> pandas.DataFrame(dict(person=['mary','james','patricia','robert','jennifer','michael'],english=['water',None,'hello','thanks',None,'green'],spanish=[None,'agua',None,None,'bienvenido','verde']))
person english spanish
0 mary water None
1 james None agua
2 patricia hello None
3 robert thanks None
4 jennifer None bienvenido
5 michael green verde
Say I also have an English-Spanish dictionary (assume no duplicates, i.e. one-to-one relationship):
>>> pandas.DataFrame(dict(english=['hello','bad','green','thanks','welcome','water'],spanish=['hola','malo','verde','gracias','bienvenido','agua']))
english spanish
0 hello hola
1 bad malo
2 green verde
3 thanks gracias
4 welcome bienvenido
5 water agua
How can I fill in any missing words, i.e. update the first DataFrame using the second DataFrame where either english or spanish is None, to arrive at this:
>>> pandas.DataFrame(dict(person=['mary','james','patricia','robert','jennifer','michael'],english=['water','water','hello','thanks','welcome','green'],spanish=['agua','agua','hola','gracias','bienvenido','verde']))
person english spanish
0 mary water agua
1 james water agua
2 patricia hello hola
3 robert thanks gracias
4 jennifer welcome bienvenido
5 michael green verde
You may check the map with fillna
df['english'] = df['english'].fillna(df['spanish'].map(df2.set_index('spanish')['english']))
df['spanish'] = df['spanish'].fillna(df['english'].map(df2.set_index('english')['spanish']))
df
Out[200]:
person english spanish
0 mary water agua
1 james water agua
2 patricia hello hola
3 robert thanks gracias
4 jennifer welcome bienvenido
5 michael green verde

Python: Converting multiple columns to a single column with categorical data

If I have this table:
City
State
Person 1
Person 2
Atlanta
GA
Bob
Fred
But, I want to convert it to:
City
State
Person#
Person Name
Atlanta
GA
1
Bob
Atlanta
GA
2
Fred
What is the most efficient way to accomplish this?
Use melt:
out = df.melt(['City', 'State'], var_name='Person#', value_name='Person Name')
out['Person#'] = out['Person#'].str.extract('(\d+)')
>>> out
City State Person# Person Name
0 Atlanta GA 1 Bob
1 Atlanta GA 2 Fred

Pandas -Split data and create columns when string occurs

I am looking to read in a text file (see below) and then create columns for all the English leagues only. So I'll be looking to do something like where "Alias name" is "England_" then create a new column with the alias name as the header and then the player names in the rows. note that the first occurrence for Alias is down as "Aliases" in the text file.
"-----------------------------------------------------------------------------------------------------------"
"- NEW TEAM -"
"-----------------------------------------------------------------------------------------------------------"
Europe Players
17/04/2019
07:59 p.m.
Aliases for England_Premier League
-------------------------------------------------------------------------------
Harry Kane
Mohamed Salah
Kevin De Bruyne
The command completed successfully.
Alias name England_Division 1
Comment Teams
Members
-------------------------------------------------------------------------------
Will Grigg
Jonson Clarke-Harris
Jerry Yates
Ivan Toney
Troy Parrott
The command completed successfully.
Alias name Spanish La Liga
Comment
Members
-------------------------------------------------------------------------------
Lionel Messi
Luis Suarez
Cristiano Ronaldo
Sergio Ramos
The command completed successfully.
Alias name England_Division 2
Comment
Members
-------------------------------------------------------------------------------
Eoin Doyle
Matt Watters
James Vughan
The command completed successfully.
This is my current code on how I'm reading in the data
df = pd.read_csv(r'Desktop\SampleData.txt', sep='\n', header=None)
This gives me a pandas DF with one column. I'm fairly new to python so I'm wondering how I would go about getting the below result? should I use a delimiter when reading in the file?
England_Premier League
England_Division 1
England_Division 2
Harry Kane
Will Griggs
Eoin Doyle
Mohamed Salah
Jonson Clarke-Harris
Matt Watters
Kevin De Bruyne
Ivan Toney
James Vughan
Troy Parrott
You can use re module for the task. For example:
import re
import pandas as pd
txt = """
"-----------------------------------------------------------------------------------------------------------"
"- NEW TEAM -"
"-----------------------------------------------------------------------------------------------------------"
Europe Players
17/04/2019
07:59 p.m.
Aliases for England_Premier League
-------------------------------------------------------------------------------
Harry Kane
Mohamed Salah
Kevin De Bruyne
The command completed successfully.
Alias name England_Division 1
Comment Teams
Members
-------------------------------------------------------------------------------
Will Grigg
Jonson Clarke-Harris
Jerry Yates
Ivan Toney
Troy Parrott
The command completed successfully.
Alias name Spanish La Liga
Comment
Members
-------------------------------------------------------------------------------
Lionel Messi
Luis Suarez
Cristiano Ronaldo
Sergio Ramos
The command completed successfully.
Alias name England_Division 2
Comment
Members
-------------------------------------------------------------------------------
Eoin Doyle
Matt Watters
James Vughan
The command completed successfully.
"""
r_competitions = re.compile(r"^Alias(?:(?:es for)| name)\s*(.*?)$", flags=re.M)
r_names = re.compile(r"^-+$\s*(.*?)\s*The command", flags=re.M | re.S)
dfs = []
for comp, names in zip(r_competitions.findall(txt), r_names.findall(txt)):
if not "England" in comp:
continue
data = []
for n in names.split("\n"):
data.append({comp: n})
dfs.append(pd.DataFrame(data))
print(pd.concat(dfs, axis=1).fillna(""))
Prints:
England_Premier League England_Division 1 England_Division 2
0 Harry Kane Will Grigg Eoin Doyle
1 Mohamed Salah Jonson Clarke-Harris Matt Watters
2 Kevin De Bruyne Jerry Yates James Vughan
3 Ivan Toney
4 Troy Parrott

Capturing row if column string contains X and at least one of [Y,Z]

My data looks something like this, with household members of three different origin (Dutch, American, French):
Household members nationality:
Dutch American Dutch French
Dutch Dutch French
American American
American Dutch
French American
Dutch Dutch
I want to convert them into three categories:
Dutch only households
Households with 1 Dutch and at least 1 French or American
Non-Dutch households
Category 1 was captured by the following code:
~df['households'].str.contains("French", "American")
I was looking for a solution for category 2 and 3. I had the following in mind:
Mixed households
df['households'].str.contains("Dutch" and ("French" or "American"))
But this solution did not work because it also captured rows containing only French members.
How do I implement this 'and' statement correctly in this context?
Let us try str.get_dummies to create a dataframe of dummy indicator variables for the column Household, then create boolean masks m1, m2, m3 as per the specified conditions finally use these masks to filter out the rows:
c = df['Household'].str.get_dummies(sep=' ')
m1 = c['Dutch'].eq(1) & c[['American', 'French']].eq(0).all(1)
m2 = c['Dutch'].eq(1) & c[['American', 'French']].eq(1).any(1)
m3 = c['Dutch'].eq(0)
Details:
>>> c
American Dutch French
0 1 1 1
1 0 1 1
2 1 0 0
3 1 1 0
4 1 0 1
5 0 1 0
>>> df[m1] # category 1
Household
5 Dutch Dutch
>>> df[m2] # category 2
Household
0 Dutch American Dutch French
1 Dutch Dutch French
3 American Dutch
>>> df[m3] # category 3
Household
2 American American
4 French American

Categories