Get dummy variables from a string column full of mess - python

I'm a less-than-a-week beginner in Python and Data sciences, so please forgive me if these questions seem obvious.
I've scraped data on a website, but the result is unfortunately not very well formatted and I can't use it without transformation.
My Data
I have a string column which contains a lot of features that I would like to convert into dummy variables.
Example of string : "8 équipements & optionsextérieur et châssisjantes aluintérieurBluetoothfermeture électrique5 placessécuritékit téléphone main libre bluetoothABSautreAPPUI TETE ARclimatisation"
What I would like to do
I would like to create a dummy colum "Bluetooth" which would be equal to one if the pattern "bluetooth" is contained in the string, and zero if not.
I would like to create an other dummy column "Climatisation" which would be equal to one if the pattern "climatisation" is contained in the string, and zero if not.
...etc
And do it for 5 or 6 patterns which interest me.
What I have tried
I wanted to use a match-test with regular expressions and to combine it with pd.getdummies method.
import re
import pandas as pd
def match(My_pattern,My_strng):
m=re.search(My_pattern,My_strng)
if m:
return True
else:
return False
pd.getdummies(df["My messy strings colum"], ...)
I haven't succeeded in finding how to settle pd.getdummies arguments to specify the test I would like to apply on the column.
I was even wondering if it's the best strategy and if it wouldn't be easier to create other parallels columns and apply a match.group() on my messy strings to populate them.
Not sure I would know how to program that anyway.
Thanks for your help

I think one way to do this would be:
df.loc[df['My messy strings colum'].str.contains("bluetooth", na=False),'Bluetooth'] = 1
df.loc[~(df['My messy strings colum'].str.contains("bluetooth", na=False)),'Bluetooth'] = 0
df.loc[df['My messy strings colum'].str.contains("climatisation", na=False),'Climatisation'] = 1
df.loc[~(df['My messy strings colum'].str.contains("climatisation", na=False)),'Climatisation'] = 0
The tilde (~) represents not, so the condition is reversed in this case to string does not contain.
na = false means that if your messy column contains any null values, these will not cause an error, they will just be assumed to not meet the condition.

Related

What's the logic behind locating elements using letters in pandas?

I have a CSV file. I load it in pandas dataframe. Now, I am practicing the loc method. This CSV file contains a list of James bond movies and I am passing letters in the loc method. I could not interpret the result shown.
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)
bond.loc["A": "I"]
The result for the above code is:
bond.loc["a": "i"]
And the result for the above code is:
What is happening here? I could not understand. Please someone help me to understand the properties of pandas.
Following is the file:
Your dataframe uses the first column ("Film") as an index when it is imported (because of the option index_col = "Film"). The column contains the name of each film stored as a string, and they all start with a capital letter. bond.loc["A":"I"] returns all films where the index is greater than or equal to "A" and less than or equal to "I" (pandas slices are upper-bound inclusive), which by the rules of string comparison in Python includes all films beginning with "A"-"H", and would also include a film called "I" if there was one. If you enter e.g. "A" <= "b" <="I" in the python prompt you will see that lower-case letters are not within the range, because ord("b") > ord("I").
If you wrote bond.index = bond.index.str.lower() that would change the index to lower case and you could search films using e.g. bond["a":"i"] (but bond["A":"I"] would no longer return any films).
DataFrame.loc["A":"I"] returns the rows that start with the letter in that range - from what I can see and tried to reproduce. Might you attach the data?

Finding row in Dataframe when dataframe is both int or string?

minor problem doing my head in. I have a dataframe similar to the following:
Number Title
12345678 A
34567890-S B
11111111 C
22222222-L D
This is read from an excel file using pandas in python, then the index set to the first column:
db = db.set_index(['Number'])
I then lookup Title based on Number:
lookup = "12345678"
title = str(db.loc[lookup, 'Title'])
However... Whilst anything postfixed with "-Something" works, anything without it doesn't find a location (eg. 12345678 will not find anything, 34567890-S will). My only hunch is it's to do with looking up as either strings or ints, but I've tried a few things (converting the table to all strings, changing loc to iloc,ix,etc) but so far no luck.
Any ideas? Thanks :)
UPDATE: So trying this from scratch doesn't exhibit the same behaviour (creating a test db presumably just sets everything as strings), however importing from CSV is resulting in the above, and...
Searching "12345678" (as a string) doesn't find it, but 12345678 as an int will. Likewise the opposite for the others. So the dataframe is only matching the pure numbers in the index with ints, but anything else with strings.
Also, I can't not search for the postfix, as I have multiple rows with differing postfix eg 34567890-S, 34567890-L, 34567890-X.
If you want to cast all entries to one particular type, you can use pandas.Series.astype:
db["Number"] = df["Number"].astype(str)
db = db.set_index(['Number'])
lookup = "12345678"
title = db.loc[lookup, 'Title']
Interestingly this is actually slower than using pandas.Index.map:
x1 = [pd.Series(np.arange(n)) for n in np.logspace(1, 4, dtype=int)]
x2 = [pd.Index(np.arange(n)) for n in np.logspace(1, 4, dtype=int)]
def series_astype(x1):
return x1.astype(str)
def index_map(x2):
return x2.map(str)
Consider all the indeces as strings, as at least some of them are not numbers. If you want to lookup a specific item that possibly could have a postfix, you could match it by comparing the start of the strings with .str.startswith:
lookup = db.index.str.startswith("34567890")
title = db.loc[lookup, "Title"]

How to strip a value from a delimited string

I have a list which i have joined using the following code:
patternCore = '|'.join(list(Broker['prime_broker_id']))
patternCore
'CITI|CS|DB|JPM|ML'
Not sure why i did it that way but I used patternCore to filter multiple strings at the same time. Please note that Broker is a dataFrame
Broker['prime_broker_id']
29 CITI
30 CS
31 DB
32 JPM
33 ML
Name: prime_broker_id, dtype: object
Now I am looking to strip one string. Say I would like to strip 'DB'. How can I do that please?
I tried this
patternCore.strip('DB')
'CITI|CS|DB|JPM|ML'
but nothing is stripped
Thank you
Since Broker is a Pandas dataframe, you can use loc with Boolean indexing, then use pd.Series.tolist:
mask = Broker['prime_broker_id'] != 'DB'
patternCore = '|'.join(Broker.loc[mask, Broker['prime_broker_id']].tolist())
A more generic solution, which works with objects other than Pandas dataframes, is to use a list comprehension with an if condition:
patternCore = '|'.join([x for x in Broker['prime_broker_id'] if x != 'DB'])
Without returning to your input series, using the same idea you can split and re-join:
patternCore = 'CITI|CS|DB|JPM|ML'
patternCore = '|'.join([x for x in patternCore.split('|') if x != 'DB'])
You should expect the last option to be expensive as your algorithm requires reading each character in your input string.
I would like to mention some points which have not been touched upon till now.
I tried this
patternCore.strip('DB')
'CITI|CS|DB|JPM|ML'
but nothing is stripped
The reason why it didn't work was because strip() returns a copy of the string with the leading and trailing characters removed.
NOTE:
Not the characters in the occuring somewhere in the mid.
The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped
Here you have specified the argument characters as 'DB'. So had your string been something like 'CITI|CS|JPM|ML|DB', your code would have worked partially(the pipe at the end would remain).
But anyways this is not a good practice. Because it would strip something like
'DCITI|CS|JPM|MLB' to 'CITI|CS|JPM|ML' or 'CITI|CS|JPM|ML|BD' to 'CITI|CS|JPM|ML|' also.
I would like to strip 'DB'.
For this part, #jpp has already given a fine answer.

Define variable number of columns in for loop

I am new to pandas and I am creating new columns based on conditions from other existing columns using the following code:
df.loc[(df.item1_existing=='NO') & (df.item1_sold=='YES'),'unit_item1']=1
df.loc[(df.item2_existing=='NO') & (df.item2_sold=='YES'),'unit_item2']=1
df.loc[(df.item3_existing=='NO') & (df.item3_sold=='YES'),'unit_item3']=1
Basically, what this means is that if item is NOT existing ('NO') and the item IS sold ('YES') then give me a 1. This works to create 3 new columns but I am thinking there is a better way. As you can see, there is a repeated string in the name of the columns: '_existing' and '_sold'. I am trying to create a for loop that will look for the name of the column that ends with that specific word and concatenate the beginning, something like this:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df.loc[('df.'+i+'_existing'=='NO') & ('df'+i+'_sold'=='YES'),'unit_'+i]=1
but of course, it doesn't work. As I said, I am able to make it work with the initial example, but I would like to have fewer lines of code instead of repeating the same code because I need to create several columns this way, not just three. Is there a way to make this easier? is the for loop the best option? Thank you.
You can use Boolean series, i.e. True / False depending on whether your condition is met. Coupled with pd.Series.eq and f-strings (PEP498, Python 3.6+), and using __getitem__ (or its syntactic sugar []) to allow string inputs, you can write your logic more readably:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df[f'unit_{i}'] = df[f'{i}_existing'].eq('NO') & df[f'{i}_sold'].eq('YES')
If you need integers (1 / 0) instead of Boolean values, you can convert via astype:
df[f'unit_{i}'] = df[f'unit_{i}'].astype(int)

Key word search just in one column of the file and keeping 2 words before and after key word

Love Python and I am new to Python as well. Here with the help of community (users like Antti Haapala) I was able to proceed some extent. But I got stuck at the end. Please help. I have two tasks remaining before I get into my big data POC. (planning to use this code in 1+ million records in text file)
• Search a key word in Column (C#3) and keep 2 words front and back to that key word.
• Divert the print output to file.
• Here I don’t want to touch C#1, C#2 for referential integrity purposes.
Really appreciate for all your help.
My input file:
C #1 C # 2 C# 3 (these are headings of columns, I used just for clarity)
12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it
Desired output file: (only change in Column 3 or last column)
12088|CITA|very nice lists, better to
12089|CITA|theme for lists keep it
Code I am currently using:
s = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """
for line in s.splitlines():
if not line.strip():
continue
fields = line.split(None, 2)
joined = '|'.join(fields)
print(joined)
BTW, If I use the key word search, I am looking my 1st and 2nd columns. My challenge is keep 1st and 2nd columns without change. And search only 3rd column and keep 2 words after/before key word/s.
First I need to warn you that using this code for 1million records is dangerous. You are dealing with regular expression and this method is good as long as expressions are regular. Else you might end up creating, tons of cases to extract the data you want without extracting the data you don't want to.
For 1 million cases you'll need pandas as for loop is too slow.
import pandas as pd
import re
df = pd.DataFrame({'C1': [12088
,12089],'C2':["CITA","CITA"],"C3":["Hello very nice lists, better to keep those",
"This is great theme for lists keep it"]})
df["C3"] = df["C3"].map(lambda x:
re.findall('(?<=Hello)[\w\s,]*(?=keep)|(?<=great)[\w\s,]*',
str(x)))
df["C3"]= df["C3"].map(lambda x: x[0].strip())
df["C3"].map(lambda x: x.strip())
which gives
df
C1 C2 C3
0 12088 CITA very nice lists, better to
1 12089 CITA theme for lists keep it
There are still some questions left about how exactly you strive to perform your keyword search. One obstacle is already contained in your example: how to deal with characters such as commas? Also, it is not clear what to do with lines that do not contain the keyword. Also, what to do if there are not two words before or two words after the keyword? I guess that you yourself are a little unsure about the exact requirements and did not think about all edge cases.
Nevertheless, I have made some "blind decisions" about these questions, and here is a naive example implementation that assumes that your keyword matching rules are rather simple. I have created the function findword(), and you can adjust it to whatever you like. So, maybe this example helps you finding your own requirements.
KEYWORD = "lists"
S = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """
def findword(words, keyword):
"""Return index of first occurrence of `keyword` in sequence
`words`, otherwise return None.
The current implementation searches for "keyword" as well as
for "keyword," (with trailing comma).
"""
for test in (keyword, "%s," % keyword):
try:
return words.index(test)
except ValueError:
pass
return None
for line in S.splitlines():
tokens = line.split("|")
words = tokens[2].split()
idx = findword(words, KEYWORD)
if idx is None:
# Keyword not found. Print line without change.
print line
continue
l = len(words)
start = idx-2 if idx > 1 else 0
end = idx+3 if idx < l-2 else -1
tokens[2] = " ".join(words[start:end])
print '|'.join(tokens)
Test:
$ python test.py
12088|CITA|very nice lists, better to
12089|CITA|theme for lists keep it
PS: I hope I got the indices right for slicing. You should check, nevertheless.

Categories