Finding row in Dataframe when dataframe is both int or string? - python

minor problem doing my head in. I have a dataframe similar to the following:
Number Title
12345678 A
34567890-S B
11111111 C
22222222-L D
This is read from an excel file using pandas in python, then the index set to the first column:
db = db.set_index(['Number'])
I then lookup Title based on Number:
lookup = "12345678"
title = str(db.loc[lookup, 'Title'])
However... Whilst anything postfixed with "-Something" works, anything without it doesn't find a location (eg. 12345678 will not find anything, 34567890-S will). My only hunch is it's to do with looking up as either strings or ints, but I've tried a few things (converting the table to all strings, changing loc to iloc,ix,etc) but so far no luck.
Any ideas? Thanks :)
UPDATE: So trying this from scratch doesn't exhibit the same behaviour (creating a test db presumably just sets everything as strings), however importing from CSV is resulting in the above, and...
Searching "12345678" (as a string) doesn't find it, but 12345678 as an int will. Likewise the opposite for the others. So the dataframe is only matching the pure numbers in the index with ints, but anything else with strings.
Also, I can't not search for the postfix, as I have multiple rows with differing postfix eg 34567890-S, 34567890-L, 34567890-X.

If you want to cast all entries to one particular type, you can use pandas.Series.astype:
db["Number"] = df["Number"].astype(str)
db = db.set_index(['Number'])
lookup = "12345678"
title = db.loc[lookup, 'Title']
Interestingly this is actually slower than using pandas.Index.map:
x1 = [pd.Series(np.arange(n)) for n in np.logspace(1, 4, dtype=int)]
x2 = [pd.Index(np.arange(n)) for n in np.logspace(1, 4, dtype=int)]
def series_astype(x1):
return x1.astype(str)
def index_map(x2):
return x2.map(str)

Consider all the indeces as strings, as at least some of them are not numbers. If you want to lookup a specific item that possibly could have a postfix, you could match it by comparing the start of the strings with .str.startswith:
lookup = db.index.str.startswith("34567890")
title = db.loc[lookup, "Title"]

Related

What's the logic behind locating elements using letters in pandas?

I have a CSV file. I load it in pandas dataframe. Now, I am practicing the loc method. This CSV file contains a list of James bond movies and I am passing letters in the loc method. I could not interpret the result shown.
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)
bond.loc["A": "I"]
The result for the above code is:
bond.loc["a": "i"]
And the result for the above code is:
What is happening here? I could not understand. Please someone help me to understand the properties of pandas.
Following is the file:
Your dataframe uses the first column ("Film") as an index when it is imported (because of the option index_col = "Film"). The column contains the name of each film stored as a string, and they all start with a capital letter. bond.loc["A":"I"] returns all films where the index is greater than or equal to "A" and less than or equal to "I" (pandas slices are upper-bound inclusive), which by the rules of string comparison in Python includes all films beginning with "A"-"H", and would also include a film called "I" if there was one. If you enter e.g. "A" <= "b" <="I" in the python prompt you will see that lower-case letters are not within the range, because ord("b") > ord("I").
If you wrote bond.index = bond.index.str.lower() that would change the index to lower case and you could search films using e.g. bond["a":"i"] (but bond["A":"I"] would no longer return any films).
DataFrame.loc["A":"I"] returns the rows that start with the letter in that range - from what I can see and tried to reproduce. Might you attach the data?

Get dummy variables from a string column full of mess

I'm a less-than-a-week beginner in Python and Data sciences, so please forgive me if these questions seem obvious.
I've scraped data on a website, but the result is unfortunately not very well formatted and I can't use it without transformation.
My Data
I have a string column which contains a lot of features that I would like to convert into dummy variables.
Example of string : "8 équipements & optionsextérieur et châssisjantes aluintérieurBluetoothfermeture électrique5 placessécuritékit téléphone main libre bluetoothABSautreAPPUI TETE ARclimatisation"
What I would like to do
I would like to create a dummy colum "Bluetooth" which would be equal to one if the pattern "bluetooth" is contained in the string, and zero if not.
I would like to create an other dummy column "Climatisation" which would be equal to one if the pattern "climatisation" is contained in the string, and zero if not.
...etc
And do it for 5 or 6 patterns which interest me.
What I have tried
I wanted to use a match-test with regular expressions and to combine it with pd.getdummies method.
import re
import pandas as pd
def match(My_pattern,My_strng):
m=re.search(My_pattern,My_strng)
if m:
return True
else:
return False
pd.getdummies(df["My messy strings colum"], ...)
I haven't succeeded in finding how to settle pd.getdummies arguments to specify the test I would like to apply on the column.
I was even wondering if it's the best strategy and if it wouldn't be easier to create other parallels columns and apply a match.group() on my messy strings to populate them.
Not sure I would know how to program that anyway.
Thanks for your help
I think one way to do this would be:
df.loc[df['My messy strings colum'].str.contains("bluetooth", na=False),'Bluetooth'] = 1
df.loc[~(df['My messy strings colum'].str.contains("bluetooth", na=False)),'Bluetooth'] = 0
df.loc[df['My messy strings colum'].str.contains("climatisation", na=False),'Climatisation'] = 1
df.loc[~(df['My messy strings colum'].str.contains("climatisation", na=False)),'Climatisation'] = 0
The tilde (~) represents not, so the condition is reversed in this case to string does not contain.
na = false means that if your messy column contains any null values, these will not cause an error, they will just be assumed to not meet the condition.

Creating a short unique ID based on other values in Python?

I have a number of variables in python that I want to use to generate a unique ID for those variables (yet have that ID always produce for those same matching variables).
I have used .encode('hex','strict') to produce an ID which seems to work, however the output value is very long. Is there a way to produce a shorter ID using variables?
myname = 'Midavalo'
mydate = '5 July 2017'
mytime = '8:19am'
codec = 'hex'
print "{}{}{}".format(myname, mydate, mytime).encode(codec,'strict')
This outputs
4d69646176616c6f35204a756c792032303137383a3139616d
I realise with hex it is probably dependant on the length of the three variables, so I'm wondering if there is another codec that can/will produce shorter values without excluding any of the variables?
So far I have tested base64, bz2, hex, quopri, uu, zip from 7.8.4. Python Specific Encodings, but I'm unsure how to get any of these to produce shorter values without removing variables.
Is there another codec I could use, or a way to shorten the values from any of them without removing the uniqueness, or even a completely different way to produce what I require?
All I am trying to do is produce an ID so I can identify those rows when loading them into a database. If the same value already exists it will not create a new row in the database. There is no security requirement, just a unique ID. The values are generated elsewhere into python, so I can't just use a database issued ID for these values.
You could use some hashing algorithm from the hashlib package: https://docs.python.org/3/library/hashlib.html or for python 2: https://docs.python.org/2.7/library/hashlib.html
import hashlib
s = "some string"
hash = hashlib.sha1(str.encode(s)).hexdigest() # you need to encode the strings into bytes here
This hash would be the same for the same string.
Your choice of algorithm depends of the number of chars you want and the risk of collision(two different strings yielding the same hash).
If you are not specific to hash and just want a uniq value based on the two or more strings. It concatenates the first character of every string and outputs a uniq value
#prints HKRC1LB for two string1 and string2
#Concatenate first char of all strings to get a uniq id
def get_uniq_val(*args):
id = ""
for i in args:
for j in i.split():
id += j[0]
return id
def main():
string_1 = "Howard Kid Recreation Centre"
string_2 = "150 Lantern Blvd"
uid = get_uniq_val(string_1,string_2)
print(uid)
if __name__ == "__main__":
main()

Key word search just in one column of the file and keeping 2 words before and after key word

Love Python and I am new to Python as well. Here with the help of community (users like Antti Haapala) I was able to proceed some extent. But I got stuck at the end. Please help. I have two tasks remaining before I get into my big data POC. (planning to use this code in 1+ million records in text file)
• Search a key word in Column (C#3) and keep 2 words front and back to that key word.
• Divert the print output to file.
• Here I don’t want to touch C#1, C#2 for referential integrity purposes.
Really appreciate for all your help.
My input file:
C #1 C # 2 C# 3 (these are headings of columns, I used just for clarity)
12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it
Desired output file: (only change in Column 3 or last column)
12088|CITA|very nice lists, better to
12089|CITA|theme for lists keep it
Code I am currently using:
s = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """
for line in s.splitlines():
if not line.strip():
continue
fields = line.split(None, 2)
joined = '|'.join(fields)
print(joined)
BTW, If I use the key word search, I am looking my 1st and 2nd columns. My challenge is keep 1st and 2nd columns without change. And search only 3rd column and keep 2 words after/before key word/s.
First I need to warn you that using this code for 1million records is dangerous. You are dealing with regular expression and this method is good as long as expressions are regular. Else you might end up creating, tons of cases to extract the data you want without extracting the data you don't want to.
For 1 million cases you'll need pandas as for loop is too slow.
import pandas as pd
import re
df = pd.DataFrame({'C1': [12088
,12089],'C2':["CITA","CITA"],"C3":["Hello very nice lists, better to keep those",
"This is great theme for lists keep it"]})
df["C3"] = df["C3"].map(lambda x:
re.findall('(?<=Hello)[\w\s,]*(?=keep)|(?<=great)[\w\s,]*',
str(x)))
df["C3"]= df["C3"].map(lambda x: x[0].strip())
df["C3"].map(lambda x: x.strip())
which gives
df
C1 C2 C3
0 12088 CITA very nice lists, better to
1 12089 CITA theme for lists keep it
There are still some questions left about how exactly you strive to perform your keyword search. One obstacle is already contained in your example: how to deal with characters such as commas? Also, it is not clear what to do with lines that do not contain the keyword. Also, what to do if there are not two words before or two words after the keyword? I guess that you yourself are a little unsure about the exact requirements and did not think about all edge cases.
Nevertheless, I have made some "blind decisions" about these questions, and here is a naive example implementation that assumes that your keyword matching rules are rather simple. I have created the function findword(), and you can adjust it to whatever you like. So, maybe this example helps you finding your own requirements.
KEYWORD = "lists"
S = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """
def findword(words, keyword):
"""Return index of first occurrence of `keyword` in sequence
`words`, otherwise return None.
The current implementation searches for "keyword" as well as
for "keyword," (with trailing comma).
"""
for test in (keyword, "%s," % keyword):
try:
return words.index(test)
except ValueError:
pass
return None
for line in S.splitlines():
tokens = line.split("|")
words = tokens[2].split()
idx = findword(words, KEYWORD)
if idx is None:
# Keyword not found. Print line without change.
print line
continue
l = len(words)
start = idx-2 if idx > 1 else 0
end = idx+3 if idx < l-2 else -1
tokens[2] = " ".join(words[start:end])
print '|'.join(tokens)
Test:
$ python test.py
12088|CITA|very nice lists, better to
12089|CITA|theme for lists keep it
PS: I hope I got the indices right for slicing. You should check, nevertheless.

Count matches in Mongodb $or

Trying to count the matches across all columns.
I currently use this code to copy across certain fields from a Scrapy item.
def getDbModel(self, item):
deal = { "name":item['name'] }
if 'imageURL' in item:
deal["imageURL"] = item['imageURL']
if 'highlights' in item:
deal['highlights'] = replace_tags(item['highlights'], ' ')
if 'fine_print' in item:
deal['fine_print'] = replace_tags(item['fine_print'], ' ')
if 'description' in item:
deal['description'] = replace_tags(item['description'], ' ')
if 'search_slug' in item:
deal['search_slug'] = item['search_slug']
if 'dealURL' in item:
deal['dealurl'] = item['dealURL']
Wondering how I would turn this into an OR search in mongodb.
I was looking at something like the below:
def checkDB(self,item):
# Check if the record exists in the DB
deal = self.getDbModel(item)
return self.db.units.find_one({"$or":[deal]})
Firstly, Is this the best method to be doing?
Secondly, how would I find the count of the amount of columns matched i.e. trying to limit records that match at least two columns.
There is no easy way of counting the number of colum matches on MongoDBs end, it just kinda matches and then returns.
You would probably be better doing this client side, I am unsure exactly how you intend to use this count figure but there is no easy way, whether through MR or aggregation framework of doing this.
You could, in the aggregation framework, change your schema a little to put these colums within a properties field and then $sum the matches within the subdocuemnt. This is a good approach since you can also sort on it to create a type of relevance search (if that is what your intending).
As to whether this is a good approach depends. When using an $or MongoDB will use an index for each condition, this is a special case within MongoDB indexing, however it does mean you should take this into consideration when making an $or and ensure you have indexes to cover each condition.
You have also got to consider that MongoDB will effectively eval each clause and then merge the results to remove duplicates, which can be heavy for bigger $ors or a large working set.
Of course the format of your $or is wrong, you need an array of arrays of your fields. At the minute you have a single array with another array which has all your attributes. When used like this the attributes will actually have an $and condition between them so it won't work.
You could probably change your code to:
def getDbModel(self, item):
deal = []
deal[] = { "name":item['name'] }
if 'imageURL' in item:
deal[] = {"imageURL": tem['imageURL']}
if 'highlights' in item:
// etc
// Some way down
return self.db.units.find_one({"$or":deal})
NB: I am not a Python programmer
Hope it helps,

Categories