Accessing Data using df['foo'] missing data for pattern searching python - python

So I have this function which takes in one row from dataframe and matches the pattern
and add it to the data. Since pattern search needs input to be string, I am forcing it with str(). However, if I do that it cuts off my url after certain point.
I figured out if I force it using ix function
str(data.ix[0,'url'])
It does not cut off any and gets me what I want. Also, if I use str(data.ix[:'url']),
it also cuts off after some point.
Problem is I cannot specify the index position inside the ix function as I plan to iterate by row using apply function. Any suggestion?
def foo (data):
url = str(data['url'])
m = re.search(r"model=(?P<model>\w+)&id=\d+&make=(?P<make>\w+)", url)
if m:
data['make'] = m.group("make")
data['model'] = m.group("model")
return data

Iterating row-by-row is a last resort. It's almost always slower, less readable, and less idiomatic.
Fortunately, there is an easy way to do what you want to do. Check out the DataFrame.str.extract method, added in version 0.13 of pandas.
Something like this...
pattern = r'model=(?P<model>\w+)&id=\d+&make=(?P<make>\w+)'
extracted_data = data.str.extract(pattern)
The result, extracted_data will be a new DataFrame with columns named 'model' and 'make', inferred from the named groups in your regex pattern.
Join it to your original DataFrame, and you're done.
data = data.join(extracted_data)

Related

Check if terms are in columns and remove

Originally I wanted to filter only for specific terms, however I've found python will match the pattern regardless of specificity eg:
possibilities = ['temp', 'degc']
temp = (df.filter(regex='|'.join(re.escape(x) for x in temp_possibilities))
.columns.to_list())
Output does find the correct columns, but unfortunately it also returns columns like temp_uncalibrated, which I do not want.
So to solve this, so far I define and remove unwanted columns first, before filtering ie:
if 'temp_uncalibrated' in df.columns:
df = df.drop('temp_uncalibrated',axis = 1)
else:
pass
However, I have found more and more of these unwanted columns, and now the code looks messy and hard to read with all the terms. Is there a way to do this more succinctly? I tries putting the terms in a list and do it that way, but it does not work, ie:
if list in df.columns:
df = df.drop(list,axis = 1)
else:
pass
I thought maybe a def function might be a better way to do it, but not really sure where to start.

How can I mix similar code using loops (for/while) in Python?

I have some code repeated, where only some numbers are changing.
df_h0 = df.copy()
df_h0['hour']='00:00'
df_h0['totalCount']=df.post_time_data.str.split('"00:00","postCount":"').str[1].str.split('","topic').str[0]
df_h0 = df_h0.fillna(0)
df_h1 = df.copy()
df_h1['hour']='01:00'
df_h1['totalCount']=df.post_time_data.str.split('"01:00","postCount":"').str[1].str.split('","topic').str[0]
df_h1 = df_h1.fillna(0)
df_h2 = df.copy()
df_h2['hour']='02:00'
df_h2['totalCount']=df.post_time_data.str.split('"02:00","postCount":"').str[1].str.split('","topic').str[0]
df_h2 = df_h2.fillna(0)
I want to simplify this code with a loop but I'm not sure how to start with that since I'm new in Python.
I will try to show what the process looks like in general, so that you can figure these things out yourself in the future. However, it's not automatic - you will need to think about what you are doing every time, in order to write the best code you are capable of.
Step 1: Grab a single representative block of the code you want to repeat, and identify the parts that change:
df_h0 = df.copy()
# ^^^ the variable name changes
df_h0['hour']='00:00'
# ^^^^^ the hour string changes
df_h0['totalCount']=df.post_time_data.str.split('"00:00","postCount":"').str[1].str.split('","topic').str[0]
# the delimiter string changes ^^^^^^^^^^^^^^^^^^^^^^^
df_h0 = df_h0.fillna(0)
Step 2: Understand that our output will be a list of values, instead of multiple separate variables with related names.
This will be much easier to work with going forward :)
Step 3: Analyze the changes.
We have an hour string which varies, and a delimiter string which also varies; but the delimiter string always has the same general form, which is based upon the hour string. So if we have the hour string, we can create the delimiter string. There is really only one piece of varying information - the hour. We'll adjust the code to reflect that:
hour = '00:00' # give the variable information a name
delimiter = f'"{hour}","postCount":"' # compute the derived information
# and then use those values in the rest of the code
df_h0 = df.copy()
df_h0['hour'] = hour
df_h0['totalCount']=df.post_time_data.str.split(delimiter).str[1].str.split('","topic').str[0]
df_h0 = df_h0.fillna(0)
Step 4: To make the overall code easier to understand, we put this block into its own function.
That lets us give a name to the process of making a single table. We use the input to the function to provide the varying information that we described in step 3. There is one thing that changes, so there will be one parameter to represent that. However, we also need to provide the data context that we're working with here - the df dataframe - so that the function has access to it. So we have two parameters in total.
def hourly_data(df, hour):
# since 'hour' was provided, we don't define it here
delimiter = f'"{hour}","postCount":"'
# now we use a generic name inside the function.
result = df.copy()
result['hour'] = hour
result['totalCount']=df.post_time_data.str.split(delimiter).str[1].str.split('","topic').str[0]
# At the last step of the original process, we `return` the value
# instead of simply assigning it.
return result.fillna(0)
Now we have code that, given an 'hour' string, can produce a new dataframe, simply by calling it - for example: df_h0 = hourly_data(df, '00:00').
Step 5: A bit more analysis.
We would like to call this function with each possible hour value, presumably from '00:00' through '23:00' inclusive. However, these strings have an obvious pattern to them. It would be easier if we just supply the number for the hour to hourly_data, and have it produce the string.
def hourly_data(df, hour):
# Locally replace the integer hour value with the hour string.
# The `:02` here is used to zero-pad and right-align the hour value
# as two digits.
hour = f'{hour:02}:00'
delimiter = f'"{hour}","postCount":"'
# The rest as before.
result = df.copy()
result['hour'] = hour
result['totalCount']=df.post_time_data.str.split(delimiter).str[1].str.split('","topic').str[0]
return result.fillna(0)
Step 6: Now we are ready to use this code in a loop.
In Python, the natural loop to "transform" one input list into another is the list comprehension. It looks like this:
hourly_dfs = [hourly_data(df, hour) for hour in range(24)]
Here, range is a built-in function that gives us the desired sequence of input values.
We can also build the list manually with a for loop:
hourly_dfs = []
for hour in range(24):
hourly_dfs.append(hourly_data(df, hour))
We could also have done the work inside the body of the for loop (someone else will probably come along with another answer and show code like that). But by making the function first, we get code that is easier to understand, and which also allows us to use a list comprehension. The list comprehension approach is simpler, because we don't have to think about the process of starting from empty and .appending each element, we let Python build a list instead of telling it how to do so.
You could make a list of the variables and iterate over them and use the string.format method
vars = [df_h0, df_h1, df_h2]
x = 0
for var in vars:
var = df.copy()
var['hour']='0{0}:00'.format(x)
var['totalCount']=df.post_time_data.str.split('0{0}:00", "postCount":'. format(x)).str[1].str.split('","topic').str[0]
var = var.fillna(0)
x += 1
If you have Python 3.6+ you can use f strings instead of .format() as well
Hopefully I havent missed anything, but if I have, you could just impletent the same logic I used by declaring anither variable like x

Pythonic equivalent to Matlab's textscan

There are some similar questions to this, but nothing exact that I can find.
I have a very odd text-file with lines like the following:
field1=1; field2=2; field3=3;
field1=4; field2=5; field3=6;
Matlab's textscan() function deals with this very neatly, as you can do this:
array = textscan(fid, 'field1=%d; field2=%d; field3=%d;'
and you will get back a cell-array where each column contains the respective field, and the text is simply ignored.
I'd like to rewrite the code that deals with this file in Python, but Numpy's loadtxt() and genfromtxt() don't seem to have this ability to ignore text interspersed with the desired numbers?
What are some Python ways to strip out the text and only get back the fields? I'm happy to use pandas or another library if required. Thanks!
EDIT: This question was suggested as an answer, but it only gives equivalents to the basic usage of textscan that does not deal with unwanted text in the input. The answer below with fromregex is what I needed.
Numpy's fromregex function is basically the same as textscan. It lets you read in based on a regular expression, with groups (parts surrounded by ()) as the values. This works for your example:
data = np.fromregex('temp.txt', r'field1=(\d+); field2=(\d+); field3=(\d+);', dtype='int')
You can also use loadtxt. There is an argument, converters, that lets you provide functions that do the actual conversion from text to a number. You can provide a function that , you just need to provide it a function to strip out the unneeded text.
So in my tests this works:
myconv = lambda x: int(x.split(b'=')[-1])
mycols = [0, 1, 2]
convdict = {i: myconv for i in mycols}
data = np.loadtxt('temp.txt', delimiter=';', usecols=mycols, converters=convdict)
myconv is an anonymous function that takes a value (say 'field1=1'), splits it on the '=', symbol (making ['field1', '1']), takes the last result ('1'), the converts that to a float (1.`).
mycols is just the numbers of the columns you want to keep. Since there is a delimiter at the end of each line, this counts as an empty columns. So we exclude that.
convdict is a dictionary where each key is a column number and each value is the function to convert that column to a number. In this case they are all the same, but you can customize them however you want.
Python has no exact equivalent of Matlab's textscan (edit: but numpy has fromregex. See #TheBlackCat's answer for more.)
With more complicated formats regular expressions may get the job done.
import re
line_pat = re.compile(r'field1=(\d+); field2=(\d+); field3=(\d+);')
with open(filepath, 'r') as f:
array = [[int(n) for n in line_pat.match(line).groups()] for line in f]

Python Pandas replace string based on format

Please, is there any ways to replace "x-y" by "x,x+1,x+2,...,y" in every row in a data frame? (Where x, y are integer).
For example, I want to replace every row like this:
"1-3,7" by "1,2,3,7"
"1,4,6-9,11-13,5" by "1,4,6,7,8,9,11,12,13,5"
etc
I know that by looping through lines and using regular expression we can do that. But the table is quite big and it takes quite some time. so I think using pandas might be faster.
Thanks alot
In pandas you can use apply to apply any function to either rows or columns in a DataFrame. The function can be passed with a lambda, or defined separately.
(side-remark: your example does not entirely make clear if you actually have a 2-D DataFrame or just a 1-D Series. Either way, apply can be used)
The next step is to find the right function. Here's a rough version (without regular expressions):
def make_list(str):
lst = str.split(',')
newlst = []
for i in lst:
if "-" in i:
newlst.extend(range(*[int(j) for j in i.split("-")]))
else:
newlst.append(int(i))
return newlst

Count matches in Mongodb $or

Trying to count the matches across all columns.
I currently use this code to copy across certain fields from a Scrapy item.
def getDbModel(self, item):
deal = { "name":item['name'] }
if 'imageURL' in item:
deal["imageURL"] = item['imageURL']
if 'highlights' in item:
deal['highlights'] = replace_tags(item['highlights'], ' ')
if 'fine_print' in item:
deal['fine_print'] = replace_tags(item['fine_print'], ' ')
if 'description' in item:
deal['description'] = replace_tags(item['description'], ' ')
if 'search_slug' in item:
deal['search_slug'] = item['search_slug']
if 'dealURL' in item:
deal['dealurl'] = item['dealURL']
Wondering how I would turn this into an OR search in mongodb.
I was looking at something like the below:
def checkDB(self,item):
# Check if the record exists in the DB
deal = self.getDbModel(item)
return self.db.units.find_one({"$or":[deal]})
Firstly, Is this the best method to be doing?
Secondly, how would I find the count of the amount of columns matched i.e. trying to limit records that match at least two columns.
There is no easy way of counting the number of colum matches on MongoDBs end, it just kinda matches and then returns.
You would probably be better doing this client side, I am unsure exactly how you intend to use this count figure but there is no easy way, whether through MR or aggregation framework of doing this.
You could, in the aggregation framework, change your schema a little to put these colums within a properties field and then $sum the matches within the subdocuemnt. This is a good approach since you can also sort on it to create a type of relevance search (if that is what your intending).
As to whether this is a good approach depends. When using an $or MongoDB will use an index for each condition, this is a special case within MongoDB indexing, however it does mean you should take this into consideration when making an $or and ensure you have indexes to cover each condition.
You have also got to consider that MongoDB will effectively eval each clause and then merge the results to remove duplicates, which can be heavy for bigger $ors or a large working set.
Of course the format of your $or is wrong, you need an array of arrays of your fields. At the minute you have a single array with another array which has all your attributes. When used like this the attributes will actually have an $and condition between them so it won't work.
You could probably change your code to:
def getDbModel(self, item):
deal = []
deal[] = { "name":item['name'] }
if 'imageURL' in item:
deal[] = {"imageURL": tem['imageURL']}
if 'highlights' in item:
// etc
// Some way down
return self.db.units.find_one({"$or":deal})
NB: I am not a Python programmer
Hope it helps,

Categories