Extract numbers from strings from a column in pandas dataframe - python

I have a dataframe called data, I am trying to clean one of the columns in the dataframe so I can convert the price into numerical values only.
This is how I'm filtering for the column to find those incorrect values.
data[data['incorrect_price'].astype(str).str.contains('[A-Za-z]')]
Incorrect_Price Occurences errors
23 99 cents 732 1
50 3 dollars and 49 cents 211 1
72 the price is 625 128 3
86 new price is 4.39 19 2
138 4 bucks 3 1
199 new price 429 13 1
225 price is 9.99 5 1
240 new price is 499 8 2
I have tried data['incorrect_Price'][20:51].str.findall(r"(\d+) dollars") and data['incorrect_Price'][20:51].str.findall(r"(\d+) cents") to find rows that have "cents" and "dollars" in them so I can extract the dollar and cents amount but haven't been able to incorporate this when iterating over all rows in the dataframe.
I would like the results to like look this:
Incorrect_Price Desired Occurences errors
23 99 cents .99 732 1
50 3 dollars and 49 cents 3.49 211 1
72 the price is 625 625 128 3
86 new price is 4.39 4.39 19 2
138 4 bucks 4.00 3 1
199 new price 429 429 13 1
225 price is 9.99 9.99 5 1
240 new price is 499 499 8 2

The task can be relatively easily solved as long as the strings Incorrect_Price retain the structure you present in the examples (numbers are not expressed in words).
Using regular expressions you can extract number part and optional "cent"/"cents" or "dollar"/"dollars" using an approach from similar SO question. The two main differences is that you are looking for pairs of numerical value and "cent[s]" or "dollar[s]" and that they potentially occur more than once.
import re
def extract_number_currency(value):
prices = re.findall('(?P<value>[\d]*[.]?[\d]{1,2})\s*(?P<currency>cent|dollar)s?', value)
result = 0.0
for value, currency in prices:
partial = float(value)
if currency == 'cent':
result += partial / 100
else:
result += partial
return result
print(extract_number_currency('3 dollars and 49 cent'))
3.49
Now, what you need is to apply this function to all incorrect values in the column with prices in words. For simplicity I am applying it here to all values (but I am sure you will be able to deal with the subset):
data['Desired'] = data['Incorrect_Price'].apply(extract_number_currency)
Voila!
Breaking down of the regex '(?P<value>[\d]*[.]?[\d]{1,2})\s*(?P<currency>cent|dollar)s?'
There are two capture named groups (?P<name_of_the_capture_group> .... )
The first capture group (?P<value>[\d]*[.]?[\d]{1,2}) captures:
[\d] - digits
[\d]* - repeated 0 or more times
[.]? - followed by optional (?) dot
[\d]{1,2} - followed by a digit repeated from 1 to 2 times
\s* - denotes 0 or more whitespaces
Now the 2nd capture group which is much simpler: (?P<currency>cent|dollar)
cent|dollar - it boils down to alternative between cent and dollar strings being captured
s? is an optional plural of 'cent s' or 'dollar s'

Related

lookup within filtered range

I have a dataframe with data from ecommerce panel.
It has orders and returns mixed together.
Each row has orderID - it's the same number for normal orders and for corresponding returns that come back from customers.
My data looks like this:
orderID
Shop
Revenue
Note
44
0
-32
Return
45
0
-100
Return
44
1
14
45
3
20
Something else
46
2
50
47
1
80
Something
48
2
222
For each return I want to find a 'Shop' column value that corresponds to original order.
For example : 'orderID' == 44 comes twice: once as return (with 'Shop' == 0) and once as normal order (with 'Shop' == 1).
I want to replace all the 0 values with 'Shop' column with values from earlier orders
My desired output looks like this:
orderID
Shop
Revenue
Note
44
1
-32
Return
45
3
-100
Return
44
1
14
45
3
20
Something else
46
2
50
47
1
80
Something
48
2
222
I know how to do it in Google Sheets (first I filter table removing 'Shop'==0 values and then I vlookup for numbers in this filtered array)
I know how to filter this table using Pandas but I don't know how to write it.
I assume that I will need to write a temporary column first, where I store both types of values - for normal orders (just copied) and for returns.
Original dataframe is 1 000 000+ rows
My data in .csv is available here:
https://docs.google.com/spreadsheets/d/e/2PACX-1vQAJ4tMc_Bcvv-4FsUy3E7sG0m9hm-nLTVLj-LwlSEns-YJ1pbq6gSKp5mj5lZqRI2EgHOsOutwnn1I/pub?gid=0&single=true&output=csv
Thank you for any advice!
IIUC, using map:
m = df.query('Shop != 0').set_index('orderID')['Shop']
df['Shop'] = df['orderID'].map(m)
print(df)
Output:
orderID Shop Revenue Note
0 44 1 -32 Return
1 45 3 -100 Return
2 44 1 14 NaN
3 45 3 20 Something else
4 46 2 50 NaN
5 47 1 80 Something
6 48 2 222 NaN
Create a pd.Series using query to filter out zero shops then set_index and map shops to orderID​.
This works if there is a 1-1 shop to order mapping. If you have multiple shops per order, then you'll need logic to determine which shop valid.
If you have duplicate order to the same shop, then you need to drop_duplicates first.

Make new column from slice of string from one column pandas python

I read the answer at the link [Text] (Pandas make new column from string slice of another column), but it does not solve my problem.
df
SKU Noodles FaceCream BodyWash Powder Soap
Jan10_Sales 122 100 50 200 300
Feb10_Sales 100 50 80 90 250
Mar10_sales 40 30 100 10 11
and so on
Now I want column month and year which will take value from SKU Column and return Jan for month and 10 for year (2010).
df['month']=df['SKU'].str[0:3]
df['year']=df['SKU'].str[4:5]
I get KeyError: 'SKU'
Doing other things to understand why the error, I perform the following:
[IN]df.index.name
[OUT]None
[IN]df.columns
[OUT]Index(['Noodles','FaceCream','BodyWash','Powder','Soap'], dtype='object', name='SKU')
Please help
I think first column is index, so use .index, also for year change 4:5 slicing to 3:5, 0 is possible omit in 0:3:
df['month']=df.index.str[:3]
df['year']=df.index.str[3:5]
print (df)
Noodles FaceCream BodyWash Powder Soap month year
SKU
Jan10_Sales 122 100 50 200 300 Jan 10
Feb10_Sales 100 50 80 90 250 Feb 10
Mar10_sales 40 30 100 10 11 Mar 10

Simplify query to just 'where' by making new column with pandas?

I have a column with SQL queries to a column. These are implemented on a function called Select_analysis
Form:
Select_analysis (input_shapefile, output_name, {where_clause}) # it takes until where.
Example:
SELECT * from OT # OT is a dataset
GROUP BY OT.CA # CA is a number that may exist many times.Therefore we group by that field.
HAVING ((Count(OT.OBJECTID))>1) # an id that appears more than once.
OT dataset
objectid CA
1 125
2 342
3 263
1 125
We group by CA.
About having: it is applied to the rows that have objectid more than once. Which is the objectid 1 in this example.
My idea is to make another column that will store a result that will be accessed with a simple where clause in the select_analysis function
example: OT dataset
objectid CA count_of_objectid_aftergroupby
1 125 2
2 342 1
3 263 1
1 125 2
So then can be:
Select_analysis(roads.shp,output.shp, count_of_objectid_aftergroupby > '1')
Notes
it has to be in such a way so that select analysis function is used in the end.
Assuming that you are pulling the data into pandas since it's tagged pandas, here's one possible solution:
df=pd.DataFrame({'objectID':[1,2,3,1],'CA':[125,342,463,125]}).set_index('objectID')
objectID CA
1 125
2 342
3 463
1 125
df['count_of_objectid_aftergroupby']=[df['CA'].value_counts().loc[x] for x in df['CA']]
objectID CA count_of_objectid_aftergroupby
1 125 2
2 342 1
3 463 1
1 125 2
The list comp does basically this:
pull the value counts for each item in df['CA'] as a series.
Use loc to index into the series at each value of 'CA' to find the count of that value
Put that item into a list
append that list as a new column

Counting the number of customers by values in a second series

I have imported a list of customers into python to run some RFM analysis, this adds a new field to the data for the RFM Class, so now my data looks like this:
customer RFMClass
0 0001914f-4655-4148-a1dc-1f25ca6d1f15 343
1 0002e50a-5551-4d9a-8734-76307dfe2131 341
2 00039977-512e-47ad-b929-170f18a1b14a 442
3 000693ff-2c61-425c-97c1-0286c874dd2f 443
4 00095dc2-7f37-48b0-894f-910d90cbbee2 142
5 000b748b-7ea0-48f2-a875-5f6cb95561d9 141
...
I'd like to plot a histogram showing the number of customers in each RFM Class, how can I get a count of the number of distinct customers ID's per class?
I tried adding a 1 to every row with summary['number'] = 1 thinking that it might be easier to count these rather than the customer ID's, as these have already been de-duped in my code, but I can't figure out how to sum these per RFM Class either.
Any thoughts on how I could do this?
I worked this out by using .groupby on my RFM class and summing the 'number' I assigned to each row:
byhour = df.groupby(['Hour']).agg({'Orders': 'sum'})
print(byhour)
This then produces the desired output:
Orders
Hour
0 902
1 438
2 307
3 162
4 149
5 233
6 721

Slicing in Pandas Dataframes and comparing elements

Morning. Recently I have been trying to implement pandas in creating large data tables for machine learning (I'm trying to move away from numpy as best I can).
However-I'm running into some issues-namely, slicing pandas date frames.
Namely-I'd like to return the rows I specify and reference and compare particular elements with those in other arrays-here's some a small amount of code i've implemented and some outline
import pandas as pd
import csv
import math
import random as nd
import numpy
#create the pandas dataframe from my csv. The Csv is entirely numerical data
#with exception of the first row vector which has column labels
df=pd.read_csv(r"C:\Users\Python\Downloads\Data for Brent - Secondattampatatdfrandomsample.csv")
#I use panda functionality to return a random sample of the data (a subset
#of the array)
df_sample=pd.DataFrame.sample(df,10)
It's at this point that I want to compare the first element along each row vector to the original data. Specifically, the first element in any row contains an id number.
If the elements of the original data frame and the sample frame match up like to compute a 3 and 6 month average of the associated column elements with matching id number
I want to disclaim I'm comfy moving to numpy and away from pandas-but there are training model methods I hear a ton of good things about in pandas (My training is the mathematics side of things and less so the program development). thanks for the input!
edit: here is the sample input for the first 11 row vectors in the dataframe (id, year, month,x,y,z)
id year month x y z
0 2 2016 2 1130 343.627538 163660.060200
1 2 2016 4 859 913.314513 360633.159400
2 2 2016 5 931 858.548056 93608.190030
3 2 2016 6 489 548.314860 39925.669950
4 2 2016 7 537 684.441725 80270.240060
5 2 2016 8 618 673.887072 124041.560000
6 2 2016 9 1030 644.749493 88975.429980
7 2 2016 10 1001 543.312870 54874.599830
8 2 2016 11 1194 689.053707 79930.230000
9 2 2016 12 673 483.644736 27567.749940
10 2 2017 1 912 657.716386 54590.460070
11 2 2017 2 671 682.007537 52514.580380
here is how sample data is returned given N same n tuple as before. I used native panda functions to return a randomly generated subset of 10 row vectors out of almost 9000 entries
2 2016 1 633 877.9282175 75890.97027
5185 2774 2016 4 184 399.418719 9974.375000
9441 4974 2017 2 239 135.520851 0.000000
5134 2745 2017 2 187 217.220657 7711.333333
8561 4063 2017 1 103 505.714286 18880.000000
3328 2033 2016 11 118 452.152542 7622.000000
3503 2157 2016 3 287 446.668831 8092.588235
5228 2791 2016 2 243 400.166008 12655.250000
9380 4708 2017 2 210 402.690583 5282.352941
1631 1178 2016 10 56 563.716667 16911.500000
2700 1766 2016 1 97 486.764151 6449.625000
I'd like to decry the appropriate positions in the sample array to search for identical elements in the original array and compute averages (and eventually more rigorous statistical modeling) to their associated numerical data
for id in df_sample['id'].unique():
df.groupby('id').mean()[['x', 'y', 'z']].reset_index()
I'm not sure if this is exactly what you want but I'll walk through it to see if it gives you ideas. For each unique id in the sample (I did it for all of them, implement whatever check you like), I grouped the original dataframe by that id (all rows with id == 2 are smushed together) and took the mean of the resulting pandas.GroupBy object as required (which averages the smushed together rows, for each column not in the groupby call). Since this averages your month and year as well, and all I think I care about is x, y, and z, I selected those columns, and then for aesthetic purposes reset the index.
Alternatively, if you wanted the average for that id for each year in the original df, you could do
df.groupby(['id', 'year']).mean()[['x', 'y', 'z']].reset_index()

Categories