sub string python pandas - python

I have a pandas dataframe that has a string column in it. The length of the frame is over 2 million rows and looping to extract the elements I need is a poor choice. My current code looks like the following
for i in range(len(table["series_id"])):
table["state_code"] = table["series_id"][i][2:4]
table["area_code"] = table["series_id"][i][5:9]
table["supersector_code"] = table["series_id"][i][11:12]
where "series_id" is the string containing multiple information fields I want to create an example data element:
columns:
[series_id, year, month, value, footnotes]
The data:
[['SMS01000000000000001' '2006' 'M01' 1966.5 '']
['SMS01000000000000001' '2006' 'M02' 1970.4 '']
['SMS01000000000000001' '2006' 'M03' 1976.6 '']
However series_id is column of interest that I am struggling with. I have looked at the str.FUNCTION for python and specifically pandas.
http://pandas.pydata.org/pandas-docs/stable/basics.html#testing-for-strings-that-match-or-contain-a-pattern
has a section describing each of the string functions i.e. specifically get & slice are the functions I would like to use. Ideally I could envision a solution like so:
table["state_code"] = table["series_id"].str.get(1:3)
or
table["state_code"] = table["series_id"].str.slice(1:3)
or
table["state_code"] = table["series_id"].str.slice([1:3])
When I have tried the following functions I get an invalid syntax for the ":".
but alas I cannot seem to figure out the proper way to perform the vector operation for taking a substring on a pandas data frame column.
Thank you

I think I would use str.extract with some regex (which you can tweak for your needs):
In [11]: s = pd.Series(["SMU78000009092000001"])
In [12]: s.str.extract('^.{2}(?P<state_code>.{3}).{1}(?P<area_code>\d{4}).{2}(?P<supersector_code>.{2})')
Out[12]:
state_code area_code supersector_code
0 U78 0000 92
This reads as: starts (^) with any two characters (which are ignored), the next three (any) characters are state_code, followed by any character (ignored), followed by four digits are area_code, ...

Related

How do I extract numbers from a dataframe column which follows a recurring pattern using pandas?

I am working on a dataframe which has a column that follows a pattern as shown in the image as Column Marks
I need to create two separate columns each containing ENG and HIN marks separately.
I am aware I need to use .extract and enter the pattern to extract the marks but I can't seem to get it to work.
I am using pandas.
Any help will be appreciated.
Try this.
i = 0
engMarks = []
hinMarks = []
for i in range(len(studentsDF)):
marksString = studentsDF['Marks'][i]
stringed = []
for s in marksString:
stringed.append(s)
engMarks.append(int(stringed[4] + stringed[5]))
hinMarks.append(int(stringed[15] + stringed[16]))
studentsDF['ENG Score'] = engMarks
studentsDF['HIN Score'] = hinMarks
studentsDF
Here is the code in Jupyter Notebook (so you can see the output):
Essentially what I'm doing is getting the string of each student's marks, itemizing each character into an array, and getting the characters that correspond with the grade you're looking for. Then, I'm converting those to integers, appending them to new arrays containing all the scores for each respective class, and then adding those as new columns to the original studentsDF DataFrame.

Replace with Python regex in pandas column

a = "Other (please specify) (What were you trying to do on the website when you encountered ^f('l1')^?Â\xa0)"
There are many values starting with '^f' and ending with '^' in a pandas column. And I need to replace them like below :
"Other (please specify) (What were you trying to do on the website when you encountered THIS ISSUE?Â\xa0)"
You don't mention what you've tried already, nor what the rest of your DataFrame looks like but here is a minimal example:
# Create a DataFrame with a single data point
df = pd.DataFrame(["... encountered ^f('l1')^?Â\xa0)"])
# Define a regex pattern
pattern = r'(\^f.+\^)'
# Use the .replace() method to replace
df = df.replace(to_replace=pattern, value='TEST', regex=True)
Output
0
0 ... encountered TEST? )

Extract values from array type of column in pandas

I am trying to extract the location codes / product codes from a sql table using pandas. The field is an array type, i.e. it has multiple values as a list within each row. I have to extract values from string for product/location codes.
Here is a sample of the table
df.head()
Target_Type Constraints
45 ti_8188,to_8188,r_8188,trad_8188_1,to_9258,ti_9258,r_9258,trad_9258_1
45 ti_8188,to_8188,r_8188,trad_8188_1,trad_22420_1
45 ti_8894,trad_8894_0.2
Now I want to extract the numeric values of the codes. I also want to ignore the end float values after 2nd underscore in the entries, i.e. ignore the _1, _0.2 etc.
Here is a sample output I want to achieve. It should be unique list/df column of all the extracted values -
Target_Type_45_df.head()
Constraints
8188
9258
22420
8894
I have never worked with nested/array type of column before. Any help would be appreciated.
You can use explode to bring each variable into a single cell, under one column:
df = df.explode('Constraints')
df['newConst'] = df['Constraints'].apply(lambda x: str(x).split('_')[1])
I would think the following overall strategy would work well (you'll need to debug):
Define a function that takes a row as input (the idea being to broadcast this function with the pandas .apply method).
In this function, set my_list = row['Constraints'].
Then do my_list = my_list.split(','). Now you have a list, with no commas.
Next, split with the underscore, take the second element (index 1), and convert to int:
numbers = [int(element.split('_')[1]) for element in my_list]
Finally, convert to set: return set(numbers)
The output for each row will be a set - just union all these sets together to get the final result.

Extract regex matches, and not groups, in data frames rows in Python

I am a novice in coding and I generally use R for this (stringr) but I started to learn also Python's syntax.
I have a data frame with one column generated from an imported excel file. The values in this column contain both capital and smallcase characters, symbols and numbers.
I would like to generate a second column in the data frame containing only some of these words included in the first column according to a regex pattern.
df = pd.DataFrame(["THIS IS A TEST 123123. s.m.", "THIS IS A Test test 123 .s.c.e", "TESTING T'TEST 123 da."],columns=['Test'])
df
Now, to extract what I want (words in capital case), in R I would generally use:
df <- str_extract_all(df$Test, "\\b[A-Z]{1,}\\b", simplify = FALSE)
to extract the matches of the regular expression in different data frame rows, which are:
* THIS IS A TEST
* THIS IS A
* TESTING T TEST
I couldn't find a similar solution for Python, and the closest I've got to is the following:
df["Name"] = df["Test"].str.extract(r"(\b[A-Z]{1,}\b)", expand = True)
Unfortunately this does not work, as it exports only the groups rather than the matches of the regex. I've tried multiple strategies, but also str.extractall does not seem to work ("TypeError: incompatible index of inserted column with frame index)
How can I extract the information I want with Python?
Thanks!
If I understand well, you can try:
df["Name"] = df["Test"].str.extractall(r"(\b[A-Z]{1,}\b)")
.unstack().fillna('').apply(' '.join, 1)
[EDIT]:
Here is a shorter version I discovered by looking at the doc:
df["Name"] = df["Test"].str.extractall(r"(\b[A-Z]{1,}\b)").unstack(fill_value='').apply(' '.join, 1)
You are on the right track of getting the pattern. This solution uses regular expression, join and map.
df['Name'] = df['Test'].map(lambda x: ' '.join(re.findall(r"\b[A-Z\s]+\b", x)))
Result:
Test Name
0 THIS IS A TEST 123123. s.m. THIS IS A TEST
1 THIS IS A Test test 123 .s.c.e THIS IS A
2 TESTING T'TEST 123 da. TESTING T TEST

Match similar column elements using pandas and fuzzwuzzy

I have an excel file that contains 1000+ company names in one column and about 20,000 company names in another column.
The goal is to match as many names as possible. The problem is that the names in column one (1000+) are poorly formatted, meaning that "Company Name" string can look something like "9Com(panynAm9e00". I'm trying to figure out the best way to solve this. (only 12 names match exactly)
After trying different methods, I've ended up with attempting to match 4-5 or more characters in each name, depending on the length of each string, using regex. But I'm just struggling to find the most efficient way to do this.
For instance:
Column 1
1. 9Com(panynAm9e00
2. NikE4
3. Mitrosof2
Column 2
1. Microsoft
2. Company Name
3. Nike
Take first element in Column 1 and look for a match in Column 2. If no exact match, then look for a string with 4-5 same characters.
Any suggestions?
I would suggest reading your Excel file with pandas and pd.read_excel(), and then using fuzzywuzzy to perform your matching, for example:
import pandas as pd
from fuzzywuzzy import process, fuzz
df = pd.DataFrame([['9Com(panynAm9e00'],
['NikE4'],
['Mitrosof2']],
columns=['Name'])
known_list = ['Microsoft','Company Name','Nike']
def find_match(x):
match = process.extractOne(x, known_list, scorer=fuzz.partial_token_sort_ratio)[0]
return match
df['match found'] = [find_match(row) for row in df['Name']]
Yields:
Name match found
0 9Com(panynAm9e00 Company Name
1 NikE4 Nike
2 Mitrosof2 Microsoft
I imagine numbers are not very common in actual company names, so an initial filter step will help immensely going forward, but here is one implementation that should work relatively well even without this. A bag-of-letters (bag-of-words) approach, if you will:
convert everything (col 1 and 2) to lowercase
For each known company in column 2, store each unique letter, and how many times it appears (count) in a dictionary
Do the same (step 2) for each entry in column 1
For each entry in col 1, find the closest bag-of-letters (dictionary from step 2) from the list of real company names
The dictionary-distance implementation is up to you.

Categories