I am a novice in coding and I generally use R for this (stringr) but I started to learn also Python's syntax.
I have a data frame with one column generated from an imported excel file. The values in this column contain both capital and smallcase characters, symbols and numbers.
I would like to generate a second column in the data frame containing only some of these words included in the first column according to a regex pattern.
df = pd.DataFrame(["THIS IS A TEST 123123. s.m.", "THIS IS A Test test 123 .s.c.e", "TESTING T'TEST 123 da."],columns=['Test'])
df
Now, to extract what I want (words in capital case), in R I would generally use:
df <- str_extract_all(df$Test, "\\b[A-Z]{1,}\\b", simplify = FALSE)
to extract the matches of the regular expression in different data frame rows, which are:
* THIS IS A TEST
* THIS IS A
* TESTING T TEST
I couldn't find a similar solution for Python, and the closest I've got to is the following:
df["Name"] = df["Test"].str.extract(r"(\b[A-Z]{1,}\b)", expand = True)
Unfortunately this does not work, as it exports only the groups rather than the matches of the regex. I've tried multiple strategies, but also str.extractall does not seem to work ("TypeError: incompatible index of inserted column with frame index)
How can I extract the information I want with Python?
Thanks!
If I understand well, you can try:
df["Name"] = df["Test"].str.extractall(r"(\b[A-Z]{1,}\b)")
.unstack().fillna('').apply(' '.join, 1)
[EDIT]:
Here is a shorter version I discovered by looking at the doc:
df["Name"] = df["Test"].str.extractall(r"(\b[A-Z]{1,}\b)").unstack(fill_value='').apply(' '.join, 1)
You are on the right track of getting the pattern. This solution uses regular expression, join and map.
df['Name'] = df['Test'].map(lambda x: ' '.join(re.findall(r"\b[A-Z\s]+\b", x)))
Result:
Test Name
0 THIS IS A TEST 123123. s.m. THIS IS A TEST
1 THIS IS A Test test 123 .s.c.e THIS IS A
2 TESTING T'TEST 123 da. TESTING T TEST
Related
I've used the code below to search across all columns of my dataframe to see if each row has the word "pool" and the words "slide" or "waterslide".
AR11AR11_regex = r"""
(?=.*(?:slide|waterslide)).*pool
"""
f = lambda x: x.str.findall(AR_regex, flags= re.VERBOSE|re.IGNORECASE)
d['AR'][AR11] = d['AR'].astype(str).apply(f).any(1).astype(int)
This has worked fine but when I want to write a for loop to do this for more than one regex pattern (e.g., AR11, AR12, AR21) using the code below, the new columns are all zeros (i.e., the search is not finding any hits)
for i in AR_list:
print(i)
pat = i+"_regex"
print(pat)
f = lambda x: x.str.findall(i+"_regex", flags= re.VERBOSE|re.IGNORECASE)
d['AR'][str(i)] = d['AR'].astype(str).apply(f).any(1).astype(int)
Any advice on why this loop didn't work would be much appreciated!
A small sample data frame would help understand your question. In any case, your code sample appears to have a multitude of problems.
i+"_regex" is just the string "AR11_regex". It won't evaluate to the value of the variable with the identifier AR11_regex. Put your regex patterns in a dict.
d['AR'] is the values in the AR column. It seems like you expect it to be a row.
d['AR'][str(i)] is adding a new row. It seems like you want to add a new column.
Lastly, this approach to setting a cell generally (always for me) yields the following warning:
/var/folders/zj/pnrcbb6n01z2qv1gmsk70b_m0000gn/T/ipykernel_13985/876572204.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
The suggest approach would be to use "at" as in d.at[str(i), 'AR'] or some such.
Add a sample data frame and refine your question for more suggestions.
Given a large CSV file(large enough to exceed RAM), I want to read only specific columns following some patterns. The columns can be any of the following: S_0, S_1, ...D_1, D_2 etc. For example, a chunk from the data frame looks like this:
And the regex pattern would be for example anyu column that starts with S: S_\d.*.
Now, how do I apply this with pd.read_csv(/path/, __) to read the specific columns as mentioned?
You can first read few rows and try DataFrame.filter to get possible columns
cols = pd.readcsv('path', nrows=10).filter(regex='S_\d*').columns
df = pd.readcsv('path', usecols=cols)
Took the same approach(as of now) as mentioned in the comments. Here goes the detailed piece I used:
def extract_col_names(all_cols, pattern):
result = []
for col in all_cols:
if re.match(pattern, col):
result.append(col)
else:
continue
return result
extract_col_names(cols, pattern="S_\d+")
And it works!
But without this work-around, say even loading the columns is heavy enough itself. So, does there exist any method to parse regex patterns at the time of reading CSVs? This still remains a question.
Thanks for the response :)
I am working on a dataframe which has a column that follows a pattern as shown in the image as Column Marks
I need to create two separate columns each containing ENG and HIN marks separately.
I am aware I need to use .extract and enter the pattern to extract the marks but I can't seem to get it to work.
I am using pandas.
Any help will be appreciated.
Try this.
i = 0
engMarks = []
hinMarks = []
for i in range(len(studentsDF)):
marksString = studentsDF['Marks'][i]
stringed = []
for s in marksString:
stringed.append(s)
engMarks.append(int(stringed[4] + stringed[5]))
hinMarks.append(int(stringed[15] + stringed[16]))
studentsDF['ENG Score'] = engMarks
studentsDF['HIN Score'] = hinMarks
studentsDF
Here is the code in Jupyter Notebook (so you can see the output):
Essentially what I'm doing is getting the string of each student's marks, itemizing each character into an array, and getting the characters that correspond with the grade you're looking for. Then, I'm converting those to integers, appending them to new arrays containing all the scores for each respective class, and then adding those as new columns to the original studentsDF DataFrame.
a = "Other (please specify) (What were you trying to do on the website when you encountered ^f('l1')^?Â\xa0)"
There are many values starting with '^f' and ending with '^' in a pandas column. And I need to replace them like below :
"Other (please specify) (What were you trying to do on the website when you encountered THIS ISSUE?Â\xa0)"
You don't mention what you've tried already, nor what the rest of your DataFrame looks like but here is a minimal example:
# Create a DataFrame with a single data point
df = pd.DataFrame(["... encountered ^f('l1')^?Â\xa0)"])
# Define a regex pattern
pattern = r'(\^f.+\^)'
# Use the .replace() method to replace
df = df.replace(to_replace=pattern, value='TEST', regex=True)
Output
0
0 ... encountered TEST? )
I have a pandas dataframe that has a string column in it. The length of the frame is over 2 million rows and looping to extract the elements I need is a poor choice. My current code looks like the following
for i in range(len(table["series_id"])):
table["state_code"] = table["series_id"][i][2:4]
table["area_code"] = table["series_id"][i][5:9]
table["supersector_code"] = table["series_id"][i][11:12]
where "series_id" is the string containing multiple information fields I want to create an example data element:
columns:
[series_id, year, month, value, footnotes]
The data:
[['SMS01000000000000001' '2006' 'M01' 1966.5 '']
['SMS01000000000000001' '2006' 'M02' 1970.4 '']
['SMS01000000000000001' '2006' 'M03' 1976.6 '']
However series_id is column of interest that I am struggling with. I have looked at the str.FUNCTION for python and specifically pandas.
http://pandas.pydata.org/pandas-docs/stable/basics.html#testing-for-strings-that-match-or-contain-a-pattern
has a section describing each of the string functions i.e. specifically get & slice are the functions I would like to use. Ideally I could envision a solution like so:
table["state_code"] = table["series_id"].str.get(1:3)
or
table["state_code"] = table["series_id"].str.slice(1:3)
or
table["state_code"] = table["series_id"].str.slice([1:3])
When I have tried the following functions I get an invalid syntax for the ":".
but alas I cannot seem to figure out the proper way to perform the vector operation for taking a substring on a pandas data frame column.
Thank you
I think I would use str.extract with some regex (which you can tweak for your needs):
In [11]: s = pd.Series(["SMU78000009092000001"])
In [12]: s.str.extract('^.{2}(?P<state_code>.{3}).{1}(?P<area_code>\d{4}).{2}(?P<supersector_code>.{2})')
Out[12]:
state_code area_code supersector_code
0 U78 0000 92
This reads as: starts (^) with any two characters (which are ignored), the next three (any) characters are state_code, followed by any character (ignored), followed by four digits are area_code, ...