Transform complicated string based on criteria in python and pandas - python

I have a column that contains a complicated string format. I would like to keep the first word only, and/or keep the first word in addition to certain other words.
I wish to keep certain key words in the string, such as 'RED', 'DB', 'APP', 'Infra', etc.
DATA
type grp
Goodbye-CCC-LET-TestData-A.1 a
Hello-PIR-SSS-Hellosims-App-INN-A.0 b
Hello-PIR-SSS-DB-RED-INN-C.0 c
Hello-PIR-SSS-App-SA200-F.0 d
Goodbye-PIR-SIR-DB_set-int-e.1 c
OK-PIR-SVV-Infra_ll-NA-A.0 e
DESIRED
type grp
Goodbye a
Hello-App b
Hello-DB-RED c
Hello-App d
Goodbye-DB c
OK-Infra e
DOING
s = (df['type'].str.split('-')
.str[0]
.str.cat(rack['type'].str.extract(r'(\d.\d+T|\d+T)', expand=False),
sep=' ',
na_rep='')
.str.strip())
df.insert(1, 'type', s)
The following code just give me the first word, for example:
Goodbye
Hello
OK
Any suggestion is appreciated. I am still researching

you can use str.extractall on your series then join the values
import pandas as pd
import re
df.drop('type',1).join(df['type'].str.extractall('(^\w+)-|(app|red|infra|db)'
,flags=re.IGNORECASE)\
.stack()\
.groupby(level=0)\
.agg(type='-'.join))
grp type
0 a Goodbye
1 b Hello-App
2 c Hello-DB-RED
3 d Hello-App
4 c Goodbye-DB
5 e OK-Infra

Related

How to check if string elements of lists are in dataframe/other list (python)

I got the following problem. I do have 2 lists/dataframes. One is a list/dataframe of customers, where every row is a customer, the columns are synonyms for these customers, other verbal expressions.
customer_list = {'A': ['AA', 'AA', 'AAA'], 'B': ['B', 'BB','BBB'], 'C': ['C','CC','CCC']}
customer_df = pd.DataFrame.from_dict(customer_list, orient='index')
Than I have another dataframe with the following structure:
text = [['A', 'Hello i am AA', 'Hello i am BB', 'Hello i am A'], ['B', 'Hello i am B', 'Hello i am BBB','Hello i am BB'], ['C', 'Hello i am AAA','Hello i am CC','Hello i am CCC']]
text_df = pd.DataFrame(text)
text_df = text_df.set_index(0)
text_df = text_df.rename_axis("customer")
How (which types, which functions) can I check every row (e.g. every element of row "A") of the text_df for "wrong entries", which means for all the elements/synonyms of other customers (so check for every entry besides the own). Do I have to create multiple dataframes in a for loop? Is one loop enough?
Thanks for any advice, even just a hint concerning methods.
For my example, a result like
Wrong texts: A: Hello i am BB,
C: Hello i am AAA or some according indices would be great.
First, I would pd.melt to transform this DataFrame into an "index" of (customer, column, value) triples, like so:
df = pd.melt(text_df.reset_index(), id_vars="customer", var_name="columns")
Now, we have a way of "efficiently" operating over the entire data without needing to figure out the "right" columns and the like. So let's solve the "correctness" problem.
def correctness(melted_row: pd.Series, customer_df: pd.DataFrame) -> bool:
customer = customer_df.loc[melted_row.customer]
cust_ids = customer.values.tolist()
return any([melted_row.value.endswith(cust_id) for cust_id in cust_ids])
Note: You could swap out .endswith with a variety of str functions to match your needs. Take a look at the docs, here.
Lastly, you can generate a mask by using the apply method across rows, like so:
df["correct"] = df.apply(correctness, axis=1, args=(customer_df, ))
You'll then have an output that looks like this:
customer columns value correct
0 A 1 Hello i am AA True
1 B 1 Hello i am B True
2 C 1 Hello i am AAA False
3 A 2 Hello i am BB False
4 B 2 Hello i am BBB True
5 C 2 Hello i am CC True
6 A 3 Hello i am A False
7 B 3 Hello i am BB True
8 C 3 Hello i am CCC True
I imagine you have other things you want to do before "un-melting" you data, so I'll point you to this SO question on how to "un-melt" your data.
By "efficient", I really mean that you have a way of leveraging built-in functions of pandas, not that it's "computationally efficient". My memory is foggy on this, but using .apply(...) is generally something to do as a last-resort. I imagine there are multiple ways to crack this problem that use built-ins, but I find this solution to be the most readable.

Python: String match is not working with regular expression

We are trying to extract rows from a column whose value contains strictly one of the following values [TC1, TC2, TC3]. The trick is that some rows also contain the following values TC12,TC13 etc. We don't want to extract them. Using str.contains is not an option in here.
Col_1 Col_2 Col_3
1 A TC1
2 B TC2
3 C TC3
4 D TC12
5 D TC15
6 D TC16
Col_1 Col_2 Col_3
1 A TC1
2 B TC2
3 C TC3
We used the following commands:
df1 = df.loc[df1['Col_3'].str.match("TC\d{1}")]
df1 = df.loc[df1['Col_3'].str.match("TC[1-3]{1}")]
df1 = df.loc[df1['Col_3'].str.match("TC[1,2,3]")]
But the problem is that is not working. Instead of returning the first 3 rows, it is returning all of the rows. We don't understand why it's wrong.
I would do
import pandas as pd
df = pd.DataFrame({"col":['TC1','TC2','TC3','TC12','TC15','TC16']})
print(df[df["col"].str.match(r"^TC\d$")])
output
col
0 TC1
1 TC2
2 TC3
Explanation: I used ^ and $ which mean start and end, so it will only detect where there is fullmatch, so-called raw-string so I can use \d inside it without need of additional escaping (for more about this see re docs). As side note "TC[1,2,3]" does not do what you think - if you enumerate characters inside [ ] there is no seperator to be used, so , is treated as character, so
import re
if(re.match("TC[1,2,3]", "TC,")):
print("match")
else:
print("no match")
output
match
You can use str.contains -
df = df[df.Col_3.str.contains(pat = r'^TC[\d{1}]$')]
or via str.match -
df = df[df.Col_3.str.match(pat = r'^TC[\d{1}]$')]
or via str.fullmatch -
df = df[df.Col_3.str.fullmatch(pat = r'^TC[\d{1}]')]
or via apply(slow) -
import re
df = df[df.Col_3.apply(lambda x : re.match(r'^TC[\d{1}]$', x)).notna()]

Select rows if string begins with certain characters in pandas

I have a csv file as the given picture bellow
I'm trying to find any word that will start with letter A and G or any list that I want
but my code returns an error any Ideas what I'm doing wrong ?
this is my code
if len(sys.argv) == 1:
print("please provide a CSV file to analys")
else:
fileinput = sys.argv[1]
wdata = pd.read_csv(fileinput)
print( list(filter(startswith("a","g"), wdata)) )
To get relevant rows, extract the first letter, then use isin:
df
words frequency
0 what 10
1 and 8
2 how 8
3 good 5
4 yes 7
df[df['words'].str[0].isin(['a', 'g'])]
words frequency
1 and 8
3 good 5
If you want a specific column, use loc:
df.loc[df['words'].str[0].isin(['a', 'g']), 'words']
1 and
3 good
Name: words, dtype: object
df.loc[df['words'].str[0].isin(['a', 'g']), 'words'].tolist()
# ['and', 'good']
Use Series.str.startswith with convert list to tuple and filtering by DataFrame.loc with boolean indexing:
wdata = pd.DataFrame({'words':['what','and','how','good','yes']})
L = ['a','g']
s = wdata.loc[wdata['words'].str.startswith(tuple(L)), 'words']
print (s)
1 and
3 good
Name: words, dtype: object
it is very easy and handy. you can just use str.startwith in this way:
df[df.Words.str.startswith('G')]
df[df.Words.str.startswith('A')]

Python Pandas: Dataframe is not updating using string methods

I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')
You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa
import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:
Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line

Python Pandas read_table with line continuation

Is it possible for pandas to read a text file that contains line continuation?
For example, say I have a text file, 'read_table.txt', that looks like this:
col1, col2
a, a string
b, a very long \
string
c, another string
If I invoke read_table on the file I get this:
>>> pandas.read_table('read_table.txt', delimiter=',')
col1 col2
0 a a string
1 b a very long \
2 string NaN
3 c another string
I'd like to get this:
col1 col2
0 a a string
1 b a very long string
2 c another string
Use escapechar:
df = pd.read_table('in.txt', delimiter=',',escapechar="\\")
That will include the newline as DSM pointed out, you can remove the newlines with df.col2 = df.col2.str.replace("\n\s*","")
I couldn't get the escapechar option to work as Padraic suggested, probably because I'm stuck on a Windows box at the moment (tell-tale \r):
col1 col2
0 a a string
1 b a very long \r
2 string NaN
3 c another string
What I did get to work correctly was a regex pass:
import pandas as pd
import re
import StringIO # python 2 on this machine, embarrassingly
with open('read_table.txt') as f_in:
file_string = f_in.read()
subbed_str = re.sub('\\\\\n\s*', '', file_string)
df = pd.read_table(StringIO.StringIO(subbed_str), delimiter=',')
This yielded your desired output:
col1 col2
0 a a string
1 b a very long string
2 c another string
Very cool question. Thanks for sharing it!

Categories