I have a csv file as the given picture bellow
I'm trying to find any word that will start with letter A and G or any list that I want
but my code returns an error any Ideas what I'm doing wrong ?
this is my code
if len(sys.argv) == 1:
print("please provide a CSV file to analys")
else:
fileinput = sys.argv[1]
wdata = pd.read_csv(fileinput)
print( list(filter(startswith("a","g"), wdata)) )
To get relevant rows, extract the first letter, then use isin:
df
words frequency
0 what 10
1 and 8
2 how 8
3 good 5
4 yes 7
df[df['words'].str[0].isin(['a', 'g'])]
words frequency
1 and 8
3 good 5
If you want a specific column, use loc:
df.loc[df['words'].str[0].isin(['a', 'g']), 'words']
1 and
3 good
Name: words, dtype: object
df.loc[df['words'].str[0].isin(['a', 'g']), 'words'].tolist()
# ['and', 'good']
Use Series.str.startswith with convert list to tuple and filtering by DataFrame.loc with boolean indexing:
wdata = pd.DataFrame({'words':['what','and','how','good','yes']})
L = ['a','g']
s = wdata.loc[wdata['words'].str.startswith(tuple(L)), 'words']
print (s)
1 and
3 good
Name: words, dtype: object
it is very easy and handy. you can just use str.startwith in this way:
df[df.Words.str.startswith('G')]
df[df.Words.str.startswith('A')]
Related
I have a column that contains a complicated string format. I would like to keep the first word only, and/or keep the first word in addition to certain other words.
I wish to keep certain key words in the string, such as 'RED', 'DB', 'APP', 'Infra', etc.
DATA
type grp
Goodbye-CCC-LET-TestData-A.1 a
Hello-PIR-SSS-Hellosims-App-INN-A.0 b
Hello-PIR-SSS-DB-RED-INN-C.0 c
Hello-PIR-SSS-App-SA200-F.0 d
Goodbye-PIR-SIR-DB_set-int-e.1 c
OK-PIR-SVV-Infra_ll-NA-A.0 e
DESIRED
type grp
Goodbye a
Hello-App b
Hello-DB-RED c
Hello-App d
Goodbye-DB c
OK-Infra e
DOING
s = (df['type'].str.split('-')
.str[0]
.str.cat(rack['type'].str.extract(r'(\d.\d+T|\d+T)', expand=False),
sep=' ',
na_rep='')
.str.strip())
df.insert(1, 'type', s)
The following code just give me the first word, for example:
Goodbye
Hello
OK
Any suggestion is appreciated. I am still researching
you can use str.extractall on your series then join the values
import pandas as pd
import re
df.drop('type',1).join(df['type'].str.extractall('(^\w+)-|(app|red|infra|db)'
,flags=re.IGNORECASE)\
.stack()\
.groupby(level=0)\
.agg(type='-'.join))
grp type
0 a Goodbye
1 b Hello-App
2 c Hello-DB-RED
3 d Hello-App
4 c Goodbye-DB
5 e OK-Infra
I have two separate dataframes, one with the substrings that I would like to check and see if they are contained in the 2nd data frame that contains the string and row data. This code will only run weekly so I was not worried about optimization at the moment I attempted to do it with nested for loops but couldn't seem to solve it. For example purposes I created the following bellow, however the substring could be at the start, middle, and end of the string - Example:
map_df['Number_1'] = [1,2,3,4,5,...,n]
map_df['String'] = ['xxhello', 'randomyy', 'zztodayzz',...,n]
substring_df['Substring'] = ['hello', 'random', 'today', 'dog', 'cat',..., n]
##Desired result
Substring_df
['Substring'] ['Number_1']
hello 1
random 2
today 3
dog
cat
df = pd.DataFrame({'map_df_string': ['xxhello', 'randomyy', 'zztodayzz'], 'substring_df_substring': ['hello', 'random', 'today']})
OUTPUT:
map_df_string substring_df_substring
0 xxhello hello
1 randomyy random
2 zztodayzz today
Now you can perform the following operation
a = df.apply(lambda row: row['substring_df_substring'] in row['map_df_string'], axis=1)
OUTPUT:
0 True
1 True
2 True
Now you can get the index of the series object and add plus one where the index is true to get the map_df['Number_1']
I have a column called SSN in a CSV file with values like this
289-31-9165
I need to loop through the values in this column and replace the first five characters so it looks like this
***-**-9165
Here's the code I have so far:
emp_file = "Resources/employee_data1.csv"
emp_pd = pd.read_csv(emp_file)
new_ssn = emp_pd["SSN"].str.replace([:5], "*")
emp_pd["SSN"] = new_ssn
How do I loop through the value and replace just the first five numbers (only) with asterisks and keep the hiphens as is?
Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format.
emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "***-**" + x[6:])
You can simply achieve this with replace() method:
Example dataframe :
borrows from #AkshayNevrekar..
>>> df
ssn
0 111-22-3333
1 121-22-1123
2 345-87-3425
Result:
>>> df.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
OR
>>> df.ssn.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
Name: ssn, dtype: object
OR:
df['ssn'] = df['ssn'].str.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
Put your asterisks in front, then grab the last 4 digits.
new_ssn = '***-**-' + emp_pd["SSN"][-4:]
You can use regex
df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']})
def func(x):
return re.sub(r'\d{3}-\d{2}','***-**', x)
df['ssn'] = df['ssn'].apply(func)
print(df)
Output:
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
I have this sample data in a cell:
EmployeeID
2016-CT-1028
2016-CT-1028
2017-CT-1063
2017-CT-1063
2015-CT-948
2015-CT-948
So, my problem is how can I add 0 inside this data 2015-CT-948 to
make it like this 2015-CT-0948.
I tried this code:
pattern = re.compile(r'(\d\d+)-(\w\w)-(\d\d\d)')
newlist = list(filter(pattern.match, idList))
Just to get the match regex pattern then add the 0 with zfill() but its not working. Please, can someone give me an idea on how can I do it. Is there anyway I can do it in regex or in pandas. Thank you!
This is one approach using zfill
Ex:
import pandas as pd
def custZfill(val):
val = val.split("-")
#alternative split by last -
#val = val.rsplit("-",1)
val[-1] = val[-1].zfill(4)
return "-".join(val)
df = pd.DataFrame({"EmployeeID": ["2016-CT-1028", "2016-CT-1028",
"2017-CT-1063", "2017-CT-1063",
"2015-CT-948", "2015-CT-948"]})
print(df["EmployeeID"].apply(custZfill))
Output:
0 2016-CT-1028
1 2016-CT-1028
2 2017-CT-1063
3 2017-CT-1063
4 2015-CT-0948
5 2015-CT-0948
Name: EmployeeID, dtype: object
With pandas it can be solved with split instead of regex:
df['EmployeeID'].apply(lambda x: '-'.join(x.split('-')[:-1] + [x.split('-')[-1].zfill(4)]))
In pandas, you could use str.replace
df['EmployeeID'] = df.EmployeeID.str.replace(r'-(\d{3})$', r'-0\1', regex=True)
# Output:
0 2016-CT-1028
1 2016-CT-1028
2 2017-CT-1063
3 2017-CT-1063
4 2015-CT-0948
5 2015-CT-0948
Name: EmployeeID, dtype: object
if the format of the id's is strictly defined, you can also use a simple list comprehension to do this job:
ids = [
'2017-CT-1063',
'2015-CT-948',
'2015-CT-948'
]
new_ids = [id if len(id) == 12 else id[0:8]+'0'+id[8:] for id in ids]
print(new_ids)
# ['2017-CT-1063', '2015-CT-0948', '2015-CT-0948']
Here's a one liner:
df['EmployeeID'].apply(lambda x: '-'.join(xi if i != 2 else '%04d' % int(xi) for i, xi in enumerate(x.split('-'))))
This question already has answers here:
Fast punctuation removal with pandas
(4 answers)
Closed 4 years ago.
Using Canopy and Pandas, I have data frame a which is defined by:
a=pd.read_csv('text.txt')
df=pd.DataFrame(a)
df.columns=["test"]
test.txt is a single column file that contains a list of string that contains text, numerical and punctuation.
Assuming df looks like:
test
%hgh&12
abc123!!!
porkyfries
I want my results to be:
test
hgh12
abc123
porkyfries
Effort so far:
from string import punctuation /-- import punctuation list from python itself
a=pd.read_csv('text.txt')
df=pd.DataFrame(a)
df.columns=["test"] /-- define the dataframe
for p in list(punctuation):
...: df2=df.med.str.replace(p,'')
...: df2=pd.DataFrame(df2);
...: df2
The command above basically just returns me with the same data set.
Appreciate any leads.
Edit: Reason why I am using Pandas is because data is huge, spanning to bout 1M rows, and future usage of the coding will be applied to list that go up to 30M rows.
Long story short, I need to clean data in a very efficient manner for big data sets.
Use replace with correct regex would be easier:
In [41]:
import pandas as pd
pd.set_option('display.notebook_repr_html', False)
df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})
df
Out[41]:
text
0 test
1 %hgh&12
2 abc123!!!
3 porkyfries
[4 rows x 1 columns]
use regex with the pattern which means not alphanumeric/whitespace
In [49]:
df['text'] = df['text'].str.replace('[^\w\s]','')
df
Out[49]:
text
0 test
1 hgh12
2 abc123
3 porkyfries
[4 rows x 1 columns]
For removing punctuation from a text column in your dataframme:
In:
import re
import string
rem = string.punctuation
pattern = r"[{}]".format(rem)
pattern
Out:
'[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~]'
In:
df = pd.DataFrame({'text':['book...regh', 'book...', 'boo,', 'book. ', 'ball, ', 'ballnroll"', '"rope"', 'rick % ']})
df
Out:
text
0 book...regh
1 book...
2 boo,
3 book.
4 ball,
5 ballnroll"
6 "rope"
7 rick %
In:
df['text'] = df['text'].str.replace(pattern, '')
df
You can replace the pattern with your desired character. Ex - replace(pattern, '$')
Out:
text
0 bookregh
1 book
2 boo
3 book
4 ball
5 ballnroll
6 rope
7 rick
Translate is often considered the cleanest and fastest way to remove punctuation (source)
import string
text = text.translate(None, string.punctuation.translate(None, '"'))
You may find that it works better to remove punctuation in 'a' before loading it into pandas.