regex as a separator to read tables in python (Pandas) - python

I would like to ask some help to read a text file (Python 2.7, pandas library) that is using "|" as a separator, but you can also find the same character in the records followed by space. The first two rows don't have the problem, but the third one has the separator in between the 6th field TAT Fans | Southern
1. 4_230_0415_99312||||9500|Gedung|||||||||15000|6.11403|102.23061
2. 4_230_0415_99313||||9500|Pakatan|||||||||50450|3.15908|101.71431
3. 4_230_0117_12377||||9990|TAT Fans | Southern||||||||||3.141033333|101.727125
I have been trying to use regex in the separator, but I haven't been able to make it work :
pd.read_table("text_file.txt", sep = "\S+\|\S+")
Can Anyone help me find a solution to my problem?
Many thanks in advance!

You can use "\s?[|]+\s?"
import pandas as pd
pd.read_table("text_file.txt", sep="\s?[|]+\s?") #or "\s?\|+\s?"
Out[18]:
4_230_0415_99312 9500 Gedung 15000 6.11403 102.23061
0 4_230_0415_99313 9500 Pakatan 50450 3.159080 101.714310
1 4_230_0117_12377 9990 TAT Fans Southern 3.141033 101.727125

Related

Extract date from a string with a lot of numbers

There seems to be quite a few ways to extract datetimes in various formats from a string. But there seems to be an issue when the string contains many numbers and symbols.
Here is an example:
t = 'Annual Transmission Revenue Requirements and Rates Transmission Owner (Transmission Zone) Annual Transmission Revenue Requirement Network Integration Transmission Service Rate ($/MW-Year) AE (AECO) $136,632,319 $53,775 AEP (AEP) $1,295,660,732 $59,818.14 AP (APS) $128,000,000 $17,895 ATSI (ATSI) $659,094,666 $54,689.39 BC (BGE) $230,595,535 $35,762 ComEd, Rochelle (CE) $702,431,433 $34,515.60 Dayton (DAY) $40,100,000 $13,295.76 Duke (DEOK) $121,250,903 $24,077 Duquesne (DLCO) $139,341,808 $51,954.44 Dominion (DOM) $1,031,382,000 $52,457.21 DPL, ODEC (DPL) $163,224,128 $42,812 East Kentucky Power Cooperative (EKPC) $83,267,903 $24,441 MAIT (METED, PENELEC) $150,858,703 $26,069.39 JCPL $135,000,000 $23,597.27 PE (PECO) $155,439,100 $19,093 PPL, AECoop, UGI (PPL) $435,349,329 $58,865 PEPCO, SMECO (PEPCO) $190,876,083 $31,304.21 PS (PSEG) $1,248,819,352 $130,535.22 Rockland (RECO) $17,724,263 $44,799 TrAILCo $226,652,117.80 n/a Effective June 1, 2018 '
import datefinder
m = datefinder.find_dates(t)
for match in m:
print(match)
Is there a way to smoothly extract the date? I can resort to re for specific formats if no better way exists. From github of datefinder it seems that it was abandoned a year ago.
Although I dont know exactly how your dates are formatted, here's a regex solution that will work with dates separated by '/'. Should work with dates where the months and days are expressed as a single number or if they include a leading zero.
If your dates are separated by hyphens instead, replace the 9th and 18th character of the regex with a hyphen instead of /. (If using the second print statement, replace the 12th and 31st character)
Edit: Added the second print statement with some better regex. That's probably the better way to go.
import re
mystring = r'joasidj9238nlsd93901/01/2021oijweo8939n'
print(re.findall('\d{1,2}\/\d{1,2}\/\d{2,4}', mystring)) # This would probably work in most cases
print(re.findall('[0-1]{0,2}\/[0-3]{0,1}\d{0,1}\/\d{2,4}', mystring)) # This one is probably a better solution. (More protection against weirdness.)
Edit #2: Here's a way to do it with the month name spelled out (in full, or 3-character abbreviation), followed by day, followed by comma, followed by a 2 or 4 digit year.
import re
mystring = r'Jan 1, 2020'
print(re.findall(r'(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Nov(?:ember)?|Dec(?:ember)?)\s+\d{1,2}\,\s+\d{2,4}',mystring))

Cleaning a dataset and removing special characters in python

I am fairly new to all of this so apologies in advance.
I've got a dataset (csv). One column contains strings with whole sentences. These sentences contain missinterpreted utf-8 charactes like ’ and emojis like 🥳.
So the dataframe (df) looks kind of like this:
date text
0 Jul 31 2020 it’s crazy. i hope post-covid we can get it done🥳
1 Jul 31 2020 just sayin’ ...
2 Jul 31 2020 nba to hold first games in 'bubble' amid pandemic
The goal is to do a sentiment analysis on the texts.
Would it be best to remove ALL special characters like , . ( ) [ ] + | - to do the sentiment analysis?
How do I do that and how do I also remove the missinterpreted utf-8 charactes like ’?
I've tried it myself by using some code I found and changing that to my problem.
This resulted in this piece of code which seems to do absolutly nothing. The charactes like ’ are still in the text.
spec_chars = ["…","🥳"]
for char in spec_chars:
df['text'] = df['text'].str.replace(char, ' ')
I'm a bit lost here.
I appreciate any help!
You can change the character encoding like this. x is one of the sentences in the original post.
x = 'it’s crazy. i hope post-covid we can get it done🥳'
x.encode('windows-1252').decode('utf8')
The result is 'it’s crazy. i hope post-covid we can get it done🥳'
As jsmart stated, use the .encode .decode. Since the column is a series, you's be using .str to access the values of the series as strings and apply the methods.
As far as the text sentiment, look at NLTK. And take a look at it's examples of sentiment analysis
import pandas as pd
df = pd.DataFrame([['Jul 31 2020','it’s crazy. i hope post-covid we can get it done🥳'],
['Jul 31 2020','just sayin’ ...'],
['Jul 31 2020',"nba to hold first games in 'bubble' amid pandemic"]],
columns = ['date','text'])
df['text'] = df['text'].str.encode('windows-1252').str.decode('utf8')
Try this. It's quite helpful for me.
df['clean_text'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word.isalnum()])

how to remove one sentence with two specific initial words

I have a dataframe that contains a news dataset. I want to remove one sentence with two specific initial words, i.e. "baca juga:, .... laga." for example. Have an idea how to do it?
This is additional information if u need it.
You can try df.loc to find it and then change it to be blank:
df.loc[df['news'].astype(str).str.contains(r'(?:baca juga)', regex=True), 'news']
and if that works, you can set it to blank with = ''
Using regex, find the sentence then replace it with a blank space
I don't see baca juga in your example but assuming its in one of the rows
import re
df['news'].map(lambda x: re.sub(r'(baca juga[^.]+.)', '', x))
Explanation
baca juga start with this
[^.] this matches any character that's not a period
+. keep going until a reaching a period and remove that period as well
Example
input_df
news
0 dskfl fsdg wer. baca juga: fgads awr yut. dfaw...
1 rwepu fsan apsj lis. fja jp ios jos lfslt
Output_df
0 dskfl fsdg wer. dfaw top fapw asf
1 rwepu fsan apsj lis. fja jp ios jos lfslt

Apply function is not working on a data-frame column

I am trying to remove special characters like ",",".","-"(except comma) from the "Actors" column of my pandas data-frame. For this I use the apply method on the "Actors" column
df['Actors']= df['Actors'].apply(lambda x : x.lower().replace("[^a-zA-Z,]","",)
df['Actors'].head()
The output of the above snippet is shown below and we can see no special characters have been replaced:
1 tim robbins, morgan freeman, bob gunton, willi...
2 marlon brando, al pacino, james caan, richard ...
3 al pacino, robert duvall, diane keaton, robert...
4 christian bale, heath ledger, aaron eckhart, m...
5 martin balsam, john fiedler, lee j. cobb, e.g....
Name: Actors, dtype: object
But when I try resolving the above issue using the snippet below, the code works:
df['Actors'] = df['Actors'].str.lower().str.replace("[^a-zA-Z,]","")
df['Actors'].head()
1 timrobbins,morganfreeman,bobgunton,williamsadler
2 marlonbrando,alpacino,jamescaan,richardscastel...
3 alpacino,robertduvall,dianekeaton,robertdeniro
4 christianbale,heathledger,aaroneckhart,michael...
5 martinbalsam,johnfiedler,leejcobb,egmarshall
Name: Actors, dtype: object
I want to know what is it with the apply function that it doesn't work properly while replacing characters ?
You call apply on series, so x in the lambda is a single string of each row of the series. So, x.lower().replace is python replace. Python replace doesn't support regex. so it considers "[^a-zA-Z,]" as a whole string and it looks for that substring in each x. It couldn't find it so nothing got replaced.
On the other hand, Pandas str.replace default option is regex=True, so it considers "[^a-zA-Z,]" as a regex pattern and replaces everything properly
It does not work because you do a replace on a string, formally you do str.replace("[^a-zA-Z,]","",). Your sting do not contain those characters [^a-zA-Z,] so nothing is removed. If you prefer, python do interpret those characters as regex, but simply as string elements.
To work you should do it like this, it's just to answer your question because the preferred way to do it is with your second exemple.
remove = re.compile(r"[^a-zA-Z,]")
df['Actors']= df['Actors'].apply(lambda x : re.sub(remove, "", x.lower()))
Herw are some documentation :
python str replace
pandas str replace

Python Zip Code

I am very new to Python and struggling to execute what I need.
I need to extract Zip codes out of the string "concat".
I was researching regex, but I am struggling on the functionality.
import pandas as pd
import re
from pandas import ExcelWriter
I imported the CSV, encoded text type of upload issues of string, established columns with data frame and made concat its own df
Client = pd.read_csv("CLZIPrevamp3.csv",encoding = "ISO-8859-1")
Client = Client[["clnum","concat"]]
clientzip = Client['concat']
CSV Examples
client number client add
40008 All, EdNULLNULLNULLNULLNULL
40009 EC, Inc. 4200 Exec-ParkwayS, MO 63141Attn: John Smith
40010 AWBWA, Inc. 2200 Northhighschool,VA 21801-7824Attn: TerryLongNULL NULL
Example purposes
Zip Codes will also match international Zip codes, 4 digit and 5 digit zip codes and all fields do not have zip codes
I would then want to rewrite the results back into my Client dataframe as a third column for matching answers
Is the ZIP always a US zip code? 5 digits at the end of a field?
Then slice it off.
>>> 'smithjonllcRichmondVa23220'[-5:]
'23220'
If you have 4 digits, then you might want the regex
>>> import re
>>> re.findall('\d{4,5}$', 'smithjonllcRichmondVa3220')[0]
'3220'
For "long zip codes" like 21801-7824, it gets more complex, and it is situations when you are handed a CSV file when the columns themselves contain commas (see example)
AWBWA, Inc. 2200 Northhighschool,VA
that you need to just ask for a different data format because good luck parsing that.
As far as pandas is concerned, you can apply() a function over a column.
I'll provide 2 examples.
To be honest, if your CSV is consistently formatted in the way you mentioned in your example you can find the zipcodes using a simple albeit finite regex like this (It captures all non-space characters before the string "Attn" which seems to be a theme in your read string):
>>> def zipcodes():
import re
csv = '''client number client add
40008 All, EdNULLNULLNULLNULLNULL
40009 EC, Inc. 4200 Exec-ParkwayS, MO 63141Attn: John Smith
40010 AWBWA, Inc. 2200 Northhighschool,VA 21801-7824Attn: TerryLongNULL NULL'''
zips = re.findall('([\S]+)Attn', csv)
print(zips)
OUTPUT:
>>> zipcodes()
['63141', '21801-7824']
...
...
Now if you want something slightly better, which discriminates by ignoring numbers that start a new line you can use a lookahead example like so (NOTE: Python's lookahead documentation is not the best... sheesh). What the lookahead below says is 'capture a string of digits in the range of 5 to 6, with 0 or 1 dahses beween them if applicable, potentially followed by any number of digits (in this case 0 or more than 0) but only capture these numbers if they are not preceded by a newline character'
>>> def zipcodes():
import re
csv = '''client number client add
40008 All, EdNULLNULLNULLNULLNULL
40009 EC, Inc. 4200 Exec-ParkwayS, MO 63141Attn: John Smith
40010 AWBWA, Inc. 2200 Northhighschool,VA 21801-7824Attn: TerryLongNULL NULL'''
zips = re.findall('(?<!\n)[\d]{5,6}[\-]?[\d]*', csv)
print(zips)
OUTPUT:
>>> zipcodes()
['63141', '21801-7824']
Hope this helps.

Categories