Extracting substrings from CSV table - python

I'm trying to clean up the data from a csv table that looks like this:
KATY PERRY#katyperry
1,084,149,282,038,820
Justin Bieber#justinbieber
10,527,300,631,674,900,000
Barack Obama#BarackObama
9,959,243,562,511,110,000
I want to extract just the "#" handles, such as:
#katyperry
#justinbieber
#BarackObama
This is the code I've put togheter, but all it does is repeat the second line of the table over and over:
import csv
import re
with open('C:\\Users\\TK\\Steemit\\Scripts\\twitter.csv', 'rt', encoding='UTF-8') as inp:
read = csv.reader(inp)
for row in read:
for i in row:
if i.isalpha():
stringafterword = re.split('\\#\\',row)[-1]
print(stringafterword)

If you are willing to use re, you can get a list of strings in one line:
import re
#content string added to make it a working example
content = """KATY PERRY#katyperry
1,084,149,282,038,820
Justin Bieber#justinbieber
10,527,300,631,674,900,000
Barack Obama#BarackObama
9,959,243,562,511,110,000"""
#solution using 're':
m = re.findall('#.*', content)
print(m)
#option without 're' but using string.find() based on your loop:
for row in content.split():
pos_of_at = row.find('#')
if pos_of_at > -1: #-1 indicates "substring not found"
print(row[pos_of_at:])
You should of course replace the contentstring with the file content.

Firstly the "#" symbol is a symbol. Therefore the if i.isalpha(): will return False as it is NOT a alpha character. Your re.split() won't even be called.
Try this:
import csv
import re
with open('C:\\Users\\input.csv', 'rt', encoding='UTF-8') as inp:
read = csv.reader(inp)
for row in read:
for i in row:
stringafterword = re.findall('#.*',i)
print(stringafterword)
Here I have removed the if-condition and changed the re.split() index to 1 as that is the section you want.
Hope it works.

Related

Replace only exact string

I am trying to replace exact string only in a csv file using another csv as dictionary.
This is my code
import re
text = open("input.csv", "r", encoding="ISO-8859-1")
replacelist = open("replace.csv","r", encoding="ISO-8859-1").readlines()
for r in replacelist:
r = r.split(",")
text = ''.join([i for i in text]) \
.replace(r[0],r[1])
print ({r[0]})
print ({r[1]})
x = open("new.csv","w")
x.writelines(text)
x.close()
Is it possible to use replace method to only replace exact match strings? Should I import and use re.sub() instead of replace?
input.csv example
ciao123;xxxxx;0
ciao12345;xxzzx;2
replace.csv example
ciao123,ok
aaaa,no
bbb,cc
Only first line in input.csv should be replaced.
Well, as per your comments, your task would be much simpler and you don't need to play with regex as well!
Basically, you are trying to replace something in a csv column if it is a exact word match, if that is the case, you should not be treating it as raw text, treat it as a column data.
If you do so, you could use one example like below:
text = open("input.csv", "r", encoding="ISO-8859-1").readlines()
replacelist = open("replace.csv","r", encoding="ISO-8859-1").readlines()
# make a replace word dictionary with O(n) time complexity
replace_data = {i.split(',')[0]: i.split(',')[1] for i in replacelist}
# Now treat data in input.csv as tabular data to replace the words
# Start another loop of O(n) time complexity
for idx, line in enumerate(text):
line_lis = line.split(';')
if line_lis[0] in replace_data:
only replace word if it is meant to be replaced
line_lis[0] = replace_data.get(line_lis[0])
text[idx] = ';'.join(line_lis)
# write results
with open("new.csv","w") as f:
f.writelines(text)
Result would be as:
ok;xxxxx;0
ciao12345;xxzzx;2

Removing punctuation and change to lowercase in python CSV file

The code below allow me to open the CSV file and change all the texts to lowercase. However, i have difficulties trying to also remove the punctuation in the CSV file. How can i do that? Do i use string.punctuation?
file = open('names.csv','r')
lines = [line.lower() for line in file]
with open('names.csv','w') as out
out.writelines(sorted(lines))
print (lines)
sample of my few lines from the file:
Justine_123
ANDY*#3
ADRIAN
hEnNy!
You can achieve this by importing strings and make use of the following example code below.
The other way you can achieve this is by using regex.
import string
str(lines).translate(None, string.punctuation)
Also you may want to learn more about how import string works and its features
The working example you requested for.
import string
with open("sample.csv") as csvfile:
lines = [line.lower() for line in csvfile]
print(lines)
will give you ['justine_123\n', 'andy*#3\n', 'adrian\n', 'henny!']
punc_table = str.maketrans({key: None for key in string.punctuation})
new_res = str(lines).translate(punc_table)
print(new_res)
new_s the result will give you justine123n andy3n adriann henny
Example with regular expressions.
import csv
import re
filename = ('names.csv')
def reg_test(name):
reg_result = ''
with open(name, 'r') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
row = re.sub('[^A-Za-z0-9]+', '', str(row))
reg_result += row + ','
return reg_result
print(reg_test(filename).lower())
justine123,andy3,adrian,henny,

Extract numeric data from matched regular expression

I have some temperature data in a csv file and I want to extract only the temperature for a say the first month of the year, and so after processing I want a list of [1.4, -5.8] in the example below.
1866-01-01 00:00:01;1866-02-01 00:00:00;1866-01;1.4;G
1866-02-01 00:00:01;1866-03-01 00:00:00;1866-02;-3.0;G
1900-01-01 00:00:01;1900-01-01 00:00:00;1900-01;-5.8;G
I thought of doing this with python module re, but I always have issues getting to grips with regular expressions! For instance my quick test below returns all lines when I only expect it to return the entries from the first month of the year...
import numpy as np
import re
regex = '\d{4}-01-\d{2}\s\d{2}:\d{2}:\d{2};\d{4}-01-\d{2}\s\d{2}:\d{2}:\d{2};\d{4}-01;[-+]?\d*\.\d+|\d+;G'
with open('test.csv', 'rb') as fid:
for line in fid:
match = re.findall(regex,line)
if match:
print line
print match
Use the csv module, specifying ; as the delimiter. The third column in the data is YYYY-MM, so check whether it's the first month and print the temperature if it is:
import csv
with open('data') as f:
for row in csv.reader(f, delimiter=';'):
year, month = row[2].split('-')
if int(month) == 1:
print(row[3])
Output
1.4
-5.8
For comparison, here is the simplest regex that I could come up with to extract the required value:
import re
with open('data') as f:
temperature = re.findall(r'\d{4}-01;(.+?);', f.read())
print('\n'.join(temperature))
You can see how it takes more effort to read & understand the regex than it does the Python code.
There is an even easier way that relies on your data consisting of fixed width fields:
with open('data') as f:
for line in f:
if line[45:47] == '01':
print(line[48:-3])
I suggest the folling regex:
^(?:\d{4}-01-.*?)(-?\d+\.\d+)
Demo and explanation of behavior: regex101
The number is in the first capturing group.
Alternatively, with a positive lookahead:
^(?=\d{4}-01).*?(-?\d+\.\d+)
Demo and explanation of behavior: regex101
You have to put brackets around what you want to extract. So you should change the last part to ;([-+]?\d*\.\d+|\d+);G.
Try this code and tell me if it works:
import re
regex1 = re.compile('\d{4}-01-\d{2}')
regex2 = re.compile('([-+]?\d*\.\d+|\d+);G')
with open('test.csv', 'rb') as fid:
for line in fid:
match1 = re.findall(regex1,line)
if match1:
match2 = re.findall(regex2, line)
print line
print match2
Hope this helps.

Python Regular Expression loop

I have this code wich will look for certain things in a file. The file looks like this:
name;lastname;job;5465465
name2;lastname2;job2;5465465
name3;lastname3;job3;5465465
This is the python code:
import re
import sys
filehandle = open('somefile.csv', 'r')
text = filehandle.read()
b = re.search("([a-zA-Z]+);([a-z\sA-Z]+);([a-zA-Z]*);([0-9^-]+)\n?",text)
print (b.group(2),b.group(1),b.group(3),b.group(4))
no it will only print:
lastname;name;job;5465465
It supposed to print the lastname first so i did that with groups. Now i need a loop to print all lines like this:
lastname;name;job;5465465
lastname2;name2;job2;5465465
lastname3;name3;job3;5465465l
i tried all kind of loops but it doesnt go trough the whole file... how do i need to do this?
it must be done with the re module. I know its easy in the csv module ;)
You need to process the file line by line.
import re
import sys
with open('somefile.csv', 'r') as filehandle:
for text in filehandle:
b = re.search("([a-zA-Z]+);([a-z\sA-Z]+);([a-zA-Z]*);([0-9^-]+)\n?",text)
print (b.group(2),b.group(1),b.group(3),b.group(4))
Your file has nicely semi-colon separated values, so it would be easier to just use split or the csv library as has been suggested.
No need for re, but a good job for csv:
import csv
with open('somefile.csv', 'r') as f:
for rec in csv.reader(f, delimiter=';'):
print (rec[1], rec[0], rec[2], rec[3])
You can use re if you want to check the validity of individual elements (valid phone number, no numbers in name, capitalized names, etc.).
The fault is not with the loops, but rather with your regex / capture group patterns. The class [a-zA-Z]+ will not match "lastname3" or "lastname2". This sample works:
import re
import sys
for line in open('somefile.csv', 'r'):
b = re.search("(\w+);(\w+);(\w*);([0-9^-]+)\n?",line)
if b:
print "%s;%s;%s;%s" % (b.group(2),b.group(1),b.group(3),b.group(4))
Seems as if you just want to reorder what you have, in which case I don't know whether regex are needed. I believe the following might be of use:
reorder = operator.itemgetter(1, 0, 2, 3)
http://docs.python.org/library/operator.html

Need to remove line breaks from a text file with certain conditions

I have a text file running into 20,000 lines. A block of meaningful data for me would consist of name, address, city, state,zip, phone. My file has each of these on a new line, so a file would go like:
StoreName1
, Address
, City
,State
,Zip
, Phone
StoreName2
, Address
, City
,State
,Zip
, Phone
I need to create a CSV file and will need the above information for each store in 1 single line :
StoreName1, Address, City,State,Zip, Phone
StoreName2, Address, City,State,Zip, Phone
So essentially, I am trying to remove \r\n only at the appropriate points. How do I do this with python re. Examples would be very helpful, am a newbie at this.
Thanks.
s/[\r\n]+,/,/g
Globally substitute 'linebreak(s),' with ','
Edit:
If you want to reduce it further with a single linebreak between records:
s/[\r\n]+(,|[\r\n])/$1/g
Globally substitute 'linebreaks(s) (comma or linebreak) with capture group 1.
Edit:
And, if it really gets out of whack, this might cure it:
s/[\r\n]+\s*(,|[\r\n])\s*/$1/g
This iterator/generator version doesn't require reading the entire file into memory at once
from itertools import groupby
with open("inputfile.txt") as f:
groups = groupby(f, key=str.isspace)
for row in ("".join(map(str.strip,x[1])) for x in groups if not x[0]):
...
Assuming the data is "normal" - see my comment - I'd approach the problem this way:
with open('data.txt') as fhi, open('newdata.txt', 'w') as fho:
# Iterate over the input file.
for store in fhi:
# Read in the rest of the pertinent data
fields = [next(fhi).rstrip() for _ in range(5)]
# Generate a list of all fields for this store.
row = [store.rstrip()] + fields
# Output to the new data file.
fho.write('%s\n' % ''.join(row))
# Consume a blank line in the input file.
next(fhi)
First mind-numbigly solution
import re
ch = ('StoreName1\r\n'
', Address\r\n'
', City\r\n'
',State\r\n'
',Zip\r\n'
', Phone\r\n'
'\r\n'
'StoreName2\r\n'
', Address\r\n'
', City\r\n'
',State\r\n'
',Zip\r\n'
', Phone')
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'(.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,[^\r\n]+)')
with open('csvoutput.txt','wb') as f:
f.writelines(''.join(mat.groups())+'\r\n' for mat in regx.finditer(ch))
ch mimics the content of a file on a Windows platform (newlines == \r\n)
Second mind-numbigly solution
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'.+?\r\n,.+?\r\n,.+?\r\n,.+?\r\n,.+?\r\n,[^\r\n]+')
with open('csvoutput.txt','wb') as f:
f.writelines(mat.group().replace('\r\n','')+'\r\n' for mat in regx.finditer(ch))
Third mind-numbigly solution, if you want to create a CSV file with other delimiters than commas:
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,([^\r\n]+)')
import csv
with open('csvtry3.txt','wb') as f:
csvw = csv.writer(f,delimiter='#')
for mat in regx.finditer(ch):
csvw.writerow(mat.groups())
.
EDIT 1
You are right , tchrist, the following solution is far simpler:
regx = re.compile('(?<!\r\n)\r\n')
with open('csvtry.txt','wb') as f:
f.write(regx.sub('',ch))
.
EDIT 2
A regex isn't required:
with open('csvtry.txt','wb') as f:
f.writelines(x.replace('\r\n','')+'\r\n' for x in ch.split('\r\n\r\n'))
.
EDIT 3
Treating a file, no more ch:
'à la gnibbler" solution, in cases when the file can't be read all at once in memory because it is too big:
from itertools import groupby
with open('csvinput.txt','r') as f,open('csvoutput.txt','w') as g:
groups = groupby(f,key= lambda v: not str.isspace(v))
g.writelines(''.join(x).replace('\n','')+'\n' for k,x in groups if k)
I have another solution with regex:
import re
regx = re.compile('^((?:.+?\n)+?)(?=\n|\Z)',re.MULTILINE)
with open('input.txt','r') as f,open('csvoutput.txt','w') as g:
g.writelines(mat.group().replace('\n','')+'\n' for mat in regx.finditer(f.read()))
I find it similar to the gnibbler-like solution
f = open(infilepath, 'r')
s = ''.join([line for line in f])
s = s.replace('\n\n', '\\n')
s = s.replace('\n', '')
s = s.replace("\\n", "\n")
f.close()
f = open(infilepath, 'r')
f.write(s)
f.close()
That should do it. It will replace your input file with the new format

Categories