I'm reading text blobs from a db and I have to write them to files.
Those strings are like
Lorem ipsum dolor sit amet,\r\nconsectetur adipiscing elit,\r\nsed do eiusmod tempor incididunt ut labore et dolore magna aliqua.\r\n\r\nUt enim ad minim veniam,\r\nquis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
and i need to write them like this:
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
is there a simpler way to achieve this than reading char by char and perform a file.write('\n') everytime I run into a \r\n?
my code:
import sqlite3
import codecs
db = sqlite3.connect('blog.sqlite')
cursor = db.cursor()
cursor.execute('''select filename, filecontent from mytable''')
all_rows = cursor.fetchall()
for row in all_rows:
with codecs.open(row[0], 'w', encoding='utf-8') as f:
f.write(row[1])
db.close()
filename is just a plain string, filecontent has accented letters and i want to avoid �s
example data:
Mi sono trovato nella condizione di dover fare una join su più campi. \r\nChe stavano in una colonna sola. \r\nSeparati da punti e virgola.\r\n\r\nPrima di tentare il suicidio, ho googolato un pò e ho scoperto questo:\r\n\r\n :::sql\r\n SELECT * \r\n FROM tabella1 LEFT JOIN tabella2 ON tabella2.colonnaA IN ( \r\n SELECT REGEXP_SUBSTR(tabella2.colonnaB,'[\\^;]+', 1, level) FROM dual \r\n CONNECT BY REGEXP_SUBSTR(tabella2.colonnaB, '[\\^;]+', 1, level) IS NOT NULL \r\n );\r\n\r\nForse è una cosa risaputa, ma non la conoscevo... \r\nSicuramente la riutilizzerò tantissimo.\r\n
I figured it out: I had to do str.replace(r'\r\n', '\n') to search raw strings
Related
I would like to extract a specific portion from a text.
For example, I have this text:
"*Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrum exercitationem ullamco laboriosam, nisi ut aliquid ex ea commodi consequatur.
Duis aute irure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum*",
I would like to extract the content from "Duis aute" to the start a new line ("nulla pariatur").
How could I do this in Python? Thanks in advance to everyone.
Sorry for poor English.
You can use this.
with open('filename.txt') as f: # open file and get the data.
data = f.read()
s_index = data.index('Duis aute') # get the starting index of text.
e_index = data.index('.',s_index) # get the end index of text here I also pass s_index as the parameter because I want the index of the dot after the starting index.
text = data[s_index:e_index]
print(text)
Output
Duis aute irure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur
If you want to end the text by \n Then use this one
with open('filename.txt') as f:
data = f.readlines()
data = ''.join(data)
# here try and except for because if the substring not in the string then it will throw an error.
try:
s_index = data.index('Duis aute')
e_index = data.index('\n',s_index)
except:
print('Value Not Found.')
else:
text = data[s_index:e_index]
print(text)
Testing
with open('filename.txt') as f:
data = f.readlines()
data = ''.join(data)
# here try and except for because if the substring not in the string then it will throw an error.
try:
s_index = data.index('ipsum dolor')
e_index = data.index('\n',s_index)
except:
print('Value Not Found.')
else:
text = data[s_index:e_index]
print(text)
output
ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.
with open('filename.txt') as f:
data = f.readlines()
data = ''.join(data)
# here try and except for because if the substring not in the string then it will throw an error.
try:
s_index = data.index('Ut enim ad minim')
e_index = data.index('\n',s_index)
except:
print('Value Not Found.')
else:
text = data[s_index:e_index]
print(text)
output
Ut enim ad minim veniam, quis nostrum exercitationem ullamco laboriosam, nisi ut aliquid ex ea commodi consequatur.
And If you need only one word after the given word then use this.
with open('filename.txt') as f:
data = f.readlines()
data = ''.join(data)
# here try and except for because if the substring not in the string then it will throw an error.
try:
s_index = data.index('Lorem')
e_index = data.index(' ',s_index+len('Lorem')+1)
except:
print('Value Not Found.')
else:
text = data[s_index:e_index]
print(text)
output
Lorem ipsum
If you are trying to extract a particular "sentence" - then one way could be to split on the sentence separator (\n for example)
sentences = s.split('\n')
If you have multiple delimiters for a sentence - you can use the re module -
import re
sentences = re.split(r'\.|\n', s)
You can then extract the matches from sentences -
required = '\n'.join(_ for _ in sentences if _.strip().startswith('Duis aute'))
Of course, you can combine all of this into a single liner -
'\n'.join(_ for _ in s.split('.') if _.strip().startswith('Duis aute'))
The function I created to find a list of regex matches doesn't work: instead of printing a list of all the matches, it prints one match at a time. I tried multiple times and I don't understand what the error could be.
For instance, this is the text I want to find the regex in: '] prima ciao hello'
This is the function:
def find_regex(regex, text):
l = []
matches_prima = re.findall(regex, text)
lunghezza_prima = len(matches_prima)
for x in matches_prima:
l.extend(matches_prima)
print(l)
And in another function is called like:
def main():
testo = '] prima ciao hello', 'ola'
find_prima = re.compile(r"\]\s*prima(?!\S)")
print(find_regex(find_prima,testo))
if __name__ == "__main__":
main()
So given a regex, I call it like print(find_regex(find_prima,testo)). But the output is:
['] prima']
[]
So I get them printed once at a time.
And I would need the full list instead to count all the matches. What am I doing wrong?
Try this:
import re
txt = """mypattern, Lorem ipsum dolor sit amet, aliquip sunt ad irure ad
labore nulla do et est eiusmod ut fugiat. Minim enim incididunt ullamco
deserunt Lorem cillum in est ullamco dolor qui sint labore. Reprehenderit
laborum anim magna pariatur proident cillum et eiusmod eu laboris cillum.
Quis et nostrud laboris non. Est incididunt dolore sint dolore. Sunt eu
mypattern, ipsum ullamco dolore ad ut veniam est. dolore mollit ut sunt nulla
"""
print([line for line in txt.splitlines() if re.match(r"mypattern, ", line) is not None])
Output:
['mypattern, Lorem ipsum dolor sit amet, aliquip sunt ad irure ad', 'mypattern, ipsum ullamco dolore ad ut veniam est. dolore mollit ut sunt nulla']
Is it possible to read data from a csv file into a dictionary, such that the first column is the key and the second column.
E.g. I have a csv file
code msg
123456 Lorem ipsum dolor sit amet, consectetur adipiscing elit
345981 sed do eiusmo ut labore, et dolore magna aliqua;
459827 ullamco, laboris nisi ut aliquip ex ea commodo consequat.
490023 veniam, quis nostrud exercitation
345612 mollit anim id est laborum.
code represents the keys, and msg represents the values associated with each code.
import csv
with open('test.csv') as f:
reader = csv.reader(f)
mydict = {rows[0]:rows[1:] for rows in reader}
print(mydict)
x = mydict.get("123456")
print(x)
Result:
{'code;msg': [], '123456;Lorem ipsum dolor sit amet': [' consectetur adipiscing elit'], '345981;"sed do eiusmo ut labore': [' et dolore magna aliqua;"'], '459827;ullamco': [' laboris nisi ut aliquip ex ea commodo consequat.'], '490023;veniam': [' quis nostrud exercitation'], '345612;mollit anim id est laborum.': []}
None
I would like to search values associated to each key.
EG: When I write:
key= "123456"
value=mydict.get(key)
print(key + "has this value : " + value)
I would get as an output:
>>> The key 123456 has this value :Lorem ipsum dolor sit amet, consectetur adipiscing elit
Without any imports, you can use:
with open('test.csv') as f:
csv = f.readlines()
d = {}
for line in csv[1:]: # Loop csv lines skipping first line csv[1:] (headers)
m = line.split()
if len(m) > 1:
d[m[0]] = " ".join(m[1:])
print(d)
Output:
{'123456': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit', '345981': 'sed do eiusmo ut labore, et dolore magna aliqua;', '459827': 'ullamco, laboris nisi ut aliquip ex ea commodo consequat.', '490023': 'veniam, quis nostrud exercitation', '345612': 'mollit anim id est laborum.'}
Python Demo
Notes:
To search by key, I normally use:
if '123456' in d:
print(d['123456'])
# Lorem ipsum dolor sit amet, consectetur adipiscing elit
Printing dictionary keys and values
print(d.keys(), d.values())
# dict_keys(['123456', '345981', '459827', '490023', '345612'])
# dict_values(['Lorem ipsum dolor sit amet, consectetur adipiscing elit', 'sed do eiusmo ut labore, et dolore magna aliqua;', 'ullamco, laboris nisi ut aliquip ex ea commodo consequat.', 'veniam, quis nostrud exercitation', 'mollit anim id est laborum.'])
Using csv module.
Ex:
import csv
result = {}
with open('test.csv') as infile:
reader = csv.reader(infile, delimiter=';')
next(reader) #Skip Header
for row in reader: #Iterate Each Line
result[row[0]] = row[1] #Form Dictionary
print(result)
Output:
{'123456': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit',
'345612': 'mollit anim id est laborum.',
'345981': 'sed do eiusmo ut labore, et dolore magna aliqua',
'459827': 'ullamco, laboris nisi ut aliquip ex ea commodo consequat.',
'490023': 'veniam, quis nostrud exercitation'}
The Problem is with the input data , The file contain multiple commas and you are reading with csv reader. The second column should be enclosed by double quotes.
code,msg
"123456","Lorem ipsum dolor sit amet, consectetur adipiscing elit"
"345981","sed do eiusmo ut labore, et dolore magna aliqua;"
"459827","ullamco, laboris nisi ut aliquip ex ea commodo consequat."
"490023","veniam, quis nostrud exercitation"
"345612","mollit anim id est laborum."
After modifying the data if you execute your snippet it would work fine.
{'code': ['msg'], '123456': ['Lorem ipsum dolor sit amet, consectetur adipiscing elit'], '345981': ['sed do eiusmo ut labore, et dolore magna aliqua;'], '459827': ['ullamco, laboris nisi ut aliquip ex ea commodo consequat.'], '490023': ['veniam, quis nostrud exercitation'], '345612': ['mollit anim id est laborum.']}
['Lorem ipsum dolor sit amet, consectetur adipiscing elit']
More over here below is an add on using pandas :
You can use the todic method in pandas , read the csv file using pandas and then convert in to dataframe and then execute the below code
df.set_index('code').T.to_dict('list')
Complete Code :
import pandas as pd
df = pd.read_csv(filepath_or_buffer = CSV_FILE_PATH")
df.set_index('code').T.to_dict('list')
Output :
{123456: ['Lorem ipsum dolor sit amet, consectetur adipiscing elit'],
345981: ['sed do eiusmo ut labore, et dolore magna aliqua;'],
459827: ['ullamco, laboris nisi ut aliquip ex ea commodo consequat.'],
490023: ['veniam, quis nostrud exercitation'],
345612: ['mollit anim id est laborum.']}
I'm unable to insert the json data below into the postgres db using python. I'm posting the entire code here. PLease let me know why I'm unable to update the data.
json data to insert:
{"data":"\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit's, we're sed'tu do\n=====\nWe'll eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse c",
"data2":"\n\n===========\n\n Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor\nWe'll incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis\nWe'll nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse c \n\n===========\n\n"
}
Code:
import sys, getopt
import csv
import json
import psycopg2
from psycopg2.extensions import AsIs
def insert_data(entry):
conn = psycopg2.connect(database = "example", user = "postgres", password = "", host = "127.0.0.1", port = "5432")
print "Opened database successfully"
cur = conn.cursor()
cur.execute("INSERT INTO example (column2) VALUES '{0}'".format(info))
conn.commit()
conn.close()
i=0
with open("test.json") as json_file:
data = json.load(json_file)
for r in data:
print json.dumps(data[i])
dd=json.dumps(data[i])
insert_data(dd)
i=i+1
I'm getting this error:
psycopg2.ProgrammingError: syntax error at or near \nWe'll
I'm trying to split a large file that has several paragraphs, each one is of variable length and the only delimiter would be the bullet point for the next paragraph...
Is there a way to get several different files with each individual paragraph?
The final thing is to write each individual paragraph to a MySQL DB...
example input:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
output: each paragraph is a separate entry in the DB
this is how you split your file by bullet point:
new_files = open(source_file).read().split(u'\u2022')
for par in new_files:
open("%s.txt"%new_files.index(par),"w").write("%s"%par)
LOAD DATA INFILE "%s.txt"%new_files.index(par) INTO TABLE your_DB_name.your_table;
This conects to mysql DB and reads the file and splits it at each bullet point and inserts the data into mysql DB table
My Code:
#Server Connection to MySQL:
import MySQLdb
conn = MySQLdb.connect(host= "localhost",
user="root",
passwd="newpassword",
db="db")
x = conn.cursor()
try:
file_data = open("FILE_NAME_WITH_EXTENSION").read().split(u'\u2022')
for text in file_data:
print text
x.execute("""INSERT INTO TABLE_NAME VALUES (%s)""",(text))
conn.commit()
except:
conn.rollback()
conn.close()