Python csv(Two columns: key/value) to Dictionary - python

Is it possible to read data from a csv file into a dictionary, such that the first column is the key and the second column.
E.g. I have a csv file
code msg
123456 Lorem ipsum dolor sit amet, consectetur adipiscing elit
345981 sed do eiusmo ut labore, et dolore magna aliqua;
459827 ullamco, laboris nisi ut aliquip ex ea commodo consequat.
490023 veniam, quis nostrud exercitation
345612 mollit anim id est laborum.
code represents the keys, and msg represents the values associated with each code.
import csv
with open('test.csv') as f:
reader = csv.reader(f)
mydict = {rows[0]:rows[1:] for rows in reader}
print(mydict)
x = mydict.get("123456")
print(x)
Result:
{'code;msg': [], '123456;Lorem ipsum dolor sit amet': [' consectetur adipiscing elit'], '345981;"sed do eiusmo ut labore': [' et dolore magna aliqua;"'], '459827;ullamco': [' laboris nisi ut aliquip ex ea commodo consequat.'], '490023;veniam': [' quis nostrud exercitation'], '345612;mollit anim id est laborum.': []}
None
I would like to search values associated to each key.
EG: When I write:
key= "123456"
value=mydict.get(key)
print(key + "has this value : " + value)
I would get as an output:
>>> The key 123456 has this value :Lorem ipsum dolor sit amet, consectetur adipiscing elit

Without any imports, you can use:
with open('test.csv') as f:
csv = f.readlines()
d = {}
for line in csv[1:]: # Loop csv lines skipping first line csv[1:] (headers)
m = line.split()
if len(m) > 1:
d[m[0]] = " ".join(m[1:])
print(d)
Output:
{'123456': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit', '345981': 'sed do eiusmo ut labore, et dolore magna aliqua;', '459827': 'ullamco, laboris nisi ut aliquip ex ea commodo consequat.', '490023': 'veniam, quis nostrud exercitation', '345612': 'mollit anim id est laborum.'}
Python Demo
Notes:
To search by key, I normally use:
if '123456' in d:
print(d['123456'])
# Lorem ipsum dolor sit amet, consectetur adipiscing elit
Printing dictionary keys and values
print(d.keys(), d.values())
# dict_keys(['123456', '345981', '459827', '490023', '345612'])
# dict_values(['Lorem ipsum dolor sit amet, consectetur adipiscing elit', 'sed do eiusmo ut labore, et dolore magna aliqua;', 'ullamco, laboris nisi ut aliquip ex ea commodo consequat.', 'veniam, quis nostrud exercitation', 'mollit anim id est laborum.'])

Using csv module.
Ex:
import csv
result = {}
with open('test.csv') as infile:
reader = csv.reader(infile, delimiter=';')
next(reader) #Skip Header
for row in reader: #Iterate Each Line
result[row[0]] = row[1] #Form Dictionary
print(result)
Output:
{'123456': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit',
'345612': 'mollit anim id est laborum.',
'345981': 'sed do eiusmo ut labore, et dolore magna aliqua',
'459827': 'ullamco, laboris nisi ut aliquip ex ea commodo consequat.',
'490023': 'veniam, quis nostrud exercitation'}

The Problem is with the input data , The file contain multiple commas and you are reading with csv reader. The second column should be enclosed by double quotes.
code,msg
"123456","Lorem ipsum dolor sit amet, consectetur adipiscing elit"
"345981","sed do eiusmo ut labore, et dolore magna aliqua;"
"459827","ullamco, laboris nisi ut aliquip ex ea commodo consequat."
"490023","veniam, quis nostrud exercitation"
"345612","mollit anim id est laborum."
After modifying the data if you execute your snippet it would work fine.
{'code': ['msg'], '123456': ['Lorem ipsum dolor sit amet, consectetur adipiscing elit'], '345981': ['sed do eiusmo ut labore, et dolore magna aliqua;'], '459827': ['ullamco, laboris nisi ut aliquip ex ea commodo consequat.'], '490023': ['veniam, quis nostrud exercitation'], '345612': ['mollit anim id est laborum.']}
['Lorem ipsum dolor sit amet, consectetur adipiscing elit']
More over here below is an add on using pandas :
You can use the todic method in pandas , read the csv file using pandas and then convert in to dataframe and then execute the below code
df.set_index('code').T.to_dict('list')
Complete Code :
import pandas as pd
df = pd.read_csv(filepath_or_buffer = CSV_FILE_PATH")
df.set_index('code').T.to_dict('list')
Output :
{123456: ['Lorem ipsum dolor sit amet, consectetur adipiscing elit'],
345981: ['sed do eiusmo ut labore, et dolore magna aliqua;'],
459827: ['ullamco, laboris nisi ut aliquip ex ea commodo consequat.'],
490023: ['veniam, quis nostrud exercitation'],
345612: ['mollit anim id est laborum.']}

Related

Function to find regex matches in text prints one match at a time... I need a list

The function I created to find a list of regex matches doesn't work: instead of printing a list of all the matches, it prints one match at a time. I tried multiple times and I don't understand what the error could be.
For instance, this is the text I want to find the regex in: '] prima ciao hello'
This is the function:
def find_regex(regex, text):
l = []
matches_prima = re.findall(regex, text)
lunghezza_prima = len(matches_prima)
for x in matches_prima:
l.extend(matches_prima)
print(l)
And in another function is called like:
def main():
testo = '] prima ciao hello', 'ola'
find_prima = re.compile(r"\]\s*prima(?!\S)")
print(find_regex(find_prima,testo))
if __name__ == "__main__":
main()
So given a regex, I call it like print(find_regex(find_prima,testo)). But the output is:
['] prima']
[]
So I get them printed once at a time.
And I would need the full list instead to count all the matches. What am I doing wrong?
Try this:
import re
txt = """mypattern, Lorem ipsum dolor sit amet, aliquip sunt ad irure ad
labore nulla do et est eiusmod ut fugiat. Minim enim incididunt ullamco
deserunt Lorem cillum in est ullamco dolor qui sint labore. Reprehenderit
laborum anim magna pariatur proident cillum et eiusmod eu laboris cillum.
Quis et nostrud laboris non. Est incididunt dolore sint dolore. Sunt eu
mypattern, ipsum ullamco dolore ad ut veniam est. dolore mollit ut sunt nulla
"""
print([line for line in txt.splitlines() if re.match(r"mypattern, ", line) is not None])
Output:
['mypattern, Lorem ipsum dolor sit amet, aliquip sunt ad irure ad', 'mypattern, ipsum ullamco dolore ad ut veniam est. dolore mollit ut sunt nulla']

Is it possible to drop sentences from the text with NLTK in Python?

For example, I have a text that consists of several sentences:
"First sentence is not relevant. Second contains information about KPI I want to keep. Third is useless. Fourth mentions topic relevant for me".
In addition, I have self-constructed dictionary with words {KPI, topic}.
Is it somehow possible to write a code that will keep only those sentences, where at least one word is mentioned in the dictionary? So that from the above example, only 2nd and 4th sentence will remain.
Thanks
P.S. I already have a code to tokenize the text into sentences, but leaving only "relevant" ones is not something common, as I see.
One solution would be to use list comprehensions (see example below).
But there might be a better and more pythonic solution out there.
sentences = ['Lorem ipsum dolor keyword sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.',
'Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.',
'Duis aute irure other_keyword dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.',
'Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.']
vocabulary = {'keyword': 'Topic 1',
'other_keyword': 'Topic 2'}
[sentence for sentence in sentences if any(word in sentence for word in list(vocabulary.keys()))]
>>> ['Lorem ipsum dolor keyword sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.',
'Duis aute irure other_keyword dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.']

Writing "/r/n" to newlines in a file

I'm reading text blobs from a db and I have to write them to files.
Those strings are like
Lorem ipsum dolor sit amet,\r\nconsectetur adipiscing elit,\r\nsed do eiusmod tempor incididunt ut labore et dolore magna aliqua.\r\n\r\nUt enim ad minim veniam,\r\nquis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
and i need to write them like this:
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
is there a simpler way to achieve this than reading char by char and perform a file.write('\n') everytime I run into a \r\n?
my code:
import sqlite3
import codecs
db = sqlite3.connect('blog.sqlite')
cursor = db.cursor()
cursor.execute('''select filename, filecontent from mytable''')
all_rows = cursor.fetchall()
for row in all_rows:
with codecs.open(row[0], 'w', encoding='utf-8') as f:
f.write(row[1])
db.close()
filename is just a plain string, filecontent has accented letters and i want to avoid �s
example data:
Mi sono trovato nella condizione di dover fare una join su più campi. \r\nChe stavano in una colonna sola. \r\nSeparati da punti e virgola.\r\n\r\nPrima di tentare il suicidio, ho googolato un pò e ho scoperto questo:\r\n\r\n :::sql\r\n SELECT * \r\n FROM tabella1 LEFT JOIN tabella2 ON tabella2.colonnaA IN ( \r\n SELECT REGEXP_SUBSTR(tabella2.colonnaB,'[\\^;]+', 1, level) FROM dual \r\n CONNECT BY REGEXP_SUBSTR(tabella2.colonnaB, '[\\^;]+', 1, level) IS NOT NULL \r\n );\r\n\r\nForse è una cosa risaputa, ma non la conoscevo... \r\nSicuramente la riutilizzerò tantissimo.\r\n
I figured it out: I had to do str.replace(r'\r\n', '\n') to search raw strings

unable to insert json data into postgresql using python

I'm unable to insert the json data below into the postgres db using python. I'm posting the entire code here. PLease let me know why I'm unable to update the data.
json data to insert:
{"data":"\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit's, we're sed'tu do\n=====\nWe'll eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse c",
"data2":"\n\n===========\n\n Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor\nWe'll incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis\nWe'll nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse c \n\n===========\n\n"
}
Code:
import sys, getopt
import csv
import json
import psycopg2
from psycopg2.extensions import AsIs
def insert_data(entry):
conn = psycopg2.connect(database = "example", user = "postgres", password = "", host = "127.0.0.1", port = "5432")
print "Opened database successfully"
cur = conn.cursor()
cur.execute("INSERT INTO example (column2) VALUES '{0}'".format(info))
conn.commit()
conn.close()
i=0
with open("test.json") as json_file:
data = json.load(json_file)
for r in data:
print json.dumps(data[i])
dd=json.dumps(data[i])
insert_data(dd)
i=i+1
I'm getting this error:
psycopg2.ProgrammingError: syntax error at or near \nWe'll

How to split a file which is delimited by bullet points

I'm trying to split a large file that has several paragraphs, each one is of variable length and the only delimiter would be the bullet point for the next paragraph...
Is there a way to get several different files with each individual paragraph?
The final thing is to write each individual paragraph to a MySQL DB...
example input:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
output: each paragraph is a separate entry in the DB
this is how you split your file by bullet point:
new_files = open(source_file).read().split(u'\u2022')
for par in new_files:
open("%s.txt"%new_files.index(par),"w").write("%s"%par)
LOAD DATA INFILE "%s.txt"%new_files.index(par) INTO TABLE your_DB_name.your_table;
This conects to mysql DB and reads the file and splits it at each bullet point and inserts the data into mysql DB table
My Code:
#Server Connection to MySQL:
import MySQLdb
conn = MySQLdb.connect(host= "localhost",
user="root",
passwd="newpassword",
db="db")
x = conn.cursor()
try:
file_data = open("FILE_NAME_WITH_EXTENSION").read().split(u'\u2022')
for text in file_data:
print text
x.execute("""INSERT INTO TABLE_NAME VALUES (%s)""",(text))
conn.commit()
except:
conn.rollback()
conn.close()

Categories