Hey so i have this CSV file which is structured in this way: ['message', 'Date', 'Name', 'Location of a train station']
['vies', 'Mon 07 Nov 2022 18:43', 'Mia', 'Haarlem']
['vies', 'Mon 07 Nov 2022 18:43', 'Mia', 'Amsterdam']
['vies', 'Mon 07 Nov 2022 18:43', 'Mia', 'Sittard']
['vies', 'Mon 07 Nov 2022 18:43', 'Mia', 'Venlo']
['vies', 'Mon 07 Nov 2022 18:43', 'Mia', 'Helmond']
['Het zou wel wat schoner mogen zijn', 'Tue 08 Nov 2022 00:49', 'Tijmen', 'Hilversum']
['Het zou wel wat schoner mogen zijn', 'Tue 08 Nov 2022 00:49', 'anoniem', 'Roosendaal']
Now i want to insert this information into my postgresql database
import csv
import psycopg2
with open('C:\\Users\\Danis Porovic\\PycharmProjects\\Module1\\berichten.csv', 'r') as
csv_file:
csv_reader = csv.reader(csv_file)
index = list(zip(*csv.reader(csv_file)))
messages =index[0]
data = index[1]
names = index[2]
stations = index[3]
con = psycopg2.connect(
host = "localhost",
database = "fabriek",
user = "postgres",
password = "DanisMia1")
cur = con.cursor()
cur.execute("insert into klant (naam) values (%s);", (names,))
con.commit()
con.close()
How would i go about inserting all names into a column succesfully in my database?
The current zip method i'm using at the top makes a tuple out of the 4 strings. Would inserting tuples even work?
This is how the tuple looks like of the names for example:
('Mia', 'Danis', 'Jeffrey', 'Tim', 'Joppe', 'Tijmen', 'anoniem')
I have the column of values as below,
array(['Mar 2018', 'Jun 2018', 'Sep 2018', 'Dec 2018', 'Mar 2019',
'Jun 2019', 'Sep 2019', 'Dec 2019', 'Mar 2020', 'Jun 2020',
'Sep 2020', 'Dec 2020'], dtype=object)
From this values I require output as,
array(['Mar'18', 'Jun'18', 'Sep'18', 'Dec'18', 'Mar'19',
'Jun'19', 'Sep'19', 'Dec'19', 'Mar'20', 'Jun'20',
'Sep'20', 'Dec'20'], dtype=object)
I have tried with following code,
df['Period'] = df['Period'].replace({'20','''})
But here it wasnt converting , how to replace the same?
Any help?
Thanks
With your shown samples, please try following.
df['Period'].replace(r" \d{2}", "'", regex=True)
Output will be as follows.
0 Mar'18
1 Jun'18
2 Sep'18
3 Dec'18
4 Mar'19
5 Jun'19
6 Sep'19
7 Dec'19
8 Mar'20
9 Jun'20
10 Sep'20
11 Dec'20
try this regex:
df['Period'].str.replace(r"\s\d{2}(\d{2})", r"'\1", regex=True)
in the replacement part, \1 refers to the capturing group, which is the last two digits in this case.
Following your code (slightly changed to work) will not get you what you need as it will replace all '20's.
>>> df['Period'] = df['Period'].str.replace('20','')
Out[179]:
Period
0 Mar 18
1 Jun 18
2 Sep 18
3 Dec 18
4 Mar 19
5 Jun 19
6 Sep 19
7 Dec 19
8 Mar
9 Jun
10 Sep
11 Dec
Another way without using regex, would be with with vectorized str methods, more here:
df['Period_refined'] = df['Period'].str[:3] + "'" + df['Period'].str[-2:]
Output
df
Period Period_refined
0 Mar 2018 Mar'18
1 Jun 2018 Jun'18
2 Sep 2018 Sep'18
3 Dec 2018 Dec'18
4 Mar 2019 Mar'19
5 Jun 2019 Jun'19
6 Sep 2019 Sep'19
7 Dec 2019 Dec'19
8 Mar 2020 Mar'20
9 Jun 2020 Jun'20
10 Sep 2020 Sep'20
11 Dec 2020 Dec'20
I am trying to extract date from text in python. These are the possible texts and date patterns in it.
"Auction details: 14 December 2016, Pukekohe Park"
"Auction details: 17 Feb 2017, Gold Sacs Road"
"Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)"
"Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)"
"Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)"
"Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)"
"Auction details: Thursday, 28th February '19"
"Auction details: Friday, 1st February '19"
This is what I have written so far,
mon = ' (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?) '
day1 = r'\d{1,2}'
day_test = r'\d{1,2}(?:th)|\d{1,2}(?:st)'
year1 = r'\d{4}'
year2 = r'\(\d{4}\)'
dummy = r'.*'
This captures cases 1,2.
match = re.search(day1 + mon + year1, "Auction details: 14 December 2016, Pukekohe Park")
print match.group()
This somewhat captures case 3,4,5. But it prints everything from the text, so in the below case, I want 25 Nov 2016, but the below regex pattern gives me 25 Nov 3:00 p.m. (On Site)(2016).
So Question 1 : How to get only the date here?
match = re.search(day1 + mon + dummy + year2, "Friday 25 Nov 3:00 p.m. (On Site)(2016)")
print match.group()
Question 2 : Similarly, how do capture case 6,7 and 8 ?? What is the regex should be for that?
If not, is there any other better way to capture date from these formats?
You may try
((?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s+\d{1,2}(?:st|nd|rd|th)?|\d{1,2}(?:st|nd|rd|th)?\s+(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)))(?:.*(\b\d{2}(?:\d{2})?\b))?
See the regex demo.
Note I made all groups in the regex blocks non-capturing ((Nov|Dec) -> (?:Nov|Dec)), added (?:st|nd|rd|th)? optional group after day digit pattern, changed the year matching pattern to \b\d{2}(?:\d{2})?\b so that it only match 4- or 2-digit chunks as whole words, and created an alternation group to account for dates where day comes before month and vice versa.
The day and month are captured into Group 1 and the year is captured into Group 2, so the result is the concatenation of both.
NOTE: In case you need to match years in a safer way you may want to precise the year pattern. E.g., if you want to avoid matching the 4- or 2-digit whole words after :, add a negative lookbehind:
year1 = r'\b(?<!:)\d{2}(?:\d{2})?\b'
^^^^^^
Also, you may add word boundaries around the whole pattern to ensure a whole word match.
Here is the Python demo:
import re
mon = r'(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)'
day1 = r'\d{1,2}(?:st|nd|rd|th)?'
year1 = r'\b\d{2}(?:\d{2})?\b'
dummy = r'.*'
rx = r"((?:{smon}\s+{sday1}|{sday1}\s+{smon}))(?:{sdummy}({syear1}))?".format(smon=mon, sday1=day1, sdummy=dummy, syear1=year1)
# Or, try this if a partial number before a date is parsed as day:
# rx = r"\b((?:{smon}\s+{sday1}|{sday1}\s+{smon}))(?:{sdummy}({syear1}))?".format(smon=mon, sday1=day1, sdummy=dummy, syear1=year1)
strs = ["Auction details: 14 December 2016, Pukekohe Park","Auction details: 17 Feb 2017, Gold Sacs Road","Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)","Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)","Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)","Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)","Auction details: Thursday, 28th February '19","Auction details: Friday, 1st February '19","Friday 25 Nov 3:00 p.m. (On Site)(2016)"]
for s in strs:
print(s)
m = re.search(rx, s)
if m:
print("{} {}".format(m.group(1), m.group(2)))
else:
print("NO MATCH")
Output:
Auction details: 14 December 2016, Pukekohe Park
14 December 2016
Auction details: 17 Feb 2017, Gold Sacs Road
17 Feb 2017
Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)
27 Apr 2016
Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)
27 Apr 2016
Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)
27 Apr 2016
Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)
November 16 2016
Auction details: Thursday, 28th February '19
28th February 19
Auction details: Friday, 1st February '19
1st February 19
Friday 25 Nov 3:00 p.m. (On Site)(2016)
25 Nov 2016
Say I have two python lists as:
ListA = ['Jan 2018', 'Feb 2018', 'Mar 2018']
ListB = ['Sales Jan 2018','Units sold Jan 2018','Sales Feb 2018','Units sold Feb 2018','Sales Mar 2018','Units sold Mar 2018']
I need to get an output as:
List_op = ['Jan 2018 Sales Jan 2018 Units sold Jan 2018','Feb 2018 Sales Feb 2018 Units sold Feb 2018','Mar 2018 Sales Mar 2018 Units sold Mar 2018']
My approach so far:
res=set()
for i in ListB:
for j in ListA:
if j in i:
res.add(f'{i} {j}')
print (res)
this gives me result as:
{'Units sold Jan 2018 Jan 2018', 'Sales Feb 2018 Feb 2018', 'Units sold Mar 2018 Mar 2018', 'Units sold Feb 2018 Feb 2018', 'Sales Jan 2018 Jan 2018', 'Sales Mar 2018 Mar 2018'}
which is definitely not the solution I'm looking for.
What I think is regular expression could be a handful here but I'm not sure how to approach. Any help in this regard is highly appreciated.
Thanks in advance.
Edit:
Values in ListA and ListB are not necessarily to be in order. Therefore for a particular month/year value in ListA, the same month/year value from ListB has to be matched and picked for both 'Sales' and 'Units sold' component and needs to be concatenated.
My main goal here is to get the list which I can use later to generate a statement that I'll be using to write Hive query.
Added more explanation as suggested by #andrew_reece
Assuming no additional edge cases that need taking care of, your original code is not bad, just needs a slight update:
List_op = []
for a in ListA:
combined = a
for b in ListB:
if a in b:
combined += " " + b
List_op.append(combined)
List_op
['Jan 2018 Sales Jan 2018 Units sold Jan 2018',
'Feb 2018 Sales Feb 2018 Units sold Feb 2018',
'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
Supposing ListA and ListB are sorted:
ListA = ['Jan 2018', 'Feb 2018', 'Mar 2018']
ListB = ['Sales Jan 2018','Units sold Jan 2018','Sales Feb 2018','Units sold Feb 2018','Sales Mar 2018','Units sold Mar 2018']
print([v1 + " " + v2 for v1, v2 in zip(ListA, [v1 + " " + v2 for v1, v2 in zip(ListB[::2], ListB[1::2])])])
This will print:
['Jan 2018 Sales Jan 2018 Units sold Jan 2018', 'Feb 2018 Sales Feb 2018 Units sold Feb 2018', 'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
In my example I firstly concatenate ListB variables together and then join ListA with this new list.
String concatenation can become expensive. In Python 3.6+, you can use more efficient f-strings within a list comprehension:
res = [f'{i} {j} {k}' for i, j, k in zip(ListA, ListB[::2], ListB[1::2])]
print(res)
['Jan 2018 Sales Jan 2018 Units sold Jan 2018',
'Feb 2018 Sales Feb 2018 Units sold Feb 2018',
'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
Using itertools.islice, you can avoid the expense of creating new lists:
from itertools import islice
zipper = zip(ListA, islice(ListB, 0, None, 2), islice(ListB, 1, None, 2))
res = [f'{i} {j} {k}' for i, j, k in zipper]
I have a large amount of data in CSV Format that looks like this:
(u'Sat Jan 17 18:56:05 +0000 2015', u'anx321', 'RT #ManojHarry27: If India loses 2015 worldcup, Karishma\ntanna will be held responsible !!! #BB8', '0.0453125', '0.325')
(u'Sat Jan 17 18:56:13 +0000 2015', u'FrancisKimberl3', 'Python form imploration overgrowth-the consummative the very best as representing construction upsurge: sDGy', '1.0', '0.39')
(u'Sat Jan 17 18:56:18 +0000 2015', u'AllTechBot', 'RT #ruby_engineer: A workshop on monads with C++14 http://t.co/OKFc91J0QJ #hacker #rubyonrails #python #AllTech', '0.0', '0.0')
(u'Sat Jan 17 18:56:22 +0000 2015', u'python_job', ' JOB ALERT #ITJob #Job #New York - Senior Software Engineer Python Backed by First Round http://t.co/eqVxoMzYMG view full details', '0.245454545455', '0.44595959596')
(u'Sat Jan 17 18:56:23 +0000 2015', u'weepingtaco', 'Python: basic but beautiful', '0.425', '0.5625')
(u'Sat Jan 17 18:56:27 +0000 2015', u'python_IT_jobs', ' JOB ALERT #ITJob #Job #New York - Senior Software Engineer Python Backed by First Round http://t.co/gavWyraNqE view full details', '0.245454545455', '0.44595959596')
(u'Sat Jan 17 18:56:32 +0000 2015', u'accusoftinfoway', 'RT #findmjob: DevOps Engineer http://t.co/NasdBEEnRp #aws #perl #mysql #linux #hadoop #python #Puppet #jobs #hiring #careers', '0.0', '0.0')
(u'Sat Jan 17 18:56:32 +0000 2015', u'accusoftinfoway', 'RT #arnicas: Very useful - end to end deploying python flask on AWS RT #matt_healy: Great tutorial: https://t.co/RsiM09qJsJ #flask #python ', '0.595', '0.375')
(u'Sat Jan 17 18:56:36 +0000 2015', u'denisegregory10', "Oh you can't beat a good 'python' argument! http://t.co/ELo3GvNsuE via #youtube", '0.875', '0.6')
(u'Sat Jan 17 18:56:38 +0000 2015', u'NoSQLDigest', 'RT #KirkDBorne: R and #Python starter code for participating in #BoozAllen #DataScience Bowl: http://t.co/Q5C01eya95 #abdsc #DataSciBowl #B', '0.0', '0.0')
(u'Sat Jan 17 19:00:05 +0000 2015', u'RedditPython', '"academicmarkdown": a Python module for academic writing with Markdown. Haven\'t tried it o... https://t.co/uv8yFaz6cv http://t.co/EhiIIO7uTW', '0.0', '0.0')
(u'Sat Jan 17 19:00:28 +0000 2015', u'shopawol', 'Only 8.5 and 12 left make sure to get yours \nhttp://t.co/4rxmHqP2Qs\n#wdywt #goawol #sneakerheads http://t.co/wACIOdlGwY', '0.166666666667', '0.62962962963')
(u'Sat Jan 17 19:00:31 +0000 2015', u'AuthorBee', "RT #_kevin_ewb_: I know what your girl won't she just wanna kick it like the #WorldCup ", '0.0', '0.0')
(u'Sat Jan 17 19:00:37 +0000 2015', u'g33kmaddy', 'RT #KirkDBorne: R and #Python starter code for participating in #BoozAllen #DataScience Bowl: http://t.co/Q5C01eya95 #abdsc #DataSciBowl #B', '0.0', '0.0')
(u'Sat Jan 17 19:00:45 +0000 2015', u'Altfashion', 'Photo: A stunning photo of Kaoris latex dreams beautiful custom python bra. Photographer: MagicOwenTog... http://t.co/KdWnr3I8xP', '0.675', '1.0')
(u'Sat Jan 17 19:00:46 +0000 2015', u'oh226twt', 'Python programming: Easy and Step by step Guide for Beginners: Learn Python (English Edition) http://t.co/9optdOCrtE 1532', '0.216666666667', '0.416666666667')
(u'Sat Jan 17 19:00:50 +0000 2015', u'DvSpacefest', 'RT #Pomerantz: Potential team in the Learning XPRIZE looking for Python coders. Details: https://t.co/nGgrmYmXCa', '0.0', '1.0')
(u'Sat Jan 17 19:01:04 +0000 2015', u'cun45', 'SPORTS And More: #Cycling #Ciclismo U23 #Portugal #WorldCup team o... http://t.co/FBeqatfu85', '0.5', '0.5')
(u'Sat Jan 17 19:01:12 +0000 2015', u'insofferentexo', 'RT #FISskijumping: Dawid is already at the hill in Zakopane, in a larger than life format! #skijumping #worldcup http://t.co/SDOnxDwfIX', '0.0', '0.5')
(u'Sat Jan 17 19:01:17 +0000 2015', u'beuhe', 'Madrid Tawarkan Khedira ke Dortmund: Real Madrid dikabarkan telah menawarkan Sami Khedira ... http://t.co/R5YCKjECtm #football #worldcup', '0.2', '0.3')
(u'Sat Jan 17 19:01:18 +0000 2015', u'ITJobs_Karen', ' JOB ALERT #ITJob #Job #Paradise Valley - Python / Django Developer http://t.co/0Xn1k0cL5B view full details', '0.35', '0.55')
(u'Sat Jan 17 19:01:22 +0000 2015', u'DonnerBella', 'So confused about #meninist . Monty Python, is that you?', '-0.4', '0.7')
(u'Sat Jan 17 19:01:34 +0000 2015', u'DoggingTeens', '#Dogging,#OutdoorSex,#Sluts,#GangBang,#Stockings,#Uk_Sex: 13 Inch Black Python Being Sucked http://t.co/n9Yv4nhcxo', '-0.166666666667', '0.433333333333')
(u'Sat Jan 17 19:02:03 +0000 2015', u'WorldCupFNH', 'Soccer-La Liga results and standings: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/3JOOnBQzvG', '0.0', '0.0')
(u'Sat Jan 17 19:02:03 +0000 2015', u'WorldCupFNH', 'Soccer-La Liga summaries: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/AZgxr5Z9EV', '0.0', '0.0')
(u'Sat Jan 17 19:02:03 +0000 2015', u'WorldCupFNH', "Soccer-Late Congo goal spoils Equatorial Guinea's party: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/W6Ff4HikxH", '0.0', '0.0')
(u'Sat Jan 17 19:02:04 +0000 2015', u'WorldCupFNH', 'Soccer-Ligue 1 top scorers: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/WS2lcZnzKu', '0.5', '0.5')
(u'Sat Jan 17 19:02:04 +0000 2015', u'WorldCupFNH', 'Soccer-Pearce answers critics as Forest seal unlikely win: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/Qb5PKuls6z', '0.15', '0.45')
(u'Sat Jan 17 19:02:04 +0000 2015', u'WorldCupFNH', 'Soccer-Israeli championship results and standings: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/dce9Qn9oI5', '0.0', '0.0')
(u'Sat Jan 17 19:02:07 +0000 2015', u'Jeff88Ho', 'RT #artwisanggeni: #python jweede.recipe.template 1.2.3: Buildout recipe for making files out of Jinja2 templates http://t.co/dgeuuFWf19', '0.0', '0.0')
(u'Sat Jan 17 19:02:07 +0000 2015', u'Jeff88Ho', 'RT #artwisanggeni: #python aclhound 1.7.5: ACL Compiler http://t.co/fNOFSYd7FJ', '0.0', '0.0')
(u'Sat Jan 17 19:02:08 +0000 2015', u'Jeff88Ho', 'RT #artwisanggeni: #python Flask-Goat 0.2.0: Flask plugin for security and user administration via GitHub OAuth & organization http://t.co/', '0.0', '0.0')
(u'Sat Jan 17 19:02:08 +0000 2015', u'Jeff88Ho', 'RT #artwisanggeni: #python filewatch 0.0.6: Python File Watcher http://t.co/fIHLagCqvf', '0.0', '0.0')
(u'Sat Jan 17 19:02:16 +0000 2015', u'HeatherA789', "Programming Python: Start Learning Python Today, Even If You've Never Coded Before (A Beginner's Guide): http://t.co/3Ss4cwCvP6", '0.0', '0.0')
(u'Sat Jan 17 19:02:18 +0000 2015', u'HeatherA789', 'Python: Learn Python in One Day and Learn It Well. Python for Beginners with Hands-on Project.: Python: Learn http://t.co/zvLIpydd6V', '0.0', '0.0')
(u'Sat Jan 17 19:02:26 +0000 2015', u'AlexeiCherenkov', 'It looks like I should learn Python. Do you think I can do this during 3 hours tomorrow? Yes-Rt; No-Fav.', '0.0', '0.0')
(u'Sat Jan 17 19:02:33 +0000 2015', u'cleansheet', "#WorldCup Cricket World Cup: Australia should've picked a leg-spinner and named Steve Smith vice-captain ... http://t.co/kgXgUVbHDd", '0.0', '0.0')
(u'Sat Jan 17 19:02:34 +0000 2015', u'cleansheet', '#WorldCup Younger Northug earns 1st cross-country World Cup victory http://t.co/y7jozMriFG', '0.0', '0.0')
(u'Sat Jan 17 19:02:35 +0000 2015', u'cleansheet', '#WorldCup ICC World Cup 2015: School massacre survivors inspire Pakistan team http://t.co/Tj1jpCZsj6', '0.0', '0.0')
(u'Sat Jan 17 19:02:35 +0000 2015', u'cleansheet', '#WorldCup We Want to Win World Cup for Peshawar Schoolkids: Misbah-ul-Haq http://t.co/RbeBkrv69s', '0.8', '0.4')
(u'Sat Jan 17 19:02:38 +0000 2015', u'world_latest', 'New: Equatorial Guinea 1-1 Congo http://t.co/32sfrrbBOW #follow #worldcup world_latest world_latest', '0.136363636364', '0.454545454545')
(u'Sat Jan 17 19:02:39 +0000 2015', u'FAHAD_CTID', 'RT #fawadiii: #FAHAD_CTID #VeronaPerqukuu Hahaha. Hanw ;) bdw worldcup bhi hai 15 sy :D', '0.483333333333', '0.8')
(u'Sat Jan 17 19:02:43 +0000 2015', u'amazon_mybot', '#3: Python http://t.co/LLzeKQQBon', '0.0', '0.0')
(u'Sat Jan 17 19:02:45 +0000 2015', u'LarryMesast', '#javascript #html5 #UX #Python #agile #DDD', '0.5', '0.75')
(u'Sat Jan 17 19:02:46 +0000 2015', u'washim987', 'RT #anjali_damania: I was angry at #shaziailmi & #thekiranbedi My husband calms me down & says. Haame Worldcup jitna hai. Sirf Pakistan se ', '-0.327777777778', '0.644444444444')
(u'Sat Jan 17 19:03:02 +0000 2015', u'sksh_rana', '"#ManojHarry27: If India loses 2015 worldcup, Karishma\ntanna will be held responsible !!! #BB8"\n#TheFarahKhan #BeingSalmanKhan', '0.0453125', '0.325')
(u'Sat Jan 17 19:03:14 +0000 2015', u't_kohyama', '#_3mame PythonMatlabPython', '0.0', '0.0')
(u'Sat Jan 17 19:03:16 +0000 2015', u'AntonShipulin', '#photo #worldcup #flowerceremony #sprint #Ruhpolding http://t.co/fe9qpiwsqJ', '0.0', '0.0')
(u'Sat Jan 17 19:03:22 +0000 2015', u'karthik_vik', 'RT #ValaAfshar: Highest paying programming languages, ranked by salary:\n\n1 Ruby\n2 Objective C\n3 Python\n4 Java\n\nhttp://t.co/RudytdjFLC http:', '0.0', '0.1')
Right now I plot the data with the following script:
import matplotlib
matplotlib.use('Agg')
from matplotlib.mlab import csv2rec
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from pylab import *
from datetime import datetime
import dateutil
from dateutil import parser
import re
import os
import operator
import csv
input_filename="test_output.csv"
output_image_namep='polarity.png'
output_image_name2='subjectivity.png'
input_file = open(input_filename, 'r')
data = csv2rec(input_file, names=['time', 'name', 'message', 'polarity', 'subjectvity'])
time_list = []
polarity_list = []
''' I am aware there's a much more concise way of doing this'''
for line in data:
td = line['time']
''' stupid regex '''
s = re.sub('\(\u', '', td)
dtime = parser.parse(s)
dtime = re.sub('-', '', str(dtime))
dtime = re.sub(' ', '', dtime)
dtime = re.sub('\+00:00', '', dtime)
dtime = re.sub(':', '', dtime)
dtime = dtime[:-2]
try:
subjectivity = float(line['subjectivity'].replace("'", '').replace(")", ''))
except:
pass
print dtime, polarity
time_list.append( str(dtime) )
polarity_list.append( polarity )
rcParams['figure.figsize'] = 10, 4
rcParams['font.size'] = 8
fig = plt.figure()
plt.plot([time_list], [polarity_list], 'ro')
axes = plt.gca()
axes.set_ylim([-1,1])
plt.savefig(output_image_namep)
It ends up looking like:
Which is fine but I would like the X axis to display the date labels correctly. Right now I'm doing some ugly regex to strip the date down to YYYYMMDDHHMM.
What about this:
import time
def format_time_label(original):
return time.strftime('%Y%m%d%H%M',
time.strptime(original, "%a %b %d %H:%M:%S +0000 %Y"))
Example:
>>> format_time_label('Sat Jan 17 19:00:50 +0000 2015')
'201501171900'
This works only if every date in your data has timezone offset +0000, as there seems to be no code in Python standard library to recognize this.
You can change parsing format expression accordingly to account for leftovers from your data format:
def format_time_label(original):
return time.strftime('%Y%m%d%H%M',
time.strptime(original, "(u'%a %b %d %H:%M:%S +0000 %Y'"))
>>> format_time_label("(u'Sat Jan 17 18:56:05 +0000 2015'")
'201501171856'