Text mining on string dates from fuzzy example

Text mining on string dates from fuzzy example - python

I'm a bit stuck with proceeding with the custom function that would be able to parse in any given language, with a set of rules that would be able to identify and parse months and years including some special cases (that I will show).
Input data:
dates_eg = [["Issued Feb 2018 · No Expiration Date"], ["Oct 2021 - Present"],
["Jan 2019 - Oct 2021 · 2 yrs 10 mos"], ["1994 - 2000"], ["Sep 2010 – Sep 2015 • 5 yrs"],
["Nov 2019 – Present"], ["Sep 2015 – Mar 2017 • 1 yr 6 mos"], ["Apr 2019 – Present • 2 yrs 8 mos"],
["Issued: Aug 2018 · Expired Aug 2020"], ["Mar 2017 - Dec 2017 · 10 mos"],
["May 2021 - Present · 9 mos"]]
NOTE: No API translation should be used. Instead of it, I've created a list of FLAG words (in certain languages) and STOPWORDS that would be able to identify all the cases.
Present, Issued, Expired, No Expiration Date -> all of these four keywords/stopwords would be identified through the lists (see underneath).
Current functions work workflow:
e.g. "Issued Feb 2018 · No Expiration Date"
Firstly, algorithm named "raw_dates_extractor" is separating the start date ('Issued Feb 2018') and the end date ('No Expiration Date') with a separator ' · '
Secondly, clean_extract_dates algorithm is identifying 'Issued' and gets rid of it
All the aforementioned resulting in:
{'Start_date_raw0': 'Feb 2018', 'Start_date_raw1': 'Oct 2021', 'Start_date_raw2': 'Jan 2019', 'Start_date_raw3': '1994', 'Start_date_raw4': 'Sep 2010', 'Start_date_raw5': 'Nov 2019', 'Start_date_raw6': 'Sep 2015', 'Start_date_raw7': 'Apr 2019', 'Start_date_raw8': 'Aug 2018', 'Start_date_raw9': 'Mar 2017', 'Start_date_raw10': 'May 2021', 'End_date_raw0': 'No Expiration Date', 'End_date_raw1': 'Present', 'End_date_raw2': 'Oct 2021', 'End_date_raw3': '2000', 'End_date_raw4': 'Sep 2015', 'End_date_raw5': 'Present', 'End_date_raw6': 'Mar 2017', 'End_date_raw7': 'Present', 'End_date_raw8': 'Aug 2020', 'End_date_raw9': 'Dec 2017', 'End_date_raw10': 'Present'}
However, the desired output would go further by extracting/identifying years, months, and stopwords/flag words (like 'No Expiration Date', 'Present' would be flagged as 'ONGOING' = 1 otherwise 0.
From the first initial raw example:
"Issued Feb 2018 · No Expiration Date" would be:
Desired result:
{'start_year': 2021, 'start_month': "Apr", 'end_year': None,
'end_month': None, 'is_on_going': True})
Complete Code attempt:
Lists with Expired, Issued, No Expiration Date, Present
import re
import pandas as pd
# "Present" word in all language combination
present_all_lang = ["současnost", "I dag", "Heute", "Present", "actualidad", "aujourd’hui", "वर्तमान"
"Saat ini", "Presente",
"現在", "현재", "Kini", "heden", "Nå", "obecnie", "o momento"
"Prezent", "настоящее время", "nu", "ปัจจุบัน ",
"Kasalukuyan", "Halen", "至今", "現在"]
# "Issued" word in all language combination
issued_all_lang = ["Vydáno", "Udstedt", "Ausgestellt", "Issued", "Expedición", "Émise le", "नव॰", "Diterbitkan"
"Data di rilascio",
"発行日", "발행일", "Dikeluarkan", "Toegekend", "Utstedt", "Wydany", "Emitido em"
"Дата выдачи", "Utfärdat", "ออกเมื่อ",
"Inisyu", "tarihinde verildi", "颁发日期", "發照日期"]
# "No experience date" in all language combination
no_experience_all_lang = ["Bez data vypršení", "Ingen udløbsdato", "Kein Ablaufdatum", "No Expiration Date",
"Sin fecha de vencimiento", "Sans date d’expiration", "समाप्ति की कोई तारीख नहीं",
"Tidak Ada Tanggal Kedaluwarsa", "Nessuna data di scadenza", "有効期限なし"
"만료일 없음",
"Tiada Tarikh Tamat Tempoh", "Geen vervaldatum" "Ingen utløpsdato",
"Brak daty ważności", "Sem data de expiração", "Без истечения срока действия"
"Inget utgångsdatum", "ไม่มีวันหมดอายุ",
"Walang Petsa ng Pagtatapos"
"Sona Erme Tarihi Yok", "长期有效", "永久有效"]
# "Expired" in all lang
expires_all_lang = ["Vyprší", "Udløbet", "Gültig bis", "Expired", "Vencimiento", "Expire le",
"को समाप्त हो गया है", "Kedaluwarsa", "Data di scadenza", "有効期限", " 만료됨", "Tamat Tempoh",
"Verloopt op", "Utløper", "Wygasa", "Expira em", "A expirat la", "Срок действия истек",
"Upphörde att gälla", "หมดอายุเมื่อ", "Mag-e-expire sa", "tarihinde süresi bitiyor", "有效期至", "過期"]
# colon my appear get rid of it!
def raw_dates_extractor(list_of_examples: dates_eg) -> dict:
raw_start_date, raw_end_date, merged_start_end_dates = {}, {}, {}
for row, datum in enumerate(list_of_examples):
date_splitter = re.split("[·•–-]", datum[0])
raw_start_date[f'Start_date_raw{row}'] = date_splitter[0].strip(' ')
raw_end_date[f'End_date_raw{row}'] = date_splitter[1].strip(' ')
merged_start_end_dates = raw_start_date | raw_end_date
return merged_start_end_dates
output:
raw_dates_extractor(dates_eg)
{'Start_date_raw0': 'Issued Feb 2018', 'Start_date_raw1': 'Oct 2021', 'Start_date_raw2': 'Jan 2019', 'Start_date_raw3': '1994', 'Start_date_raw4': 'Sep 2010', 'Start_date_raw5': 'Nov 2019', 'Start_date_raw6': 'Sep 2015', 'Start_date_raw7': 'Apr 2019', 'Start_date_raw8': 'Issued: Aug 2018', 'Start_date_raw9': 'Mar 2017', 'Start_date_raw10': 'May 2021', 'End_date_raw0': 'No Expiration Date', 'End_date_raw1': 'Present', 'End_date_raw2': 'Oct 2021', 'End_date_raw3': '2000', 'End_date_raw4': 'Sep 2015', 'End_date_raw5': 'Present', 'End_date_raw6': 'Mar 2017', 'End_date_raw7': 'Present', 'End_date_raw8': 'Expired Aug 2020', 'End_date_raw9': 'Dec 2017', 'End_date_raw10': 'Present'}
def clean_extract_dates(input_dictionary: dict, aggregate_stop_words: list):
container_replica = {}
for date_element in input_dictionary:
query_req = input_dictionary[f'{date_element}']
if ":" in query_req:
query_req = query_req.replace(":", "")
if "." in query_req:
query_req = query_req.replace(".", "")
stopwords = aggregate_stop_words
query_words = query_req.split()
result_words = [word for word in query_words if word not in stopwords]
end_result = ' '.join(result_words)
container_replica[f'{date_element}'] = end_result
return container_replica
output:
clean_extract_dates(raw_dates_extractor(dates_eg), aggregate_lists)
{'Start_date_raw0': 'Feb 2018', 'Start_date_raw1': 'Oct 2021', 'Start_date_raw2': 'Jan 2019', 'Start_date_raw3': '1994', 'Start_date_raw4': 'Sep 2010', 'Start_date_raw5': 'Nov 2019', 'Start_date_raw6': 'Sep 2015', 'Start_date_raw7': 'Apr 2019', 'Start_date_raw8': 'Aug 2018', 'Start_date_raw9': 'Mar 2017', 'Start_date_raw10': 'May 2021', 'End_date_raw0': 'No Expiration Date', 'End_date_raw1': 'Present', 'End_date_raw2': 'Oct 2021', 'End_date_raw3': '2000', 'End_date_raw4': 'Sep 2015', 'End_date_raw5': 'Present', 'End_date_raw6': 'Mar 2017', 'End_date_raw7': 'Present', 'End_date_raw8': 'Aug 2020', 'End_date_raw9': 'Dec 2017', 'End_date_raw10': 'Present'}
aggregate_lists = issued_all_lang + no_experience_all_lang + expires_all_lang
saved_dictionary = clean_extract_dates(raw_dates_extractor(dates_eg), aggregate_lists)

Related

sort a list of string numerically in python

I know there are many questions regarding this type of sorting, I tried many time by referring to those questions and also by going through the re topic in python too
My question is:
class Example(models.Model):
_inherit = 'sorting.example'
def unable_to_sort(self):
data_list = ['Abigail Peterson Jan 25','Paul Williams Feb 1','Anita Oliver Jan 24','Ernest Reed Jan 28']
self.update({'list_of_birthday_week': ','.join(r for r in data_list)})
I need to be sorted according to the month & date like:
data_list = ['Anita Oliver Jan 24','Abigail Peterson Jan 25','Ernest Reed Jan 28','Paul Williams Feb 1']
is there any way to achieve this ?

Use a regex to extract the date than use it as key of sorted function.
import re
pattern = r'(\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\D?(?:\d{1,2}\D?))'
sort_by_date = lambda x: datetime.strptime(re.search(pattern, x).group(0), '%b %d')
out = sorted(data_list, key=sort_by_date)
Output:
>>> out
['Anita Oliver Jan 24',
'Abigail Peterson Jan 25',
'Ernest Reed Jan 28',
'Paul Williams Feb 1']
Input:
>>> data_list
['Abigail Peterson Jan 25',
'Paul Williams Feb 1',
'Ernest Reed Jan 28',
'Anita Oliver Jan 24']

You need to extract the date part from the string, and then turn the date string into a comparable format. For the first task, regexen would be a decent choice here, and for the second part, datetime.strptime would be appropriate:
>>> import re
>>> from datetime import *
>>>
>>> re.search('\w+ \d+$', 'Abigail Peterson Jan 25')
<re.Match object; span=(17, 23), match='Jan 25'>
>>> re.search('\w+ \d+$', 'Abigail Peterson Jan 25')[0]
'Jan 25'
>>>
>>> datetime.strptime('Jan 25', '%b %d')
datetime.datetime(1900, 1, 25, 0, 0)
>>> datetime.strptime(re.search('\w+ \d+$', 'Abigail Peterson Jan 25')[0], '%b %d')
datetime.datetime(1900, 1, 25, 0, 0)
Then turn that into a callback for list.sort:
>>> data_list.sort(key=lambda i: datetime.strptime(re.search('\w+ \d+$', i)[0], '%b %d'))
['Anita Oliver Jan 24', 'Abigail Peterson Jan 25', 'Ernest Reed Jan 28', 'Paul Williams Feb 1']

You can also use split() to accomplish that.
from datetime import datetime
...
def unable_to_sort(self):
data_list = ['Abigail Peterson Jan 25','Paul Williams Feb 1','Anita Oliver Jan 24','Ernest Reed Jan 28']
def get_date(data):
name, str_date = data.split(" ")[:-2], data.split(" ")[-2:]
month, day = str_date
return datetime.strptime(f"{month} {day}", "%b %d")
sorted_data_list = sorted(data_list, key=get_date)
self.update({'list_of_birthday_week': ','.join(r for r in sorted_data_list)})

You can use sorted function with keys datetime.strptime() and date value.
from datetime import datetime
data_list = ['Anita Oliver Jan 24','Abigail Peterson Jan 25','Ernest Reed Jan 28','Paul Williams Feb 1']
k=[x.split() for x in data_list]
days_sorted = sorted(k, key=lambda x: (datetime.strptime(x[2],'%b'),x[3]))
[['Anita', 'Oliver', 'Jan', '24'],
['Abigail', 'Peterson', 'Jan', '25'],
['Ernest', 'Reed', 'Jan', '28'],
['Paul', 'Williams', 'Feb', '1']]

error when using time in rolling function pandas

I am trying to calculate mean i.e moving average every 10sec of data; lets say 1 to 10sec, and 11sec to 20sec etc.
Is below right for this? I am getting error when using "60sec" in rolling function, I think it may be due to the "ltt" column which is of type string, I am converting it to datetime, but still the error is coming.
How to resolve this error? Also how to do the averaging for samples collected every 10sec. This is streaming data coming in, but for testing purpose, I am using the static data in record1.
import pandas as pd
import numpy as np
records1 = [
{'ltt': 'Mon Nov 7 12:12:05 2022', 'last': 258},
{'ltt': 'Mon Nov 7 12:12:05 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:07 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:08 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:09 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:10 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:11 2022', 'last': 261},
{'ltt': 'Mon Nov 7 12:12:12 2022', 'last': 262},
{'ltt': 'Mon Nov 7 12:12:12 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:14 2022', 'last': 258},
{'ltt': 'Mon Nov 7 12:12:15 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:16 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:17 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:18 2022', 'last': 258},
{'ltt': 'Mon Nov 7 12:12:19 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:20 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:21 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:22 2022', 'last': 258},
{'ltt': 'Mon Nov 7 12:12:23 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:24 2022', 'last': 260}
]
datalist = []
def strategy1(record):
global datalist
datalist.append(record)
pandas_df = pd.DataFrame(datalist)
pandas_df['ltt'] = pd.to_datetime(pandas_df['ltt'], format="%a %b %d %H:%M:%S %Y")
pandas_df['hour'] = pandas_df['ltt'].dt.hour
pandas_df['minute'] = pandas_df['ltt'].dt.minute
pandas_df['second'] = pandas_df['ltt'].dt.second
pandas_df['max'] = pandas_df.groupby('second')['last'].transform("max")
pandas_df["ma_1min"] = (
pandas_df.sort_values("ltt")
.groupby(["hour", "minute"])["last"]
.transform(lambda x: x.rolling('10sec', min_periods=1).mean())
)
print(pandas_df)

i don't know how to exactly implement this in your code but i had a kind of similar problem where i had to group each day into 4 hour timeslots. so an approach might be something like this:
pandas_df.groupby([pandas_df['ltt'].dt.hour, pandas_df['ltt'].dt.minute, (pandas_df['ltt'].dt.second / 10).astype(int)]).last.agg('mean')
this should basically give you 6 groups ([0s-9s -> 0], [10s-19s -> 1], etc. for the 3rd groupby index) for every minute of data.

Concatenate ListA elements with partially matching ListB elements

Say I have two python lists as:
ListA = ['Jan 2018', 'Feb 2018', 'Mar 2018']
ListB = ['Sales Jan 2018','Units sold Jan 2018','Sales Feb 2018','Units sold Feb 2018','Sales Mar 2018','Units sold Mar 2018']
I need to get an output as:
List_op = ['Jan 2018 Sales Jan 2018 Units sold Jan 2018','Feb 2018 Sales Feb 2018 Units sold Feb 2018','Mar 2018 Sales Mar 2018 Units sold Mar 2018']
My approach so far:
res=set()
for i in ListB:
for j in ListA:
if j in i:
res.add(f'{i} {j}')
print (res)
this gives me result as:
{'Units sold Jan 2018 Jan 2018', 'Sales Feb 2018 Feb 2018', 'Units sold Mar 2018 Mar 2018', 'Units sold Feb 2018 Feb 2018', 'Sales Jan 2018 Jan 2018', 'Sales Mar 2018 Mar 2018'}
which is definitely not the solution I'm looking for.
What I think is regular expression could be a handful here but I'm not sure how to approach. Any help in this regard is highly appreciated.
Thanks in advance.
Edit:
Values in ListA and ListB are not necessarily to be in order. Therefore for a particular month/year value in ListA, the same month/year value from ListB has to be matched and picked for both 'Sales' and 'Units sold' component and needs to be concatenated.
My main goal here is to get the list which I can use later to generate a statement that I'll be using to write Hive query.
Added more explanation as suggested by #andrew_reece

Assuming no additional edge cases that need taking care of, your original code is not bad, just needs a slight update:
List_op = []
for a in ListA:
combined = a
for b in ListB:
if a in b:
combined += " " + b
List_op.append(combined)
List_op
['Jan 2018 Sales Jan 2018 Units sold Jan 2018',
'Feb 2018 Sales Feb 2018 Units sold Feb 2018',
'Mar 2018 Sales Mar 2018 Units sold Mar 2018']

Supposing ListA and ListB are sorted:
ListA = ['Jan 2018', 'Feb 2018', 'Mar 2018']
ListB = ['Sales Jan 2018','Units sold Jan 2018','Sales Feb 2018','Units sold Feb 2018','Sales Mar 2018','Units sold Mar 2018']
print([v1 + " " + v2 for v1, v2 in zip(ListA, [v1 + " " + v2 for v1, v2 in zip(ListB[::2], ListB[1::2])])])
This will print:
['Jan 2018 Sales Jan 2018 Units sold Jan 2018', 'Feb 2018 Sales Feb 2018 Units sold Feb 2018', 'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
In my example I firstly concatenate ListB variables together and then join ListA with this new list.

String concatenation can become expensive. In Python 3.6+, you can use more efficient f-strings within a list comprehension:
res = [f'{i} {j} {k}' for i, j, k in zip(ListA, ListB[::2], ListB[1::2])]
print(res)
['Jan 2018 Sales Jan 2018 Units sold Jan 2018',
'Feb 2018 Sales Feb 2018 Units sold Feb 2018',
'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
Using itertools.islice, you can avoid the expense of creating new lists:
from itertools import islice
zipper = zip(ListA, islice(ListB, 0, None, 2), islice(ListB, 1, None, 2))
res = [f'{i} {j} {k}' for i, j, k in zipper]

creating a key, value1, value2 dictionary from list

I am trying to create a key: value1, value2 dictionary from a byte list and having issues with it.
Here is my list
[b'Expected in April 2018',
b'Murder At Koh E Fiza',
b'34',
b'06 April 2018',
b'Subedar Joginder Singh',
b'0',
b'06 April 2018',
b'Blackmail',
b'86',
b'06 April 2018',
b'Missing',
b'0',
b'13 April 2018',
b'October',
b'59',
b'13 April 2018',
b'Mercury',
b'0',
b'20 April 2018',
b'Omerta',
b'50']
I have tried following code:
b = dict(zip(list[1::3],(list[2::3]+list[0::3])))
but I don't get third value as key value pair.
I have also tried
b = dict(zip(list[1::3],list[2::3]+list[0::3]))
same issue I am getting following output with both of these statements
{b'Murder At Koh E Fiza': b'34', b'Subedar Joginder Singh': b'0',
b'Blackmail': b'86', b'Missing': b'0', b'October': b'59', b'Mercury': b'0',
b'Omerta': b'50'}
I am looking for the following output
b'Murder At Koh E Fiza': b'34',b'Expected in April 2018',
b'Subedar Joginder Singh': b'0',b'06 April 2018',
Please let me know

I think you are looking to associate a list or a tuple with each key in your dictionary. So something like this should work:
dict( zip(list[1::3], zip( list[2::3], list[0::3] ) ))
Which results in
{'Mercury': ('0', '13 April 2018'), 'Murder At Koh E Fiza': ('34', 'Expected in April 2018'), 'October': ('59', '13 April 2018'), 'Missing': ('0', '06 April 2018'), 'Blackmail': ('86', '06 April 2018'), 'Omerta': ('50', '20 April 2018'), 'Subedar Joginder Singh': ('0', '06 April 2018')}

You can use zip alongside with a dict comprehension
a = [b'Expected in April 2018',
b'Murder At Koh E Fiza',
b'34',
b'06 April 2018',
b'Subedar Joginder Singh',
b'0',
b'06 April 2018',
b'Blackmail',
b'86',
b'06 April 2018',
b'Missing',
b'0',
b'13 April 2018',
b'October',
b'59',
b'13 April 2018',
b'Mercury',
b'0',
b'20 April 2018',
b'Omerta',
b'50']
final = {v: [m, k] for k, v, m in zip(a, a[1:], a[2:])}
print(final.get(b'Murder At Koh E Fiza')
print(final.get(b'Subedar Joginder Singh'))
output:
[b'34', b'Expected in April 2018']
[b'0', b'06 April 2018']

You can find that out without zipping as well:
my_list = ['Expected in April 2018', 'Murder At Koh E Fiza', '34', '06
April 2018', 'Subedar Joginder Singh', '0', '06 April 2018',
'Blackmail', '86', '06 April 2018', 'Missing', '0', '13 April 2018',
'October', '59', '13 April 2018', 'Mercury', '0', '20 April 2018',
'Omerta', '50']
n = 3
composite_list = [my_list[x:x+n] for x in range(0, len(my_list),n)]
d = {n[1]: [n[0], n[2]] for n in composite_list}

Python Plot X Access Time Values Not Formatted Correctly

I have a large amount of data in CSV Format that looks like this:
(u'Sat Jan 17 18:56:05 +0000 2015', u'anx321', 'RT #ManojHarry27: If India loses 2015 worldcup, Karishma\ntanna will be held responsible !!! #BB8', '0.0453125', '0.325')
(u'Sat Jan 17 18:56:13 +0000 2015', u'FrancisKimberl3', 'Python form imploration overgrowth-the consummative the very best as representing construction upsurge: sDGy', '1.0', '0.39')
(u'Sat Jan 17 18:56:18 +0000 2015', u'AllTechBot', 'RT #ruby_engineer: A workshop on monads with C++14 http://t.co/OKFc91J0QJ #hacker #rubyonrails #python #AllTech', '0.0', '0.0')
(u'Sat Jan 17 18:56:22 +0000 2015', u'python_job', ' JOB ALERT #ITJob #Job #New York - Senior Software Engineer Python Backed by First Round http://t.co/eqVxoMzYMG view full details', '0.245454545455', '0.44595959596')
(u'Sat Jan 17 18:56:23 +0000 2015', u'weepingtaco', 'Python: basic but beautiful', '0.425', '0.5625')
(u'Sat Jan 17 18:56:27 +0000 2015', u'python_IT_jobs', ' JOB ALERT #ITJob #Job #New York - Senior Software Engineer Python Backed by First Round http://t.co/gavWyraNqE view full details', '0.245454545455', '0.44595959596')
(u'Sat Jan 17 18:56:32 +0000 2015', u'accusoftinfoway', 'RT #findmjob: DevOps Engineer http://t.co/NasdBEEnRp #aws #perl #mysql #linux #hadoop #python #Puppet #jobs #hiring #careers', '0.0', '0.0')
(u'Sat Jan 17 18:56:32 +0000 2015', u'accusoftinfoway', 'RT #arnicas: Very useful - end to end deploying python flask on AWS RT #matt_healy: Great tutorial: https://t.co/RsiM09qJsJ #flask #python ', '0.595', '0.375')
(u'Sat Jan 17 18:56:36 +0000 2015', u'denisegregory10', "Oh you can't beat a good 'python' argument! http://t.co/ELo3GvNsuE via #youtube", '0.875', '0.6')
(u'Sat Jan 17 18:56:38 +0000 2015', u'NoSQLDigest', 'RT #KirkDBorne: R and #Python starter code for participating in #BoozAllen #DataScience Bowl: http://t.co/Q5C01eya95 #abdsc #DataSciBowl #B', '0.0', '0.0')
(u'Sat Jan 17 19:00:05 +0000 2015', u'RedditPython', '"academicmarkdown": a Python module for academic writing with Markdown. Haven\'t tried it o... https://t.co/uv8yFaz6cv http://t.co/EhiIIO7uTW', '0.0', '0.0')
(u'Sat Jan 17 19:00:28 +0000 2015', u'shopawol', 'Only 8.5 and 12 left make sure to get yours \nhttp://t.co/4rxmHqP2Qs\n#wdywt #goawol #sneakerheads http://t.co/wACIOdlGwY', '0.166666666667', '0.62962962963')
(u'Sat Jan 17 19:00:31 +0000 2015', u'AuthorBee', "RT #_kevin_ewb_: I know what your girl won't she just wanna kick it like the #WorldCup ", '0.0', '0.0')
(u'Sat Jan 17 19:00:37 +0000 2015', u'g33kmaddy', 'RT #KirkDBorne: R and #Python starter code for participating in #BoozAllen #DataScience Bowl: http://t.co/Q5C01eya95 #abdsc #DataSciBowl #B', '0.0', '0.0')
(u'Sat Jan 17 19:00:45 +0000 2015', u'Altfashion', 'Photo: A stunning photo of Kaoris latex dreams beautiful custom python bra. Photographer: MagicOwenTog... http://t.co/KdWnr3I8xP', '0.675', '1.0')
(u'Sat Jan 17 19:00:46 +0000 2015', u'oh226twt', 'Python programming: Easy and Step by step Guide for Beginners: Learn Python (English Edition) http://t.co/9optdOCrtE 1532', '0.216666666667', '0.416666666667')
(u'Sat Jan 17 19:00:50 +0000 2015', u'DvSpacefest', 'RT #Pomerantz: Potential team in the Learning XPRIZE looking for Python coders. Details: https://t.co/nGgrmYmXCa', '0.0', '1.0')
(u'Sat Jan 17 19:01:04 +0000 2015', u'cun45', 'SPORTS And More: #Cycling #Ciclismo U23 #Portugal #WorldCup team o... http://t.co/FBeqatfu85', '0.5', '0.5')
(u'Sat Jan 17 19:01:12 +0000 2015', u'insofferentexo', 'RT #FISskijumping: Dawid is already at the hill in Zakopane, in a larger than life format! #skijumping #worldcup http://t.co/SDOnxDwfIX', '0.0', '0.5')
(u'Sat Jan 17 19:01:17 +0000 2015', u'beuhe', 'Madrid Tawarkan Khedira ke Dortmund: Real Madrid dikabarkan telah menawarkan Sami Khedira ... http://t.co/R5YCKjECtm #football #worldcup', '0.2', '0.3')
(u'Sat Jan 17 19:01:18 +0000 2015', u'ITJobs_Karen', ' JOB ALERT #ITJob #Job #Paradise Valley - Python / Django Developer http://t.co/0Xn1k0cL5B view full details', '0.35', '0.55')
(u'Sat Jan 17 19:01:22 +0000 2015', u'DonnerBella', 'So confused about #meninist . Monty Python, is that you?', '-0.4', '0.7')
(u'Sat Jan 17 19:01:34 +0000 2015', u'DoggingTeens', '#Dogging,#OutdoorSex,#Sluts,#GangBang,#Stockings,#Uk_Sex: 13 Inch Black Python Being Sucked http://t.co/n9Yv4nhcxo', '-0.166666666667', '0.433333333333')
(u'Sat Jan 17 19:02:03 +0000 2015', u'WorldCupFNH', 'Soccer-La Liga results and standings: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/3JOOnBQzvG', '0.0', '0.0')
(u'Sat Jan 17 19:02:03 +0000 2015', u'WorldCupFNH', 'Soccer-La Liga summaries: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/AZgxr5Z9EV', '0.0', '0.0')
(u'Sat Jan 17 19:02:03 +0000 2015', u'WorldCupFNH', "Soccer-Late Congo goal spoils Equatorial Guinea's party: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/W6Ff4HikxH", '0.0', '0.0')
(u'Sat Jan 17 19:02:04 +0000 2015', u'WorldCupFNH', 'Soccer-Ligue 1 top scorers: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/WS2lcZnzKu', '0.5', '0.5')
(u'Sat Jan 17 19:02:04 +0000 2015', u'WorldCupFNH', 'Soccer-Pearce answers critics as Forest seal unlikely win: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/Qb5PKuls6z', '0.15', '0.45')
(u'Sat Jan 17 19:02:04 +0000 2015', u'WorldCupFNH', 'Soccer-Israeli championship results and standings: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/dce9Qn9oI5', '0.0', '0.0')
(u'Sat Jan 17 19:02:07 +0000 2015', u'Jeff88Ho', 'RT #artwisanggeni: #python jweede.recipe.template 1.2.3: Buildout recipe for making files out of Jinja2 templates http://t.co/dgeuuFWf19', '0.0', '0.0')
(u'Sat Jan 17 19:02:07 +0000 2015', u'Jeff88Ho', 'RT #artwisanggeni: #python aclhound 1.7.5: ACL Compiler http://t.co/fNOFSYd7FJ', '0.0', '0.0')
(u'Sat Jan 17 19:02:08 +0000 2015', u'Jeff88Ho', 'RT #artwisanggeni: #python Flask-Goat 0.2.0: Flask plugin for security and user administration via GitHub OAuth & organization http://t.co/', '0.0', '0.0')
(u'Sat Jan 17 19:02:08 +0000 2015', u'Jeff88Ho', 'RT #artwisanggeni: #python filewatch 0.0.6: Python File Watcher http://t.co/fIHLagCqvf', '0.0', '0.0')
(u'Sat Jan 17 19:02:16 +0000 2015', u'HeatherA789', "Programming Python: Start Learning Python Today, Even If You've Never Coded Before (A Beginner's Guide): http://t.co/3Ss4cwCvP6", '0.0', '0.0')
(u'Sat Jan 17 19:02:18 +0000 2015', u'HeatherA789', 'Python: Learn Python in One Day and Learn It Well. Python for Beginners with Hands-on Project.: Python: Learn http://t.co/zvLIpydd6V', '0.0', '0.0')
(u'Sat Jan 17 19:02:26 +0000 2015', u'AlexeiCherenkov', 'It looks like I should learn Python. Do you think I can do this during 3 hours tomorrow? Yes-Rt; No-Fav.', '0.0', '0.0')
(u'Sat Jan 17 19:02:33 +0000 2015', u'cleansheet', "#WorldCup Cricket World Cup: Australia should've picked a leg-spinner and named Steve Smith vice-captain ... http://t.co/kgXgUVbHDd", '0.0', '0.0')
(u'Sat Jan 17 19:02:34 +0000 2015', u'cleansheet', '#WorldCup Younger Northug earns 1st cross-country World Cup victory http://t.co/y7jozMriFG', '0.0', '0.0')
(u'Sat Jan 17 19:02:35 +0000 2015', u'cleansheet', '#WorldCup ICC World Cup 2015: School massacre survivors inspire Pakistan team http://t.co/Tj1jpCZsj6', '0.0', '0.0')
(u'Sat Jan 17 19:02:35 +0000 2015', u'cleansheet', '#WorldCup We Want to Win World Cup for Peshawar Schoolkids: Misbah-ul-Haq http://t.co/RbeBkrv69s', '0.8', '0.4')
(u'Sat Jan 17 19:02:38 +0000 2015', u'world_latest', 'New: Equatorial Guinea 1-1 Congo http://t.co/32sfrrbBOW #follow #worldcup world_latest world_latest', '0.136363636364', '0.454545454545')
(u'Sat Jan 17 19:02:39 +0000 2015', u'FAHAD_CTID', 'RT #fawadiii: #FAHAD_CTID #VeronaPerqukuu Hahaha. Hanw ;) bdw worldcup bhi hai 15 sy :D', '0.483333333333', '0.8')
(u'Sat Jan 17 19:02:43 +0000 2015', u'amazon_mybot', '#3: Python http://t.co/LLzeKQQBon', '0.0', '0.0')
(u'Sat Jan 17 19:02:45 +0000 2015', u'LarryMesast', '#javascript #html5 #UX #Python #agile #DDD', '0.5', '0.75')
(u'Sat Jan 17 19:02:46 +0000 2015', u'washim987', 'RT #anjali_damania: I was angry at #shaziailmi & #thekiranbedi My husband calms me down & says. Haame Worldcup jitna hai. Sirf Pakistan se ', '-0.327777777778', '0.644444444444')
(u'Sat Jan 17 19:03:02 +0000 2015', u'sksh_rana', '"#ManojHarry27: If India loses 2015 worldcup, Karishma\ntanna will be held responsible !!! #BB8"\n#TheFarahKhan #BeingSalmanKhan', '0.0453125', '0.325')
(u'Sat Jan 17 19:03:14 +0000 2015', u't_kohyama', '#_3mame PythonMatlabPython', '0.0', '0.0')
(u'Sat Jan 17 19:03:16 +0000 2015', u'AntonShipulin', '#photo #worldcup #flowerceremony #sprint #Ruhpolding http://t.co/fe9qpiwsqJ', '0.0', '0.0')
(u'Sat Jan 17 19:03:22 +0000 2015', u'karthik_vik', 'RT #ValaAfshar: Highest paying programming languages, ranked by salary:\n\n1 Ruby\n2 Objective C\n3 Python\n4 Java\n\nhttp://t.co/RudytdjFLC http:', '0.0', '0.1')
Right now I plot the data with the following script:
import matplotlib
matplotlib.use('Agg')
from matplotlib.mlab import csv2rec
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from pylab import *
from datetime import datetime
import dateutil
from dateutil import parser
import re
import os
import operator
import csv
input_filename="test_output.csv"
output_image_namep='polarity.png'
output_image_name2='subjectivity.png'
input_file = open(input_filename, 'r')
data = csv2rec(input_file, names=['time', 'name', 'message', 'polarity', 'subjectvity'])
time_list = []
polarity_list = []
''' I am aware there's a much more concise way of doing this'''
for line in data:
td = line['time']
''' stupid regex '''
s = re.sub('\(\u', '', td)
dtime = parser.parse(s)
dtime = re.sub('-', '', str(dtime))
dtime = re.sub(' ', '', dtime)
dtime = re.sub('\+00:00', '', dtime)
dtime = re.sub(':', '', dtime)
dtime = dtime[:-2]
try:
subjectivity = float(line['subjectivity'].replace("'", '').replace(")", ''))
except:
pass
print dtime, polarity
time_list.append( str(dtime) )
polarity_list.append( polarity )
rcParams['figure.figsize'] = 10, 4
rcParams['font.size'] = 8
fig = plt.figure()
plt.plot([time_list], [polarity_list], 'ro')
axes = plt.gca()
axes.set_ylim([-1,1])
plt.savefig(output_image_namep)
It ends up looking like:
Which is fine but I would like the X axis to display the date labels correctly. Right now I'm doing some ugly regex to strip the date down to YYYYMMDDHHMM.

What about this:
import time
def format_time_label(original):
return time.strftime('%Y%m%d%H%M',
time.strptime(original, "%a %b %d %H:%M:%S +0000 %Y"))
Example:
>>> format_time_label('Sat Jan 17 19:00:50 +0000 2015')
'201501171900'
This works only if every date in your data has timezone offset +0000, as there seems to be no code in Python standard library to recognize this.
You can change parsing format expression accordingly to account for leftovers from your data format:
def format_time_label(original):
return time.strftime('%Y%m%d%H%M',
time.strptime(original, "(u'%a %b %d %H:%M:%S +0000 %Y'"))
>>> format_time_label("(u'Sat Jan 17 18:56:05 +0000 2015'")
'201501171856'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Text mining on string dates from fuzzy example - python

Related

sort a list of string numerically in python

error when using time in rolling function pandas

Concatenate ListA elements with partially matching ListB elements

creating a key, value1, value2 dictionary from list

Python Plot X Access Time Values Not Formatted Correctly

Categories

Resources