Related
Hey so i have this CSV file which is structured in this way: ['message', 'Date', 'Name', 'Location of a train station']
['vies', 'Mon 07 Nov 2022 18:43', 'Mia', 'Haarlem']
['vies', 'Mon 07 Nov 2022 18:43', 'Mia', 'Amsterdam']
['vies', 'Mon 07 Nov 2022 18:43', 'Mia', 'Sittard']
['vies', 'Mon 07 Nov 2022 18:43', 'Mia', 'Venlo']
['vies', 'Mon 07 Nov 2022 18:43', 'Mia', 'Helmond']
['Het zou wel wat schoner mogen zijn', 'Tue 08 Nov 2022 00:49', 'Tijmen', 'Hilversum']
['Het zou wel wat schoner mogen zijn', 'Tue 08 Nov 2022 00:49', 'anoniem', 'Roosendaal']
Now i want to insert this information into my postgresql database
import csv
import psycopg2
with open('C:\\Users\\Danis Porovic\\PycharmProjects\\Module1\\berichten.csv', 'r') as
csv_file:
csv_reader = csv.reader(csv_file)
index = list(zip(*csv.reader(csv_file)))
messages =index[0]
data = index[1]
names = index[2]
stations = index[3]
con = psycopg2.connect(
host = "localhost",
database = "fabriek",
user = "postgres",
password = "DanisMia1")
cur = con.cursor()
cur.execute("insert into klant (naam) values (%s);", (names,))
con.commit()
con.close()
How would i go about inserting all names into a column succesfully in my database?
The current zip method i'm using at the top makes a tuple out of the 4 strings. Would inserting tuples even work?
This is how the tuple looks like of the names for example:
('Mia', 'Danis', 'Jeffrey', 'Tim', 'Joppe', 'Tijmen', 'anoniem')
I am trying to calculate mean i.e moving average every 10sec of data; lets say 1 to 10sec, and 11sec to 20sec etc.
Is below right for this? I am getting error when using "60sec" in rolling function, I think it may be due to the "ltt" column which is of type string, I am converting it to datetime, but still the error is coming.
How to resolve this error? Also how to do the averaging for samples collected every 10sec. This is streaming data coming in, but for testing purpose, I am using the static data in record1.
import pandas as pd
import numpy as np
records1 = [
{'ltt': 'Mon Nov 7 12:12:05 2022', 'last': 258},
{'ltt': 'Mon Nov 7 12:12:05 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:07 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:08 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:09 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:10 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:11 2022', 'last': 261},
{'ltt': 'Mon Nov 7 12:12:12 2022', 'last': 262},
{'ltt': 'Mon Nov 7 12:12:12 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:14 2022', 'last': 258},
{'ltt': 'Mon Nov 7 12:12:15 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:16 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:17 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:18 2022', 'last': 258},
{'ltt': 'Mon Nov 7 12:12:19 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:20 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:21 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:22 2022', 'last': 258},
{'ltt': 'Mon Nov 7 12:12:23 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:24 2022', 'last': 260}
]
datalist = []
def strategy1(record):
global datalist
datalist.append(record)
pandas_df = pd.DataFrame(datalist)
pandas_df['ltt'] = pd.to_datetime(pandas_df['ltt'], format="%a %b %d %H:%M:%S %Y")
pandas_df['hour'] = pandas_df['ltt'].dt.hour
pandas_df['minute'] = pandas_df['ltt'].dt.minute
pandas_df['second'] = pandas_df['ltt'].dt.second
pandas_df['max'] = pandas_df.groupby('second')['last'].transform("max")
pandas_df["ma_1min"] = (
pandas_df.sort_values("ltt")
.groupby(["hour", "minute"])["last"]
.transform(lambda x: x.rolling('10sec', min_periods=1).mean())
)
print(pandas_df)
i don't know how to exactly implement this in your code but i had a kind of similar problem where i had to group each day into 4 hour timeslots. so an approach might be something like this:
pandas_df.groupby([pandas_df['ltt'].dt.hour, pandas_df['ltt'].dt.minute, (pandas_df['ltt'].dt.second / 10).astype(int)]).last.agg('mean')
this should basically give you 6 groups ([0s-9s -> 0], [10s-19s -> 1], etc. for the 3rd groupby index) for every minute of data.
I'm a bit stuck with proceeding with the custom function that would be able to parse in any given language, with a set of rules that would be able to identify and parse months and years including some special cases (that I will show).
Input data:
dates_eg = [["Issued Feb 2018 · No Expiration Date"], ["Oct 2021 - Present"],
["Jan 2019 - Oct 2021 · 2 yrs 10 mos"], ["1994 - 2000"], ["Sep 2010 – Sep 2015 • 5 yrs"],
["Nov 2019 – Present"], ["Sep 2015 – Mar 2017 • 1 yr 6 mos"], ["Apr 2019 – Present • 2 yrs 8 mos"],
["Issued: Aug 2018 · Expired Aug 2020"], ["Mar 2017 - Dec 2017 · 10 mos"],
["May 2021 - Present · 9 mos"]]
NOTE: No API translation should be used. Instead of it, I've created a list of FLAG words (in certain languages) and STOPWORDS that would be able to identify all the cases.
Present, Issued, Expired, No Expiration Date -> all of these four keywords/stopwords would be identified through the lists (see underneath).
Current functions work workflow:
e.g. "Issued Feb 2018 · No Expiration Date"
Firstly, algorithm named "raw_dates_extractor" is separating the start date ('Issued Feb 2018') and the end date ('No Expiration Date') with a separator ' · '
Secondly, clean_extract_dates algorithm is identifying 'Issued' and gets rid of it
All the aforementioned resulting in:
{'Start_date_raw0': 'Feb 2018', 'Start_date_raw1': 'Oct 2021', 'Start_date_raw2': 'Jan 2019', 'Start_date_raw3': '1994', 'Start_date_raw4': 'Sep 2010', 'Start_date_raw5': 'Nov 2019', 'Start_date_raw6': 'Sep 2015', 'Start_date_raw7': 'Apr 2019', 'Start_date_raw8': 'Aug 2018', 'Start_date_raw9': 'Mar 2017', 'Start_date_raw10': 'May 2021', 'End_date_raw0': 'No Expiration Date', 'End_date_raw1': 'Present', 'End_date_raw2': 'Oct 2021', 'End_date_raw3': '2000', 'End_date_raw4': 'Sep 2015', 'End_date_raw5': 'Present', 'End_date_raw6': 'Mar 2017', 'End_date_raw7': 'Present', 'End_date_raw8': 'Aug 2020', 'End_date_raw9': 'Dec 2017', 'End_date_raw10': 'Present'}
However, the desired output would go further by extracting/identifying years, months, and stopwords/flag words (like 'No Expiration Date', 'Present' would be flagged as 'ONGOING' = 1 otherwise 0.
From the first initial raw example:
"Issued Feb 2018 · No Expiration Date" would be:
Desired result:
{'start_year': 2021, 'start_month': "Apr", 'end_year': None,
'end_month': None, 'is_on_going': True})
Complete Code attempt:
Lists with Expired, Issued, No Expiration Date, Present
import re
import pandas as pd
# "Present" word in all language combination
present_all_lang = ["současnost", "I dag", "Heute", "Present", "actualidad", "aujourd’hui", "वर्तमान"
"Saat ini", "Presente",
"現在", "현재", "Kini", "heden", "Nå", "obecnie", "o momento"
"Prezent", "настоящее время", "nu", "ปัจจุบัน ",
"Kasalukuyan", "Halen", "至今", "現在"]
# "Issued" word in all language combination
issued_all_lang = ["Vydáno", "Udstedt", "Ausgestellt", "Issued", "Expedición", "Émise le", "नव॰", "Diterbitkan"
"Data di rilascio",
"発行日", "발행일", "Dikeluarkan", "Toegekend", "Utstedt", "Wydany", "Emitido em"
"Дата выдачи", "Utfärdat", "ออกเมื่อ",
"Inisyu", "tarihinde verildi", "颁发日期", "發照日期"]
# "No experience date" in all language combination
no_experience_all_lang = ["Bez data vypršení", "Ingen udløbsdato", "Kein Ablaufdatum", "No Expiration Date",
"Sin fecha de vencimiento", "Sans date d’expiration", "समाप्ति की कोई तारीख नहीं",
"Tidak Ada Tanggal Kedaluwarsa", "Nessuna data di scadenza", "有効期限なし"
"만료일 없음",
"Tiada Tarikh Tamat Tempoh", "Geen vervaldatum" "Ingen utløpsdato",
"Brak daty ważności", "Sem data de expiração", "Без истечения срока действия"
"Inget utgångsdatum", "ไม่มีวันหมดอายุ",
"Walang Petsa ng Pagtatapos"
"Sona Erme Tarihi Yok", "长期有效", "永久有效"]
# "Expired" in all lang
expires_all_lang = ["Vyprší", "Udløbet", "Gültig bis", "Expired", "Vencimiento", "Expire le",
"को समाप्त हो गया है", "Kedaluwarsa", "Data di scadenza", "有効期限", " 만료됨", "Tamat Tempoh",
"Verloopt op", "Utløper", "Wygasa", "Expira em", "A expirat la", "Срок действия истек",
"Upphörde att gälla", "หมดอายุเมื่อ", "Mag-e-expire sa", "tarihinde süresi bitiyor", "有效期至", "過期"]
# colon my appear get rid of it!
def raw_dates_extractor(list_of_examples: dates_eg) -> dict:
raw_start_date, raw_end_date, merged_start_end_dates = {}, {}, {}
for row, datum in enumerate(list_of_examples):
date_splitter = re.split("[·•–-]", datum[0])
raw_start_date[f'Start_date_raw{row}'] = date_splitter[0].strip(' ')
raw_end_date[f'End_date_raw{row}'] = date_splitter[1].strip(' ')
merged_start_end_dates = raw_start_date | raw_end_date
return merged_start_end_dates
output:
raw_dates_extractor(dates_eg)
{'Start_date_raw0': 'Issued Feb 2018', 'Start_date_raw1': 'Oct 2021', 'Start_date_raw2': 'Jan 2019', 'Start_date_raw3': '1994', 'Start_date_raw4': 'Sep 2010', 'Start_date_raw5': 'Nov 2019', 'Start_date_raw6': 'Sep 2015', 'Start_date_raw7': 'Apr 2019', 'Start_date_raw8': 'Issued: Aug 2018', 'Start_date_raw9': 'Mar 2017', 'Start_date_raw10': 'May 2021', 'End_date_raw0': 'No Expiration Date', 'End_date_raw1': 'Present', 'End_date_raw2': 'Oct 2021', 'End_date_raw3': '2000', 'End_date_raw4': 'Sep 2015', 'End_date_raw5': 'Present', 'End_date_raw6': 'Mar 2017', 'End_date_raw7': 'Present', 'End_date_raw8': 'Expired Aug 2020', 'End_date_raw9': 'Dec 2017', 'End_date_raw10': 'Present'}
def clean_extract_dates(input_dictionary: dict, aggregate_stop_words: list):
container_replica = {}
for date_element in input_dictionary:
query_req = input_dictionary[f'{date_element}']
if ":" in query_req:
query_req = query_req.replace(":", "")
if "." in query_req:
query_req = query_req.replace(".", "")
stopwords = aggregate_stop_words
query_words = query_req.split()
result_words = [word for word in query_words if word not in stopwords]
end_result = ' '.join(result_words)
container_replica[f'{date_element}'] = end_result
return container_replica
output:
clean_extract_dates(raw_dates_extractor(dates_eg), aggregate_lists)
{'Start_date_raw0': 'Feb 2018', 'Start_date_raw1': 'Oct 2021', 'Start_date_raw2': 'Jan 2019', 'Start_date_raw3': '1994', 'Start_date_raw4': 'Sep 2010', 'Start_date_raw5': 'Nov 2019', 'Start_date_raw6': 'Sep 2015', 'Start_date_raw7': 'Apr 2019', 'Start_date_raw8': 'Aug 2018', 'Start_date_raw9': 'Mar 2017', 'Start_date_raw10': 'May 2021', 'End_date_raw0': 'No Expiration Date', 'End_date_raw1': 'Present', 'End_date_raw2': 'Oct 2021', 'End_date_raw3': '2000', 'End_date_raw4': 'Sep 2015', 'End_date_raw5': 'Present', 'End_date_raw6': 'Mar 2017', 'End_date_raw7': 'Present', 'End_date_raw8': 'Aug 2020', 'End_date_raw9': 'Dec 2017', 'End_date_raw10': 'Present'}
aggregate_lists = issued_all_lang + no_experience_all_lang + expires_all_lang
saved_dictionary = clean_extract_dates(raw_dates_extractor(dates_eg), aggregate_lists)
The USPTO site offers public data that updates every week. Every time they release new data they release it in a form of " delta data" from the last week. Im trying to download this data using python so I wont have to do it manually every week.
there are a few weird things that are happening:
first, the browser.page_source holds html (but not the right one - I checked). But when I pass that html (as string) to BeatifulSoup, the soup.current_data is empty.
Second, the html that is returning is not the full html and does not contain delta or that section at all, even though it is in the site html in the browser:
Any ideas on how to get that file to download? I need to eventually call the deltaJsonDownload() js function.
Code to reproduce:
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://ped.uspto.gov/peds/'
browser = webdriver.PhantomJS(executable_path='/usr/bin/phantomjs')
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(browser.page_source)
assert('delta' in browser.page_source)
When you analyse the website network calls, it makes an ajax request to get all the links for the data to download.
import requests
res = requests.get("https://ped.uspto.gov/api/")
data = res.json()
print(data)
Output:
{'message': None,
'helpText': '{}',
'xmlDownloadMetadata': [{'lastUpdated': 'Sat 15 Aug 2020 01:30:57-0400',
'sizeInBytes': 10429068701,
'fileName': 'pairbulk-delta-20200815-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:10-0400',
'sizeInBytes': 100685778,
'fileName': '1900-1919-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:14-0400',
'sizeInBytes': 13877,
'fileName': '1920-1939-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:15-0400',
'sizeInBytes': 93016,
'fileName': '1940-1959-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:15-0400',
'sizeInBytes': 82353484,
'fileName': '1960-1979-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:16-0400',
'sizeInBytes': 5019098918,
'fileName': '1980-1999-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:46-0400',
'sizeInBytes': 33231977060,
'fileName': '2000-2019-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:23:23-0400',
'sizeInBytes': 24313575,
'fileName': '2020-2020-pairbulk-full-20200809-xml',
'updatedFile': False}],
'jsonDownloadMetadata': [{'lastUpdated': 'Sat 15 Aug 2020 03:08:00-0400',
'sizeInBytes': 5957650088,
'fileName': 'pairbulk-delta-20200815-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:23-0400',
'sizeInBytes': 66467976,
'fileName': '1900-1919-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:25-0400',
'sizeInBytes': 10100,
'fileName': '1920-1939-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:27-0400',
'sizeInBytes': 69891,
'fileName': '1940-1959-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:29-0400',
'sizeInBytes': 54076774,
'fileName': '1960-1979-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:31-0400',
'sizeInBytes': 3009216952,
'fileName': '1980-1999-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:46-0400',
'sizeInBytes': 18853619536,
'fileName': '2000-2019-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:20:30-0400',
'sizeInBytes': 17518389,
'fileName': '2020-2020-pairbulk-full-20200809-json',
'updatedFile': False}],
'links': [{'rel': 'swagger-api-docs', 'href': '/api-docs'}]}
Parse the json and using these links you can easily download the file you are looking for. But I would say these files are pretty huge files, better using streaming download in requests.
The link you are looking for is the first element in data["jsonDownloadMetadata"]
In order to get the downloadable links, parse the json
data = res.json()
for links in data["jsonDownloadMetadata"]:
print(f"https://ped.uspto.gov/api/full-download?fileName={links['fileName']}")
Output:
https://ped.uspto.gov/api/full-download?fileName=pairbulk-delta-20200815-json
https://ped.uspto.gov/api/full-download?fileName=1900-1919-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1920-1939-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1940-1959-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1960-1979-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1980-1999-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=2000-2019-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=2020-2020-pairbulk-full-20200809-json
I have a large amount of data in CSV Format that looks like this:
(u'Sat Jan 17 18:56:05 +0000 2015', u'anx321', 'RT #ManojHarry27: If India loses 2015 worldcup, Karishma\ntanna will be held responsible !!! #BB8', '0.0453125', '0.325')
(u'Sat Jan 17 18:56:13 +0000 2015', u'FrancisKimberl3', 'Python form imploration overgrowth-the consummative the very best as representing construction upsurge: sDGy', '1.0', '0.39')
(u'Sat Jan 17 18:56:18 +0000 2015', u'AllTechBot', 'RT #ruby_engineer: A workshop on monads with C++14 http://t.co/OKFc91J0QJ #hacker #rubyonrails #python #AllTech', '0.0', '0.0')
(u'Sat Jan 17 18:56:22 +0000 2015', u'python_job', ' JOB ALERT #ITJob #Job #New York - Senior Software Engineer Python Backed by First Round http://t.co/eqVxoMzYMG view full details', '0.245454545455', '0.44595959596')
(u'Sat Jan 17 18:56:23 +0000 2015', u'weepingtaco', 'Python: basic but beautiful', '0.425', '0.5625')
(u'Sat Jan 17 18:56:27 +0000 2015', u'python_IT_jobs', ' JOB ALERT #ITJob #Job #New York - Senior Software Engineer Python Backed by First Round http://t.co/gavWyraNqE view full details', '0.245454545455', '0.44595959596')
(u'Sat Jan 17 18:56:32 +0000 2015', u'accusoftinfoway', 'RT #findmjob: DevOps Engineer http://t.co/NasdBEEnRp #aws #perl #mysql #linux #hadoop #python #Puppet #jobs #hiring #careers', '0.0', '0.0')
(u'Sat Jan 17 18:56:32 +0000 2015', u'accusoftinfoway', 'RT #arnicas: Very useful - end to end deploying python flask on AWS RT #matt_healy: Great tutorial: https://t.co/RsiM09qJsJ #flask #python ', '0.595', '0.375')
(u'Sat Jan 17 18:56:36 +0000 2015', u'denisegregory10', "Oh you can't beat a good 'python' argument! http://t.co/ELo3GvNsuE via #youtube", '0.875', '0.6')
(u'Sat Jan 17 18:56:38 +0000 2015', u'NoSQLDigest', 'RT #KirkDBorne: R and #Python starter code for participating in #BoozAllen #DataScience Bowl: http://t.co/Q5C01eya95 #abdsc #DataSciBowl #B', '0.0', '0.0')
(u'Sat Jan 17 19:00:05 +0000 2015', u'RedditPython', '"academicmarkdown": a Python module for academic writing with Markdown. Haven\'t tried it o... https://t.co/uv8yFaz6cv http://t.co/EhiIIO7uTW', '0.0', '0.0')
(u'Sat Jan 17 19:00:28 +0000 2015', u'shopawol', 'Only 8.5 and 12 left make sure to get yours \nhttp://t.co/4rxmHqP2Qs\n#wdywt #goawol #sneakerheads http://t.co/wACIOdlGwY', '0.166666666667', '0.62962962963')
(u'Sat Jan 17 19:00:31 +0000 2015', u'AuthorBee', "RT #_kevin_ewb_: I know what your girl won't she just wanna kick it like the #WorldCup ", '0.0', '0.0')
(u'Sat Jan 17 19:00:37 +0000 2015', u'g33kmaddy', 'RT #KirkDBorne: R and #Python starter code for participating in #BoozAllen #DataScience Bowl: http://t.co/Q5C01eya95 #abdsc #DataSciBowl #B', '0.0', '0.0')
(u'Sat Jan 17 19:00:45 +0000 2015', u'Altfashion', 'Photo: A stunning photo of Kaoris latex dreams beautiful custom python bra. Photographer: MagicOwenTog... http://t.co/KdWnr3I8xP', '0.675', '1.0')
(u'Sat Jan 17 19:00:46 +0000 2015', u'oh226twt', 'Python programming: Easy and Step by step Guide for Beginners: Learn Python (English Edition) http://t.co/9optdOCrtE 1532', '0.216666666667', '0.416666666667')
(u'Sat Jan 17 19:00:50 +0000 2015', u'DvSpacefest', 'RT #Pomerantz: Potential team in the Learning XPRIZE looking for Python coders. Details: https://t.co/nGgrmYmXCa', '0.0', '1.0')
(u'Sat Jan 17 19:01:04 +0000 2015', u'cun45', 'SPORTS And More: #Cycling #Ciclismo U23 #Portugal #WorldCup team o... http://t.co/FBeqatfu85', '0.5', '0.5')
(u'Sat Jan 17 19:01:12 +0000 2015', u'insofferentexo', 'RT #FISskijumping: Dawid is already at the hill in Zakopane, in a larger than life format! #skijumping #worldcup http://t.co/SDOnxDwfIX', '0.0', '0.5')
(u'Sat Jan 17 19:01:17 +0000 2015', u'beuhe', 'Madrid Tawarkan Khedira ke Dortmund: Real Madrid dikabarkan telah menawarkan Sami Khedira ... http://t.co/R5YCKjECtm #football #worldcup', '0.2', '0.3')
(u'Sat Jan 17 19:01:18 +0000 2015', u'ITJobs_Karen', ' JOB ALERT #ITJob #Job #Paradise Valley - Python / Django Developer http://t.co/0Xn1k0cL5B view full details', '0.35', '0.55')
(u'Sat Jan 17 19:01:22 +0000 2015', u'DonnerBella', 'So confused about #meninist . Monty Python, is that you?', '-0.4', '0.7')
(u'Sat Jan 17 19:01:34 +0000 2015', u'DoggingTeens', '#Dogging,#OutdoorSex,#Sluts,#GangBang,#Stockings,#Uk_Sex: 13 Inch Black Python Being Sucked http://t.co/n9Yv4nhcxo', '-0.166666666667', '0.433333333333')
(u'Sat Jan 17 19:02:03 +0000 2015', u'WorldCupFNH', 'Soccer-La Liga results and standings: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/3JOOnBQzvG', '0.0', '0.0')
(u'Sat Jan 17 19:02:03 +0000 2015', u'WorldCupFNH', 'Soccer-La Liga summaries: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/AZgxr5Z9EV', '0.0', '0.0')
(u'Sat Jan 17 19:02:03 +0000 2015', u'WorldCupFNH', "Soccer-Late Congo goal spoils Equatorial Guinea's party: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/W6Ff4HikxH", '0.0', '0.0')
(u'Sat Jan 17 19:02:04 +0000 2015', u'WorldCupFNH', 'Soccer-Ligue 1 top scorers: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/WS2lcZnzKu', '0.5', '0.5')
(u'Sat Jan 17 19:02:04 +0000 2015', u'WorldCupFNH', 'Soccer-Pearce answers critics as Forest seal unlikely win: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/Qb5PKuls6z', '0.15', '0.45')
(u'Sat Jan 17 19:02:04 +0000 2015', u'WorldCupFNH', 'Soccer-Israeli championship results and standings: #FNH #WorldCup #Russia2018 #WC2018 http://t.co/dce9Qn9oI5', '0.0', '0.0')
(u'Sat Jan 17 19:02:07 +0000 2015', u'Jeff88Ho', 'RT #artwisanggeni: #python jweede.recipe.template 1.2.3: Buildout recipe for making files out of Jinja2 templates http://t.co/dgeuuFWf19', '0.0', '0.0')
(u'Sat Jan 17 19:02:07 +0000 2015', u'Jeff88Ho', 'RT #artwisanggeni: #python aclhound 1.7.5: ACL Compiler http://t.co/fNOFSYd7FJ', '0.0', '0.0')
(u'Sat Jan 17 19:02:08 +0000 2015', u'Jeff88Ho', 'RT #artwisanggeni: #python Flask-Goat 0.2.0: Flask plugin for security and user administration via GitHub OAuth & organization http://t.co/', '0.0', '0.0')
(u'Sat Jan 17 19:02:08 +0000 2015', u'Jeff88Ho', 'RT #artwisanggeni: #python filewatch 0.0.6: Python File Watcher http://t.co/fIHLagCqvf', '0.0', '0.0')
(u'Sat Jan 17 19:02:16 +0000 2015', u'HeatherA789', "Programming Python: Start Learning Python Today, Even If You've Never Coded Before (A Beginner's Guide): http://t.co/3Ss4cwCvP6", '0.0', '0.0')
(u'Sat Jan 17 19:02:18 +0000 2015', u'HeatherA789', 'Python: Learn Python in One Day and Learn It Well. Python for Beginners with Hands-on Project.: Python: Learn http://t.co/zvLIpydd6V', '0.0', '0.0')
(u'Sat Jan 17 19:02:26 +0000 2015', u'AlexeiCherenkov', 'It looks like I should learn Python. Do you think I can do this during 3 hours tomorrow? Yes-Rt; No-Fav.', '0.0', '0.0')
(u'Sat Jan 17 19:02:33 +0000 2015', u'cleansheet', "#WorldCup Cricket World Cup: Australia should've picked a leg-spinner and named Steve Smith vice-captain ... http://t.co/kgXgUVbHDd", '0.0', '0.0')
(u'Sat Jan 17 19:02:34 +0000 2015', u'cleansheet', '#WorldCup Younger Northug earns 1st cross-country World Cup victory http://t.co/y7jozMriFG', '0.0', '0.0')
(u'Sat Jan 17 19:02:35 +0000 2015', u'cleansheet', '#WorldCup ICC World Cup 2015: School massacre survivors inspire Pakistan team http://t.co/Tj1jpCZsj6', '0.0', '0.0')
(u'Sat Jan 17 19:02:35 +0000 2015', u'cleansheet', '#WorldCup We Want to Win World Cup for Peshawar Schoolkids: Misbah-ul-Haq http://t.co/RbeBkrv69s', '0.8', '0.4')
(u'Sat Jan 17 19:02:38 +0000 2015', u'world_latest', 'New: Equatorial Guinea 1-1 Congo http://t.co/32sfrrbBOW #follow #worldcup world_latest world_latest', '0.136363636364', '0.454545454545')
(u'Sat Jan 17 19:02:39 +0000 2015', u'FAHAD_CTID', 'RT #fawadiii: #FAHAD_CTID #VeronaPerqukuu Hahaha. Hanw ;) bdw worldcup bhi hai 15 sy :D', '0.483333333333', '0.8')
(u'Sat Jan 17 19:02:43 +0000 2015', u'amazon_mybot', '#3: Python http://t.co/LLzeKQQBon', '0.0', '0.0')
(u'Sat Jan 17 19:02:45 +0000 2015', u'LarryMesast', '#javascript #html5 #UX #Python #agile #DDD', '0.5', '0.75')
(u'Sat Jan 17 19:02:46 +0000 2015', u'washim987', 'RT #anjali_damania: I was angry at #shaziailmi & #thekiranbedi My husband calms me down & says. Haame Worldcup jitna hai. Sirf Pakistan se ', '-0.327777777778', '0.644444444444')
(u'Sat Jan 17 19:03:02 +0000 2015', u'sksh_rana', '"#ManojHarry27: If India loses 2015 worldcup, Karishma\ntanna will be held responsible !!! #BB8"\n#TheFarahKhan #BeingSalmanKhan', '0.0453125', '0.325')
(u'Sat Jan 17 19:03:14 +0000 2015', u't_kohyama', '#_3mame PythonMatlabPython', '0.0', '0.0')
(u'Sat Jan 17 19:03:16 +0000 2015', u'AntonShipulin', '#photo #worldcup #flowerceremony #sprint #Ruhpolding http://t.co/fe9qpiwsqJ', '0.0', '0.0')
(u'Sat Jan 17 19:03:22 +0000 2015', u'karthik_vik', 'RT #ValaAfshar: Highest paying programming languages, ranked by salary:\n\n1 Ruby\n2 Objective C\n3 Python\n4 Java\n\nhttp://t.co/RudytdjFLC http:', '0.0', '0.1')
Right now I plot the data with the following script:
import matplotlib
matplotlib.use('Agg')
from matplotlib.mlab import csv2rec
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from pylab import *
from datetime import datetime
import dateutil
from dateutil import parser
import re
import os
import operator
import csv
input_filename="test_output.csv"
output_image_namep='polarity.png'
output_image_name2='subjectivity.png'
input_file = open(input_filename, 'r')
data = csv2rec(input_file, names=['time', 'name', 'message', 'polarity', 'subjectvity'])
time_list = []
polarity_list = []
''' I am aware there's a much more concise way of doing this'''
for line in data:
td = line['time']
''' stupid regex '''
s = re.sub('\(\u', '', td)
dtime = parser.parse(s)
dtime = re.sub('-', '', str(dtime))
dtime = re.sub(' ', '', dtime)
dtime = re.sub('\+00:00', '', dtime)
dtime = re.sub(':', '', dtime)
dtime = dtime[:-2]
try:
subjectivity = float(line['subjectivity'].replace("'", '').replace(")", ''))
except:
pass
print dtime, polarity
time_list.append( str(dtime) )
polarity_list.append( polarity )
rcParams['figure.figsize'] = 10, 4
rcParams['font.size'] = 8
fig = plt.figure()
plt.plot([time_list], [polarity_list], 'ro')
axes = plt.gca()
axes.set_ylim([-1,1])
plt.savefig(output_image_namep)
It ends up looking like:
Which is fine but I would like the X axis to display the date labels correctly. Right now I'm doing some ugly regex to strip the date down to YYYYMMDDHHMM.
What about this:
import time
def format_time_label(original):
return time.strftime('%Y%m%d%H%M',
time.strptime(original, "%a %b %d %H:%M:%S +0000 %Y"))
Example:
>>> format_time_label('Sat Jan 17 19:00:50 +0000 2015')
'201501171900'
This works only if every date in your data has timezone offset +0000, as there seems to be no code in Python standard library to recognize this.
You can change parsing format expression accordingly to account for leftovers from your data format:
def format_time_label(original):
return time.strftime('%Y%m%d%H%M',
time.strptime(original, "(u'%a %b %d %H:%M:%S +0000 %Y'"))
>>> format_time_label("(u'Sat Jan 17 18:56:05 +0000 2015'")
'201501171856'