create separate new list from one mother list - python

I am trying to do a script that read a seismic USGS bulletin and take some data to build a new txt file in order to have an input for other program called Zmap to do seismic statistics
SO I have the following USGS bulletin format:
time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,net,id,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
2016-03-31T07:53:28.830Z,-22.6577,-68.5345,95.74,4.8,mww,,33,0.35,0.97,us,us20005dm3,2016-05-07T05:09:39.040Z,"43km NW of San Pedro de Atacama, Chile",earthquake,6.5,4.3,,,reviewed,us,us
2016-03-31T07:17:19.300Z,-18.779,-67.3104,242.42,4.5,mb,,65,1.987,0.85,us,us20005dlx,2016-04-24T07:21:05.358Z,"55km WSW of Totoral, Bolivia",earthquake,10.2,12.6,0.204,7,reviewed,us,us
this has many seismics events, so I did the following code which basically tries to read, split and save some variables in list to put them all together in a final *txt file.
import os, sys
import csv
import string
from itertools import (takewhile,repeat)
os.chdir('D:\\Seismic_Inves\\b-value_osc\\try_tonino')
archi=raw_input('NOMBRE DEL BOLETIN---> ')
ff=open(archi,'rb')
bufgen=takewhile(lambda x: x, (ff.read(1024*1024) for _ in repeat(None)))
numdelins= sum(buf.count(b'\n') for buf in bufgen if buf) - 1
with open(archi,'rb') as f:
next(f)
tiempo=[]
lat=[]
lon=[]
prof=[]
mag=[]
t_mag=[]
leo=csv.reader(f,delimiter=',')
for line in leo:
tiempo.append(line[0])
lat.append(line[1])
lon.append(line[2])
prof.append(line[3])
mag.append(line[4])
t_mag.append(line[5])
tiempo=[s.replace('T', ' ') for s in tiempo] #remplaza el tema de la T por espacio
tiempo=[s.replace('Z','') for s in tiempo] #quito la Z
tiempo=[s.replace(':',' ') for s in tiempo] # quito los :
tiempo=[s.replace('-',' ') for s in tiempo] # quito los -
From the USGS catalog I'd like to take the: Latitude (lat), longitude(lon), time(tiempo), depth (prof), magnitude (mag), type of magnitude (t_mag), with this part of teh code I took the variables I needed:
next(f)
tiempo=[]
lat=[]
lon=[]
prof=[]
mag=[]
t_mag=[]
leo=csv.reader(f,delimiter=',')
for line in leo:
tiempo.append(line[0])
lat.append(line[1])
lon.append(line[2])
prof.append(line[3])
mag.append(line[4])
t_mag.append(line[5])
but I had some troubles with the tim, so I applied my newbie knowledge to split the time from 2016-03-31T07:53:28.830Z to 2016 03 31 07 53 28.830.
Now I am suffering trying to have in one list the year ([2016,2016,2016,...]) in other list the months ([01,01,...03,03,...12]), in other the day ([12,14,...03,11]), in other the hour ([13,22,14,17...]), and the minutes with seconds merged by a point (.) like ([minute.seconds]) or ([12.234,14.443,...]), so I tryied to do this (to plit the spaces) and no success
tiempo2=[]
for element in tiempo:
tiempo2.append(element.split(' '))
print tiempo2
no success because i got this result:
[['2016', '03', '31', '07', '53', '28.830'], ['2016', '03', '31', '07', '17', '19.300'].
can you give me a hand in this part?, or is there a pythonic way to split the date like I said before.
Thank you for the time you spent reading it.
best regards.
Tonino

suppose our tiempo2 holds the following value extracted from the csv :
>>> tiempo2 = [['2016', '03', '31', '07', '53', '28.830'], ['2016', '03', '31', '07', '17', '19.300']]
>>> list (map (list, (map (float, items) if index == 5 else map (int, items) for index, items in enumerate (zip (*tiempo2)))))
[[2016, 2016], [3, 3], [31, 31], [7, 7], [53, 17], [28.83, 19.3]]
here we used the zip function to zip years, months, days, etc ...
I applied the conditional mapping for each item to an int if the index of the list is not the last otherwise to a float

I would suggest using the time.strptime() function to parse the time string into a Python time.struct_time which is a namedtuple. That means you can access any attributes you want using . notation.
Here's what I mean:
import time
time_string = '2016-03-31T07:53:28.830Z'
timestamp = time.strptime(time_string, '%Y-%m-%dT%H:%M:%S.%fZ')
print(type(timestamp))
print(timestamp.tm_year) # -> 2016
print(timestamp.tm_mon) # -> 3
print(timestamp.tm_mday) # -> 31
print(timestamp.tm_hour) # -> 7
print(timestamp.tm_min) # -> 53
print(timestamp.tm_sec) # -> 28
print(timestamp.tm_wday) # -> 3
print(timestamp.tm_yday) # -> 91
print(timestamp.tm_isdst) # -> -1
You could process a list of time strings by using a for loop as shown below:
import time
tiempo = ['2016-03-31T07:53:28.830Z', '2016-03-31T07:17:19.300Z']
for time_string in tiempo:
timestamp = time.strptime(time_string, '%Y-%m-%dT%H:%M:%S.%fZ')
print('year: {}, mon: {}, day: {}, hour: {}, min: {}, sec: {}'.format(
timestamp.tm_year, timestamp.tm_mon, timestamp.tm_mday,
timestamp.tm_hour, timestamp.tm_min, timestamp.tm_sec))
Output:
year: 2016, mon: 3, day: 31, hour: 7, min: 53, sec: 28
year: 2016, mon: 3, day: 31, hour: 7, min: 17, sec: 19

Another solution with the iso8601 add-on (pip install iso8601)
>>> import iso8601
>>> dt = iso8601.parse_date('2016-03-31T07:17:19.300Z')
>>> dt.year
2016
>>> dt.month
3
>>> dt.day
31
>>> dt.hour
7
>>> dt.minute
17
>>> dt.second
10
>>> dt.microsecond
300000
>>> dt.tzname()
'UTC'
Edited 2017/8/6 12h55
IMHO, it is a bad idea to split the datetime timestamp objects into components (year, month, ...) in individual lists. Keeping the datetime timestamp objects as provided by iso8601.parse_date(...) could help to compute time deltas between events, check the chronological order, ... See the doc of the datetime module for more https://docs.python.org/3/library/datetime.html
Having distinct lists for year, month, (...) would make such operations difficult. Anyway, if you prefer this solution, here are the changes
import iso8601
# Start as former solution
with open(archi,'rb') as f:
next(f)
# tiempo=[]
dt_years = []
dt_months = []
dt_days = []
dt_hours = []
dt_minutes = []
dt_timezones = []
lat=[]
lon=[]
prof=[]
mag=[]
t_mag=[]
leo=csv.reader(f,delimiter=',')
for line in leo:
# tiempo.append(line[0])
dt = iso8601.parse_date(line[0])
dt_years.append(dt.year)
dt_months.append(dt.month)
dt_days.append(dt.day)
dt_hours.append(dt.hour)
dec_minutes = dt.minute + (dt.seconds / 60) + (dt.microsecond / 600000000)
dt_minutes.append(dec_minutes)
dt_timezones.append(dt.tzname())
lat.append(line[1])
lon.append(line[2])
prof.append(line[3])
mag.append(line[4])
t_mag.append(line[5])

Related

How to efficient sort arrays by date with Python 3.6

I have a .csv input file that i am reading using Python 3.6.3, that has the following abbreviated outline
Day,Month,Year,Debit(U.S. Dollars)
1,March,2016,487.00
1,March,2016,27.48
6,Februaray,2016,47.81
9,June,2017,218.55
I am reading in the data using the .csv module such that the first column is read to the variable Day, the second column is read to the variable Month, the third column is read to the variable Year, and the fourth column is read to the variable Debit. Each variable is transformed into a numpy array. When I print the variables I get the following output.
>>> print(Day)
>>> [1 6 9]
>>> print(Month)
>>> [March March February]
>>> print(Year)
>>> [2016 2016 2016, 2017]
>>> print(Debit)
>>> [487.00 27.48 47.81 218.55]
I would like to find a way to efficiently sort the arrays by date, which is predicated on the combination of the Day, Month, and Year arrays, such that when printed I get the following results
>>> print(Day)
>>> [6 1 1 9]
>>> print(Month)
>>> [February March March June]
>>> print(Year)
>>> [2016 2016 2016 2017]
>>> print(47.81 487.00 27.48 218.55]
I have considered just having a calendar algorithm walk through every date between the first and last date and passing the data points to a new array, if an expense occurs on that date, but that does not seem like a very efficient method. Does anyone have any idea on a good/efficient way to sort the arrays by date?
One approach would be to prepend a datetime representation of the combined date elements to the start of each row. This would then make all of the elements correctly sortable. The list of rows can then be converted into a list of columns using *zip():
from datetime import datetime
import csv
data = []
with open('input.csv', newline='') as f_input:
csv_reader = csv.reader(f_input)
header = next(csv_reader)
for row in csv_reader:
data.append([datetime.strptime('{} {} {}'.format(*row[:3]), '%d %B %Y')] + row)
sorted_cols = list(zip(*sorted(data)))
print("Days", sorted_cols[1])
print("Months", sorted_cols[2])
print("Years", sorted_cols[3])
print("Debits", sorted_cols[4])
This would give you:
Days ('6', '1', '1', '9')
Months ('February', 'March', 'March', 'June')
Years ('2016', '2016', '2016', '2017')
Debits ('47.81', '27.48', '487.00', '218.55')
Since each row (not column) is one data entry, I might consider reading by row and not column. But if you don't have control over that, you can convert everything to datetime objects, sort by that and then overwrite your existing arrays:
from datetime import datetime
entries = []
for i, day in enumerate(days):
debit = Debut[i]
time = datetime(Year[1], Month[1], day)
entries.append([time, debit])
entries.sort(key=lambda x: x[1])
# At this point you can either just use the entries array for your purposes
# or re-create your newly-sorted arrays using list comprehensions
Day = [entry[0].day for entry in entries]
Month = [entry[0].month for entry in entries]
Year = [entry[0].year for entry in entries]
Debit = [entry[1] for entry in entries]

Sorting by month-year groups by month instead

I have a curious python problem.
The script takes two csv files, one with a column of dates and the other a column of text snippets. in the other excel file there is a bunch of names (substrings).
All that the code does is step through both lists building up a name-mentioned-per-month matrix.
FILE with dates and text: (Date, Snippet first column)
ENTRY 1 : Sun 21 nov 2014 etc, The release of the iphone 7 was...
-strings file
iphone 7
apple
apples
innovation etc.
The problem is that when i try to order it so that the columns follow in asceding order, e.g. oct-2014, nov-2014, dec-2014 and so on, it just groups the months together instead, which isn't what i want
import csv
from datetime import datetime
file_1 = input('Enter first CSV name (one with the date and snippet): ')
file_2 = input('Enter second CSV name (one with the strings): ')
outp = input('Enter the output CSV name: ')
file_1_list = []
head = True
for row in csv.reader(open(file_1, encoding='utf-8', errors='ignore')):
if head:
head = False
continue
date = datetime.strptime(row[0].strip(), '%a %b %d %H:%M:%S %Z %Y')
date_str = date.strftime('%b %Y')
file_1_list.append([date_str, row[1].strip()])
file_2_dict = {}
for line in csv.reader(open(file_2, encoding='utf-8', errors='ignore')):
s = line[0].strip()
for d in file_1_list:
if s.lower() in d[1].lower():
if s in file_2_dict.keys():
if d[0] in file_2_dict[s].keys():
file_2_dict[s][d[0]] += 1
else:
file_2_dict[s][d[0]] = 1
else:
file_2_dict[s] = {
d[0]: 1
}
months = []
for v in file_2_dict.values():
for k in v.keys():
if k not in months:
months.append(k)
months.sort()
rows = [[''] + months]
for k in file_2_dict.keys():
tmp = [k]
for m in months:
try:
tmp.append(file_2_dict[k][m])
except:
tmp.append(0)
rows.append(tmp)
print("still working on it be patient")
writer = csv.writer(open(outp, "w", encoding='utf-8', newline=''))
for r in rows:
writer.writerow(r)
print('Done...')
From my understanding I am months.sort() isnt doing what i expect it to?
I have looked here , where they apply some other function to sort the data, using attrgetter,
from operator import attrgetter
>>> l = [date(2014, 4, 11), date(2014, 4, 2), date(2014, 4, 3), date(2014, 4, 8)]
and then
sorted(l, key=attrgetter('month'))
But I am not sure whether that would work for me?
From my understanding I parse the dates 12-13, am I missing an order data first, like
data = sorted(data, key = lambda row: datetime.strptime(row[0], "%b-%y"))
I have only just started learning python and so many things are new to me i dont know what is right and what isnt?
What I want(of course with the correctly sorted data):
This took a while because you had so much unrelated stuff about reading csv files and finding and counting tags. But you already have all that, and it should have been completely excluded from the question to avoid confusing people.
It looks like your actual question is "How do I sort dates?"
Of course "Apr-16" comes before "Oct-14", didn't they teach you the alphabet in school? A is the first letter! I'm just being silly to emphasize a point -- it's because they are simple strings, not dates.
You need to convert the string to a date with the datetime class method strptime, as you already noticed. Because the class has the same name as the module, you need to pay attention to how it is imported. You then go back to a string later with the member method strftime on the actual datetime (or date) instance.
Here's an example:
from datetime import datetime
unsorted_strings = ['Oct-14', 'Dec-15', 'Apr-16']
unsorted_dates = [datetime.strptime(value, '%b-%y') for value in unsorted_strings]
sorted_dates = sorted(unsorted_dates)
sorted_strings = [value.strftime('%b-%y') for value in sorted_dates]
print(sorted_strings)
['Oct-14', 'Dec-15', 'Apr-16']
or skipping to the end
from datetime import datetime
unsorted_strings = ['Oct-14', 'Dec-15', 'Apr-16']
print (sorted(unsorted_strings, key = lambda x: datetime.strptime(x, '%b-%y')))
['Oct-14', 'Dec-15', 'Apr-16']

How to distinguish between date, month and year in the string?

I have a question: How to use "strip" function to slice a date like "24.02.1999"?
The output should be like this '24', '02', '1999'.
Can you help to solve this?
You can do like this
>>> stri="24.02.1999"
>>> stri.split('.')
['24', '02', '1999']
>>>
strip is used to remove the characters. What you meant is split. For your code,
date = input('Enter date in the format (DD.MM.YY) : ')
dd, mm, yyyy = date.strip().split('.')
print('day = ',dd)
print('month = ',mm)
print('year = ',yyyy)
Output:
Enter date in the format (DD.MM.YY) : 24.02.1999
day = 24
month = 02
year = 1999
You need to use split() not strip().
strip() is used to remove the specified characters from a string.
split() is used to split the string to list based on the value provided.
date = str(input()) # reading input date in dd.mm.yyyy format
splitted_date = date.split('.') # splitting date
day = splitted_date[0] # storing day
month = splitted_date[1] # storing month
year = splitted_date[2] # storing year
# Display the values
print('Date : ',date)
print('Month : ',month)
print('Year : ',year)
You can split date given in DD.MM.YYYY format like this.
Instead of splitting the string, you should be using datetime.strptime(..) to convert the string to the datetime object like:
>>> from datetime import datetime
>>> my_date_str = "24.02.1999"
>>> my_date = datetime.strptime(my_date_str, '%d.%m.%Y')
Then you can access the values you desire as:
>>> my_date.day # For date
24
>>> my_date.month # For month
2
>>> my_date.year # For year
1999
Here you go
date="24.02.1999"
[dd,mm,yyyy] = date.split('.')
output=(("'%s','%s','%s'") %(dd,mm,yyyy))
print(output)
alternate way
date="24.02.1999"
dd=date[0:2]
mm=date[3:5]
yyyy=date[6:10]
newdate=(("'%s','%s','%s'") %(dd,mm,yyyy))
print(newdate)
one more alternate way
from datetime import datetime
date="24.02.1999"
date=datetime.strptime(date, '%d.%m.%Y')
date=(("'%s','%s','%s'") %(date.day,date.month,date.year))
print(date)
Enjoy

Iterate through column of dates in csv to calculate growth rate of variables every 30 days

I have a CSV file, which has a column of dates and another column for number of Twitter followers. I would like to calculate the month over month growth rate of Twitter followers, but the dates may not be an even 30 days apart. So, if I have
2016-03-10 with 200 followers
2016-02-08 with 195 followers
2016-01-01 with 105 followers
How can I iterate through this to generate the month over month growth rate? I've tried working with dateutil's rrule with pandas but am having difficulty. I thought about using R for this, but I'd rather do it in Python as I will output the data into a new CSV from Python.
My team and I used the below function to solve this challenge.
The code below:
def compute_mom(data_list):
list_tuple = zip(data_list[1:],data_list)
raw_mom_growth_rate = [((float(nxt) - float(prev))/float(prev))*100 for nxt, prev in list_tuple]
return [round(mom, 2) for mom in raw_mom_growth_rate]
Hope this helps..
Here's an approach with a defaultdict
import csv
from collections import defaultdict
from datetime import datetime
path = "C:\\Users\\USER\\Desktop\\YOUR_FILE_HERE.csv"
with open(path, "r") as f:
d = defaultdict(int)
rows = csv.reader(f)
for dte, followers in rows:
dte = datetime.strptime(dte, "%Y-%m-%d")
d[dte.year, dte.month] += int(followers)
print d
to_date_followers = 0
for (year, month) in sorted(d):
last_month_and_year = (12, year-1) if month == 1 else (month-1, year)
old_followers = d.get(last_month_and_year, 0)
new_followers = d[year, month]
to_date_followers += new_followers
print "%d followers gained in %s, %s resulting in a %.2f%% increase from %s (%s followers to date)" % (
new_followers-old_followers, month, year, new_followers*100.0/to_date_followers, ', '.join(str(x) for x in last_month_and_year), to_date_followers
)
For input below:
2015-12-05,10
2015-12-31,10
2016-01-01,105
2016-02-08,195
2016-03-01,200
2016-03-10,200
2017-03-01,200
It prints:
defaultdict(<type 'int'>, {(2015, 12): 20, (2016, 1): 105, (2016, 3): 400,
(2017, 3): 200, (2016, 2): 195})
20 followers gained in 12, 2015 resulting in a 100.00% increase from 11, 2015 (20 followers to date)
105 followers gained in 1, 2016 resulting in a 84.00% increase from 12, 2015 (125 followers to date)
195 followers gained in 2, 2016 resulting in a 60.94% increase from 1, 2016 (320 followers to date)
400 followers gained in 3, 2016 resulting in a 55.56% increase from 2, 2016 (720 followers to date)
200 followers gained in 3, 2017 resulting in a 21.74% increase from 2, 2017 (920 followers to date)
thank you SO MUCH for replying. I was able to devise the following code, which accomplishes what I am looking for (I did not expect to be able to do this but happened to find the right functions at the right time):
import csv, datetime, string, os
import pandas as pd
df = pd.read_csv('file_name.csv', sep=',')
# This converts our date strings to date_time objects
df['Date'] = pd.to_datetime(df['Date'])
# But we only want the date, so we strip the time part
df['Date'] = df['Date'].dt.date
sep = ' '
# This allows us to iterate through the rows in a pandas dataframe
for index, row in df.iterrows():
if index == 0:
start_date = df.iloc[0]['Date']
Present = df.iloc[0]['Count']
continue
# This assigns the date of the row to the variable end_date
end_date = df.iloc[index]['Date']
delta = start_date - end_date
# If the number of days is >= to 30
if delta >= 30:
print "Start Date: {}, End Date: {}, delta is {}".format(start_date, end_date, delta)
Past = df.iloc[index]['Count']
percent_change = ((Present-Past)/Past)*100
df.set_value(index, 'MoM', percent_change)
# Sets a new start date and new TW FW count
start_date = df.iloc[index]['Date']
Present = df.iloc[index]['Count']

How to split dates from weather data file

I am new in python and i would greatly appreciate some help.
I have data generated from a weather station(rawdate) in the format 2015-04-26 00:00:48 like this
Date,Ambient Temperature (C),Wind Speed (m/s)
2015-04-26 00:00:48,10.75,0.00
2015-04-26 00:01:48,10.81,0.43
2015-04-26 00:02:48,10.81,0.32
and i would like to split them into year month day hour and minute. My attempt so far is this:
for i in range(len(rawdate)):
x=rawdate[1].split()
date.append(x)
but it gives me a list full of empty lists. My target is to convert this into a list of lists (using the command split) where the new data will be stored into x in the form of [date, time]. Then i want to split further using split with "-" and ":". Can someone offer some advice?
>>> from datetime import datetime
>>> str_date = '2015-04-26 00:00:48'
>>> datte = datetime.strptime(str_date, '%Y-%m-%d %H:%M:%S')
>>> t = datte.timetuple()
>>> y, m, d, h, min, sec, wd, yd, i = t
>>> y
2015
>>> m
4
>>> min
0
>>> sec
48
Your code is clearly broken, because you are not using the loop in any way other than repeating the same operation on rawdate[1], len(rawdate) times.
It's possible that you meant i where you have 1.
For this to make sense, your rawdate would have to be a list of strings (as suggested by #SuperBiasedMan)
Maybe something close to what you were after is like this:
>>> dates = []
>>> rawdates = ['2015-04-26 00:00:48', '2015-04-26 00:00:49']
>>> for i in range(len(rawdates)):
... the_date = rawdates[i].split()
... dates.append(the_date)
...
>>> dates
[['2015-04-26', '00:00:48'], ['2015-04-26', '00:00:49']]
>>>
Use meaningful names always.
rawdate[1] will always return a 0 cause '2015...'[1] is 0.
>>>a = '2015-04-26 00:00:48'
>>>print([date for date in [i for i in a.split(' ')][0].split('-')] + [time for time in [i for i in a.split(' ')][1].split(':')])
>>>['2015', '04', '26', '00', '00', '48']

Categories