Can google dataflow convert an input date to a bigquery timestamp

Can google dataflow convert an input date to a bigquery timestamp - python

quite new to dataflow, I have been searching for days for a solution to my problem. I need to run a pipeline that reads a date from a csv file in the following format : 2019010420300033, passing it through the the different flows and end up in bigquery as a timestamp. Is there a way to do this or the input file must be converter first to a convertible date (I know a format like this works : 2019-01-01 20:30:00.331).
Or, is is possible to have dataflow output in some way a new pipeline with that date converted?
thanks

This is an easy job for Dataflow. You can use either a ParDo or a Map.
In the example below each line from the CSV will be passed to Map(convertDate). The function convertDate, which you need to modify to fit your date conversion, then returns the modified line. Then the entire converted CSV is written to the output file set.
Example (simplified) using Map:
def convertDate(line):
# convert date to desired format
# Split line into columns, change date format for desired column
# Rejoin columns into line and return
cols = line.split(',') # change for your column seperator
cols[2] = my_change_method_for_date(cols[2]) # code the date conversion here
return ",".join(cols)
with beam.Pipeline(argv=pipeline_args) as p:
lines = p | 'ReadCsvFile' >> beam.io.ReadFromText(args.input)
lines = lines | 'ConvertDate' >> beam.Map(convertDate)
lines | 'WriteCsvFile' >> beam.io.WriteToText(args.output)

Related

unable to parse using pd.json_normalize, it throws null with index values

Sample of my data:
ID
target
1
{"abc":"xyz"}
2
{"abc":"adf"}
this data was a csv output that i imported as below in python
data=pd.read_csv('location',converters{'target':json.loads},header=None,doublequote=True,encoding='unicode_escape')
data=data.drop(labels=0,axis=0)
data=data.rename(columns={0:'ID',1:'target'})
when I try to parse this data using
df=pd.json_normalize(data['target'])
i get Empty dataframe
0
1

You need to change the cells from strings to actual dicts and then your code works.
Try this:
df['target'] = df['target'].apply(json.loads)
df = pd.json_normalize(df['target'])

How to remove/ignore invalid formatted data while reading a huge csv file and creating a Dataframe using chunks in python

I have a huge CSV log file (200,000+ entries). I am using chunks to read the file and then appending the chunks to get the entire file as data frame. Sometimes weird values/invalid formats arrive in the log file. I want to discard/ignore the wrong formatted data and filter out only the correct format and then work on the data frame. Below is the sample file of the same:
ddmmyyyy,hh:mm:ss,FileName,Function,Bytes,MsgText
17Jul2021,14:21:46,StatFile,Upload,1,"copy success"
17Jul2021,14:22:42,AuditFile,Download,1,"Download success"
17Jul2021,15:21:46,ReactFile,Upload,1,"copy success"
17Jul2021,15:23:46,StatFile,Upload,1,"copy success"
17Jul2021,16:30:46,StatFile,Upload,0,"copy success"
-,-,-,
17Jul2021,17:21:42,StatFile,Upload,1,"copy success"
ep success.",TLSV12,2546,-,1,25648
17Jul2021,17:50:46,StatFile,Upload,1,"copy success"
1,-32,328,280,Extend,s
17Jul2021,18:19:06,AuditFile,Download,2,"Download success"
Below sample code i am using:
MyList=[]
Chunksize=10
for chunk in pd.read_csv(g_log,delimiter=',',usecols=["ddmmyyyy","hh:mm:ss","Function"],index_col=None,low_memory=False,chunksize=Chunksize)
MyList.append(chunk)
len(MyList)
df1=pd.concat(MyList,axis=0)
latest_Update=df1["hh:mm:ss"].max()
print(type(latest_Update))
print(latest_Update)
Output
<class 'str'>
TLSV12
I want only the time format or exactly "hh:mm:ss" format as a string in the df1["hh:mm:ss"].So that I can calculate the time difference between the current time and the latest_Update.
How can I filter the invalid types from these columns?
I tried date_parser as well, but the same output.
custom_date_parser=lambda x:pd.to_datetime(x,errors,'coerce',infer_datetime_format=True)
for chunk in pd.read_csv(g_log,delimiter=',',usecols=["ddmmyyyy","hh:mm:ss","Function"],index_col=None,custom_date_parser,low_memory=False,chunksize=Chunksize)
I also tried using the below, which gave me data as Timestamp, It is fulfilling the requirement, but it is taking too long to execute. So, this needs to be optimized:
df1["hh:mm:ss"]=df1["hh:mm:ss"].apply(lambda x:pd.to_datetime(x,errors,'coerce',infer_datetime_format=True))

This is probably not the solution, but to offer some food for thought...
import pandas as pd
import re
from dateutil import parser
'''
test_text.txt contains:
ddmmyyyy,hh:mm:ss,FileName,Function,Bytes,MsgText
17Jul2021,14:21:46,StatFile,Upload,1,"copy success"
17Jul2021,14:22:42,AuditFile,Download,1,"Download success"
17Jul2021,15:21:46,ReactFile,Upload,1,"copy success"
17Jul2021,15:23:46,StatFile,Upload,1,"copy success"
17Jul2021,16:30:46,StatFile,Upload,0,"copy success"
-,-,-,
17Jul2021,17:21:42,StatFile,Upload,1,"copy success"
ep success.",TLSV12,2546,-,1,25648
17Jul2021,17:50:46,StatFile,Upload,1,"copy success"
1,-32,328,280,Extend,s
17Jul2021,18:19:06,AuditFile,Download,2,"Download success"
'''
# regex pattern to find just date strings like 17Jul2021
pattern = r'^[0-9]{0,2}[A-Za-z]{3}[0-9]{0,4}'
# open test CSV and split on commas
df = pd.read_csv('test_text.txt', sep=",")
# create filter using regex to identify valid lines
filter = df['ddmmyyyy'].str.contains(pattern)
# drop all rows apart from valid ones
df = df[filter]
# combine the date and time columns together
df['Timestamp'] = df['ddmmyyyy'] + ' ' + df['hh:mm:ss']
# MAYBE this is the only line of interest?
# using from dateutil import parser ...format the new Timestamp column to datetime
df['Timestamp'] = [parser.parse(row) for row in df['Timestamp']]
# set column order in list and apply to dataframe
cols = ['Timestamp', 'ddmmyyyy', 'hh:mm:ss', 'FileName', 'Function', 'Bytes', 'MsgText']
df = df[cols]
# display dataframe
print(df)
Outputs:

Python Pandas replacing part of a string

I'm trying to filter data that is stored in a .csv file that contains time and angle values and save filtered data in an output .csv file. I solved the filtering part, but the problem is that time is recorded in hh:mm:ss:msmsmsms (12:55:34:500) format and I want to change that to hhmmss (125534) or in other words remove the : and the millisecond part.
I tried using the .replace function but I keep getting the KeyError: 'time' error.
Input data:
time,angle
12:45:55,56
12:45:56,89
12:45:57,112
12:45:58,189
12:45:59,122
12:46:00,123
Code:
import pandas as pd
#define min and max angle values
alpha_min = 110
alpha_max = 125
#read input .csv file
data = pd.read_csv('test_csv3.csv', index_col=0)
#filter by angle size
data = data[(data['angle'] < alpha_max) & (data['angle'] > alpha_min)]
#replace ":" with "" in time values
data['time'] = data['time'].replace(':','')
#display results
print data
#write results
data.to_csv('test_csv3_output.csv')

That's because time is an index. You can do this and remove the index_col=0:
data = pd.read_csv('test_csv3.csv')
And change this line:
data['time'] = pd.to_datetime(data['time']).dt.strftime('%H%M%S')
Output:
time angle
2 124557 112
4 124559 122
5 124600 123

What would print (data.keys()) or print(data.head()) yield? It seems like you have a stray character before\after the time index string, happens from time to time, depending on how the csv was created vs how it was read (see this question).
If it's not a bigger project and/or you just want the data, you could just do some silly workaround like: timeKeyString=list(data.columns.values)[0] (assuming time is the first one).

converting google datastore query result to pandas dataframe in python

I need to convert a Google Cloud Datastore query result to a dataframe, to create a chart from the retrieved data. The query:
def fetch_times(limit):
start_date = '2019-10-08'
end_date = '2019-10-19'
query = datastore_client.query(kind='ParticleEvent')
query.add_filter(
'published_at', '>', start_date)
query.add_filter(
'published_at', '<', end_date)
query.order = ['-published_at']
times = query.fetch(limit=limit)
return times
creates a json like string of the results for each entity returned by the query:
Entity('ParticleEvent', 5942717456580608) {'gc_pub_sub_id': '438169950283983', 'data': '605', 'event': 'light intensity', 'published_at': '2019-10-11T14:37:45.407Z', 'device_id': 'e00fce6847be7713698287a1'}>
Thought I found something that would translate to json which I could convert to dataframe, but get an error that the properties attribute does not exist:
def to_json(gql_object):
result = []
for item in gql_object:
result.append(dict([(p, getattr(item, p)) for p in item.properties()]))
return json.dumps(result, cls=JSONEncoder)
Is there a way to iterate through the query results to get them into a dataframe either directly to a dataframe or by converting to json then to dataframe?

Datastore entities can be treated as Python base dictionaries! So you should be able to do something as simple as...
df = pd.DataFrame(datastore_entities)
...and pandas will do all the rest.
If you needed to convert the entity key, or any of its attributes to a column as well, you can pack them into the dictionary separately:
for e in entities:
e['entity_key'] = e.key
e['entity_key_name'] = e.key.name # for example
df = pd.DataFrame(entities)

You can use pd.read_json to read your json query output into a dataframe.
Assuming the output is the string that you have shared above, then the following approach can work.
#Extracting the beginning of the dictionary
startPos = line.find("{")
df = pd.DataFrame([eval(line[startPos:-1])])
Output looks like :
gc_pub_sub_id data event published_at \
0 438169950283983 605 light intensity 2019-10-11T14:37:45.407Z
device_id
0 e00fce6847be7713698287a1
Here, line[startPos:-1] is essentially the entire dictionary in that sthe string input. Using eval, we can convert it into an actual dictionary. Once we have that, it can be easily converted into a dataframe object

Original poster found a workaround, which is to convert each item in the query result object to string, and then manually parse the string to extract the needed data into a list.

The return value of the fetch function is google.cloud.datastore.query.Iterator which behaves like a List[dict] so the output of fetch can be passed directly into pd.DataFrame.
import pandas as pd
df = pd.DataFrame(fetch_times(10))
This is similar to #bkitej, but I added the use of the original poster's function.

Spark: Load multiple files, analyze individually, merge results, and save

I'm new to Spark and not quite how to ask this (which terms to use, etc.), so here's a picture of what I'm conceptually trying to accomplish:
I have lots of small, individual .txt "ledger" files (e.g., line-delimited files with a timestamp and attribute values at that time).
I'd like to:
Read each "ledger" file into individual data frames (read: NOT combining into one, big data frame);
Perform some basic calculations on each individual data frame, which result in a row of new data values; and then
Merge all the individual result rows into a final object & save it to disk in a line-delimited file.
It seems like nearly every answer I find (when googling related terms) is about loading multiple files into a single RDD or DataFrame, but I did find this Scala code:
val data = sc.wholeTextFiles("HDFS_PATH")
val files = data.map { case (filename, content) => filename}
def doSomething(file: String) = {
println (file);
// your logic of processing a single file comes here
val logData = sc.textFile(file);
val numAs = logData.filter(line => line.contains("a")).count();
println("Lines with a: %s".format(numAs));
// save rdd of single file processed data to hdfs comes here
}
files.collect.foreach( filename => {
doSomething(filename)
})
... but:
A. I can't tell if this parallelizes the read/analyze operation, and
B. I don't think it provides for merging the results into a single object.
Any direction or recommendations are greatly appreciated!
Update
It seems like what I'm trying to do (run a script on multiple files in parallel and then combine results) might require something like thread pools (?).
For clarity, here's an example of the calculation I'd like to perform on the DataFrame created by reading in the "ledger" file:
from dateutil.relativedelta import relativedelta
from datetime import datetime
from pyspark.sql.functions import to_timestamp
# Read "ledger file"
df = spark.read.json("/path/to/ledger-filename.txt")
# Convert string ==> timestamp & sort
df = (df.withColumn("timestamp", to_timestamp(df.timestamp, 'yyyy-MM-dd HH:mm:ss'))).sort('timestamp')
columns_with_age = ("location", "status")
columns_without_age = ("wh_id")
# Get the most-recent values (from the last row of the df)
row_count = df.count()
last_row = df.collect()[row_count-1]
# Create an empty "final row" dictionary
final_row = {}
# For each column for which we want to calculate an age value ...
for c in columns_with_age:
# Initialize loop values
target_value = last_row.__getitem__(c)
final_row[c] = target_value
timestamp_at_lookback = last_row.__getitem__("timestamp")
look_back = 1
different = False
while not different:
previous_row = df.collect()[row_count - 1 - look_back]
if previous_row.__getitem__(c) == target_value:
timestamp_at_lookback = previous_row.__getitem__("timestamp")
look_back += 1
else:
different = True
# At this point, a difference has been found, so calculate the age
final_row["days_in_{}".format(c)] = relativedelta(datetime.now(), timestamp_at_lookback).days
Thus, a ledger like this:
+---------+------+-------------------+-----+
| location|status| timestamp|wh_id|
+---------+------+-------------------+-----+
| PUTAWAY| I|2019-04-01 03:14:00| 20|
|PICKABLE1| X|2019-04-01 04:24:00| 20|
|PICKABLE2| X|2019-04-01 05:33:00| 20|
|PICKABLE2| A|2019-04-01 06:42:00| 20|
| HOTPICK| A|2019-04-10 05:51:00| 20|
| ICEXCEPT| A|2019-04-10 07:04:00| 20|
| ICEXCEPT| X|2019-04-11 09:28:00| 20|
+---------+------+-------------------+-----+
Would reduce to (assuming the calculation was run on 2019-04-14):
{ '_id': 'ledger-filename', 'location': 'ICEXCEPT', 'days_in_location': 4, 'status': 'X', 'days_in_status': 3, 'wh_id': 20 }

Using wholeTextFiles is not recommended as it loads the full file into memory at once. If you really want to create an individual data frame per file, you can simply use the full path instead of a directory. However, this is not recommended and will most likely lead to poor resource utilisation. Instead, consider using input_file_path https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/functions.html#input_file_name--
For example:
spark
.read
.textFile("path/to/files")
.withColumn("file", input_file_name())
.filter($"value" like "%a%")
.groupBy($"file")
.agg(count($"value"))
.show(10, false)
+----------------------------+------------+
|file |count(value)|
+----------------------------+------------+
|path/to/files/1.txt |2 |
|path/to/files/2.txt |4 |
+----------------------------+------------+
so the files can be processed individually and then later combined.

You could fetch the file paths in hdfs
import org.apache.hadoop.fs.{FileSystem,Path}
val files=FileSystem.get( sc.hadoopConfiguration ).listStatus( new Path(your_path)).map( x => x.getPath ).map(x=> "hdfs://"+x.toUri().getRawPath())
create a unique dataframe for each path
val arr_df= files.map(spark.read.format("csv").option("delimeter", ",").option("header", true).load(_))
The apply your filter or any transformation before unioning to one dataframe
val df= arr_df.map(x=> x.where(your_filter)).reduce(_ union _)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can google dataflow convert an input date to a bigquery timestamp - python

Related

unable to parse using pd.json_normalize, it throws null with index values

How to remove/ignore invalid formatted data while reading a huge csv file and creating a Dataframe using chunks in python

Python Pandas replacing part of a string

converting google datastore query result to pandas dataframe in python

Spark: Load multiple files, analyze individually, merge results, and save

Categories

Resources