Cog in the Machine:
Data contains Current 12 months of data and is stacked Horizontally. With each month having updates revised and new month appended to.
ID |Date |Month1_a |Month1_b |Month1_c |Month2_a |Month2_b |Month2_c |Month3_a |Month3_b |Month3_c
## |MM/DD/YYYY |abc |zxy |123 |NULL |zxy |122 |abc |zxy |123
Actual data file has no headers and is ingested downstream as distinct File per Month
File Month 1, etc.
ID | Date |Month1_a |Month1_b |Month1_c |New Column
## |MM/DD/YYYY |abc |zxy |123 | #
ID | Date |Month2_a |Month2_b |Month2_c |New Column
## |MM/DD/YYYY |NULL |zxy |122 | #
Other than copying the file 12 times. Is there any suggestion for reading once and looping through to create my outputs. I've worked out the logic for Month 1, I'm stuck as to how to move to month 2+.
Was originally thinking Read File > Drop Month 3+ > Drop Month 1 > Run Logic, but I'm not sure if there is a better/best practice.
Thanks.
This will output n number of csv files where n is the number of months in your input data. Hopefully this is what you are after.
import pandas as pd
df = pd.read_csv('my_data.csv', sep='|')
# Strip whitespace from column names
df.columns = [x.strip() for x in df.columns]
# Get a set of months in the data by splitting on _ and removing 'Month' from
# the first part
months = set([x.split('_')[0].replace('Month','') for x in df.columns if 'Month' in x])
# For each numeric month in months, add those columns with that number in it to
# the ID and Date columns and write to a csv with that month number in the csv title
for month in months:
base_columns = ['ID','Date']
base_columns.extend([x for x in df.columns if 'Month'+month in x])
df[base_columns].to_csv(f'Month_{month}.csv', index=False)
Related
I'm working on some data preparation for a project I'm involved in. We do most of the work in Databricks, using the underlying Apache Spark for computations on large datasets. Everything is done in PySpark.
My goal is to convert a date variable to a variable yearperiod, which divides the year into 13 periods of 4 weeks (with some exceptions). The value is a concatenation of the year and the period, e.g. yearperiod = 201513 would be the year 2015, period 13.
I have two tables: yp_table which contains start and end dates (Edit: type DateType()) for yearperiods (between 2012 and now, Edit: ~120 rows):
+----------+----------+----------+
| start| end|yearperiod|
+----------+----------+----------+
|2012-01-16|2012-01-29| 201201|
|2012-01-30|2012-02-26| 201202|
|2012-02-27|2012-03-25| 201203|
|2012-03-26|2012-04-22| 201204|
|2012-04-23|2012-05-20| 201205|
|2012-05-21|2012-06-17| 201206|
....
And I have the actual data table, which contains a Date column (Edit: type StringType()):
+--------+--------+--------+-----+
| Var1| Var2| Date| Var3|
+--------+--------+--------+-----+
| xxxxxx| xxxx|20191231| x,xx|
| xxxxxx| xxxx|20191231| x,xx|
| xxxxxx| xxxx|20191231| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
...
My question: how do I compute a column yearperiod for the data table, by comparing data.Date with both yp_table.start and yp_table.end?
So far I've been able to make it work with regular Python (a solution with list comprehensions), but it proves to be too slow for large datasets. Any help is greatly appreciated!
Edit: for privacy reasons I can't give the actual schemas of the dataframes. I've edited above to include the types of the relevant columns.
Add a column to your data df that contains the dates in the matching format to the yp_table and then join them filtering by date intervals. Since the yp_table is small, you can use a broadcast join to speed things up.
import pyspark.sql.functions as fun
# Date lookup
start_dates = ["2012-01-16", "2012-01-30", "2012-02-27", "2012-03-26", "2012-04-23", "2012-05-21"]
end_dates = ["2012-01-29", "2012-02-26", "2012-03-25", "2012-04-22", "2012-05-20", "2012-06-17"]
yearperiod = ["201201", "201202", "201203", "201204", "201205", "201206"]
yp_table = spark.createDataFrame(pd.DataFrame({'start': start_dates, 'end': end_dates, 'yearperiod': yearperiod}))
# Data df
dates = ["20120116", "20120130", "20120228", "20120301", "20200101", "20200101", "20200101"]
vals = range(0, len(dates))
data = spark.createDataFrame(pd.DataFrame({'Dates':dates, 'vals': vals}))
# Add formatted data_str column for joining
data = data.withColumn("date_str", fun.concat_ws("-", data.Dates.substr(0,4), data.Dates.substr(5,2), data.Dates.substr(7,2))) # + "-" + data.Dates.substr(6,8))
# Broadcase join small yp_table into the data table using conditional
joined = data.join(fun.broadcast(yp_table), (data.date_str >= yp_table.start) & (data.date_str < yp_table.end))
yp_table.show()
data.show()
joined.show()
+----------+----------+----------+
| start| end|yearperiod|
+----------+----------+----------+
|2012-01-16|2012-01-29| 201201|
|2012-01-30|2012-02-26| 201202|
|2012-02-27|2012-03-25| 201203|
|2012-03-26|2012-04-22| 201204|
|2012-04-23|2012-05-20| 201205|
|2012-05-21|2012-06-17| 201206|
+----------+----------+----------+
+--------+----+----------+
| Dates|vals| date_str|
+--------+----+----------+
|20120116| 0|2012-01-16|
|20120130| 1|2012-01-30|
|20120228| 2|2012-02-28|
|20120301| 3|2012-03-01|
|20200101| 4|2020-01-01|
|20200101| 5|2020-01-01|
|20200101| 6|2020-01-01|
+--------+----+----------+
+--------+----+----------+----------+----------+----------+
| Dates|vals| date_str| start| end|yearperiod|
+--------+----+----------+----------+----------+----------+
|20120116| 0|2012-01-16|2012-01-16|2012-01-29| 201201|
|20120130| 1|2012-01-30|2012-01-30|2012-02-26| 201202|
|20120228| 2|2012-02-28|2012-02-27|2012-03-25| 201203|
|20120301| 3|2012-03-01|2012-02-27|2012-03-25| 201203|
+--------+----+----------+----------+----------+----------+
I am trying to select data for multiple ID's between a time range using pyspark.
I have four columns in a spark dataframe 'event_df'
ID
Time
Event_Start_Date
Event_End_Date
241856
2020-10-18T09:16:49.000+0000
2020-11-12T20:15:00.000+0000
2020-11-12T20:45:00.000+0000
In 'Time' there is data worth 2 months for individual ID's. Different ID's have different event start and end datetimes However, I want to select data only between 'event start date' and 'event end date'.
I have tried the following but it doesn't seem to return what I want
refined_df = event_df.where(( col ('Time') >= col ('Event_Start_Date')) & ( col ('Time') <= col ('Event_End_Date ')) )
Not sure why your line isn't working for you, but you can also try using between:
import pyspark.sql.functions as F
data = [(241856, '2020-10-18T09:16:49.000+0000', '2019-11-12T20:15:00.000+0000', '2020-11-12T20:45:00.000+0000'),
(241857, '2020-10-18T09:16:49.000+0000', '2020-11-12T20:15:00.000+0000', '2020-11-12T20:45:00.000+0000')]
df = spark.sparkContext.parallelize(data).toDF(['ID','Time','Event_Start_Date','Event_End_Date'])
df.show()
df.filter(F.col('Time').between(F.col('Event_Start_Date'), F.col('Event_End_Date'))).show()
returns
+------+--------------------+--------------------+--------------------+
| ID| Time| Event_Start_Date| Event_End_Date|
+------+--------------------+--------------------+--------------------+
|241856|2020-10-18T09:16:...|2019-11-12T20:15:...|2020-11-12T20:45:...|
|241857|2020-10-18T09:16:...|2020-11-12T20:15:...|2020-11-12T20:45:...|
+------+--------------------+--------------------+--------------------+
+------+--------------------+--------------------+--------------------+
| ID| Time| Event_Start_Date| Event_End_Date|
+------+--------------------+--------------------+--------------------+
|241856|2020-10-18T09:16:...|2019-11-12T20:15:...|2020-11-12T20:45:...|
+------+--------------------+--------------------+--------------------+
I have list multiple 1000's of huge files in a folder ..
Each file is having 2 header rows and trailer row
file1
H|*|F|*|TYPE|*|EXTRACT|*|Stage_|*|2021.04.18 07:35:26|##|
H|*|TYP_ID|*|TYP_DESC|*|UPD_USR|*|UPD_TSTMP|##|
E|*||*|CONNECTOR|*|2012.06.01 09:03:11|##|
H|*|Tracking|*|asdasasd|*|2011.03.04 11:50:51|##|
S|*|Tracking|*|asdasdas|*|2011.03.04 11:51:06|##|
T|*|3|*|2021.04.18 07:35:43|##|
file 2
H|*|F|*|PA__STAT|*|EXTRACT|*|Folder|*|2021.04.18 07:35:26|##|
H|*|STAT_ID|*|STAT_DESC|*|UPD_USR|*|UPD_TSTMP|##|
A|*|Active / Actif|*|1604872|*|2018.06.25 15:12:35|##|
D|*||*|CONNECTOR|*|2012.04.06 10:49:09|##|
I|*|Intermittent Leave|*|asdasda|*|2021.04.09 13:14:00|##|
L|*|On Leave|*|asdasasd|*|2011.03.04 11:49:40|##|
P|*|Paid Leave|*|asdasd|*|2011.03.04 11:49:56|##|
T|*|Terminated / Terminé|*|1604872|*|2018.06.25 15:13:06|##|
U|*||*|CONNECTOR|*|2012.06.16 09:04:14|##|
T|*|7|*|2021.04.18 07:35:55|##|
file3
H|*|K|*|PA_CPN|*|EXTRACT|*|SuccessFactors|*|2021.04.22 23:09:26|##|
H|*|COL_NUM|*|CPNT_TYP_ID|*|CPNT_ID|*|REV_DTE|##|
40|*|OLL|*|asdasdas|*|2019.01.21 14:07:00|##|
40|*|OLL|*|asdasda|*|2019.01.21 14:18:00|##|
40|*|OLL|*|asdasdas|*|2019.01.21 14:20:00|##|
T|*|3|*|2021.04.22 23:27:17|##|
I am applying a filter on lines starting with H|| and T|| but it is rejecting the data for few rows.
df_cleanse=spark.sql("select replace(replace(replace(value,'~','-'),'|*|','~'),'|##|','') as value from linenumber3 where value not like 'T|*|%' and value not like 'H|*|%'")
I know we can use zipwithindex , but i have to read file by file and and they apply zip index and then filter on the rows .
for each file:
df = spark.read.text('file1')
#Adding index column each row get its row numbers , Spark distibutes the data and to maintain the order of data we need to perfrom this action
df_1 = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
df_1.createOrReplaceTempView("linenumber")
spark.sql("select * from linenumber where index >1 and value.value not like 'T|*|%'")
Please let know the optimal solution for the same. I do not want to run a extensive program all i need is to juts remove 3 lines . Even a regex to remove the rows is fine we need to process TB's of files in this format
Unix Commands and Sed operators are ruled out due to the file sizes
Meanwhile I wait your answer, try this to remove the first two lines and the last:
from pyspark.sql.window import Window
import pyspark.sql.functions as f
df = spark.read.csv('your_path', schema='value string')
df = df.withColumn('filename', f.input_file_name())
df = df.repartition('filename')
df = df.withColumn('index', f.monotonically_increasing_id())
w = Window.partitionBy('filename')
df = (df
.withColumn('remove', (f.col('index') == f.max('index').over(w)) | (f.col('index') < f.min('index').over(w) + f.lit(2)))
.where(~f.col('remove'))
.select('value'))
df.show(truncate=False)
Output
+-------------------------------------------------------------+
|value |
+-------------------------------------------------------------+
|E|*||*|CONNECTOR|*|2012.06.01 09:03:11|##| |
|H|*|Tracking|*|asdasasd|*|2011.03.04 11:50:51|##| |
|S|*|Tracking|*|asdasdas|*|2011.03.04 11:51:06|##| |
|A|*|Active / Actif|*|1604872|*|2018.06.25 15:12:35|##| |
|D|*||*|CONNECTOR|*|2012.04.06 10:49:09|##| |
|I|*|Intermittent Leave|*|asdasda|*|2021.04.09 13:14:00|##| |
|L|*|On Leave|*|asdasasd|*|2011.03.04 11:49:40|##| |
|P|*|Paid Leave|*|asdasd|*|2011.03.04 11:49:56|##| |
|T|*|Terminated / Terminé|*|1604872|*|2018.06.25 15:13:06|##||
|U|*||*|CONNECTOR|*|2012.06.16 09:04:14|##| |
|40|*|OLL|*|asdasdas|*|2019.01.21 14:07:00|##| |
|40|*|OLL|*|asdasda|*|2019.01.21 14:18:00|##| |
|40|*|OLL|*|asdasdas|*|2019.01.21 14:20:00|##| |
+-------------------------------------------------------------+
Given the following example dataframe:
advertiser_id| name | amount | total |max_total_advertiser|
4061 |source1|-434.955284|-354882.75336200005| -355938.53950700007
4061 |source2|-594.012216|-355476.76557800005| -355938.53950700007
4061 |source3|-461.773929|-355938.53950700007| -355938.53950700007
I need to sum the amount and the max_total_advertiser field in order to get the correct total value in each row. Taking into account that I need this total value for every group partitioned by advertiser_id. (The total column in the initial dataframe is incorrect, that's why I want to calculate correctly)
Something like that should be:
w = Window.partitionBy("advertiser_id").orderBy("advertiser_id")
df.withColumn("total_aux", when( lag("advertiser_id").over(w) == col("advertiser_id"), lag("total_aux").over(w) + col("amount") ).otherwise( col("max_total_advertiser") + col("amount") ))
This lag("total_aux") is not working because the column is not generated yet, that's what I want to achieve, if it is the first row in the group, sum the columns in the same row if not sum the previous obtained value with the current amount field.
Example output:
advertiser_id| name | amount | total_aux |
4061 |source1|-434.955284|-356373.494791 |
4061 |source2|-594.012216|-356967.507007 |
4061 |source3|-461.773929|-357429.280936 |
Thanks.
I assume that name is a distinct value for each advertiser_id and your dataset is therefore sortable by name. I also assume that max_total_advertiser contains the same value for each advertiser_id. If one of those is not the case, please add a comment.
What you need is a rangeBetween window which gives you all preceding and following rows within the specified range. We will use Window.unboundedPreceding as we want to sum up all the previous values.
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [
(4061, 'source1',-434.955284,-354882.75336200005, -355938.53950700007),
(4061, 'source2',-594.012216,-355476.76557800005, -345938.53950700007),
(4062, 'source1',-594.012216,-355476.76557800005, -5938.53950700007),
(4062, 'source2',-594.012216,-355476.76557800005, -5938.53950700007),
(4061, 'source3',-461.773929,-355938.53950700007, -355938.53950700007)
]
columns = ['advertiser_id','name' ,'amount', 'total', 'max_total_advertiser']
df=spark.createDataFrame(l, columns)
w = Window.partitionBy('advertiser_id').orderBy('name').rangeBetween(Window.unboundedPreceding, 0)
df = df.withColumn('total', F.sum('amount').over(w) + df.max_total_advertiser)
df.show()
Output:
+-------------+-------+-----------+-------------------+--------------------+
|advertiser_id| name| amount| total|max_total_advertiser|
+-------------+-------+-----------+-------------------+--------------------+
| 4062|source1|-594.012216|-6532.5517230000705| -5938.53950700007|
| 4062|source2|-594.012216| -7126.563939000071| -5938.53950700007|
| 4061|source1|-434.955284| -356373.4947910001| -355938.53950700007|
| 4061|source2|-594.012216| -346967.5070070001| -345938.53950700007|
| 4061|source3|-461.773929|-357429.28093600005| -355938.53950700007|
+-------------+-------+-----------+-------------------+--------------------+
You might be looking for the orderBy() function. Does this work?
from pyspark.sql.window import *
df.withColumn("cumulativeSum", sum(df("amount"))
.over( Window.partitionBy("advertiser_id").orderBy("amount")))
I have a data set that has columns for number of units sold in a given month - the problem being that the monthly units columns are named in MM/yyyy format, meaning that I have 12 columns of units information per record.
So for instance, my data looks like:
ProductID | CustomerID | 04/2018 | 03/2018 | 02/2018 | FileDate |
a1032 | c1576 | 36 | 12 | 19 | 04/20/2018 |
What causes this to be problematic is that a new file comes in every month, with the same file name, but different column headers for the units information based on the last 12 months.
What I would like to do, is rename the monthly units columns to Month1, Month2, Month3... based on a simple regex such as ([0-9]*)/([0-9]*) that will result in the output:
ProductID | CustomerID | Month1 | Month2 | Month3 | FileDate |
a1032 | c1576 | 36 | 12 | 19 | 04/20/2018 |
I know that this should be possible using Python, but as I have never used Python before (I am an old .Net developer) I honestly have no idea how to achieve this.
I have done a bit of research on renaming columns in Python, but none of them mentioned pattern matching to rename a column, eg:
df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
UPDATE: The data that I am showing in my example is only a subset of the columns; total, in my data set I have 120 columns, only 12 of which need to be renamed, this is why I thought that regex might be the simplest way to go.
import re
# regex pattern
pattern = re.compile("([0-9]*)/([0-9]*)")
# get headers as list
headers = list(df)
# apply regex
months = 1
for index, header in enumerate(headers):
if pattern.match(header):
headers[index] = 'Month{}'.format(months)
months += 1
# set new list as column headers
df.columns = headers
If you have some set names that you want to convert to, then rather than using rename, it might easier to just pass a new list to the df.columns attribute
df.columns = ['ProductID','CustomerID']+['Month{}'.format(i) for i in range(12)]+['FileDate']
If you want to use rename, if you can write a function find_new_name that does the conversion you want for a single name, you can rename an entire list old_names with
df.rename(columns = {oldname:find_new_name(old_name) for old_name in old_names})
Or if you have a function that takes a new name and figures out what old name corresponds to it, then it would be
df.rename(columns = {find_old_name(new_name):new_name for new_name in new_names})
You can also do
for new_name in new_names:
old_name = find_new_name(old_name)
df[new_name] = df[old_name]
This will copy the data into new columns with the new names rather than renaming, so you can then subset to just the columns you want.
Since rename could take a function as a mapper, we could define a customized function which returns a new column name in the new format if the old column name matches regex; otherwise, returns the same column name. For example,
import re
def mapper(old_name):
match = re.match(r'([0-9]*)/([0-9]*)', old_name)
if match:
return 'Month{}'.format(int(match.group(1)))
return old_name
df = df.rename(columns=mapper)