Rename Columns in Python using Regular Expressions - python

I have a data set that has columns for number of units sold in a given month - the problem being that the monthly units columns are named in MM/yyyy format, meaning that I have 12 columns of units information per record.
So for instance, my data looks like:
ProductID | CustomerID | 04/2018 | 03/2018 | 02/2018 | FileDate |
a1032 | c1576 | 36 | 12 | 19 | 04/20/2018 |
What causes this to be problematic is that a new file comes in every month, with the same file name, but different column headers for the units information based on the last 12 months.
What I would like to do, is rename the monthly units columns to Month1, Month2, Month3... based on a simple regex such as ([0-9]*)/([0-9]*) that will result in the output:
ProductID | CustomerID | Month1 | Month2 | Month3 | FileDate |
a1032 | c1576 | 36 | 12 | 19 | 04/20/2018 |
I know that this should be possible using Python, but as I have never used Python before (I am an old .Net developer) I honestly have no idea how to achieve this.
I have done a bit of research on renaming columns in Python, but none of them mentioned pattern matching to rename a column, eg:
df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
UPDATE: The data that I am showing in my example is only a subset of the columns; total, in my data set I have 120 columns, only 12 of which need to be renamed, this is why I thought that regex might be the simplest way to go.

import re
# regex pattern
pattern = re.compile("([0-9]*)/([0-9]*)")
# get headers as list
headers = list(df)
# apply regex
months = 1
for index, header in enumerate(headers):
if pattern.match(header):
headers[index] = 'Month{}'.format(months)
months += 1
# set new list as column headers
df.columns = headers

If you have some set names that you want to convert to, then rather than using rename, it might easier to just pass a new list to the df.columns attribute
df.columns = ['ProductID','CustomerID']+['Month{}'.format(i) for i in range(12)]+['FileDate']
If you want to use rename, if you can write a function find_new_name that does the conversion you want for a single name, you can rename an entire list old_names with
df.rename(columns = {oldname:find_new_name(old_name) for old_name in old_names})
Or if you have a function that takes a new name and figures out what old name corresponds to it, then it would be
df.rename(columns = {find_old_name(new_name):new_name for new_name in new_names})
You can also do
for new_name in new_names:
old_name = find_new_name(old_name)
df[new_name] = df[old_name]
This will copy the data into new columns with the new names rather than renaming, so you can then subset to just the columns you want.

Since rename could take a function as a mapper, we could define a customized function which returns a new column name in the new format if the old column name matches regex; otherwise, returns the same column name. For example,
import re
def mapper(old_name):
match = re.match(r'([0-9]*)/([0-9]*)', old_name)
if match:
return 'Month{}'.format(int(match.group(1)))
return old_name
df = df.rename(columns=mapper)

Related

Python pandas drop columns if their partial name is in a list or column in pandas

I have the following dataframe called dropthese.
| partname | x1 | x2 | x3....
0 text1_mid1
1 another1_mid2
2 yet_another
And another dataframe called df that looks like this.
text1_mid1_suffix1 | text1_mid1_suffix2 | ... | something_else | another1_mid2_suffix1 | ....
0 .....
1 .....
2 .....
3 .....
I want to drop all the columns from df, if a part of the name is in dropthese['partname'].
So for example, since text1_mid1 is in partname, all columns that contain that partial string should be dropped like text1_mid1_suffix1 and text1_mid1_suffix2.
I have tried,
thisFilter = df.filter(dropthese.partname, regex=True)
df.drop(thisFilter, axis=1)
But I get this error, TypeError: Keyword arguments `items`, `like`, or `regex` are mutually exclusive. What is the proper way to do this filter?
I would use a regex with str.contains (or str.match if you want to restrict to the start of string):
import re
pattern = '|'.join(dropthese['partname'].map(re.escape))
out = df.loc[:, ~df.columns.str.contains(f'({pattern})')]
Output:
something_else
0 ...
Why your command failed
you should pass the pattern to the regex parameter of filter, and use the column names in drop:
pattern = '|'.join(dropthese['partname'].map(re.escape))
thisFilter = df.filter(regex=pattern)
df.drop(thisFilter.columns, axis=1)

Filter Header 2 rows and Trailer 1 row in 1000 of huge files pyspark

I have list multiple 1000's of huge files in a folder ..
Each file is having 2 header rows and trailer row
file1
H|*|F|*|TYPE|*|EXTRACT|*|Stage_|*|2021.04.18 07:35:26|##|
H|*|TYP_ID|*|TYP_DESC|*|UPD_USR|*|UPD_TSTMP|##|
E|*||*|CONNECTOR|*|2012.06.01 09:03:11|##|
H|*|Tracking|*|asdasasd|*|2011.03.04 11:50:51|##|
S|*|Tracking|*|asdasdas|*|2011.03.04 11:51:06|##|
T|*|3|*|2021.04.18 07:35:43|##|
file 2
H|*|F|*|PA__STAT|*|EXTRACT|*|Folder|*|2021.04.18 07:35:26|##|
H|*|STAT_ID|*|STAT_DESC|*|UPD_USR|*|UPD_TSTMP|##|
A|*|Active / Actif|*|1604872|*|2018.06.25 15:12:35|##|
D|*||*|CONNECTOR|*|2012.04.06 10:49:09|##|
I|*|Intermittent Leave|*|asdasda|*|2021.04.09 13:14:00|##|
L|*|On Leave|*|asdasasd|*|2011.03.04 11:49:40|##|
P|*|Paid Leave|*|asdasd|*|2011.03.04 11:49:56|##|
T|*|Terminated / Terminé|*|1604872|*|2018.06.25 15:13:06|##|
U|*||*|CONNECTOR|*|2012.06.16 09:04:14|##|
T|*|7|*|2021.04.18 07:35:55|##|
file3
H|*|K|*|PA_CPN|*|EXTRACT|*|SuccessFactors|*|2021.04.22 23:09:26|##|
H|*|COL_NUM|*|CPNT_TYP_ID|*|CPNT_ID|*|REV_DTE|##|
40|*|OLL|*|asdasdas|*|2019.01.21 14:07:00|##|
40|*|OLL|*|asdasda|*|2019.01.21 14:18:00|##|
40|*|OLL|*|asdasdas|*|2019.01.21 14:20:00|##|
T|*|3|*|2021.04.22 23:27:17|##|
I am applying a filter on lines starting with H|| and T|| but it is rejecting the data for few rows.
df_cleanse=spark.sql("select replace(replace(replace(value,'~','-'),'|*|','~'),'|##|','') as value from linenumber3 where value not like 'T|*|%' and value not like 'H|*|%'")
I know we can use zipwithindex , but i have to read file by file and and they apply zip index and then filter on the rows .
for each file:
df = spark.read.text('file1')
#Adding index column each row get its row numbers , Spark distibutes the data and to maintain the order of data we need to perfrom this action
df_1 = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
df_1.createOrReplaceTempView("linenumber")
spark.sql("select * from linenumber where index >1 and value.value not like 'T|*|%'")
Please let know the optimal solution for the same. I do not want to run a extensive program all i need is to juts remove 3 lines . Even a regex to remove the rows is fine we need to process TB's of files in this format
Unix Commands and Sed operators are ruled out due to the file sizes
Meanwhile I wait your answer, try this to remove the first two lines and the last:
from pyspark.sql.window import Window
import pyspark.sql.functions as f
df = spark.read.csv('your_path', schema='value string')
df = df.withColumn('filename', f.input_file_name())
df = df.repartition('filename')
df = df.withColumn('index', f.monotonically_increasing_id())
w = Window.partitionBy('filename')
df = (df
.withColumn('remove', (f.col('index') == f.max('index').over(w)) | (f.col('index') < f.min('index').over(w) + f.lit(2)))
.where(~f.col('remove'))
.select('value'))
df.show(truncate=False)
Output
+-------------------------------------------------------------+
|value |
+-------------------------------------------------------------+
|E|*||*|CONNECTOR|*|2012.06.01 09:03:11|##| |
|H|*|Tracking|*|asdasasd|*|2011.03.04 11:50:51|##| |
|S|*|Tracking|*|asdasdas|*|2011.03.04 11:51:06|##| |
|A|*|Active / Actif|*|1604872|*|2018.06.25 15:12:35|##| |
|D|*||*|CONNECTOR|*|2012.04.06 10:49:09|##| |
|I|*|Intermittent Leave|*|asdasda|*|2021.04.09 13:14:00|##| |
|L|*|On Leave|*|asdasasd|*|2011.03.04 11:49:40|##| |
|P|*|Paid Leave|*|asdasd|*|2011.03.04 11:49:56|##| |
|T|*|Terminated / Terminé|*|1604872|*|2018.06.25 15:13:06|##||
|U|*||*|CONNECTOR|*|2012.06.16 09:04:14|##| |
|40|*|OLL|*|asdasdas|*|2019.01.21 14:07:00|##| |
|40|*|OLL|*|asdasda|*|2019.01.21 14:18:00|##| |
|40|*|OLL|*|asdasdas|*|2019.01.21 14:20:00|##| |
+-------------------------------------------------------------+

Best Practice for repetitive computations

Cog in the Machine:
Data contains Current 12 months of data and is stacked Horizontally. With each month having updates revised and new month appended to.
ID |Date |Month1_a |Month1_b |Month1_c |Month2_a |Month2_b |Month2_c |Month3_a |Month3_b |Month3_c
## |MM/DD/YYYY |abc |zxy |123 |NULL |zxy |122 |abc |zxy |123
Actual data file has no headers and is ingested downstream as distinct File per Month
File Month 1, etc.
ID | Date |Month1_a |Month1_b |Month1_c |New Column
## |MM/DD/YYYY |abc |zxy |123 | #
ID | Date |Month2_a |Month2_b |Month2_c |New Column
## |MM/DD/YYYY |NULL |zxy |122 | #
Other than copying the file 12 times. Is there any suggestion for reading once and looping through to create my outputs. I've worked out the logic for Month 1, I'm stuck as to how to move to month 2+.
Was originally thinking Read File > Drop Month 3+ > Drop Month 1 > Run Logic, but I'm not sure if there is a better/best practice.
Thanks.
This will output n number of csv files where n is the number of months in your input data. Hopefully this is what you are after.
import pandas as pd
df = pd.read_csv('my_data.csv', sep='|')
# Strip whitespace from column names
df.columns = [x.strip() for x in df.columns]
# Get a set of months in the data by splitting on _ and removing 'Month' from
# the first part
months = set([x.split('_')[0].replace('Month','') for x in df.columns if 'Month' in x])
# For each numeric month in months, add those columns with that number in it to
# the ID and Date columns and write to a csv with that month number in the csv title
for month in months:
base_columns = ['ID','Date']
base_columns.extend([x for x in df.columns if 'Month'+month in x])
df[base_columns].to_csv(f'Month_{month}.csv', index=False)

How to append a new value in a CSV file in Python?

I have a CSV sheet, having data like this:
| not used | Day 1 | Day 2 |
| Person 1 | Score | Score |
| Person 2 | Score | Score |
But with a lot more rows and columns. Every day I get progress of how much each person progressed, and I get that data as a dictionary where keys are names and values are score amounts.
The thing is, sometimes that dictionary will include new people and not include already existing ones. Then, if a new person comes, it will add 0 as every previous day and if the dict doesn't include already existing person, it will give him 0 score to that day
My idea of solving this is doing lines = file.readlines() on that CSV file, making a new list of people's names with
for line in lines:
names.append(line.split(",")[0])
then making a copy of lines (newLines = lines)
and going through dict's keys, seeing if that person is already in the csv, if so, append the value followed by a comma
But I'm stuck at the part of adding score of 0
Any help or contributions would be appreciated
EXAMPLE: Before I will have this
-,day1,day2,day3
Mark,1500,0,1660
John,1800,1640,0
Peter,1670,1680,1630
Hannah,1480,1520,1570
And I have this dictionary to add
{'Mark': 1750, 'Hannah':1640, 'Brian':1780}
The result should be
-,day1,day2,day3,day4
Mark,1500,0,1660,1750
John,1800,1640,0,0
Peter,1670,1680,1630,0
Hannah,1480,1520,1570,1640
Brian,0,0,0,1780
See how Brian is in the dict and not in the before csv and he got added with any other day score 0. I figured out that one line .split(',') would give a list of N elements, where N - 2 will be amount of zero scores to add prior to first day of that person
This is easy to do in pandas as an outer join. Read the CSV into a dataframe and generate a new dataframe from the dictionary. The join is almost what you want except that since not-a-number values are inserted for empty cells, you need to fill the NaN's with zero and reconvert everything to integer.
The one potential problem is that the CSV is sorted. You don't simply have the new rows appended to the bottom.
import pandas as pd
import errno
import os
INDEX_COL = "-"
def add_days_score(filename, colname, scores):
try:
df = pd.read_csv(filename, index_col=INDEX_COL)
except OSError as e:
if e.errno == errno.ENOENT:
# file doesn't exist, create empty df
df = pd.DataFrame([], columns=[INDEX_COL])
df = df.set_index(INDEX_COl)
else:
raise
new_df = pd.DataFrame.from_dict({colname:scores})
merged = df.join(new_df, how="outer").fillna(0).astype(int)
try:
merged.to_csv(filename + ".tmp", index_label=[INDEX_COL])
except:
raise
else:
os.rename(filename + ".tmp", filename)
return merged
#============================================================================
# TEST
#============================================================================
test_file = "this_is_a_test.csv"
before = """-,day1,day2,day3
Mark,1500,0,1660
John,1800,1640,0
Peter,1670,1680,1630
Hannah,1480,1520,1570
"""
after = """-,day1,day2,day3,day4
Brian,0,0,0,1780
Hannah,1480,1520,1570,1640
John,1800,1640,0,0
Mark,1500,0,1660,1750
Peter,1670,1680,1630,0
"""
test_dicts = [
["day4", {'Mark': 1750, 'Hannah':1640, 'Brian':1780}],
]
open(test_file, "w").write(before)
for name, scores in test_dicts:
add_days_score(test_file, name, scores)
print("want\n", after, "\n")
got = open(test_file).read()
print("got\n", got, "\n")
if got != after:
print("FAILED")

Pandas: Why are my headers being inserted into the first row of my dataframe?

I have a script that collates sets of tags from other dataframes, converts them into comma-separated string and adds all of this to a new dataframe. If I use pd.read_csv to generate the dataframe, the first entry is what I expect it to be. However, if I use the df_empty script (below), then I get a copy of the headers in that first row instead of the data I want. The only difference I have made is generating a new dataframe instead of loading one.
The resultData = pd.read_csv() reads a .csv file with the following headers and no additional information:
Sheet, Cause, Initiator, Group, Effects
The df_empty script is as follows:
def df_empty(columns, dtypes, index=None):
assert len(columns)==len(dtypes)
df = pd.DataFrame(index=index)
for c,d in zip(columns, dtypes):
df[c] = pd.Series(dtype=d)
return df
# https://stackoverflow.com/a/48374031
# Usage: df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
My script contains the following line to create the dataframe:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],[np.str,np.int64,np.str,np.str,np.str])
I've also used the following with no differences:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],['object','int64','object','object','object'])
My script to collate the data and add it to my dataframe is as follows:
data = {'Sheet': sheetNum, 'Cause': causeNum, 'Initiator': initTag, 'Group': grp, 'Effects': effectStr}
count = len(resultData)
resultData.at[count,:] = data
When I run display(data), I get the following in Jupyter:
{'Sheet': '0001',
'Cause': 1,
'Initiator': 'Tag_I1',
'Group': 'DIG',
'Effects': 'Tag_O1, Tag_O2,...'}
What I want to see with both options / what I get when reading the csv:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| 0001 | 1 | Tag_I1 | DIG | Tag_O1, Tag_O2,... |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
What I see when generating a dataframe with df_empty:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
Any ideas on what might be causing the generated dataframe to copy my headers into the first row and if it possible for me to not have to read an otherwise empty csv?
Thanks!
Why? Because you've inserted the first row as data. The magic behaviour of using the first row as header is in read_csv(), if you create your dataframe without using read_csv, the first row is not treated specially.
Solution? Skip the first row when inserting to the data frame generate by df_empty.

Categories