Allow duplicate columns in Pandas - python

I'm splitting a large CSV (containing stock financial data) file into smaller chunks. The format of the CSV file is different. Something like an Excel pivot table. The first few rows of the first column contain some headers.
Company name, id, etc. are repeated across the following columns. Because one single company has more than one attribute, not like one company has one column only.
After the first few rows, the columns then start resembling a typical data frame where headers are in columns instead of rows.
Anyways, what I'm trying to do is to make Pandas allow duplicate column headers and not make it add ".1", ".2", ".3", etc after the headers. I know Pandas does not allow this natively, is there a workaround? I tried to set header = None on read_csv but it throws a tokenization error which I think makes sense. I just can't think of an easy way.
import pandas as pd
csv_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4.csv"
#df = pd.read_csv(csv_path, header=1, dtype='unicode', sep=';', low_memory=False, error_bad_lines=False)
df = pd.read_csv(csv_path, header = 1, dtype='unicode', sep=';', index_col=False)
print("I read in a dataframe with {} columns and {} rows.".format(
len(df.columns), len(df)
))
filename = 1
#column increment
x = 30 * 59
for column in df:
loc = df.columns.get_loc(column)
if loc == (x * filename) + 1:
y = filename - 1
a = (x * y) + 1
b = (x * filename) + 1
date_df = df.iloc[:, :1]
out_df = df.iloc[:, a:b]
final_df = pd.concat([date_df, out_df], axis=1, join='inner')
out_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4-part" + str(filename) + ".csv"
final_df.to_csv(out_path, index=False)
#out_df.to_csv(out_path)
filename += 1
# This should be the same as df, but with only the first column.
# Check it with similar code to above.
EDIT:
From, https://github.com/pandas-dev/pandas/issues/19383, I add:
final_df.columns = final_df.iloc[0]
final_df = final_df.reindex(final_df.index.drop(0)).reset_index(drop=True)
So, full code:
import pandas as pd
csv_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4.csv"
#df = pd.read_csv(csv_path, header=1, dtype='unicode', sep=';', low_memory=False, error_bad_lines=False)
df = pd.read_csv(csv_path, header = 1, dtype='unicode', sep=';', index_col=False)
print("I read in a dataframe with {} columns and {} rows.".format(
len(df.columns), len(df)
))
filename = 1
#column increment
x = 30 * 59
for column in df:
loc = df.columns.get_loc(column)
if loc == (x * filename) + 1:
y = filename - 1
a = (x * y) + 1
b = (x * filename) + 1
date_df = df.iloc[:, :1]
out_df = df.iloc[:, a:b]
final_df = pd.concat([date_df, out_df], axis=1, join='inner')
out_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4-part" + str(filename) + ".csv"
final_df.columns = final_df.iloc[0]
final_df = final_df.reindex(final_df.index.drop(0)).reset_index(drop=True)
final_df.to_csv(out_path, index=False)
#out_df.to_csv(out_path)
filename += 1
# This should be the same as df, but with only the first column.
# Check it with similar code to above.
Now, the entire first row is gone. But, the expected output is for the header row to be replaced with the reset index, without the ".1", ".2", etc.
Screenshot:
The SimFin ID row is no longer there.

This is how I did it:
final_df.columns = final_df.columns.str.split('.').str[0]
Reference:
https://pandas.pydata.org/pandas-docs/stable/text.html

Below solution would ensure that other column names with symbol period ('.') in the dataframe do not get modified
import pandas as pd
from csv import DictReader
csv_file_loc = "file.csv"
# Read csv
df = pd.read_csv(csv_file_loc)
# Get column names from csv file using DictReader
col_names = DictReader(open(csv_file_loc, 'r')).fieldnames
# Rename columns
df.columns = col_names

I know I'm pretty late to the draw on this one, but I'm leaving the solution I came up with in case anyone else wanders across this as I have.
Firstly, the linked question has a pretty nice and dynamic solution that seems to work well even for high column counts. I came across that after I made my solution, haha. Check it out here. Another answer on this thread utilizes the csv library to read and use the column names from that, as it doesn't seem to modify duplicates like Pandas does. That should work fine, but I just wanted to avoid using any extra libraries, especially considering I was originally using csv and then upgrade to Pandas for better functionality.
Now here's my solution. I'm sure it could be done more nicely but this does the job for what I needed and is pretty dynamic, from what I can tell. It basically goes through the columns, checks if it can split the string based on the rightmost "." (that's the rpartition), then does a few more checks from there.
It checks:
Is this string in the colMap? The colMap keeps track of all of the column names, duplicate or not. If this comes back true, then that means it's a duplicate of another column that came before it.
Is the string after the rightmost "." a number? All of the columns are strings, so this just makes sure that whatever it is can be converted into a number to prevent grabbing some other random column that meets previous criteria but isn't actually a dupe from Pandas. eg. "DupeCol" and "DupeCol.Stuff" wouldn't get picked up, but "DupeCol" and "DupeCol.1" would.
Does the number that comes after the rightmost "." match up to the current count of duplicates in the colMap? Seeing as the colMap contains all of the names of the columns, duplicates or not, this will ensure that we're not grabbing a user-named column that managed to overlap with the ".number" convention that Pandas uses. Eg. if a user had named two columns "DupeCol" and "DupeCol.6", it wouldn't get picked up unless there were 6 "DupeCol"s preceding "DupeCol.6", indicating that it almost had to be Pandas that named it that way, as opposed to the user. This part is definitely a bit overkill, but I felt like being extra thorough.
colMap = []
for col in df.columns:
if col.rpartition('.')[0]:
colName = col.rpartition('.')[0]
inMap = col.rpartition('.')[0] in colMap
lastIsNum = col.rpartition('.')[-1].isdigit()
dupeCount = colMap.count(colName)
if inMap and lastIsNum and (int(col.rpartition('.')[-1]) == dupeCount):
colMap.append(colName)
continue
colMap.append(col)
df.columns = colMap
Hopefully this helps someone! Feel free to comment if you think it could use any improvements. I don't entirely love using "continue" in my code, but I'm not sure if that's because it's actually bad practice or just me reading random people complain about it too much. I think it doesn't make the code too unreadable here and prevents the need for duplicating the "else" statement; but let me know if there's a way to improve that or anything otherwise. I'm always looking to learn!

If you know types of all data you may consider loading the csv without header first.
df = pd.read_csv(csv_file, header=None)
df.columns = df.iloc[0] # replace column with first row
df = df.drop(0) # remove the first row
(Note that drop is to remove the row, given that your index is unique, and may not be true if you use index_col argument of pd.read_csv)
caveats: The above solution causes you to lose dtypes infomations.
There is some solution to fix the above problem.
# turn each column into numeric
df = df.apply(lambda col: pd.to_numeric(col, errors='ignore'), axis=0)
Otherwise, you may consider reading the csv twice to get the dtype information and apply the correct convertion.

Related

Changing Headers in .csv files

Right now I am trying to read in data which is provided in a messy to read-in format. Here is an example
#Name,
#Comment,""
#ExtComment,""
#Source,
[Data]
1,2
3,4
5,6
#[END_OF_FILE]
When working with one or two of these files, I have manually changed the ['DATA'] header to ['x', 'y'] and am able to read in data just fine by skipping the first few rows and not reading the last line.
However, right now I have 30+ files, split between two different folders and I am trying to figure out the best way to read in the files and change the header of each file from ['DATA'] to ['x', 'y'].
The excel files are in a folder one path lower than the file that is supposed to read them (i.e. folder 1 contains set of code below, and folder 2 contains the excel files, folder 1 contains folder 2)
Here is what I have right now:
#sets - refers to the set containing the name of each file (i.e. [file1, file2])
#df - the dataframe which you are going to store the data in
#dataLabels - the headers you want to search for within the .csv file
#skip - the number of rows you want to skip
#newHeader - what you want to change the column headers to be
#pathName - provide path where files are located
def reader (sets, df, dataLabels, skip, newHeader, pathName):
for i in range(len(sets)):
df_temp = pd.read_csv(glob.glob(pathName+ sets[i]+".csv"), sep=r'\s*,', skiprows = skip, engine = 'python')[:-1]
df_temp.column.value[0] = [newHeader]
for j in range(len(dataLabels)):
df_temp[dataLabels[j]] = pd.to_numeric(df_temp[dataLabels[j]],errors = 'coerce')
df.append(df_temp)
return df
When I run my code, I run into the error:
No columns to parse from file
I am not quite sure why - I have tried skipping past the [DATA] header and I still receive that error.
Note, for this example I would like the headers to be 'x', 'y' - I am trying to make a universal function so that I could change it to something more useful depending on what I am measuring.
If the #[DATA] row is to be replaced regardless, just ignore it. You can just tell pandas to ignore lines that start with # and then specify your own names:
import pandas as pd
df = pd.read_csv('test.csv', comment='#', names=['x', 'y'])
which gives
x y
0 1 2
1 3 4
2 5 6
Expanding Kraigolas's answer, to do this with multiple files you can use a list comprehension:
files = [glob.glob(f"{pathName}{set_num}.csv") for set_num in sets]
df = pd.concat([pd.read_csv(file, comment="#", names = ["x", "y"]) for file in files])
If you're lucky, you can use Kraigolas' answer to treat those lines as comments.
In other cases you may be able to use the skiprows argument to skip header columns:
df= pd.read_csv(path,skiprows=10,skipfooter=2,names=["x","y"])
And yes, I do have an unfortunate file with a 10-row heading and 2 rows of totals.
Unfortunately I also have very unfortunate files where the number of headings change.
In this case I used the following code to iterate until I find the first "good" row, then create a new dataframe from the rest of the rows. The names in this case are taken from the first "good" row and the types from the first data row
This is certainly not fast, it's a last resort solution. If I had a better solution I'd use it:
data = df
if(first_col not in df.columns):
# Skip rows until we find the first col header
for i, row in df.iterrows():
if row[0] == first_col:
data = df.iloc[(i + 1):].reset_index(drop=True)
# Read the column names
series = df.iloc[i]
series = series.str.strip()
data.columns = list(series)
# Use only existing column types
types = {k: v for k, v in dtype.items() if k in data.columns}
# Apply the column types again
data = data.astype(dtype=types)
break
return data
In this case the condition is finding the first column name (first_col) in the first cell.
This can be adopted to use different conditions, eg looking for the first numeric cell:
columns = ["x", "y"]
dtypes = {"x":"float64", "y": "float64"}
data = df
# Skip until we find the first numeric value
for i, row in df.iterrows():
if row[0].isnumeric():
data = df.iloc[(i + 1):].reset_index(drop=True)
# Apply names and types
data.columns = columns
data = data.astype(dtype=dtypes)
break
return data

Is there a way to validate data type lengths in Pandas when using the read_csv function?

I'm trying to put some sort of length validation for columns using Pandas. For example, let's say I have a csv named test.csv that has the following data within it:
Column1,Column2,Column3
Data1,Data2,DataDataData3
Data1,Data2,Data3
Now, let's say I have a SQL table called [dbo].[Test1] with the following column datatypes and lengths:
CREATE TABLE [dbo].[Test1](Column1 VARCHAR(5),Column2 VARCHAR(5),Column3 VARCHAR(5))
Now, the scenario- I'm trying to use Pandas read_csv tp pick up this test.csv and then use to_sql to import this data. The code within Pandas would look similar to this (Obviously with more implicit design to pick up multiple files in a directory):
import pandas as pd
file = 'C:\Users\test\Documents\test.csv'
df = pd.read_csv(file, skip_blank_lines = True, warn_bad_lines = True)
df.to_sql(schema='dbo', name='Test1', con=conn, if_exists='append', index=False)
The conn is my connection string variable, but that's not the issue. When this would be ran, it will throw an error since the Column3 data is too big in the first row (13) for the length set in SQL for column 3 (5). My question is- Is there a way in Pandas to either reject this record and import the record that doesn't have an issue?
I'm trying to find something on length validation for Pandas to_sql, but I'm coming up at a loss.
Thank you
If you want to remove those rows with strings that have length > 5 before importing to sql, the below should work in between pd.read_csv() and df.to_sql().
df = df[df['Column3'].apply(lambda x: len(x) <= 5)])
Or you could do a quick for loop, like the below:
for col in df.columns.to_list():
df = df[df[col].apply(lambda x: len(x) <= 5)])
Here's the logic I ended up using, but I can't use it on multiple columns:
for i, row in df.iterrows():
if len(row['Column3']) > 5:
df.drop(index = i, inplace = True)
The only way I found to use is on multiple columns is creating multiple if statements like so:
for i, row in df.iterrows():
if len(row['Column1']) > 5:
df.drop(index = i, inplace = True)
if len(row['Column2']) > 5:
df.drop(index = i, inplace = True)
if len(row['Column3']) > 5:
df.drop(index = i, inplace = True)
I don't think this is the most efficient way, but it does work. Also, I've not tested to see how much this increases the time it takes to import.

Getting rows of data frame excluding header in pandas

I have a CSV file of which the screenshot is shown below :
My goal is to get the rows of the data frame excluding the header part.
My Purpose: I'm converting data in row into "float64" using the function astype() and then rounding the data to two decimal point using pandas round() function. The code for the same is shown below:
df = pd.read_csv('C:/Users/viral/PycharmProjects/Samples/036_20191009T132130.CSV',skiprows = 1)
df = df.astype('float64', errors="ignore")
df = df.round(decimals=2)
Here as you can see that I'm Skipping the first row in order to exclude the header part.
But Unfortunately, the data is not rounding up to 2 decimals place. The results are shown below :
I am not sure, but I guess the line Empty Data frame is making a problem in rounding up the data.
With Header= None
Even header= 1 will act in the same way the skiprow=1 was acting.
Any suggestions are most welcome...
Thank you
I believe you need:
file = 'C:/Users/viral/PycharmProjects/Samples/036_20191009T132130.CSV'
#for default columns 0,1,2, N with omit first row in original data
df = pd.read_csv(file,skiprows = 1, header=None)
#for columns names by first row in file omit skiprows and header parameters
#df = pd.read_csv(file)
#if necessary, convert to floats
df = df.astype('float64', errors="ignore")
#select only numeric columns
cols = df.select_dtypes(np.number).columns
#round only numeric cols
df[cols] = df[cols].round(decimals=2)
Try the following:
df = pd.read_csv('C:/Users/viral/PycharmProjects/Samples/036_20191009T132130.CSV',skiprows = 1)
df = df.astype('float64', errors="ignore")
df = df.round(decimals=2)

DataFrame Split On Rows and apply on header one column using Python Pandas

I'm working on some project and came up with the messy situation across where I've to split the data frame based on the first column of a data frame, So the situation is here the data frame I've with me is coming from SQL queries and I'm doing so much manipulation on that. So that is why not posting the code here.
Target: The data frame I've with me is like the below screenshot, and its available as an xlsx file.
Output: I'm looking for output like the attached file here:
The thing is I'm not able to put any logic here that how do I get this done on dataframe itself as I'm newbie in Python.
I think you can do this:
df = df.set_index('Placement# Name')
df['Date'] = df['Date'].dt.strftime('%M-%d-%Y')
df_sub = df[['Delivered Impressions','Clicks','Conversion','Spend']].sum(level=0)\
.assign(Date='Subtotal')
df_sub['CTR'] = df_sub['Clicks'] / df_sub['Delivered Impressions']
df_sub['eCPA'] = df_sub['Spend'] / df_sub['Conversion']
df_out = pd.concat([df, df_sub]).set_index('Date',append=True).sort_index(level=0)
startline = 0
writer = pd.ExcelWriter('testxls.xlsx', engine='openpyxl')
for n,g in df_out.groupby(level=0):
g.to_excel(writer, startrow=startline, index=True)
startline += len(g)+2
writer.save()
Load the Excel file into a Pandas dataframe, then extract rows based on condition.
dframe = pandas.read_excel("sample.xlsx")
dframe = dframe.loc[dframe["Placement# Name"] == "Needed value"]
Where "needed value" would be the value of one of those rows.

pandas read_excel select rows

thanks to StackOverflow (so basically all of you) I've managed to solve almost all my issues regarding reading excel data to DataFrame, except one... My code goes like this:
df = pd.read_excel(
fileName,
sheetname=sheetName,
header=None,
skiprows=3,
index_col=None,
skip_footer=0,
parse_cols='A:J,AB:CC,CE:DJ',
na_values='')
The thing is that in excel files which I'm parsing last row of dataI want to load is every time on different position. The only way I can identify last row of data which interest me is to look for word "SUMA" in first column of each sheet, and the last row I want to load to df will be n-1 row from the one containing "SUMA". Rows below SUMA also have some irrevelant (for me) information and there can by quite a lot of them so I want to avoid loading them.
If you do it with generators, you could do something like this. This loads the complete DataFrame, but afterwards filters out the lines after 'SUMA', using the trick that True == 1, so you only keep the relevant info. You might need some work afterwards to get the dtypes correct
def read_files(files):
sheetname = 'my_sheet'
for file in files:
yield pd.read_excel(
file,
sheetname=sheetName,
header=None,
skiprows=3,
index_col=None,
skip_footer=0,
parse_cols='A:J,AB:CC,CE:DJ',
na_values='')
def clean_files(dataframes):
summary_text = 'SUMA'
for df in dataframes:
index_after_suma = df.index.str.startswith(summary_text).cumsum()
yield df.loc[~index_after_suma, :]

Categories