Right now I am trying to read in data which is provided in a messy to read-in format. Here is an example
#Name,
#Comment,""
#ExtComment,""
#Source,
[Data]
1,2
3,4
5,6
#[END_OF_FILE]
When working with one or two of these files, I have manually changed the ['DATA'] header to ['x', 'y'] and am able to read in data just fine by skipping the first few rows and not reading the last line.
However, right now I have 30+ files, split between two different folders and I am trying to figure out the best way to read in the files and change the header of each file from ['DATA'] to ['x', 'y'].
The excel files are in a folder one path lower than the file that is supposed to read them (i.e. folder 1 contains set of code below, and folder 2 contains the excel files, folder 1 contains folder 2)
Here is what I have right now:
#sets - refers to the set containing the name of each file (i.e. [file1, file2])
#df - the dataframe which you are going to store the data in
#dataLabels - the headers you want to search for within the .csv file
#skip - the number of rows you want to skip
#newHeader - what you want to change the column headers to be
#pathName - provide path where files are located
def reader (sets, df, dataLabels, skip, newHeader, pathName):
for i in range(len(sets)):
df_temp = pd.read_csv(glob.glob(pathName+ sets[i]+".csv"), sep=r'\s*,', skiprows = skip, engine = 'python')[:-1]
df_temp.column.value[0] = [newHeader]
for j in range(len(dataLabels)):
df_temp[dataLabels[j]] = pd.to_numeric(df_temp[dataLabels[j]],errors = 'coerce')
df.append(df_temp)
return df
When I run my code, I run into the error:
No columns to parse from file
I am not quite sure why - I have tried skipping past the [DATA] header and I still receive that error.
Note, for this example I would like the headers to be 'x', 'y' - I am trying to make a universal function so that I could change it to something more useful depending on what I am measuring.
If the #[DATA] row is to be replaced regardless, just ignore it. You can just tell pandas to ignore lines that start with # and then specify your own names:
import pandas as pd
df = pd.read_csv('test.csv', comment='#', names=['x', 'y'])
which gives
x y
0 1 2
1 3 4
2 5 6
Expanding Kraigolas's answer, to do this with multiple files you can use a list comprehension:
files = [glob.glob(f"{pathName}{set_num}.csv") for set_num in sets]
df = pd.concat([pd.read_csv(file, comment="#", names = ["x", "y"]) for file in files])
If you're lucky, you can use Kraigolas' answer to treat those lines as comments.
In other cases you may be able to use the skiprows argument to skip header columns:
df= pd.read_csv(path,skiprows=10,skipfooter=2,names=["x","y"])
And yes, I do have an unfortunate file with a 10-row heading and 2 rows of totals.
Unfortunately I also have very unfortunate files where the number of headings change.
In this case I used the following code to iterate until I find the first "good" row, then create a new dataframe from the rest of the rows. The names in this case are taken from the first "good" row and the types from the first data row
This is certainly not fast, it's a last resort solution. If I had a better solution I'd use it:
data = df
if(first_col not in df.columns):
# Skip rows until we find the first col header
for i, row in df.iterrows():
if row[0] == first_col:
data = df.iloc[(i + 1):].reset_index(drop=True)
# Read the column names
series = df.iloc[i]
series = series.str.strip()
data.columns = list(series)
# Use only existing column types
types = {k: v for k, v in dtype.items() if k in data.columns}
# Apply the column types again
data = data.astype(dtype=types)
break
return data
In this case the condition is finding the first column name (first_col) in the first cell.
This can be adopted to use different conditions, eg looking for the first numeric cell:
columns = ["x", "y"]
dtypes = {"x":"float64", "y": "float64"}
data = df
# Skip until we find the first numeric value
for i, row in df.iterrows():
if row[0].isnumeric():
data = df.iloc[(i + 1):].reset_index(drop=True)
# Apply names and types
data.columns = columns
data = data.astype(dtype=dtypes)
break
return data
Related
I am using lasio (https://lasio.readthedocs.io/en/latest/index.html) to call out data within a .LAS file. It's an oil and gas drilling type file with data in the heading and in the body (called the curve). TL;DR on the lasio docs, but it reads the data as a pandas DataFrame. Hence me using a dictionary to assign the data.
This is an output of a lasio file in notepad:
At the end, I need a file that has the UWI (unique well #), the depth and it's porosity reading.
The UWI is one value but there are multiple values for the depth and porosity. So I need the UWI repeated. To complicate matters, not all of my files have the porosity data so I have had to screen for them too.
My code was going ok until I export it and see that in the csv, the cells are nested. The code reads in the values in a dictionary and I need the UWI duplicated for each depth value.
data = []
df_global = pd.DataFrame(data)
alias = ["DPHI", "DPHI_LS", "DPH8", "DPHZ", "DPHZ_LS", "DPOR_LS", "DPOR", "PORD", "DPHI_SCANNED", "SPHI"]
for filename in all_files:
las = lasio.read(filename)
df = las.df().reset_index()
mnemonic = las.keys()
match = set(alias).intersection(mnemonic)
if len(match) != 0:
DEPT = df["DEPT"]
DPHI2 = df[match]
DPHI = DPHI2.iloc[:,0]
UWI = las.well.UWI.value
df_global = df_global.append({'UWI': UWI, 'DEPTH': DEPT, 'DPHI': DPHI}, ignore_index=True)
df_global.to_csv('las_output.csv', index=False)
This is my output, note the nested rows.
I have tried
df.loc[:,"UWI"] = np.array(las.well.UWI.value*len(df.DEPT))
but the UWI value is just repeated and not put into rows.
Problem
You are appending dictionaries to an already-existing DataFrame. Each dictionary contains a variety of types (an integer under the key UWI, and pandas Series under other keys). This is a very general operation, and pandas reacts by converting the Series contained within the dictionary to strings, which is what you are seeing in columns B and C in Excel.
This is also probably not the operation you want to do, which appears to be appending DataFrames (i.e. one per file) to an existing DataFrame (df_global). Pandas does not make this easy for existing DataFrames, for good reason.
Solution
This is much simpler if you create a Python list (data) containing DataFrames, then use pandas' concat function to create a single DataFrame as the last step. See below for an example. I have not tested the code, because you didn't include a minimal reproducible example, but hopefully it helps.
data = []
alias = ["DPHI", "DPHI_LS", "DPH8", "DPHZ", "DPHZ_LS", "DPOR_LS", "DPOR", "PORD", "DPHI_SCANNED", "SPHI"]
for filename in all_files:
las = lasio.read(filename)
df = las.df().reset_index()
mnemonic = las.keys()
match = set(alias).intersection(mnemonic)
if len(match) != 0:
columns_to_keep = [las.curves[0].mnemonic] + list(match)
# Assign the single UWI value to a new column called "UWI"
df['UWI'] = las.well.UWI.value
columns_to_keep.append('UWI')
data.append(df[columns_to_keep])
df_final = pd.concat(data, join='outer') # join='outer' means that it will keep all of the different values found from `alias`
df_final.to_csv('las_output.csv', index=False)
I'm splitting a large CSV (containing stock financial data) file into smaller chunks. The format of the CSV file is different. Something like an Excel pivot table. The first few rows of the first column contain some headers.
Company name, id, etc. are repeated across the following columns. Because one single company has more than one attribute, not like one company has one column only.
After the first few rows, the columns then start resembling a typical data frame where headers are in columns instead of rows.
Anyways, what I'm trying to do is to make Pandas allow duplicate column headers and not make it add ".1", ".2", ".3", etc after the headers. I know Pandas does not allow this natively, is there a workaround? I tried to set header = None on read_csv but it throws a tokenization error which I think makes sense. I just can't think of an easy way.
import pandas as pd
csv_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4.csv"
#df = pd.read_csv(csv_path, header=1, dtype='unicode', sep=';', low_memory=False, error_bad_lines=False)
df = pd.read_csv(csv_path, header = 1, dtype='unicode', sep=';', index_col=False)
print("I read in a dataframe with {} columns and {} rows.".format(
len(df.columns), len(df)
))
filename = 1
#column increment
x = 30 * 59
for column in df:
loc = df.columns.get_loc(column)
if loc == (x * filename) + 1:
y = filename - 1
a = (x * y) + 1
b = (x * filename) + 1
date_df = df.iloc[:, :1]
out_df = df.iloc[:, a:b]
final_df = pd.concat([date_df, out_df], axis=1, join='inner')
out_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4-part" + str(filename) + ".csv"
final_df.to_csv(out_path, index=False)
#out_df.to_csv(out_path)
filename += 1
# This should be the same as df, but with only the first column.
# Check it with similar code to above.
EDIT:
From, https://github.com/pandas-dev/pandas/issues/19383, I add:
final_df.columns = final_df.iloc[0]
final_df = final_df.reindex(final_df.index.drop(0)).reset_index(drop=True)
So, full code:
import pandas as pd
csv_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4.csv"
#df = pd.read_csv(csv_path, header=1, dtype='unicode', sep=';', low_memory=False, error_bad_lines=False)
df = pd.read_csv(csv_path, header = 1, dtype='unicode', sep=';', index_col=False)
print("I read in a dataframe with {} columns and {} rows.".format(
len(df.columns), len(df)
))
filename = 1
#column increment
x = 30 * 59
for column in df:
loc = df.columns.get_loc(column)
if loc == (x * filename) + 1:
y = filename - 1
a = (x * y) + 1
b = (x * filename) + 1
date_df = df.iloc[:, :1]
out_df = df.iloc[:, a:b]
final_df = pd.concat([date_df, out_df], axis=1, join='inner')
out_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4-part" + str(filename) + ".csv"
final_df.columns = final_df.iloc[0]
final_df = final_df.reindex(final_df.index.drop(0)).reset_index(drop=True)
final_df.to_csv(out_path, index=False)
#out_df.to_csv(out_path)
filename += 1
# This should be the same as df, but with only the first column.
# Check it with similar code to above.
Now, the entire first row is gone. But, the expected output is for the header row to be replaced with the reset index, without the ".1", ".2", etc.
Screenshot:
The SimFin ID row is no longer there.
This is how I did it:
final_df.columns = final_df.columns.str.split('.').str[0]
Reference:
https://pandas.pydata.org/pandas-docs/stable/text.html
Below solution would ensure that other column names with symbol period ('.') in the dataframe do not get modified
import pandas as pd
from csv import DictReader
csv_file_loc = "file.csv"
# Read csv
df = pd.read_csv(csv_file_loc)
# Get column names from csv file using DictReader
col_names = DictReader(open(csv_file_loc, 'r')).fieldnames
# Rename columns
df.columns = col_names
I know I'm pretty late to the draw on this one, but I'm leaving the solution I came up with in case anyone else wanders across this as I have.
Firstly, the linked question has a pretty nice and dynamic solution that seems to work well even for high column counts. I came across that after I made my solution, haha. Check it out here. Another answer on this thread utilizes the csv library to read and use the column names from that, as it doesn't seem to modify duplicates like Pandas does. That should work fine, but I just wanted to avoid using any extra libraries, especially considering I was originally using csv and then upgrade to Pandas for better functionality.
Now here's my solution. I'm sure it could be done more nicely but this does the job for what I needed and is pretty dynamic, from what I can tell. It basically goes through the columns, checks if it can split the string based on the rightmost "." (that's the rpartition), then does a few more checks from there.
It checks:
Is this string in the colMap? The colMap keeps track of all of the column names, duplicate or not. If this comes back true, then that means it's a duplicate of another column that came before it.
Is the string after the rightmost "." a number? All of the columns are strings, so this just makes sure that whatever it is can be converted into a number to prevent grabbing some other random column that meets previous criteria but isn't actually a dupe from Pandas. eg. "DupeCol" and "DupeCol.Stuff" wouldn't get picked up, but "DupeCol" and "DupeCol.1" would.
Does the number that comes after the rightmost "." match up to the current count of duplicates in the colMap? Seeing as the colMap contains all of the names of the columns, duplicates or not, this will ensure that we're not grabbing a user-named column that managed to overlap with the ".number" convention that Pandas uses. Eg. if a user had named two columns "DupeCol" and "DupeCol.6", it wouldn't get picked up unless there were 6 "DupeCol"s preceding "DupeCol.6", indicating that it almost had to be Pandas that named it that way, as opposed to the user. This part is definitely a bit overkill, but I felt like being extra thorough.
colMap = []
for col in df.columns:
if col.rpartition('.')[0]:
colName = col.rpartition('.')[0]
inMap = col.rpartition('.')[0] in colMap
lastIsNum = col.rpartition('.')[-1].isdigit()
dupeCount = colMap.count(colName)
if inMap and lastIsNum and (int(col.rpartition('.')[-1]) == dupeCount):
colMap.append(colName)
continue
colMap.append(col)
df.columns = colMap
Hopefully this helps someone! Feel free to comment if you think it could use any improvements. I don't entirely love using "continue" in my code, but I'm not sure if that's because it's actually bad practice or just me reading random people complain about it too much. I think it doesn't make the code too unreadable here and prevents the need for duplicating the "else" statement; but let me know if there's a way to improve that or anything otherwise. I'm always looking to learn!
If you know types of all data you may consider loading the csv without header first.
df = pd.read_csv(csv_file, header=None)
df.columns = df.iloc[0] # replace column with first row
df = df.drop(0) # remove the first row
(Note that drop is to remove the row, given that your index is unique, and may not be true if you use index_col argument of pd.read_csv)
caveats: The above solution causes you to lose dtypes infomations.
There is some solution to fix the above problem.
# turn each column into numeric
df = df.apply(lambda col: pd.to_numeric(col, errors='ignore'), axis=0)
Otherwise, you may consider reading the csv twice to get the dtype information and apply the correct convertion.
I have a 4 csv files exported from e-shop database I need to merge them by columns, which I would maybe manage to do alone. But the problem is to match the right columns
First file:
"ep_ID","ep_titleCS","ep_titlePL".....
"601","Kancelářská židle šedá",NULL.....
...
Second file:
"pe_photoID","pe_productID","pe_sort"
"459","603","1"
...
Third file:
"epc_productID","epc_categoryID","epc_root"
"2155","72","1"
...
Fourth file:
"ph_ID","ph_titleCS"...
"379","5391132275.jpg"
...
I need to match the rows so rows with same "ep_ID" and "epc_productID" are merged together and rows with same "ph_ID", "pe_photoID" too. I don't really know where to start, hopefully, I wrote it understandably
Update:
I am using :
files = ['produkty.csv', 'prirazenifotek.csv', 'pprirazenikategorii.csv', 'adresyfotek.csv']
dfs = []
for f in files:
df = pd.read_csv(f,low_memory=False)
dfs.append(df)
first_and_third =pd.merge(dfs[0],dfs[1],left_on = "ep_ID",right_on="pe_photoID")
first_and_third.to_csv('new_filepath.csv', index=False)
Ok this code works, but it does two things in another way than I need:
When there is a row in file one with ID = 1 for example and in the next file two there is 5 rows with bID = 1, then it creates 5 rows int the final file I would like to have one row that would have multiple values from every row with bID = 1 in file number two. Is it possible?
And it seems to be deleting some rows... not sure till i get rid of the "duplicates"...
You can use pandas's merge method to merge the csvs together. In your question you only provide keys between the 1st and 3rd files, and the 2nd and 4th files. Not sure if you want one giant table that has them all together -- if so you will need to find another intermediary key, maybe one you haven't listed(?).
import pandas as pd
files = ['path_to_first_file.csv', 'second_file.csv', 'third_file.csv', 'fourth_file.csv']
dfs = []
for f in files:
df = pd.read_csv(f)
dfs.append(df)
first_and_third = dfs[0].merge(dfs[2], left_on='ep_ID', right_on='epc_productID', how='left')
second_and_fourth = dfs[1].merge(dfs[3], left_on='pe_photoID', right_on='ph_ID', how='left')
If you want to save the dataframe back down to a file you can do so:
first_and_third.to_csv('new_filepath.csv', index=False)
index=False assumes you have no index set on the dataframe and don't want the dataframe's row numbers to be included in the final csv.
I have a .csv to read into a DataFrame and the names of the columns are in the same .csv file in the previos rows. Usually I drop all the 'unnecesary' rows to create the DataFrame and then hardcode the names of each dataframe
Trigger time,2017-07-31,10:45:38
CH,Signal name,Input,Range,Filter,Span
CH1, "Tin_MIX_Air",TEMP,PT,Off,2000.000000,-200.000000,degC
CH2, "Tout_Fan2b",TEMP,PT,Off,2000.000000,-200.000000,degC
CH3, "Tout_Fan2a",TEMP,PT,Off,2000.000000,-200.000000,degC
CH4, "Tout_Fan1a",TEMP,PT,Off,2000.000000,-200.000000,degC
Here you can see the rows where the columns names are in double quotes "TinMix","Tout..",etc there are exactly 16 rows with names
Logic/Pulse,Off
Data
Number,Date&Time,ms,CH1,CH2,CH3,CH4,CH5,CH7,CH8,CH9,CH10,CH11,CH12,CH13,CH14,CH15,CH16,CH20,Alarm1-10,Alarm11-20,AlarmOut
NO.,Time,ms,degC,degC,degC,degC,degC,degC,%RH,%RH,degC,degC,degC,degC,degC,Pa,Pa,A,A1234567890,A1234567890,A1234
1,2017-07-31 10:45:38,000,+25.6,+26.2,+26.1,+26.0,+26.3,+25.7,+43.70,+37.22,+25.6,+25.3,+25.1,+25.3,+25.3,+0.25,+0.15,+0.00,LLLLLLLLLL,LLLLLLLLLL,LLLL
And here the values of each variables start.
What I need to do is create a Dataframe from this .csv and place these names in the columns names. I'm new to Python and I'm not very sure how to do it
import pandas as pd
path = r'path-to-file.csv'
data=pd.DataFrame()
with open(path, 'r') as f:
for line in f:
data = pd.concat([data, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True)
data.drop(data.index[range(0,29)],inplace=True)
x=len(data.iloc[0])
data.drop(data.columns[[0,1,2,x-1,x-2,x-3]],axis=1,inplace=True)
data.reset_index(drop=True,inplace=True)
data = data.T.reset_index(drop=True).T
data = data.apply(pd.to_numeric)
This is what I've done so far to get my dataframe with the usefull data, I'm dropping all the other columns that arent useful to me and keeping only the values. Last three lines are to reset row/column indexes and to transform the whole df to floats. What I would like is to name the columns with each of the names I showed in the first piece of coding as a I said before I'm doing this manually as:
data.columns = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p']
But I would like to get them from the .csv file since theres a possibility on changing the CH# - "Name" combination
Thank you very much for the help!
Comment: possible for it to work within the other "OPEN " loop that I have?
Assume Column Names from Row 2 up to 6, Data from Row 7 up to EOF.
For instance (untested code)
data = None
columns = []
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2 and row <= 6:
ch, name = line.split(',')[:2]
columns.append(name)
else:
row_data = [tuple(line.strip().split(','))]
if not data:
data = pd.DataFrame(row_data, columns=columns, ignore_index=True)
else:
data.append(row_data)
Question: ... I would like to get them from the .csv file
Start with:
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2:
ch, name = line.split(',')[:2]
I'm dealing with badly laid out excel sheets which I'm trying to parse and write into a database.
Every sheet can have multiple tables. Though the header for these possible tables are known, which tables are gonna be on any given sheet is not, neither is their exact location on the sheet (the tables don't align in a consistent way). I've added a pic of two possible sheet layout to illustrate this: This layout has two tables, while this one has all the tables of the first, but not in the same location, plus an extra table.
What I do know:
All the possible table headers, so each individual table can be identified by its headers
Tables are separated by blank cells. They do not touch each other.
My question Is there a clean way to deal with this using some Python module such as pandas?
My current approach:
I'm currently converting to .csv and parsing each row. I split each row around blank cells, and process the first part of the row (should belong to the leftmost table). The remainder of the row is queued and later processed in the same manner. I then read this first_part and check whether or not it's a header row. If it is, I use it to identify which table I'm dealing with (this is stored in a global current_df). Subsequent rows which are not header rows are fed into this table (here I'm using pandas.DataFrame for my tables).
Code so far is below (mostly incomplete and untested, but it should convey the approach above):
class DFManager(object): # keeps track of current table and its headers
current_df = None
current_headers = []
def set_current_df(self, df, headers):
self.current_headers = headers
self.current_df = df
def split_row(row, separator):
while row and row[0] == separator:
row.pop(0)
while row and row[-1] == separator:
row.pop()
if separator in row:
split_index = row.index(separator)
return row[:split_index], row[split_index:]
else:
return row, []
def process_df_row(row, dfmgr):
df = df_with_header(row) # returns the dataframe with these headers
if df is None: # is not a header row, add it to current df
df = dfmgr.current_df
add_row_to_df(row, df)
else:
dfmgr.set_current_df(df, row)
# this is passed the Excel sheet
def populate_dataframes(xl_sheet):
dfmgr = DFManager()
row_queue = Queue()
for row in xl_sheet:
row_queue.put(row)
for row in iter(row_queue.get, None):
if not row:
continue
first_part, remainder = split_row(row)
row_queue.put(remainder)
process_df_row(first_part, dfmgr)
This is such a specific situation that there is likely no "clean" way to do this with a ready-made module.
One way to do this might use the header information you already have to find the starting indices of each table, something like this solution (Python Pandas - Read csv file containing multiple tables), but with an offset in the column direction as well.
Once you have the starting position of each table, you'll want to determine the widths (either known a priori or discovered by reading until the next blank column) and read those columns into a dataframe until the end of the table.
The benefit of an index-based method rather than a queue based method is that you do not need to re-discover where the separator is in each row or keep track of which row fragments belong to which table. It is also agnostic to the presence of >2 tables per row.
I written code to merge multiple tables separated vertically, having common headers in each table. I am assuming unique headers should name does not end with dot integer number.
'''
def clean(input_file, output_file):
try:
df = pd.read_csv(input_file, skiprows=[1,1])
df = df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis=1)
df = rename_duplicate_columns(df)
except:
df =[]
print("Error: File Not found\t", sys.exc_info() [0])
exit
udf = df.loc[:, ~df.columns.str.match(".*\.\d")]
udf = udf.dropna(how='all')
try:
table_num = int(df.columns.values[-1].split('.')[-1])
fdf = udf
for i in range(1,table_num+1):
udfi = pd.DataFrame()
udfi = df.loc[:, df.columns.str.endswith(f'.{i}')]
udfi.rename(columns = lambda x: '.'.join(x.split('.')[:-1]), inplace=True)
udfi = udfi.dropna(how='all')
fdf = fdf.append(udfi,ignore_index=True)
fdf.to_csv(output_file)
except ValueError:
print ("File Contains only single Table")
exit
def rename_duplicate_columns(df):
cols=pd.Series(df.columns)
for dup in df.columns.get_duplicates():
cols[df.columns.get_loc(dup)]=[dup+'.'+str(d_idx) if d_idx!=0 else
dup for d_idx in range(df.columns.get_loc(dup).sum())]
df.columns=cols
print(df.columns)
return df
clean(input_file, output_file)
'''