Column name in Csv changes after merging - python

I have multiple csv files that i need to merge. The column names are:
idSite idVisit visitIp visitorId
However in the merged file the column 'idSite' changes to 'idSite'
This is the program i wrote. Everything else seems to be fine.
import pandas as pd
import os
dirListing = os.listdir("D:/Python/Test/Diku/piwik/filteredcsv/")
df=[]
siteIds = [34]
for id in siteIds:
for item in dirListing:
if str(id) in item:
print item
df.append(pd.read_csv(item,sep = ",",dtype='unicode'))
df3 = pd.concat(df,axis=0, ignore_index=True)
df3.to_csv('merged_' + str(id) + '_raw'+'.csv', sep =',')
Can't seem to figure out the problem. Is it a encoding issue?

Related

Column appended to dataframe coming up empty

I have the following code:
import glob
import pandas as pd
import os
import csv
myList = []
path = "/home/reallymemorable/Documents/git/COVID-19/csse_covid_19_data/csse_covid_19_daily_reports_us/*.csv"
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
fileDate = pd.DataFrame({'Date': [dateFromFilename]})
myList.append(row.join(fileDate))
concatList = pd.concat(myList, sort=True)
print(concatList)
concatList.to_csv('/home/reallymemorable/Documents/test.csv', index=False, header=True
It goes through a folder of CSVs and grabs a specific row and puts it all in a CSV. The files themselves have names like 10-10-2020.csv. I have some code in there that gets the filename and removes the file extension, so I am left with the date alone.
I am trying to add another column called "Date" that contains the filename for each file.
The script almost works: it gives me a CSV of all the rows I pulled out of the various CSVs, but the Date column itself is empty.
If I do print(dateFromFilename), the date/filename prints as expected (e.g. 10-10-2020).
What am I doing wrong?
I believe join has how=left by default. And your fileDate dataframe has different index than row, so you wouldn't get the date. Instead, do an assignment:
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
myList.append(row.assign(Date=dateFromFilename))
concatList = pd.concat(myList, sort=True)
Another way is to store the dataframes as a dictionary, then concat:
myList = dict()
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
myList[dateFromFilename] = row
concatList = pd.concat(myList, sort=True)

Concatenating dataframes adding additional columns

I'm trying to create a combined dataframe from a series of 12 individual CSVs (12 months to combine for the year). All the CSVs have the same format and column layout.
When I first ran it, it appeared to work and I was left with a combined dataframe with 6 columns (as expected). Upon looking at it, I found that the header row was applied as actual data in all the files, so I had some bad rows I needed to eliminate. I could manually make these changes but I'm looking to have the code take care of this automatically.
So to that end, I updated the code so it only read in the first CSV with headers and the remaining CSVs without headers and concatenate everything together. This appears to work BUT I end up with 12 columns instead of 6 with the first 6 columns having NaNs for the first CSV and the last 6 columns having NaNs for the other 11 CSVs, which is obviously NOT what I want (see image below).
The code is similar, I just use the header=None parameter in pd.read_csv() for the 11 CSVs after the first (and I don't use that parameter for the first CSV). Can anyone give me a hint as to why I'm getting 12 columns (with the data placement as described) when I run this code? The layout of the CSV file is shown below.
Appreciate any help.
import pandas as pd
import numpy as np
import os
# Need to include the header row only for the first csv (otherwise header row will be included
# for each read csv, which places improperly formatted rows into the combined dataframe).
totrows = 0
# Get list of csv files to read.
files = os.listdir('c:/data/datasets')
# Read the first csv file, including the header row.
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)
# Now read the remaining csv files (without header row) and concatenate their values
# into our full Sales Data dataframe.
for file in files[1:]:
df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True, header=None)
dfSD = pd.concat([dfSD, df])
totrows += df.shape[0]
print(file + " == " + str(df.shape[0]) + " rows")
print()
print("TOTAL ROWS = " + str(totrows + pd.read_csv('c:/data/datasets/' + files[0]).shape[0]))
One simple solution is the following.
import pandas as pd
import numpy as np
import os
totrows = 0
files = os.listdir('c:/data/datasets')
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)
columns = []
dfSD = []
for file in files:
df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True)
if not columns:
columns = df.columns
df.columns = columns
dfSD.append(df)
totrows += df.shape[0]
print(file + " == " + str(df.shape[0]) + " rows")
dfSD = pd.concat(dfSD, axis = 0)
dfSD = dfSD.reset_index(drop = True)
Another possibility is:
import pandas as pd
import numpy as np
import os
# Need to include the header row only for the first csv (otherwise header row will be included
# for each read csv, which places improperly formatted rows into the combined dataframe).
totrows = 0
# Get list of csv files to read.
files = os.listdir('c:/data/datasets')
# Read the first csv file, including the header row.
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)
df_comb = [dfSD]
# Now read the remaining csv files (without header row) and concatenate their values
# into our full Sales Data dataframe.
for file in files[1:]:
df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True, header=None)
df.columns = dfSD.columns
df_comb.append(df)
totrows += df.shape[0]
print(file + " == " + str(df.shape[0]) + " rows")
dfSD = pd.concat([df_comb], axis = 0).reset_index(drop = True)

adding data to an existing empty dataframe containing only column names

How can I add data to an existing empty column in a dataframe?
I have an empty dataframe with column names (stock tickers)
I am trying to add data to each stock, basically, populate the dataframe column by column, from left to right based on the header name.
I am pulling the data from another CSV file which looks like this (CSV file name = column name in the dataframe Im trying to populate):
PS aditional issue may arise due to the length of data available for each stock, eg. I may have a list of 10 values for the first stock, 0 for the second, and 25 for third. I plan to save this in a CSV, so perhaps it could not cause too big of an issue.
I have tried the following idea but without luck. any suggestions are welcome.
import pandas as pd
import os
path = 'F:/pathToFiles'
Russell3k_Divs = 'Russel3000-Divs/'
Russell3k_Tickers = 'Russell-3000-Stock-Tickers-List.csv'
df_tickers = pd.read_csv(path + Russell3k_Tickers)
divFls = os.listdir(path + Russell3k_Divs)
for i in divFls:
df = pd.read_csv(path + Russell3k_Divs + i)
Div = df['Dividends']
i = i[0].split('.')
df_tickers[i] = df_tickers.append(Div)
print(df_tickers)
break
import pandas as pd
import os
from tqdm import tqdm
path = 'F:/pathToFiles'
Russell3k_Divs = 'Russel3000-Divs/'
Russell3k_Tickers = 'Russell-3000-Stock-Tickers-List.csv'
df_tickers = pd.DataFrame()
divFls = os.listdir(path + Russell3k_Divs)
for i in tqdm(divFls):
df = pd.read_csv(path + Russell3k_Divs + i)
i = i.split('.')[0]
df[str(i)] = df['Date']
df_tickers = df_tickers.join(df[str(i)], how='outer')
df_tickers.to_csv('Russell-3000-Stock-Tickers-List1.csv', encoding='utf-8', index=False)
This answer was posted as an edit to the question adding data to an existing empty dataframe containing only column names by the OP Mr.Riply under CC BY-SA 4.0.

Is pandas automatically adding a row and a columns in the first position?

I'm using pandas to merge some csv files (the range of csv files's number is can vary).
When I run the script, it seems that a column and a row are automatically added (as you can see in the picture below).
I use pandas with python 3.7 and run a windows OS based computer. I use Excel to open the csv files.
Here is the code :
import os
import pandas as pd
L_Log= os.listdir('E://PJT/TEST2/')
dfList=[]
for filename in L_Log:
filename = "E://PJT/TEST2/" + filename
typefile=type(filename)
print = typefile
print(filename)
df=pd.read_csv(filename,header=None, sep = ';', error_bad_lines=False, encoding="ANSI")
#df[1:] = [test[1:] for test in df[1:]]
dfList.append(df)
concatDf=pd.concat(dfList,axis=0)
concatDf.to_csv('Concat2.csv', sep = ';')
The result I get is shown in the picture and what is higlighted with excel is what I expect : Picture
Thanks for your help !
UPDATE :
I changed the code a little bit : I deleted the part when it add the column's titles and i added a
concatDf.to_csv('Concat2.csv', sep = ';',index=False)
Here is the full new script :
import os
import pandas as pd
L_Log= os.listdir('.')
L_LogClean=[]
'''
for k in range(len(L_Log)):
if 'Logfile_' in L_Log[k]:
Tempo = L_Log[k]
Tempo2 = Tempo[12:16]+Tempo[10:12]+Tempo[8:10]
Tempo2 = int(Tempo2)
L_LogClean.append(Tempo2)
L_LogClean = sorted(L_LogClean)
for k in range(len(L_LogClean)):
Tempo = str(L_LogClean[k])
Tempo2 = 'Logfile_' + Tempo[6:8]+Tempo[4:6]+Tempo[0:4]+'.csv'
L_LogClean[k] = Tempo2
print(L_LogClean)
'''
dfList=[]
colnames=['No.','Date','Time','Temp1','Unit','Temp2','Unit','Lux2','Unit','BP1','Humidité Relat','Unit','CO2','Unit','Présence','Temp1_EnO','Unit','Temp2_EnO','Unit','Temp3_EnO','Unit','RH3_EnO','Unit','Chauffage']
for filename in L_Log:
filename = "E://PJT/TEST2/" + filename
typefile=type(filename)
print = typefile
print(filename)
df=pd.read_csv(filename,header=None, sep = ';', error_bad_lines=False, encoding="ANSI")
#df[1:] = [test[1:] for test in df[1:]]
dfList.append(df)
concatDf=pd.concat(dfList,axis=0)
#concatDf.columns=colnames
concatDf.to_csv('Concat2.csv', sep = ';',index=False)
Now the file seems correct but : It add the name of the columns for each files (and obviously I juste want the first row to show the column's title)
Here is an exemple of what I get : What I get
Thanks !
You'll want to set both index and header to None. (Not exactly intuitive in my opinion as it should have been index but columns, but what can you do.)
To prevent having your column names repeated in the data, you need to set your column names in the csv file as the column names in your dataframe. You'll need to edit the header parameter in your for loop where you are reading the csv files with read_csv:
for filename in L_Log:
...
df=pd.read_csv(filename,header=0, ...)
...

Combining Excel worksheets over multiple loops

I've got a number of Excel workbooks, each with multiple worksheets, that I'd like to combine.
I've set up two sets of loops (one while, one for) to read in rows for each sheet in a given workbook and then do the same for all workbooks.
I tried to do it on a subset of these, and it appears to work until I try to combine the two sets using the pd.concat function. Error given is
TypeError: first argument must be an iterable of pandas objects, you
passed an object of type "DataFrame"
Any idea what I'm doing incorrectly?
import pandas as pd
d = 2013
numberOfSheets = 5
while d < 2015:
#print(str(d) + ' beginning')
f ='H:/MyDocuments/Z Project Work/scriptTest ' + str(d) + '.xlsx'
for i in range(1,numberOfSheets+1):
data = pd.read_excel(f, sheetname = 'Table '+str(i), header=None)
print(i)
df.append(data)
print(str(d) + ' complete')
print(df)
d += 1
df = pd.concat(df)
print(df)
final = "H:/MyDocuments/Z Project Work/mergedfile.xlsx"
df.to_excel(final)
As the error says, pd.concat() requires an iterable, like a list: pd.concat([df1, df2]) will concatenate df1 and df2 along the default axis of 0, which means df2 is appended to the bottom of df1.
Two issues need fixing:
The for loop refers to df before assigning anything to it.
The variable df is overwritten with each iteration of the for loop.
One workaround is to create an empty list of DataFrames before the loops, then append DataFrames to that list, and finally concatenate all the DataFrames in that list. Something like this:
import pandas as pd
d = 2013
numberOfSheets = 5
dfs = []
while d < 2015:
#print(str(d) + ' beginning')
f ='H:/MyDocuments/Z Project Work/scriptTest ' + str(d) + '.xlsx'
for i in range(1, numberOfSheets + 1):
data = pd.read_excel(f, sheetname='Table ' + str(i), header=None)
print(i)
dfs.append(data)
print(str(d) + ' complete')
print(df)
d += 1
# ignore_index=True gives the result a default IntegerIndex
# starting from 0
df_final = pd.concat(dfs, ignore_index=True)
print(df_final)
final_path = "H:/MyDocuments/Z Project Work/mergedfile.xlsx"
df_final.to_excel(final_path)
Since I can't comment, I'll leave this as an answer: you can speed up this code by opening the file once then parsing the workbook to get each sheet. Should save a second or two off each iteration, since opening the Excel file takes the longest. Here's some code that might help.
Note: setting sheet_name=None will return ALL the sheets in the workbook:
dfs = {<sheetname1>: <DataFrame1>, <sheetname2>: <DataFrame2>, etc.}
Here's the code:
xl = pd.ExcelFile(fpath)
dfs = xl.parse(sheetname=None, header=None)
for i, df in enumerate(dfs):
<do stuff with each, if you want>
print('Sheet {0} looks like:\n{1}'.format(i+1, df))
Thank you, both. I accepted the answer that addressed the specific question, but was able to use the second answer and some additional googling thereafter (eg, glob) to amend the original code, and automate more fully independent of number of workbooks or worksheets.
Final version of the above now below:
import pandas as pd
import glob
#import numpy as np
#import os, collections, csv
#from os.path import basename
fpath = "H:/MyDocuments/Z Project Work/"
dfs = []
files = glob.glob(fpath+'*.xlsx')
for f in files:
xl = pd.ExcelFile(f)
xls = xl.parse(sheetname=None, header=0)
for i, df in enumerate(xls):
print(i)
dfs.append(xls[df])
print(f+ ' complete')
df_final = pd.concat(dfs, ignore_index=True)
final = "H:/MyDocuments/Z Project Work/mergedfile.xlsx"
df_final.to_excel(final)

Categories