Why do I get duplicates when trying to append csv files - python

I wrote a function parse_xml that convert xml to csv. Then I want to convert every files within a folder (100,000 files) and append all into 1 file. My code for that task is
result=pd.DataFrame()
os.chdir('/Users/dp/Dropbox/Ratings/SP')
for file in list(glob.glob('*.xml')):
data = marshal.dumps(file)
obj = marshal.loads(data)
parse_xml(obj)
df = pd.DataFrame(rows, columns=cols)
result = pd.concat([result,pd.DataFrame.from_records(df)])
result.to_csv('output.csv')
However the result isn't what I'm looking for. It keeps re-appending all over again with similar files. 90% of the ouput's observations are duplicated.
Could somebody please give me a hint on how to resolve this issue? Thank you so much

Related

Python - Read txt files from Network shared drive

I am trying to read in data from text files that are moved into a network share drive over a VPN. The overall intent is to be able to loop through the files with yesterday's date (either in the file name, or by the Modified Date) and extract the pipe delimited data separated by "|" and concat it into a pandas df. The issue I am having is actually being able to read files from the network drive. So far I have only been able to figure out how to use os.listdir to identify the file names, but not actually read them. Anyone have any ideas?
This is what I've tried so far that has actually started to pan out = with os.listdir correctly being able to see the Network folder and the files inside - but how would I call the actual files inside (filtered by date or not) to actually get it to work in the loop?
import pandas as pd
#folder = os.listdir(r'\\fileshare.com\PATH\TO\FTP\FILES')
folder = (r'\\fileshare.com\PATH\TO\FTP\FILES')
main_dataframe = pd.DataFrame(pd.read_csv(folder[0]))
for i in range (1, len(folder)):
data = pd.read_csv(folder[i])
df = pd.DataFrame(data)
main_dataframe = pd.concat([main_dataframe, df], axis=1)
print(main_dataframe)
I'm pretty new at Python and doing things like this, so I apologize if I refer to anything wrong. Any advice would be greatly appreciated!

How to read in excel files from a folder and join them into a single df?

First-time poster here! I have perused these forums for a while and I am taken aback by how supportive this community is.
My problem involves several excel files with the same name, column headers, data types, that I am trying to read in with pandas. After reading them in, I want to compare the column 'Agreement Date' across all the data-frames and create a yes/no column if they match. I then want to export the data frame.
I am still learning Python and Pandas so I am struggling with this task. This is my code so far:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# read .xlsx file into a list
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files
for excelfiles in allfiles:
raw_excel = pd.read_excel(allfiles)
# place all the pulled dataframe into a list
list = [raw_excel]
From here though I am quite lost. I do not know how to join all of my files together on my id column and then compare the 'Agreement Date' column? Any help would be greatly appreciated!
THANKS!!
In your loop you need to hand the looped value and not the whole list to read_excel
You have to append the list values within the loop, otherwise only the last item will be in the list
Do not overwrite python builtins such as list or you can encounter some difficult to debug behaviors
Here's what I would change:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# get file name list of .xlsx files in the directory
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files & place all the pulled dataframe into a list
dataframes_list = []
for file in allfiles:
dataframes_list.append(pd.read_excel(file))
You can then append the DataFrames like this:
merged_df = dataframes_list[0]
for df in dataframes_list[1:]:
merged_df.append(df, ignore_index=True)
Use ignore_index if the Indexes are overlapping and causing problems. If they already are distinct and you want to keep them, set this to False.

Appending lots of csv files in folders within a folder

I am working on a shared network drive where I have 1 folder (main folder) containing many subfolders; 1 for each date (over 1700) and then within them csv files (results.csv) with a common name at the end (same file format). Each csv contains well over 30k rows.
I wish to read in all the csvs appending them into one dataframe to perform some minor calculations. I have used the below code. It ran for 3+ days so I quit, but looking at the dataframe it actually got 80% of the way through. But it seems inefficient as it takes ages and when I want to add the latest days file it will have to re-run again. I also only need a handful of the columns within each csv so want to use the usecols=['A', 'B', 'C'] function but not sure how to incorporate it. Could someone shed some light please on a better solution?
import glob
import os
import pandas as pd
file_source = glob.glob(r"//location//main folder//**//*results.csv", recursive=True)
appended_file = []
for i in file_source:
df = pd.read_csv(i)
appended_file.append(df)
combined=pd.concat(appended_file, axis=0, ignore_index=True, sort=False)
Thanks.

How to call a python function in PySpark?

I have a multiple files (CSV and XML) and I want to do some filters.
I defined a functoin doing all those filters, and I want to knwo how can I call it to be applicable for my CSV file?
PS: The type of my dataframe is: pyspark.sql.dataframe.DataFrame
Thanks in advance
For example, if you read in your first CSV files as df1 = spark.read.csv(..) and your second CSV file as df2 = spark.read.csv(..)
Wrap up all the multiple pyspark.sql.dataframe.DataFrame that came from CSV files alone into a list..
csvList = [df1, df2, ...]
and then,
for i in csvList:
YourFilterOperation(i)
Basically, for every i which is pyspark.sql.dataframe.DataFrame that came from a CSV file stored in csvList, it should iterate one by one, go inside the loop and perform whatever filter operation that you've written.
Since you haven't provided any reproducible code, I can't see if this works on my Mac.

I want to combine csv's, dropping rows while only keeping certain columns

This is the code I have so far:
import pandas as pd
import glob, os
os.chdir("L:/FMData/")
results = pd.DataFrame([])
for counter, file in enumerate(glob.glob("F5331_FM001**")):
namedf = pd.read_csv(file, skiprows=[1,2,3,4,5,6,7], index_col=[1], usecols=[1,2])
results = results.append(namedf)
results.to_csv('L:/FMData/FM001_D/FM5331_FM001_D.csv')
This however is producing a new document as instructed but isn't copying any data into it. I'm wanting to look up files in a certain location, with names along the lines of FM001, combine them, skip the first 7 rows in each csv, and only keep columns 1 and 2 in the new file. Can anyone help with my code?
Thanks in advance!!!
To combine multiple csv files, you should create a list of dataframes. Then combine the dataframes within your list via pd.concat in a single step. This is much more efficient than appending to an existing dataframe.
In addition, you need to write your result to a file outside your for loop.
For example:
results = []
for counter, file in enumerate(glob.glob("F5331_FM001**")):
namedf = pd.read_csv(file, skiprows=[1,2,3,4,5,6,7], index_col=[1], usecols=[1,2])
results = results.append(namedf)
df = pd.concat(results, axis=0)
df.to_csv('L:/FMData/FM001_D/FM5331_FM001_D.csv')
This code works on my side (using Linux and Python 3), it populates a csv file with data in.
Add a print just after the read_csv to see if your csv file actually reads any data, else nothing will be written, like this:
namedf = pd.read_csv(file)
print(namedf)
results = results.append(namedf)
It adds row 1 (probably becuase it is considered the header) and then skips 7 rows then continue, this is my result for csv file just written from one to eleven out in rows:
F5331_FM001.csv
one
0 nine
1 ten
2 eleven
Addition:
If print(namedf) shows nothing, then check your input files.
The python program is looking in L:/FMData/ for your files. Are you sure your files are located in that directory? You can change the directory by adding the correct path with the os.chdir command.

Categories