Analysing multiple files using for loop - python

I have 10 files which I need to work on.
I need to import those files using pd.read_csv to turn them all into dataframes along with usecols as I only need the same two specific columns from each file.
I then need to search the two columns for a specific entry in the rows like ‘abcd’ and for python to return a new df with includes all the rows it appeared in for each file.
Is there a way I could do this using a for loop. For far I’ve only got a list of all the paths to the 10 files.
So far what I do for one file without the for loop is:
df = pd.read_csv(r'filepath', header=2, usecols=['Column1', 'Column2'])
search_df = df.loc[df['Column1'] == 'abcd']

Related

How to read in excel files from a folder and join them into a single df?

First-time poster here! I have perused these forums for a while and I am taken aback by how supportive this community is.
My problem involves several excel files with the same name, column headers, data types, that I am trying to read in with pandas. After reading them in, I want to compare the column 'Agreement Date' across all the data-frames and create a yes/no column if they match. I then want to export the data frame.
I am still learning Python and Pandas so I am struggling with this task. This is my code so far:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# read .xlsx file into a list
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files
for excelfiles in allfiles:
raw_excel = pd.read_excel(allfiles)
# place all the pulled dataframe into a list
list = [raw_excel]
From here though I am quite lost. I do not know how to join all of my files together on my id column and then compare the 'Agreement Date' column? Any help would be greatly appreciated!
THANKS!!
In your loop you need to hand the looped value and not the whole list to read_excel
You have to append the list values within the loop, otherwise only the last item will be in the list
Do not overwrite python builtins such as list or you can encounter some difficult to debug behaviors
Here's what I would change:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# get file name list of .xlsx files in the directory
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files & place all the pulled dataframe into a list
dataframes_list = []
for file in allfiles:
dataframes_list.append(pd.read_excel(file))
You can then append the DataFrames like this:
merged_df = dataframes_list[0]
for df in dataframes_list[1:]:
merged_df.append(df, ignore_index=True)
Use ignore_index if the Indexes are overlapping and causing problems. If they already are distinct and you want to keep them, set this to False.

pd.read_excel is combining the first columns into one

I'm using a simple code to import an Excel file. However, the command is combining the first two rows into one. I would like to keep it separated (as it is in the Excel file).
db=pd.read_excel('fileaddress', sheetname='Sheet1')

How to call a python function in PySpark?

I have a multiple files (CSV and XML) and I want to do some filters.
I defined a functoin doing all those filters, and I want to knwo how can I call it to be applicable for my CSV file?
PS: The type of my dataframe is: pyspark.sql.dataframe.DataFrame
Thanks in advance
For example, if you read in your first CSV files as df1 = spark.read.csv(..) and your second CSV file as df2 = spark.read.csv(..)
Wrap up all the multiple pyspark.sql.dataframe.DataFrame that came from CSV files alone into a list..
csvList = [df1, df2, ...]
and then,
for i in csvList:
YourFilterOperation(i)
Basically, for every i which is pyspark.sql.dataframe.DataFrame that came from a CSV file stored in csvList, it should iterate one by one, go inside the loop and perform whatever filter operation that you've written.
Since you haven't provided any reproducible code, I can't see if this works on my Mac.

I want to combine csv's, dropping rows while only keeping certain columns

This is the code I have so far:
import pandas as pd
import glob, os
os.chdir("L:/FMData/")
results = pd.DataFrame([])
for counter, file in enumerate(glob.glob("F5331_FM001**")):
namedf = pd.read_csv(file, skiprows=[1,2,3,4,5,6,7], index_col=[1], usecols=[1,2])
results = results.append(namedf)
results.to_csv('L:/FMData/FM001_D/FM5331_FM001_D.csv')
This however is producing a new document as instructed but isn't copying any data into it. I'm wanting to look up files in a certain location, with names along the lines of FM001, combine them, skip the first 7 rows in each csv, and only keep columns 1 and 2 in the new file. Can anyone help with my code?
Thanks in advance!!!
To combine multiple csv files, you should create a list of dataframes. Then combine the dataframes within your list via pd.concat in a single step. This is much more efficient than appending to an existing dataframe.
In addition, you need to write your result to a file outside your for loop.
For example:
results = []
for counter, file in enumerate(glob.glob("F5331_FM001**")):
namedf = pd.read_csv(file, skiprows=[1,2,3,4,5,6,7], index_col=[1], usecols=[1,2])
results = results.append(namedf)
df = pd.concat(results, axis=0)
df.to_csv('L:/FMData/FM001_D/FM5331_FM001_D.csv')
This code works on my side (using Linux and Python 3), it populates a csv file with data in.
Add a print just after the read_csv to see if your csv file actually reads any data, else nothing will be written, like this:
namedf = pd.read_csv(file)
print(namedf)
results = results.append(namedf)
It adds row 1 (probably becuase it is considered the header) and then skips 7 rows then continue, this is my result for csv file just written from one to eleven out in rows:
F5331_FM001.csv
one
0 nine
1 ten
2 eleven
Addition:
If print(namedf) shows nothing, then check your input files.
The python program is looking in L:/FMData/ for your files. Are you sure your files are located in that directory? You can change the directory by adding the correct path with the os.chdir command.

Reading dataframe from multiple input paths and adding columns simultaneously

I am trying to read multiple input paths and based on the dates in the paths adding two columns to the data frame. Actually the files were stored as orc partitioned by these dates using hive so they have a structure like
s3n://bucket_name/folder_name/partition1=value1/partition2=value2
where partition2 = mg_load_date . So here I am trying to fetch multiple directories from multiple paths and based on the partitions I have to create two columns namely mg_load_date and event_date for each spark dataframe. I am reading these as input and combining them after I add these two columns finding the dates for each file respectively.
Is there any other way since I have many reads for each file, to read all the files at once while adding two columns for their specific rows. Or any other way where I can make the read operation fast since I have many reads.
I guess reading all the files like this sqlContext.read.format('orc').load(inputpaths) is faster than reading them individually and then merging them.
Any help is appreciated.
dfs = []
for i in input_paths:
df = sqlContext.read.format('orc').load(i)
date = re.search('mg_load_date=([^/]*)/$', i).group(1)
df = df.withColumn('event_date',F.lit(date)).withColumn('mg_load_date',F.lit(date))
dfs+=[df]
df = reduce(DataFrame.unionAll,dfs)
As #user8371915 says, you should load your data from the root path instead of passing a list of subdirectory:
sqlContext.read.format('orc').load("s3n://bucket_name/folder_name/")
Then you'll have access to your partitioning columns partition1 and partition2.
If for some reason you can't load from the root path you can try using pyspark.sql.functions input_file_name to get the name of the file for each row of your dataframe.
Spark 2.2.0+
to read from multiple folders using orc format.
df=spark.read.orc([path1,path2])
ref: https://issues.apache.org/jira/browse/SPARK-12334

Categories