Subseting in chunks in pandas

Subseting in chunks in pandas - python

Here is my code:
path = 'C:\\Users\\Daniil\\Desktop\\dw_payments'
#list of all df:
all_files = glob.glob(path + '/*.csv')
all_payments_data = pd.DataFrame()
dfs = []
for file in all_files:
df = pd.read_csv(file,index_col = None,chunksize = 200000)
df_f = df[df['CUSTOMER_NO'] == 20069675]
df_f = pd.concat(df_f,ignore_index = True)
dfs.append(df_f)
all_payments_data = pd.concat(dfs)
As you see in the line df_f = df[df['CUSTOMER_NO'] == 20069675] i want to select the specific customer in one chunk and then merge it to the empty data frame. And I want to repeat the process many times(there are a lot of files).
But it throws me an error:
TypeError: 'TextFileReader' object is not subscriptable
How can i fix it?

I think you need iterate by TextFileReader, filter and append to df_s. Last only once concat.
Notice - Structure of all files has to be same (same columns names in same order)
df_s = []
for file in all_files:
txt = pd.read_csv(file,index_col = None,chunksize = 200000)
for df in txt:
df_s.append(df[df['CUSTOMER_NO'] == 20069675])
df_f = pd.concat(df_s,ignore_index = True)

Related

How to read and manipulate multiple CSV files using pandas and for-loop?

I want to read a list of CSV files, for example exon_kipan.00001.csv, exon_kipan.00002.csv, exon_kipan.00003.csv, and exon_kipan.00004.csv (24 files in total), and then perform a series of operations using pandas before concatenating the dataframes.
For a single file, I would do:
df= pd.read_csv("exon_kipan.csv", sep="\t", index_col=0, low_memory=False)
df= df[df.columns[::3]]
df= df.T
del df[df.columns[0]]
df.index = df.index.str.upper()
df= df.sort_index()
df.index = ['-'.join( s.split('-')[:4]) for s in df.index.tolist() ]
df.rename_axis(None, axis=1, inplace=True)
However, now I want to read, manipulate, and concatenate multiple files.
filename = '/work/exon_kipan.{}.csv'
df_dict = {}
exon_clin_list = []
for i in range(1, 25):
df_dict[i] = pd.read_csv(filename, sep="\t", index_col=0, low_memory=False)
df_dict[i] = df_dict[i][df_dict[i].columns[::3]]
df_dict[i] = df_dict[i].T
del df_dict[i][df_dict[i].columns[0]]
df_dict[i].index = df_dict[i].index.str.upper()
df_dict[i] = df_dict[i].sort_index()
df_dict[i].index = ['-'.join( s.split('-')[:4]) for s in df_dict[i].index.tolist() ]
df_dict[i].rename_axis(None, axis=1, inplace=True)
exon_clin_list.append(df_dict[i])
exon_clin = pd.concat(df_list)
My code raised:
FileNotFoundError: [Errno 2] No such file or directory: '/work/exon_kipan.{}.csv'

You have to use format method of str:
filename = '/work/exon_kipan.{:05}.csv' # <- don't forget to modify here
...
for i in range(1, 25):
df_dict[i] = pd.read_csv(filename.format(i), ...)
Test:
filename = '/work/exon_kipan.{:05}.csv'
for i in range(1, 25):
print(filename.format(i))
# Output
/work/exon_kipan.00001.csv
/work/exon_kipan.00002.csv
/work/exon_kipan.00003.csv
/work/exon_kipan.00004.csv
/work/exon_kipan.00005.csv
/work/exon_kipan.00006.csv
/work/exon_kipan.00007.csv
/work/exon_kipan.00008.csv
/work/exon_kipan.00009.csv
/work/exon_kipan.00010.csv
/work/exon_kipan.00011.csv
/work/exon_kipan.00012.csv
/work/exon_kipan.00013.csv
/work/exon_kipan.00014.csv
/work/exon_kipan.00015.csv
/work/exon_kipan.00016.csv
/work/exon_kipan.00017.csv
/work/exon_kipan.00018.csv
/work/exon_kipan.00019.csv
/work/exon_kipan.00020.csv
/work/exon_kipan.00021.csv
/work/exon_kipan.00022.csv
/work/exon_kipan.00023.csv
/work/exon_kipan.00024.csv

may be something like this will work
#write a function to read file do some processing and return a dataframe
def read_file_and_do_some_actions(filename):
df = pd.read_csv(filename, index_col=None, header=0)
#############################
#do some processing
#############################
return df
path = r'/home/tester/inputdata/exon_kipan'
all_files = glob.glob(os.path.join(path, "/work/exon_kipan.*.csv"))
#for each file in all_files list, call function read_file_and_do_some_actions and then concatenate all the dataframes into one dataframe
df = pd.concat((read_file_and_do_some_actions(f) for f in all_files), ignore_index=True)

Column appended to dataframe coming up empty

I have the following code:
import glob
import pandas as pd
import os
import csv
myList = []
path = "/home/reallymemorable/Documents/git/COVID-19/csse_covid_19_data/csse_covid_19_daily_reports_us/*.csv"
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
fileDate = pd.DataFrame({'Date': [dateFromFilename]})
myList.append(row.join(fileDate))
concatList = pd.concat(myList, sort=True)
print(concatList)
concatList.to_csv('/home/reallymemorable/Documents/test.csv', index=False, header=True
It goes through a folder of CSVs and grabs a specific row and puts it all in a CSV. The files themselves have names like 10-10-2020.csv. I have some code in there that gets the filename and removes the file extension, so I am left with the date alone.
I am trying to add another column called "Date" that contains the filename for each file.
The script almost works: it gives me a CSV of all the rows I pulled out of the various CSVs, but the Date column itself is empty.
If I do print(dateFromFilename), the date/filename prints as expected (e.g. 10-10-2020).
What am I doing wrong?

I believe join has how=left by default. And your fileDate dataframe has different index than row, so you wouldn't get the date. Instead, do an assignment:
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
myList.append(row.assign(Date=dateFromFilename))
concatList = pd.concat(myList, sort=True)
Another way is to store the dataframes as a dictionary, then concat:
myList = dict()
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
myList[dateFromFilename] = row
concatList = pd.concat(myList, sort=True)

Dynamically append dataframes in pandas

I want to load files from a list, calculate mean, median and standard deviation for each row of each file and then create a dataframe listing all the newly calculated fields.
I have the following code:
#list files to load
file_names = ["file_1", "file_2", ...]
#empty df
data = pd.DataFrame()
#for loop
for filename in file_names:
df = pd.read_csv(filename, index_col=False, header=0)
mean = df.mean(axis = 1)
median = df.median(axis = 1)
std = df.std(axis = 1)
df = pd.concat([mean, median, std], axis = 1, ignore_index = 1)
data = pd.concat(df, axis=1)
I'm getting an error:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
Individual dfs that are being created in the for loop look exactly how I want it but I can't concatenate them all together.

As it is you're overwriting df every time through the loop.
Instead collect the DataFrames in a list, then concatenate that list together.
df_list = []
#for loop
for filename in file_names:
df = pd.read_csv(filename, index_col=False, header=0)
mean = df.mean(axis = 1)
median = df.median(axis = 1)
std = df.std(axis = 1)
df = pd.concat([mean, median, std], axis = 1, ignore_index = 1)
df_list.append(df)
data = pd.concat(df_list, axis=1)

Change this line
data = pd.concat(df, axis=1)
to
data = pd.concat([data,df], axis=1)
should work.. do let me know in any case pls

Python, Pandas from data frame to create new data

Original spreadsheets have 2 columns. I want to pick the rows by given criteria (according to months), and put them into new files.
The original files looked like:
The codes I am using:
import os
import pandas as pd
working_folder = "C:\\My Documents\\"
file_list = ["Jan.xlsx", "Feb.xlsx", "Mar.xlsx"]
with open(working_folder + '201703-1.csv', 'a') as f03:
for fl in file_list:
df = pd.read_excel(working_folder + fl)
df_201703 = df[df.ARRIVAL.between(20170301, 20170331)]
df_201703.to_csv(f03, header = True)
with open(working_folder + '201702-1.csv', 'a') as f02:
for fl in file_list:
df = pd.read_excel(working_folder + fl)
df_201702 = df[df.ARRIVAL.between(20170201, 20170231)]
df_201702.to_csv(f02, header = True)
with open(working_folder + '201701-1.csv', 'a') as f01:
for fl in file_list:
df = pd.read_excel(working_folder + fl)
df_201701 = df[df.ARRIVAL.between(20170101, 20170131)]
df_201701.to_csv(f01, header = True)
The results are like:
Improvements I want to make:
Save them as xlsx files instead of .csv
Not to have the first index columns
Keeping only 1 row (top) headers (now each csv has 3 rows of headers)
How can I do that? Thank you.

I think need create list of DataFrames, concat together and then write to file:
dfs1 = []
for fl in file_list:
df = pd.read_excel(working_folder + fl)
dfs1.append(df[df.ARRIVAL.between(20170101, 20170131)] )
pd.concat(dfs1).to_excel('201701-1.xlsx', index = False)
What should be simplify by list comprehension:
file_list = ["Jan.xlsx", "Feb.xlsx", "Mar.xlsx"]
dfs1 = [pd.read_excel(working_folder + fl).query('20170101 >= ARRIVAL >=20170131') for fl in file_list]
pd.concat(dfs1).to_excel('201701-1.xlsx', index = False)

how to append data in one dataframe from different files?

I have used the following code to read the data from the files. I tried to make a time series data in one data frame but I am missing somewhere.
files = glob.glob('*.txt')
files.sort()
for infile in files:
year,formatt = infile.split('.')
year = year.split('_')[1]
ws = [4,9,7,7,7,7,7,7,7,7,7,7,7]
df = pd.read_fwf(infile,widths=ws,header=9, nrows=31, keep_default_na = False)
df = df.drop('Day', 1)
df = np.array(df.T)
df = df[df != '']
data = pd.DataFrame([])
data['Discharge'] = df
data = data.set_index(pd.date_range(year, periods=len(data), freq='D'),
drop=True, append=False, inplace=False, verify_integrity=False)
new = pd.DataFrame([])
all_ = new.append(data)
print all_
can anyone help me to figure out my problem?
my sample data is in this link: https://drive.google.com/open?id=0B2rkXkOkG7ExSWQ5djloNkpNWmc

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Subseting in chunks in pandas - python

Related

How to read and manipulate multiple CSV files using pandas and for-loop?

Column appended to dataframe coming up empty

Dynamically append dataframes in pandas

Python, Pandas from data frame to create new data

how to append data in one dataframe from different files?

Categories

Resources