Pandas apply Series- Order of the columns - python

To aggregate and to find values per second, I am doing following in Python using pandas, however, in the output logged to a file doesn't show columns in the way they appear here. Somehow these column names are sorted and hence TotalDMLsSec shows up before UpdateTotal and UpdatesSec.
'DeletesTotal': x['Delete'].sum(),
'DeletesSec': x['Delete'].sum()/VSeconds,
'SelectsTotal': x['Select'].sum(),
'SelectsSec': x['Select'].sum()/VSeconds,
'UpdateTotal': x['Update'].sum(),
'UpdatesSec': x['Update'].sum()/VSeconds,
'InsertsTotal': x['Insert'].sum(),
'InsertsSec': x['Insert'].sum()/VSeconds,
'TotalDMLsSec':(x['Delete'].sum()+x['Update'].sum()+x['Insert'].sum())/VSeconds
})
)
df.to_csv(/home/summary.log,sep='\t', encoding='utf-8-sig')
Apart from above questions, have couple of other questions-
Despite logging as csv format, all values/columns are appearing in one column in excel, is there anyway to properly load data CSV
Can rows be sorted based on one column(let say InsertsSec) by default when writing to csv file?
Any help here would be really appreciated.

Assume that your DataFrame is something like this:
Deletes Selects Updates Inserts
Name
Xxx 20 10 40 50
Yyy 12 32 24 11
Zzz 70 20 30 20
Then both total and total per sec can be computed as:
total = df.sum().rename('Total')
VSeconds = 5 # I assumed some value
tps = (total / VSeconds).rename('Total per sec')
Then you can add both above rows to the DataFrame:
df = df.append(totals).append(tps)
The downside is that all numbers are converted to float.
But in Pandat there is no other way, as each column must have
values of one type.
Then you can e.g. write it to a CSV file (with totals included).

This is how I eneded up doing
df.to_excel(vExcelFile,'All')
vSortedDF=df.sort_values(['Deletes%'],ascending=False)
vSortedDF.loc[vSortedDF['Deletes%']> 5, ['DeletesTotal','DeletesSec','Deletes%']].to_excel(vExcelFile,'Top Delete objects')
vExcelFile.save()
For CSV, instead of using separate \t used , and it worked just fine.
df.to_csv(/home/summary.log,sep='\t', encoding='utf-8-sig')

Related

How to create python function that performs multiple checks on a dataframe?

I have multiple inventory tables like so:
line no
-1 qty
-2 qty
1
-
3
2
42.1 FT
-
3
5
-
4
-
10 FT
5
2
1
6
6.7
-
or
line no
qty
1
2
2
4.5 KG
3
5
4
5
13
6
AR
I want to create logic check for the quantity column using python. (The table may have more than one qty column and I need to be able to check all of them. In both examples, I have the tables formatted as dataframes.)
Acceptable criteria:
integer with or without "EA" (meaning each)
"AR" (as required)
integer or float with unit of measure
if multiple QTY columns, then "-" is also accepted (first table)
I want to return a list per page, containing the line no. corresponding to rows where quantity value is missing (line 4, second table) or does not meet acceptance criteria (line 6, table 1). If the line passes the checks, then return True.
I have tried:
qty_col = [col for col in df.columns if 'qty' in col]
df['corr_qty'] = np.where(qty_col.isnull(), False, df['line_no'])
but this creates the quantity columns as a list and yields the following
AttributeError: 'list' object has no attribute 'isnull'
Intro and Suggestions:
Welcome to StackOverflow. Some general tips when asking questions on S.O. include as much information as possible. In addition, always identify the libraries you want to use and the accepted approach since there can be multiple solutions to the same problem, looks like you've done that.
Also, it is best to always share all, if not, most of your attempted solutions so others can understand the thought process and fully understand the best approach to provide a potential solution.
The Solution:
It wasn't clear if the solution you are looking for required that you read the PDF to create the dataframe or if converting the PDF to a CSV and processing the data using the CSV was sufficient. I took the latter approach.
import tabula as tb
import pandas as pd
#PDF file path
input_file_path = "/home/hackernumber7/Projects/python/resources/Pandas_Sample_Data.pdf"
#CSV file path
output_file_path = "/home/hackernumber7/Projects/python/resources/Pandas_Sample_Data.csv"
#Read the PDF
#id = tb.read_pdf(input_file_path, pages='all')
#Convert the PDF to CSV
cv = tb.convert_into(input_file_path, output_file_path, "csv", pages="all")
#Read initial data
id = pd.read_csv(output_file_path, delimiter=",")
#Print the initial data
print(id)
#Create the dataframe
df = pd.DataFrame(id, columns = ['qty'])
#Print the data as a DataFrame object; boolean values when conditions met
print(df.notna())

Combining several csv files and calculating averages of a column

I have a folder with several csv files. Each file has several columns but the ones I'm interested in are id, sequence (3 different sequences) and time. See an example of a csv file below. I have over 40 csv files-each file belongs to a different participant.
id
sequence
time
300
1
2
2
3
1
3
etc
I need to calculate the average times for 3 different sequences for each participant. The code I currently have combines the csv files in one dataframe, selects the columns I'm interested in (id, sequence and time) and calculates the average for each person for 3 conditions, pivots the table to wide format (it's the format I need) and exports this as a csv file. I am not sure if this is the best way to do it but it works. However, for some conditions 'time' is 0. I want to exclude these sequences from the averages. How do I do this? Many thanks in advance.
filenames = sorted(glob.glob('times*.csv'))
df = pd.concat((pd.read_csv(filename) for filename in filenames))
df_new = df[["id","sequence","time"]]
df_new_ave = df_new.groupby(['id','sequence'] ['time'].mean().reset_index(name='Avg_Score')
df_new_ave_wide = df_new_ave.pivot(index='id', columns='sequence', values='Avg_Score')
df_new_ave_wide.to_csv('final_ave.csv', encoding='utf-8', index=True)

generate output files with random samples from pandas dataframe

I have a dataframe with 500K rows. I need to distribute sets of 100 randomly selected rows to volunteers for labeling.
for example:
df = pd.DataFrame(np.random.randint(0,450,size=(450,1)),columns=list('a'))
I can remove a random sample of 100 rows and output a file with time stamp:
df_subset=df.sample(100)
df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')
df=df.drop(df_subset.index)
the above works but if I try to apply it to the entire example dataframe:
while len(df)>0:
df_subset=df.sample(100)
df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')
df=df.drop(df_subset.index)
it runs continuously - my expected output is 5 timestampdfsample.csv files 4 of which have 100 rows and the fifth 50 rows all randomly selected from df however df.drop(df_sample.index) doesn't update df so condition is always true and it runs forever generating csv files. I'm having trouble solving this problem.
any guidance would be appreciated
UPDATE
this to gets me almost there:
for i in range(4):
df_subset=df.sample(100)
df=df.drop(df_subset.index)
time.sleep(1) #added because runs too fast for unique naming
df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')
it requires me to specify number of files. If I say 5 for the example df I get an error on the 5th. I hoped for 5 files with the 5th having 50 rows but not sure how to do that.
After running your code, I think the problem is not with df.drop
but with the line containing time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv', because Pandas creates multiple CSV files within a second which might be causing some overwriting issues.
I think if you want label files using a timestamp, perhaps going to the millisecond level might be more useful and prevent possibility of overwrite. In your case
while len(df)>0:
df_subset=df.sample(100)
df_subset.to_csv(datetime.now().strftime("%Y%m%d_%H%M%S.%f") + 'dfsample.csv')
df=df.drop(df_subset.index)
Another way is to shuffle your rows and get rid of that awful loop.
df.sample(frac=1)
and save slices of the shuffled dataframe.

Append to csv file column-wise under one header

Coding in Python 2.7. I have a csv file already present called input.csv. In that file we have 3 headers — Filename, Category, Version — under which certain values already exist.
I want to know how I can reopen the csv file and input only one value multiple times under the "Version" column such that whatever was written under "Version" gets overwritten by the new input.
So suppose under the "Version" column I had 3 inputs in 3 rows:
VERSION
55
66
88
It gets rewritten by my new input 10 so it will look like:
VERSION
10
10
10
I know normally we input csv row-wise but this time around I just want to input column wise under that specific header "Version".
Solution 1:
With pandas, you can use:
import pandas as pd
df = pd.read_csv(file)
df['VERSION'] = 10
df.to_csv(file, index=False)
Solution 2: If there are multiple rows (and you only want first 3), then you can use:
df.loc[df.index < 3, ['VERSION']] = 10
instead of:
df['VERSION'] = 10

How do I extract variables that repeat from an Excel Column using Python?

I'm a beginner at Python and I have a school proyect where I need to analyze an excel document with information. It has aproximately 7 columns and more than 1000 rows.
Theres a column named "Materials" that starts at B13. It contains a code that we use to identify some materials. The material code looks like this -> 3A8356. There are different material codes in the same column they repeat a lot. I want to identify them and make a list with only one code, no repeating. Is there a way I can analyze the column and extract the codes that repeat so I can take them and make a new column with only one of each material codes?
An example would be:
12 Materials
13 3A8356
14 3A8376
15 3A8356
16 3A8356
17 3A8346
18 3A8346
and transform it toosomething like this:
1 Materials
2 3A8346
3 3A8356
4 3A8376
Yes.
If df is your dataframe, you only have to do df = df.drop_duplicates(subset=['Materials',], keep=False)
To load the dataframe from an excel file, just do:
import pandas as pd
df = pd.read_excel(path_to_file)
the subset argument indicates which column headings you want to look at.
Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
For the docs, the new data frame with the duplicates dropped is returned so you can assign it to any variable you want. If you want to re_index the first column, take a look at:
new_data_frame = new_data_frame.reset_index(drop=True)
Or simply
new_data_frame.reset_index(drop=True, inplace=True)

Categories