Python - How to improve the dataframe performance? - python

There are 2 CSV files. Each file has 700,000 rows.
I should read one file line by line and find the same row from the other file.
After then, make two files data as one file data.
But, It takes about 1 minute just per 1,000 rows!!
I don't know how to improve the performance.
Here is my code :
import pandas as pd
fail_count = 0
match_count = 0
count = 0
file1_df = pd.read_csv("Data1.csv", sep='\t')
file2_df = pd.read_csv("Data2.csv", sep='\t')
columns = ['Name', 'Age', 'Value_file1', 'Value_file2']
result_df = pd.DataFrame(columns=columns)
for row in fil1_df.itterow():
name = row[1][2]
position = row[1][3]
selected = file2_df[(file2_df['Name'] == name ) & (file2_df['Age'] == age)]
if selected.empty :
fail_count += 1
continue
value_file1 = row[1][4]
value_file2 = selected['Value'].values[0]
result_df.loc[len(result_df)] = [name, age, value_file1, value_file2]
match_count += 1
print('match : ' + str(match_count))
print('fail : ' + str(fail_count))
result_df.to_csv('result.csv', index=False, encoding='utf-8')
Which line can be changed?
Is there any other way to do this process?

This might be too simplistic, but have you tried using pandas.merge() functionality?
See here for syntax.
For your tables:
result_df = pd.merge(left=file1_df, right=file2_df, on=['Name', 'Age'], how='inner')
That will do an "inner" join, only keeping rows with Names & Ages that match in both tables.

Related

Creating Multiple output files using pandas in python

The following code check SampleData.txt and produces Result1.txt. I want to create another file Result2.txt from same data that will contain only 1 column. I am new to pandas and cant figure out what is needed to be modified to create Result2.txt
import pandas as pd
from tabulate import tabulate
dl = []
with open('SampleData.txt', encoding='utf8', errors='ignore') as f:
for line in f:
parts = line.split()
if not parts[3][:2].startswith('($'):
parts.insert(3,'0')
if len(parts) > 5:
temp = ' '.join(parts[4:])
parts = parts[:4] + [temp]
parts[1] = int(parts[1])
parts[2] = float(parts[2].replace(',', ''))
parts[3] = float(parts[3].strip('($)').replace(',', ''))
dl.append(parts)
headers = ['ID', 'TRANS', 'VALUE', 'AMOUNT', 'CODE']
df = pd.DataFrame(dl,columns=headers)
pd.set_option('colheader_justify', 'center')
df = df.groupby(['ID','CODE']).sum().reset_index().round(2)
df = df.sort_values('TRANS',ascending=False)
df['AMOUNT'] = '($' + df['AMOUNT'].astype(str) + ')'
df = df[headers]
print (df.head(n=40).to_string(index=False))
print()
df.to_csv("Out1.txt", sep="\t", index=None, header=None)
SampleData.txt
0xdata1 1 2,200,000 test1(test1)
0xdata2 1 9,500,000,000 ($70.30) test2(test2)
0xdata3 1 4.6 ($14.08) test3(test3)
0xdata4 1 0.24632941 test4(test4)
0xdata5 1 880,000,000 ($1.94) test5(test5)
Result1.txt #-- Fine and working
0xdata1 1 2,200,000 test1(test1)
0xdata2 1 9,500,000,000 ($70.30) test2(test2)
0xdata3 1 4.6 ($14.08) test3(test3)
0xdata4 1 0.24632941 test4(test4)
0xdata5 1 880,000,000 ($1.94) test5(test5)
Result2.txt #-- Additional output needed and what I am trying to produce
0xdata1
0xdata2
0xdata3
0xdata4
0xdata5
You can select just the column that you want to save as in you case
df['ID'].to_csv("Out_ID.txt", sep="\t", index=None, header=None)
This should solve your problem!

Loops in Python : how to apply same set of code in a loop

Thank you for all your help in my previous questions.
Now it leads me to my final and most difficult task. Let me break it down:
I have a file named:
"PCU1-160321.csv"
Then I run the following codes (see below) that helps me to do all sorts of things I need to do.
This leaves me to the most difficult task which is to do again and again the same set of code (see below), to other other 23 files, named:
PCU2-160321.csv
PCU3-160321.csv
...
PCU24-160321.csv
Eg. I wish to call df2 for PCU2-160321.csv, df3 for PCU3-160321.csv etc. (Hence, loop...)
Or is there a better looping method?
Below I attach the visualisation:
Thank you very much.
UPDATE:
I did try something like this , but it didn't work...
#Assign file names
file1 = 'PCU1-160321.csv'
file_out1 = file1+'15min.csv'
#Read csv file and assign header
df1 = pd.read_csv(gdrive_url+file1+'.csv', sep=';', names=['Date','Time_Decimal','Parameter','Value'])
#To split the Time_Decimal column and concatenate the second decimals back into Time column, and rearrange the column order
df1[['Time','DecimalSecond']] = df1.Time_Decimal.str.split(".",expand=True)
df1 = df1[['Date', 'Time', 'DecimalSecond', 'Parameter','Value']]
#Split AM and concatenate DecimalSecond
df1[['Time', 'AMPM']] = df.Time.str.split(" ", expand=True)
df1 = df1[['Date', 'Time', 'AMPM', 'DecimalSecond', 'Parameter','Value']]
df1['Time'] = df1['Time'].map(str) + '.' + df1['DecimalSecond'].map(str) + ' ' + df['AMPM'].map(str)
df1['Timestamp'] = df1['Date'].map(str) + ' ' + df1['Time']
df1 = df1[['Timestamp', 'Parameter','Value']]
#Assign index and set index as timestamp object
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1 = df1.set_index('Timestamp')
#Assigning parameters I want to filter out of the df above.
Parameter1 = 'DCA_L_1'
Parameter2 = 'DCA_L_2'
#Filtering based on the variables defined above, contains whatevr parameter I need.
df1_param1 = df1[df1['Parameter'].str.contains(Parameter1)]
df1_param2 = df1[df1['Parameter'].str.contains(Parameter2)]
#Renaming the columns header
df1_param1.columns = ['Par1','Val1']
df1_param2.columns = ['Par2','Val2']
#Obtain exact name of parameter into a string, then we use this for the new df top row
par1 = df1_param1.head(1)['Par1'].values[0]
par2 = df1_param2.head(1)['Par2'].values[0]
#Downsampling 15 mins
df1_param1 = df1_param1.resample('15min').mean()
df1_param2 = df1_param2.resample('15min').mean()
#Concatenating all the dfs - except empty df, df_ppc_param4
df1_concat = pd.concat([df1_param1, df1_param2], axis=1)
#Select Values
df1_concat = df1_concat[['Val1','Val2']
#Rename columns to the new df_concat
df1_concat.columns = [par1,par2]
#To save output as csv
df_concat.to_csv(gdrive_url_out+file_out+'.csv', index=True)

Concatenating .mtx files and changing counter for cell IDs

I have several files that look like this, where the header is the count of unique values per column.
How can I read several of these files and concatenate them all in one??
When I concatenate, I need that all the values in the column in the middle ADD the total value of count of that column from the file before, to continue with the count when I concatenate. The other two columns I don't mind.
My try:
matrixFiles = glob.glob(filesPath +'/*matrix.mtx')
dfs = []
i = 0
for file in sorted(matrixFiles):
matrix = pd.read_csv(file, sep = ' ')
cellNumber = matrix.columns[1]
cellNumberInt = np.int64(cellNumber)
if i > 0:
matrix.iloc[:,1] = matrix.iloc[:,1] + cellNumberInt
dfs.append(matrix)
i = i + 1
big_file = pd.concat (dfs)
I don't know how to access to cellNumberInt from the file iterated before to add it to the new one.
When I concat dfs the output is not a three column dataframe. How can I concatenate all the files in the same columns and avoiding the header?
1.csv:
33694,1298,2465341
33665,1299,20
33663,1299,8
2.csv:
53694,1398,3465341
33665,1399,20
33663,1399,8
3.csv:
13694,7778,3465341
44432,7780,20
33663,7780,8
import pandas as pd
import numpy as np
matrixFiles = ['1.csv', '2.csv', '3.csv']
dfs = []
matrix_list = []
#this dict stores the i number (keys) and the cellNumberInt (values)
cellNumberInt_dict = {}
i = 0
for file in sorted(matrixFiles):
matrix = pd.read_csv(file)
cellNumber = matrix.columns[1]
cellNumberInt = np.int64(cellNumber)
cellNumberInt_dict[i] = cellNumberInt
if i > 0:
matrix.rename(columns={str(cellNumberInt) : cellNumberInt + cellNumberInt_dict[i-1]}, inplace=True)
dfs.append(matrix)
if i < len(matrixFiles)-1:
#we only want to keep the df values here, keeping the columns that don't
# have shared names messes up the pd.concat()
matrix_list.append(matrix.values)
i += 1
# get the last df in the dfs list because it has the last cellNumberInt
last_df = dfs[-1]
#concat all of the values from the dfs except for the last one
arrs = np.concatenate(matrix_list)
#make a df from the numpy arrays
new_df = pd.DataFrame(arrs, columns=last_df.columns.tolist())
big_file = pd.concat([last_df, new_df])
big_file.rename(columns={big_file.columns.tolist()[1] : sum(cellNumberInt_dict.values())}, inplace=True)
print (big_file)
13694 10474 3465341
0 44432 7780 20
1 33663 7780 8
0 33665 1299 20
1 33663 1299 8
2 33665 1399 20
3 33663 1399 8

Loop through cell range (Every 3 cells) and add ranking to it

The problem is I am trying to make a ranking for every 3 cells in that column
using pandas.
For example:
This is the outcome I want
I have no idea how to make it.
I tried something like this:
for i in range(df.iloc[1:],df.iloc[,:],3):
counter = 0
i['item'] += counter + 1
The code is completely wrong, but I need help with the range and put df.iloc in the brackets in pandas.
Does this match the requirements ?
import pandas as pd
df = pd.DataFrame()
df['Item'] = ['shoes','shoes','shoes','shirts','shirts','shirts']
df2 = pd.DataFrame()
for i, item in enumerate(df['Item'].unique(), 1):
df2.loc[i-1,'rank'] = i
df2.loc[i-1, 'Item'] = item
df2['rank'] = df2['rank'].astype('int')
print(df)
print("\n")
print(df2)
df = df.merge(df2, on='Item', how='inner')
print("\n")
print(df)

Python Pandas 'Unnamed' column keeps appearing

I am running into an issue where each time I run my program (which reads the dataframe from a .csv file) a new column shows up called 'Unnamed'.
sample output columns after running 3 times -
Unnamed: 0 Unnamed: 0.1 Subreddit Appearances
here is my code. for each row, the 'Unnamed' columns simply increase by 1.
df = pd.read_csv(Location)
while counter < 50:
#gets just the subreddit name
e = str(elem[counter].get_attribute("href"))
e = e.replace("https://www.reddit.com/r/", "")
e = e[:-1]
if e in df['Subreddit'].values:
#adds 1 to Appearances if the subreddit is already in the DF
df.loc[df['Subreddit'] == e, 'Appearances'] += 1
else:
#adds new row with the subreddit name and sets the amount of appearances to 1.
df = df.append({'Subreddit': e, 'Appearances': 1}, ignore_index=True)
df.reset_index(inplace=True, drop=True)
print(e)
counter = counter + 2
#(doesn't work) df.drop(df.columns[df.columns.str.contains('Unnamed', case=False)], axis=1)
The first time i run it, with a clean .csv file, it works perfect, but each time after, another 'Unnamed' column shoes up.
I just wanted the 'Subreddit' and 'Appearances' columns to show each time.
An other solution would be to read your csv with the attribute index_col=0 to not take into account the index column : df = pd.read_csv(Location, index_col=0).
each time I run my program (...) a new column shows up called 'Unnamed'.
I suppose that's due to reset_index or maybe you have a to_csv somewhere in your code as #jpp suggested. To fix the to_csv be sure to use index=False:
df.to_csv(path, index=False)
just wanted the 'Subreddit' and 'Appearances' columns
In general, here's how I would approach your task.
What this does is to count all appearances first (keyed by e), and from these counts create a new dataframe to merge with the one you already have (how='outer' adds rows that don't exist yet). This avoids resetting the index for each element which should avoid the problem and is also more performant.
Here's the code with these thoughts included:
base_df = pd.read_csv(location)
appearances = Counter() # from collections
while counter < 50:
#gets just the subreddit name
e = str(elem[counter].get_attribute("href"))
e = e.replace("https://www.reddit.com/r/", "")
e = e[:-1]
appearances[e] += 1
counter = counter + 2
appearances_df = pd.DataFrame({'e': e, 'appearances': c }
for e, c in x.items())
df = base_df.merge(appearances_df, how='outer', on='e')

Categories