Concatenating .mtx files and changing counter for cell IDs - python

I have several files that look like this, where the header is the count of unique values per column.
How can I read several of these files and concatenate them all in one??
When I concatenate, I need that all the values in the column in the middle ADD the total value of count of that column from the file before, to continue with the count when I concatenate. The other two columns I don't mind.
My try:
matrixFiles = glob.glob(filesPath +'/*matrix.mtx')
dfs = []
i = 0
for file in sorted(matrixFiles):
matrix = pd.read_csv(file, sep = ' ')
cellNumber = matrix.columns[1]
cellNumberInt = np.int64(cellNumber)
if i > 0:
matrix.iloc[:,1] = matrix.iloc[:,1] + cellNumberInt
dfs.append(matrix)
i = i + 1
big_file = pd.concat (dfs)
I don't know how to access to cellNumberInt from the file iterated before to add it to the new one.
When I concat dfs the output is not a three column dataframe. How can I concatenate all the files in the same columns and avoiding the header?

1.csv:
33694,1298,2465341
33665,1299,20
33663,1299,8
2.csv:
53694,1398,3465341
33665,1399,20
33663,1399,8
3.csv:
13694,7778,3465341
44432,7780,20
33663,7780,8
import pandas as pd
import numpy as np
matrixFiles = ['1.csv', '2.csv', '3.csv']
dfs = []
matrix_list = []
#this dict stores the i number (keys) and the cellNumberInt (values)
cellNumberInt_dict = {}
i = 0
for file in sorted(matrixFiles):
matrix = pd.read_csv(file)
cellNumber = matrix.columns[1]
cellNumberInt = np.int64(cellNumber)
cellNumberInt_dict[i] = cellNumberInt
if i > 0:
matrix.rename(columns={str(cellNumberInt) : cellNumberInt + cellNumberInt_dict[i-1]}, inplace=True)
dfs.append(matrix)
if i < len(matrixFiles)-1:
#we only want to keep the df values here, keeping the columns that don't
# have shared names messes up the pd.concat()
matrix_list.append(matrix.values)
i += 1
# get the last df in the dfs list because it has the last cellNumberInt
last_df = dfs[-1]
#concat all of the values from the dfs except for the last one
arrs = np.concatenate(matrix_list)
#make a df from the numpy arrays
new_df = pd.DataFrame(arrs, columns=last_df.columns.tolist())
big_file = pd.concat([last_df, new_df])
big_file.rename(columns={big_file.columns.tolist()[1] : sum(cellNumberInt_dict.values())}, inplace=True)
print (big_file)
13694 10474 3465341
0 44432 7780 20
1 33663 7780 8
0 33665 1299 20
1 33663 1299 8
2 33665 1399 20
3 33663 1399 8

Related

Python: concat data frames then save them to one csv

I have multiple data frames. I want to get some rows from each data frame based on a certain condition and add them into one data frame, then save them to one csv file.
I tried multiple methods, append with data frames is deprecated.
Here is the simple code. I want to retrieve the above and below values for all the rows larger than 2.
result= pd.concat() returns the required rows with the headers. That means with every iteration from the for loop, it prints the required rows. However, when I save them to csv, only the last three saved. How do I save/append the rows before adding them to the csv? What am I missing here?
df_sorted = pd.DataFrame({"ID": [1,2,3,4,5,6],
"User": ['a','b','c','d','e','f']})
Max = pd.DataFrame()
above = pd.DataFrame()
below = pd.DataFrame()
for i in range(len(df_sorted)):
if df_sorted.ID[i] > 2:
Max = df_sorted.iloc[[i]] # first df
if i < len(df_sorted) - 1:
above = df_sorted.iloc[[i+1]] # second df
if i > 0:
below = df_sorted.iloc[[i-1]] #third df
frames = [above, Max, below]
result = pd.concat(frames)
result.to_csv('new_df.csv')
The desired result should be,
ID User
2 b
3 c
4 d
3 c
4 d
5 e
4 d
5 e
6 f
5 e
6 f
what I get from result is,
ID User
5 e
6 f
6 f
Here it is:
columns = [ 'id', 'user']
Max = pd.DataFrame(columns=columns)
above = pd.DataFrame(columns=columns)
below = pd.DataFrame(columns=columns)
for i in range(len(df_sorted)):
if df_sorted.ID[i] > 2:
Max.loc[i,'id' ]=df_sorted.iloc[i, 0]
Max.loc[i,'user' ]=df_sorted.iloc[i, 1]
if i < len(df_sorted) - 1:
above.loc[i,'id' ]=df_sorted.iloc[i+1, 0]
above.loc[i,'user' ]=df_sorted.iloc[i+1, 1]
elif i > 0:
below.loc[i,'id' ]=df_sorted.iloc[i-1, 0]
below.loc[i,'user' ]=df_sorted.iloc[i-1, 1]
result = pd.concat([above, Max, below], axis = 0)
result
It seems that you did not define the Max, above and below.
Now, Max and above and below are only one value and every time, they are updated.
You should define Max=pd.dataframe(columns) or array and same thing for above and below. With this, you can save the data in these dataframes and with concat, you don't lose the data.

divide the row into two rows after several columns

I have CSV file and I try to split my row into many rows if it contains more than 4 columns
Example:-
enter image description here
Expected Output:
enter image description here
So there are way to do that in pandas or python
Sorry if this is a simple question
When there are two columns with the same name in CSV file, the pandas dataframe automatically appends an integer value to the duplicate column name
for example:
This CSV file :
Will become this :
df = pd.read_csv("Book1.csv")
df
Now to solve your question, lets consider the above dataframe as the input dataframe.
Try this :
cols = df.columns.tolist()
cols.remove('id')
start = 0
end = 4
new_df = []
final_cols = ['id','x1','y1','x2','y2']
while start<len(cols):
if end>len(cols):
end = len(cols)
temp = cols[start:end]
start = end
end = end+4
temp_df = df.loc[:,['id']+temp]
temp_df.columns = final_cols[:1+len(temp)]
if len(temp)<4:
temp_df[final_cols[1+len(temp):]] = None
print(temp_df)
new_df.append(temp_df)
pd.concat(new_df).reset_index(drop = True)
Result:
You can first set the video column as index then concat your remaining every 4 columns into a new dataframe. At last, reset index to get video column back.
df.set_index('video', inplace=True)
dfs = []
for i in range(len(df.columns)//4):
d = df.iloc[:, range(i*4,i*4+4)]
dfs.append(d.set_axis(['x_center', 'y_center']*2, axis=1))
df_ = pd.concat(dfs).reset_index()
I think the following list comprehension should work, but it gives an positional indexing error on my machine and I don't know why
df_ = pd.concat([df.iloc[: range(i*4, i*4+4)].set_axis(['x_center', 'y_center']*2, axis=1) for i in range(len(df.columns)//4)])
print(df_)
video x_center y_center x_center y_center
0 1_1 31.510973 22.610222 31.383655 22.488293
1 1_1 31.856295 22.830109 32.016905 22.948702
2 1_1 32.011684 22.990689 31.933356 23.004779

How to dynamically match rows from two pandas dataframes

I have a large dataframe of urls and a smaller 2nd dataframe that contains columns of strings which I want to use to merge the two dataframes together. Data from the 2nd df will be used to populate the larger 1st df.
The matching strings can contain * wildcards (and more then one) but the order of the grouping still matters; so "path/*path2" would match with "exsample.com/eg_path/extrapath2.html but not exsample.com/eg_path2/path/test.html. How can I use the strings in the 2nd dataframe to merge the two dataframes together. There can be more then one matching string in the 2nd dataframe.
import pandas as pd
urls = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7]}
metadata = {'group':['group1','group2'],
'matching_string_1':['google','wikipedia*Python_'],
'matching_string_2':['stackoverflow*questions*56318782','']}
result = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7],
'group':['group2','group1','group1','']}
df1 = pd.DataFrame(urls)
df2 = pd.DataFrame(metadata)
what_I_am_after = pd.DataFrame(result)
Not very robust but gives the correct answer for my example.
import pandas as pd
urls = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7]}
metadata = {'group':['group1','group2'],
'matching_string_1':['google','wikipedia*Python_'],
'matching_string_2':['stackoverflow*questions*56318782','']}
result = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7],
'group':['group2','group1','group1','']}
df1 = pd.DataFrame(urls)
df2 = pd.DataFrame(metadata)
results = pd.DataFrame(columns=['url','hits','group'])
for index,row in df2.iterrows():
for x in row[1:]:
group = x.split('*')
rx = "".join([str(x)+".*" if len(x) > 0 else '' for x in group])
if rx == "":
continue
filter = df1['url'].str.contains(rx,na=False, regex=True)
if filter.any():
temp = df1[filter]
temp['group'] = row[0]
results = results.append(temp)
d3 = df1.merge(results,how='outer',on=['url','hits'])

How do I import CSV to Pandas df where data is organized by an index column with a parent/child relationship?

I have GBs of data in this text format:
1,'Acct01','Freds Autoshop'
2,'3-way-Cntrl','Y'
1000,576,686,837
1001,683,170,775
1,'Acct02','Daves Tacos'
2,'centrifugal','N'
1000,334,787,143
1001,749,132,987
The first column indicates the row content and is an index series that repeats for each Account (Acct01, Acct02...). Rows with index values (1,2) are one-to-one associated with each account (Parent). I would like to flatten this data into a dataframe that associates the Account level data (index = 1,2) with it's associated series data (1000, 10001, 1002, 1003...) the child data in a flat df.
Desired df:
'Acct01','Freds Autoshop','3-way-Cntrl','Y',1000,576,686,837
'Acct01','Freds Autoshop','3-way-Cntrl','Y',1001,683,170,775
'Acct02','Daves Tacos',2,'centrifugal','N',1000,334,787,143
'Acct02','Daves Tacos',2,'centrifugal','N',1001,749,132,987
I've been able to do this in a very mechanical, very slow row-by-row process:
import pandas as pd
import numpy as np
import time
file = 'C:\\PythonData\\AcctData.txt'
t0 = time.time()
pdata = [] # Parse data
acct = [] # Account Data
row = {} #Assembly Container
#Set dataframe columns
df = pd.DataFrame(columns=['Account','Name','Type','Flag','Counter','CNT01','CNT02','CNT03'])
# open the file and read through it line by line
with open(file, 'r') as f:
for line in f:
#Strip each line
pdata = [x.strip() for x in line.split(',')]
#Use the index to parse data into either acct[] for use on the rows with counter > 2
indx = int(pdata[0])
if indx == 1:
acct.clear()
acct.append(pdata[1])
acct.append(pdata[2])
elif indx == 2:
acct.append(pdata[1])
acct.append(pdata[2])
else:
row.clear()
row['Account'] = acct[0]
row['Name'] = acct[1]
row['Type'] = acct[2]
row['Flag'] = acct[3]
row['Counter'] = pdata[0]
row['CNT01'] = pdata[1]
row['CNT02'] = pdata[2]
row['CNT03'] = pdata[3]
if indx > 2:
#data.append(row)
df = df.append(row, ignore_index=True)
t1 = time.time()
totalTimeDf = t1-t0
TTDf = '%.3f'%(totalTimeDf)
print(TTDf + " Seconds to Complete df: " + i_filepath)
print(df)
Result:
0.018 Seconds to Complete df: C:\PythonData\AcctData.txt
Account Name Type Flag Counter CNT01 CNT02 CNT03
0 'Acct01' 'Freds Autoshop' '3-way-Cntrl' 'Y' 1000 576 686 837
1 'Acct01' 'Freds Autoshop' '3-way-Cntrl' 'Y' 1001 683 170 775
2 'Acct02' 'Daves Tacos' 'centrifugal' 'N' 1000 334 787 143
3 'Acct02' 'Daves Tacos' 'centrifugal' 'N' 1001 749 132 987
This works but is tragically slow. I suspect there is a very easy pythonic way to import and organize to a df. It appears an OrderDict will properly organize the data as follows:
import csv
from collections import OrderedDict
od = OrderedDict()
file_name = 'C:\\PythonData\\AcctData.txt'
try:
csvfile = open(file_name, 'rt')
except:
print("File not found")
csvReader = csv.reader(csvfile, delimiter=",")
for row in csvReader:
key = row[0]
od.setdefault(key,[]).append(row)
od
Result:
OrderedDict([('1',
[['1', "'Acct01'", "'Freds Autoshop'"],
['1', "'Acct02'", "'Daves Tacos'"]]),
('2',
[['2', "'3-way-Cntrl'", "'Y'"],
['2', "'centrifugal'", "'N'"]]),
('1000',
[['1000', '576', '686', '837'], ['1000', '334', '787', '143']]),
('1001',
[['1001', '683', '170', '775'], ['1001', '749', '132', '987']])])
From the OrderDict I haven't been able to figure out how to combine keys 1,2 and associate with acct specific series of keys (1000, 1001) then append into a df. How do I go from OrderedDict to df while flattening the Parent/Child data? Or, is there a better way to process this data?
I'm not sure if it's the fastes or the pythonic way, but I believe a pandas aproach might do, since you need to iterate for every 4 rows in a weird real specific way:
first importing libraries to work with:
import pandas as pd
import numpy as np
since I didn't have a file to load, I just recreated it as an array (this part you'll have to do some work, or simply load it to a pandas' DataFrame with 4 columns will be fine [like next step]):
data = [[1,'Acct01','Freds Autoshop'],
[2,'3-way-Cntrl','Y' ],
[1000,576,686,837 ],
[1001,683,170,775 ],
[1002,333,44,885 ],
[1003,611183,12,1 ],
[1,'Acct02','Daves Tacos' ],
[2,'centrifugal','N' ],
[1000,334,787,143 ] ,
[1001,749,132,987],
[1,'Acct03','Norah Jones' ],
[2,'undertaker','N' ],
[1000,323,1,3 ] ,
[1001,311,2,111 ] ,
[1002,95,112,4]]
Created a dataframe with the above data + created new columns with numpy's nans (faster than panda's) as placeholders.
df = pd.DataFrame(data)
df['4']= np.nan
df['5']= np.nan
df['6']= np.nan
df['7']= np.nan
df['8']= np.nan
df.columns = ['idx','Account','Name','Type','Flag','Counter','CNT01','CNT02','CNT3']
Making a new df that will get everytime "AcctXXXX" apears and how many rows bellow until the next parent.
# Getting the unique "Acct" and their index position into an array
acct_idx_pos = np.array([df[df['Account'].str.contains('Acct').fillna(False)]['Account'].values, df[df['Account'].str.contains('Acct').fillna(False)].index.values])
# Making a df with the transposed array
df_pos = pd.DataFrame(acct_idx_pos.T, columns=['Acct', 'Position'])
# Shifting the values into a new column and filling the last value (nan) with the df length
df_pos['End_position'] = df_pos['Position'].shift(-1)
df_pos['End_position'][-1:] = len(df)
# Making the column we want, that is the number of loops we'll go
df_pos['Position_length'] = df_pos['End_position'] - df_pos['Position']
A custom function that uses a dummy Dataframe and concatenates temporary ones (will be used later)
def concatenate_loop_dfs(df_temp, df_full, axis=0):
"""
to avoid retyping the same line of code for every df.
the parameters should be the temporary df created at each loop and the concatenated DF that will contain all
values which must first be initialized (outside the loop) as df_name = pd.DataFrame(). """
if df_full.empty:
df_full = df_temp
else:
df_full = pd.concat([df_full, df_temp], axis=axis)
return df_full
Created a function that will loop to fill each row and drop duplicated rows:
# a complicated loop function
def shorthen_df(df, num_iterations):
# to not delete original df
dataframe = df.copy()
# for the slicing, we need to start at the first row.
curr_row = 1
# fill current row's nan values with values from next row
dataframe.iloc[curr_row-1:curr_row:,3] = dataframe.iloc[curr_row:curr_row+1:,1].values
dataframe.iloc[curr_row-1:curr_row:,4] = dataframe.iloc[curr_row:curr_row+1:,2].values
dataframe.iloc[curr_row-1:curr_row:,5] = dataframe.iloc[curr_row+1:curr_row+2:,0].values
dataframe.iloc[curr_row-1:curr_row:,6] = dataframe.iloc[curr_row+1:curr_row+2:,1].values
dataframe.iloc[curr_row-1:curr_row:,7] = dataframe.iloc[curr_row+1:curr_row+2:,2].values
dataframe.iloc[curr_row-1:curr_row:,8] = dataframe.iloc[curr_row+1:curr_row+2:,3].values
# the "num_iterations-2" is because the first two lines are filled and not replaced
# as the next ones will be. So this will vary correctly to each "account"
for i in range(1, num_iterations-2):
# Replaces next row with values from previous row
dataframe.iloc[curr_row+(i-1):curr_row+i:] = dataframe.iloc[curr_row+(i-2):curr_row+(i-1):].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,5] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,0].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,6] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,1].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,7] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,2].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,8] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,3].values
# last 2 rows of df
dataframe = dataframe[0:len(dataframe)-2]
return dataframe
Finally, creating the dummy DF that will concat all "Acct" and loop for each one with it's position, using both functions above.
df_final= pd.DataFrame()
for start, end, iterations in zip(df_pos.Position.values, df_pos.End_position.values, df_pos.Position_length.values):
df2 = df[start:end]
df_temp = shorthen_df(df2, iterations)
df_final = concatenate_loop_dfs(df_temp, df_final)
# Dropping first/unnecessary columns
df_final.drop('idx', axis=1, inplace=True)
# resetting index
df_final.reset_index(inplace=True, drop=True)
df_final
returns
Account Name Type Flag Counter CNT01 CNT02 CNT3
0 Acct01 Freds Autoshop 3-way-Cntrl Y 1000.0 576 686 837
1 Acct01 Freds Autoshop 3-way-Cntrl Y 1001.0 683 170 775
2 Acct01 Freds Autoshop 3-way-Cntrl Y 1002.0 333 44 885
3 Acct01 Freds Autoshop 3-way-Cntrl Y 1003.0 611183 12 1
4 Acct02 Daves Tacos centrifugal N 1000.0 334 787 143
5 Acct02 Daves Tacos centrifugal N 1001.0 749 132 987
6 Acct03 Norah Jones undertaker N 1000.0 323 1 3
7 Acct03 Norah Jones undertaker N 1001.0 311 2 111
8 Acct03 Norah Jones undertaker N 1002.0 95 112 4

Python - How to improve the dataframe performance?

There are 2 CSV files. Each file has 700,000 rows.
I should read one file line by line and find the same row from the other file.
After then, make two files data as one file data.
But, It takes about 1 minute just per 1,000 rows!!
I don't know how to improve the performance.
Here is my code :
import pandas as pd
fail_count = 0
match_count = 0
count = 0
file1_df = pd.read_csv("Data1.csv", sep='\t')
file2_df = pd.read_csv("Data2.csv", sep='\t')
columns = ['Name', 'Age', 'Value_file1', 'Value_file2']
result_df = pd.DataFrame(columns=columns)
for row in fil1_df.itterow():
name = row[1][2]
position = row[1][3]
selected = file2_df[(file2_df['Name'] == name ) & (file2_df['Age'] == age)]
if selected.empty :
fail_count += 1
continue
value_file1 = row[1][4]
value_file2 = selected['Value'].values[0]
result_df.loc[len(result_df)] = [name, age, value_file1, value_file2]
match_count += 1
print('match : ' + str(match_count))
print('fail : ' + str(fail_count))
result_df.to_csv('result.csv', index=False, encoding='utf-8')
Which line can be changed?
Is there any other way to do this process?
This might be too simplistic, but have you tried using pandas.merge() functionality?
See here for syntax.
For your tables:
result_df = pd.merge(left=file1_df, right=file2_df, on=['Name', 'Age'], how='inner')
That will do an "inner" join, only keeping rows with Names & Ages that match in both tables.

Categories