I need to insert rows based on the column week based on the groupby type, in some cases i have missing weeks in the middle of the dataframe at different positions and i want to insert rows to fill in the missing rows as copies of the last existing row, in this case copies of week 7 to fill in the weeks 8 and 9 and copies of week 11 to fill in rows for week 12, 13 and 14 : on this table you can see the jump from week 7 to 10 and from 11 to 15:
the perfect output would be as follow: the final table with incremental values in column week the correct way :
Below is the code i have, it inserts only one row and im confused why:
def middle_values(final : DataFrame) -> DataFrame:
finaltemp= pd.DataFrame()
out= pd.DataFrame()
for i in range(0, len(final)):
for f in range(1, 52 , 1):
if final.iat[i,8]== f and final.iat[i-1,8] != f-1 :
if final.iat[i,8] > final.iat[i-1,8] and final.iat[i,8] != (final.iat[i-1,8] - 1):
line = final.iloc[i-1]
c1 = final[0:i]
c2 = final[i:]
c1.loc[i]=line
concatinated = pd.concat([c1, c2])
concatinated.reset_index(inplace=True)
concatinated.iat[i,11] = concatinated.iat[i-1,11]
concatinated.iat[i,9]= f-1
finaltemp = finaltemp.append(concatinated)
if 'type' in finaltemp.columns:
for name, groups in finaltemp.groupby(["type"]):
weeks = range(groups['week'].min(), groups['week'].max()+1)
out = out.append(pd.merge(finaltemp, pd.Series(weeks, name='week'), how='right').ffill())
out.drop_duplicates(subset=['project', 'week'], keep = 'first', inplace=True)
out.drop_duplicates(inplace = True)
out.sort_values(["Budget: Budget Name", "Budget Week"], ascending = (False, True), inplace=True)
out.drop(['level_0'], axis = 1, inplace=True)
out.reset_index(inplace=True)
out.drop(['level_0'], axis = 1, inplace=True)
return out
else :
return final
For the first part of your question. Suppose we have a dataframe like the following:
df = DataFrame({"project":[1,1,1,2,2,2], "week":[1,3,4,1,2,4], "value":[12,22,18,17,18,23]})
We can create a new multi index to get the additional rows that we need
new_index = pd.MultiIndex.from_arrays([sorted([i for i in df['project'].unique()]*52),
[i for i in np.arange(1,53,1)]*df['project'].unique().shape[0]], names=['project', 'week'])
We can then apply this index to get the new dataframe that you need with blanks in the new rows
df = df.set_index(['project', 'week']).reindex(new_index).reset_index().sort_values(['project', 'week'])
You would then need to apply a forward fill (using ffill) or a back fill (using bfill) with groupby and transform to get the required values in the rows that you need.
I have CSV file and I try to split my row into many rows if it contains more than 4 columns
Example:-
enter image description here
Expected Output:
enter image description here
So there are way to do that in pandas or python
Sorry if this is a simple question
When there are two columns with the same name in CSV file, the pandas dataframe automatically appends an integer value to the duplicate column name
for example:
This CSV file :
Will become this :
df = pd.read_csv("Book1.csv")
df
Now to solve your question, lets consider the above dataframe as the input dataframe.
Try this :
cols = df.columns.tolist()
cols.remove('id')
start = 0
end = 4
new_df = []
final_cols = ['id','x1','y1','x2','y2']
while start<len(cols):
if end>len(cols):
end = len(cols)
temp = cols[start:end]
start = end
end = end+4
temp_df = df.loc[:,['id']+temp]
temp_df.columns = final_cols[:1+len(temp)]
if len(temp)<4:
temp_df[final_cols[1+len(temp):]] = None
print(temp_df)
new_df.append(temp_df)
pd.concat(new_df).reset_index(drop = True)
Result:
You can first set the video column as index then concat your remaining every 4 columns into a new dataframe. At last, reset index to get video column back.
df.set_index('video', inplace=True)
dfs = []
for i in range(len(df.columns)//4):
d = df.iloc[:, range(i*4,i*4+4)]
dfs.append(d.set_axis(['x_center', 'y_center']*2, axis=1))
df_ = pd.concat(dfs).reset_index()
I think the following list comprehension should work, but it gives an positional indexing error on my machine and I don't know why
df_ = pd.concat([df.iloc[: range(i*4, i*4+4)].set_axis(['x_center', 'y_center']*2, axis=1) for i in range(len(df.columns)//4)])
print(df_)
video x_center y_center x_center y_center
0 1_1 31.510973 22.610222 31.383655 22.488293
1 1_1 31.856295 22.830109 32.016905 22.948702
2 1_1 32.011684 22.990689 31.933356 23.004779
I am new to python. I am working on finance data. I want to loop through multiple dataset.
I have following code to read the data.
df1_url = pd.read_html("https:url1")
df2_url = pd.read_html("https:url2")
df3_url = pd.read_html("https:url3")
df4_url = pd.read_html("https:url4")
Each dataset has different 9 different tables in it. but every dataset is of same format.
Eg. The resulted output should be like:
bs_sheet = df1_url[1]
ps_sheet = df1_url[3]
cf_sheet = df1_url[5]
This process is same for all dataframes. Here I want to loop through 4 different dataframes like this.
So I tried to have all these 4 dataset and put in the dictionary.
dfs= {'df1':df1_url,'df2':df2_url,'df3':df3_url,'df4':df4_url}
I tried to loop through different datasets,
def trans(frame):
for i in dfs:
bs_sheet = i[1]
ps_sheet = i[3]
cf_sheet = i[5]
data = pd.concat([bs_sheet,pl_sheet,cf_sheet],axis=0)
data = data.transpose
This operations should be performed for all 4 different dataset.
While I performed this operations I received string out of range. After this how to access each dataset?
My solution was this:
d={}
for key,data in dfs.items():
bs_sheets = data[1]
ps_sheets= data[3]
cs_flows = data[5]
data = pd.concat([bs_sheets,pl_sheets,cs_flows],axis=0)
data = data.transpose()
d[key]= data
Thanks for helping me out #Zeinab #lucas.
You function would not work as need to change frame in your function to dfs
def trans(dfs):
for i in dfs:
bs_sheet = i[1]
ps_sheet = i[3]
cf_sheet = i[5]
data = pd.concat([bs_sheet,pl_sheet,cf_sheet],axis=0)
data = data.transpose
import pandas as pd
dfs= {'df1':['a','a','a','a'],'df2':['b','b','b','b'],
'df3':['c','c','c','c'],'df4':['d','d','d','d']}
d = []
for i in dfs.values() :
d.append(pd.Series(i))
final_pd= pd.concat(d,axis = 1)
print(final_pd)
0 1 2 3
0 a b c d
1 a b c d
2 a b c d
3 a b c d
I have several files that look like this, where the header is the count of unique values per column.
How can I read several of these files and concatenate them all in one??
When I concatenate, I need that all the values in the column in the middle ADD the total value of count of that column from the file before, to continue with the count when I concatenate. The other two columns I don't mind.
My try:
matrixFiles = glob.glob(filesPath +'/*matrix.mtx')
dfs = []
i = 0
for file in sorted(matrixFiles):
matrix = pd.read_csv(file, sep = ' ')
cellNumber = matrix.columns[1]
cellNumberInt = np.int64(cellNumber)
if i > 0:
matrix.iloc[:,1] = matrix.iloc[:,1] + cellNumberInt
dfs.append(matrix)
i = i + 1
big_file = pd.concat (dfs)
I don't know how to access to cellNumberInt from the file iterated before to add it to the new one.
When I concat dfs the output is not a three column dataframe. How can I concatenate all the files in the same columns and avoiding the header?
1.csv:
33694,1298,2465341
33665,1299,20
33663,1299,8
2.csv:
53694,1398,3465341
33665,1399,20
33663,1399,8
3.csv:
13694,7778,3465341
44432,7780,20
33663,7780,8
import pandas as pd
import numpy as np
matrixFiles = ['1.csv', '2.csv', '3.csv']
dfs = []
matrix_list = []
#this dict stores the i number (keys) and the cellNumberInt (values)
cellNumberInt_dict = {}
i = 0
for file in sorted(matrixFiles):
matrix = pd.read_csv(file)
cellNumber = matrix.columns[1]
cellNumberInt = np.int64(cellNumber)
cellNumberInt_dict[i] = cellNumberInt
if i > 0:
matrix.rename(columns={str(cellNumberInt) : cellNumberInt + cellNumberInt_dict[i-1]}, inplace=True)
dfs.append(matrix)
if i < len(matrixFiles)-1:
#we only want to keep the df values here, keeping the columns that don't
# have shared names messes up the pd.concat()
matrix_list.append(matrix.values)
i += 1
# get the last df in the dfs list because it has the last cellNumberInt
last_df = dfs[-1]
#concat all of the values from the dfs except for the last one
arrs = np.concatenate(matrix_list)
#make a df from the numpy arrays
new_df = pd.DataFrame(arrs, columns=last_df.columns.tolist())
big_file = pd.concat([last_df, new_df])
big_file.rename(columns={big_file.columns.tolist()[1] : sum(cellNumberInt_dict.values())}, inplace=True)
print (big_file)
13694 10474 3465341
0 44432 7780 20
1 33663 7780 8
0 33665 1299 20
1 33663 1299 8
2 33665 1399 20
3 33663 1399 8
I have GBs of data in this text format:
1,'Acct01','Freds Autoshop'
2,'3-way-Cntrl','Y'
1000,576,686,837
1001,683,170,775
1,'Acct02','Daves Tacos'
2,'centrifugal','N'
1000,334,787,143
1001,749,132,987
The first column indicates the row content and is an index series that repeats for each Account (Acct01, Acct02...). Rows with index values (1,2) are one-to-one associated with each account (Parent). I would like to flatten this data into a dataframe that associates the Account level data (index = 1,2) with it's associated series data (1000, 10001, 1002, 1003...) the child data in a flat df.
Desired df:
'Acct01','Freds Autoshop','3-way-Cntrl','Y',1000,576,686,837
'Acct01','Freds Autoshop','3-way-Cntrl','Y',1001,683,170,775
'Acct02','Daves Tacos',2,'centrifugal','N',1000,334,787,143
'Acct02','Daves Tacos',2,'centrifugal','N',1001,749,132,987
I've been able to do this in a very mechanical, very slow row-by-row process:
import pandas as pd
import numpy as np
import time
file = 'C:\\PythonData\\AcctData.txt'
t0 = time.time()
pdata = [] # Parse data
acct = [] # Account Data
row = {} #Assembly Container
#Set dataframe columns
df = pd.DataFrame(columns=['Account','Name','Type','Flag','Counter','CNT01','CNT02','CNT03'])
# open the file and read through it line by line
with open(file, 'r') as f:
for line in f:
#Strip each line
pdata = [x.strip() for x in line.split(',')]
#Use the index to parse data into either acct[] for use on the rows with counter > 2
indx = int(pdata[0])
if indx == 1:
acct.clear()
acct.append(pdata[1])
acct.append(pdata[2])
elif indx == 2:
acct.append(pdata[1])
acct.append(pdata[2])
else:
row.clear()
row['Account'] = acct[0]
row['Name'] = acct[1]
row['Type'] = acct[2]
row['Flag'] = acct[3]
row['Counter'] = pdata[0]
row['CNT01'] = pdata[1]
row['CNT02'] = pdata[2]
row['CNT03'] = pdata[3]
if indx > 2:
#data.append(row)
df = df.append(row, ignore_index=True)
t1 = time.time()
totalTimeDf = t1-t0
TTDf = '%.3f'%(totalTimeDf)
print(TTDf + " Seconds to Complete df: " + i_filepath)
print(df)
Result:
0.018 Seconds to Complete df: C:\PythonData\AcctData.txt
Account Name Type Flag Counter CNT01 CNT02 CNT03
0 'Acct01' 'Freds Autoshop' '3-way-Cntrl' 'Y' 1000 576 686 837
1 'Acct01' 'Freds Autoshop' '3-way-Cntrl' 'Y' 1001 683 170 775
2 'Acct02' 'Daves Tacos' 'centrifugal' 'N' 1000 334 787 143
3 'Acct02' 'Daves Tacos' 'centrifugal' 'N' 1001 749 132 987
This works but is tragically slow. I suspect there is a very easy pythonic way to import and organize to a df. It appears an OrderDict will properly organize the data as follows:
import csv
from collections import OrderedDict
od = OrderedDict()
file_name = 'C:\\PythonData\\AcctData.txt'
try:
csvfile = open(file_name, 'rt')
except:
print("File not found")
csvReader = csv.reader(csvfile, delimiter=",")
for row in csvReader:
key = row[0]
od.setdefault(key,[]).append(row)
od
Result:
OrderedDict([('1',
[['1', "'Acct01'", "'Freds Autoshop'"],
['1', "'Acct02'", "'Daves Tacos'"]]),
('2',
[['2', "'3-way-Cntrl'", "'Y'"],
['2', "'centrifugal'", "'N'"]]),
('1000',
[['1000', '576', '686', '837'], ['1000', '334', '787', '143']]),
('1001',
[['1001', '683', '170', '775'], ['1001', '749', '132', '987']])])
From the OrderDict I haven't been able to figure out how to combine keys 1,2 and associate with acct specific series of keys (1000, 1001) then append into a df. How do I go from OrderedDict to df while flattening the Parent/Child data? Or, is there a better way to process this data?
I'm not sure if it's the fastes or the pythonic way, but I believe a pandas aproach might do, since you need to iterate for every 4 rows in a weird real specific way:
first importing libraries to work with:
import pandas as pd
import numpy as np
since I didn't have a file to load, I just recreated it as an array (this part you'll have to do some work, or simply load it to a pandas' DataFrame with 4 columns will be fine [like next step]):
data = [[1,'Acct01','Freds Autoshop'],
[2,'3-way-Cntrl','Y' ],
[1000,576,686,837 ],
[1001,683,170,775 ],
[1002,333,44,885 ],
[1003,611183,12,1 ],
[1,'Acct02','Daves Tacos' ],
[2,'centrifugal','N' ],
[1000,334,787,143 ] ,
[1001,749,132,987],
[1,'Acct03','Norah Jones' ],
[2,'undertaker','N' ],
[1000,323,1,3 ] ,
[1001,311,2,111 ] ,
[1002,95,112,4]]
Created a dataframe with the above data + created new columns with numpy's nans (faster than panda's) as placeholders.
df = pd.DataFrame(data)
df['4']= np.nan
df['5']= np.nan
df['6']= np.nan
df['7']= np.nan
df['8']= np.nan
df.columns = ['idx','Account','Name','Type','Flag','Counter','CNT01','CNT02','CNT3']
Making a new df that will get everytime "AcctXXXX" apears and how many rows bellow until the next parent.
# Getting the unique "Acct" and their index position into an array
acct_idx_pos = np.array([df[df['Account'].str.contains('Acct').fillna(False)]['Account'].values, df[df['Account'].str.contains('Acct').fillna(False)].index.values])
# Making a df with the transposed array
df_pos = pd.DataFrame(acct_idx_pos.T, columns=['Acct', 'Position'])
# Shifting the values into a new column and filling the last value (nan) with the df length
df_pos['End_position'] = df_pos['Position'].shift(-1)
df_pos['End_position'][-1:] = len(df)
# Making the column we want, that is the number of loops we'll go
df_pos['Position_length'] = df_pos['End_position'] - df_pos['Position']
A custom function that uses a dummy Dataframe and concatenates temporary ones (will be used later)
def concatenate_loop_dfs(df_temp, df_full, axis=0):
"""
to avoid retyping the same line of code for every df.
the parameters should be the temporary df created at each loop and the concatenated DF that will contain all
values which must first be initialized (outside the loop) as df_name = pd.DataFrame(). """
if df_full.empty:
df_full = df_temp
else:
df_full = pd.concat([df_full, df_temp], axis=axis)
return df_full
Created a function that will loop to fill each row and drop duplicated rows:
# a complicated loop function
def shorthen_df(df, num_iterations):
# to not delete original df
dataframe = df.copy()
# for the slicing, we need to start at the first row.
curr_row = 1
# fill current row's nan values with values from next row
dataframe.iloc[curr_row-1:curr_row:,3] = dataframe.iloc[curr_row:curr_row+1:,1].values
dataframe.iloc[curr_row-1:curr_row:,4] = dataframe.iloc[curr_row:curr_row+1:,2].values
dataframe.iloc[curr_row-1:curr_row:,5] = dataframe.iloc[curr_row+1:curr_row+2:,0].values
dataframe.iloc[curr_row-1:curr_row:,6] = dataframe.iloc[curr_row+1:curr_row+2:,1].values
dataframe.iloc[curr_row-1:curr_row:,7] = dataframe.iloc[curr_row+1:curr_row+2:,2].values
dataframe.iloc[curr_row-1:curr_row:,8] = dataframe.iloc[curr_row+1:curr_row+2:,3].values
# the "num_iterations-2" is because the first two lines are filled and not replaced
# as the next ones will be. So this will vary correctly to each "account"
for i in range(1, num_iterations-2):
# Replaces next row with values from previous row
dataframe.iloc[curr_row+(i-1):curr_row+i:] = dataframe.iloc[curr_row+(i-2):curr_row+(i-1):].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,5] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,0].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,6] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,1].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,7] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,2].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,8] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,3].values
# last 2 rows of df
dataframe = dataframe[0:len(dataframe)-2]
return dataframe
Finally, creating the dummy DF that will concat all "Acct" and loop for each one with it's position, using both functions above.
df_final= pd.DataFrame()
for start, end, iterations in zip(df_pos.Position.values, df_pos.End_position.values, df_pos.Position_length.values):
df2 = df[start:end]
df_temp = shorthen_df(df2, iterations)
df_final = concatenate_loop_dfs(df_temp, df_final)
# Dropping first/unnecessary columns
df_final.drop('idx', axis=1, inplace=True)
# resetting index
df_final.reset_index(inplace=True, drop=True)
df_final
returns
Account Name Type Flag Counter CNT01 CNT02 CNT3
0 Acct01 Freds Autoshop 3-way-Cntrl Y 1000.0 576 686 837
1 Acct01 Freds Autoshop 3-way-Cntrl Y 1001.0 683 170 775
2 Acct01 Freds Autoshop 3-way-Cntrl Y 1002.0 333 44 885
3 Acct01 Freds Autoshop 3-way-Cntrl Y 1003.0 611183 12 1
4 Acct02 Daves Tacos centrifugal N 1000.0 334 787 143
5 Acct02 Daves Tacos centrifugal N 1001.0 749 132 987
6 Acct03 Norah Jones undertaker N 1000.0 323 1 3
7 Acct03 Norah Jones undertaker N 1001.0 311 2 111
8 Acct03 Norah Jones undertaker N 1002.0 95 112 4