Pandas: modify multiple dataframes (in a loop) - python

I have multiple data frames that I want to do the same function for them. therefore I need to iterate over my frameworks.
# read text files
df1 = pd.read_csv("df1.txt", sep="\t", error_bad_lines=False, index_col =None)
df2 = pd.read_csv("df2.txt", sep="\t", error_bad_lines=False, index_col =None)
df3 = pd.read_csv("df3.txt", sep="\t", error_bad_lines=False, index_col =None)
I have used the following code, however, it is not working (it means that all dataframes are still the same, and the changes do not affect them):
for df in [df1 , df2 , df3]:
df = df[df["Time"]>= 600.0].reset_index(drop=True)
df.head()
How I can iterate over them? and how can I overwrite dataframes?

The problem is that you're not changing the data frames in place, but rather creating new ones. Here's a piece of code that changes things in-place. I don't have your data, so I create fake data for the sake of this example:
df1 = pd.DataFrame(range(10))
df2 = pd.DataFrame(range(20))
df3 = pd.DataFrame(range(30))
df_list = [df1, df2, df3]
for df in df_list:
# use whatever condition you need in the following line
# for example, df.drop(df[df["Time"] < 600].index, inplace=True)
# in your case.
df.drop(df[df[0] % 2 == 0].index, inplace=True)
df.reset_index(inplace = True)
print(df2) # for example
The result for df2 is:
index 0
0 1 1
1 3 3
2 5 5
3 7 7
4 9 9
5 11 11
6 13 13
7 15 15
8 17 17
9 19 19

This might work:
df_list=[df1,df2,df3]
for df in range(len(df_list)):
df=df_list[i]
df_list[i]=df[df["Time"]>=600.0].reset_iundex(drop=True)

If you just store the new df to another list or same list you are all good.
newdf_list = [] # create new list to store df
for df in [df1 , df2 , df3]:
df = df[df["Time"]>= 600.0].reset_index(drop=True)
df.head()
newdf_list.append(df) # append changed df to new list

Related

Pandas apply row-wise a function and create multiple new columns

What is the best way to apply a row-wise function and create multiple new columns?
I have two dataframes and a working code, but it's most likely not optimal
df1 (dataframe has thousands of rows and xx number of columns)
sic
data1
data2
data3
data4
data5
5
0.90783598
0.84722083
0.47149924
0.98724123
0.50654476
6
0.53442684
0.59730371
0.92486887
0.61531646
0.62784041
3
0.56806423
0.09619383
0.33846097
0.71878313
0.96316724
8
0.86933042
0.64965755
0.94549745
0.08866519
0.92156389
12
0.651328
0.37193774
0.9679044
0.36898991
0.15161838
6
0.24555531
0.50195983
0.79114578
0.9290596
0.10672607
df2 (column header maps to the sic-code in df1. There are in total 12 sic-codes and the dataframe is thousands of rows long)
1
2
3
4
5
6
7
8
9
10
11
12
c_bar
0.4955329
0.92970292
0.68049726
0.91325006
0.55578465
0.78056519
0.53954711
0.90335326
0.93986402
0.0204794
0.51575764
0.61144255
a1_bar
0.75781444
0.81052669
0.99910449
0.62181902
0.11797144
0.40031316
0.08561665
0.35296894
0.14445697
0.93799762
0.80641802
0.31379671
a2_bar
0.41432552
0.36313911
0.13091618
0.39251953
0.66249636
0.31221897
0.15988528
0.1620938
0.55143589
0.66571044
0.68198944
0.23806947
a3_bar
0.38918855
0.83689178
0.15838139
0.39943204
0.48615188
0.06299899
0.86343819
0.47975619
0.05300611
0.15080875
0.73088725
0.3500239
a4_bar
0.47201384
0.90874121
0.50417142
0.70047698
0.24820601
0.34302454
0.4650635
0.0992668
0.55142391
0.82947194
0.28251699
0.53170308
I achieved the result I need with the following code:
ind_list = np.arange(1,13) # Create list of industries
def c_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['const',i]
def a1_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a1bar',i]
def a2_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a2bar',i]
def a3_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a3bar',i]
def a4_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a4bar',i]
mlev_merge['c_bar'] = mlev_merge.apply(c_bar, axis=1, result_type='expand')
mlev_merge['a1_bar'] = mlev_merge.apply(a1_bar, axis=1, result_type='expand')
mlev_merge['a2_bar'] = mlev_merge.apply(a2_bar, axis=1, result_type='expand')
mlev_merge['a3_bar'] = mlev_merge.apply(a3_bar, axis=1, result_type='expand')
mlev_merge['a4_bar'] = mlev_merge.apply(a4_bar, axis=1, result_type='expand')
The output is something like this:
sic
data1
data2
data3
data4
c_bar
a1_bar
a2_bar
a3_bar
a4_bar
5
0.10316948
0.61408639
0.04042675
0.79255749
0.56357931
0.42920472
0.20701581
0.67639811
0.37778029
6
0.5730904
0.16753145
0.27835136
0.00178992
0.51793793
0.06772307
0.15084885
0.12451806
0.33114948
3
0.87710893
0.66834187
0.14286608
0.12609769
0.75873957
0.72586804
0.6081763
0.14598001
0.21557266
8
0.24565579
0.56195558
0.93316676
0.20988936
0.67404545
0.65221594
0.79758557
0.67093021
0.33400764
12
0.79703344
0.61066111
0.94602909
0.56218703
0.92384307
0.30836159
0.72521994
0.00795362
0.76348227
6
0.86604791
0.28454782
0.97229172
0.21853932
0.75650652
0.40788056
0.53233553
0.60326386
0.27399405
Cell values in the example are randomly generated, but the point is to map based on sic-codes and add rows from df2 as new columns into df1.
To do this, you need to:
Transpose df2 so that its columns are correct for concatenation
Index it with the df1["sic"] column to get the correct rows
Reset the index of the obtained rows of df2 using .reset_index(drop=True), so that the dataframes can be concatenated correctly. (This replaces the current index e.g. 5, 6, 3, 8, 12, 6 with a new one e.g. 0, 1, 2, 3, 4, 5 while keeping the actual values the same. This is so that pandas doesn't get confused while concatenating them)
Concatenate the two dataframes
Note: I used a method based off of this to read in the dataframe, and it assumed that the columns of df2 were strings but the values of the sic column of df1 were ints. Therefore I used .astype(str) to get step 2 working. If this is not actually the case, you may need to remove the .astype(str).
Here is the single line of code to do these things:
merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)
Here is the full code I used:
from io import StringIO
import pandas as pd
df1 = pd.read_csv(StringIO("""
sic data1 data2 data3 data4 data5
5 0.90783598 0.84722083 0.47149924 0.98724123 0.50654476
6 0.53442684 0.59730371 0.92486887 0.61531646 0.62784041
3 0.56806423 0.09619383 0.33846097 0.71878313 0.96316724
8 0.86933042 0.64965755 0.94549745 0.08866519 0.92156389
12 0.651328 0.37193774 0.9679044 0.36898991 0.15161838
6 0.24555531 0.50195983 0.79114578 0.9290596 0.10672607
"""), sep="\t")
df2 = pd.read_csv(StringIO("""
1 2 3 4 5 6 7 8 9 10 11 12
c_bar 0.4955329 0.92970292 0.68049726 0.91325006 0.55578465 0.78056519 0.53954711 0.90335326 0.93986402 0.0204794 0.51575764 0.61144255
a1_bar 0.75781444 0.81052669 0.99910449 0.62181902 0.11797144 0.40031316 0.08561665 0.35296894 0.14445697 0.93799762 0.80641802 0.31379671
a2_bar 0.41432552 0.36313911 0.13091618 0.39251953 0.66249636 0.31221897 0.15988528 0.1620938 0.55143589 0.66571044 0.68198944 0.23806947
a3_bar 0.38918855 0.83689178 0.15838139 0.39943204 0.48615188 0.06299899 0.86343819 0.47975619 0.05300611 0.15080875 0.73088725 0.3500239
a4_bar 0.47201384 0.90874121 0.50417142 0.70047698 0.24820601 0.34302454 0.4650635 0.0992668 0.55142391 0.82947194 0.28251699 0.53170308
"""), sep="\t", index_col=[0])
merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)
print(merged)
which produces the output:
sic data1 data2 data3 ... a1_bar a2_bar a3_bar a4_bar
0 5 0.907836 0.847221 0.471499 ... 0.117971 0.662496 0.486152 0.248206
1 6 0.534427 0.597304 0.924869 ... 0.400313 0.312219 0.062999 0.343025
2 3 0.568064 0.096194 0.338461 ... 0.999104 0.130916 0.158381 0.504171
3 8 0.869330 0.649658 0.945497 ... 0.352969 0.162094 0.479756 0.099267
4 12 0.651328 0.371938 0.967904 ... 0.313797 0.238069 0.350024 0.531703
5 6 0.245555 0.501960 0.791146 ... 0.400313 0.312219 0.062999 0.343025
[6 rows x 11 columns]
Try transposing df2 and applying transformations to it.
Transposing a data frame means converting the rows into columns of your data frame.
df2_tr = df2.T.map(lambda col:mapFunc(col),axis=0)
then, you can use concatenate the transformed columns of df2 with the columns of df1, using df1 = pd.concat([df1,df2],axis=1).

Merge multiple no header csv files with same row length and different column number python [duplicate]

This question already has an answer here:
Pandas: Combining Two DataFrames Horizontally [duplicate]
(1 answer)
Closed 2 years ago.
As the post's title, what I found on internet are mainly merging with header. Browsing around only gives me this only code that can return a result
import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv('output.csv')
`
but the result is not good at all as in this picture:
what I want is just like this
EDIT
I found this one actually works really good for my case.
import pandas as pd
import glob
import csv
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename, header=None))
full_df = pd.concat(df_list, ignore_index=True, axis=1)
full_df.to_csv('Route_data.csv') #,header = ['a','b',...], index=False) for csv output header index=False)
additional code to delete the old files that just merged makes this even more powerful
You can use:
pd.concat([df1, df2], axis=1)
Example:
import pandas as pd
df1 = pd.DataFrame([*zip([1,2,3],[4,5,6])])
print(df1)
df2 = pd.DataFrame([*zip([7,8,9],[10,11,12])])
print(df2)
df_combined = pd.concat([df1, df2], axis=1)
print(df_combined)
Output:
df1>
0 1
0 1 4
1 2 5
2 3 6
df2>
0 1
0 7 10
1 8 11
2 9 12
df_combined>
0 1 0 1
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
DataFrame Example:
import pandas as pd
df1 = pd.read_csv("/path/file1.csv", header=None)
print(df1)
df2 = pd.read_csv("/path/file2.csv", header=None)
print(df2)
df_combined = pd.concat([df1, df2], axis=1)
print(df_combined)

concat by taking the values from column

i have a list ['df1', 'df2'] where i have stores some dataframes which have been filtered on few conditions. Then i have converted this list to dataframe using
df = pd.DataFrame(list1)
now the df has only one column
0
df1
df2
sometimes it may also have
0
df1
df2
df3
i wanted to concate all these my static code is
df_new = pd.concat([df1,df2],axis=1) or
df_new = pd.concat([df1,df2,df3],axis=1)
how can i make it dynamic (without me specifying as df1,df2) so that it takes the values and concat it.
Using array to add the lists and data frames :
import pandas as pd
lists = [[1,2,3],[4,5,6]]
arr = []
for l in lists:
new_df = pd.DataFrame(l)
arr.append(new_df)
df = pd.concat(arr,axis=1)
df
Result :
0 0
0 1 4
1 2 5
2 3 6

Concat dataframes on different columns

I have 3 different csv files and I'm looking for concat the values. The only condition I need is that the first csv dataframe must be in column A of the new csv, the second csv dataframe in the column B and the Thirth csv dataframe in the C Column. The quantity of rows is the same for all csv files.
Also I need to change the three headers to ['año_pasado','mes_pasado','este_mes']
import pandas as pd
df = pd.read_csv('año_pasado_subastas2.csv', sep=',')
df1 = pd.read_csv('mes_pasado_subastas2.csv', sep=',')
df2 = pd.read_csv('este_mes_subastas2.csv', sep=',')
df1
>>>
Subastas
166665859
237944547
260106086
276599496
251813654
223790056
179340698
177500866
239884764
234813107
df2
>>>
Subastas
212003586
161813617
172179313
209185016
203804433
198207783
179410798
156375658
130228140
124964988
df3
>>>
Subastas
142552750
227514418
222635042
216263925
196209965
140984000
139712089
215588302
229478041
222211457
The output that I need is:
año_pasado,mes_pasado,este_mes
166665859,124964988,142552750
237944547,161813617,227514418
260106086,172179313,222635042
276599496,209185016,216263925
251813654,203804433,196209965
223790056,198207783,140984000
179340698,179410798,139712089
177500866,156375658,215588302
239884764,130228140,229478041
234813107,124964988,222211457
I think you need concat of Series created by squeeze=True if one column data only or selecting columns and for new columns names use parameter keys:
df = pd.read_csv('año_pasado_subastas2.csv', squeeze=True)
df1 = pd.read_csv('mes_pasado_subastas2.csv', squeeze=True)
df2 = pd.read_csv('este_mes_subastas2.csv', squeeze=True)
cols = ['año_pasado','mes_pasado','este_mes']
df = pd.concat([df, df1, df2], keys = cols, axis=1)
Or:
df = pd.read_csv('año_pasado_subastas2.csv')
df1 = pd.read_csv('mes_pasado_subastas2.csv')
df2 = pd.read_csv('este_mes_subastas2.csv')
cols = ['año_pasado','mes_pasado','este_mes']
df = pd.concat([df['Subastas'], df1['Subastas'], df2['Subastas']], keys = cols, axis=1)
print (df)
año_pasado mes_pasado este_mes
0 166665859 212003586 142552750
1 237944547 161813617 227514418
2 260106086 172179313 222635042
3 276599496 209185016 216263925
4 251813654 203804433 196209965
5 223790056 198207783 140984000
6 179340698 179410798 139712089
7 177500866 156375658 215588302
8 239884764 130228140 229478041
9 234813107 124964988 222211457

Organising columns in pandas DataFrame

I have a 2 sets of 92 columns. At the moment, all 92 columns are in one row. Is it possible to reogranise this such that the 92 columns are split into sets of 12, essentially having 8 sets of 12 data (one underneath the other). my code:
import glob
import pandas as pd
import os
os.chdir('C:/Users/peaches9/Desktop/')
Result = []
def FID_extract(filepath):
path_pattern = filepath
files = glob.glob(path_pattern)
dataframes = [pd.DataFrame.from_csv(f, index_col=None) for f in files]
new_dfa = pd.DataFrame()
for i, df in enumerate(dataframes):
colname = 'Run {}'.format(i+1)
selected_data = df['Unnamed: 3'].ix[12:17]
new_dfa[colname] = selected_data
#print new_dfa
#new_dfa.to_csv('FID_11169_Liquid.csv')
Result.append(new_dfa)
def TCD_extract(filepath):
path_pattern = filepath
files = glob.glob(path_pattern)
dataframes = [pd.DataFrame.from_csv(f, index_col=None) for f in files]
new_dfb = pd.DataFrame()
for i, df in enumerate(dataframes):
colname = 'Run {}'.format(i+1)
selected_data = df['Unnamed: 3'].ix[12:15]
new_dfb[colname] = selected_data
#print new_dfb
#new_dfb.to_csv('TCD_11169_liquid.csv')
Result.append(new_dfb)
FID_extract('C:/Users/peaches9/Desktop/Cryostat Verification/GC results/11169_Cryo_1bar/FID_0*') #files directory
TCD_extract('C:/Users/peaches9/Desktop/Cryostat Verification/GC results/11169_Cryo_1bar/TCD_0*')
dfc = pd.concat(Result)
Out:
Run 1..... Run 95 Run 96
12 5193791.85 5193915.21 5194343.34
13 1460874.04 1460929.33 1461072.84
14 192701.82 192729.55 192743.99
15 156836.4 156876.97 156889.26
16 98342.84 98346.7 98374.95
17 NaN NaN NaN
12 3982.69 3982.16 4017.66
13 2913008.04 2913627.33 2914075.7
14 226963.37 226956.1 227106.71
15 25208.2 25173.89 25197.88
I want all 96 columns split into 8 X 12 columns all underneath each other. Many thanks in advance.
EDIT:
I have managed to seperate the dataframes into sets of 8... but I can't get each dataframe to go beneath each other. They concat to the right, always!
dfc = pd.concat(Result)
df1 = dfc.ix[:,0:12]
df2 = dfc.ix[:,12:24]
df3 = dfc.ix[:,24:36]
df4 = dfc.ix[:,36:48]
df5 = dfc.ix[:,48:60]
df6 = dfc.ix[:,60:72]
df7 = dfc.ix[:,72:84]
df8 = dfc.ix[:,84:96]
pieces = [df1,df2,df3,df4,df5,df6,df7,df8]
df_final = pd.concat([df1, df2], levels = 1, axis = 3)
Assuming you are trying to take your 96 columns, and create a single 2-dimensional DataFrame with 12 columns and 8 times as many rows, then you want:
df_final = pd.concat( pieces, axis=0, ignore_index=True )
If you are trying to make a 3-dimensional DataFrame, with your new dimension having 8 values, you aren't trying to make a DataFrame, but a Panel.

Categories