Many DataFrames in one output, how to merge. Python - python

I my code:
def my_def():
abc_all=open('file.txt')#where file.txt is a big data with many lines
lines=abc_all.readlines()
lines=lines[4:]
file.txt look like(about 120 letters in one line, and many lines:
AAA S S SSDAS ASDJAI A 234 33 43 234 2342999 2.31 22 33....
SSS S D W2UUQ Q231WQ A 222 11 23 123 1231299 2.31 22 11....
for line in lines:
abcd=line[5:-34]
abcd2 =abcd[:27]+abcd[40:]
abc=abcd2.split()
result=pd.DataFrame(abc)
and now i want to save the results but when i am using for example: result.to_csv() i am receiving only first line in my output file.
I have just printed result and like i see i am receiving each line like DataFrame , so thats the reason why my output is printing only first line.
part of result:
0 1 2 3 4 5 6
0 22.2222 -11.1111 222 111 name1 1 l
0 1 2 3 4 5 6
0 33.2222 -11.1111 444 333 name2 1 c
0 1 2 3 4 5 6
0 12.1111 -11.1111 222 111 name3 1 c
How can i save this output like one DataFrame, or how to merge all those DataFrames to one.
Thanks for help!

What about building a list of list first and then turning it into a DataFrame. It might be easier and faster:
results = []
for line in lines:
abcd = line[5:-34]
# if we know all row are the same length one split
results.append(abcd.split())
df = pd.DataFrame(results)
As a one-liner even:
df = pd.DataFrame([line[5:-34].split() for line in lines])
ex:
b = [ "0 22.2222 -11.1111 222 111 name1 1 l", "0 33.2222 -11.1111 444 333 name2 1 c", "0 12.1111 -11.1111 222 111 name3 1 c"]
df = pd.DataFrame([line.split() for line in b])
print(df)
0 1 2 3 4 5 6 7
0 0 22.2222 -11.1111 222 111 name1 1 l
1 0 33.2222 -11.1111 444 333 name2 1 c
2 0 12.1111 -11.1111 222 111 name3 1 c
As stated by #Andrew L it might be intersting to see if the file has not a format already supported by one of the many pandas.read_...
EDIT: Seems like a job for pd.read_csv('file.txt', sep=' ') and the deleting the columns you don't want / selecting the ones you want to keep

Related

Combine two number columns but exclude zero

I have the following dataframe from a database download that I cleaned up a bit. Unfortunately some of the single numbers split into a second column (row 9) from a single one. I'm trying to merge the two columns but exclude the zero values.
city crashes crashes_1 total_crashes
1 ABERDEEN 710 0 710
2 ACHERS LODGE 1 0 1
3 ACME 1 0 1
4 ADVANCE 55 0 55
5 AFTON 2 0 2
6 AHOSKIE 393 0 393
7 AKERS CENTER 1 0 1
8 ALAMANCE 50 0 50
9 ALBEMARLE 1 58 59
So for row 9 I want to end with:
9 ALBEMARLE 1 58 158
I tried a few snippets but nothing seems to work:
df['total_crashes'] = df['crashes'].astype(str).str.zfill(0) + df['crashes_1'].astype(str).str.zfill(0)
df['total_crashes'] = df['total_crashes'].astype(str).replace('\0', '', regex=True)
df['total_crashes'] = df['total_crashes'].apply(lambda x: ''.join(x[x!=0]))
df['total_crashes'] = df['total_crashes'].str.cat(df['total_crashes'], x[x!=0])
df['total_crashes'] = df.drop[0].sum(axis=1)
Thanks for any help.
You can use where condition:
df['total_crashes'] = df['crashes'].astype(str) + df['crashes_1'].astype(str).where(df['crashes_1'] != 0, "")

develop excel database based on column filters using pandas python

I am working on a raw excel file to develop an organized format database (.xlsx) format. The demo input file is given as:
input file
FromTo B# Bname Id Mend
1 to 2 123 bus1 1 F
1 to 3 234 bus2 1 F
5 to 6 321 bus3 2 F
9 to 10 322 bus5 2 F
1 to 2 326 bus6 1 F
1 to 2 457 bus7 1 F
5 to 6 656 bus8 1 F
9 to 10 780 bus9 2 F
1 to 3 875 bus10 2 F
1 to 3 564 bus11 2 F
The required output is in the following format:
output format
Essentially, I want to automate the filter method on column 'FromTo' (based on cell value) of the input and put the information of other columns as it is, as depicted in the output format image.
For output, I am able to get the columns B to E as required in the correct order and format. For this, I used the following logic using pandas
import pandas as pd
df = pd.read_excel('st2_trial.xlsx')
#create an empty dataframe
df_1 = pd.DataFrame()
ai = ['1 to 2','1 to 3','5 to 6', '9 to 10'] #all entries from input Col 'FromTo'
for i in range(len(ai)):
filter_ai = (df['FromTo'] == (ai[i]))
df_ai = (df.loc[filter_ai])
df_1 = pd.concat([df_1,df_ai])
print(df_1)
Getting the following output from this code:
FromTo B# Bname Id Mend
1 to 2 123 bus1 1 F
1 to 2 326 bus6 1 F
1 to 2 457 bus7 1 F
1 to 3 234 bus2 1 F
1 to 3 875 bus10 2 F
1 to 3 564 bus11 2 F
1 to 3 893 bus12 1 F
5 to 6 321 bus3 2 F
5 to 6 656 bus8 1 F
5 to 6 212 bus13 2 F
9 to 10 322 bus5 2 F
9 to 10 780 bus9 2 F
However, clearly, the first column is not the way I want! I am looking to aviod redundunt entries of '1 to 2', '1 to 3', etc. in the first column.
I believe this can be achieved by proper loops in place for the first output column. Any help with the same will be highly appreciated!
PS: I have something in mind to work around this:
-create empty dataframe
-list of all unique entries of column 'FromTo'
-take first element of the list put in first column of output
-Then go over my logic to get other required information as it is in loop
This way, I think, it would avoid the redundant entries in first column of output.
The above question seems similar if not exactly to How to print a groupby object . However I will post my answer here if it helps.
import pandas as pd
df = pd.read_excel('st2_trial.xlsx')
df_group = df.groupby('FromTo').apply(lambda a: a.drop('FromTo', axis = 1)[:].reset_index(drop = True))
print(df_group)
OUTPUT:
B# Bname Id Mend
FromTo
1 to 2 0 123 bus1 1 F
1 326 bus6 1 F
2 457 bus7 1 F
1 to 3 0 234 bus2 1 F
1 875 bus10 2 F
2 564 bus11 2 F
5 to 6 0 321 bus3 2 F
1 656 bus8 1 F
9 to 10 0 322 bus5 2 F
1 780 bus9 2 F
You could try something like this to get your expected output:
df_group = df_1.groupby('FromTo')
print(df_group)

Merge two Dataframes with same columns with overwrite

I have dataframe like this:
df = pd.DataFrame({"flag":["1","0","1","0"],
"val":["111","111","222","222"], "qwe":["","11","","12"]})
It gives:
flag qwe val
0 1 111
1 0 11 111
2 1 222
3 0 12 222
Then i am filtering first dataframe like this:
dff = df.loc[df["flag"]=="1"]
**was:**
dff.loc["qwe"] = "123"
**edited:** (setting all rows in column "qwe" to "123")
dff["qwe"] = "123"
And now i need to merge/join df and dff in such a way to get:
flag qwe val
0 1 123 111
1 0 11 111
2 1 123 222
3 0 12 222
Adding changes in 'qwe' from dff only if df value is empty.
Something like this:
pd.merge(df, dff, left_index=True, right_index=True, how="left")
gives
flag_x qwe_x val_x flag_y qwe_y val_y
0 1 111 1 111
1 0 11 111 NaN NaN NaN
2 1 222 1 222
3 0 12 222 NaN NaN NaN
so, after that i need to drop flag_y, val_y, rename _x columns and merge manually qwe_x and qwe_y. But is there any way to make it easier?
pd.merge has an on argument that you can use to join columns with the same name in different dataframes.
Try:
pd.merge(df, dff, how="left", on=['flag', 'qwe', 'val'])
However, I don't think you need to do that at all. You can produce the same result using df.loc to conditionally assign a value:
df.loc[(df["flag"] == "1") & (df['qwe'].isnull()), 'qwe'] = 123
After edited changes, for me works this code:
c1 = dff.combine_first(df)
It produces:
flag qwe val
0 1 123 111
1 0 11 111
2 1 123 222
3 0 12 222
Which is exactly i was looking for.

Joining two df with condition in python

I have two df.
df1 has over 2 milion rows and has complete data. I'd like to join data from df2, which has over 70.000 rows but it's structure is a bit complicated.
df1 has for eac row keys KO-STA and KO-PAR.
df2 has in some cases data only on KO-STA, in some cases only on KO-PAR and in some case on both.
I'd like to merge those two df and get the data on Need1 and Need2.
df1
STA_SID DST_SID CC KO_SIFKO KO-STA KO-PAR
135 10021582 28878502 NaN 634 634-83 537-780/9
117 10028732 29999540 NaN 657 657-1729 537-780/4
117 10028732 29999541 NaN 657 657-1729 537-780/4
117 10028732 29999542 NaN 657 657-1729 537-780/4
117 10028732 29999543 NaN 657 657-1729 537-780/4
117 10028732 31356572 NaN 657 657-1729 537-780/4
df2
KO-STA STA-PAR KO-PAR Need1 Need2 \
0 1976-_ 366/2 1976-366/2 Bio 49.500000
1 991-_ 329/128 991-329/128 PH 184.399994
2 2147--- 96/19 2147-96/19 Win 8.850000
3 2048-_ 625/4 2048-625/4 SSE 4.940000
4 2194-_ 285/3 2194-285/3 TI f 163.000000
5 2386--- 97/1 2386-97/1 Bio 49.500000
6 2002-_ 2002/9 2002-2002/9 Win 12.850000
7 1324-_ 62 1324-62 Win 8.850000
8 1625-_ 980/1 1625-980/1 Win 8.850000
9 1625-_ 980/1 1625-980/1 Bio 49.500000
My attempt was with the following code
GURS_ES1 = pd.merge(df1.reset_index(), df2.reset_index(), on = 'KO-STA')
GURS_ES2 = pd.merge(GURS_ES1.reset_index(), df2.reset_index(), on = 'KO-PAR')
But after the first merge, GURS_ES1 has two indexes KO-PAR_x and KO-PAR_y and it doesn't join them as one column. Any recommendations?
I provide you an example to make sure how you can proceed an what is the reason for the behaviour you observed:
First, let's construct our sample data
df1 = pd.DataFrame(np.random.randint(1,3,size=(3,3)),columns=['a1','x1','x2'])
Output
a1 x1 x2
0 1 2 1
1 2 1 1
2 1 2 2
Now, the other dataframe
df2 = pd.DataFrame(np.random.randint(1,3,size=(3,3)),columns=['a2','x1','x2'])
a2 x1 x2
0 2 2 1
1 1 2 2
2 1 1 2
Now, if we merge on only(!) one of the indices which occur in both dataframes, then pandas wants you to be able to reconstruct from which dataframe the index originally came
pd.merge(df1,df2, on='x1')
Output
a1 x1 x2_x a2 x2_y
0 1 2 1 2 1
1 1 2 1 1 2
2 1 2 2 2 1
3 1 2 2 1 2
4 2 1 1 1 2
Now, the easiest way to get rid of this is to drop one of the double occuring columns in one of the dataframes:
pd.merge(df1[df1.columns.drop('x2')], df2, on='x1')
Output
a1 x1 a2 x2
0 1 2 2 1
1 1 2 1 2
2 1 2 2 1
3 1 2 1 2
4 2 1 1 2
But you could also merge on a list of columns. Note that we perform an inner join here, which can reduce the number of rows in the output dataframa significantly (or even lead to empty dataframes if there are no matches on both columns)
pd.merge(df1,df2, on=['x1','x2'])
a1 x1 x2 a2
0 1 2 1 2
1 1 2 2 1

Pandas: Merge or join dataframes based on column data?

I am trying to add several columns of data to an existing dataframe. The dataframe itself was built from a number of other dataframes, which I successfully joined on indices, which were identical. For that, I used code like this:
data = p_data.join(r_data)
I actually joined these on a multi-index, so the dataframe looks something like the following, where Name1 and Name 2 are indices:
Name1 Name2 present r behavior
a 1 1 0 0
2 1 .5 2
4 3 .125 1
b 2 1 0 0
4 5 .25 4
8 1 0 1
So the Name1 index does not repeat data, but the Name2 index does (I'm using this to keep track of dyads, so that Name1 & Name2 together are only represented once). What I now want to add are 4 columns of data that correspond to Name2 data (information on the second member of the dyad). Unlike the "present" "r" and "behavior" data, these data are per individual, not per dyad. So I don't need to consider Name1 data when merging.
The problem is that while Name2 data are repeated to exhaust the dyad combos, the "Name2" column in the data I would now like to add only has one piece of data per Name2 individual:
Name2 Data1 Data2 Data3
1 80 6 1
2 61 8 3
4 45 7 2
8 30 3 6
What I would like the output to look like:
Name1 Name2 present r behavior Data1 Data2 Data3
a 1 1 0 0 80 6 1
2 1 .5 2 61 8 3
4 3 .125 1 45 7 2
b 2 1 0 0 61 8 3
4 5 .25 4 45 7 2
8 1 0 1 30 3 6
Despite reading the documentation, I am not clear on whether I can use join() or merge() for the desired outcome. If I try a join to the existing dataframe like the simple one I've used previously, I end up with the new columns but they are full of NaN values. I've also tried various combinations using Name1 and Name2 as either columns or as indices, with either join or merge (not as random as it sounds, but I'm clearly not interpreting the documentation correctly!). Your help would be very much appreciated, as I am presently very much lost.
I'm not sure if this is the best way, but you could use reset_index to temporarily make your original DataFrame indexed by Name2 only. Then you could perform the join as usual. Then use set_index to again make Name1 part of the MultiIndex:
import pandas as pd
df = pd.DataFrame({'Name1':['a','a','a','b','b','b'],
'Name2':[1,2,4,2,4,8],
'present':[1,1,3,1,5,1]})
df.set_index(['Name1','Name2'], inplace=True)
df2 = pd.DataFrame({'Data1':[80,61,45,30],
'Data2':[6,8,7,3]},
index=pd.Series([1,2,4,8], name='Name2'))
result = df.reset_index(level=0).join(df2).set_index('Name1', append=True)
print(result)
# present Data1 Data2
# Name2 Name1
# 1 a 1 80 6
# 2 a 1 61 8
# b 1 61 8
# 4 a 3 45 7
# b 5 45 7
# 8 b 1 30 3
To make the result look even more like your desired DataFrame, you could reorder and sort the index:
print(result.reorder_levels([1,0],axis=0).sort(axis=0))
# present Data1 Data2
# Name1 Name2
# a 1 1 80 6
# 2 1 61 8
# 4 3 45 7
# b 2 1 61 8
# 4 5 45 7
# 8 1 30 3

Categories