Why i'm not getting my whole output in the run module? - python

I'm not getting my whole output as well as my column names in my Screen.
import sqlite3
import pandas as pd
hello = sqlite3.connect(r"C:\Users\ravjo\Downloads\Chinook.sqlite")
rs = hello.execute("SELECT * FROM PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000")
df = pd.DataFrame(rs.fetchall())
hello.close()
print(df.head())
actual result:
0 1 2 3 4 ... 6 7 8 9 10
0 1 3390 3390 One and the Same 271 ... 23 None 217732 3559040 0.99
1 1 3392 3392 Until We Fall 271 ... 23 None 230758 3766605 0.99
2 1 3393 3393 Original Fire 271 ... 23 None 218916 3577821 0.99
3 1 3394 3394 Broken City 271 ... 23 None 228366 3728955 0.99
4 1 3395 3395 Somedays 271 ... 23 None 213831 3497176 0.99
[5 rows x 11 columns]
expected result:
PlaylistId TrackId TrackId Name AlbumId MediaTypeId \
0 1 3390 3390 One and the Same 271 2
1 1 3392 3392 Until We Fall 271 2
2 1 3393 3393 Original Fire 271 2
3 1 3394 3394 Broken City 271 2
4 1 3395 3395 Somedays 271 2
GenreId Composer Milliseconds Bytes UnitPrice
0 23 None 217732 3559040 0.99
1 23 None 230758 3766605 0.99
2 23 None 218916 3577821 0.99
3 23 None 228366 3728955 0.99
4 23 None 213831 3497176 0.99

The ... in the middle actually says that some of the data have been omitted from display. If you want to see the entire data, you should modify the pandas options. You can do so by using pandas.set_option() method. Documentation here.
In your case, you should set display.max_columns to None so that pandas displays unlimited number of columns. You will have to read in the column names from the database of set it manually. Refer here on how to read in the column names from the database itself.

To display all the columns please use below mentioned code snippet.
pd.set_option("display.max_columns",None)

By default, pandas limits number of rows for display. However you can change it to as per your need. Here is helper function I use, whenever I need to print full data-frame
def print_full(df):
import pandas as pd
pd.set_option('display.max_rows', len(df))
print(df)
pd.reset_option('display.max_rows')

Related

Merge two dataframes based on condition

I have these two dataframes:
sp_client
ConnectionID Value
0 CN01493292 495
1 CN01492424 440
2 CN01491959 403
3 CN01493200 312
4 CN01493278 282
.. ... ...
110 CN01492864 1
111 CN01492513 1
112 CN01492899 1
113 CN01493010 1
114 CN01493032 1
[115 rows x 2 columns]
sp_server
ConnectionID Value
1 CN01491920 2
1 CN01491920 2
3 CN01491922 2
3 CN01491922 2
5 CN01491928 2
.. ... ...
595 CN01493166 3
595 CN01493166 3
595 CN01493166 3
597 CN01493163 2
597 CN01493163 2
[673 rows x 2 columns]
I would like to merge them in a way where sp_client['Value'] increments by addition of sp_sever['Value'] and sp_client['Value'] only when the rows satisfy the condition sp_sever['ConnectionID']==sp_client['ConnectionID'].
It was a little bit complicated for me but I tried the following, but I am missing the condition part. Maybe it does not need to be merged in the first place. Happy to hear suggestions.
as per my comment, try to append tables and group them by ID while summing Value column as per example:
all_data = pd.concat([sp_server,sp_client])
all_data = all_data.groupby('ConnectionID')['Value'].agg(sum).reset_index()
out:
ConnectionID Value
0 CN01491920 4
1 CN01491922 4
2 CN01491928 2
3 CN01491959 403
4 CN01492424 440
5 CN01493200 312

Replace blank value in dataframe based on another column condition

I have many blanks in a merged data set and I want to fill them with a condition.
My current code looks like this
import pandas as pd
import csv
import numpy as np
pd.set_option('display.max_columns', 500)
# Read all files into pandas dataframes
Jan = pd.read_csv(r'C:\~\Documents\Jan.csv')
Feb = pd.read_csv(r'C:\~\Documents\Feb.csv')
Mar = pd.read_csv(r'C:\~\Documents\Mar.csv')
Jan=pd.DataFrame({'Department':['52','5','56','70','7'],'Item':['2515','254','818','','']})
Feb=pd.DataFrame({'Department':['52','56','765','7','40'],'Item':['2515','818','524','','']})
Mar=pd.DataFrame({'Department':['7','70','5','8','52'],'Item':['45','','818','','']})
all_df_list = [Jan, Feb, Mar]
appended_df = pd.concat(all_df_list)
df = appended_df
df.to_csv(r"C:\~\Documents\SallesDS.csv", index=False)
Data set:
df
Department Item
52 2515
5 254
56 818
70
7 50
52 2515
56 818
765 524
7
40
7 45
70
5 818
8
52
What I want is to fill the empty cells in Item with a correspondent values of the Department column.
So If Department is 52 and Item is empty it should be filled with 2515
Department 7 and Item is empty fill it with 45
and the result should look like this
df
Department Item
52 2515
5 254
56 818
70
7 50
52 2515
56 818
765 524
7 45
40
7 45
70
5 818
8
52 2515
I tried the following method but non of them worked.
1
df.loc[(df['Item'].isna()) & (df['Department'].str.contains(52)), 'Item'] = 2515
df.loc[(df['Item'].isna()) & (df['Department'].str.contains(7)), 'Item'] = 45
2
df["Item"] = df["Item"].fillna(df["Department"])
df = df.replace({"Item":{"52":"2515", "7":"45"}})
both ethir return error or do not work
Answer:
Hi I have used the below code and it worked
b = [52]
df.Item=np.where(df.Department.isin(b),df.Item.fillna(2515),df.Item)
a = [7]
df.Item=np.where(df.Department.isin(a),df.Item.fillna(45),df.Item)
Hope it helps someone who face the same issue
The following solution first creates a map of each department and it's maximum corresponding item (assuming there is one), and then matches that item to a department with a blank item. Note that in your data frame, the empty items are an empty string ("") and not NaN.
Create a map:
values = df.groupby('Department').max()
values['Item'] = values['Item'].apply(lambda x: np.nan if x == "" else x)
values = values.dropna().reset_index()
Department Item
0 5 818
1 52 2515
2 56 818
3 7 45
4 765 524
Then use df.apply():
df['Item'] = df.apply(lambda x: values[values['Department'] == x['Department']]['Item'].values if x['Item'] == "" else x['Item'], axis=1)
In this case, the new values will have brackets around them. They can be removed with str.replace():
df['Item'] = df['Item'].astype(str).str.replace(r'\[|\'|\'|\]', "", regex=True)
The result:
Department Item
0 52 2515
1 5 254
2 56 818
3 70
4 7 45
0 52 2515
1 56 818
2 765 524
3 7 45
4 40
0 7 45
1 70
2 5 818
3 8
4 52 2515
Hi I have used the below code and it worked
b = [52]
df.Item=np.where(df.Department.isin(b),df.Item.fillna(2515),df.Item)
a = [7]
df.Item=np.where(df.Department.isin(a),df.Item.fillna(45),df.Item)
Hope it helps someone who face the same issue

Replace every nth row in df1 with every row from df 2

(Absolute beginner here)
The following code should replace every 9th row of the template df with EVERY row of the data df. However it replaces every 9th row of template with every 9th row of data.
template.iloc[::9, 2] = data['Question (en)']
template.iloc[::9, 3] = data['Correct Answer']
template.iloc[::9, 4] = data['Incorrect Answer 1']
template.iloc[::9, 5] = data['Incorrect Answer 2']
Thank you for your help
The source of the problem with your code is that the initial step to
any operation on 2 DataFrames is their alignment by indices.
To avoid this step, take the underlying Numpy array from one of arrays, invoking values.
Since Numpy array has no index, Pandas can't perform the mentioned alignment.
Another correction is:
to take from the second DataFrame only as many rows as it is needed,
and only these columns that are to be saved in the target array,
perform the whole update "in one go" (see the code below).
To create both source test arrays, I defined the following function:
def getTestDf(nRows : int, tt : str, valShift=0):
qn = np.array(list(map(lambda i: tt + str(i),np.arange(nRows, dtype=int))))
ans = np.arange(nRows * 3, dtype=int).reshape((-1, 3)) + valShift
return pd.concat([pd.DataFrame({'Question (en)' : qn}), pd.DataFrame(ans,
columns=['Correct Answer', 'Incorrect Answer 1', 'Incorrect Answer 2'])], axis=1)
and called it:
template = getTestDf(80, 'Question_')
data = getTestDf(9, 'New question ', 1000)
Note that after I created template I counted that just 9 rows in data
are needed, so I created data with just 9 rows.
This way the initial part of template contains:
Question (en) Correct Answer Incorrect Answer 1 Incorrect Answer 2
0 Question_0 0 1 2
1 Question_1 3 4 5
2 Question_2 6 7 8
3 Question_3 9 10 11
4 Question_4 12 13 14
...
and data (in full):
Question (en) Correct Answer Incorrect Answer 1 Incorrect Answer 2
0 New question 0 1000 1001 1002
1 New question 1 1003 1004 1005
2 New question 2 1006 1007 1008
3 New question 3 1009 1010 1011
4 New question 4 1012 1013 1014
5 New question 5 1015 1016 1017
6 New question 6 1018 1019 1020
7 New question 7 1021 1022 1023
8 New question 8 1024 1025 1026
Now, to copy selected rows, run just:
template.iloc[::9] = data.values
The initial part of template contains now:
Question (en) Correct Answer Incorrect Answer 1 Incorrect Answer 2
0 New question 0 1000 1001 1002
1 Question_1 3 4 5
2 Question_2 6 7 8
3 Question_3 9 10 11
4 Question_4 12 13 14
5 Question_5 15 16 17
6 Question_6 18 19 20
7 Question_7 21 22 23
8 Question_8 24 25 26
9 New question 1 1003 1004 1005
10 Question_10 30 31 32
11 Question_11 33 34 35
12 Question_12 36 37 38
13 Question_13 39 40 41
14 Question_14 42 43 44
15 Question_15 45 46 47
16 Question_16 48 49 50
17 Question_17 51 52 53
18 New question 2 1006 1007 1008
19 Question_19 57 58 59
I am pretty sure that there are simpler/nicer ways, but just off the top of my head:
template_9=template.iloc[::9,0:2].copy()
# outer join
template_9['key'] = 0
data['key'] = 0
template_9.merge(data, how='left') # you don't need left here, but I think it's clearer
template_9.drop('key', axis=1, inplace=True)
template = pd.concat([template,template_9]).drop_duplicates(keep='last')
In case you want to keep the index replace:
template_9.reset_index().merge(data, how='left').set_index('index')
and then you can sort by index in the end.
P.S. I'm assuming column names are the same in both data frames, but it should be straightforward to adapt it anyway.

How can I merge these two datasets on 'Name' and 'Year'?

I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')
You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.
Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!

compare 2 dataframe with pandas

It is the first time I use pandas and I do not really know how to deal with my problematic.
In fact I have 2 data frame:
import pandas
blast=pandas.read_table("blast")
cluster=pandas.read_table("cluster")
Here is an exemple of their contents:
>>> cluster
cluster_name seq_names
0 1 g1.t1_0035
1 1 g1.t1_0035_0042
2 119365 g1.t1_0042
3 90273 g1.t1_0042_0035
4 71567 g10.t1_0035
5 37976 g10.t1_0035_0042
6 22560 g10.t1_0042
7 90280 g10.t1_0042_0035
8 82698 g100.t1_0035
9 47392 g100.t1_0035_0042
10 28484 g100.t1_0042
11 22580 g100.t1_0042_0035
12 19474 g1000.t1_0035
13 5770 g1000.t1_0035_0042
14 29708 g1000.t1_0042
15 99776 g1000.t1_0042_0035
16 6283 g10000.t1_0035
17 39828 g10000.t1_0035_0042
18 25383 g10000.t1_0042
19 106614 g10000.t1_0042_0035
20 6285 g10001.t1_0035
21 13866 g10001.t1_0035_0042
22 121157 g10001.t1_0042
23 106615 g10001.t1_0042_0035
24 6286 g10002.t1_0035
25 113 g10002.t1_0035_0042
26 25397 g10002.t1_0042
27 106616 g10002.t1_0042_0035
28 4643 g10003.t1_0035
29 13868 g10003.t1_0035_0042
... ... ...
and
[78793 rows x 2 columns]
>>> blast
qseqid sseqid pident length mismatch \
0 g1.t1_0035_0042 g1.t1_0035_0042 100.0 286 0
1 g1.t1_0035_0042 g1.t1_0035 100.0 257 0
2 g1.t1_0035_0042 g9307.t1_0035 26.9 134 65
3 g2.t1_0035_0042 g2.t1_0035_0042 100.0 445 0
4 g2.t1_0035_0042 g2.t1_0035 95.8 451 3
5 g2.t1_0035_0042 g24520.t1_0042_0035 61.1 429 137
6 g2.t1_0035_0042 g9924.t1_0042 61.1 429 137
7 g2.t1_0035_0042 g1838.t1_0035 86.2 29 4
8 g3.t1_0035_0042 g3.t1_0035_0042 100.0 719 0
9 g3.t1_0035_0042 g3.t1_0035 84.7 753 62
10 g4.t1_0035_0042 g4.t1_0035_0042 100.0 242 0
11 g4.t1_0035_0042 g3.t1_0035 98.8 161 2
12 g5.t1_0035_0042 g5.t1_0035_0042 100.0 291 0
13 g5.t1_0035_0042 g3.t1_0035 93.1 291 0
14 g6.t1_0035_0042 g6.t1_0035_0042 100.0 152 0
15 g6.t1_0035_0042 g4.t1_0035 100.0 152 0
16 g7.t1_0035_0042 g7.t1_0035_0042 100.0 216 0
17 g7.t1_0035_0042 g5.t1_0035 98.1 160 3
18 g7.t1_0035_0042 g11143.t1_0042 46.5 230 99
19 g7.t1_0035_0042 g27537.t1_0042_0035 40.8 233 111
20 g3778.t1_0035_0042 g3778.t1_0035_0042 100.0 86 0
21 g3778.t1_0035_0042 g6174.t1_0035 98.0 51 1
22 g3778.t1_0035_0042 g20037.t1_0035_0042 100.0 50 0
23 g3778.t1_0035_0042 g37190.t1_0035 100.0 50 0
24 g3778.t1_0035_0042 g15112.t1_0042_0035 66.0 53 18
25 g3778.t1_0035_0042 g6061.t1_0042 66.0 53 18
26 g18109.t1_0035_0042 g18109.t1_0035_0042 100.0 86 0
27 g18109.t1_0035_0042 g33071.t1_0035 100.0 81 0
28 g18109.t1_0035_0042 g32810.t1_0035 96.4 83 3
29 g18109.t1_0035_0042 g17982.t1_0035_0042 98.6 72 1
... ... ... ... ... ...
if you stay focus on the cluster database, the first column correspond to the cluster ID and inside those clusters there are several sequences ID.
What I need to to is first to split all my cluster (in R it would be like: liste=split(x = data$V2, f = data$V1) )
And then, creat a function which displays the most similarity paires sequence within each cluster.
here is an exemple:
let's say I have two clusters (dataframe cluster):
cluster 1:
seq1
seq2
seq3
seq4
cluster 2:
seq5
seq6
seq7
...
On the blast dataframe there is on the 3th column the similarity between all sequences (all against all), so something like:
seq1 vs seq1 100
seq1 vs seq2 90
seq1 vs seq3 56
seq1 vs seq4 49
seq1 vs seq5 40
....
seq2 vs seq3 70
seq2 vs seq4 98
...
seq5 vs seq5 100
seq5 vs seq6 89
seq5 vs seq7 60
seq7 vs seq7 46
seq7 vs seq7 100
seq6 vs seq6 100
and what I need to get is :
cluster 1 (best paired sequences):
seq 1 vs seq 2
cluster2 (best paired sequences):
seq 5 vs seq6
...
So as you can see, I do not want to take into account the sequences paired by themselves
IF someone could give me some clues it would be fantastic.
Thank you all.
Firstly I assume that there are no Pairings in 'blast' with sequences from two different Clusters. In other words: in this solution the cluster-ID of a pairing will be evaluated by only one of the two sequence IDs.
Including cluster information and pairing information into one dataframe:
data = cluster.merge(blast, left_on='seq_names', right_on='qseqid')
Then the data should only contain pairings of different sequences:
data = data[data['qseqid']!=data['sseqid']]
To ignore pairings which have the same substrings in their seqid, the most readable way would be to add data columns with these data:
data['qspec'] = [seqid.split('_')[1] for seqid in data['qseqid'].values]
data['sspec'] = [seqid.split('_')[1] for seqid in data['sseqid'].values]
Now equal spec-values can be filtered the same way like it was done with equal seqids above:
data = data[data['qspec']!=data['sspec']]
In the end the data should be grouped by cluster-ID and within each group, the maximum of pident is of interest:
data_grpd = data.groupby('cluster_name')
result = data.loc[data_grpd['pident'].idxmax()]
The only drawback here - except the above mentioned assumption - is, that if there are several exactly equal max-values, only one of them would be taken into account.
Note: if you don't want the spec-columns to be of type string, you could easiliy turn them into integers on the fly by:
import numpy as np
data['qspec'] = [np.int(seqid.split('_')[1]) for seqid in data['qseqid'].values]
This merges the dataframes based first on sseqid, then on qseqid, and then returns results_df. Any with 100% match are filtered out. Let me know if this works. You can then order by cluster name.
blast = blast.loc[blast['pident'] != 100]
results_df = cluster.merge(blast, left_on='seq_names',right_on='sseqid')
results_df = results_df.append(cluster.merge(blast, left_on='seq_names',right_on='qseqid'))

Categories