how to get excel particular cell value for in while loop - python

I am new with python and pandas, I have a text file (data.txt) in which "content" is like. "123 456 789 101123 456 789 101 112 113 110 112 123 456 789 101 112 113 110 113 123 456 789 101 112 113 110 110 ............. " etc. and having an excel file (combination.xlsx) which is carrying some combination. (In excel sheet cell A1 = 123 456, A2 = 456 789, A3 = 789 101123, .................), my Problem is how to use/get each cell value from (combination.xlsx) to use for count of frequency of occurrence which may available in data.txt and print in another text file (final.txt). want to make a while loop which will start with picking the first cell value )A1) and start a loop and if it is = to or more then 1 then it will print in final.txt otherwise it should pick second cell value(A2).. till cell value/data is empty.

It seems to me that you don't need an explicit while loop here. You can get each cell value using pd.read_excel
which returns a dataframe with all cells. To count the frequency of occurrence, for each row of the dataframe you can use len over the re.findall with the following regular expression: \b({x})\b. This regex assures that the number sequence (x on this particular f-string) will match only between word boundaries. To print to another file, you can use df["Qnt"].to_csv.
import pandas as pd
import re
data_txt = "123 456 789 101123 456 789 101 112 113 110 112 123 456 789 101 112 113 110 113 123 456 789 101 112 113 110 110"
# read XLSX cells
df = pd.read_excel("combination.xlsx", header=None, names=["Comb"])
# count occurrences
find_qnt = lambda x: len(re.findall(rf"\b({x})\b", data_txt))
# apply to each row
df["Qnt"] = df["Comb"].apply(find_qnt)
print(df)
# print into another text file
df["Qnt"].to_csv("final.txt", index=False)
Output from df
Comb Qnt
0 123 456 3
1 456 789 4
2 789 101123 1

Related

Python: how to get unique ID and remove duplicates from column 1 (ID), and column 3 (Description), Then get the median for column 2 (Value) in Pandas

Python: how to get unique ID and remove duplicates from column 1 (ID), and column 3 (Description), Then get the median for column 2
ID
Value
Description
123456
116
xx
123456
117
xx
123456
113
xx
123456
109
xz
123456
108
xz
123456
98
xz
121214
115
abc
121214
110
abc
121214
103
abc
121214
117
abz
121214
120
abz
121214
125
abz
151416
114
zxc
151416
135
zxc
151416
127
zxc
151416
145
zxm
151416
125
zxm
151416
121
zxm
Procced table should look like:
ID
xx
xz
abc
abz
zxc
zxm
123456
110
151
0
0
0
0
121214
0
0
132
113
0
0
151416
0
0
0
0
124
115
I went for the approach of the mean, but your "expected output" example doesn't give a mean. Is that me misunderstanding what you mean?
pd.pivot_table(DF, 'Value', index='ID', columns='Description')
Should do the trick, default math function is the mean, so that's ideal. More info can be found here (mind you, DF is the to import dataframe).
Maybe this approach will work for you?
d = {'ID': [1,1,2,3,3,4,4,4,4,5,5], 'Value': [5,6,7,8,9,7,8,5,1,2,4]}
df = pd.DataFrame(data=d)
unique = set(df['ID'])
value_mean = []
for i in unique:
a = df[df['ID']==i]['Value']
a = a.mean()
value_mean.append(a)
Well you have e.g. 6 'ID' with value '123456'. If you only want unique 'ID', you need to remove 5 'ID' rows, by doing this you will not have duplicate 'Description' values anymore. The question is, do you want unique ID or unique Description values (or unique combination of both)?
There are probably more options to solve this. What you could do is combine the ID and Description into a new column, and remove the duplicate in the DataFrame. Hopefully this would help.
import pandas as pd
a = {'ID': [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5],
'Value': [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6],
'Description': ['a','a','b','b','c','d','d','a','c','d','e','e','e','a','b']}
df = pd.DataFrame(data=a)
unique_combined = []
for i in range(len(df)):
unique_combined.append((str(df.iloc[i]['ID'])+ df.iloc[i]['Description']))
df['un'] = unique_combined
df.drop_duplicates(subset=['un'])

Creating new values in panda dataframe using math with existed columns

I have a df with numbers in the second column. Each number represents the length of a DNA sequence. I would like to create two new columns in which the first one says where this sequence start and the second one says where this sequence end.
This is my current df:
Names LEN
0 Ribosomal_S9: 121
1 Ribosomal_S8: 129
2 Ribosomal_L10: 100
3 GrpE: 166
4 DUF150: 141
.. ... ...
115 TIGR03632: 117
116 TIGR03654: 175
117 TIGR03723: 314
118 TIGR03725: 212
119 TIGR03953: 188
[120 rows x 2 columns]
And this is what I am trying to get
Names LEN Start End
0 Ribosomal_S9: 121 0 121
1 Ribosomal_S8: 129 121 250
2 Ribosomal_L10: 100 250 350
3 GrpE: 166 350 516
4 DUF150: 141 516 657
.. ... ... ... ..
115 TIGR03632: 117
116 TIGR03654: 175
117 TIGR03723: 314
118 TIGR03725: 212
119 TIGR03953: 188
[120 rows x 4 columns]
Can please anyone put me in the right direction?
Use DataFrame.assign with new columns created with Series.cumsum and for start is added Series.shift:
#convert column to integers
df['LEN'] = df['LEN'].astype(int)
#alternative for replace non numeric to missing values
#df['LEN'] = pd.to_numeric(df['LEN'], errors='coerce')
s = df['LEN'].cumsum()
df = df.assign(Start = s.shift(fill_value=0), End = s)
print (df)
Names LEN Start End
0 Ribosomal_S9: 121 0 121
1 Ribosomal_S8: 129 121 250
2 Ribosomal_L10: 100 250 350
3 GrpE: 166 350 516
4 DUF150: 141 516 657

Pandas - Grouping columns based on other columns and tagging them into new column

I have a data frame which I want to group based on the value of another column in the same data frame.
For example:
The Parent_ID and Child ID are linked and defines who is related to who in a hierarchical tree.
The dataframe looks like (input from a csv file)
No Name ID Parent_Id
1 Tom 211 111
2 Galie 209 111
3 Remo 200 101
4 Carmen 212 121
5 Alfred 111 191
6 Marvela 101 111
7 Armin 234 101
8 Boris 454 109
9 Katya 109 323
I would like to group this data frame based on the ID and Parent_ID in the below grouping, and generate CSV files out of this based on the top level parent. I.e, Alfred.csv, Carmen.csv (will have only its own entry, ice line #4) , Katya.csv using the to_csv() function.
Alfred
|_ Galie
_ Tom
_ Marvela
|_ Remo
_ Armin
Carmen
Katya
|_ Boris
And, I want to create a new column in the same data frame, that will have a tag indicating the hierarchy. Like:
No Name ID Parent_Id Tag
1 Tom 211 111 Alfred
2 Galie 209 111 Alfred
3 Remo 200 101 Marvela, Alfred
4 Carmen 212 121
5 Alfred 111 191
6 Marvela 101 111 Alfred
7 Armin 234 101 Marvela, Alfred
8 Boris 454 109 Katya
9 Katya 109 323
Note that the names can repeat, but the ID will be unique.
Kindly let me know how to achieve this using pandas. I tried out groupby() but seems a little complicated and not getting what I intend. There should be one file for each parent, and the child records in the parent file.
If a child has other child (like marvel), it qualifies to have its own csv file.
And the final output would be
Alfred.csv - All records matching Galie, Tom, Marvela
Marvela.csv - All records matching Remo, Armin
Carmen.csv - Only record matching carmen (row)
Katya.csv - all records matching katya, boris
I am assuming your dataframe as a dictionary:
mydf = ({"No":[1,2,3,4,5,6,7,8,9],"Name":["Tom","Galie","Remo","Carmen","Alfred","Marvela","Armin","Boris","Katya"],
"ID":[211,209,200,212,111,101,234,454,109],"Parent_Id":[111,111,101,121,191,111,101,109,323]})
df = pd.DataFrame(mydf)
Then, I identify the Parent_Id from each row. Finally stored them into new column:
tag = []
for z in df['Parent_Id']:
try:
tag.append(df.query('ID==%s'%z)['Name'].item())
except:
tag.append('')
df['Tag'] = tag
To filter the dataframe based on a value in column Tag, e.g. Alfred:
df[df['Tag'].str.match('Alfred')]
Then save it in a csv file. Repeat for other values. Alternatively, if you have a large number of names in column Tag, then use for loop.

How to preserve the column ordering when accessing a multi-index dataframe using `.loc`?

Let's be given the following dataframe with multi-index columns
import numpy as np
import pandas as pd
a = ['i', 'ii']
b = list('abc')
mi = pd.MultiIndex.from_product([a,b])
df = pd.DataFrame(np.arange(100,100+len(mi)*3).reshape([-1,len(mi)]),
columns=mi)
print(df)
# i ii
# a b c a b c
# 0 100 101 102 103 104 105
# 1 106 107 108 109 110 111
# 2 112 113 114 115 116 117
Using .loc[] and pd.IndexSlice I try to select the columns 'c' and 'b', in that very ordering.
idx = pd.IndexSlice
df.loc[:, idx[:, ['c','b']]]
However, if I look at the output, the requested ordering is not respected!
# i ii
# b c b c
# 0 101 102 104 105
# 1 107 108 110 111
# 2 113 114 116 117
Here are my questions:
Why is the ordering not preserved by pandas? I consider this pretty dangerous, because the list ['c', 'b'] implies an ordering from a user point of view.
How to access the columns via loc[] while preserving the ordering at the same time?
Update: (02.02.2020)
The issue has been identified as pandas bug. In the process of fixing it, this related issue has been identified, which addresses a semantic ambiguity for expressions like df.loc[:, pd.IndexSlice[:, ['c','b']]].
In the meantime, the problem can be circumvented using the approach described in the accepted answer.
Quoting from this link:
I don't think we make guarantees about the order of returned values
from a .loc operation so I am inclined to say this is not a bug but
let's see what others say
So we should be using reindex instead:
df.reindex(columns=pd.MultiIndex.from_product([a,['c','b']]))
i ii
c b c b
0 102 101 105 104
1 108 107 111 110
2 114 113 117 116

CSV value mapping from 2 files like map in pandas

I have two csv files that i created with python from an unstructured data but i don't want my script to output two files once i run the script on a json. So lets say i have a file A with columns as follows:
File 1:
feats ID A B C E
AA 123 3343 234 2342 112
BB 121 3342 237 2642 213``
CC 122 3341 232 2352 912
DD 123 3343 233 5342 12
EE 121 3345 235 2442 2112
...and so on with lets say, 10000 rows of different values and 6 columns. Now I want to check these values of column "ID" against file 2 and merge on the values of ID.
File 2:
Char_Name ID Cosmic Awareness
Uatu 123 3.4
Galan 121 4.5 ``
Norrin Radd 122 1.6
Shalla-bal 124 0.3
Nova 125 1.2
This file 2 has only 5 rows for 5 different values for b and lets say 23
column values. I can do this easily with map or apply in pandas but i'm
dealing with 1000's of files and don't wanna do that. Is their any way
like mapping the file 2 values (name and cosmic awareness columns) to File 1 by adding new columns titled 'name' and 'cosmic' (from file 2) by matching the values with corresponding ID values on File 1 and File 2. The expected output should be somewhat like this.
Final File:
feats ID A B C E Char_Name Cosmic Awareness
AA 123 3343 234 2342 112 Uatu 3.4
BB 121 3342 237 2642 213`` Galan 4.5
CC 122 3341 232 2352 912 Norrin Radd 1.6
DD 123 3343 233 5342 12 Uatu 3.4
EE 121 3345 235 2442 2112 Galan 4.5
Thanks in advance and if their is any way to improve this question, the suggestions are welcome. I will incorporate them here. I have added the expected outcome above.
I think need glob for all file names and then in list comprehension create DataFrame:
from functools import reduce
import glob
files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp) for fp in files]
Last merge together:
df = reduce(lambda left,right: pd.merge(left,right,on='ID'), dfs)
For outer join is possible use concat:
import glob
files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp, index_col=['ID']) for fp in files]
df = pd.concat(dfs, axis=1)

Categories