Read data from txt file from a specific pattern in text - python

I need to extract data into dataframe from a text file, after a specific word was found
Here is an example of text file:
Signal:
197 198 180 140
X_values:
11 26 8 15
14 17 12 10
** Exe Trs:
[115] time1: 1 (42 ms) - tou(): 4.7 - JH: 5 (B: 0)
[230] time2: 2 (87 ms) - tou(): 3.0 - Am: 5 (B: 0)
What I want is to have two dataframes created as below:
df1:
Signal X_values1 X_values2
197 11 14
198 26 17
180 8 12
140 15 10
df2:
order time(ms)
115 42
230 87

A quick solution is to use a regex if your files always follow the same order of the fields. If you want something more efficient/robust, you need to write a custom parser.
import re
import numpy as np
import pandas as pd
with open('your_file', 'r') as f:
text = f.read()
signal, values, exec_t = re.search('Signal:\n([\d ]+)\nX_values:\n([\d\n ]+)\n\*\* Exe Trs:\n([\w\W]*)',
text).groups()
signal = re.split('\s+', signal.strip())
values = re.split('\s+', values.strip())
exec_t = re.findall('\[(\d+)][^(]+\((\d+) ms', exec_t)
df1 = pd.DataFrame(np.array(values).reshape(-1, len(signal)).T,
index=signal,
columns=['X_values%s' % (i+1) for i in range(len(values)//len(signal))]
).rename_axis('Signal').reset_index()
df2 = pd.DataFrame(exec_t, columns=['order', 'time(ms)'])
output:
>>> df1
Signal X_values1 X_values2
0 197 11 14
1 198 26 17
2 180 8 12
3 140 15 10
>>> df2
order time(ms)
0 115 42
1 230 87

Related

Set a column to one date format Pandas

I am filtering out records for last month data records, however when doing
emp_df = emp_df[emp_df['Date'].dt.month == (currentMonth-1)]
It neglects some records(treats some records months as days).Link to File
from datetime import datetime, date
import pandas as pd
import numpy as np
cholareport = pd.read_excel("D:/Automations/HealthCheck and Audit Trail/report.xlsx")
uniqueemp = set(cholareport['Email'])
cholareport['Date'] = pd.to_datetime(cholareport['Date'])
uniqueemp = set(cholareport['Email'])
daystoignore = ['Holiday_COE', 'Leave_COE']
# datedfforemp = pd.DataFrame(columns=uniqueemp)
cholareport['Date'] = cholareport['Date'].apply(lambda x:
pd.to_datetime(x).strftime('%d/%m/%Y'))
cholareport["Date"] = pd.to_datetime(cholareport["Date"], utc=True)
for emp in uniqueemp:
emp_df = cholareport[cholareport['Email'].isin([emp])]
emp_df = emp_df[~emp_df['Task: Task Name'].isin(daystoignore)]
# s1 = pd.to_datetime(emp_df['Date']).dt.strftime('%Y-%m')
# s2 = (pd.to_datetime('today').strftime('%Y-%m') -pd.DateOffset(months=1)).strftime('%Y-%m')
# emp_df = emp_df[s1 == s2]
currentMonth = datetime.now().month
# print(currentMonth)
# print(emp_df['Date'])
emp_df['Date'] = pd.to_datetime(emp_df['Date']).dt.strftime("%dd-%mm-%YYYY")
format_data = "%dd-%mm-%YYYY"
empdfdate = []
for i in emp_df['Date']:
empdfdate.append(datetime.strptime(i,format_data))
print(empdfdate)
emp_df['Date'] = empdfdate
for i in emp_df['Date']:
print(i.month, i.day)
# emp_df['Date'] = pd.to_datetime(emp_df['Date']).dt.strftime('%Y-%m')
emp_df = emp_df[emp_df['Date'].dt.month == (currentMonth-1)]
for i in emp_df['Date']:
print(i.month, i.day)
Results :
6 10
7 10
10 10
11 10
12 10
10 13
10 14
Expected:
6 10
7 10
10 10
11 10
12 10
13 10
14 10
I am not entirely sure what you want to accomplish. If I understand it correctly, you simply want to count the number of entries per day for the past month. In such case, you can simply do the following.
from datetime import datetime
import pandas as pd
report = pd.read_excel('report.xlsx')
print('day: counts', report.Date[report.Date.dt.month == datetime.now().month - 1].dt.day.value_counts(), sep='\n')
I do not get your expected results. It might be that you also want to filter by email somehow; however, I cannot understand from your code what it is that you want to do.
Output:
day: counts
3 101
5 101
6 101
7 101
4 101
24 84
28 84
27 84
26 84
25 84
10 82
11 82
12 82
13 82
14 82
17 67
21 67
20 67
19 67
18 67
31 2
Name: Date, dtype: int64

Replace blank value in dataframe based on another column condition

I have many blanks in a merged data set and I want to fill them with a condition.
My current code looks like this
import pandas as pd
import csv
import numpy as np
pd.set_option('display.max_columns', 500)
# Read all files into pandas dataframes
Jan = pd.read_csv(r'C:\~\Documents\Jan.csv')
Feb = pd.read_csv(r'C:\~\Documents\Feb.csv')
Mar = pd.read_csv(r'C:\~\Documents\Mar.csv')
Jan=pd.DataFrame({'Department':['52','5','56','70','7'],'Item':['2515','254','818','','']})
Feb=pd.DataFrame({'Department':['52','56','765','7','40'],'Item':['2515','818','524','','']})
Mar=pd.DataFrame({'Department':['7','70','5','8','52'],'Item':['45','','818','','']})
all_df_list = [Jan, Feb, Mar]
appended_df = pd.concat(all_df_list)
df = appended_df
df.to_csv(r"C:\~\Documents\SallesDS.csv", index=False)
Data set:
df
Department Item
52 2515
5 254
56 818
70
7 50
52 2515
56 818
765 524
7
40
7 45
70
5 818
8
52
What I want is to fill the empty cells in Item with a correspondent values of the Department column.
So If Department is 52 and Item is empty it should be filled with 2515
Department 7 and Item is empty fill it with 45
and the result should look like this
df
Department Item
52 2515
5 254
56 818
70
7 50
52 2515
56 818
765 524
7 45
40
7 45
70
5 818
8
52 2515
I tried the following method but non of them worked.
1
df.loc[(df['Item'].isna()) & (df['Department'].str.contains(52)), 'Item'] = 2515
df.loc[(df['Item'].isna()) & (df['Department'].str.contains(7)), 'Item'] = 45
2
df["Item"] = df["Item"].fillna(df["Department"])
df = df.replace({"Item":{"52":"2515", "7":"45"}})
both ethir return error or do not work
Answer:
Hi I have used the below code and it worked
b = [52]
df.Item=np.where(df.Department.isin(b),df.Item.fillna(2515),df.Item)
a = [7]
df.Item=np.where(df.Department.isin(a),df.Item.fillna(45),df.Item)
Hope it helps someone who face the same issue
The following solution first creates a map of each department and it's maximum corresponding item (assuming there is one), and then matches that item to a department with a blank item. Note that in your data frame, the empty items are an empty string ("") and not NaN.
Create a map:
values = df.groupby('Department').max()
values['Item'] = values['Item'].apply(lambda x: np.nan if x == "" else x)
values = values.dropna().reset_index()
Department Item
0 5 818
1 52 2515
2 56 818
3 7 45
4 765 524
Then use df.apply():
df['Item'] = df.apply(lambda x: values[values['Department'] == x['Department']]['Item'].values if x['Item'] == "" else x['Item'], axis=1)
In this case, the new values will have brackets around them. They can be removed with str.replace():
df['Item'] = df['Item'].astype(str).str.replace(r'\[|\'|\'|\]', "", regex=True)
The result:
Department Item
0 52 2515
1 5 254
2 56 818
3 70
4 7 45
0 52 2515
1 56 818
2 765 524
3 7 45
4 40
0 7 45
1 70
2 5 818
3 8
4 52 2515
Hi I have used the below code and it worked
b = [52]
df.Item=np.where(df.Department.isin(b),df.Item.fillna(2515),df.Item)
a = [7]
df.Item=np.where(df.Department.isin(a),df.Item.fillna(45),df.Item)
Hope it helps someone who face the same issue

How to get max value from second column & min value from third column in CSV file with no row header in Python

How to get the max value from the second column and min value from the third column in CSV file with no row headers as per the screenshot of DataFrame through defining a function?
My code is:
import pandas as pd
def minmaxvalue(filename):
# some code
minmaxvalue("my_data.cvs")
How to get the max&min value between the defining function?
i a b
1 33 99
2 35 100
3 37 101
4 39 102
5 41 103
6 43 104
7 45 105
8 47 106
9 49 107
10 51 108
11 53 109
12 55 110
13 57 111
14 59 112
15 61 113
import pandas as pd
def minmaxvalue(filename):
# reading from file
df = pd.read_csv(filename, names=['a', 'b'])
# returning max and min
return df['a'].max(), df['b'].min()
minmaxvalue("my_data.csv")
One way is this:
def minmaxvalue(filename):
minim = filename['a'][0]
maxim = filename['b'][0]
for i in range(0, len(filename)):
if minim > filename['a'][i]:
minim = filename['a'][i]
if maxim < filename['b'][i]:
maxim = filename['b'][i]
return minim, maxim

compare 2 dataframe with pandas

It is the first time I use pandas and I do not really know how to deal with my problematic.
In fact I have 2 data frame:
import pandas
blast=pandas.read_table("blast")
cluster=pandas.read_table("cluster")
Here is an exemple of their contents:
>>> cluster
cluster_name seq_names
0 1 g1.t1_0035
1 1 g1.t1_0035_0042
2 119365 g1.t1_0042
3 90273 g1.t1_0042_0035
4 71567 g10.t1_0035
5 37976 g10.t1_0035_0042
6 22560 g10.t1_0042
7 90280 g10.t1_0042_0035
8 82698 g100.t1_0035
9 47392 g100.t1_0035_0042
10 28484 g100.t1_0042
11 22580 g100.t1_0042_0035
12 19474 g1000.t1_0035
13 5770 g1000.t1_0035_0042
14 29708 g1000.t1_0042
15 99776 g1000.t1_0042_0035
16 6283 g10000.t1_0035
17 39828 g10000.t1_0035_0042
18 25383 g10000.t1_0042
19 106614 g10000.t1_0042_0035
20 6285 g10001.t1_0035
21 13866 g10001.t1_0035_0042
22 121157 g10001.t1_0042
23 106615 g10001.t1_0042_0035
24 6286 g10002.t1_0035
25 113 g10002.t1_0035_0042
26 25397 g10002.t1_0042
27 106616 g10002.t1_0042_0035
28 4643 g10003.t1_0035
29 13868 g10003.t1_0035_0042
... ... ...
and
[78793 rows x 2 columns]
>>> blast
qseqid sseqid pident length mismatch \
0 g1.t1_0035_0042 g1.t1_0035_0042 100.0 286 0
1 g1.t1_0035_0042 g1.t1_0035 100.0 257 0
2 g1.t1_0035_0042 g9307.t1_0035 26.9 134 65
3 g2.t1_0035_0042 g2.t1_0035_0042 100.0 445 0
4 g2.t1_0035_0042 g2.t1_0035 95.8 451 3
5 g2.t1_0035_0042 g24520.t1_0042_0035 61.1 429 137
6 g2.t1_0035_0042 g9924.t1_0042 61.1 429 137
7 g2.t1_0035_0042 g1838.t1_0035 86.2 29 4
8 g3.t1_0035_0042 g3.t1_0035_0042 100.0 719 0
9 g3.t1_0035_0042 g3.t1_0035 84.7 753 62
10 g4.t1_0035_0042 g4.t1_0035_0042 100.0 242 0
11 g4.t1_0035_0042 g3.t1_0035 98.8 161 2
12 g5.t1_0035_0042 g5.t1_0035_0042 100.0 291 0
13 g5.t1_0035_0042 g3.t1_0035 93.1 291 0
14 g6.t1_0035_0042 g6.t1_0035_0042 100.0 152 0
15 g6.t1_0035_0042 g4.t1_0035 100.0 152 0
16 g7.t1_0035_0042 g7.t1_0035_0042 100.0 216 0
17 g7.t1_0035_0042 g5.t1_0035 98.1 160 3
18 g7.t1_0035_0042 g11143.t1_0042 46.5 230 99
19 g7.t1_0035_0042 g27537.t1_0042_0035 40.8 233 111
20 g3778.t1_0035_0042 g3778.t1_0035_0042 100.0 86 0
21 g3778.t1_0035_0042 g6174.t1_0035 98.0 51 1
22 g3778.t1_0035_0042 g20037.t1_0035_0042 100.0 50 0
23 g3778.t1_0035_0042 g37190.t1_0035 100.0 50 0
24 g3778.t1_0035_0042 g15112.t1_0042_0035 66.0 53 18
25 g3778.t1_0035_0042 g6061.t1_0042 66.0 53 18
26 g18109.t1_0035_0042 g18109.t1_0035_0042 100.0 86 0
27 g18109.t1_0035_0042 g33071.t1_0035 100.0 81 0
28 g18109.t1_0035_0042 g32810.t1_0035 96.4 83 3
29 g18109.t1_0035_0042 g17982.t1_0035_0042 98.6 72 1
... ... ... ... ... ...
if you stay focus on the cluster database, the first column correspond to the cluster ID and inside those clusters there are several sequences ID.
What I need to to is first to split all my cluster (in R it would be like: liste=split(x = data$V2, f = data$V1) )
And then, creat a function which displays the most similarity paires sequence within each cluster.
here is an exemple:
let's say I have two clusters (dataframe cluster):
cluster 1:
seq1
seq2
seq3
seq4
cluster 2:
seq5
seq6
seq7
...
On the blast dataframe there is on the 3th column the similarity between all sequences (all against all), so something like:
seq1 vs seq1 100
seq1 vs seq2 90
seq1 vs seq3 56
seq1 vs seq4 49
seq1 vs seq5 40
....
seq2 vs seq3 70
seq2 vs seq4 98
...
seq5 vs seq5 100
seq5 vs seq6 89
seq5 vs seq7 60
seq7 vs seq7 46
seq7 vs seq7 100
seq6 vs seq6 100
and what I need to get is :
cluster 1 (best paired sequences):
seq 1 vs seq 2
cluster2 (best paired sequences):
seq 5 vs seq6
...
So as you can see, I do not want to take into account the sequences paired by themselves
IF someone could give me some clues it would be fantastic.
Thank you all.
Firstly I assume that there are no Pairings in 'blast' with sequences from two different Clusters. In other words: in this solution the cluster-ID of a pairing will be evaluated by only one of the two sequence IDs.
Including cluster information and pairing information into one dataframe:
data = cluster.merge(blast, left_on='seq_names', right_on='qseqid')
Then the data should only contain pairings of different sequences:
data = data[data['qseqid']!=data['sseqid']]
To ignore pairings which have the same substrings in their seqid, the most readable way would be to add data columns with these data:
data['qspec'] = [seqid.split('_')[1] for seqid in data['qseqid'].values]
data['sspec'] = [seqid.split('_')[1] for seqid in data['sseqid'].values]
Now equal spec-values can be filtered the same way like it was done with equal seqids above:
data = data[data['qspec']!=data['sspec']]
In the end the data should be grouped by cluster-ID and within each group, the maximum of pident is of interest:
data_grpd = data.groupby('cluster_name')
result = data.loc[data_grpd['pident'].idxmax()]
The only drawback here - except the above mentioned assumption - is, that if there are several exactly equal max-values, only one of them would be taken into account.
Note: if you don't want the spec-columns to be of type string, you could easiliy turn them into integers on the fly by:
import numpy as np
data['qspec'] = [np.int(seqid.split('_')[1]) for seqid in data['qseqid'].values]
This merges the dataframes based first on sseqid, then on qseqid, and then returns results_df. Any with 100% match are filtered out. Let me know if this works. You can then order by cluster name.
blast = blast.loc[blast['pident'] != 100]
results_df = cluster.merge(blast, left_on='seq_names',right_on='sseqid')
results_df = results_df.append(cluster.merge(blast, left_on='seq_names',right_on='qseqid'))

Pandas dataframe with multiindex column - merge levels

I have a dataframe, grouped, with multiindex columns as below:
import pandas as pd
codes = ["one","two","three"];
colours = ["black", "white"];
textures = ["soft", "hard"];
N= 100 # length of the dataframe
df = pd.DataFrame({ 'id' : range(1,N+1),
'weeks_elapsed' : [random.choice(range(1,25)) for i in range(1,N+1)],
'code' : [random.choice(codes) for i in range(1,N+1)],
'colour': [random.choice(colours) for i in range(1,N+1)],
'texture': [random.choice(textures) for i in range(1,N+1)],
'size': [random.randint(1,100) for i in range(1,N+1)],
'scaled_size': [random.randint(100,1000) for i in range(1,N+1)]
}, columns= ['id', 'weeks_elapsed', 'code','colour', 'texture', 'size', 'scaled_size'])
grouped = df.groupby(['code', 'colour']).agg( {'size': [np.sum, np.average, np.size, pd.Series.idxmax],'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]}).reset_index()
>> grouped
code colour size scaled_size
sum average size idxmax sum average size idxmax
0 one black 1031 60.647059 17 81 185.153944 10.891408 17 47
1 one white 481 37.000000 13 53 204.139249 15.703019 13 53
2 three black 822 48.352941 17 6 123.269405 7.251141 17 31
3 three white 1614 57.642857 28 50 285.638337 10.201369 28 37
4 two black 523 58.111111 9 85 80.908912 8.989879 9 88
5 two white 669 41.812500 16 78 82.098870 5.131179 16 78
[6 rows x 10 columns]
How can I flatten/merge the column index levels as: "Level1|Level2", e.g. size|sum, scaled_size|sum. etc? If this is not possible, is there a way to groupby() as I did above without creating multi-index columns?
There is potentially a better way, more pythonic way to flatten MultiIndex columns.
1. Use map and join with string column headers:
grouped.columns = grouped.columns.map('|'.join).str.strip('|')
print(grouped)
Output:
code colour size|sum size|average size|size size|idxmax \
0 one black 862 53.875000 16 14
1 one white 554 46.166667 12 18
2 three black 842 49.529412 17 90
3 three white 740 56.923077 13 97
4 two black 1541 61.640000 25 50
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 6980 436.250000 16 77
1 6101 508.416667 12 13
2 7889 464.058824 17 64
3 6329 486.846154 13 73
4 12809 512.360000 25 23
2. Use map with format for column headers that have numeric data types.
grouped.columns = grouped.columns.map('{0[0]}|{0[1]}'.format)
Output:
code| colour| size|sum size|average size|size size|idxmax \
0 one black 734 52.428571 14 30
1 one white 1110 65.294118 17 88
2 three black 930 51.666667 18 3
3 three white 1140 51.818182 22 20
4 two black 656 38.588235 17 77
5 two white 704 58.666667 12 17
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 8229 587.785714 14 57
1 8781 516.529412 17 73
2 10743 596.833333 18 21
3 10240 465.454545 22 26
4 9982 587.176471 17 16
5 6537 544.750000 12 49
3. Use list comprehension with f-string for Python 3.6+:
grouped.columns = [f'{i}|{j}' if j != '' else f'{i}' for i,j in grouped.columns]
Output:
code colour size|sum size|average size|size size|idxmax \
0 one black 1003 43.608696 23 76
1 one white 1255 59.761905 21 66
2 three black 777 45.705882 17 39
3 three white 630 52.500000 12 23
4 two black 823 54.866667 15 33
5 two white 491 40.916667 12 64
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 12532 544.869565 23 27
1 13223 629.666667 21 13
2 8615 506.764706 17 92
3 6101 508.416667 12 43
4 7661 510.733333 15 42
5 6143 511.916667 12 49
you could always change the columns:
grouped.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in grouped.columns]
Based on Scott Boston's answer,
little update(it will be work for 2 or more levels column):
temp.columns.map(lambda x: '|'.join([str(i) for i in x]))
Thank you, Boston!
Full credit to suraj's concise answer: https://stackoverflow.com/a/72616083/317797
df.columns = df.columns.map('_'.join)

Categories