DataFrame from string that look like table - python

I have problem with creating DataFrame from string that look like table. Exactly, I want to create same table as my data.This is my data, and below is my code:
0 2017 IX 2018 X 2018 X 2018 X 2018
0 2017 IX 2018 0 2017 IX 2018
UKUPNO 1.053 1.075 1.093 103,8 101,7 1.633 1.669 1.701 104,2 101,9
A Poljoprivreda, šumarstvo i ribolov 907 888 925 102,0 104,2 1.394 1.356 1.420 101,9 104,7
B Vađenje ruda i kamena 913 919 839 91,9 91,3 1.395 1.406 1.297 93,0 92,2
C Prerađivačka industrija 769 764 775 100,8 101,4 1.176 1.169 1.187 100,9 101,5
D Proizvodnja i snabdijevanje 1.574 1.570 1.647 104,6 104,9 2.459 2.455 2.579 104,9 105,1
električnom energijom, plinom,
parom i klimatizacija
E Snabdijevanje vodom; uklanjanje 956 973 954 99,8 98,0 1.462 1.491 1.462 100,0 98,1
otpadnih voda, upravljanje otpadom
TESTDATA = io.StringIO(''' ''')
df=pd.read_csv(TESTDATA,sep='delimiter',header=None,engine='python')
When I read my code, I get this DataFrame
0 Prosječna neto plaća ...
1 u KM ...
2 Index Index ...
3 0 2017 IX 2018 X 2018 X 2018 ...
4 0 2017 IX 2018 ...
5 UKUPNO ...
6 A Poljoprivreda, šumarstvo i ribolov ...
7 B Vađenje ruda i kamena ...
8 C Prerađivačka industrija ...
9 D Proizvodnja i snabdijevanje ...
10 električnom energijom, plinom,

Related

Web scraping with Python and Pandas - Pagination

With this short code I can get data from the table:
import pandas as pd
df=pd.read_html('https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page=1&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)
df[0].to_csv('2023_I_M_800.csv')
I am trying to get data from all pages or a determinated number of them but since this website doesn't use lu or li elementsIdon'tknow exacxtly how to built it.
Any help or idea would be appreciated.
Since the url contains the page number, why not just making a loop and concat ?
`https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page=1&bestResultsOnly=false&oversizedTrack=regular
import pandas as pd
​
F, L = 1, 4 # first and last pages
​
dico = {}
for page in range(F, L+1):
url = f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular'
sub_df = pd.read_html(url, parse_dates=True)[0]
sub_df.insert(0, "page_number", page)
dico[page] = sub_df
​
out = pd.concat(dico, ignore_index=True)
# out.to_csv('2023_I_M_800.csv') # <- uncomment this line to make a .csv
NB : You can access each sub_df separately by using key-indexing notation : dico[num_page].
Output :
print(out)
page_number Rank ... Date Results Score
0 1 1 ... 22 JAN 2023 1230
1 1 2 ... 22 JAN 2023 1204
2 1 3 ... 29 JAN 2023 1204
3 1 4 ... 27 JAN 2023 1192
4 1 5 ... 28 JAN 2023 1189
.. ... ... ... ... ...
395 4 394 ... 21 JAN 2023 977
396 4 394 ... 28 JAN 2023 977
397 4 398 ... 27 JAN 2023 977
398 4 399 ... 28 JAN 2023 977
399 4 399 ... 29 JAN 2023 977
[400 rows x 11 columns]
Try this:
for page in range(1, 10):
df=pd.read_html(f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)
df[0].to_csv(f'2023_I_M_800_page_{page}.csv')

problem merging list and dataframe in Python

I have CSV files that I want to merge with list of struct(class) I made.
In the CSV I have field 'sector' and another field with information about this sector.
The array type is of a class I made with fields: name, x, y where x,y is the location that belong to this name.
This is how I defined the list(I generated it from CSV file as well which each antenna appear many time with different parameters so I extracted only those I need)
# ant_file is the CSV with all the antennas, ant_list_name is the list with
# only antennas name and ant_list_tot is the list with the name and also x,y
# fields
for rowA in range(size_ant_file):
rec = ant_file.iloc[rowA]['name']
if rec not in ant_lis_name:
ant_lis_name.append(rec)
A = Antenna(ant_file.iloc[rowA]['name'], ant_file.iloc[rowA]['x'],
ant_file.iloc[rowA]['y'])
ant_list_tot.append(A)
print(antenna_list)
[Antenna(name='UWE33', x=34.9, y=31.9), Antenna(name='UTN00', x=34.8,
y=32.1), Antenna(name='UWE02', x=34.8, y=32.1)]
I tried to do it with double for loop:
#dataclass
class Antenna:
name: str
x: float
y: float
# records is the csv file and antenna_list is the list of type Antenna
for index in range(len(records)):
rec = records.iloc[index]['sector']
for i in range(len(antenna_list)):
if rec == antenna_list[i].name:
lat = antenna_list[i].x
lon = antenna_list[i].y
records.at[index, 'x'] = lat
records.at[index, 'y'] = lon
break
The result CSV file is partly right and at the end there are rows with all fields correctly except x and y fields which are 0 and some rows with x and y values but without the information of the original fields.
It seems like there is a big shift of rows but I can't understand why.
I checked that there are no missing values
example:
records.csv at the begining:(date,hour and user_id are random number and its not important)
sector date hour user_id x y
abc 1.1.19 20:00 123 0 0
dfs 5.8.17 12:40 876 0 0
ngh 6.9.19 08:12 962 0 0
yjt 10.10.16 17:18 492 0 0
abc 6.8.16 22:10 985 0 0
dfs 7.1.15 19:15 542 0 0
antenna_list in the form of (name,x,y): (also here, x and y is random number right now and its not important)
antenna_list[0] = (abc,12,16)
antenna_list[1] = (dfs,6,20)
antenna_list[2] = (ngh,13,98)
antenna_list[3] = (yjt,18,41)
the result I want to see is:
sector date hour user_id x y
abc 1.1.19 20:00 123 12 16
dfs 5.8.17 12:40 876 6 20
ngh 6.9.19 08:12 962 13 98
yjt 10.10.16 17:18 492 18 41
abc 6.8.16 22:10 985 12 16
dfs 7.1.15 19:15 542 6 20
but the real result is:
sector date hour user_id x y
abc 1.1.19 20:00 123 12 16
dfs 5.8.17 12:40 876 6 20
ngh 6.9.19 08:12 962 0 0
yjt 10.10.16 17:18 492 0 0
abc 6.8.16 22:10 985 0 0
dfs 7.1.15 19:15 542 0 0
13 98
18 41
12 16
6 20
TIA
If you save antenna_list as two dicts,
antenna_dict_x = {'abc':12, 'dfs':6, 'ngh':13, 'yjt':18}
antenna_dict_y = {'abc':16, 'dfs':20, 'ngh':98, 'yjt':41}
then creating two columns should be an easy map,
data['x']=data['sector'].map(antenna_dict_x)
data['y']=data['sector'].map(antenna_dict_y)
So if you do:
import pandas as pd
class Antenna():
def __init__(self, name, x, y):
self.name = name
self.x = x
self.y = y
antenna_list = [Antenna('abc',12,16), Antenna('dfs',6,20), Antenna('ngh',13,98), Antenna('yjt',18,41)]
records = pd.read_csv('something.csv')
for index in range(len(records)):
rec = records.iloc[index]['sector']
for i in range(len(antenna_list)):
if rec == antenna_list[i].name:
lat = antenna_list[i].x
lon = antenna_list[i].y
records.at[index, 'x'] = lat
records.at[index, 'y'] = lon
break
print(records)
you get:
sector date hour user_id x y
0 abc 1.1.19 20:00 123 12 16
1 dfs 5.8.17 12:40 876 6 20
2 ngh 6.9.19 8:12 962 13 98
3 yjt 10.10.16 17:18 492 18 41
4 abc 6.8.16 22:10 985 12 16
5 dfs 7.1.15 19:15 542 6 20
Which is what you were expecting. Also, if you do:
import pandas as pd
from dataclasses import dataclass
#dataclass
class Antenna:
name: str
x: float
y: float
antenna_list = [Antenna('abc',12,16), Antenna('dfs',6,20), Antenna('ngh',13,98), Antenna('yjt',18,41)]
records = pd.read_csv('something.csv')
for index in range(len(records)):
rec = records.iloc[index]['sector']
for i in range(len(antenna_list)):
if rec == antenna_list[i].name:
lat = antenna_list[i].x
lon = antenna_list[i].y
records.at[index, 'x'] = lat
records.at[index, 'y'] = lon
break
print(records)
you get:
sector date hour user_id x y
0 abc 1.1.19 20:00 123 12 16
1 dfs 5.8.17 12:40 876 6 20
2 ngh 6.9.19 8:12 962 13 98
3 yjt 10.10.16 17:18 492 18 41
4 abc 6.8.16 22:10 985 12 16
5 dfs 7.1.15 19:15 542 6 20
Which is, again, what you were expecting. You did not post how you created the antenna list, but I assume that is where your error is.

Parsing and create a new df with conditions

I need some help with python and pandas.
I actually have a dataframe with in the column seq1_id al the seq_id of sequences of the species 1 and the column 2 for the sequences of the sp2.
I actually passed a filter on those sequences and got two dataframes (one with all sequences of sp 1 passed through the filter) and (one with all sequences of sp2 passed through the filter).
Then I have 3 dataframes.
Because in a pairs, one seq can pass the filter while the other does not, it is important to keep only paired genes which are keeping on the two previous filtering, so what I need to do is actually to parse my first df such this one:
Seq_1.id Seq_2.id
seq1_A seq8_B
seq2_A Seq9_B
seq3_A Seq10_B
seq4_A Seq11_B
and check row by row if (ex the first row) seq1_A is present in the df2 and if seq8_B is also present in the df3, then keep this row in the df1 and add it in a new df4.
Here is an example with output wanted:
first df:
Seq_1.id Seq_2.id
seq1_A seq8_B
seq2_A Seq9_B
seq3_A Seq10_B
seq4_A Seq11_B
df2 (sp1) (seq3_A is absent)
Seq_1.id
seq1_A
seq2_A
seq4_A
df3 (sp2) (Seq11_B is absent)
Seq_2.id
seq8_B
Seq9_B
Seq10_B
Then because Seq11_B and seq3_A are not present, the df4 (output) would be:
Seq_1.id Seq_2.id
seq1_A seq8_B
seq2_A Seq9_B
candidates_0035=pd.read_csv("candidates_genes_filtering_0035",sep='\t')
candidates_0042=pd.read_csv("candidates_genes_filtering_0042",sep='\t')
dN_dS=pd.read_csv("dn_ds.out_sorted",sep='\t')
df4 =dN_dS[dN_dS['seq1_id'].isin(candidates_0042['gene'])&dN_dS['seq2_id'].isin(candidates_0035['gene'])]
and I got an empty output, only with columns names but it should not be like that.
Here are the data if you cant to test the code on it :
df1:
Unnamed: 0 seq1_id seq2_id dN dS Dist_third_pos Dist_brute Length_seq_1 Length_seq_2 GC_content_seq1 GC_content_seq2 GC Mean_length
0 0 g66097.t1_0035_0035 g13600.t1_0042_0042 0.10455938989199982 0.3122332927029104 0.23600000000000002 0.142 535.0 1024.0 49.1588785046729 51.171875 50.165376752336456 535.0
1 1 g45594.t1_0035_0035 g1464.t1_0042_0042 0.5208761055250978 5.430485421797574 0.7120000000000001 0.489 246.0 222.0 47.967479674796756 44.594594594594604 46.28103713469567 222.0
2 2 g50055.t1_0035_0035 g34744.t1_0042_0035 0.08040473491714645 0.4233916132491867 0.262 0.139 895.0 749.0 56.312849162011176 57.67690253671562 56.994875849363396 749.0
3 3 g34020.t1_0035_0035 g12096.t1_0042_0042 0.4385191689737516 26.834927363887587 0.5760000000000001 0.433 597.0 633.0 37.85594639865997 39.810426540284354 38.83318646947217 597.0
4 4 g28436.t1_0035_0042 g35222.t1_0042_0035 0.055299811368483165 0.1181241496387666 0.1 0.069 450.0 461.0 45.111111111111114 44.90238611713666 45.006748614123886 450.0
5 5 g1005.t1_0035_0035 g11524.t1_0042_0042 0.3528036631463747 19.32549458735676 0.71 0.512 3177.0 3804.0 39.06200818382121 52.944269190325976 46.0031386870736 3177.0
6 6 g28456.t1_0035_0035 g31669.t1_0042_0035 0.4608959702286786 26.823981621115166 0.6859999999999999 0.469 516.0 591.0 49.224806201550386 53.46869712351946 51.346751662534935 516.0
7 7 g6202.t1_0035_0035 g193.t1_0042_0042 0.4679458383555545 17.81312422445775 0.66 0.462 804.0 837.0 41.91542288557214 47.67025089605735 44.79283689081474 804.0
8 8 g60667.t1_0035_0035 g14327.t1_0042_0042 0.046056273155280165 0.13320612138898 0.122 0.067 348.0 408.0 56.89655172413793 55.392156862745104 56.1443542934415 348.0
9 9 g30148.t1_0035_0042 g37790.t1_0042_0035 0.05631607180881047 0.19747150378706246 0.12300000000000001 0.08800000000000001 405.0 320.0 59.012345679012356 58.4375 58.72492283950618 320.0
10 10 g24481.t1_0035_0035 g37405.t1_0042_0035 0.2151957757290965 0.15106487998618026 0.135 0.17600000000000002 270.0 276.0 51.111111111111114 51.44927536231884 51.28019323671497 270.0
11 11 g33270.t1_0035_0035 g21201.t1_0042_0035 0.2773062983971916 21.13839474189674 0.6940000000000001 0.401 297.0 357.0 54.882154882154886 50.42016806722689 52.65116147469089 297.0
12 12 EOG090X03YJ_0035_0035_1 EOG090X03YJ_0042_0042_1 0.5402471721616758 19.278839157918302 0.7070000000000001 0.488 1321.0 1719.0 38.53141559424678 43.92088423502036 41.22614991463357 1321.0
13 13 g13075.t1_0035_0042 g504.t1_0042_0035 0.3317504066721263 4.790120127840871 0.65 0.38799999999999996 372.0 408.0 59.40860215053763 51.470588235294116 55.43959519291587 372.0
14 14 g1026.t1_0035_0035 g7716.t1_0042_0042 0.21445770772761286 13.92799368027682 0.626 0.344 336.0 315.0 38.095238095238095 44.444444444444436 41.26984126984127 315.0
15 15 g18238.t1_0035_0042 g35401.t1_0042_0035 0.3889830456691637 20.33679494952895 0.6759999999999999 0.44799999999999995 320.0 366.0 50.9375 49.453551912568315 50.19552595628416 320.0
df2:
Unnamed: 0 gene scaf_name start end cov_depth GC
179806 g13600.t1_0042_0042 scaffold_6556 1 1149 2.42361684558216 0.528846153846154
315037 g34744.t1_0042_0035 scaffold_8076 17 765 3.49803921568627 0.386138613861386
317296 g35222.t1_0042_0035 scaffold_9018 1 614 93.071661237785 0.41
183513 g14327.t1_0042_0042 scaffold_9358 122 529 3.3184165232357996 0.36
328164 g37790.t1_0042_0035 scaffold_16356 1 320 2.73125 0.436241610738255
326617 g37405.t1_0042_0035 scaffold_14890 1 341 1.3061224489795902 0.36898395721925104
188515 g15510.t1_0042_0042 scaffold_20183 1 276 137.326086956522 0.669354838709677
184561 g14562.t1_0042_0042 scaffold_10427 1 494 157.993927125506 0.46145940390544704
290684 g30982.t1_0042_0035 scaffold_3800 440 940 174.499839537869 0.39823008849557506
179993 g13632.t1_0042_0042 scaffold_6654 29 1114 3.56506849315068 0.46153846153846206
181670 g13942.t1_0042_0042 scaffold_7830 1 811 5.307028360049321 0.529411764705882
196148 g20290.t1_0042_0035 scaffold_1145 2707 9712 78.84112231766741 0.367283950617284
313624 g34464.t1_0042_0035 scaffold_7610 1 480 7.740440324449589 0.549019607843137
303133 g32700.t1_0042_0035 scaffold_5119 1735 2373 118.436578171091 0.49074074074074103
df3:
Unnamed: 0 gene scaf_name start end cov_depth GC
428708 g66097.t1_0035_0035 scaffold_306390 1 695 32.2431654676259 0.389880952380952
342025 g50055.t1_0035_0035 scaffold_188566 15 954 7.062893081761009 0.351129363449692
214193 g28436.t1_0035_0042 scaffold_231066 1 842 25.9774346793349 0.348837209302326
400337 g60667.t1_0035_0035 scaffold_261197 309 656 15.873529411764698 0.353846153846154
224023 g30148.t1_0035_0042 scaffold_263686 10 414 23.2072538860104 0.34108527131782895
184987 g24481.t1_0035_0035 scaffold_65047 817 1593 27.7840552416824 0.533898305084746
249413 g34492.t1_0035_0035 scaffold_106432 1 511 3.2482544608223396 0.368318122555411
249418 g34493.t1_0035_0035 scaffold_106432 547 1230 3.2482544608223396 0.368318122555411
12667 g1120.t1_0035_0042 scaffold_2095 2294 2794 47.864745898359295 0.56203288490284
252797 g35042.t1_0035_0035 scaffold_108853 274 1276 20.269592476489 0.32735426008968604
255878 g36112.t1_0035_0042 scaffold_437464 1 540 74.8252551020408 0.27884615384615397
40058 g4082.t1_0035_0042 scaffold_11195 579 1535 33.4396168320219 0.48487467588591204
271053 g39343.t1_0035_0042 scaffold_590976 1 290 19.6666666666667 0.38636363636363596
89911 g10947.t1_0035_0035 scaffold_21433 1735 2373 32.4222503160556 0.408571428571429
This should do it:
df4 = df1[df1['Seq_1.id'].isin(df2['Seq_1.id'])&df1['Seq_2.id'].isin(df3['Seq_2.id'])]
df4
# Seq_1.id Seq_2.id
#0 seq1_A seq8_B
#1 seq2_A Seq9_B
EDIT
You must have made a permutation, this doesn't return empty:
df4 = dN_dS[(dN_dS['seq1_id'].isin(candidates_0035['gene']))&(dN_dS['seq2_id'].isin(candidates_0042['gene']))]

python error when finding count of cells where value was found

I have below code on toy data which works the day i want. Last 2 columns provide how many times value in column Jan was found in column URL and in how many distinct rows value in column Jan was found in column URL
sales = [{'account': '3', 'Jan': 'xxx', 'Feb': '200 .jones', 'URL': 'ea2018-001.pdf try bbbbb why try'},
{'account': '1', 'Jan': 'try', 'Feb': '210', 'URL': ''},
{'account': '2', 'Jan': 'bbbbb', 'Feb': '90', 'URL': 'ea2017-104.pdf bbbbb cc for why try' }]
df = pd.DataFrame(sales)
df
df['found_in_column'] = df['Jan'].apply(lambda x: ''.join(df['URL'].tolist()).count(x))
df['distinct_finds'] = df['Jan'].apply(lambda x: sum(df['URL'].str.contains(x)))
why does the same code fails in the last case? How could i change my code to avoid the error. In case of my last example there are special characters in the first column, I felt that they are causing the problem. But when i look at row where index is 3 and 4, they have special characters too and code runs fine
answer2=answer[['Value','non_repeat_pdf']].iloc[0:11]
print(answer2)
Value non_repeat_pdf
0 effect\nive Initials: __\nDL_ -1- Date: __\n8/14/2017\n...
1 closing ####
2 executing ####
3 order, ####
4 waives: ####
5 right ####
6 notice ####
7 intention ####
8 prohibit ####
9 further ####
10 participation ####
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
Out[220]:
0 1
1 0
2 1
3 0
4 1
5 1
6 0
7 0
8 1
9 0
10 0
Name: Value, dtype: int64
answer2=answer[['Value','non_repeat_pdf']].iloc[10:11]
print(answer2)
Value non_repeat_pdf
10 participation ####
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
Out[212]:
10 0
Name: Value, dtype: int64
answer2=answer[['Value','non_repeat_pdf']].iloc[11:12]
print(answer2)
Value non_repeat_pdf
11 1818(e); ####
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
Traceback (most recent call last):
File "<ipython-input-215-2df7f4b2de41>", line 1, in <module>
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py", line 2355, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/src\inference.pyx", line 1574, in pandas._libs.lib.map_infer
File "<ipython-input-215-2df7f4b2de41>", line 1, in <lambda>
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\strings.py", line 1562, in contains
regex=regex)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\strings.py", line 254, in str_contains
stacklevel=3)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\warnings.py", line 99, in _showwarnmsg
msg.file, msg.line)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1069, in _showwarning
file.write(formatWarning(message, category, filename, lineno, line))
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\utils.py", line 69, in formatWarning
file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name
IndexError: list index out of range
update
I modified my code and removed all special character from the Value column. I am still getting the error...what could be wrong.
Even with the error, the new column gets added to my answer2 dataframe
answer2=answer[['Value','non_repeat_pdf']]
print(answer2)
Value non_repeat_pdf
0 law Initials: __\nDL_ -1- Date: __\n8/14/2017\n...
1 concerned
2 rights
3 c
4 violate
5 8
6 agreement
7 voting
8 previously
9 supervisory
10 its
11 exercise
12 occs
13 entities
14 those
15 approved
16 1818h2
17 9
18 are
19 manner
20 their
21 affairs
22 b
23 solicit
24 procure
25 transfer
26 attempt
27 extraneous
28 modification
29 vote
... ...
1552 closing
1553 heavily
1554 pm
1555 throughout
1556 half
1557 window
1558 sixtysecond
1559 activity
1560 sampling
1561 using
1562 hour
1563 violated
1564 euro
1565 rates
1566 derivatives
1567 portfolios
1568 valuation
1569 parties
1570 numerous
1571 they
1572 reference
1573 because
1574 us
1575 important
1576 moment
1577 snapshot
1578 cet
1579 215
1580 finance
1581 supervision
[1582 rows x 2 columns]
answer2['found_in_all_PDF'] = answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))
Traceback (most recent call last):
File "<ipython-input-298-4dc80361895c>", line 1, in <module>
answer2['found_in_all_PDF'] = answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 2331, in __setitem__
self._set_item(key, value)
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 2404, in _set_item
self._check_setitem_copy()
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py", line 1873, in _check_setitem_copy
warnings.warn(t, SettingWithCopyWarning, stacklevel=stacklevel)
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\warnings.py", line 99, in _showwarnmsg
msg.file, msg.line)
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1069, in _showwarning
file.write(formatWarning(message, category, filename, lineno, line))
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\utils.py", line 69, in formatWarning
file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name
IndexError: list index out of range
update2
below works
answer2=answer[['Value','non_repeat_pdf']]
xyz= answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))
xyz=xyz.to_frame()
xyz.columns=['found_in_all_PDF']
pd.concat([answer2, xyz], axis=1)
Out[305]:
Value non_repeat_pdf \
0 law Initials: __\nDL_ -1- Date: __\n8/14/2017\n...
1 concerned
2 rights
3 c
4 violate
5 8
6 agreement
7 voting
8 previously
9 supervisory
10 its
11 exercise
12 occs
13 entities
14 those
15 approved
16 1818h2
17 9
18 are
19 manner
20 their
21 affairs
22 b
23 solicit
24 procure
25 transfer
26 attempt
27 extraneous
28 modification
29 vote
... ...
1552 closing
1553 heavily
1554 pm
1555 throughout
1556 half
1557 window
1558 sixtysecond
1559 activity
1560 sampling
1561 using
1562 hour
1563 violated
1564 euro
1565 rates
1566 derivatives
1567 portfolios
1568 valuation
1569 parties
1570 numerous
1571 they
1572 reference
1573 because
1574 us
1575 important
1576 moment
1577 snapshot
1578 cet
1579 215
1580 finance
1581 supervision
found_in_all_PDF
0 6
1 1
2 4
3 1036
4 9
5 93
6 4
7 2
8 1
9 2
10 6
11 1
12 0
13 1
14 3
15 1
16 0
17 25
18 20
19 3
20 14
21 4
22 358
23 2
24 1
25 2
26 6
27 1
28 1
29 3
...
1552 3
1553 2
1554 0
1555 5
1556 2
1557 3
1558 0
1559 2
1560 1
1561 5
1562 2
1563 7
1564 8
1565 3
1566 0
1567 1
1568 1
1569 4
1570 1
1571 9
1572 2
1573 2
1574 96
1575 1
1576 1
1577 1
1578 0
1579 0
1580 1
1581 0
[1582 rows x 3 columns]
Unfortunately i can't reproduce exactly same error on my environment. But what I see is warning about wrong regexp usage. Your string was interpreted as capturing regular expression because of brackets in the string "1818(e);". Try use str.contains with regex=False.
answer2 =pd.DataFrame({'Value': {11: '1818(e);'}, 'non_repeat_pdf': {11: '####'}})
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x,regex=False)))
Output:
11 0
Name: Value, dtype: int64

Reshape data frame (with R or

I want to know if it's possible to have this result:
exemple:
With this data frame
df
y Faisceaux destination Trajet RED_Groupe Nbr observation RED Pond Nbr observation total RED pct
1 2015 France DOM-TOM Aller 78248.47 87 85586.75 307 0.9142591 0.04187815
2 2015 Hors Schengen Aller 256817.64 234 195561.26 1194 1.3132337 0.06015340
3 2015 INTERNATIONAL Aller 258534.78 473 288856.53 2065 0.8950283 0.04099727
4 2015 Maghreb Aller 605514.45 270 171718.14 1130 3.5262113 0.16152007
5 2015 NATIONAL Aller 361185.82 923 1082529.19 5541 0.3336500 0.01528302
6 2015 Schengen Aller 312271.06 940 505181.07 4190 0.6181369 0.02831411
7 2015 France DOM-TOM Retour 30408.70 23 29024.60 108 1.0476871 0.04798989
8 2015 Hors Schengen Retour 349805.15 225 168429.96 953 2.0768583 0.09513165
9 2015 INTERNATIONAL Retour 193536.63 138 99160.52 678 1.9517509 0.08940104
10 2015 Maghreb Retour 302863.83 110 41677.90 294 7.2667735 0.33285861
11 2015 NATIONAL Retour 471520.80 647 757258.33 3956 0.6226684 0.02852167
12 2015 Schengen Retour 307691.66 422 243204.76 2104 1.2651548 0.05795112
without using Exel.
With R or Python? I don't know if spliting column like that is possible.
thanks to all the comment here my solution :
I split my data frame into two data frame df15 ( with 2015 data) and df16 (2016 data) then :
mytable15 <- tabular(Heading()*Faisceaux_destination ~ Trajet*(`RED_Groupe` + `Nbr observation RED` + Pond + `Nbr observation total` + RED + pct)*Heading()*(identity),data=df15)
mytable16 <- tabular(Heading()*Faisceaux_destination ~ Trajet*(`RED_Groupe` + `Nbr observation RED` + Pond + `Nbr observation total` + RED + pct)*Heading()*(identity),data=df16)

Categories