Pandas DataFrame - Issue regarding column formatting

Pandas DataFrame - Issue regarding column formatting - python

I have a .txt file that has the data regarding the total number of queries with valid names. The text inside of the file came out of a SQL Server 19 query output. The database used consists of the results of an algorithm that retrieves the most similar brands related to the query inserted. The file looks something like this:
2 16, 42, 44 A MINHA SAÚDE
3 34 !D D DUNHILL
4 33 #MEGA
5 09 (michelin man)
5 12 (michelin man)
6 33 *MONTE DA PEDRA*
7 35 .FOX
8 33 #BATISTA'S BY PITADA VERDE
9 12 #COM
10 41 + NATUREZA HUMANA
11 12 001
12 12 002
13 12 1007
14 12 101
15 12 102
16 12 104
17 37 112 PC
18 33 1128
19 41 123 PILATES
The 1st column has the Query identifier, the 2nd one has the brand classes where the Query can be located and the 3rd one is the Query itself (the spaces came from the SQL Server output formatting).
I then made a Pandas DataFrame in Google Colaboratory where I wanted the columns to be like the ones in the text file. However, when I ran the code, it gave me this:
The code that I wrote is here:
# Dataframe with the total number of queries with valid names:
df = pd.DataFrame(pd.read_table("/content/drive/MyDrive/data/classes/100/queries100.txt", header=None, names=["Query ID", "Query Name", "Classes Where Query is Present"]))
df
I think that this happens because of the commas in the 2nd column but I'm not quite sure. Any suggestions on why this is happening? I already tried read_csv and read_fwf and they were even worse in terms of formatting.

You can use pd.read_fwf() in this case, as your columns have fixed widths:
import pandas as pd
df = pd.read_fwf(
"/content/drive/MyDrive/data/classes/100/queries100.txt",
colspecs=[(0,20),(21,40),(40,1000)],
header=None,
names=["Query ID", "Query Name", "Classes Where Query is Present"]
)
df.head()
# Query ID Query Name Classes Where Query is Present
# 0 2 16, 42, 44 A MINHA SAÚDE
# 1 3 34 !D D DUNHILL
# 2 4 33 #MEGA
# 3 5 09 (michelin man)
# 4 5 12 (michelin man)

Related

Turn a 2 second order array into pandas dataframe

I have a data set as such of 2 order array, with arbitrary length. as shown below
[['15,39' '17,43']
['23,40' '18,44']
['28,41' '18,45']
['28,42' '27,46']
['34,43' '26,47']
.
.
.
]
I want to turn it into a panda dataframe as columns and rows, shown below
15 39 17 43
23 40 18 44
28 41 18 45
28 42 27 46
34 43 26 47
.
.
.
anyone has idea how to achieve it without saving the data out to files during process?

My strategy is defining a function first to deal with the comma and quotes. Keeping in mind that your data is already a 2 dimensional numpy array I define the following function:
def str_to_flt(lst):
tmp = np.array([[float(i.split(",")[0]),float(i.split(",")[1])] for i in lst])
return tmp
import pandas as pd
df = pd.DataFrame(np.concatenate((str_to_flt(data[:,0]), str_to_flt(data[:,1])), axis=1))

Your data:
from io import StringIO
s="""[['15,39' '17,43']
['23,40' '18,44']
['28,41' '18,45']
['28,42' '27,46']
['34,43' '26,47']]"""
df=pd.read_csv(StringIO(s),header=None)
You can do:
d={"\[\['":"","'\]\]":"","'\]\]'":"","'\]":"","\['":"","' '":','}
df=df.replace(d,regex=True)
df[[1.2,1.5]]=df.pop(1).str.extract(r"(\d+),(\d+)")
df=df.sort_index(axis=1)
output of df:
0.0 1.2 1.5 2.0
0 15 39 17 43
1 23 40 18 44
2 28 41 18 45
3 28 42 27 46
4 34 43 26 47
Ofcourse you can rename the name of columns according to your need by using columns attribute or rename() method and typecast data by using astype() method according to your need

Select/Group rows from a data frame with the nearest values for a specific column(s)

I have the two columns in a data frame (you can see a sample down below)
Usually in columns A & B I get 10 to 12 rows with similar values.
So for example: from index 1 to 10 and then from index 11 to 21.
I would like to group these values and get the mean and standard deviation of each group.
I found this following line code where I can get the index of the nearest value. but I don't know how to do this repetitively:
Index = df['A'].sub(df['A'][0]).abs().idxmin()
Anyone has any ideas on how to approach this problem?
A B
1 3652.194531 -1859.805238
2 3739.026566 -1881.965576
3 3742.095325 -1878.707674
4 3747.016899 -1878.728626
5 3746.214554 -1881.270329
6 3750.325368 -1882.915532
7 3748.086576 -1882.406672
8 3751.786422 -1886.489485
9 3755.448968 -1885.695822
10 3753.714126 -1883.504098
11 -337.969554 24.070990
12 -343.019575 23.438956
13 -344.788697 22.250254
14 -346.433460 21.912217
15 -343.228579 22.178519
16 -345.722368 23.037441
17 -345.923108 23.317620
18 -345.526633 21.416528
19 -347.555162 21.315934
20 -347.229210 21.565183
21 -344.575181 22.963298
22 23.611677 -8.499528
23 26.320500 -8.744512
24 24.374874 -10.717384
25 25.885272 -8.982414
26 24.448127 -9.002646
27 23.808744 -9.568390
28 24.717935 -8.491659
29 25.811393 -8.773649
30 25.084683 -8.245354
31 25.345618 -7.508419
32 23.286342 -10.695104
33 -3184.426285 -2533.374402
34 -3209.584366 -2553.310934
35 -3210.898611 -2555.938332
36 -3214.234899 -2558.244347
37 -3216.453616 -2561.863807
38 -3219.326197 -2558.739058
39 -3214.893325 -2560.505207
40 -3194.421934 -2550.186647
41 -3219.728445 -2562.472566
42 -3217.630380 -2562.132186
43 234.800448 -75.157523
44 236.661235 -72.617806
45 238.300501 -71.963103
46 239.127539 -72.797922
47 232.305335 -70.634125
48 238.452197 -73.914015
49 239.091210 -71.035163
50 239.855953 -73.961841
51 238.936811 -73.887023
52 238.621490 -73.171441
53 240.771812 -73.847028
54 -16.798565 4.421919
55 -15.952454 3.911043
56 -14.337879 4.236691
57 -17.465204 3.610884
58 -17.270147 4.407737
59 -15.347879 3.256489
60 -18.197750 3.906086

A simpler approach consist in grouping the values where the percentage change is not greater than a given threshold (let's say 0.5):
df['Group'] = (df.A.pct_change().abs()>0.5).cumsum()
df.groupby('Group').agg(['mean', 'std'])
Output:
A B
mean std mean std
Group
0 3738.590934 30.769420 -1880.148905 7.582856
1 -344.724684 2.666137 22.496995 0.921008
2 24.790470 0.994361 -9.020824 0.977809
3 -3210.159806 11.646589 -2555.676749 8.810481
4 237.902230 2.439297 -72.998817 1.366350
5 -16.481411 1.341379 3.964407 0.430576
Note: I have only used the "A" column, since the "B" column appears to follow the same pattern of consecutive nearest values. You can check if the identified groups are the same between columns with:
grps = (df[['A','B']].pct_change().abs()>1).cumsum()
grps.A.eq(grps.B).all()

I would say that if you know the length of each group/index set you want then you can first subset the column and row with :
df['A'].iloc[0:11].mean()
Then figure out a way to find standard deviation.

Replace every nth row in df1 with every row from df 2

(Absolute beginner here)
The following code should replace every 9th row of the template df with EVERY row of the data df. However it replaces every 9th row of template with every 9th row of data.
template.iloc[::9, 2] = data['Question (en)']
template.iloc[::9, 3] = data['Correct Answer']
template.iloc[::9, 4] = data['Incorrect Answer 1']
template.iloc[::9, 5] = data['Incorrect Answer 2']
Thank you for your help

The source of the problem with your code is that the initial step to
any operation on 2 DataFrames is their alignment by indices.
To avoid this step, take the underlying Numpy array from one of arrays, invoking values.
Since Numpy array has no index, Pandas can't perform the mentioned alignment.
Another correction is:
to take from the second DataFrame only as many rows as it is needed,
and only these columns that are to be saved in the target array,
perform the whole update "in one go" (see the code below).
To create both source test arrays, I defined the following function:
def getTestDf(nRows : int, tt : str, valShift=0):
qn = np.array(list(map(lambda i: tt + str(i),np.arange(nRows, dtype=int))))
ans = np.arange(nRows * 3, dtype=int).reshape((-1, 3)) + valShift
return pd.concat([pd.DataFrame({'Question (en)' : qn}), pd.DataFrame(ans,
columns=['Correct Answer', 'Incorrect Answer 1', 'Incorrect Answer 2'])], axis=1)
and called it:
template = getTestDf(80, 'Question_')
data = getTestDf(9, 'New question ', 1000)
Note that after I created template I counted that just 9 rows in data
are needed, so I created data with just 9 rows.
This way the initial part of template contains:
Question (en) Correct Answer Incorrect Answer 1 Incorrect Answer 2
0 Question_0 0 1 2
1 Question_1 3 4 5
2 Question_2 6 7 8
3 Question_3 9 10 11
4 Question_4 12 13 14
...
and data (in full):
Question (en) Correct Answer Incorrect Answer 1 Incorrect Answer 2
0 New question 0 1000 1001 1002
1 New question 1 1003 1004 1005
2 New question 2 1006 1007 1008
3 New question 3 1009 1010 1011
4 New question 4 1012 1013 1014
5 New question 5 1015 1016 1017
6 New question 6 1018 1019 1020
7 New question 7 1021 1022 1023
8 New question 8 1024 1025 1026
Now, to copy selected rows, run just:
template.iloc[::9] = data.values
The initial part of template contains now:
Question (en) Correct Answer Incorrect Answer 1 Incorrect Answer 2
0 New question 0 1000 1001 1002
1 Question_1 3 4 5
2 Question_2 6 7 8
3 Question_3 9 10 11
4 Question_4 12 13 14
5 Question_5 15 16 17
6 Question_6 18 19 20
7 Question_7 21 22 23
8 Question_8 24 25 26
9 New question 1 1003 1004 1005
10 Question_10 30 31 32
11 Question_11 33 34 35
12 Question_12 36 37 38
13 Question_13 39 40 41
14 Question_14 42 43 44
15 Question_15 45 46 47
16 Question_16 48 49 50
17 Question_17 51 52 53
18 New question 2 1006 1007 1008
19 Question_19 57 58 59

I am pretty sure that there are simpler/nicer ways, but just off the top of my head:
template_9=template.iloc[::9,0:2].copy()
# outer join
template_9['key'] = 0
data['key'] = 0
template_9.merge(data, how='left') # you don't need left here, but I think it's clearer
template_9.drop('key', axis=1, inplace=True)
template = pd.concat([template,template_9]).drop_duplicates(keep='last')
In case you want to keep the index replace:
template_9.reset_index().merge(data, how='left').set_index('index')
and then you can sort by index in the end.
P.S. I'm assuming column names are the same in both data frames, but it should be straightforward to adapt it anyway.

How to merge two columns of a dataframe based on values from a column in another dataframe?

I have a dataframe called df_location:
location = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69]}
df_location = pd.DataFrame(locations)
I have another dataframe called df_islands:
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
Each island_id corresponds to one or more locations. As you can see, the locations are stored in a list.
What I'm trying to do is to search the list_of_locations for each unique location and merge it to df_location in a way where each island_id will correspond to a specific location.
Final dataframe should be the following:
merged = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69],
'island_id':[10,20,20,30,30,40,40,40,50,60]}
df_merged = pd.DataFrame(merged)
I don't know whether there is a method or function in python to do so. I would really appreciate it if someone can give me a solution to this problem.

The pandas method you're looking for to expand your df_islands dataframe is .explode(column_name). From there, rename your column to location_id and then join the dataframes using pd.merge(). It'll perform a SQL-like join method using the location_id as the key.
import pandas as pd
locations = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69]}
df_locations = pd.DataFrame(locations)
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
df_islands = df_islands.explode(column='list_of_locations')
df_islands.columns = ['island_id', 'location_id']
pd.merge(df_locations, df_islands)
Out[]:
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 2 21 61 20
2 3 22 62 20
3 4 23 63 30
4 5 24 64 30
5 6 25 65 40
6 7 26 66 40
7 8 27 67 40
8 9 28 68 50
9 10 29 69 60

The df.apply() method works here. It's a bit long-winded but it works:
df_location['island_id'] = df_location['location_id'].apply(
lambda x: [
df_islands['island_id'][i] \
for i in df_islands.index \
if x in df_islands['list_of_locations'][i]
# comment above line and use this instead if list is stored in a string
# if x in eval(df_islands['list_of_locations'][i])
][0]
)
First we select the final value we want if the if statement is True: df_islands['island_id'][i]
Then we loop over each column in df_islands by using df_islands.index
Then create the if statement which loops over all values in df_islands['list_of_locations'] and returns True if the value for df_location['location_id'] is in the list.
Finally, since we must contain this long statement in square brackets, it is a list. However, we know that there is only one value in the list so we can index it by using [0] at the end.
I hope this helps and happy for other editors to make the answer more legible!
print(df_location)
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 2 21 61 20
2 3 22 62 20
3 4 23 63 30
4 5 24 64 30
5 6 25 65 40
6 7 26 66 40
7 8 27 67 40
8 9 28 68 50
9 10 29 69 60

.csv loading repeats all entries from one column in every cell

I am attempting to load a given csv file with the folowing structure:
Then, I'd like to join all the words with the same "Sent_ID" into one row, with the following code:
train = pd.read_csv("train.csv")
# Create a dataframe of sentences.
sentence_df = pd.DataFrame(train["Sent_ID"].drop_duplicates(), columns=["Sent_ID", "Sentence", "Target"])
for _, row in train.iterrows():
print(str(row["Word"]))
sentence_df.loc[sentence_df["Sent_ID"] == row["Sent_ID"], ["Sentence"]] = str(row["Word"])
However, the result of the print(str(row["Word"])) is:
Name: Word, Length: 4543833, dtype: object
0 Obesity
1 in
2 Low-
3 and
4 Middle-Income
5 Countries
...
i.e every single word in the column, for any given row. This occurs for all rows.
Printing the entire row gives:
id 89
Doc_ID 1
Sent_ID 4
Word 0 Obesity\n1 ...
tag O
Name: 88, dtype: object
This again suggests that every element of the "Word" column is present in each cell. (The 88th entry is not "Obesity\n1" in the .csv file.
I have tried changing the quoting argument in the read_csv function, as well as manually inserting the headers in the names argument, to no avail.
How do I ensure each Dataframe entry only contains its own word?
I've added a pastebin with some of the samples here (the pastebin will expire a week after this edit).

Building on #Aravinds answer, OP wanted a working example:
from io import StringIO
csv = StringIO('''
<paste csv snippet here>
'''
df = pd.read_csv(csv)
# Print first 5 rows
print(df.head())
id Doc_ID Sent_ID Word tag
0 1 1 1 Obesity O
1 2 1 1 in O
2 3 1 1 Low- O
3 4 1 1 and O
4 5 1 1 Middle-Income O
Now we have the data loaded as a pandas.DataFrame We can use the method to combine the words into sentences.
df = df.groupby('Sent_ID').Word.apply(' '.join).reset_index()
print(df)
Sent_ID Word
0 1 Obesity in Low- and Middle-Income Countries : ...
1 2 We have reviewed the distinctive features of e...
2 3 Obesity is rising in every region of the world...
3 4 In LMICs , overweight is higher in women compa...
4 5 Overweight occurs alongside persistent burdens...
5 6 Changes in the global diet and physical activi...
6 7 Emerging risk factors include environmental co...
7 8 Data on effective strategies to prevent the on...
8 9 Expanding the research in this area is a key p...
9 10 MICROCEPHALIA VERA
10 11 Excellent reproducibility of laser speckle con...
11 12 We compared the inter-day reproducibility of p...
12 13 We also tested whether skin blood flow assessm...
13 14 Skin blood flow was evaluated during PORH and ...
14 15 Data are expressed as cutaneous vascular condu...
15 16 Reproducibility is expressed as within subject...
16 17 Twenty-eight healthy participants were enrolle...
17 18 The reproducibility of the PORH peak CVC was b...
18 19 Inter-day reproducibility of the LTH plateau w...
19 20 Finally , we observed significant correlation ...
20 21 The recently developed LSCI technique showed v...
21 22 Moreover , we showed significant correlation b...
22 23 However , more data are needed to evaluate the...
23 24 Positive inotropic action of cholinesterase on...
24 25 The putative chloride channel hCLCA2 has a sin...
25 26 Calcium-activated chloride channel ( CLCA ) pr...
26 27 Genetic and electrophysiological studies have ...
27 28 The human CLCA2 protein is expressed as a 943-...
28 29 Earlier investigations of transmembrane geomet...
29 30 However , analysis by the more recently derive...

Use groupby()
df = df.groupby('Sent_ID')['Word'].apply(' '.join).reset_index()
You can group by multiple columns as a list. Like so
df.groupby(['Doc_ID','Sent_ID','tag'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas DataFrame - Issue regarding column formatting - python

Related

Turn a 2 second order array into pandas dataframe

Select/Group rows from a data frame with the nearest values for a specific column(s)

Replace every nth row in df1 with every row from df 2

How to merge two columns of a dataframe based on values from a column in another dataframe?

.csv loading repeats all entries from one column in every cell

Categories

Resources