I have a .txt file that has the data regarding the total number of queries with valid names. The text inside of the file came out of a SQL Server 19 query output. The database used consists of the results of an algorithm that retrieves the most similar brands related to the query inserted. The file looks something like this:
2 16, 42, 44 A MINHA SAÚDE
3 34 !D D DUNHILL
4 33 #MEGA
5 09 (michelin man)
5 12 (michelin man)
6 33 *MONTE DA PEDRA*
7 35 .FOX
8 33 #BATISTA'S BY PITADA VERDE
9 12 #COM
10 41 + NATUREZA HUMANA
11 12 001
12 12 002
13 12 1007
14 12 101
15 12 102
16 12 104
17 37 112 PC
18 33 1128
19 41 123 PILATES
The 1st column has the Query identifier, the 2nd one has the brand classes where the Query can be located and the 3rd one is the Query itself (the spaces came from the SQL Server output formatting).
I then made a Pandas DataFrame in Google Colaboratory where I wanted the columns to be like the ones in the text file. However, when I ran the code, it gave me this:
The code that I wrote is here:
# Dataframe with the total number of queries with valid names:
df = pd.DataFrame(pd.read_table("/content/drive/MyDrive/data/classes/100/queries100.txt", header=None, names=["Query ID", "Query Name", "Classes Where Query is Present"]))
df
I think that this happens because of the commas in the 2nd column but I'm not quite sure. Any suggestions on why this is happening? I already tried read_csv and read_fwf and they were even worse in terms of formatting.
You can use pd.read_fwf() in this case, as your columns have fixed widths:
import pandas as pd
df = pd.read_fwf(
"/content/drive/MyDrive/data/classes/100/queries100.txt",
colspecs=[(0,20),(21,40),(40,1000)],
header=None,
names=["Query ID", "Query Name", "Classes Where Query is Present"]
)
df.head()
# Query ID Query Name Classes Where Query is Present
# 0 2 16, 42, 44 A MINHA SAÚDE
# 1 3 34 !D D DUNHILL
# 2 4 33 #MEGA
# 3 5 09 (michelin man)
# 4 5 12 (michelin man)
I have the two columns in a data frame (you can see a sample down below)
Usually in columns A & B I get 10 to 12 rows with similar values.
So for example: from index 1 to 10 and then from index 11 to 21.
I would like to group these values and get the mean and standard deviation of each group.
I found this following line code where I can get the index of the nearest value. but I don't know how to do this repetitively:
Index = df['A'].sub(df['A'][0]).abs().idxmin()
Anyone has any ideas on how to approach this problem?
A B
1 3652.194531 -1859.805238
2 3739.026566 -1881.965576
3 3742.095325 -1878.707674
4 3747.016899 -1878.728626
5 3746.214554 -1881.270329
6 3750.325368 -1882.915532
7 3748.086576 -1882.406672
8 3751.786422 -1886.489485
9 3755.448968 -1885.695822
10 3753.714126 -1883.504098
11 -337.969554 24.070990
12 -343.019575 23.438956
13 -344.788697 22.250254
14 -346.433460 21.912217
15 -343.228579 22.178519
16 -345.722368 23.037441
17 -345.923108 23.317620
18 -345.526633 21.416528
19 -347.555162 21.315934
20 -347.229210 21.565183
21 -344.575181 22.963298
22 23.611677 -8.499528
23 26.320500 -8.744512
24 24.374874 -10.717384
25 25.885272 -8.982414
26 24.448127 -9.002646
27 23.808744 -9.568390
28 24.717935 -8.491659
29 25.811393 -8.773649
30 25.084683 -8.245354
31 25.345618 -7.508419
32 23.286342 -10.695104
33 -3184.426285 -2533.374402
34 -3209.584366 -2553.310934
35 -3210.898611 -2555.938332
36 -3214.234899 -2558.244347
37 -3216.453616 -2561.863807
38 -3219.326197 -2558.739058
39 -3214.893325 -2560.505207
40 -3194.421934 -2550.186647
41 -3219.728445 -2562.472566
42 -3217.630380 -2562.132186
43 234.800448 -75.157523
44 236.661235 -72.617806
45 238.300501 -71.963103
46 239.127539 -72.797922
47 232.305335 -70.634125
48 238.452197 -73.914015
49 239.091210 -71.035163
50 239.855953 -73.961841
51 238.936811 -73.887023
52 238.621490 -73.171441
53 240.771812 -73.847028
54 -16.798565 4.421919
55 -15.952454 3.911043
56 -14.337879 4.236691
57 -17.465204 3.610884
58 -17.270147 4.407737
59 -15.347879 3.256489
60 -18.197750 3.906086
A simpler approach consist in grouping the values where the percentage change is not greater than a given threshold (let's say 0.5):
df['Group'] = (df.A.pct_change().abs()>0.5).cumsum()
df.groupby('Group').agg(['mean', 'std'])
Output:
A B
mean std mean std
Group
0 3738.590934 30.769420 -1880.148905 7.582856
1 -344.724684 2.666137 22.496995 0.921008
2 24.790470 0.994361 -9.020824 0.977809
3 -3210.159806 11.646589 -2555.676749 8.810481
4 237.902230 2.439297 -72.998817 1.366350
5 -16.481411 1.341379 3.964407 0.430576
Note: I have only used the "A" column, since the "B" column appears to follow the same pattern of consecutive nearest values. You can check if the identified groups are the same between columns with:
grps = (df[['A','B']].pct_change().abs()>1).cumsum()
grps.A.eq(grps.B).all()
I would say that if you know the length of each group/index set you want then you can first subset the column and row with :
df['A'].iloc[0:11].mean()
Then figure out a way to find standard deviation.
I have a dataframe called df_location:
location = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69]}
df_location = pd.DataFrame(locations)
I have another dataframe called df_islands:
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
Each island_id corresponds to one or more locations. As you can see, the locations are stored in a list.
What I'm trying to do is to search the list_of_locations for each unique location and merge it to df_location in a way where each island_id will correspond to a specific location.
Final dataframe should be the following:
merged = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69],
'island_id':[10,20,20,30,30,40,40,40,50,60]}
df_merged = pd.DataFrame(merged)
I don't know whether there is a method or function in python to do so. I would really appreciate it if someone can give me a solution to this problem.
The pandas method you're looking for to expand your df_islands dataframe is .explode(column_name). From there, rename your column to location_id and then join the dataframes using pd.merge(). It'll perform a SQL-like join method using the location_id as the key.
import pandas as pd
locations = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69]}
df_locations = pd.DataFrame(locations)
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
df_islands = df_islands.explode(column='list_of_locations')
df_islands.columns = ['island_id', 'location_id']
pd.merge(df_locations, df_islands)
Out[]:
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 2 21 61 20
2 3 22 62 20
3 4 23 63 30
4 5 24 64 30
5 6 25 65 40
6 7 26 66 40
7 8 27 67 40
8 9 28 68 50
9 10 29 69 60
The df.apply() method works here. It's a bit long-winded but it works:
df_location['island_id'] = df_location['location_id'].apply(
lambda x: [
df_islands['island_id'][i] \
for i in df_islands.index \
if x in df_islands['list_of_locations'][i]
# comment above line and use this instead if list is stored in a string
# if x in eval(df_islands['list_of_locations'][i])
][0]
)
First we select the final value we want if the if statement is True: df_islands['island_id'][i]
Then we loop over each column in df_islands by using df_islands.index
Then create the if statement which loops over all values in df_islands['list_of_locations'] and returns True if the value for df_location['location_id'] is in the list.
Finally, since we must contain this long statement in square brackets, it is a list. However, we know that there is only one value in the list so we can index it by using [0] at the end.
I hope this helps and happy for other editors to make the answer more legible!
print(df_location)
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 2 21 61 20
2 3 22 62 20
3 4 23 63 30
4 5 24 64 30
5 6 25 65 40
6 7 26 66 40
7 8 27 67 40
8 9 28 68 50
9 10 29 69 60
This question already has an answer here:
Pandas filter vs. loc method
(1 answer)
Closed 3 years ago.
I am want to split these dataframe into two new data frames.
1) contains all the google
2) contains all the bing
by matching strings works.. but by finding anything that starts with with "Bing" or "Google" will be even better!!
Dataframe:
number platform rds
0 200 BingBrand 56
1 200 BingNonBrand 56
2 200 BingNonBrand 56
3 151 GoogleNonBrand 56
4 1651 GoogleDisplay 56
5 626 Bing 56
6 626 BingNonBrand 56
7 125 GoogleBrand 56
# It is working but only for the last condition in the brackets,
example: BingBrand, GoogleBrand
# creating two new dataframes for each
cw_bing = cw[cw["platform"] == ('Bing' and 'BingNonBrand' and 'BingBrand' )]
cw_google = cw[cw["platform"] == ('GoogleNonBrand'and 'GoogleDisplay'
and 'Google' and 'GoogleBrand')]
Trying to split by any row started with "Bing" or "Google"
# Doesnt work
# cw_bing = cw[cw["vendorname"].filter(regex='Bing.*')]
# cw_google = cw[cw["vendorname"].filter(regex='Google.*')]
Final result should be like this
df 1 =
BingBrand 56
1 200 BingNonBrand 56
2 200 BingNonBrand 56
5 626 Bing 56
6 626 BingNonBrand 56
df2 =
3 151 GoogleNonBrand 56
4 1651 GoogleDisplay 56
7 125 GoogleBrand 56
You can use contain.
bing = df[df.platform.str.contains('Bing')]
google = df[df.platform.str.contains('Google')]
I have a text file (one.txt) that contains an arbitrary number of key‐value pairs (where the key and value are separated by a = – e.g. 1=8). Here are some examples:
1=88|11=1438|15=KKK|45=00|45=00|21=66|86=a
4=13|11=1438|49=DDD|8=157.73|67=00|45=00|84=b|86=a
6=84|41=18|56=TTT|67=00|4=13|45=00|07=d
I need to create a DataFrame with a list of dictionaries, with each row as one dictionary in the list like so:
[{1:88,11:1438,15:kkk,45:7.7....},{4:13,11:1438....},{6:84,41:18,56:TTT...}]
df = pd.read_csv("input.txt",names=['text'],header=None)
data = df['text'].str.split("|")
names=[ y.split('=') for x in data for y in x]
ds=pd.DataFrame(names)
print ds
How can I create a dictionary for each line by splitting on the = symbol?
It should be one row and multiple columns.
The DataFrame should have all keys as rows and values as columns.
Example:
1 11 15 45 21 86 4 49 8 67 84 6 41 56 45 07
88 1438 kkk 00 66 a
na 1438 na .....
I think performing a .pivot would work. Try this:
import pandas as pd
df = pd.read_csv("input.txt",names=['text'],header=None)
data = df['text'].str.split("|")
names=[ y.split('=') for x in data for y in x]
ds=pd.DataFrame(names)
ds = ds.pivot(columns=0).fillna('')
The .fillna('') removes the None values. If you'd like to replace with na you can use .fillna('na').
Output:
ds.head()
1
0 07 1 11 15 21 4 41 45 49 56 6 67 8 84 86
0 88
1 1438
2 KKK
3 00
4 00
For space I didn't print the entire dataframe, but it does column indexing based on the key and then values based on the values for each line (preserving the dict by line concept).