Related
I am trying to use a Canadian gridded historical dataset of temperature anomalies but it seems that I don't have the skills to pull that off. The grd file are temperatures anomalies on what I believe is a highly regular grid. I have no experience with that kind of grid and I am having trouble building the xarray dataset.
What I have (a subset of the grd and the text file is accessible here) :
2075 '.grd' files ('t190001.grd' to 't202112.grd' following "t{year}{month}.grd" structure)
1 txt file listing the grid coordinates called "CANGRD_points_LL.txt"
From this I would like to build a xarray dataset in order to do some analysis.
Naively, I thought the grid files were already georeferenced and all so I started by doing this :
import glob
import rioxarray as rio
import pandas as pd
import numpy as np
import xarray as xr
#not used for the moment even though I believe that will be needed
#df = pd.read_csv(r"CANGRD_points_LL.txt", sep = ' ', header=None)
list_files = sorted(set(glob.glob(r"t?????[0-2].grd" ) + glob.glob(r"t????0[0-9].grd" )))
times = pd.date_range("1900/01/01",freq='M', periods= len(list_files))
datarrays = [rio.open_rasterio(rst, masked=True,band_as_variable=True).assign_coords(time = t).expand_dims(dim='time').squeeze() for rst,t in zip(list_files, times)]
ds = xr.concat(datarrays,dim='time').rename({'band_1' : 'tas', 'y': 'lat', 'x' : 'lon'})
But as I plotted the results it became evident that my coordinates were only the indices of the pixels :
So I believe I have to use the txt file provided, however, I have no idea how to make the xarray grid using the grid's coordinates and how to make that match with my array obtained by loading a grid via rioxarray. Here is a sample, the complete file is available above. What baffles me is that most of the 11874 lines of the dataframe resulting from the txt file seem to be unique, so how could I fit an array of dimensions 125 lon by 95 lat into it.
0 1 2 3
0 0 0 40.0451 -129.8530
1 0 1 40.1780 -129.3650
2 0 2 40.3080 -128.8740
3 0 3 40.4348 -128.3801
4 0 4 40.5585 -127.8834
5 0 5 40.6790 -127.3840
6 0 6 40.7963 -126.8817
7 0 7 40.9104 -126.3768
8 0 8 41.0211 -125.8693
9 0 9 41.1286 -125.3591
10 0 10 41.2327 -124.8465
11 0 11 41.3335 -124.3314
12 0 12 41.4308 -123.8140
13 0 13 41.5247 -123.2942
14 0 14 41.6151 -122.7722
15 0 15 41.7020 -122.2481
16 0 16 41.7853 -121.7218
17 0 17 41.8651 -121.1936
18 0 18 41.9413 -120.6634
19 0 19 42.0139 -120.1313
20 0 20 42.0828 -119.5975
21 0 21 42.1481 -119.0620
22 0 22 42.2097 -118.5249
23 0 23 42.2675 -117.9863
24 0 24 42.3216 -117.4462
25 0 25 42.3720 -116.9049
26 0 26 42.4186 -116.3622
27 0 27 42.4614 -115.8185
28 0 28 42.5005 -115.2736
29 0 29 42.5357 -114.7279
30 0 30 42.5670 -114.1812
31 0 31 42.5946 -113.6338
32 0 32 42.6182 -113.0857
33 0 33 42.6381 -112.5371
34 0 34 42.6540 -111.9880
35 0 35 42.6661 -111.4385
36 0 36 42.6743 -110.8888
37 0 37 42.6786 -110.3389
38 0 38 42.6791 -109.7889
39 0 39 42.6757 -109.2390
40 0 40 42.6684 -108.6892
41 0 41 42.6572 -108.1397
42 0 42 42.6421 -107.5905
43 0 43 42.6232 -107.0417
44 0 44 42.6004 -106.4935
45 0 45 42.5738 -105.9459
46 0 46 42.5433 -105.3991
47 0 47 42.5090 -104.8531
48 0 48 42.4708 -104.3081
49 0 49 42.4289 -103.7640
Here is the view of one grid file loaded as xarray,
Any help would be greatly appreciated! Thank you so much
I directly asked on the Xarray Github discussion here is the original answer from Keewis:
https://github.com/pydata/xarray/discussions/7443#discussioncomment-4700261
The grid file contains stacked 2D coordinates, which I guess is due to the grid's original coordinate system not being aligned with the lat / lon axes.
To read the coordinates into 2D coordinates you can use:
df = pd.read_csv(r"CANGRD_points_LL.txt", sep=" ", header=None, names=["y", "x", "lat", "lon"])
grid = df.set_index(["y", "x"]).to_xarray().set_coords(["lat", "lon"])
raw = xr.concat([...], dim="time")
ds = xr.merge([raw, grid]).assign_coords(time=times).rename_vars(...)
I have a dataframe called df_location:
location = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69]}
df_location = pd.DataFrame(locations)
I have another dataframe called df_islands:
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
Each island_id corresponds to one or more locations. As you can see, the locations are stored in a list.
What I'm trying to do is to search the list_of_locations for each unique location and merge it to df_location in a way where each island_id will correspond to a specific location.
Final dataframe should be the following:
merged = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69],
'island_id':[10,20,20,30,30,40,40,40,50,60]}
df_merged = pd.DataFrame(merged)
I don't know whether there is a method or function in python to do so. I would really appreciate it if someone can give me a solution to this problem.
The pandas method you're looking for to expand your df_islands dataframe is .explode(column_name). From there, rename your column to location_id and then join the dataframes using pd.merge(). It'll perform a SQL-like join method using the location_id as the key.
import pandas as pd
locations = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69]}
df_locations = pd.DataFrame(locations)
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
df_islands = df_islands.explode(column='list_of_locations')
df_islands.columns = ['island_id', 'location_id']
pd.merge(df_locations, df_islands)
Out[]:
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 2 21 61 20
2 3 22 62 20
3 4 23 63 30
4 5 24 64 30
5 6 25 65 40
6 7 26 66 40
7 8 27 67 40
8 9 28 68 50
9 10 29 69 60
The df.apply() method works here. It's a bit long-winded but it works:
df_location['island_id'] = df_location['location_id'].apply(
lambda x: [
df_islands['island_id'][i] \
for i in df_islands.index \
if x in df_islands['list_of_locations'][i]
# comment above line and use this instead if list is stored in a string
# if x in eval(df_islands['list_of_locations'][i])
][0]
)
First we select the final value we want if the if statement is True: df_islands['island_id'][i]
Then we loop over each column in df_islands by using df_islands.index
Then create the if statement which loops over all values in df_islands['list_of_locations'] and returns True if the value for df_location['location_id'] is in the list.
Finally, since we must contain this long statement in square brackets, it is a list. However, we know that there is only one value in the list so we can index it by using [0] at the end.
I hope this helps and happy for other editors to make the answer more legible!
print(df_location)
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 2 21 61 20
2 3 22 62 20
3 4 23 63 30
4 5 24 64 30
5 6 25 65 40
6 7 26 66 40
7 8 27 67 40
8 9 28 68 50
9 10 29 69 60
I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')
You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.
Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!
I am attempting to load a given csv file with the folowing structure:
Then, I'd like to join all the words with the same "Sent_ID" into one row, with the following code:
train = pd.read_csv("train.csv")
# Create a dataframe of sentences.
sentence_df = pd.DataFrame(train["Sent_ID"].drop_duplicates(), columns=["Sent_ID", "Sentence", "Target"])
for _, row in train.iterrows():
print(str(row["Word"]))
sentence_df.loc[sentence_df["Sent_ID"] == row["Sent_ID"], ["Sentence"]] = str(row["Word"])
However, the result of the print(str(row["Word"])) is:
Name: Word, Length: 4543833, dtype: object
0 Obesity
1 in
2 Low-
3 and
4 Middle-Income
5 Countries
...
i.e every single word in the column, for any given row. This occurs for all rows.
Printing the entire row gives:
id 89
Doc_ID 1
Sent_ID 4
Word 0 Obesity\n1 ...
tag O
Name: 88, dtype: object
This again suggests that every element of the "Word" column is present in each cell. (The 88th entry is not "Obesity\n1" in the .csv file.
I have tried changing the quoting argument in the read_csv function, as well as manually inserting the headers in the names argument, to no avail.
How do I ensure each Dataframe entry only contains its own word?
I've added a pastebin with some of the samples here (the pastebin will expire a week after this edit).
Building on #Aravinds answer, OP wanted a working example:
from io import StringIO
csv = StringIO('''
<paste csv snippet here>
'''
df = pd.read_csv(csv)
# Print first 5 rows
print(df.head())
id Doc_ID Sent_ID Word tag
0 1 1 1 Obesity O
1 2 1 1 in O
2 3 1 1 Low- O
3 4 1 1 and O
4 5 1 1 Middle-Income O
Now we have the data loaded as a pandas.DataFrame We can use the method to combine the words into sentences.
df = df.groupby('Sent_ID').Word.apply(' '.join).reset_index()
print(df)
Sent_ID Word
0 1 Obesity in Low- and Middle-Income Countries : ...
1 2 We have reviewed the distinctive features of e...
2 3 Obesity is rising in every region of the world...
3 4 In LMICs , overweight is higher in women compa...
4 5 Overweight occurs alongside persistent burdens...
5 6 Changes in the global diet and physical activi...
6 7 Emerging risk factors include environmental co...
7 8 Data on effective strategies to prevent the on...
8 9 Expanding the research in this area is a key p...
9 10 MICROCEPHALIA VERA
10 11 Excellent reproducibility of laser speckle con...
11 12 We compared the inter-day reproducibility of p...
12 13 We also tested whether skin blood flow assessm...
13 14 Skin blood flow was evaluated during PORH and ...
14 15 Data are expressed as cutaneous vascular condu...
15 16 Reproducibility is expressed as within subject...
16 17 Twenty-eight healthy participants were enrolle...
17 18 The reproducibility of the PORH peak CVC was b...
18 19 Inter-day reproducibility of the LTH plateau w...
19 20 Finally , we observed significant correlation ...
20 21 The recently developed LSCI technique showed v...
21 22 Moreover , we showed significant correlation b...
22 23 However , more data are needed to evaluate the...
23 24 Positive inotropic action of cholinesterase on...
24 25 The putative chloride channel hCLCA2 has a sin...
25 26 Calcium-activated chloride channel ( CLCA ) pr...
26 27 Genetic and electrophysiological studies have ...
27 28 The human CLCA2 protein is expressed as a 943-...
28 29 Earlier investigations of transmembrane geomet...
29 30 However , analysis by the more recently derive...
Use groupby()
df = df.groupby('Sent_ID')['Word'].apply(' '.join).reset_index()
You can group by multiple columns as a list. Like so
df.groupby(['Doc_ID','Sent_ID','tag'])
I have a dataframe that needs a column added to it. That column needs to be a count of all the other rows in the table that meet a certain condition, that condition needs to take in input both from the "input" row and the "output" row.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
I'd want the height and weight of the row, as well as the height and weight of the other rows in a function, so I can do something like:
def example_function(height1, weight1, height2, weight2):
if height1 > height2 and weight1 < weight2:
return True
else:
return False
And it would just sum up all the True's and give that sum in the column.
Is something like this possible?
Thanks in advance for any ideas!
Edit: Sample input:
id name height weight country
0 Adam 70 180 USA
1 Bill 65 190 CANADA
2 Chris 71 150 GERMANY
3 Eric 72 210 USA
4 Fred 74 160 FRANCE
5 Gary 75 220 MEXICO
6 Henry 61 230 SPAIN
The result would need to be:
id name height weight country new_column
0 Adam 70 180 USA 1
1 Bill 65 190 CANADA 1
2 Chris 71 150 GERMANY 3
3 Eric 72 210 USA 1
4 Fred 74 160 FRANCE 4
5 Gary 75 220 MEXICO 1
6 Henry 61 230 SPAIN 0
I believe it will need to be some sort of function, as the actual logic I need to use is more complicated.
edit 2:fixed typo
You can add booleans, like this:
count = ((df.height1 > df.height2) & (df.weight1 < df.weight2)).sum()
EDIT:
I test it a bit and then change conditions with custom function:
def f(x):
#check boolean mask
#print ((df.height > x.height) & (df.weight < x.weight))
return ((df.height < x.height) & (df.weight > x.weight)).sum()
df['new_column'] = df.apply(f, axis=1)
print (df)
id name height weight country new_column
0 0 Adam 70 180 USA 2
1 1 Bill 65 190 CANADA 1
2 2 Chris 71 150 GERMANY 3
3 3 Eric 72 210 USA 1
4 4 Fred 74 160 FRANCE 4
5 5 Gary 75 220 MEXICO 1
6 6 Henry 61 230 SPAIN 0
Explanation:
For each row compare values and for count simply sum values True.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
As far as I understand, you want to assign to a new column something like
df['num_heigher_and_leighter'] = df.apply(lambda r: ((df.height > r.height) & (df.weight < r.weight)).sum(), axis=1)
However, your text description doesn't seem to match the outcome, which is:
0 2
1 3
2 0
3 1
4 0
5 0
6 6
dtype: int64
Edit
As in any other case, you can use a named function instead of a lambda:
df = ...
def foo(r):
return ((df.height > r.height) & (df.weight < r.weight)).sum()
df['num_heigher_and_leighter'] = df.apply(foo, axis=1)
I'm assuming you had a typo and want to compare heights with heights and weights with weights. If so, you could count the number of persons taller OR heavier like so:
>>> for i,height,weight in zip(df.index,df.height, df.weight):
... cnt = df.loc[((df.height>height) & (df.weight>weight)), 'height'].count()
... df.loc[i,'thing'] = cnt
...
>>> df
name height weight country thing
0 Adam 70 180 USA 2.0
1 Bill 65 190 CANADA 2.0
2 Chris 71 150 GERMANY 3.0
3 Eric 72 210 USA 1.0
4 Fred 74 160 FRANCE 1.0
5 Gary 75 220 MEXICO 0.0
6 Henry 61 230 SPAIN 0.0
Here for instance, no person is Heavier than Henry, and no person is taller than Gary. If that's not what you intended, it should be easy to modify the & above to a | instead or switching out the > to a <.
When you're more accustomed to Pandas, I suggest you use Ami Tavory excellent answer instead.
PS. For the love of god, use the Metric system for representing weight and height, and convert to whatever for presentation. These numbers are totally nonsensical for the world population at large. :)