Skip to content

pygmt.grdtrack: Wrong column names of output pandas DataFrame #2703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yvonnefroehlich opened this issue Sep 29, 2023 · 2 comments
Open

pygmt.grdtrack: Wrong column names of output pandas DataFrame #2703

yvonnefroehlich opened this issue Sep 29, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@yvonnefroehlich
Copy link
Member

yvonnefroehlich commented Sep 29, 2023

Description of the problem

This is related to the GMT forum post at https://forum.generic-mapping-tools.org/t/type-error-with-pygmt-grdtrack-possible-bug-or-intended/4286.

If the input pandas DataFrame contains an additional column with datetime objects, pygmt.grdtrack mixes up the column names of the output pandas Dataframe.

Maybe this is partly related to the issue #2463?

Minimal Complete Verifiable Example

import pygmt
import pandas as pd
import numpy as np
import datetime

np.random.seed(100)

# Make random test data
df = pd.DataFrame(columns=["lon", "lat", "trash", "trashdate"])
df["lon"] = 50 + 2 * np.random.randn(100)
df["lat"] = 55 + 4 * np.random.randn(100)
df["trash"] = 53 + 5 * np.random.randn(100)
# Correct column names after applying grdtrack
# df["trashdate"] = 51 + 7 * np.random.randn(100)
# Mixed up column names after applying grdtrack
df["trashdate"] = (51 + 7 * np.random.randn(100)) \
    * datetime.timedelta(seconds=3e7) + datetime.date(1950, 1, 1)

# Make mesh
spacing = 0.5
region = [min(df.lon), max(df.lon), min(df.lat), max(df.lat)]
n_lon = round((region[1] - region[0]) / spacing) + 1
n_lat = round((region[3] - region[2]) / spacing) + 1
lat, lon = np.mgrid[region[2]:region[3]:n_lat*1j, region[0]:region[1]:n_lon*1j]

# Make GMT ready grid with random z values
pygmt.xyz2grd(
    x=lon.flatten(),
    y=lat.flatten(),
    z=50 + 1 * np.random.randn(len(lon.flatten())),
    region=region,
    spacing=spacing,
    outgrid="test.grd",
)

# Extrakt the z values of the grid at the points stored in the pandas.DataFrame
# and write them to the new column "z_grid"
df_new = pygmt.grdtrack(
    grid="test.grd",
    points=df,
    newcolname="z_grid",
)

Compare the pandas DataFrames:

df
          lon        lat      trash   trashdate
0   46.500469  48.181395  56.022118  1994-11-01
1   50.685361  50.454956  48.464848  1998-02-18
2   52.306072  43.106738  55.960116  2019-08-08
3   49.495128  55.133269  50.814678  2020-09-02
4   51.962642  54.004445  53.508879  2009-02-25
..        ...        ...        ...         ...
95  50.006035  57.596692  48.621346  1999-09-30
96  49.847953  54.300874  53.588042  1984-11-06
97  50.007915  59.069057  48.209158  2003-09-13
98  49.629972  52.600068  43.594427  1996-12-07
99  45.025697  61.304667  53.911714  1990-08-08

[100 rows x 4 columns]

and

df_new
          lon        lat      trash  trashdate      z_grid
0   46.500469  48.181395  56.022118  50.241717  1994-11-01
1   50.685361  50.454956  48.464848  50.100797  1998-02-18
2   52.306072  43.106738  55.960116  50.701331  2019-08-08
3   49.495128  55.133269  50.814678  48.872544  2020-09-02
4   51.962642  54.004445  53.508879  50.619094  2009-02-25
..        ...        ...        ...        ...         ...
95  50.006035  57.596692  48.621346  50.866452  1999-09-30
96  49.847953  54.300874  53.588042  50.468180  1984-11-06
97  50.007915  59.069057  48.209158  49.848821  2003-09-13
98  49.629972  52.600068  43.594427  50.984231  1996-12-07
99  45.025697  61.304667  53.911714  50.887035  1990-08-08

[100 rows x 5 columns]

Full error message

No error message occurs, but the column names after applying grdtrack are mixed up in case the pandas DataFrame contains a column with datetime objects.

System information

PyGMT information:
  version: v0.10.1.dev25
System information:
  python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 17:59:51) [MSC v.1935 64 bit (AMD64)]
  executable: C:\ProgramData\Anaconda3\envs\pygmt_env_dev\python.exe
  machine: Windows-10-10.0.19045-SP0
Dependency information:
  numpy: 1.24.3
  pandas: 2.0.2
  xarray: 2023.1.1.dev17
  netCDF4: 1.6.2
  packaging: 23.1
  contextily: 1.3.0
  geopandas: 0.13.2
  IPython: 8.14.0
  rioxarray: 0.14.1
  ghostscript: 9.54.0
GMT library information:
  binary version: 6.4.0
  cores: 4
  grid layout: rows
  image layout: 
  library path: C:/ProgramData/Anaconda3/envs/pygmt_env_dev/Library/bin/gmt.dll
  padding: 2
  plugin dir: C:/ProgramData/Anaconda3/envs/pygmt_env_dev/Library/bin/gmt_plugins
  share dir: C:/Program Files (x86)/gmt6/share
  version: 6.4.0
@yvonnefroehlich yvonnefroehlich added the bug Something isn't working label Sep 29, 2023
@seisman
Copy link
Member

seisman commented Oct 2, 2023

The following two GMT commands explain the behavior of the GMT's grdtrack module:

$ echo 46  48  56  1994-11-01 | gmt grdtrack -G@earth_relief_01d_g    
46	48	56	1994-11-01T00:00:00	-2

$ echo 46  48  56  TEXT | gmt grdtrack -G@earth_relief_01d_g
46	48	56	-2	TEXT

In the first case, the z value (-2) is appended as the last column, while in the 2nd case, the z value is inserted at the 3rd case (column number starting from 0) and the text column is appended as the last column. The philosophy of the GMT I/O mechanism is always putting text at the end of a row.

In your example, the df["trashdate"] has an object dtype, which is recognized as a string column, rather then a datetime column.

df["trashdate"] = (51 + 7 * np.random.randn(100)) \
    * datetime.timedelta(seconds=3e7) + datetime.date(1950, 1, 1)

To fix the issue, you need to explicitly convert the column to a datetime column instead. For example:

df["trashdate"] = (51 + 7 * np.random.randn(100)) \
    * datetime.timedelta(seconds=3e7) + datetime.date(1950, 1, 1)
df["trashdate"] = df["trashdate"].astype("datetime64[ms]")

@seisman
Copy link
Member

seisman commented Mar 10, 2025

Updates on the status of the issue.

Since PyGMT v0.14.1, the column names of the output are correct:

          lon        lat      trash     trashdate     z_grid
0   46.500469  48.181395  56.022118  7.836480e+08  50.241717
1   50.685361  50.454956  48.464848  8.877600e+08  50.100797
2   52.306072  43.106738  55.960116  1.565222e+09  50.701331
3   49.495128  55.133269  50.814678  1.599005e+09  48.872544
4   51.962642  54.004445  53.508879  1.235520e+09  50.619094
..        ...        ...        ...           ...        ...
95  50.006035  57.596692  48.621346  9.386496e+08  50.866452
96  49.847953  54.300874  53.588042  4.685472e+08  50.468180
97  50.007915  59.069057  48.209158  1.063411e+09  49.848821
98  49.629972  52.600068  43.594427  8.499168e+08  50.984231
99  45.025697  61.304667  53.911714  6.500736e+08  50.887035

That's because the trashdate column with the object dtype can be correctly recognized as np.datetime64 dtype and passed to GMT C API as "GMT_DATETIME" type. So, the original issue has been fixed.

However, we still have an issue: the output trashdate is in float64 dtype, but we expect it to be in datetime dtype.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants