Skip to content

charindex support for varbinary and enhancement #3810

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: BABEL_5_X_DEV
Choose a base branch
from

Conversation

herambh-shah
Copy link
Contributor

@herambh-shah herambh-shah commented Jun 6, 2025

Description

"This Pull Request includes two main enhancements to the CHARINDEX function:

  1. Added Support for Data Types:
  • Now supports VARBINARY arguments
  • Now supports BINARY arguments
  1. Performance Optimization:
  • Reimplemented system function from SQL to C
  • Improved execution efficiency by moving from SQL-based to C-based implementation

The changes aim to enhance both functionality and performance of the CHARINDEX function."

Issues Resolved

[BABEL-3659] - CHARINDEX() returns incorrect result on [VAR]BINARY arguments
[BABEL-5912] - Enhancement: Charindex Function

Test Scenarios Covered

  • Use case based
  • Boundary conditions
  • Arbitrary inputs
  • Negative test cases

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is under the terms of the Apache 2.0 and PostgreSQL licenses, and grant any person obtaining a copy of the contribution permission to relicense all or a portion of my contribution to the PostgreSQL License solely to contribute all or a portion of my contribution to the PostgreSQL open source project.

For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@herambh-shah herambh-shah force-pushed the jira-babel-3659 branch 2 times, most recently from fc5bdd0 to 583acb1 Compare June 6, 2025 22:23
@coveralls
Copy link
Collaborator

coveralls commented Jun 6, 2025

Pull Request Test Coverage Report for Build 15526664620

Details

  • 39 of 41 (95.12%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.01%) to 75.467%

Changes Missing Coverage Covered Lines Changed/Added Lines %
contrib/babelfishpg_tsql/runtime/functions.c 39 41 95.12%
Totals Coverage Status
Change from base Build 15464616681: 0.01%
Covered Lines: 48821
Relevant Lines: 64692

💛 - Coveralls

@herambh-shah herambh-shah requested a review from forestkeeper June 9, 2025 19:40
@herambh-shah herambh-shah marked this pull request as ready for review June 9, 2025 19:47
/* Check for NULL pattern and throw error */
if (PG_ARGISNULL(0))
ereport(ERROR,
(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error code ERRCODE_NULL_VALUE_NOT_ALLOWED looks not match the err msg ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @forestkeeper for pointing this out. I believe we should change it to "NULL" instead of throwing error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And as per our discussion for cases like charindex(NULL, varbinary) where it throws error, will try to handle it in analyzer.

/* Get function arguments */
expressionToFind = PG_GETARG_TEXT_PP(0);
expressionToSearch = PG_GETARG_TEXT_PP(1);
start_location = PG_GETARG_INT32(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have NULL check for 3rd arg as well. Else, while fetching the value like this, server may crash. Also add such test case.

Copy link
Contributor Author

@herambh-shah herambh-shah Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updating the sys function to STRICT so that we wont have to handle it explicitly.

collation = PG_GET_COLLATION();

/* If no collation specified, use default */
if (!OidIsValid(collation))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this should not happen. We should have error condition here instead of assigning DEFAULT COLLATION. Let's throw error llike check_collation_set function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the "if" condition.

else
{
/* Get substring starting from start_location */
text *substr = DatumGetTextP(DirectFunctionCall2(text_substr_no_len,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use DatumGetTextPP instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.

if (pos > 0)
result = pos + start_location - 1;

pfree(substr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have a sanity check whether substr is NULL or not. Also, what will be the impact on pos if substr is NULL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can skip this because DatumGetTextPP will never return NULL value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatumGetTextPP can return pointer to null. Here you are obtaining the substring using a function, what if that function returns NULL itself? There are case where it will happen

if (start_location <= 0)
{
/* Use byteapos directly */
result = DatumGetInt32(DirectFunctionCall2(byteapos,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments for tsql_charindex_char apply here as well

@@ -0,0 +1,200 @@
-- Basic syntax: CHARINDEX(substring, string [, start_location])
SELECT CHARINDEX('abc', 'xxabcxx'); -- Returns 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have test where 3rd arg is -ve

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added !

GO

-- Force case sensitivity
SELECT CHARINDEX('ABC', 'xxabcxx' COLLATE Latin1_General_CS_AS); -- Returns 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test where collation of both args are different, we expect to throw error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add another test like select charindex('abc' collate DATABASE_DEFAULT, 'abc' collate latin1_general_ci_ai). This should also throw error.
Let's have explicit collation on both args with same collations as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

GO

-- Special character
SELECT CHARINDEX(' ', 'Hello World'); -- Returns 6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check for wildcard characters like %, [], - etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes , added . Thanks

GO

-- Special character
SELECT CHARINDEX(' ', 'Hello World'); -- Returns 6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expressionToSearch is empty?

Copy link
Contributor Author

@herambh-shah herambh-shah Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gives "0". Added more test case for spaces and empty string
SELECT CHARINDEX('Hello World', ''); -- return 0
SELECT CHARINDEX('', ''); -- return 0
SELECT CHARINDEX(' ', ' '); (10 spaces , 3 spaces). -- return 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly add tests for the same

GO
SELECT CHARINDEX('abc', 'xxabcxx', 1); -- Returns 3
GO
SELECT CHARINDEX('abc', 'xxabcxx', 4); -- Returns 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the 3rd arg is NULL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it gives NULL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly add tests for the same

Comment on lines +5189 to +5193
/* Check for NULL pattern and throw error */
if (PG_ARGISNULL(0))
ereport(ERROR,
(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
errmsg("Argument data type is invalid for argument 2 of charindex function.")));
Copy link
Contributor

@rohit01010 rohit01010 Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we throwing invalid argument data type error when pattern is NULL, can you test following query

select charindex(cast(NULL as varbinary), 0x12345678)
go

Copy link
Contributor Author

@herambh-shah herambh-shah Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed above condition.
But in sql server, this is the behavior, I observed for NULL with varbinary arguments -
CASE 1: charindex(NULL, var[binary] ) -> throws error
CASE 2: charindex(d , var[binary]) -> where col d is having NULL values -> gives NULL
when we provide constant NULL value in charindex function along with varbinary argument, it seems sql server throws error
But if NULL value is given through a column, it returns NULL.
It might be the case with other functions also.

Comment on lines +5127 to +5128
if (PG_ARGISNULL(0) || PG_ARGISNULL(1))
PG_RETURN_NULL();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The corresponding SQL function which uses this is already marked as STRICT which returns NULL on NULL argument, why do we need to check explicitly here as well ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we actually don't need it . I was following other function convection and thought should use it as a pre-check.

Comment on lines +925 to +936
-- Not STRICT because VAR[BINARY] needs custom NULL handling
CREATE OR REPLACE FUNCTION sys.charindex(expressionToFind VARBINARY,
expressionToSearch VARBINARY,
start_location INTEGER DEFAULT 0)
RETURNS INTEGER AS 'babelfishpg_tsql', 'tsql_charindex_binary'
LANGUAGE C IMMUTABLE PARALLEL SAFE;

CREATE OR REPLACE FUNCTION sys.charindex(expressionToFind BINARY,
expressionToSearch BINARY,
start_location INTEGER DEFAULT 0)
RETURNS INTEGER AS 'babelfishpg_tsql', 'tsql_charindex_binary'
LANGUAGE C IMMUTABLE PARALLEL SAFE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test charindex with UDTs created on binary and varbinary. For example,

declare @a varbinary_udt = 0x656667
declare @b varbinary_udt = 0x707172656667
select charindex(@a, @b)
go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

DECLARE
exception_message text;
BEGIN
ALTER FUNCTION sys.charindex RENAME TO charindex_deprecated_in_5_3_0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use sys.charindex(PG_CATALOG.TEXT, PG_CATALOG.TEXT, INTEGER) explicitly, else it will raise error 'function name "sys.charindex" is not unique' if we are trying to upgrade from a lower version also has this fix to this version. Although we had added the exception handling which will convert that error to warning, but its better not to throw such incorrect warnings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it completely as per your below comment.

Comment on lines +466 to +478
DO $$
DECLARE
exception_message text;
BEGIN
ALTER FUNCTION sys.charindex RENAME TO charindex_deprecated_in_5_3_0;
EXCEPTION WHEN OTHERS THEN
GET STACKED DIAGNOSTICS
exception_message = MESSAGE_TEXT;
RAISE WARNING '%', exception_message;
END;
$$;

CALL sys.babelfish_drop_deprecated_object('function', 'sys', 'charindex_deprecated_in_5_3_0');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are only updating the body, language, strictness and parallel safety of the function and not the definition(i.e. the number of arguments or argument types or return type), i do not see any point in deprecating the function. so why are we deprecating this function ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohkk. I didn't know it before. I thought whenever we update any sys_functions, we need to add it in upgrade file also. Thanks for noticing it. I will remove it from upgrade file.

-- Create tables with specific collations
CREATE TABLE charindex_tests.ChineseTest (
ID INT,
Content NVARCHAR(100) COLLATE Chinese_PRC_CI_AS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets also test other string datatypes columns such as VARCHAR, CHAR, NCHAR, with specific collation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Thanks

( CAST('Test©®™' AS BINARY(50)), CAST('Special©Characters®™' AS VARBINARY(MAX)), 'Special characters' );
GO
~~ROW COUNT: 4~~

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets also add tests for dependent objects(views, procedures, functions, computed columns, constraint) on charindex, to tests its behaviour in the upgrade scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added .

@@ -628,3 +628,4 @@ babel_test_numeric_int8_oper
Test_cast_binary_bytea
numeric_money
MONEY_OBJECT
charindex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add charindex to schedule files of other versions to test the charindex tests in upgrades from other versions as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants