string_matching(qry_a, qry_b, params_as_json)
</br>
Methodology:
For each
a_str
, this function finds the best matchingb_str
. This function utilizes the Jaro-Winkler algorithm, and provides additional fine-tuning paramitization for segmenting, sorting, partial matching, and ranking of all permutations therein.
Query Strings:
This function returns a number between 0 and 1 quantifying the similarity between the results of two postgreSQL search queries. The input queries qry_a
and qry_b
must designate the aliases (a_str
, a_idx
) and (b_str
, b_idx
), respectively.
For example, these tables:
DROP TABLE IF EXISTS z_string_test_1;
CREATE TABLE z_string_test_1 AS (
SELECT
ARRAY['MARTHA','DWAYNE','DIXON'] a_str,
ARRAY[1,2,3] a_idx,
ARRAY['DUANE','MARHTA','DICKSONX'] b_str,
ARRAY[2,3,4] b_idx
);
DROP TABLE IF EXISTS z_string_test_2;
CREATE TABLE z_string_test_2 AS (
SELECT
UNNEST(a_str) a_str,
UNNEST(a_idx) a_idx,
UNNEST(b_str) b_str,
UNNEST(b_idx) b_idx
FROM z_string_test_1
);
queried as:
SELECT * FROM z_string_matching(
'SELECT a_str a_str,a_idx a_idx FROM z_string_test_2',
'SELECT b_str b_str,b_idx b_idx FROM z_string_test_2',
'false'
)
SELECT * FROM z_string_matching(
'SELECT UNNEST(a_str) a_str,UNNEST(a_idx) a_idx FROM z_string_test_1',
'SELECT UNNEST(b_str) b_str,UNNEST(b_idx) b_idx FROM z_string_test_1',
''
)
each produces the generally accepted results:
# | a_idx | a_str | jaro_score | b_str | b_idx | other_matches |
---|---|---|---|---|---|---|
0 | 1 | MARTHA | 0.9444 | MARHTA | 3 | {} |
1 | 2 | DWAYNE | 0.8222 | DUANE | 2 | {} |
2 | 3 | DIXON | 0.7667 | DICKSONX | 4 | {} |
Of further note:
- the Jaro-Winkler algorithm values first letter(s) matches » last letter matches
jaro_score
equals 0.0 where any string evaluated has a length less than 4other_matches
is the index (i.e.,b_idx
) of the otherb_str
matches having the same best jaro score- avoid using “pllua_“ as a prefix for any alias relating to
qry_a
orqry_b
Parameter String:
The params_as_json
argument is either:
- (a) the case-insensitive string “false”, or,
- (b) a one dimensional JSON string.
If a JSON string, it may comprise any combination of the below Parameter List.
Some general points:
- all keys and values of
params_as_json
are strings enclosed with double quotes (and not two single quotes) - all boolean values are case insensitive
PARAMETERS:
first_match_only
: Upon finding ab_str
match for a particulara_str
given the input criteria, stop further processing re:a_str
and return the first-matchedb_str
. This option often works well in conjunction with minimal iterative steps and high-score conditions to quickly extract the easy matches and minimize the number of future, more arduous, steps.
- ==[x] implement==
- ==[ ] finish note==
a_cols_as_prefix
1,a_cols_as_suffix
1: Each of these parameters is a string split by a semi-colon and made into a list that defines additional return columns inqry_a
. The prefix and suffix parameters define what set of strings, and in what order, will be concatenated with the base return column (i.e.,a_str
). The parametersdiv_str
andconcat_str
(described next) define how the string set will be split and joined, respectively, and then compared.
- ==[x] implement==
- ==[ ] finish note==
concat_str
: This parameter defines what character set is used to combine string parts, if applicable.
-
Default:
" "
- ==[x] implement==
- ==[ ] finish note==
div_str
: This parameter defines what characters are used to split strings into segments, and is used by default to normalize string before comparison (seea_str_mod
for more detail). This parameter, used in conjunction witha_str_mod
, can operate as a regular expression substitute.2 A semicolon (==”;”==) marks a break point between segments of characters used to split a string (“splitting segment” hereinafter).
- Multiple criteria of any non-zero length may constitute a splitting segment.
- All splitting segment space(s) (e.g., ==” ; ; ;”==) must be at the beginning of
div_str
. - Only single semi-colons can currently be used as splitting segments (e.g., ==”;;”==), and it must be put at the end of
div_str
. -
**Default: ==” ;-;_;/;\; ;&;;;”==**
- ==[X] implement==
- ==[ ] finish note==
See String Manipulation examples.
a_str_mod
1: this parameter designates an additional option for evaluating variations of the query inputa_str
.
Available options:
- ==”norm”== or ==””== or (undesignated) (Default): By default,
a_str
andb_str
are split by all splitting characters defined bydiv_str
and then concatenated with the value ofconcat_str
. This step effectively normalizes strings and usually allows for more consistent scoring. - ==”none”== or ==”false”==: These case-insensitive options disable the normalization step.
- ==”iter”==: With this option, this function splits each
a_str
bydiv_str
and evaluates each segment therein. For example, wherea_str
equals “one-two-three”, this function will attempt to return the best match for all cases wherea_str
equals “one”, “two”, or “three” (assumingdiv_str
is ==”-“== or includes ==”-;”==). - ==”perm”==: Here, the function splits each
a_str
bydiv_str
and evaluates all permutations of segment arrangements. For example, wherea_str
equals “one-two-three”, this function will attempt to return the best match for cases wherea_str
equals “two-one-three”, “three-one-two”, etc… (assumingdiv_str
is ==”-“== or includes ==”-;”==).
- ==[X] implement==
- ==[ ] finish note==
See String Permutation examples.
a_str_condition
1: This parameter provides a mechanism for limiting the results of the input queries (e.g.,qry_a
) when the input queries return, that is, the result that this function then scores.
- ==[x] implement==
- ==[ ] finish note==
a_idx_condition
: Wheniter_a_str_perms
equals “true”, this condition is handy to minimize the iterations ofa_str
that are compared with everyb_str
result.
- ==[x] implement==
- ==[ ] finish note==
update_cond
: This allows for fine-tuning when and what results are returned.
- ==[x] implement==
- ==[ ] finish note==
String Manipulation Examples:
- ==[ ] implement==
- ==[ ] finish note==
String Permutation Examples:
- ==[ ] implement==
- ==[ ] finish note==
Reference to Related Functions:
-
string_matching_mgr
- ==[ ] implement==
- ==[ ] finish note==
Add this webpage to pgSQL function comments:
- ==[ ] implement==
-
This description relating to
qry_a
applies similarly toqry_b
. ‘&8617;’ ‘&8617;’2 ‘&8617;’3 ‘&8617;’4 -
Lua regular expression patterns are somewhat uncommon, and therefore, were not natively integrated. For example, (1) “or” logic does not exist (or did not exist when developing), (2) matching patterns cannot be quantified (e.g., “{n,}”), and (3) there is no mechanism to capture the n-th string match where n>9. ‘&8617;’