string_matching(qry_a, qry_b, params_as_json)
</br>
Methodology:
For each
a_str, this function finds the best matchingb_str. This function utilizes the Jaro-Winkler algorithm, and provides additional fine-tuning paramitization for segmenting, sorting, partial matching, and ranking of all permutations therein.
Query Strings:
This function returns a number between 0 and 1 quantifying the similarity between the results of two postgreSQL search queries. The input queries qry_a and qry_b must designate the aliases (a_str, a_idx) and (b_str, b_idx), respectively.
For example, these tables:
DROP TABLE IF EXISTS z_string_test_1;
CREATE TABLE z_string_test_1 AS (
SELECT
ARRAY['MARTHA','DWAYNE','DIXON'] a_str,
ARRAY[1,2,3] a_idx,
ARRAY['DUANE','MARHTA','DICKSONX'] b_str,
ARRAY[2,3,4] b_idx
);
DROP TABLE IF EXISTS z_string_test_2;
CREATE TABLE z_string_test_2 AS (
SELECT
UNNEST(a_str) a_str,
UNNEST(a_idx) a_idx,
UNNEST(b_str) b_str,
UNNEST(b_idx) b_idx
FROM z_string_test_1
);
queried as:
SELECT * FROM z_string_matching(
'SELECT a_str a_str,a_idx a_idx FROM z_string_test_2',
'SELECT b_str b_str,b_idx b_idx FROM z_string_test_2',
'false'
)
SELECT * FROM z_string_matching(
'SELECT UNNEST(a_str) a_str,UNNEST(a_idx) a_idx FROM z_string_test_1',
'SELECT UNNEST(b_str) b_str,UNNEST(b_idx) b_idx FROM z_string_test_1',
''
)
each produces the generally accepted results:
| # | a_idx | a_str | jaro_score | b_str | b_idx | other_matches |
|---|---|---|---|---|---|---|
| 0 | 1 | MARTHA | 0.9444 | MARHTA | 3 | {} |
| 1 | 2 | DWAYNE | 0.8222 | DUANE | 2 | {} |
| 2 | 3 | DIXON | 0.7667 | DICKSONX | 4 | {} |
Of further note:
- the Jaro-Winkler algorithm values first letter(s) matches » last letter matches
jaro_scoreequals 0.0 where any string evaluated has a length less than 4other_matchesis the index (i.e.,b_idx) of the otherb_strmatches having the same best jaro score- avoid using “pllua_“ as a prefix for any alias relating to
qry_aorqry_b
Parameter String:
The params_as_json argument is either:
- (a) the case-insensitive string “false”, or,
- (b) a one dimensional JSON string.
If a JSON string, it may comprise any combination of the below Parameter List.
Some general points:
- all keys and values of
params_as_jsonare strings enclosed with double quotes (and not two single quotes) - all boolean values are case insensitive
PARAMETERS:
first_match_only: Upon finding ab_strmatch for a particulara_strgiven the input criteria, stop further processing re:a_strand return the first-matchedb_str. This option often works well in conjunction with minimal iterative steps and high-score conditions to quickly extract the easy matches and minimize the number of future, more arduous, steps.
- ==[x] implement==
- ==[ ] finish note==
a_cols_as_prefix1,a_cols_as_suffix1: Each of these parameters is a string split by a semi-colon and made into a list that defines additional return columns inqry_a. The prefix and suffix parameters define what set of strings, and in what order, will be concatenated with the base return column (i.e.,a_str). The parametersdiv_strandconcat_str(described next) define how the string set will be split and joined, respectively, and then compared.
- ==[x] implement==
- ==[ ] finish note==
concat_str: This parameter defines what character set is used to combine string parts, if applicable.
-
Default:
" " - ==[x] implement==
- ==[ ] finish note==
div_str: This parameter defines what characters are used to split strings into segments, and is used by default to normalize string before comparison (seea_str_modfor more detail). This parameter, used in conjunction witha_str_mod, can operate as a regular expression substitute.2 A semicolon (==”;”==) marks a break point between segments of characters used to split a string (“splitting segment” hereinafter).
- Multiple criteria of any non-zero length may constitute a splitting segment.
- All splitting segment space(s) (e.g., ==” ; ; ;”==) must be at the beginning of
div_str. - Only single semi-colons can currently be used as splitting segments (e.g., ==”;;”==), and it must be put at the end of
div_str. -
**Default: ==” ;-;_;/;\; ;&;;;”==**
- ==[X] implement==
- ==[ ] finish note==
See String Manipulation examples.
a_str_mod1: this parameter designates an additional option for evaluating variations of the query inputa_str.
Available options:
- ==”norm”== or ==””== or (undesignated) (Default): By default,
a_strandb_strare split by all splitting characters defined bydiv_strand then concatenated with the value ofconcat_str. This step effectively normalizes strings and usually allows for more consistent scoring. - ==”none”== or ==”false”==: These case-insensitive options disable the normalization step.
- ==”iter”==: With this option, this function splits each
a_strbydiv_strand evaluates each segment therein. For example, wherea_strequals “one-two-three”, this function will attempt to return the best match for all cases wherea_strequals “one”, “two”, or “three” (assumingdiv_stris ==”-“== or includes ==”-;”==). - ==”perm”==: Here, the function splits each
a_strbydiv_strand evaluates all permutations of segment arrangements. For example, wherea_strequals “one-two-three”, this function will attempt to return the best match for cases wherea_strequals “two-one-three”, “three-one-two”, etc… (assumingdiv_stris ==”-“== or includes ==”-;”==).
- ==[X] implement==
- ==[ ] finish note==
See String Permutation examples.
a_str_condition1: This parameter provides a mechanism for limiting the results of the input queries (e.g.,qry_a) when the input queries return, that is, the result that this function then scores.
- ==[x] implement==
- ==[ ] finish note==
a_idx_condition: Wheniter_a_str_permsequals “true”, this condition is handy to minimize the iterations ofa_strthat are compared with everyb_strresult.
- ==[x] implement==
- ==[ ] finish note==
update_cond: This allows for fine-tuning when and what results are returned.
- ==[x] implement==
- ==[ ] finish note==
String Manipulation Examples:
- ==[ ] implement==
- ==[ ] finish note==
String Permutation Examples:
- ==[ ] implement==
- ==[ ] finish note==
Reference to Related Functions:
-
string_matching_mgr
- ==[ ] implement==
- ==[ ] finish note==
Add this webpage to pgSQL function comments:
- ==[ ] implement==
-
This description relating to
qry_aapplies similarly toqry_b. ‘&8617;’ ‘&8617;’2 ‘&8617;’3 ‘&8617;’4 -
Lua regular expression patterns are somewhat uncommon, and therefore, were not natively integrated. For example, (1) “or” logic does not exist (or did not exist when developing), (2) matching patterns cannot be quantified (e.g., “{n,}”), and (3) there is no mechanism to capture the n-th string match where n>9. ‘&8617;’