[ SYSTEM ]: Linux srv.persadacompanies.com 4.18.0-553.56.1.el8_10.x86_64 #1 SMP Tue Jun 10 05:00:59 EDT 2025 x86_64
[ SERVER ]: Apache | PHP: 8.4.20
[ USER ]: persadamedika | IP: 45.64.1.108
GEFORCE FILE MANAGER
/
usr
/
share
/
doc
/
grep
/
UPLOAD:
NAME
SIZE
QUICK PERMS
ACTIONS
đ AUTHORS
2,341 B
SET
[ EDIT ]
|
[ DEL ]
đ NEWS
43,155 B
SET
[ EDIT ]
|
[ DEL ]
đ README
2,404 B
SET
[ EDIT ]
|
[ DEL ]
đ THANKS
7,544 B
SET
[ EDIT ]
|
[ DEL ]
đ TODO
11,304 B
SET
[ EDIT ]
|
[ DEL ]
DELETE SELECTED
[ CLOSE ]
EDIT: TODO
Things to do for GNU grep Copyright (C) 1992, 1997-2002, 2004-2017 Free Software Foundation, Inc. Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved. =============== Short term work =============== See where we are with UTF-8 performance. Merge Debian patches that seem relevant. Go through patches in Savannah. Fix --directories=read. Write better Texinfo documentation for grep. The manual page would be a good place to start, but Info documents are also supposed to contain a tutorial and examples. Some tests in tests/spencer2.tests should have failed! Need to filter out some bugs in dfa.[ch]/regex.[ch]. Multithreading? GNU grep originally did 32-bit arithmetic. Although it has moved to 64-bit on 64-bit platforms by using types like ptrdiff_t and size_t, this conversion has not been entirely systematic and should be checked. Lazy dynamic linking of libpcre. See Debianâs 03-397262-dlopen-pcre.patch. Check FreeBSDâs integration of zgrep (-Z) and bzgrep (-J) in one binary. Is there a possibility of doing even better by automatically checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F 0x9D for compress, and 0x42 0x5A 0x68 for bzip2)? Once what to do with libpcre is decided, do the same for libz and libbz2. =================== Matching algorithms =================== Take a look at these and consider opportunities for merging or cloning: -- http://osrd.org/projects/grep/global-regular-expression-print-tools-grep-variants -- ja-grepâs mlb2 patch (Japanese grep) <http://distcache.freebsd.org/ports-distfiles/grep-2.4.2-mlb2.patch.gz> -- lgrep (from lv, a Powerful Multilingual File Viewer / Grep) <http://www.mt.cs.keio.ac.jp/person/narita/lv/>; -- cgrep (Context grep) <https://awgn.github.io/cgrep/> seems like nice work; -- sgrep (Struct grep) <https://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>; -- agrep (Approximate grep) <https://www.tgries.de/agrep/>, from glimpse; -- nr-grep (Nondeterministic reverse grep) <https://www.dcc.uchile.cl/~gnavarro/software/>; -- ggrep (Grouse grep) <http://www.grouse.com.au/ggrep/>; -- freegrep <https://github.com/howardjp/freegrep>; Check some new algorithms for matching. See, for example, Faro & Lecroq (cited in kwset.c). Fix the DFA matcher to never use exponential space. (Fortunately, these cases are rare.) ============================ Standards: POSIX and Unicode ============================ For POSIX compliance issues, see POSIX 1003.1. Current support for the POSIX [= =] and [. .] constructs is limited to platforms whose regular expression matchers are sufficiently compatible with the GNU C library so that the --without-included-regex option of âconfigureâ is in effect. Extend this support to non-glibc platforms, where --with-included-regex is in effect, by modifying the included version of the regex code to defer to the native version when handling [= =] and [. .]. For Unicode, interesting things to check include the Unicode Standard <http://www.unicode.org/standard/standard.html> and the Unicode Technical Standard #18 (<http://www.unicode.org/reports/tr18/> âUnicode Regular Expressionsâ). Talk to Bruno Haible whoâs maintaining GNU libunistring. See also Unicode Standard Annex #15 (<http://www.unicode.org/reports/tr15/> âUnicode Normalization Formsâ), already implemented by GNU libunistring. In particular, --ignore-case needs to be evaluated against the standards. We may want to deviate from POSIX if Unicode provides better or clearer semantics. POSIX and --ignore-case ----------------------- For this issue, interesting things to check in POSIX include the Open Group Base Specifications, Chapter âRegular Expressionsâ, in particular Section âRegular Expression General Requirementsâ and its paragraph about caseless matching (this may not have been fully thought through and that this text may be self-contradicting [specifically: âof either data or patternsâ versus all the rest]). See: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02 In particular, consider the following with POSIXâs approach to case folding in mind. Assume a non-Turkic locale with a character repertoire reduced to the following various forms of âLATIN LETTER Iâ: 0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069; 0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;\ LATIN CAPITAL LETTER I DOT;;;0069; 0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049 UTF-8 octet lengths differ between U+0049 (0x49) and U+0069 (0x69) versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1). This implies that whole UTF-8 strings cannot be case-converted in place, using the same memory buffer, and that the needed octet-size of the new buffer cannot merely be guessed (although thereâs a simple upper bound of five times the size of the input, as the longest UTF-8 encoding of any character is five bytes). We have lc(I) = i, uc(I) = I lc(i) = i, uc(i) = I lc(İ) = i, uc(İ) = İ lc(ı) = ı, uc(ı) = I where lc() and uc() denote lower-case and upper-case conversions. There are several candidate --ignore-case logics. Using the if (lc(input_wchar) == lc(pattern_wchar)) logic leads to the following matches: \in I i İ ı pat\ ---------- I | Y Y Y n i | Y Y Y n İ | Y Y Y n ı | n n n Y There is a lack of symmetry between CAPITAL and SMALL LETTERs with this. Using the if (uc(input_wchar) == uc(pattern_wchar)) logic (which is what GNU grep currently does although this is not documented or guaranteed in the future), leads to the following matches: \in I i İ ı pat\ ---------- I | Y Y n Y i | Y Y n Y İ | n n Y n ı | Y Y n Y There is a lack of symmetry between CAPITAL and SMALL LETTERs with this. Using the if (lc(input_wchar) == lc(pattern_wchar) || uc(input_wchar) == uc(pattern_wchar)) logic leads to the following matches: \in I i İ ı pat\ ---------- I | Y Y Y Y i | Y Y Y Y İ | Y Y Y n ı | Y Y n Y There is some elegance and symmetry with this. But there are potentially two conversions to be made per input character. If the pattern is pre-converted, two copies of it need to be kept and used in a mutually coherent fashion. Using the if (input_wchar == pattern_wchar || lc(input_wchar) == pattern_wchar || uc(input_wchar) == pattern_wchar) logic (a plausible interpretation of POSIX) leads to the following matches: \in I i İ ı pat\ ---------- I | Y Y n Y i | Y Y Y n İ | n n Y n ı | n n n Y There is a different CAPITAL/SMALL symmetry with this. But thereâs also a loss of pattern/input symmetry thatâs unique to it. Also there are potentially two conversions to be made per input character. Using the if (lc(uc(input_wchar)) == lc(uc(pattern_wchar))) logic leads to the following matches: \in I i İ ı pat\ ---------- I | Y Y Y Y i | Y Y Y Y İ | Y Y Y Y ı | Y Y Y Y This shows total symmetry and transitivity (at least in this example analysis). There are two conversions to be made per input character, but support could be added for having a single straight mapping performing a composition of the two conversions. Any optimization in the implementation of each logic must not change its basic semantic. Unicode and --ignore-case ------------------------- For this issue, interesting things to check in Unicode include: - The Unicode Standard, Chapter 3 (<http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf> âConformanceâ), Section 3.13 (âDefault Case Algorithmsâ) and the toCasefold() case conversion operation. - The Unicode Standard, Chapter 4 (<http://www.unicode.org/versions/Unicode9.0.0/ch04.pdf> âCharacter Propertiesâ), Section 4.2 (âCaseâ) and the <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt> SpecialCasing.txt and <http://www.unicode.org/Public/UNIDATA/CaseFolding.txt> CaseFolding.txt files. - The Unicode Standard, Chapter 5 (<http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf> âImplementation Guidelinesâ), Section 5.18 (âCase Mappingsâ), Subsection âCaseless Matchingâ. - The Unicode case charts <http://www.unicode.org/charts/case/>. Unicode uses the if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string)) logic for caseless matching. Consider the âLATIN LETTER Iâ example mentioned above. In a non-Turkic locale, simple case folding yields toCasefold_simple(U+0049) = U+0069 toCasefold_simple(U+0069) = U+0069 toCasefold_simple(U+0130) = U+0130 toCasefold_simple(U+0131) = U+0131 which leads to the following matches: \in I i İ ı pat\ ---------- I | Y Y n n i | Y Y n n İ | n n Y n ı | n n n Y This is different from anything so far! In a non-Turkic locale, full case folding yields toCasefold_full(U+0049) = U+0069 toCasefold_full(U+0069) = U+0069 toCasefold_full(U+0130) = <U+0069, U+0307> toCasefold_full(U+0131) = U+0131 with 0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;; which leads to the following matches: \in I i İ ı pat\ ---------- I | Y Y * n i | Y Y * n İ | n n Y n ı | n n n Y This is just sad! Having toCasefold(U+0131), simple or full, map to itself instead of U+0069 is in contradiction with the rules of Section 5.18 of the Unicode Standard since toUpperCase(U+0131) is U+0049. Same thing for toCasefold_simple(U+0130) since toLowerCase(U+0131) is U+0069. The justification for the weird toCasefold_full(U+0130) mapping is unknown; it doesnât even make sense to add a dot (U+0307) to a letter that already has one (U+0069). It would have been so simple to put them all in the same equivalence class! Otherwise, also consider the following problem with Unicodeâs approach on case folding in mind. Assume that we want to perform echo 'AĂBC' | grep -i 'Sb' which corresponds to input: U+0041 U+00DF U+0042 U+0043 U+000A pattern: U+0053 U+0062 Following CaseFolding.txt, applying the toCasefold() transformation to these yields input: U+0061 U+0073 U+0073 U+0062 U+0063 U+000A pattern: U+0073 U+0062 so, according to this approach, the input should match the pattern. As long as the original input line is to be reported to the user as a whole, there is no problem (from the userâs point-of-view; implementation is complicated by this). However, consider both these GNU extensions: echo 'AĂBC' | grep -i --only-matching 'Sb' echo 'AĂBC' | grep -i --color=always 'Sb' What is to be reported in these cases, since the match begins in the *middle* of the original input character âĂâ? Unicodeâs toCasefold() cannot be implemented in terms of POSIXâs towctrans() since that can only return a single wint_t value per input wint_t value.