prég match et UTF-8 en PHP

Question

prég match et UTF-8 en PHP

j'essaie de rechercher une chaîne codée UTF8 en utilisant preg_match .

preg_match('/H/u', "xC2xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];

cela devrait écrire 1, puisque" H "est à l'index 1 dans la chaîne" ¡Hola!". Mais ça imprime 2. Il semble donc qu'il ne traite pas le sujet comme une chaîne encodée UTF8, même si je passe le "u" modificateur dans l'expression régulière.

j'ai les paramètres suivants dans mon php.ini, et D'autres fonctions UTF8 sont au travail:

mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off

des idées?

31

pcre php unicode utf-8

demandé sur JW. 2009-11-12 23:40:34

7 réponses

score 17 · Answer 1

ressemble à ceci est une "caractéristique", voir http://bugs.php.net/bug.php?id=37391

commutateur 'u' n'a de sens que pour pcre, PHP lui-même n'en est pas conscient.

du point de vue de PHP, les chaînes sont des séquences d'octets et le fait de retourner l'offset d'octets semble logique (Je ne dis pas"correct").

score 34 · Answer 2

bien que le modificateur u fasse que le motif et le sujet soient interprétés comme UTF-8, les offsets capturés sont toujours comptés en octets.

vous pouvez utiliser mb_strlen pour obtenir la longueur en caractères UTF-8 plutôt que bytes:

$str = "\xC2\xA1Hola!";
preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1]));

score 25 · Answer 3

essayez d'ajouter ce (*UTF8) avant le regex:

preg_match('(*UTF8)/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);

magie, grâce à un commentaire dans http://www.php.net/manual/es/function.preg-match.php#95828

score 4 · Answer 4

Excusez-moi pour le nécroposage, mais peut-être que quelqu'un le trouvera utile: le code ci-dessous peut remplacer les fonctions preg_match et preg_match_all et retourne les correspondances correctes avec correct offset pour les chaînes encodées en UTF8.

     mb_internal_encoding('UTF-8');

     /**
     * Returns array of matches in same format as preg_match or preg_match_all
     * @param bool   $matchAll If true, execute preg_match_all, otherwise preg_match
     * @param string $pattern  The pattern to search for, as a string.
     * @param string $subject  The input string.
     * @param int    $offset   The place from which to start the search (in bytes).
     * @return array
     */
    function pregMatchCapture($matchAll, $pattern, $subject, $offset = 0)
    {
        $matchInfo = array();
        $method    = 'preg_match';
        $flag      = PREG_OFFSET_CAPTURE;
        if ($matchAll) {
            $method .= '_all';
        }
        $n = $method($pattern, $subject, $matchInfo, $flag, $offset);
        $result = array();
        if ($n !== 0 && !empty($matchInfo)) {
            if (!$matchAll) {
                $matchInfo = array($matchInfo);
            }
            foreach ($matchInfo as $matches) {
                $positions = array();
                foreach ($matches as $match) {
                    $matchedText   = $match[0];
                    $matchedLength = $match[1];
                    $positions[]   = array(
                        $matchedText,
                        mb_strlen(mb_strcut($subject, 0, $matchedLength))
                    );
                }
                $result[] = $positions;
            }
            if (!$matchAll) {
                $result = $result[0];
            }
        }
        return $result;
    }

    $s1 = 'Попробуем русскую строку для теста';
    $s2 = 'Try english string for test';

    var_dump(pregMatchCapture(true, '/обу/', $s1));
    var_dump(pregMatchCapture(false, '/обу/', $s1));

    var_dump(pregMatchCapture(true, '/lish/', $s2));
    var_dump(pregMatchCapture(false, '/lish/', $s2));

sortie de mon exemple:

    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(6) "обу"
          [1]=>
          int(4)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(6) "обу"
        [1]=>
        int(4)
      }
    }
    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(4) "lish"
          [1]=>
          int(7)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(4) "lish"
        [1]=>
        int(7)
      }
    }

score 1 · Answer 5

si tout ce que vous voulez faire est de trouver la position de sécurité multi-octets de H try mb_strpos ()

mb_internal_encoding('UTF-8');
$str = "\xC2\xA1Hola!";
$pos = mb_strpos($str, 'H');
echo $str."\n";
echo $pos."\n";
echo mb_substr($str,$pos,1)."\n";

sortie:

¡Hola!
1
H

score 1 · Answer 6

j'ai écrit petite classe pour convertir les offsets retournés par preg_match aux offsets utf appropriés:

final class NonUtfToUtfOffset
{
    /** @var int[] */
    private $utfMap = [];

    public function __construct(string $content)
    {
        $contentLength = mb_strlen($content);

        for ($offset = 0; $offset < $contentLength; $offset ++) {
            $char = mb_substr($content, $offset, 1);
            $nonUtfLength = strlen($char);

            for ($charOffset = 0; $charOffset < $nonUtfLength; $charOffset ++) {
                $this->utfMap[] = $offset;
            }
        }
    }

    public function convertOffset(int $nonUtfOffset): int
    {
        return $this->utfMap[$nonUtfOffset];
    }
}

Vous pouvez l'utiliser comme ça:

$content = 'aą bać d';
$offsetConverter = new NonUtfToUtfOffset($content);

preg_match_all('#(bać)#ui', $content, $m, PREG_OFFSET_CAPTURE);

foreach ($m[1] as [$word, $offset]) {
    echo "bad: " . mb_substr($content, $offset, mb_strlen($word))."\n";
    echo "good: " . mb_substr($content, $offsetConverter->convertOffset($offset), mb_strlen($word))."\n";
}

https://3v4l.org/8Y32J

score 0 · Answer 7

vous pourriez vouloir regarder t-Regx bibliothèque.

pattern('H', 'u')->match('\xC2\xA1Hola!')->first(function (Match $match) 
{
    echo $match->offset();
});

Ce $match->offset() UTF-8 fort décalage.

Las etiquetas más populares

prég match et UTF-8 en PHP

7 réponses