Parsing une table avec rowspan et colspan

Question

Parsing une table avec rowspan et colspan

j'ai une table que je dois analyser, en particulier c'est un horaire scolaire avec 4 blocs de temps, et 5 blocs de jours pour chaque semaine. J'ai essayé de l'analyser, mais honnêtement ne sont pas allés très loin parce que je suis coincé avec la façon de traiter avec rowspan et colspan attributs, parce qu'ils signifient essentiellement qu'il ya un manque de données que je dois continuer.

voici un exemple de ce que je veux faire:

<tr>
    <td colspan="2" rowspan="4">#1</td>
    <td rowspan="4">#2</td>
    <td rowspan="2">#3</td>
    <td rowspan="2">#4</td>
</tr>

<tr>
</tr>

<tr>
    <td rowspan="2">#5</td>
    <td rowspan="2">#6</td>
</tr>

<tr>
</tr>

je veux prendre cette table et convertir dans cette liste:

[[1,1,2,3,4],
 [1,1,2,3,4],
 [1,1,2,5,6],
 [1,1,2,5,6]]

en ce moment je reçois une liste plate, semblable à ceci:

[1,2,3,4,5,6]

mais sous forme de dictionnaire, avec des informations sur le nombre de colonnes et de lignes qu'il couvre, une description de celui-ci et de quelle semaine il est.

Évidemment, cela doit fonctionner pour chaque possibilité de rowspan/colspan, et pour plusieurs semaines dans la même table.

le html n'est pas aussi propre que je l'ai dépeint, il y a beaucoup d'attributs que j'ai omis, et le texte n'est évidemment pas aussi la coupe propre comme 1,2,3,4 mais plutôt des blocs de texte descriptif. Mais si je pouvais résoudre cette partie alors il devrait être assez facile de l'intégrer dans ce que j'ai déjà écrit.

j'ai utilisé lxml.html et Python pour le faire, mais je suis ouvert à utiliser d'autres modules si cela fournit une solution plus facile.

j'espère que quelqu'un pourra m'aider, car je ne sais vraiment pas quoi faire.

EDIT:

<table>
    <tr>
        <td> </td>
        <td> </td>
        <td> </td>
        <td rowspan="4">Thing</td>
        <td> </td>
    </tr>
    <tr>
        <td> </td>
        <td> </td>
        <td> </td>
        <td> </td>
    </tr>
    <tr>
        <td> </td>
        <td> </td>
        <td> </td>
        <td> </td>
    </tr>
    <tr>
        <td> </td>
        <td> </td>
        <td> </td>
        <td> </td>
    </tr>
</table>

Ceci est me causant quelques problèmes, c'est outputting

[' ', ' ', ' ', 'Thing', ' ']
[' ', ' ', ' ', ' ', ' ']
[' ', ' ', ' ', ' ', ' ']
[' ', ' ', ' ', ' ', ' ']

avec le code fourni par reclosedev, qu'est-ce que je dois changer pour l'adapter pour qu'il sorte

[' ', ' ', ' ', 'Thing', ' ']
[' ', ' ', ' ', 'Thing', ' ']
[' ', ' ', ' ', 'Thing', ' ']
[' ', ' ', ' ', 'Thing', ' ']

à la Place?

EDIT2: en utilisant la nouvelle fonction de reclosedev, elle s'approche d'une solution, mais il y a encore des cas où elle ne parvient pas à placer correctement les cellules:

<table> 
    <tr>
        <td> </td>
        <td rowspan="2"> DMAT Aud. 6 </td>
        <td rowspan="4"> Exam</td>
        <td rowspan="2"> DMAT Aud. 7</td>
        <td> </td>
    </tr>
    <tr>
        <td> </td>
        <td rowspan="2"> CART Aud. 4</td>
    </tr>
    <tr>
        <td> </td>
        <td rowspan="2"> CART Aud. 4</td>
        <td rowspan="2"> OOP Aud. 7</td>
    </tr>
    <tr>
        <td> </td>
        <td> </td>
    </tr>
</table>

avec ceci, le tableau original le montre comme tel:

[
[' ', ' DMAT Aud. 6 ', ' Exam', ' DMAT Aud. 7', ' '],
[' ', ' DMAT Aud. 6 ', ' Exam', ' DMAT Aud. 7', ' CART Aud. 4'],
[' ', ' CART Aud. 4' , ' Exam', ' OOP Aud. 7' , ' CART Aud. 4'],
[' ', ' CART Aud. 4' , ' Exam', ' OOP Aud. 7' , ' ']
]

Mais le nouvel appel sorties ceci:

[
[' ', ' DMAT Aud. 6 ', ' Exam', ' DMAT Aud. 7', ' '],
[' ', ' DMAT Aud. 6 ', ' Exam', ' DMAT Aud. 7', ' CART Aud. 4'],
[' ', ' CART Aud. 4' , ' Exam', ' CART Aud. 4', ' OOP Aud. 7'],
[' ', ' CART Aud. 4' , ' Exam', ' OOP Aud. 7' , ' ']
]

13

html-parsing html-table lxml python

demandé sur Atheuz 2012-04-02 18:41:56

2 réponses

score 11 · Answer 1

UPDATE (supprimé fonction précédente)

UPDATE2 fixe et simplifié.

Ma première fonction était mauvais. En voici un autre, qui fonctionne mais nécessite des tests:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from collections import defaultdict


def table_to_list(table):
    dct = table_to_2d_dict(table)
    return list(iter_2d_dict(dct))


def table_to_2d_dict(table):
    result = defaultdict(lambda : defaultdict(unicode))
    for row_i, row in enumerate(table.xpath('./tr')):
        for col_i, col in enumerate(row.xpath('./td|./th')):
            colspan = int(col.get('colspan', 1))
            rowspan = int(col.get('rowspan', 1))
            col_data = col.text_content()
            while row_i in result and col_i in result[row_i]:
                col_i += 1
            for i in range(row_i, row_i + rowspan):
                for j in range(col_i, col_i + colspan):
                    result[i][j] = col_data
    return result


def iter_2d_dict(dct):
    for i, row in sorted(dct.items()):
        cols = []
        for j, col in sorted(row.items()):
            cols.append(col)
        yield cols


if __name__ == '__main__':
    import lxml.html
    from pprint import pprint

    doc = lxml.html.parse('tables.html')
    for table_el in doc.xpath('//table'):
        table = table_to_list(table_el)
        pprint(table)

tableaux.html:

<table border="1">
    <tr>
        <td>1 </td>
        <td>1 </td>
        <td>1 </td>
        <td rowspan="4">Thing</td>
        <td>1 </td>
    </tr>
    <tr>
        <td>2 </td>
        <td>2 </td>
        <td>2 </td>
        <td>2 </td>
    </tr>
    <tr>
        <td>3 </td>
        <td>3 </td>
        <td>3 </td>
        <td>3 </td>
    </tr>
    <tr>
        <td>4 </td>
        <td>4 </td>
        <td>4 </td>
        <td>4 </td>
    </tr>
</table>

<table border="1">
<tr>
    <td colspan="2" rowspan="4">#1</td>
    <td rowspan="4">#2</td>
    <td rowspan="2">#3</td>
    <td rowspan="2">#4</td>
</tr>
<tr></tr>
<tr>
    <td rowspan="2">#5</td>
    <td rowspan="2">#6</td>
</tr>
<tr></tr>
</table>

Sortie:

[['1 ', '1 ', '1 ', 'Thing', '1 '],
 ['2 ', '2 ', '2 ', 'Thing', '2 '],
 ['3 ', '3 ', '3 ', 'Thing', '3 '],
 ['4 ', '4 ', '4 ', 'Thing', '4 ']]
[['#1', '#1', '#2', '#3', '#4'],
 ['#1', '#1', '#2', '#3', '#4'],
 ['#1', '#1', '#2', '#5', '#6'],
 ['#1', '#1', '#2', '#5', '#6']]

score 2 · Answer 2

mise à Jour: il y a un bug dans cette réponse (qui est basée sur la solution de reclosedev)

Voir comment analyser une table avec rowspan et colspan

Vieux:

def table_to_2d(table_tag):
    rows = table_tag("tr")
    cols = rows[0](["td", "th"])
    table = [[None] * len(cols) for _ in range(len(rows))]
    for row_i, row in enumerate(rows):
        for col_i, col in enumerate(row(["td", "th"])):
            insert(table, row_i, col_i, col)
    return table


def insert(table, row, col, element):
    if row >= len(table) or col >= len(table[row]):
        return
    if table[row][col] is None:
        value = element.get_text()
        table[row][col] = value
        if element.has_attr("colspan"):
            span = int(element["colspan"])
            for i in range(1, span):
                table[row][col+i] = value
        if element.has_attr("rowspan"):
            span = int(element["rowspan"])
            for i in range(1, span):
                table[row+i][col] = value
    else:
        insert(table, row, col + 1, element)

Utilisation:

soup = BeautifulSoup('<table><tr><th>1</th><th>2</th><th>5</th></tr><tr><td rowspan="2">3</td><td colspan="2">4</td></tr><tr><td>6</td><td>7</td></tr></table>', 'html.parser')
print(table_to_2d(soup.table))

C'est optimisé. J'ai écrit ça pour mon scénario unique.

Las etiquetas más populares

Parsing une table avec rowspan et colspan

2 réponses