Parse an html parse

I have generated an html file from a Pdf file using Pdf2htmlex lib and below is the result. I have just post part of the file

<div id="page-container">
    <div id="pf1" class="pf w0 h0" data-page-no="1">
        <div class="pc pc1 w0 h0">
            <img
                class="bi x0 y0 w1 h1"
                alt=""
                src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAyQAAAJCCAIAAAC+spSgAAAACXBIWXMAABYlAAAWJQFJUiTwAAAJX0lEQVR42u3ZwY1EIQxEwe7NP2dvEiD5D1URINuHJ5EAAHBNk8zMoge1q97Ds2tyiqZnyLgEjizozxQAAO4RWwAAYgsAQGwBACC2AADEFgCA2AIAQGwBAIgtAACxBQCA2AIAEFsAAGILAACxBQAgtgAAxBYAAGILAEBsAQCILSMAABBbAABiCwAAsQUAILYAAMQWAABiCwBAbAEAiC0AAMQWAAAAAL+vSWZm0YPaVe/h2TU5RdMzZFwCRxbkGxEA4CKxBQAgtgAAxBYAAGILAEBsAQCILQAAxBYAgNgCABBbAACILQAAsQUAILYAABBbAABiCwBAbAEAILYAAMQWAIDYMgIAALEFACC2AAAQWwAAYgsAQGwBACC2AADEFgCA2AIAQGwBAIgtAACxBQCA2AIAEFsAAGILAACxBQAgtgAAxBYAAGILAEBsAQAgtgAAxBYAgNgCAEBsAQCILQAAsQUAgNgCABBbAABiCwAAsQUAILYAAMQWAABiCwBAbAEAiC0AAAAA+KAmmZlFD2pXvYdn1+QUTc+QcQkcWZBvRACAi8QWAIDYAgAQWwAAiC0AALEFACC2AAAQWwAAYgsAQGwBACC2AADEFgCA2AIAQGwBAIgtAACxBQCA2AIAEFsAAGLLCAAAxBYAgNgCAEBsAQCILQAAsQUAgNgCABBbAABiCwAAsQUAILYAAMQWAABiCwBAbAEAiC0AAMQWAIDYAgAQWwAAiC0AALEFAIDYAgAQWwAAYgsAALEFACC2AADEFgAAYgsAQGwBAIgtAADEFgCA2AIAEFsAAIgtAACxBQAAAAC3NMnMLHpQu+o9PLsmp2h6hoxL4MiCfCMCAFwktgAAxBYAgNgCAEBsAQCILQAAsQUAgNgCABBbAABiCwAAsQUAILYAAMQWAABiCwBAbAEAiC0AAMQWAIDYAgAQW0YAACC2AADEFgAAYgsAQGwBAIgtAADEFgCA2AIAEFsAAIgtAACxBQAgtgAAEFsAAGILAEBsAQAgtgAAxBYAgNgCAEBsAQCILQAAxBYAgNgCABBbAACILQAAsQUAILYAABBbAABiCwBAbAEAILYAAMQWAIDYAgBAbAEAiC0AALEFAAAAAB/UJDOz6EHtqvfw7JqcoukZMi6BIwvyjQgAcJHYAgAQWwAAYgsAALEFACC2AADEFgAAYgsAQGwBAIgtAADEFgCA2AIAEFsAAIgtAACxBQAgtgAAEFsAAGILAEBsGQEAgNgCABBbAACILQAAsQUAILYAABBbAABiCwBAbAEAILYAAMQWAIDYAgBAbAEAiC0AALEFAIDYAgAQWwAAYgsAALEFACC2AAAQWwAAYgsAQGwBACC2AADEFgCA2AIAQGwBAIgtAACxBQCA2AIAEFsAAGILAACxBQAgtgAAxBYAAAAAfFCTzMyiB7Wr3sOza3KKpmfIuASOLMg3IgDARWILAEBsAQCILQAAxBYAgNgCABBbAACILQAAsQUAILYAABBbAABiCwBAbAEAILYAAMQWAIDYAgBAbAEAiC0AALFlBAAAYgsAQGwBACC2AADEFgCA2AIAQGwBAIgtAACxBQCA2AIAEFsAAGILAACxBQAgtgAAxBYAAGILAEBsAQCILQAAxBYAgNgCAEBsAQCILQAAsQUAgNgCABBbAABiCwAAsQUAILYAAMQWAABiCwBAbAEAiC0AAMQWAIDYAgAAAIBbmmRmFj2oXfUenl2TUzQ9Q8YlcGRBvhEBAC4SWwAAYgsAQGwBACC2AADEFgCA2AIAQGwBAIgtAACxBQCA2AIAEFsAAGILAACxBQAgtgAAxBYAAGILAEBsAQCILSMAABBbAABiCwAAsQUAILYAAMQWAABiCwBAbAEAiC0AAMQWAIDYAgAQWwAAiC0AALEFACC2AAAQWwAAYgsAQGwBACC2AADEFgAAYgsAQGwBAIgtAADEFgCA2AIAEFsAAIgtAACxBQAgtgAAEFsAAGILAEBsAQAgtgAAxBYAgNgCAAAAgA9qkplZ9KB21Xt4dk1O0fQMGZfAkQX5RgQAuEhsAQCILQAAsQUAgNgCABBbAABiCwAAsQUAILYAAMQWAABiCwBAbAEAiC0AAMQWAIDYAgAQWwAAiC0AALEFACC2jAAAQGwBAIgtAADEFgCA2AIAEFsAAIgtAACxBQAgtgAAEFsAAGILAEBsAQAgtgAAxBYAgNgCAEBsAQCILQAAsQUAgNgCABBbAACILQAAsQUAILYAABBbAABiCwBAbAEAILYAAMQWAIDYAgBAbAEAiC0AALEFAIDYAgAQWwAAYgsAAAAAPqhJZmbRg9pV7+HZNTlF0zNkXAJHFuQbEQDgIrEFACC2AADEFgAAYgsAQGwBAIgtAADEFgCA2AIAEFsAAIgtAACxBQAgtgAAEFsAAGILAEBsAQAgtgAAxBYAgNgyAgAAsQUAILYAABBbAABiCwBAbAEAILYAAMQWAIDYAgBAbAEAiC0AALEFAIDYAgAQWwAAYgsAALEFACC2AADEFgAAYgsAQGwBACC2AADEFgCA2AIAQGwBAIgtAACxBQCA2AIAEFsAAGILAACxBQAgtgAAxBYAAGILAEBsAQCILQAAAAD4oCaZmUUPale9h2fX5BRNz5BxCRxZkG9EAICLxBYAgNgCABBbAACILQAAsQUAILYAABBbAABiCwBAbAEAILYAAMQWAIDYAgBAbAEAiC0AALEFAIDYAgAQWwAAYssIAADEFgCA2AIAQGwBAIgtAACxBQCA2AIAEFsAAGILAACxBQAgtgAAxBYAAGILAEBsAQCILQAAxBYAgNgCABBbAACILQAAsQUAgNgCABBbAABiCwAAsQUAILYAAMQWAABiCwBAbAEAiC0AAMQWAIDYAgAQWwAAiC0AALEFAAAAALc0ycwselC76j08uyanaHqGjEvgyIJ8IwIAXCS2AADEFgCA2AIAQGwBAIgtAACxBQCA2AIAEFsAAGILAACxBQAgtgAAxBYAAGILAEBsAQCILQAAxBYAgNgCABBbRgAAILYAAMQWAABiCwBAbAEAiC0AAMQWAIDYAgAQWwAAiC0AALEFACC2AAAQWwAAYgsAQGwBACC2AADEFgCA2AIAQGwBAIgtAADEFgCA2AIAEFsAAIgtAACxBQAgtgAAEFsAAGILAEBsAQAgtgAAxBYAgNgCAEBsAQCILQAAsQUAAAAAH/QPtovrYuDmvTgAAAAASUVORK5CYII="
            />
            <div class="c x0 y1 w2 h2">
                <div class="t m0 x1 h3 y2 ff1 fs0 fc0 sc0 ls0 ws0">id<span class="ls1"> </span></div>
            </div>
            <div class="c x2 y1 w3 h2"><div class="t m0 x1 h3 y2 ff1 fs0 fc0 sc0 ls1 ws0">description</div></div>
            <div class="c x3 y1 w4 h2"><div class="t m0 x1 h3 y2 ff1 fs0 fc0 sc0 ls1 ws0">status</div></div>
            <div class="c x4 y1 w5 h2"><div class="t m0 x1 h3 y2 ff1 fs0 fc0 sc0 ls1 ws0">maker_id</div></div>
            <div class="c x5 y1 w6 h2"><div class="t m0 x1 h3 y2 ff1 fs0 fc0 sc0 ls1 ws0">checker_id</div></div>
            <div class="c x6 y1 w7 h2">
                <div class="t m0 x1 h3 y2 ff1 fs0 fc0 sc0 ls1 ws0">commodity_type<span class="_ _0"></span></div>
            </div>
            <div class="c x7 y1 w8 h2"><div class="t m0 x1 h3 y2 ff1 fs0 fc0 sc0 ls1 ws0">inserted_at</div></div>
            <div class="c x0 y3 w2 h4"><div class="t m0 x1 h3 y4 ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
            <div class="c x2 y3 w3 h4"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">General</div></div>
            <div class="c x3 y3 w4 h4"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">A</div></div>
            <div class="c x4 y3 w5 h4"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x5 y3 w6 h4"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x6 y3 w7 h4">
                <div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls2 ws0">ggg<span class="ls1"> </span></div>
            </div>
            <div class="c x7 y3 w8 h4">
                <div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2021-<span class="ls3">05</span>-<span class="ls3">26 </span></div>
                <div class="t m0 x1 h3 y6 ff1 fs0 fc0 sc0 ls1 ws0">14:39:22.00<span class="_ _0"></span>0</div>
            </div>
            <div class="c x0 y7 w2 h5"><div class="t m0 x1 h3 y8 ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
            <div class="c x2 y7 w3 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">Fuels</div></div>
            <div class="c x3 y7 w4 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">A</div></div>
            <div class="c x4 y7 w5 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x5 y7 w6 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x6 y7 w7 h5">
                <div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls2 ws0">gggg<span class="ls1"> </span></div>
            </div>
            <div class="c x7 y7 w8 h5">
                <div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2021-<span class="ls3">05</span>-<span class="ls3">26 </span></div>
                <div class="t m0 x1 h3 y6 ff1 fs0 fc0 sc0 ls1 ws0">14:39:22.00<span class="_ _0"></span>0</div>
            </div>
            <div class="c x0 y9 w2 h5"><div class="t m0 x1 h3 y8 ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
            <div class="c x2 y9 w3 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">Ours</div></div>
            <div class="c x3 y9 w4 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">A</div></div>
            <div class="c x4 y9 w5 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x5 y9 w6 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x6 y9 w7 h5">
                <div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls4 ws0">ffff<span class="ls1"> </span></div>
            </div>
            <div class="c x7 y9 w8 h5">
                <div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2021-<span class="ls3">05</span>-<span class="ls3">26 </span></div>
                <div class="t m0 x1 h3 y6 ff1 fs0 fc0 sc0 ls1 ws0">14:39:22.00<span class="_ _0"></span>0</div>
            </div>
            <div class="c x0 ya w2 h5"><div class="t m0 x1 h3 y8 ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
            <div class="c x2 ya w3 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">Bauxite</div></div>
            <div class="c x3 ya w4 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">A</div></div>
            <div class="c x4 ya w5 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x5 ya w6 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x6 ya w7 h5"><div class="t m0 x1 h3 y8 ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
            <div class="c x7 ya w8 h5">
                <div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2021-<span class="ls3">05</span>-<span class="ls3">26 </span></div>
                <div class="t m0 x1 h3 y6 ff1 fs0 fc0 sc0 ls1 ws0">14:39:22.00<span class="_ _0"></span>0</div>
            </div>
            <div class="c x0 yb w2 h5"><div class="t m0 x1 h3 y8 ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
            <div class="c x2 yb w3 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">Cement</div></div>
            <div class="c x3 yb w4 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">A</div></div>
            <div class="c x4 yb w5 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x5 yb w6 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x6 yb w7 h5"><div class="t m0 x1 h3 y8 ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
            <div class="c x7 yb w8 h5">
                <div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2021-<span class="ls3">05</span>-<span class="ls3">26 </span></div>
                <div class="t m0 x1 h3 y6 ff1 fs0 fc0 sc0 ls1 ws0">14:39:22.00<span class="_ _0"></span>0</div>
            </div>
            <div class="c x0 yc w2 h5"><div class="t m0 x1 h3 y8 ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
            <div class="c x2 yc w3 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">Coal</div></div>
            <div class="c x3 yc w4 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">A</div></div>
            <div class="c x4 yc w5 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x5 yc w6 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x6 yc w7 h5"><div class="t m0 x1 h3 y8 ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
            <div class="c x7 yc w8 h5">
                <div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2021-<span class="ls3">05</span>-<span class="ls3">26 </span></div>
                <div class="t m0 x1 h3 y6 ff1 fs0 fc0 sc0 ls1 ws0">14:39:22.00<span class="_ _0"></span>0</div>
            </div>
            <div class="c x0 yd w2 h4"><div class="t m0 x1 h3 ye ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
            <div class="c x2 yd w3 h4">
                <div class="t m0 x1 h3 yf ff1 fs0 fc0 sc0 ls1 ws0">Cobalt</div>
                <div class="t m0 x1 h3 y6 ff1 fs0 fc0 sc0 ls1 ws0">Concentrates</div>
            </div>
            <div class="c x3 yd w4 h4"><div class="t m0 x1 h3 yf ff1 fs0 fc0 sc0 ls1 ws0">A</div></div>
            <div class="c x4 yd w5 h4"><div class="t m0 x1 h3 yf ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x5 yd w6 h4"><div class="t m0 x1 h3 yf ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x6 yd w7 h4"><div class="t m0 x1 h3 ye ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
            <div class="c x7 yd w8 h4">
                <div class="t m0 x1 h3 yf ff1 fs0 fc0 sc0 ls1 ws0">2021-<span class="ls3">05</span>-<span class="ls3">26 </span></div>
                <div class="t m0 x1 h3 y6 ff1 fs0 fc0 sc0 ls1 ws0">14:39:22.00<span class="_ _0"></span>0</div>
            </div>
            <div class="c x0 y10 w2 h5"><div class="t m0 x1 h3 y8 ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
            <div class="c x2 y10 w3 h5">
                <div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">Copper</div>
                <div class="t m0 x1 h3 y6 ff1 fs0 fc0 sc0 ls1 ws0">Concentrates</div>
            </div>
            <div class="c x3 y10 w4 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">A</div></div>
            <div class="c x4 y10 w5 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x5 y10 w6 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x6 y10 w7 h5"><div class="t m0 x1 h3 y8 ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
            <div class="c x7 y10 w8 h5">
                <div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2021-<span class="ls3">05</span>-<span class="ls3">26 </span></div>
                <div class="t m0 x1 h3 y6 ff1 fs0 fc0 sc0 ls1 ws0">14:39:22.00<span class="_ _0"></span>0</div>
            </div>
            <div class="c x0 y11 w2 h5"><div class="t m0 x1 h3 y8 ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
            <div class="c x2 y11 w3 h5">
                <div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">Container</div>
                <div class="t m0 x1 h3 y6 ff1 fs0 fc0 sc0 ls1 ws0">empty</div>
            </div>
            <div class="c x3 y11 w4 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">A</div></div>
            <div class="c x4 y11 w5 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x5 y11 w6 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x6 y11 w7 h5"><div class="t m0 x1 h3 y8 ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
            <div class="c x7 y11 w8 h5">
                <div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2021-<span class="ls3">05</span>-<span class="ls3">26 </span></div>
                <div class="t m0 x1 h3 y6 ff1 fs0 fc0 sc0 ls1 ws0">14:39:22.00<span class="_ _0"></span>0</div>
            </div>
            <div class="c x0 y12 w2 h5"><div class="t m0 x1 h3 y8 ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
            <div class="c x2 y12 w3 h5">
                <div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">Container</div>
                <div class="t m0 x1 h3 y6 ff1 fs0 fc0 sc0 ls1 ws0">loaded</div>
            </div>
            <div class="c x3 y12 w4 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">A</div></div>
            <div class="c x4 y12 w5 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x5 y12 w6 h5"><div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2</div></div>
            <div class="c x6 y12 w7 h5"><div class="t m0 x1 h3 y8 ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
            <div class="c x7 y12 w8 h5">
                <div class="t m0 x1 h3 y5 ff1 fs0 fc0 sc0 ls1 ws0">2021-<span class="ls3">05</span>-<span class="ls3">26 </span></div>
                <div class="t m0 x1 h3 y6 ff1 fs0 fc0 sc0 ls1 ws0">14:39:22.00<span class="_ _0"></span>0</div>
            </div>
            <div class="c x8 y13 w9 h0"><div class="t m0 x0 h3 y14 ff1 fs0 fc0 sc0 ls1 ws0"></div></div>
        </div>
        <div class="pi" data-data='{"ctm":[1.000000,0.000000,0.000000,1.000000,0.000000,0.000000]}'></div>
    </div>
</div>

i want to parse html file and get this result from it


[ %{id: "",description: "General", status: "A",maker_id: "2",checker_id: "2",commodity_type: "ggg",inserted_at: "2021-05-26 14:39:22.000" ,updated_at: "2021-05-26 14:39:22.000"},

%{id: "",description: "fuel", status: "A",maker_id: "2",checker_id: "2",commodity_type: gggg",inserted_at: "2021-05-26 14:39:22.000" ,updated_at: "2021-05-26 14:39:22.000"},

%{id: "",description: "Ours", status: "A",maker_id: "2",checker_id: "2",commodity_type: "ffff",inserted_at: "2021-05-26 14:39:22.000" ,updated_at: "2021-05-26 14:39:22.000"},

%{id: "",description: "Bauxite", status: "A",maker_id: "2",checker_id: "2",commodity_type: "ffff",inserted_at: "2021-05-26 14:39:22.000" ,updated_at: "2021-05-26 14:39:22.000"}

................]

@coilardium I edited the first code block to format the HTML; it’s still tough to read, but now folks won’t have to horizontally scroll through everything

For parsing HTML, I’ve used Floki with good results in the past.

3 Likes