encountered when cleaning web page data, how to extract all the contents if there are multiple target objects in a piece of html text.
for example, the following paragraph
<span style="mso-spacerun:"yes";font-family:;mso-ascii-font-family:Calibri;mso-hansi-font-family:Calibri;mso-bidi-font-family:"Times new roman";font-size:10.5000pt;mso-font-kerning:1.0000pt;">
<font face=""></font></span>
want to extract the part of Chinese characters.
current scheme
use regular expressions to fully match. The specific code is as follows (partially intercepted):
import re
s = """
<span style="mso-spacerun:"yes";font-family:;mso-ascii-font-family:Calibri;mso-hansi-font-family:Calibri;mso-bidi-font-family:"Times new roman";font-size:10.5000pt;mso-font-kerning:1.0000pt;">
<font face=""></font></span>
"""
rs = re.findall(r"(?<=(>))[\d\D]*?(?=(<))", s, re.M)
for item in rs:
print item
result
the output is as follows, which is not the result you want
(">", "<")
(">", "<")
(">", "<")