Python uses regular extraction of html text content, how to get all the results of multi-segment matching

encountered when cleaning web page data, how to extract all the contents if there are multiple target objects in a piece of html text.

for example, the following paragraph

<span style="mso-spacerun:"yes";font-family:;mso-ascii-font-family:Calibri;mso-hansi-font-family:Calibri;mso-bidi-font-family:"Times new roman";font-size:10.5000pt;mso-font-kerning:1.0000pt;">
<font face=""></font></span>

want to extract the part of Chinese characters.

current scheme

use regular expressions to fully match. The specific code is as follows (partially intercepted):

import re
s = """
<span style="mso-spacerun:"yes";font-family:;mso-ascii-font-family:Calibri;mso-hansi-font-family:Calibri;mso-bidi-font-family:"Times new roman";font-size:10.5000pt;mso-font-kerning:1.0000pt;">
<font face=""></font></span>
"""
rs = re.findall(r"(?<=(>))[\d\D]*?(?=(<))", s, re.M)
for item in rs:
    print item

result

the output is as follows, which is not the result you want

(">", "<")
(">", "<")
(">", "<")

Web-crawler regular-expression python

Jul.14,2021

Don't use regularization, BeautifulSoup is much better for html

from bs4 import BeautifulSoup
s = '''
<span style="mso-spacerun:'yes';font-family:;mso-ascii-font-family:Calibri;mso-hansi-font-family:Calibri;mso-bidi-font-family:'Times new roman';font-size:10.5000pt;mso-font-kerning:1.0000pt;">
<font face=""> </font> </span>
'''
clean_text = BeautifulSoup(s,"lxml").get_text()
print(clean_text)

output

We strolled into a small courtyard with a rich rural flavor, which was clean and tidy. The yard is neatly covered with golden corn, even corn bones are neatly lined up, red chili peppers are hanging on both sides of the door, chickens, dogs and cats are walking leisurely in the courtyard, and there are two chicken nests on the chicken house. there happens to be an egg in one of the henhouses, and all kinds of flowers such as hydrangeas are in full bloom. The owners of the courtyard are all in their eighties. The master is 83 and the hostess is 85. They are still grabbing corn and seeing us burst into the yard. Instead of being nervous, they are very enthusiastic. They invite us to sit down and plan to pour us Scald. We just keep saying no. The two old men, unhurried and slow, have never stopped. According to them, most of their children and grandchildren are now independent and promising. Seeing such a clean and clean small courtyard full of warm life, it must be the life of the elderly who is full of pursuit and interest to create the beauty of all this.

Previous: The example of includes of data in ES6 is incomprehensible. There are no parameters when defining, but the call can pass parameters.

Next: What is the relationship between eth0 and en0?

Python regular processing of local txt files
recently, I was learning about crawlers, and then I used get to connect to the web page, and then I asked a lot of questions. I said one by one, when I get, I added the following information params = header header = {user-agent: xxxx} the resulting te...

Web-crawler regular-expression python3.x

Mar.24,2021

MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-36be1af-30e71.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-36be1af-30e71.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?