如何使用Python中的正则表达式处理html文件

html类型的文本数据内容是由前端代码书写的标签+文本数据的格式,可以直接在chrome浏览器打开,清楚的展示出文本的格式,下面这篇文章主要给大家介绍了关于如何使用Python中的正则表达式处理html文件的相关资料,需要的朋友可以参考下

使用Python中的正则表达式处理html文件

finditer方法是一种全匹配方法。您可能已经使用了findall方法，它返回多个匹配字符串的列表。finditer返回一个迭代器顺序地为多个匹配中的每一个生成匹配对象。在下面的代码中，这些匹配对象被访问（通过for循环），因此可以打印组1。

您的任务是编写Python RE来识别HTML文本文件中的某些模式。将代码添加到STARTER脚本为这些模式编译RE（将它们分配给有意义的变量名称），并将这些RE应用于文件的每一行，打印出找到的匹配项。

1.编写识别HTML标签的模式，然后将其打印为“TAG:TAG string”（例如“TAG:b”代表标签）。为了简单起见，假设左括号和右括号每个标记的（<，>）将始终出现在同一行文本中。第一次尝试可能使regex“<.*>”其中“.”是与任何字符匹配的预定义字符类符号。尝试找出这一点，找出为什么这不是一个好的解决方案。编写一个更好的解决方案，解决这个问题

2.修改代码，使其区分开头和结尾标记（例如p与/p)打印OPENTAG和CLOSETAG

import sys, re #------------------------------ testRE = re.compile('(logic|sicstus)', re.I) testI = re.compile('<[A-Za-z]>', re.I) testO = re.compile('<[^/](\S*?)[^>]*>') testC = re.compile(']*>') with open('RGX_DATA.html') as infs: linenum = 0 for line in infs: linenum += 1 if line.strip() == '': continue print('  ', '-' * 100, '[%d]' % linenum, '\n   TEXT:', line, end='') m = testRE.search(line) if m: print('** TEST-RE:', m.group(1)) mm = testRE.finditer(line) for m in mm: print('** TEST-RE:', m.group(1)) index= testI.finditer(line) for i in index: print('Tag:',i.group().replace('<', '').replace('>', '')) open1= testO.finditer(line) for m in open1: print('opening:',m.group().replace('<', '').replace('>', '')) close1= testC.finditer(line) for n in close1: print('closing:',n.group().replace('<', '').replace('>', ''))

请注意，有些HTML标签有参数，例如：

确保打开标记的模式适用于带参数和不带参数的标记，即成功找到并打印标签标签。现在扩展您的代码，以便打印两个打开的标签标签和参数，例如:
OPENTAG: table
PARAM: border=1
PARAM: cellspacing=0
PARAM: cellpadding=8
 		open1= testO.finditer(line) for m in open1: #print('opening:',m.group().replace('<', '').replace('>', '')) firstm= m.group().replace('<', '').replace('>', '').split() num = 0 for otherm in firstm: if num == 0: print('opening:',otherm) else: print('pram:',otherm) num+= 1 
在正则表达式中，可以使用反向引用来指示匹配早期部分的子字符串,应再次出现正则表达式的。格式为\N（其中N为正整数），并返回到第N个匹配的文本正则表达式组。例如，正则表达式，如：r" (\w+) \1 仅当与组（\w+）完全匹配的字符串再次出现时才匹配 backref\1出现的位置。这可能与字符串“踢”匹配.例如，“the”出现两次。使用反向引用编写一个模式，当一行包含成对的open和关闭标签，例如在粗体中.
考虑到我们可能想要创建一个执行HTML剥离的脚本，即一个HTML文件，并返回一个纯文本文件，所有HTML标记都已从中删除出来这里我们不打算这样做，而是考虑一个更简单的例子，即删除我们在输入数据文件的任何行中找到的HTML标记。
你应该能够让您已经定义的RE识别HTML标签这样做,将生成的文本打印到屏幕上为STRIPPED：。。
import sys, re #------------------------------ # PART 1: # Key thing is to avoid matching strings that include # multiple tags, e.g. treating '' as a single # tag. Can do this in several ways. Firstly, use # non-greedy matching, so get shortest possible match # including the two angle brackets: tag = re.compile('') # The above treats the '/' of a close tag as a separate # optional component - so that this doesn't turn up as # part of the match '.group(1)', which is meant to return # the tag label. # Following alternative solution uses a negated character # class to explicitly prevent this including '>': tag = re.compile(']+)>') # Finally, following version separates finding the tag # label string from any (optional) parameters that might # also appear before the close angle bracket: tag = re.compile(r']+)?>') # Note that use of '\b' (as word boundary anchor) here means # we must mark the regex string as a 'raw' string (r'..'). #------------------------------ # PART 2: # Following closeTag definition requires first first char # after the open angle bracket to be '/', while openTag # definition excludes this by requiring first char to be # a 'word char' (\w): openTag  = re.compile(r'<(\w[^>]*)>') closeTag = re.compile(r']*)>') # Following revised definitions are more carefully stated # for correct extraction of tag label (separately from # any parameters: openTag  = re.compile(r'<(\w+\b)([^>]+)?>') closeTag = re.compile(r'') #------------------------------ # PART 3: # Above openTag definition will already get the string # encompassing any parameters, and return it as # m.group(2), i.e. defn: openTag  = re.compile(r'<(\w+\b)([^>]+)?>') # If assume that parameters are continuous non-whitespace # chars separated by whitespace chars, then we can divide # them up using split - and that's how we handle them # here. (In reality, parameter strings can be a lot more # messy than this, but we won't try to deal with that.) #------------------------------ # PART 4: openCloseTagPair = re.compile(r'<(\w+\b)([^>]+)?>(.*?)') # Note use of non-greedy matching for the text falling # *between* the open/close tag pair - to avoid false # results where have two similar tag pairs on same line. #------------------------------ # PART 5: URLS # This is quite tricky. The URL expressions in the file # are of two kinds, of which the first is a string # between double quotes ("..") which may include # whitespace. For this case we might have a regex: url = re.compile('href=("[^">]+")', re.I) # The second case does not have quotes, and does not # allow whitespace, consisting of a continuous sequence # of non-whitespace material (that ends when you reach a # space or close bracket '>'). This might be: url = re.compile('href=([^">\s]+)', re.I) # We can combine these two cases as follows, and still # get the expression back as group(1): url = re.compile(r'href=("[^">]+"|[^">\s]+)', re.I) # Note that I've done nothing here to exclude 'mailto:' # links as being accepted as URLS. #------------------------------ with open('RGX_DATA.html') as infs: linenum = 0 for line in infs: linenum += 1 if line.strip() == '': continue print('  ', '-' * 100, '[%d]' % linenum, '\n   TEXT:', line, end='') # PART 1: find HTML tags # (The following uses 'finditer' to find ALL matches # within the line) mm = tag.finditer(line) for m in mm: print('** TAG:', m.group(1), ' + [%s]' % m.group(2)) # PART 2,3: find open/close tags (+ params of open tags) mm = openTag.finditer(line) for m in mm: print('** OPENTAG:', m.group(1)) if m.group(2): for param in m.group(2).split(): print('    PARAM:', param) mm = closeTag.finditer(line) for m in mm: print('** CLOSETAG:', m.group(1)) # PART 4: find open/close tag pairs appearing on same line mm = openCloseTagPair.finditer(line) for m in mm: print("** PAIR [%s]: \"%s\"" % (m.group(1), m.group(3))) # PART 5: find URLs: mm = url.finditer(line) for m in mm: print('** URL:', m.group(1)) # PART 6: Strip out HTML tags (note that .sub will do all # possible substitutions, unless number is limited by count # keyword arg - which is fortunately what we want here) stripped = tag.sub('', line) print('** STRIPPED:', stripped, end = '') 
总结
到此这篇关于如何使用Python中的正则表达式处理html文件的文章就介绍到这了,更多相关Python正则处理html文件内容请搜索0133技术站以前的文章或继续浏览下面的相关文章希望大家以后多多支持0133技术站！
以上就是如何使用Python中的正则表达式处理html文件的详细内容，更多请关注0133技术站其它相关文章！
赞(0) 打赏
未经允许不得转载：0133技术站首页 » python
上一篇
Python web框架实现增加BasicAuth认证详解下一篇
注意import和from import 的区别及说明
相关文章
Appium+Python自动化环境搭建实例教程
python编写adb截图工具的实现源码
一篇文章教会你PYcharm的用法
一篇文章带你了解kali局域网攻击
Python的Matplotlib库图像复现学习
PyCharm 2021.2 (Professional)调试远程服务器程序的操作技巧
python聊天室(虽然很简洁,但是可以用)
python中map()函数使用方法详解
脚本专栏
pythonvbs相关ErlangLua
热门搜索：
python正则表达式
正则表达式的使用
正则表达式使用
中的正则表达式
用正则表达式来表示中文

置顶推荐
qq火花是过了24小时断吗2021-12-20
猜你喜欢
python的正则表达式re模块的常用方法2021-09-20
python imutils包基本概念及使用2021-10-09
pandas.DataFrame删除/选取含有特定数值的行或列实例2021-10-08
Python二维码生成库qrcode安装和使用示例2021-09-10
Python 给某个文件名添加时间戳的方法2021-09-10
Python3中编码与解码之Unicode与bytes的讲解2021-09-10
Python Dict找出value大于某值或key大于某值的所有项方式2021-09-19
Python中unittest模块做UT（单元测试）使用实例2021-09-10
© 2022 WEB前端开发
工具教程 | 前端开发 | 常见问题 | 操作系统 | 编程  | 网络安全  | 设计  | 站长技巧
鄂ICP备2021014202号-2