Python BeautifulSoup - How to extract this text -

- August 15, 2014

current python script:

import win_unicode_console win_unicode_console.enable()  import requests bs4 import beautifulsoup  data = ''' <div class="info">     <h1>company title</h1>     <p class="type">company type</p>     <p class="address"><strong>zip, city</strong></p>     <p class="address"><strong>street 123</strong></p>     <p style="margin-top:10px;"> phone: <strong>(111) 123-456-78</strong><br />         fax: <strong>(222) 321-654-87</strong><br />         phone: <strong>(333) 87-654-321</strong><br />         fax: <strong>(444) 000-1111-2222</strong><br />     </p>     <p style="margin-top:10px;"> e-mail: <a href="mailto:mail@domain.com">mail@domain.com</a><br />     e-mail: <a href="mailto:mail2@domain.com">mail2@domain.com</a><br />     </p>     <p> web: <a href="http://www.domain.com" target="_blank">www.domain.com</a><br />     </p>     <p style="margin-top:10px;"> id: <strong>123456789</strong><br />         vat: <strong>987654321</strong> </p>     <p class="del" style="margin-top:10px;">some info:</p>     <ul>         <li><a href="#category">&raquo; category</a></li>     </ul> </div> '''  html = beautifulsoup(data, "html.parser")  p = html.find_all('p', attrs={'class': none})  pp in p:     print(pp.contents)

it returns following:

[' phone: ', <strong>123-456-78</strong>, <br/>, '\n\t\tfax: ', <strong>321-654-87</strong>, <br/>, '\n\t\tphone: ', <strong>87-654-321</strong>, <br/>, '\n\t\tfax: ', <strong>000-1111-2222</strong>, <br/>, '\n'] [' e-mail: ', <a href="mailto:mail@domain.com">mail@domain.com</a>, <br/>, '\n\te-mail: ', <a href="mailto:mail2@domain.com">mail2@domain.com</a>, <br/>, '\n'] [' web: ', <a href="http://www.domain.com" target="_blank">www.domain.com</a>, <br/>, '\n'] [' id: ', <strong>123456789</strong>, <br/>, '\n\t\tvat: ', <strong>987654321</strong>, ' ']

problem: dont know how extract text phone, fax , email, id, vat , create array them, like:

phones = [123-456-78, 87-654-321] faxes = [321-654-87, 000-1111-2222] emails = [mail@domain.com, mail2@domain.com] id = [123456789] vat = [987654321]

you group data using defaultdict after splitting:

html = beautifulsoup(data, "html.parser")  p = html.find_all('p', attrs={'class': none}) collections import defaultdict  d = defaultdict(list) pp in p:     spl = iter(pp.text.split(none,1))     ele in spl:         d[ele.rstrip(":")].append(next(spl).rstrip())  print(d) defaultdict(<class 'list'>, {'phone': ['123-456-78', '87-654-321'], 'fax': ['321-654-87', '000-1111-2222'], 'e-mail': ['mail@domain.com', 'mail2@domain.com'], 'vat': ['987654321'], 'web': ['www.domain.com'],  'id': ['123456789']})

splitting text gives lists of data gives you:

['phone:', '123-456-78', 'fax:', '321-654-87', 'phone:', '87-654-321', 'fax:', '000-1111-2222'] ['e-mail:', 'mail@domain.com', 'e-mail:', 'mail2@domain.com'] ['web:', 'www.domain.com'] ['id:', '123456789', 'vat:', '987654321']

so use every 2 elements key/value pairs. appending repeated keys.

for edit catch spaces in fax , phone numbers split lines splitlines , split once on whitespace: collections import defaultdict

d = defaultdict(list) pp in p:     spl = pp.text.splitlines()     ele in spl:         k, v = ele.strip().split(none, 1)         d[k.rstrip(":")].append(v.rstrip())

output:

defaultdict(<class 'list'>, {'fax': ['(222) 321-654-87', '(444) 000-1111-2222'],  'web': ['www.domain.com'], 'id': ['123456789'], 'e-mail': ['mail@domain.com', 'mail2@domain.com'],  'vat': ['987654321'], 'phone': ['(111) 123-456-78', '(333) 87-654-321']})

Search This Blog

Chrom

Python BeautifulSoup - How to extract this text -

Comments

Post a Comment

Popular posts from this blog

qt - Using float or double for own QML classes -

json - ORA-06502: PL/SQL: numeric or value error: character string buffer too small - Convert Clob to varchar2 -

python - jinja2: TemplateSyntaxError: expected token ',', got 'string' -