python - IMDb HTML Extraction - With Beautiful Soup -

- January 15, 2012

with beautiful soup4, i'm trying text doesn't seem tagged. (i may wrong, i'm not capable html)

i need extract several values imdb code of page; budget value , latest worldwide gross value particular film. length of code varies between films if there method using beautiful soup4 extract these values regardless of line number, hugely helpful. code:

<div id="tn15content"> <h5>budget</h5> $165,000,000 (estimated)<br/> <br/>

from source code of page: imdb box office page interstellar

i need '$165,000,000' extracted can store etc.

the gross code more confusing:

<h5>gross</h5> $188,020,017 (usa) (<a href="/date/03-19/">19 march</a> <a href="/year/2015/">2015</a>)<br/>$187,991,439 (usa) (<a href="/date/03-15/">15 march</a> <a href="/year/2015/">2015</a>)<br/>$187,930,551 (usa) (<a href="/date/03-14/">14 march</a> <a href="/year/2015/">2015</a>)<br/>$187,918,949 (usa) (<a href="/date/03-11/">11 march</a> <a href="/year/2015/">2015</a>)<br/>$187,888,097 (usa) (<a href="/date/03-08/">8 march</a> <a href="/year/2015/">2015</a>)<br/>

all need recent (the worldwide figures further through huge chunk of code decided leave out due spacing on here.

i know there similar problem on here solved, couldn't solution work nor comment ask user providing answer particular solution due being new site. going try , imdbpy work, wasn't sure how install winpython.

use regular expression

\$([0-9,]+) \(usa\)  \$([0-9,]+) \(worldwide\)

http://pythex.org/

Search This Blog

Chrom

python - IMDb HTML Extraction - With Beautiful Soup -

Comments

Post a Comment

Popular posts from this blog

qt - Using float or double for own QML classes -

json - ORA-06502: PL/SQL: numeric or value error: character string buffer too small - Convert Clob to varchar2 -

python - jinja2: TemplateSyntaxError: expected token ',', got 'string' -