Przetwarzanie kodu źródłowego HTML za pomocą AppleScript
Próbuję przeanalizować plik HTML, który przekonwertowałem na plik TXT wewnątrz Automatora.
Wcześniej pobrałem plik HTML ze strony internetowej za pomocą Automatora i teraz próbuję przeanalizować kod źródłowy.
Najlepiej jest wziąć informacje o samej tabeli i muszę powtórzyć tę czynność dla 1800 różnych plików HTML.
Oto przykład kodu źródłowego:
<code></head> <body> <div id="header"> <div class="wrapper"> <span class="access"> <div id="fb-root"></div> <span class="access"> Gold Account: <a class="upgrade" title="Account Details" href="http://www.hedge-professionals.com/account-details.html" >Active </a> Logged in as Edward | <a href="javascript:void(0);" onclick='logout()' class="logout">Sign Out</a> </span> </span> </div><!-- /wrapper --> </div><!-- /header --> <div id="masthead"> <div class="wrapper"> <a href="http://www.hedge-professionals.com" ><img src="http://www.hedge-professionals.com/images/hedgep_logo_white.png" alt="Hedge Professionals Database" width="333" height="46" class="logo" border="0" /></a> <div id="navigation"> <ul> <li ><a href='http://www.hedge-professionals.com/dashboard.html' >Dashboard</a></li> <li ><a href='http://www.hedge-professionals.com/people.html'class='current' >People</a></li><li ><a href='http://www.hedge-professionals.com/watchlists.html' >My Watchlists</a></li><li ><a href='http://www.hedge-professionals.com/my-searches.html' >My Searches</a></li><li ><a href='http://www.hedge-professionals.com/my-profile.html' >My Profile</a></li></ul> </div><!-- /navigation --> </div><!-- /wrapper --> </div><!-- /masthead --> <div id="content"> <div class="wrapper"> <div id="main-content"> <!-- per Project stuff --> <span class="section"> <img src="http://www.hedge-professionals.com/images/people/noimage_53x53.jpg" alt="Christian Sieling" width="52" height="53" class="profile-pic" id="profile-pic-104947"/> <h1><span id="profile-name-104947" >Christian Sieling</span></h1> <ul class="gbutton-group right"> <li><a class="gbutton bold pill" href="http://www.hedge-professionals.com/people.html">« Back </a></li> <li><a class="gbutton bold pill boxy on-click" href="http://www.hedge-professionals.com/addtoWatchlist.php?usr=114752" id="row-104947" title='Add to Watchlist' >Add to Watchlist</a></li> </ul> <div style="float:right;padding:3px 3px;text-align:center;margin-top:5px;" > <span id="profile-updated-date" >Updated On: 4 Aug, 2010</span><br/> <a class="gbutton bold pill" href="http://www.hedge-professionals.com/profile/suggest/people/104947/Christian-Sieling" style="margin:5px;" title='Report Inaccurate Data' >Report Inaccurate Data</a> </div> <h2><span id="profile-details-104947" > at <a href="http://www.hedge-professionals.com/quicksearch/search/Lumix+Capital+Management+Ltd." ><span title='Lumix Capital Management Ltd.' >Lumix Capital Management Ltd.</span></a></span><input type="hidden" name="sub-id" id="sub-id" value="114752"></h2> </span> <table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table"> <tr> <th>Role</th> <td> <p>Other</p> </td> </tr> <tr> <th>Organisation Type</th> <td> <p>Asset Manager</p> </td> </tr> <tr> <th>Email</th> <td><a href="mailto:[email protected]" title="[email protected]" >[email protected]</a></td> </tr> <tr> <th>Website</th> <td><a href="http://www.lumixcapital.com/" target="_new" title="http://www.lumixcapital.com/" >http://www.lumixcapital.com/</a></td> </tr> <tr> <th>Phone</th> <td>41 78 616 7334</td> </tr> <tr> <th>Fax</th> <td></td> </tr> <tr> <th>Mailing Address</th> <td>Birrenstrasse 30</td> </tr> <tr> <th>City</th> <td>Schindellegi</td> </tr> <tr> <th>State</th> <td>CH</td> </tr> <tr> <th>Country</th> <td>Switzerland</td> </tr> <tr> <th class="lastrow" >Zip/ Postal Code</th> <td class="lastrow" >8834</td> </tr> </table> </div><!-- /main-content --> <div id="sidebar" > </div> <div id="similar_sidebar" class="similar_refine" > </div> </div><!-- /wrapper --> </div><!-- /content --> <div id="footer"> </div> </code>
Moja próba AppleScript, która jest używanatext item delimiters
wyodrębnić tabelę w podobny sposób:
<code>set p to input set ex to extractBetween(p, "<table>", "</table>") -- extract the URL to extractBetween(SearchText, startText, endText) set tid to AppleScript's text item delimiters set AppleScript's text item delimiters to startText set endItems to text of text item -1 of SearchText set AppleScript's text item delimiters to endText set beginningToEnd to text of text item 1 of endItems set AppleScript's text item delimiters to tid return beginningToEnd end extractBetween </code>
Jak mogę przeanalizować tabelę z pliku HTML?