Forums
Subject: Spider/Crawl whole site
Prev Next
You are not authorized to post a reply.

Author Messages
maritUser is Offline

Posts:8

Send A Message
03/26/2008 10:46 AM  
Hello I've installed seamus with full text search. I see now, that it's indexing my files and folders. How can I get seamus to index all my tabs? it is importat, that it crawls/spiders through all tabs, even those, which have custom build moduls. Is there something, that I have to set? Thank you very much. marit Current settings in Seamus: If No Records are Returned: Show "No Records Found" Update Index Every: 6 Global Crawler: Yes All other settings are empty/not set.
John HenleyUser is Offline

Posts:25

Send A Message
03/26/2008 1:34 PM  
I too am very confused. I purchased this because your features said:

"The Venexus Seamus module is an aggregation module that will crawl/spider your DotNetNuke portal (Tabs, Modules, Folders, and Files), external websites (including non-DotNetNuke websites), and any RSS feed."

That lead me to believe that it would crawl/spider HTML content on non-DNN sites which don't have RSS feeds. Is that correct? If so, how do I set it up?
John HenleyUser is Offline

Posts:25

Send A Message
03/26/2008 1:37 PM  
For example, I see this in the errors: ErrorID: 153 When: 3/26/2008 1:36:56 PM URL: http://www.danalytics.com/guru/letter/ Message: 200 - text/html
John HenleyUser is Offline

Posts:25

Send A Message
03/28/2008 3:07 PM  
Is this really an error?
Should questions like this be posted in the issue tracker or asked here?
John HenleyUser is Offline

Posts:25

Send A Message
03/28/2008 3:11 PM  
Can you help me understand what needs to be done in order to crawl/spider HTML content on non-DNN sites which don't have RSS feeds? Do I just add the URL as a feed?
tmunn78User is Offline

Posts:1

Send A Message
03/28/2008 3:39 PM  
You can add items to be crawled through Seamus via the Add Feed link. It will give you an option for Feed or Website. Also, look at this link for reference. http://www.venexus.com/Support/ProductForums/tabid/132/forumid/2/postid/35/view/topic/Default.aspx
John HenleyUser is Offline

Posts:25

Send A Message
03/31/2008 11:32 AM  
It will give you an option for Feed or Website


I don't have that option when using the Add Feed link...there is just a 'Feed URL'...is there a difference between an RSS URL and an HTML URL?
Jeff SmithUser is Online

Posts:131

Send A Message
03/31/2008 11:51 AM  
Make sure you enable "Users Can Add Feeds" in the module settings. When you do, you should see a link at the bottom of the main view on Seamus that says "Add Site". If clicked on you should see radio button for RSS Feed or Site URL and you can add a Title and URL
maritUser is Offline

Posts:8

Send A Message
04/07/2008 8:44 AM  
When I add a URL, say "www.venexus.com", will seamus then just add the homepage or the entire site with all tabs? Thanks very much!
John HenleyUser is Offline

Posts:25

Send A Message
04/07/2008 10:05 AM  
I'd like to understand this better as well. Does it work for non-DNN sites?
Jeff SmithUser is Online

Posts:131

Send A Message
04/07/2008 11:55 AM  
Currently, if you add a url like http://www.venexus.com, it will crawl that page and all other pages on the site it finds using traditional crawling method, if Global Crawler is enabled. As it crawls http://www.venexus.com, each time a URL is found on the page, it will add it to the aggregation queue. It will also continue indexing other sites that are linked from the Venexus website as well. For example, if we have a link to www.dotnetnuke.com, it will begin indexing that site as well. The Global Crawler option is only limited by the exclusion list (Edit Seamus > Manage Excludes) for sites it will crawl. That is why we provide the Moderation option and the VenexusSearchQueue to manage any pages/documents that are crawled. There currently is not a way to index only a single site (ignoring external urls from specified domain). However, in a future release we will be adding advanced domain management (placeholder in Edit Seamus > Manage Domains). From there you will be able to approve/ban domains to be crawled. By adding a domain in Manage Domains, it will automatically be added to the aggregation queue and only URLs with the approved domain name in it will be crawled. This is a fairly large peice of work with alot of heavy lifting of the aggregation queue and index related to the approving or banning of a domain. It is planned for the 2.6 release of Seamus.
John HenleyUser is Offline

Posts:25

Send A Message
04/07/2008 4:43 PM  

Thank you for updating the Seamus documentation, that is very helpful.  I do have a question, of course.  Regarding the "limitations", "

Does this apply regardless of which of the four methods is used to crawl, or just to using the scheduled task? 

 

"If you have set security on a page or file that is not public and requires authentication, Seamus will not be able to crawl and index that page or document."

Jeff SmithUser is Online

Posts:131

Send A Message
04/08/2008 1:35 PM  

Currently, it is always true that it will not handle any authentication. However, this got pushed up for getting implemented by a client who has to have that functionality. I will be in the next release (2.5.5).

maritUser is Offline

Posts:8

Send A Message
04/09/2008 5:17 AM  

Limiting the global crawler to defined list of domains is an important feature for us. When do you version 2.6 expect to be released?

Jeff SmithUser is Online

Posts:131

Send A Message
05/29/2008 10:04 AM  
This has been released.
You are not authorized to post a reply.
Forums > Modules > Venexus Seamus > Spider/Crawl whole site



 Print   

Top Posts
Venexus WebControls 4.0.5 Released by tmunn
A new version of  Venexus WebControls was released today. This releas...
Venexus WebControls 4.0.5 Released by tmunn
A new version of  Venexus WebControls was released today. This releas...
VenexusSearchDirectory 4.0.1 Released by tmunn
A new version of VenexusSearchDirectory has been released.  Make sure ...
VenexusSearchDirectory 4.0.1 Released by tmunn
A new version of VenexusSearchDirectory has been released.  Make sure ...
VenexusGroups 4.0.3 Released by tmunn
A new version of VenexusGroups has been released.  Make sure to update...
  

 © 2007 - Venexus, Inc. - All rights reserved Terms Of Use | Privacy Statement