Code

Discussion on WP Content Crawler - Get content from almost any site, automatically!

Discussion on WP Content Crawler - Get content from almost any site, automatically!

Cart 3,660 sales

turgutsaricam supports this item

Supported

This author's response time can be up to 5 business days.

2655 comments found.

It’s a shame about the Cloudflare block… Unfortunately, I can’t find any proxy to use with the plugin to continue crawling websites :(

Hi,

I can’t seem to crawl this page – https://eurohockey.org/news#news

Can you check to see if this is possible, please, and point my in the right direction.

Many thanks

Hi,

You can see this page to learn more on how to understand if you can crawl the content of a site. Additionally, if you want to go the JavaScript rendering route, which is to use a proxy service that is capable of rendering JavaScript and sending the rendered page as static HTML, after you find a proxy that is capable of doing that, you can make the plugin use that proxy by configuring the proxy settings.

Hi, I have two questions that I can not see how to do.

Date – How do I set the date as the original crawled content date, not the current date? When crawling I am receiving multiple posts in my blog from the same site at the top and not in any date order.

Second question – Can it be set so it only crawls and post the last 4 articles (if new). I have set it to crawl 1 page, but it’s still posting lots of posts from teh site and not only the new ones.

Hi,

If you do not set any date selectors, the publish date of the post will be the date and time when the post is crawled. So, you can simply skip the date selector setup to use the crawling date as the publish date of the post.

If you already set it to crawl only the first page, you can configure the post URL selectors in such a way that only the URLs of the first 4 articles of the category page is found. For example, you can use `:nth-child(-n+4)` pseudo class to select the first 4 elements, e.g. `ul > li:nth-child(-n+4) > h3 > a`.

The database is becoming overloaded when we have more than 200k posts. Is there any way to optimize this?

Hi,

If you are looking for an option in the plugin, there is no such option to improve your database performance, unfortunately.

Hi,

I want the domain currently registered to my license key to be removed.

My license key is

Thanks

Hi,

Please send your license key by using the contact form on my profile page.

Hello,

I’m contacting you to request assistance with the Content Crawler plugin for WordPress.

Hi,

I replied to your email.

creaorg

creaorg Purchased

Hello Support Team,

We are currently managing a large-scale aggregation project and have identified two critical logic behaviors in WP Content Crawler that are affecting data integrity. We are writing to report these as bugs/logic flaws and to inquire if a fix or “Strict Mode” is planned for an upcoming release.

Issue 1: The “Current Date” Fallback is Corrupting Archival Data We have observed that when the plugin fails to parse a date selector (i.e., when it returns null or false), the system automatically defaults to current_time().

Current Behavior: If a configured selector like meta[itemprop=”datePublished”] finds no match on a page, the post is saved and published with today’s date and time.

Desired Behavior: We require a setting to SKIP the post or set status to Draft if the date selector fails to find a value.

Context: For archival projects, no data is better than incorrect data. Defaulting to “now” destroys the chronological accuracy of the archive.

Question: Is there a filter (e.g., wcc_post_date or similar) where we can return false to abort the save if the date is empty? Or do you plan to add a “Require Date” checkbox in the settings?

Issue 2: Soft 404s Being Scraped as Valid Posts The plugin appears to process pages that return HTTP 200 OK but contain “Page Not Found” content (Soft 404s).

Scenario: A source URL is dead/removed. The target site redirects to a custom 404 page which returns a 200 OK status code.

Failure Mode: The crawler loads this 404 page. It fails to find the article content/date. However, because of the fallback logic mentioned in Issue 1, it assigns “Today’s Date” and saves the “Page Not Found” error message as a valid published post.

Question: How can we configure the crawler to strictly validate that specific selectors (like Date or Content) MUST exist? If they are missing, the crawl for that specific URL should be marked as “Failed” rather than attempting to save what it found.

Technical Request: We are avoiding sharing specific source URLs for privacy, but this behavior is reproducible on any WordPress site that:

Has a meta date tag on valid post pages.

Does not have that tag on 404 pages.

Returns 200 OK for 404 pages (common in many themes).

Could you please clarify if a patch for this fallback behavior is on your roadmap, or if you can provide a snippet to enforce strict date parsing?

Thank you for your assistance.

Hi,

You can set the status of the post via the filters conditionally. For example, you can check if the value of an element does not exist, and, if it does not exist, you can change the post status. You can also check if the text of the element contains/doesn’t contain a value and many more other things. Watching the introduction tutorial of the filters feature is a good starting point. You can also read the documentation of the filters feature here. Here is an example for your “publish date” case: https://ibb.co/V0WNs7pp Additionally, you can find the documentation for all the available filter commands here, which should give you an idea of what can be achieved with the filters.

insirah

insirah Purchased

hocam merhaba, çok eski bi müşterinim. takıldığım çözemediğim bi konu var, b2b bayi sayfasından ürün verilerini çekmem gerekiyor, fakat şifreli bi alan olduğu için bu alana giriş yapabilmem için yapmam gereken ayarları bir türlü bilemedim. cookie eklemeyi denedim ama başarılı olamadım. sorunumu anlatabildim mi tam bilemiyorum. yardımcı olabilir misiniz çok teşekkür ederim.

Merhaba,

Çerezleri eklemeyle ilgili şurada bir anlatım mevcut. Belirtilen adımları takip ederek hedef sitedeki tüm çerezleri site ayarlarına ekleyebilirsiniz. Ek olarak, sayfanın alt bölümünde açıklanan “tüm request headerları içeri aktarmak” bölümünde belirtilenleri de yapmanız gerekebilir.

arnlweb

arnlweb Purchased

Message: The license could not be checked with the server. Please try saving your license settings again in a few minutes. If the error persists, please contact the developer.

The plugin was running. Suddenly stopped. Domain name, https://lyrics.arnlweb.com/

It says “nothing is broken”, which agrees with my message.

arnlweb

arnlweb Purchased

error: This license has reached its domain limit and is not valid for this domain. Registered domains: gpl.arnlweb.com

my new domain: lyrics.arnlweb.com

Please send your purchase code via the contact form on my profile page so that I can remove the domain registered to it.

My license doesn’t work on any other site. I need help resolving this issue. Please help me fix it and get the extension working again.

Hi,

The plugin’s license server is up and running without any issues. Also, you are the only one having this issue, which suggests that the issue is caused by your setup, not the plugin.

You can first make sure that your site can connect to wpcontentcrawler.com, to eliminate any connectivity issues. If your site can connect to wpcontentcrawler.com, the problem is likely caused by another third party software (your theme or other plugins). In that case, you can see this page to learn how to find out which software causes the issue.

I want to collect all articles from the website https://dienthoaivui.com.vn/. It uses loadmore. Can your product be collected?

Hi,

It is possible to configure the plugin to make subsequent requests to retrieve more information from other URLs, but it requires technical knowledge about how the websites and AJAX requests work. So, if you do not have such technical knowledge, it is hard to do, unfortunately.

I’m getting this error with php 8.1 and the last WP:

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the wp-content-crawler domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/xxx/xxx/xxx/public_html/wp-includes/functions.php on line 6121

The plugin stopped crawling and I can’t access the Dashboard either.

Hi,

Some of your other plugins’themes might be triggering something too early, which causes this problem, as the plugin has already been successfully tested in several WordPress and PHP versions, including PHP 8.1, before release. You can see this page to learn how to find out what third party software causes this: https://docs.wpcontentcrawler.com/troubleshooting/fixing-problems-caused-by-other-plugins-themes.html

Merhaba,

Elimde lisanslı WP Content Crawler Pro (codecanyon.net) sürümü mevcut. Ancak JavaScript ile yüklenen dinamik içerikler (örneğin beko.com.tr sayfalarındaki fiyatlar) crawler tarafından çekilmiyor.

“Test via: Rendered / Both” seçenekleri ve “Manipulations” aktif olmasına rağmen, sayfada fiyat bilgisi HTML’e düşmüyor. Bu nedenle “script[type=’application/ld+json’]” veya “.beko-price” gibi selector’larla veri yakalanamıyor.

Rica ederim şu konularda yardımcı olabilir misiniz:

Headless browser (JS rendering) özelliği mevcut sürümümde aktif mi?

Eğer değilse, JavaScript rendering veya puppeteer / chromium modülü nasıl etkinleştirilebilir?

“Rendered test” penceresinde dynamic price elements’in görünebilmesi için özel bir ayar, proxy veya eklenti modülü gerekiyor mu?

Teşekkür ederim,

Chatgpt ile çalıştım onun mesajı:)

Merhaba,

Eklentinin yalnızca bir çeşidi var ve yalnızca CodeCanyon üzerinden satılıyor. Eklentide “headless browser” entegrasyonu mevcut değil. JavaScript çalıştırmak istiyorsanız, JavaScript çalıştırabilen bir proxy kullanabilirsiniz.

https://www.beko.com.tr/no-frost-buzdolabi/660316-mb-buzdolabi

Merhaba; Bu sitede normal fiyat ve olizli fiyat alamıyorum. Ayrıca daha önce fotoğraf içinde yazmıştım. Bu sorunu çözebilirsek bir çok lisans alacağız baylilerle ortak.. Lütfen bu konuda pratik çözüm verebilirmisiniz yada bir güncelleme. Çünkü bir çok siteden çekim yaptım sorun yok. Ama bu tarz sitelerden yapamıyorum. Özellikle beko sitesi.

Merhaba,

Emailde bu konuyu zaten şu şekilde açıklamıştım:

Sitede `img` elementlerinin `src` öznitelikleri tanımlanmadığı için, görseller doğru yüklenemiyor. Site, sayfa yüklendikten sonra JavaScript ile `src` özniteliğini tanımlıyor. Eklenti JavaScript çalıştıramadığı için, bu işlemi eklentinin HTML manipülasyonu seçeneklerini kullanarak yapmanız gerekiyor.

Sayfanın kaynak kodunu incelerseniz, `img` elementlerinin `data-srcset` özniteliklerinde görsel URL’leri bulunuyor. Bu URL’lerden ilkini, `src` özniteliğinin değeri olarak belirlemek için ben şöyle bir ayar yaptım: https://ibb.co/1YtF1kzw Şuradan görülebileceği gibi ( https://ibb.co/MxZv6rjB ), görsellerin URL’leri doğru bir şekilde alınabiliyor.

Eklentinin ayarlarının yapılması destek kapsamında olmadığı için, eklentiyi satın alıp ayarları yapamamanız durumunda ne yazık ki bu şekilde ayarları ben yapmıyorum. Bunu yalnızca bir örnek olarak gönderiyorum. Daha sonraki mesajlarınızı CodeCanyon’daki profil sayfamdaki iletişim formunu kullanarak gönderirseniz, süreci takip etmem daha kolay olacaktır.

onishko

onishko Purchased

Critical PHP TypeError in WP Content Crawler v1.15.0 – Plugin Conflict with MyListing Theme

Dear WP Content Crawler Support Team, I’m experiencing a critical issue with your plugin that causes fatal errors and conflicts with my theme functionality. I would appreciate your assistance in resolving this problem.

When WP Content Crawler is active, I encounter a PHP TypeError that breaks the “Related Listing” dropdown functionality in MyListing theme. The dropdown shows “The Results could not be loaded” instead of displaying the listing options. When I deactivate WP Content Crawler, the functionality works perfectly.

Error Details: From my debug.log, I can see the following critical error: PHP Fatal error: Uncaught TypeError: strtolower(): Argument #1 ($string) must be of type string, array given in /wp-content/plugins/wp-content-crawler/app/Utils.php:1176 Stack trace: #0 /wp-content/plugins/wp-content-crawler/app/Utils.php(1176): strtolower() #1 /wp-content/plugins/wp-content-crawler/app/RequirementValidator.php(58): WPCCrawler\Utils::isPluginPage() #2 /wp-content/plugins/wp-content-crawler/app/WPCCrawler.php(58): WPCCrawler\RequirementValidator->validateAll() Additional Notices The plugin also triggers multiple early loading warnings: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the wp-content-crawler domain was triggered too early.

Environment Information WordPress Version: 6.8.3 PHP Version: 8.4 WP Content Crawler Version: 1.15.0 Theme: MyListing (based on JobListing)

Steps to Reproduce Activate WP Content Crawler plugin Go to Listings → Edit any listing Try to use the “Related Listing” dropdown field Error occurs: “The Results could not be loaded” Expected Behavior The Related Listing dropdown should load and display available listings via AJAX request. Temporary Workaround Currently, I have to deactivate WP Content Crawler to use the Related Listing functionality. Request Could you please: Investigate this TypeError in Utils.php line 1176 Check why the plugin triggers translations too early Provide a fix that prevents this conflict with AJAX requests I’m available to provide additional information, screenshots, or perform testing if needed. Thank you for your attention to this matter. Best regards, Roman https://disk.yandex.ru/d/YRE2oeTsF7jHtA

I am glad it worked. Thanks for letting me know!

onishko

onishko Purchased

“Thank you for your help with “Related Listing”! One last question. I’m trying to parse a site where phone numbers are hidden behind a ‘Show’ button. Even with proper cookies/auth, the parser sees only: ‘8 800 10… Show’ instead of the full number. Current situation: Main page: https://optlist.ru/company/uyutnosti/ (basic info) API endpoint: https://optlist.ru/company/uyutnosti/phone_json (full phone) Question: Is it possible to configure the crawler to: Get main content from the primary URL Simultaneously fetch specific data (like phone) from a secondary API endpoint? Or simulate clicking ‘Show’ button before parsing? The phone JSON returns: {Phones 800 101-49-41”],Fax“} Any solution for this multi-URL parsing scenario?” Best regards, Roman

You can use the `Make` command to make requests to other URLs and include their response into the current page. This video explains how to use the command: https://www.youtube.com/watch?v=VR3rh6DK8_E

Taxonomy terms not splitting after Find & Replace

Hello!

I’m trying to import movie genres using WP Crawler from a site where genres are listed like this: Триллер, Фантастика, Приключение, Боевик

Here is my setup:

Selector: .stat.wborder li:nth-child(6) span.value

Attribute: text

Custom field: raw_movie_genre

Taxonomy: movie_genre

Delimiter: ,

Find & Replace: added to remove | or |, at the beginning (regex like ^\|\s*)

The issue: Even after Find & Replace, the genres are imported as one single taxonomy term instead of being split into separate terms.

The preview shows correct output (clean list of genres), but taxonomy splitting does not happen.

I’ve tried multiple combinations, including different selectors, delimiters, and regex variations — nothing helps.

My question: Does WP Crawler apply Find & Replace before splitting taxonomy terms? If not — is there any way to fix it? This functionality is critical for my use case.

Thanks in advance for your help!

The top-level category is not selected even if all its subcategories are selected, it seems. So, you can try the following as a workaround.
  • Check `Post Tab > Category Section > Do not add the category defined in the category URLs?` setting’s checkbox. This will make the plugin not assign the top-level category selected in the `Category Tab > Category URLs` setting.
  • Update `Post Tab > Category Section > Category Name Selectors` setting so that it finds the top-level category in the page.

This requires the top-level category information to be available in the target post page. If that is not available, you can add any element to the page by using the settings. For example, this screenshot shows a filter that is used to add an element inside the `body` element, which then you can select it via `#my-top-level-category` CSS selector.

Thanks for your help, the second option helped me

I am glad to hear that. Thanks for letting me know.

Hi Is there a way to scrape/parse from csv files? Can’t find a plugin that can do that, and it would be the best of two worlds.

Especially if we could choose exactly which parts of the csv file that should be parsed and which parts that should be directly imported.

That would really help to get full description of the product or post during import from a csv…

Thanks!

Hi,

Sorry, no, the plugin is not designed to process CSV files, unfortunately.

Would you consider adding that function? We affiliated marketers would be very happy, cause our biggest challenge is to be able to get the original product description(which are rarely included in the csv files) to use in AI API to create new unique content and at the same time use the CSV for convenient data updates etc. I believe there is an untouched market for that function :)

Anyway, thanks for getting back to me!

I see. Isn’t there a WooCommerce plugin that can be used to update all the existing products by using data from a CSV file? If so, you can crawl the products via WP Content Crawler, with their SKUs, and then use such a plugin to update them. I am not too familiar with the plugins available to WooCommerce in that area, unfortunately.

Can I use it without openAI or any API?

Hi,

Yes, you can.

Hello! Your plugin has been working great for a whole year, but recently we started seeing a message saying the license couldn’t be verified. This may be due to partial internet blockages in Russia.

Please tell me what needs to be done to get it working again.

Screenshots of the issue: https://ibb.co/pr9stNK2 https://ibb.co/8gw1yWYP

Purchase information: LICENSE CERTIFICATE : Envato Market Item

This document certifies the purchase of: ONE REGULAR LICENSE as defined in the standard terms and conditions on Envato Market.

Licensor’s Author Username: turgutsaricam

Licensee: Denis Babasinov

Item Title: WP Content Crawler – Get content from almost any site, automatically!

Item URL: https://codecanyon.net/item/wp-content-crawler-get-content-from-almost-any-site-automatically/15983018

Item ID: 15983018

Item Purchase Code: 81f2b39c-(XXXXXXXXXXXX)86ed

Purchase Date: 2024-07-05 10:29:25 UTC

Hi,

I replied to your email.

Turgut bet lisans kodum hostinger geçiçi domaininde kayıtlı kaldı kendi domainimi yönlendirdiğimde sorun yaşıyorum yardımcı olur musunuz

Merhaba,

Profil sayfamda bulunan iletişim formunu kullanarak lisans anahtarınızla birlikte kayıtlı domaini kaldırmak istediğinizi belirten bir email gönderirseniz, lisansınıza kayıtlı domaini kaldırayım.

by
by
by
by
by
by

Tell us what you think!

We'd like to ask you a few questions to help improve CodeCanyon.

Sure, take me to the survey