2020-04-21crawling / concept3 minutes read (About 410 words)

webcrawling

web crawling 하기전에 알아둬야 할 사항

예를들어, 네이버 홈페이지를 크롤링한다고 하면 www.naver.com/robots.txt을 브라우저 주소창에 입력하면 로봇 배제 규약에 관한 내용이 나옵니다.

robots.txt 내용 요약

모든 로봇 접근 허락
User-agent: *
Allow : /
1. 모든 로봇 접근 차단
  User-agent: *
  Disallow: /
2. 모든 로봇에 디렉토리 3곳 접근 차단
  User-agent: *
  Disallow: /cgi-bin/
  Disallow: /tmp/
  Disallow: /junk/
3. 모든 로봇에 특정 파일 접근 차단
  User-agent: *
  Disallow: /directory/file.html
4. BadBot 로봇에 모든 파일 접근 차단
  User-agent: BadBot
  Disallow: /
5. BadBot과 Googlebot에 특정 디렉토리 접근 차단
  User-agent: BadBot
  User-agent: Googlebot
  Disallow: /private/

참고사항 2020년 4월 21일 현재

네이버 로봇 규약 설정
출처: https://searchadvisor.naver.com/guide/seo-basic-robots

사이트의 루트 페이지만 수집 허용으로 설정합니다.
User-agent: *
Disallow: /
Allow: /$
- sitemap.xml 지정
  User-agent: *
  Allow: /
  Sitemap: http://www.example.com/sitemap.xml

다음 로봇 규약 설정

모든 로봇의 접근 차단
User-agent: *
Disallow: /
1. 카카오 로봇 규약 설정
모든 로봇의 접근 차단
See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
To ban all spiders from the entire site uncomment the next two lines:
User-agent: *
Disallow: /

문제가 있거나 오타가 있으면 댓글이나 메일로 알려주세요.
감사합니다 :)

자세한 내용은 아래 사이트를 참조하세요.
출처: https://gbsb.tistory.com/80
출처: https://medium.com/@euncho/robots-txt-e08328c4f0fd
출처: https://support.google.com/webmasters/answer/6062596?hl=ko
출처: https://ko.wikipedia.org/wiki/%EB%A1%9C%EB%B4%87_%EB%B0%B0%EC%A0%9C_%ED%91%9C%EC%A4%80