python-project/python-3.7.4-docs-html/library/urllib.robotparser.html
Caleb Fontenot 335515d331 add files
2019-07-15 09:16:41 -07:00

281 lines
17 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title>urllib.robotparser — Parser for robots.txt &#8212; Python 3.7.4 documentation</title>
<link rel="stylesheet" href="../_static/pydoctheme.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<script type="text/javascript" id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script type="text/javascript" src="../_static/language_data.js"></script>
<script type="text/javascript" src="../_static/sidebar.js"></script>
<link rel="search" type="application/opensearchdescription+xml"
title="Search within Python 3.7.4 documentation"
href="../_static/opensearch.xml"/>
<link rel="author" title="About these documents" href="../about.html" />
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="copyright" title="Copyright" href="../copyright.html" />
<link rel="next" title="http — HTTP modules" href="http.html" />
<link rel="prev" title="urllib.error — Exception classes raised by urllib.request" href="urllib.error.html" />
<link rel="shortcut icon" type="image/png" href="../_static/py.png" />
<link rel="canonical" href="https://docs.python.org/3/library/urllib.robotparser.html" />
<script type="text/javascript" src="../_static/copybutton.js"></script>
<script type="text/javascript" src="../_static/switchers.js"></script>
<style>
@media only screen {
table.full-width-table {
width: 100%;
}
}
</style>
</head><body>
<div class="related" role="navigation" aria-label="related navigation">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="../genindex.html" title="General Index"
accesskey="I">index</a></li>
<li class="right" >
<a href="../py-modindex.html" title="Python Module Index"
>modules</a> |</li>
<li class="right" >
<a href="http.html" title="http — HTTP modules"
accesskey="N">next</a> |</li>
<li class="right" >
<a href="urllib.error.html" title="urllib.error — Exception classes raised by urllib.request"
accesskey="P">previous</a> |</li>
<li><img src="../_static/py.png" alt=""
style="vertical-align: middle; margin-top: -1px"/></li>
<li><a href="https://www.python.org/">Python</a> &#187;</li>
<li>
<span class="language_switcher_placeholder">en</span>
<span class="version_switcher_placeholder">3.7.4</span>
<a href="../index.html">Documentation </a> &#187;
</li>
<li class="nav-item nav-item-1"><a href="index.html" >The Python Standard Library</a> &#187;</li>
<li class="nav-item nav-item-2"><a href="internet.html" accesskey="U">Internet Protocols and Support</a> &#187;</li>
<li class="right">
<div class="inline-search" style="display: none" role="search">
<form class="inline-search" action="../search.html" method="get">
<input placeholder="Quick search" type="text" name="q" />
<input type="submit" value="Go" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
<script type="text/javascript">$('.inline-search').show(0);</script>
|
</li>
</ul>
</div>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<div class="section" id="module-urllib.robotparser">
<span id="urllib-robotparser-parser-for-robots-txt"></span><h1><a class="reference internal" href="#module-urllib.robotparser" title="urllib.robotparser: Load a robots.txt file and answer questions about fetchability of other URLs."><code class="xref py py-mod docutils literal notranslate"><span class="pre">urllib.robotparser</span></code></a> — Parser for robots.txt<a class="headerlink" href="#module-urllib.robotparser" title="Permalink to this headline"></a></h1>
<p><strong>Source code:</strong> <a class="reference external" href="https://github.com/python/cpython/tree/3.7/Lib/urllib/robotparser.py">Lib/urllib/robotparser.py</a></p>
<hr class="docutils" id="index-0" />
<p>This module provides a single class, <a class="reference internal" href="#urllib.robotparser.RobotFileParser" title="urllib.robotparser.RobotFileParser"><code class="xref py py-class docutils literal notranslate"><span class="pre">RobotFileParser</span></code></a>, which answers
questions about whether or not a particular user agent can fetch a URL on the
Web site that published the <code class="file docutils literal notranslate"><span class="pre">robots.txt</span></code> file. For more details on the
structure of <code class="file docutils literal notranslate"><span class="pre">robots.txt</span></code> files, see <a class="reference external" href="http://www.robotstxt.org/orig.html">http://www.robotstxt.org/orig.html</a>.</p>
<dl class="class">
<dt id="urllib.robotparser.RobotFileParser">
<em class="property">class </em><code class="descclassname">urllib.robotparser.</code><code class="descname">RobotFileParser</code><span class="sig-paren">(</span><em>url=''</em><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser" title="Permalink to this definition"></a></dt>
<dd><p>This class provides methods to read, parse and answer questions about the
<code class="file docutils literal notranslate"><span class="pre">robots.txt</span></code> file at <em>url</em>.</p>
<dl class="method">
<dt id="urllib.robotparser.RobotFileParser.set_url">
<code class="descname">set_url</code><span class="sig-paren">(</span><em>url</em><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser.set_url" title="Permalink to this definition"></a></dt>
<dd><p>Sets the URL referring to a <code class="file docutils literal notranslate"><span class="pre">robots.txt</span></code> file.</p>
</dd></dl>
<dl class="method">
<dt id="urllib.robotparser.RobotFileParser.read">
<code class="descname">read</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser.read" title="Permalink to this definition"></a></dt>
<dd><p>Reads the <code class="file docutils literal notranslate"><span class="pre">robots.txt</span></code> URL and feeds it to the parser.</p>
</dd></dl>
<dl class="method">
<dt id="urllib.robotparser.RobotFileParser.parse">
<code class="descname">parse</code><span class="sig-paren">(</span><em>lines</em><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser.parse" title="Permalink to this definition"></a></dt>
<dd><p>Parses the lines argument.</p>
</dd></dl>
<dl class="method">
<dt id="urllib.robotparser.RobotFileParser.can_fetch">
<code class="descname">can_fetch</code><span class="sig-paren">(</span><em>useragent</em>, <em>url</em><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser.can_fetch" title="Permalink to this definition"></a></dt>
<dd><p>Returns <code class="docutils literal notranslate"><span class="pre">True</span></code> if the <em>useragent</em> is allowed to fetch the <em>url</em>
according to the rules contained in the parsed <code class="file docutils literal notranslate"><span class="pre">robots.txt</span></code>
file.</p>
</dd></dl>
<dl class="method">
<dt id="urllib.robotparser.RobotFileParser.mtime">
<code class="descname">mtime</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser.mtime" title="Permalink to this definition"></a></dt>
<dd><p>Returns the time the <code class="docutils literal notranslate"><span class="pre">robots.txt</span></code> file was last fetched. This is
useful for long-running web spiders that need to check for new
<code class="docutils literal notranslate"><span class="pre">robots.txt</span></code> files periodically.</p>
</dd></dl>
<dl class="method">
<dt id="urllib.robotparser.RobotFileParser.modified">
<code class="descname">modified</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser.modified" title="Permalink to this definition"></a></dt>
<dd><p>Sets the time the <code class="docutils literal notranslate"><span class="pre">robots.txt</span></code> file was last fetched to the current
time.</p>
</dd></dl>
<dl class="method">
<dt id="urllib.robotparser.RobotFileParser.crawl_delay">
<code class="descname">crawl_delay</code><span class="sig-paren">(</span><em>useragent</em><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser.crawl_delay" title="Permalink to this definition"></a></dt>
<dd><p>Returns the value of the <code class="docutils literal notranslate"><span class="pre">Crawl-delay</span></code> parameter from <code class="docutils literal notranslate"><span class="pre">robots.txt</span></code>
for the <em>useragent</em> in question. If there is no such parameter or it
doesnt apply to the <em>useragent</em> specified or the <code class="docutils literal notranslate"><span class="pre">robots.txt</span></code> entry
for this parameter has invalid syntax, return <code class="docutils literal notranslate"><span class="pre">None</span></code>.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 3.6.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="urllib.robotparser.RobotFileParser.request_rate">
<code class="descname">request_rate</code><span class="sig-paren">(</span><em>useragent</em><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser.request_rate" title="Permalink to this definition"></a></dt>
<dd><p>Returns the contents of the <code class="docutils literal notranslate"><span class="pre">Request-rate</span></code> parameter from
<code class="docutils literal notranslate"><span class="pre">robots.txt</span></code> as a <a class="reference internal" href="../glossary.html#term-named-tuple"><span class="xref std std-term">named tuple</span></a> <code class="docutils literal notranslate"><span class="pre">RequestRate(requests,</span> <span class="pre">seconds)</span></code>.
If there is no such parameter or it doesnt apply to the <em>useragent</em>
specified or the <code class="docutils literal notranslate"><span class="pre">robots.txt</span></code> entry for this parameter has invalid
syntax, return <code class="docutils literal notranslate"><span class="pre">None</span></code>.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 3.6.</span></p>
</div>
</dd></dl>
</dd></dl>
<p>The following example demonstrates basic use of the <a class="reference internal" href="#urllib.robotparser.RobotFileParser" title="urllib.robotparser.RobotFileParser"><code class="xref py py-class docutils literal notranslate"><span class="pre">RobotFileParser</span></code></a>
class:</p>
<div class="highlight-python3 notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">import</span> <span class="nn">urllib.robotparser</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rp</span> <span class="o">=</span> <span class="n">urllib</span><span class="o">.</span><span class="n">robotparser</span><span class="o">.</span><span class="n">RobotFileParser</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rp</span><span class="o">.</span><span class="n">set_url</span><span class="p">(</span><span class="s2">&quot;http://www.musi-cal.com/robots.txt&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rp</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rrate</span> <span class="o">=</span> <span class="n">rp</span><span class="o">.</span><span class="n">request_rate</span><span class="p">(</span><span class="s2">&quot;*&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rrate</span><span class="o">.</span><span class="n">requests</span>
<span class="go">3</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rrate</span><span class="o">.</span><span class="n">seconds</span>
<span class="go">20</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rp</span><span class="o">.</span><span class="n">crawl_delay</span><span class="p">(</span><span class="s2">&quot;*&quot;</span><span class="p">)</span>
<span class="go">6</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rp</span><span class="o">.</span><span class="n">can_fetch</span><span class="p">(</span><span class="s2">&quot;*&quot;</span><span class="p">,</span> <span class="s2">&quot;http://www.musi-cal.com/cgi-bin/search?city=San+Francisco&quot;</span><span class="p">)</span>
<span class="go">False</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rp</span><span class="o">.</span><span class="n">can_fetch</span><span class="p">(</span><span class="s2">&quot;*&quot;</span><span class="p">,</span> <span class="s2">&quot;http://www.musi-cal.com/&quot;</span><span class="p">)</span>
<span class="go">True</span>
</pre></div>
</div>
</div>
</div>
</div>
</div>
<div class="sphinxsidebar" role="navigation" aria-label="main navigation">
<div class="sphinxsidebarwrapper">
<h4>Previous topic</h4>
<p class="topless"><a href="urllib.error.html"
title="previous chapter"><code class="xref py py-mod docutils literal notranslate"><span class="pre">urllib.error</span></code> — Exception classes raised by urllib.request</a></p>
<h4>Next topic</h4>
<p class="topless"><a href="http.html"
title="next chapter"><code class="xref py py-mod docutils literal notranslate"><span class="pre">http</span></code> — HTTP modules</a></p>
<div role="note" aria-label="source link">
<h3>This Page</h3>
<ul class="this-page-menu">
<li><a href="../bugs.html">Report a Bug</a></li>
<li>
<a href="https://github.com/python/cpython/blob/3.7/Doc/library/urllib.robotparser.rst"
rel="nofollow">Show Source
</a>
</li>
</ul>
</div>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="related" role="navigation" aria-label="related navigation">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="../genindex.html" title="General Index"
>index</a></li>
<li class="right" >
<a href="../py-modindex.html" title="Python Module Index"
>modules</a> |</li>
<li class="right" >
<a href="http.html" title="http — HTTP modules"
>next</a> |</li>
<li class="right" >
<a href="urllib.error.html" title="urllib.error — Exception classes raised by urllib.request"
>previous</a> |</li>
<li><img src="../_static/py.png" alt=""
style="vertical-align: middle; margin-top: -1px"/></li>
<li><a href="https://www.python.org/">Python</a> &#187;</li>
<li>
<span class="language_switcher_placeholder">en</span>
<span class="version_switcher_placeholder">3.7.4</span>
<a href="../index.html">Documentation </a> &#187;
</li>
<li class="nav-item nav-item-1"><a href="index.html" >The Python Standard Library</a> &#187;</li>
<li class="nav-item nav-item-2"><a href="internet.html" >Internet Protocols and Support</a> &#187;</li>
<li class="right">
<div class="inline-search" style="display: none" role="search">
<form class="inline-search" action="../search.html" method="get">
<input placeholder="Quick search" type="text" name="q" />
<input type="submit" value="Go" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
<script type="text/javascript">$('.inline-search').show(0);</script>
|
</li>
</ul>
</div>
<div class="footer">
&copy; <a href="../copyright.html">Copyright</a> 2001-2019, Python Software Foundation.
<br />
The Python Software Foundation is a non-profit corporation.
<a href="https://www.python.org/psf/donations/">Please donate.</a>
<br />
Last updated on Jul 13, 2019.
<a href="../bugs.html">Found a bug</a>?
<br />
Created using <a href="http://sphinx.pocoo.org/">Sphinx</a> 2.0.1.
</div>
</body>
</html>