Skip to content

Commit b791151

Browse files
authored
Fix handling undefined length in CdxRecord (#83)
Some CDX entries have `"-"` for their length, instead of a number. `WaybackClient.search()` used to raise an error in this situation, but now handles it gracefully by setting the resulting `CdxRecord` object's `length` property to `None`.
1 parent 024a80e commit b791151

File tree

5 files changed

+37
-3
lines changed

5 files changed

+37
-3
lines changed

CONTRIBUTING.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -79,10 +79,10 @@ Ready to contribute? Here's how to set up `wayback` for local development.
7979
5. When you're done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox::
8080

8181
$ flake8 wayback tests
82-
$ python setup.py test
82+
$ pytest
8383
$ tox
8484

85-
To get flake8 and tox, just pip install them into your virtualenv.
85+
To get flake8, pytest and tox, just pip install them into your virtualenv using `pip install -r requirements-dev.txt`.
8686

8787
6. Commit your changes and push your branch to GitHub::
8888

README.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ Thanks to the following people for their contributions and help on this package!
6565

6666
- `Dan Allan <https://github.com/danielballan>`_ (Code, Tests, Documentation, Reviews)
6767
- `Rob Brackett <https://github.com/Mr0grog>`_ (Code, Tests, Documentation, Reviews)
68+
- `Will Sackfield <https://github.com/8W9aG>`_ (Code, Tests)
6869
- `Ed Summers <https://github.com/edsu>`_ (Code, Tests)
6970
- `Lion Szlagowski <https://github.com/LionSzl>`_ (Code, Tests)
7071

wayback/_client.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -539,7 +539,7 @@ def search(self, url, *, matchType=None, limit=None, offset=None,
539539
status_code = None
540540
else:
541541
status_code = int(data.status_code)
542-
length = int(data.length)
542+
length = None if data.length == '-' else int(data.length)
543543
capture_time = _utils.parse_timestamp(data.timestamp)
544544
except Exception as err:
545545
if 'RobotAccessControlException' in text:

wayback/tests/test_client.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -190,6 +190,34 @@ def test_search_removes_malformed_entries(requests_mock):
190190
assert 2 == len(list(records))
191191

192192

193+
def test_search_handles_no_length_cdx_records(requests_mock):
194+
"""
195+
The CDX index can contain a "-" in lieu of an actual length, which can't be
196+
parsed into an int. We should handle this.
197+
198+
Because these are rare and hard to get all in a single CDX query that isn't
199+
*huge*, we use a made-up mock for this one instead of a VCR recording.
200+
"""
201+
with open(Path(__file__).parent / 'test_files' / 'zero_length_cdx.txt') as f:
202+
bad_cdx_data = f.read()
203+
204+
with WaybackClient() as client:
205+
requests_mock.get('http://web.archive.org/cdx/search/cdx'
206+
'?url=www.cnn.com%2F%2A'
207+
'&matchType=domain&filter=statuscode%3A200'
208+
'&showResumeKey=true&resolveRevisits=true',
209+
[{'status_code': 200, 'text': bad_cdx_data}])
210+
records = client.search('www.cnn.com/*',
211+
matchType="domain",
212+
filter_field="statuscode:200")
213+
214+
record_list = list(records)
215+
assert 5 == len(record_list)
216+
for record in record_list[:4]:
217+
assert isinstance(record.length, int)
218+
assert record_list[-1].length is None
219+
220+
193221
@ia_vcr.use_cassette()
194222
def test_get_memento():
195223
with WaybackClient() as client:
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
com,cnn)/ 20100215131836 http://www.cnn.com/ text/html 200 T6CDCO73TS77BY67HLKYFLHKT2APUM5Y 24186
2+
com,cnn)/ 20100217150314 http://www.cnn.com:80/ text/html 200 3Y4OQ7ETJXGA24RIDYJYKMTJKF525QS2 24028
3+
com,cnn)/ 20100217231950 http://www.cnn.com/ text/html 200 OSFLJPVDJVSRY4X4AGMRPUMVK7AEQCWI 24116
4+
com,cnn)/ 20100218000634 http://www.cnn.com/ text/html 200 73LNKJSLFZSDZ7T77JNZWXXNGDRSWTHO 24217
5+
com,cnn)/ 20100218025451 http://www.cnn.com/ text/html 200 JK7O4OWOMDJHBNQJ5U43X5OQRORF6FQB -

0 commit comments

Comments
 (0)