eaiovnaovbqoebvqoeavibavo 3 6]t@s dZdZddlZddlmZddlZddlZddlZdZyddl Z ddZ WnFe k ryddl Z ddZ Wne k rddZ YnXYnXy ddl Z Wne k rYnXejd jejZejd jejZGd d d eZGd ddZGdddZdS)aBBeautiful Soup bonus library: Unicode, Dammit This library converts a bytestream to Unicode through any means necessary. It is heavily based on code from Mark Pilgrim's Universal Feed Parser. It works best on XML and HTML, but it does not rewrite the XML or HTML to reflect a new encoding; that's the tree builder's job. ZMITN)codepoint2namecCstj|dS)Nencoding)cchardetdetect)sr/usr/lib/python3.6/dammit.pychardet_dammitsr cCstj|dS)Nr)chardetr)rrrrr !scCsdS)Nr)rrrrr 'sz!^<\?.*encoding=['"](.*?)['"].*\?>z0<\s*meta[^>]+charset\s*=\s*["']?([^>]*?)[ /;'">]c@seZdZdZddZe\ZZZdddddd Ze j d Z e j d Z e d d Ze ddZe ddZe dddZe dddZe ddZdS)EntitySubstitutionzASubstitute XML or HTML entities for the corresponding characters.cCsni}i}g}xBttjD]2\}}t|}|dkrD|j||||<|||<qWddj|}||tj|fS)N"z[%s])listritemschrappendjoinrecompile)lookupZreverse_lookupZcharacters_for_reZ codepointname characterZ re_definitionrrr_populate_class_variables9s  z,EntitySubstitution._populate_class_variablesZaposZquotZampltgt)'"&<>z&([<>]|&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;))z([<>&])cCs|jj|jd}d|S)Nrz&%s;)CHARACTER_TO_HTML_ENTITYgetgroup)clsmatchobjentityrrr_substitute_html_entityZsz*EntitySubstitution._substitute_html_entitycCs|j|jd}d|S)zmUsed with a regular expression to substitute the appropriate XML entity for an XML special character.rz&%s;)CHARACTER_TO_XML_ENTITYr")r#r$r%rrr_substitute_xml_entity_sz)EntitySubstitution._substitute_xml_entitycCs6d}d|kr*d|kr&d}|jd|}nd}|||S)a*Make a value into a quoted XML attribute, possibly escaping it. Most strings will be quoted using double quotes. Bob's Bar -> "Bob's Bar" If a string contains double quotes, it will be quoted using single quotes. Welcome to "my bar" -> 'Welcome to "my bar"' If a string contains both single and double quotes, the double quotes will be escaped, and the string will be quoted using double quotes. Welcome to "Bob's Bar" -> "Welcome to "Bob's bar" rrz")replace)selfvalueZ quote_withZ replace_withrrrquoted_attribute_valuefsz)EntitySubstitution.quoted_attribute_valueFcCs"|jj|j|}|r|j|}|S)a Substitute XML entities for special XML characters. :param value: A string to be substituted. The less-than sign will become <, the greater-than sign will become >, and any ampersands will become &. If you want ampersands that appear to be part of an entity definition to be left alone, use substitute_xml_containing_entities() instead. :param make_quoted_attribute: If True, then the string will be quoted, as befits an attribute value. )AMPERSAND_OR_BRACKETsubr(r,)r#r+make_quoted_attributerrrsubstitute_xmls   z!EntitySubstitution.substitute_xmlcCs"|jj|j|}|r|j|}|S)aSubstitute XML entities for special XML characters. :param value: A string to be substituted. The less-than sign will become <, the greater-than sign will become >, and any ampersands that are not part of an entity defition will become &. :param make_quoted_attribute: If True, then the string will be quoted, as befits an attribute value. )BARE_AMPERSAND_OR_BRACKETr.r(r,)r#r+r/rrr"substitute_xml_containing_entitiess   z5EntitySubstitution.substitute_xml_containing_entitiescCs|jj|j|S)aReplace certain Unicode characters with named HTML entities. This differs from data.encode(encoding, 'xmlcharrefreplace') in that the goal is to make the result more readable (to those with ASCII displays) rather than to recover from errors. There's absolutely nothing wrong with a UTF-8 string containg a LATIN SMALL LETTER E WITH ACUTE, but replacing that character with "é" will make it more readable to some people. )CHARACTER_TO_HTML_ENTITY_REr.r&)r#rrrrsubstitute_htmls z"EntitySubstitution.substitute_htmlN)F)F)__name__ __module__ __qualname____doc__rr ZHTML_ENTITY_TO_CHARACTERr3r'rrr1r- classmethodr&r(r,r0r2r4rrrrr 5s$      %  r c@sHeZdZdZdddZddZedd Zed d Z edd d Z dS)EncodingDetectora^Suggests a number of possible encodings for a bytestring. Order of precedence: 1. Encodings you specifically tell EncodingDetector to try first (the override_encodings argument to the constructor). 2. An encoding declared within the bytestring itself, either in an XML declaration (if the bytestring is to be interpreted as an XML document), or in a tag (if the bytestring is to be interpreted as an HTML document.) 3. An encoding detected through textual analysis by chardet, cchardet, or a similar external library. 4. UTF-8. 5. Windows-1252. NFcCsN|pg|_|pg}tdd|D|_d|_||_d|_|j|\|_|_dS)NcSsg|] }|jqSr)lower).0xrrr sz-EncodingDetector.__init__..) override_encodingssetexclude_encodingschardet_encodingis_htmldeclared_encodingstrip_byte_order_markmarkupsniffed_encoding)r*rFr?rCrArrr__init__s zEncodingDetector.__init__cCs8|dk r4|j}||jkrdS||kr4|j|dSdS)NFT)r;rAadd)r*rtriedrrr_usables  zEncodingDetector._usableccst}x |jD]}|j||r|VqW|j|j|r>|jV|jdkrZ|j|j|j|_|j|j|rp|jV|jdkrt |j|_|j|j|r|jVxdD]}|j||r|VqWdS)z tag, hopefully near the beginning of the document. iig?N)endposrasciir)) rVmaxintxml_encoding_research html_meta_regroupsdecoder;)r#rFrCZsearch_entire_documentZ xml_endposZ html_endposrDZdeclared_encoding_matchrrrrN+s   z'EncodingDetector.find_declared_encoding)NFN)FF) r5r6r7r8rHrKpropertyrPr9rErNrrrrr:s  ! r:c@sReZdZdZdddZdddgZgdd gfd d Zd d ZdddZdddZ e ddZ ddZ ddZ ddd d!d"d#d$d%d&d'd(d)d*d2d+d2d2d,d-d.d/d0d1d2d3d4d5d6d7d2d8d9dQ ZdRddSdTdUdVdWdXdYdZd[d\d]d2d^d2d2d_d_d`d`dadbdcdddedfdgdhd2didjddkdldmdndodpd[dqdPdrdsdkddtdbdudvdwdxd:dzd{dadSd|drd}d~ddd2ddddddddddddddddddddddddaddddddjddddddddddlddddddddduddudududududdudzdzdzdzddddZdddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd d d d d dzZd;d<d=gZeddZed>dZed?ddZdS(@ UnicodeDammitzA class for detecting the encoding of a *ML document and converting it to a Unicode string. If the source encoding is windows-1252, can replace MS smart quotes with their HTML or XML equivalents.z mac-romanz shift-jis) macintoshzx-sjis windows-1252z iso-8859-1z iso-8859-2NFcCs||_g|_d|_||_tjt|_t|||||_ t |t sF|dkr`||_ t ||_ d|_dS|j j |_ d}x,|j jD] }|j j }|j|}|dk rxPqxW|sx@|j jD]4}|dkr|j|d}|dk r|jjdd|_PqW||_ |sd|_dS)NFr rYr)zSSome characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.T)smart_quotes_totried_encodingsZcontains_replacement_charactersrCloggingZ getLoggerr5logr:detectorrTrUrFZunicode_markuporiginal_encodingrP _convert_fromZwarning)r*rFr?rerCrAurrrrrHXs>     zUnicodeDammit.__init__cCs|jd}|jdkr&|jj|j}nf|jj|}t|tkr|jdkrfdj|djdj}qdj|djdj}n|j}|S)z[Changes a MS smart quote character to an XML or HTML entity, or an ASCII character.rYZxmlz&#x;rr)r"reMS_CHARS_TO_ASCIIr!encodeMS_CHARStypetuple)r*matchZorigr.rrr _sub_ms_chars     zUnicodeDammit._sub_ms_charstrictcCs|j|}| s||f|jkr"dS|jj||f|j}|jdk rh||jkrhd}tj|}|j|j |}y|j |||}||_||_ Wn t k r}zdSd}~XnX|jS)Ns([-])) find_codecrfrrFreENCODINGS_WITH_SMART_QUOTESrrr.ru _to_unicoderj Exception)r*ZproposederrorsrFZsmart_quotes_reZsmart_quotes_compiledrlrOrrrrks"     zUnicodeDammit._convert_fromcCs t|||S)zGiven a string and its encoding, decodes the string into Unicode. %encoding is a string recognized by encodings.aliases)rU)r*rWrr{rrrryszUnicodeDammit._to_unicodecCs|js dS|jjS)N)rCrirD)r*rrrdeclared_html_encodingsz$UnicodeDammit.declared_html_encodingcCs`|j|jj||pN|r*|j|jddpN|r@|j|jddpN|rL|jpN|}|r\|jSdS)N-r _)_codecCHARSET_ALIASESr!r)r;)r*charsetr+rrrrws zUnicodeDammit.find_codecc Cs<|s|Sd}ytj||}Wnttfk r6YnX|S)N)codecsr LookupError ValueError)r*rcodecrrrrs zUnicodeDammit._codeceuro20AC sbquo201Afnof192bdquo201Ehellip2026dagger2020Dagger2021circ2C6permil2030Scaron160lsaquo2039OElig152?#x17D17Dlsquo2018rsquo2019ldquo201Crdquo201Dbull2022ndash2013mdash2014tilde2DCtrade2122scaron161rsaquo203Aoelig153#x17E17EYumlr ) ZEUR,fz,,z...+z++^%SrZOEZrr*r}z--~z(TM)rrZoezY!cZGBP$ZYEN|z..z(th)z<>z1/4z1/2z3/4AZAECEIDNOUbBaZaerOin/y)rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrs€s‚sƒs„s…s†s‡sˆs‰sŠs‹sŒsŽs‘s’s“s”s•s–s—s˜s™sšs›sœsžsŸs s¡s¢s£s¤s¥s¦s§s¨s©sªs«s¬s­s®s¯s°s±s²s³s´sµs¶s·s¸s¹sºs»s¼s½s¾s¿sÀsÁsÂsÃsÄsÅsÆsÇsÈsÉsÊsËsÌsÍsÎsÏsÐsÑsÒsÓsÔsÕsÖs×sØsÙsÚsÛsÜsÝsÞsßsàrsâsãsäsåsæsçsèsésêsësìsísîsïsðsñsòsósôsõsös÷søsùsúsûsüsýsþ)zrrrRrrrSrrrQrrmutf8c Cs$|jddjdkrtd|jdkr0tdg}d }d }x|t|kr||}t|tsft|}||jkr||jkrxz|j D]$\}} } ||kr|| kr|| 7}PqWq>|d kr||j kr|j ||||j |j ||d 7}|}q>|d 7}q>W|d kr|S|j ||d d j |S)aFix characters from one encoding embedded in some other encoding. Currently the only situation supported is Windows-1252 (or its subset ISO-8859-1), embedded in UTF-8. The input must be a bytestring. If you've already converted the document to Unicode, you're too late. The output is a bytestring in which `embedded_encoding` characters have been converted to their `main_encoding` equivalents. r~r} windows-1252 windows_1252zPWindows-1252 and ISO-8859-1 are the only currently supported embedded encodings.rutf-8z4UTF-8 is the only currently supported main encoding.rrarmN)rr)rr) r)r;NotImplementedErrorrVrTr[ordFIRST_MULTIBYTE_MARKERLAST_MULTIBYTE_MARKERMULTIBYTE_MARKERS_AND_SIZESWINDOWS_1252_TO_UTF8rr) r#Zin_bytesZ main_encodingZembedded_encodingZ byte_chunksZ chunk_startposZbytestartendsizerrr detwingle s<      zUnicodeDammit.detwingle)rv)rv)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr)rr )rr)rrrR)rrrS)rrrQ)rrd)r5r6r7r8rrxrHrurkryrar|rwrrqrorrrrr9rrrrrrbEs`1        rb)r8Z __license__rZ html.entitiesrrrgstringZ chardet_typerr ImportErrorr Z iconv_codecrrprr\r^objectr r:rbrrrrs8